FlyBase .. Aberrations .. Anatomy .. BLAST .. Genes .. Annotation/Sequences .. Gene Products .. Maps .. People .. References .. Stocks .. Transgenes/Transposons .|. Help .. Searches .. News .. Site






Release 4 Notes
 

RE-ANNOTATED GENOMIC SEQUENCE

Release 4 Notes: Updated February 21, 2006

Release 4.3 of the euchromatin is now available at http://flybase.net/annot/.

   
 

RELEASE 4.3 ANNOTATION UPDATE

Release 4, initially comprised of genomic sequence only, was made public in April 2004 (euchromatic sequences). This release adds 1.4 Mb of high quality sequence, including 21 gaps that have been closed and two inverted regions that have been corrected (see BDGP Release 4 notes for details).

Release 4.3 is the fourth of regular updates that reflect gene-by-gene annotation assessments, rather than a comprehensive survey of the entire genome. Re-annotation of a gene model is triggered by new sequence data, data curated from the literature, or user communications. For updates that include changes to annotations only (and not the underlying sequence), the release numbers increase as decimal increments. These more frequent updates also include new supporting data, represented in the evidence tiers of the gene annotation reports (e.g., Sos), in the Gbrowse views, and in the Apollo annotation editor and viewer.

Release 4.3 annotations were available from NCBI on January 30, 2006 and FlyBase on March 03, 2006.   At FlyBase, these data are available from Gene Annotation reports, which are accessible from individual gene report pages or as a result of a query using the Basic Annotation Query Form, the FlyBase BLAST server, the batch query page, and the download site.

Previous releases, unannotated BAC-based sequences, and the WGS3 whole-genome shotgun sequence assembly continue to be available from GenBank. See the Heterochromatin section below for information about the release of Heterochromatin.

Tabulated information about features in this release are shown below and comparison of Release 4.3 to 4.2 with lists of new, split, merged, deleted, etc. genes may be found HERE. Release 4.3 is the first annotation set to include the features of the mitochondrial genome.


In Release 4.2 the representation of dicistronic genes was changed. Find information on dicistronic identifiers HERE.

Annotation Statistics for Release 4.3

Note that gene model statistics include heterochromatin annotations but aligned feature counts are only for euchromatin


New Gene Models 63
Deleted Gene Models 2
Merged Gene Models 42 from release 4.2 -> 19 in release 4.3
Split Gene Models 5 from release 4.2 -> 12 in release 4.3
Unchanged peptides 18978

Annotated Gene Models* Count Avg. size Longest Shortest Change from previous release**
Genes 14816 5002 279927 16 +101
Protein coding genes 14066 5247 279927 138 +79
Protein coding transcripts 19819 2256 69571 132 +211
Exons 65380 480 27725 3 +567
Introns 48501 1192 185510 30 +366
5' Untranslated regions 17736 184 3391 1 +143
3' Untranslated regions 12162 366 5684 1 +213
Unique peptides 17134 568 23015 25 +166
rRNA 104
172 1995
29 +2
tRNA 314 75 186 61 +20
snRNA 46 111 255 36 0
snoRNA 63 88 316 16 0
miRNA 66 22 29 20 0
miscellaneous non-coding RNA 110 1903 31065 19 +3
pseudogenes 52 1119 13064 53 +2

Transposable Elements Present in the Sequenced Strain 12794 1258 66001 21 0
Euchromatic transposable elements
6005 1258 66001 21 0
Heterochromatic repeat with transposon homology+
6189
NA
NA
NA
0

Other Annotated Gene Features Count Change from previous release
abberation junction 131 +4
enhancer 33 +6
point mutation 1444 +444
poly A site 122 -4
protein binding site 1372 +2
regulatory region 195 -24
rescue fragment 220 +13
sequence variant 414 +66
signal peptide 1 0

Mapped reagent features Count Change from previous release
transposable element insertion site 35783 +2515
oligonucleotide 194086 0

Aligned evidence features++ Algorithm Count Change from previous release

Nucleotide alignments      
BAC clone locator 710 0
D. melanogaster cDNA inserts sim4tandem,splign 13464 +2585
D. melanogaster EST (total) sim4 308722 0
EST from sequenced strain sim4 153900 0
EST from different strains sim4 154822 0
Other melanogaster DNA sequences sim4tandem 12707 0

 
ab initio gene predictions      
Genie prediction Genie v2.2/flyGenie 11063 0
Genscan prediction Genscan 1.0 17811 0
Augustus prediction Augustus 1.0 12316 0

 

Proteins aligned  

D. melanogaster proteins WU-blastx 2.0, Prosplign 26915 +2829
Other Insect proteins WU-blastx 2.0 7011 0
Nematode proteins WU-blastx 2.0 6318 0
Yeast proteins WU-blastx 2.0 2149 0
Plant proteins WU-blastx 2.0 8319 0
Rodent proteins WU-blastx 2.0 14732 0
Primate proteins WU-blastx 2.0 13607 0
Other invertebrate proteins WU-blastx 2.0 12991 0
Other vertebrate proteins WU-blastx 2.0 10383 0
Other  proteins Prosplign 6530 +6530

   
Translated nucleotide alignments    
Insect ESTs WU-tblastx 2.0 N.A. N.A.
A. gambiae genomic WU-tblastx 2.0 N.A. N.A.
D. pseudoobscura genomic WU-tblastx 2.0 N.A. N.A.
     

 

* counts include mitochondrial protein-coding genes (13), tRNAs (22) and rRNAs (2)
** change is relative to Release 4.2 annotations

+Natural transposon insertions in heterochromatin are 'repeat_regions' with high TE homology. See Transposable Elements below.
++Aligned evidence feature counts are for euchromatin only.


The confidence we have in the annotated gene models varies considerably; improvements to the gene models will be ongoing, and will require the continued input of the community. If you notice a mistake in annotation, please submit an error report form (also accessed from the gene annotation reports) or write to flybase-updates AT morgan.harvard.edu. Updates may also be submitted as sequence records or as Apollo-generated XML files.

 

 

 
HETEROCHROMATIN

The sequence finishing and annotation of the heterochromatic region of the genome is being performed by the Drosophila Heterochromatin Genome Project (DHGP; see Hoskins et al. 2002). As sequence gaps are filled, and the heterochromatic scaffolds are finished to high quality and re-annotated, they will be contributed to GenBank and FlyBase and integrated into future releases of the Drosophila genomic sequence.

Release 3.2b annotation of the heterochromatic regions are available from FlyBase and the public data libraries (NCBI, EBI, DDBJ). At FlyBase, these data are available from FlyBase Gene Annotation reports, the FlyBase BLAST server, the batch query page, and the download site.

The Release 3.2b heterochromatin annotation represents the latest effort to describe the protein-coding genes, non-coding genes, and other features located in the heterochromatin sequence. In this update, the underlying sequence is the 20.7Mb of Release 3 whole-genome-shotgun (WGS) scaffolds from Celera that could not be assembled into the euchromatin arms as well as a few BDGP-sequenced scaffolds.

The WGS3 heterochromatin consists of ~2600 scaffolds that still contain gaps and collapsed repeats, but are otherwise considered relatively high-quality sequence. Some of these have been mapped to particular chromosome arms (i.e. 2h, 3h, 4h, Xh, or Yh), while the remaining have been placed on chromsome U. It is important to note that scaffolds that have been mapped to a particular chromosome arm are provisionally ordered, but not oriented: they are ordered by their experimentally determined cytological locations, but their orientation and exact order remain unclear. Chromosome U consists of unordered, unoriented scaffolds. While the underlying sequence of the scaffolds annotated in Release 3.2 has not changed, the mapping and ordering of these scaffolds on chromosome arms (e.g. 2h, 3h...) may differ from previous releases.

The transition between the euchromatic and heterochromatic regions of the genome is thought to be a gradual one, and there are no objective rules to categorize the sequence in this transitional area as definitively euchromatic or heterochromatic. Currently the boundaries between the euchromatic and heterochromatic portions of the genome are based on cytological data, as described in Hoskins et al. 2002.

Annotation guidelines consistent with FlyBase and the overall Drosophila genome annotation were adhered to whenever possible. However, since these annotations are based on high-quality draft sequence, certain gene models may contain missing or premature stop codons, missing start codons, or gaps within their ORFs. Open reading frames corresponding to fragments of transposable elements are common in heterochromatin; every attempt was made to identify these and exclude them from the gene annotations.

Release 4 annotation of the heterochromatin should become available in Summer 2005.

As the DHGP adds new data and improves the quality of the underlying sequence and assembly in future releases, the quality of the annotations will also improve. The DHGP welcomes any feedback and data from the community that will assist in this effort.

KNOWN MUTATIONS IN THE SEQUENCED STRAIN

The sequenced strain, usually described as the y[1]; cn[1] bw[1] sp[1]strain, was known to carry mutations in those four genes. During annotation, mutations in other genes have been discovered (currently known are mutations in oc, LysC, MstProx, GstD5, Rh6, Gr22b, Or98b and CG8447). To allow compilation of a comprehensive proteome, wild-type protein sequences for these genes have been included in sequence entries to GenBank/EMBL/DDBJ. Wherever possible, a RefSeq accession based on an alternative wild-type sequence and curated as a FlyBase Annotated Genome Sequence (ARGS) has been provided.

GENOMIC SEQUENCE RELEASES vs. ANNOTATION RELEASES

The different releases of the D. melanogaster genomic sequence are designated by the whole number component of the release number. The first annotated genomic sequence was released on March 24, 2000, and constituted Release 1 (Adams, et al., 2000). After Celera/BDGP filled 330 gaps and changed ~3000 annotations, Release 2 was made public in October, 2000. This whole genome shotgun assembly had ~1300 gaps.

To produce the 116.8 Mb Release 3 euchromatic sequence, the BDGP closed almost all of the gaps in the euchromatic portion of the genome, and raised the sequence quality to an estimated error rate of less than one in 100,000 base pairs in the unique portion of the sequence, and less than one in 10,000 base pairs in the repetitive portion (Celniker et al. 2002). The accuracy of the assembly was verified by restriction digestion of BAC clones, and composite sequences of transposable elements in the previous releases was replaced in Release 3 with the true sequences of 1572 individual transposon insertions.

To create the 118.4 Mb Release 4 genomic sequence, 21 gaps were closed, and the assembly was validated in collaboration with the Genome Sciences Centre at the British Columbia Cancer Agency in Vancouver, Canada, using fingerprint analysis of a tiling path of BACs spanning the genome. This assembly has 23 gaps remaining.

The BDGP is continuing to improve the genomic sequence to high quality. Release 5 genomic sequence is being submitted as unannotated BACs to GenBank as it is finished.

Commencing with Release 3 and continuing into the future, changes to the gene models and other annotations will occur more often than changes to the underlying sequence. These changes are indicated by fractional release numbers; for example, 'Release 3.2' consists of the second update of annotations on the Release 3 genomic sequence. FlyBase will continue to increment release numbers across the entire genome.

In FlyBase, the release number will appear at the top of each annotation query and report page, and also at the FlyBase download sites for sequence. Please make a note of the release number you are working with.

The annotated sequence is submitted to GenBank as chromosome arms, and GenBank cuts these into segments of manageable size, averaging ~270 kb. When the underlying sequence for a given segment changes, GenBank increments the decimal version number. Note that this does not occur genome-wide, so some accession version numbers will change and others will not. On occasion, the underlying sequence has not changed, but the extent of a given segment may differ (to avoid dividing a gene model between two segments). Such a change in extent will also result in an increment of the version number. Changes to annotations are indicated by an updated date stamp.

Examples of release number changes and corresponding GenBank version numbers are shown in the table below.

Date Release GenBank Version
March 2000 Release 1 AE003452.1
October 2000 Release 2 AE003452.2
June 2002 Release 3.0 AE003452.3
February 2003 Release 3.1 AE003452.4
March 2004 Release 3.2 AE003452.4
November 2004 Release 4.0 AE003452.5
February 2004 Release 4.1 AE003452.5



March 2000 Release 1 AE003463.1
October 2000 Release 2 AE003463.1
June 2002 Release 3.0 AE003463.2
February 2003 Release 3.1 AE003463.2
March 2004 Release 3.2 AE003463.2
November 2004 Release 4.0 AE003463.2
February 2005 Release 4.1 AE003463.2



Links from FlyBase gene and annotation reports will go to the most recent release at NCBI. If you need access to a previous release, you can query at NCBI using the accession number including the version number suffix; click on 'revision history.'

 
 

RELEASE 4.1 and 4.2 ANNOTATIONS

Release 4.1 and 4.2 were the second and third of regular updates that reflect gene-by-gene annotation assessments, rather than a comprehensive survey of the entire genome. Re-annotation of a gene model was triggered by new sequence data, data curated from the literature, or user communications. For updates that included changes to annotations only (and not the underlying sequence), the release numbers increase as decimal increments. These more frequent updates also include new supporting data, represented in the evidence tiers of the gene annotation reports (e.g., Sos), in the Gbrowse views, and in the Apollo annotation editor and viewer.

 
RELEASE 4.0 ANNOTATION

Annotations from Release 3.2 were promoted to the Release 4 sequence without further assessment; this constituted Release 4.0, made public in November 2004 (euchromatic sequences only).

Very few annotations differed between Release 3.2 and Release 4.0. Forty-one gene models that fell within regions of underlying sequence change exhibited changes in transcript sequences; of these, 25 resulted in changes to the predicted proteins. Two entities were deleted: one gene model was merged with its neighbor, and one natural transposable element insertion was not present in the Release 4 sequence. In addition, the initiating amino acid of the CDS's for non-AUG starts (erroneously annotated in r3.2) was corrected, and one gene model omitted from Release 3.2 was reinstated.

RELEASE 3.2 ANNOTATION

The March 2004 Release 3.2 included new sequence features curated from the fly literature, such as mutational lesions, aberration breakpoints, and insertion sites of transgenic constructs. These new sequence features may be accessed via the Gene Annotation reports, however, they are not included in the Release 3.2 GenBank submissions. A major addition to annotated gene models in Release 3.2 was the inclusion of 100 5SrRNAs (of the estimated 160 genes in the 56F 5SrRNA gene cluster); this includes four 5SrRNA pseudogenes.

RELEASE 3.1

When Release 3 of the genomic sequence became available, FlyBase conducted a comprehensive review of all euchromatic annotations (Misra et al. 2002). The goals of this re-annotation were:
  • To manually inspect and synthesize the results of computational analysis of the entire euchromatic sequence into updated annotations, using a small group of human curators and a consistent set of rules.
  • To take advantage of the large numbers of new EST and full-length cDNA sequences from the BDGP (LBNL) and the community in improving gene models.
  • To add annotations of non-protein-coding genes, transposons, and pseudogenes.
  • To validate the results against published peptide sequences.
In order to address these goals, a new computational pipeline was created (Mungall et al. 2002) with an exhaustive list of Drosophila sequence datasets and SwissProt/trEMBL SWALL peptide datasets from other species. The results and datasets are stored in the new FlyBase genome annotation database, so that evidence for the annotations can be tracked and queried. A new graphical user interface, Apollo, was developed in a collaboration between FlyBase BDGP and Ensembl, to allow FlyBase biologist curators to easily view the results of computational analysis and efficiently edit the annotations (Lewis et al! . 2002). A set of curation rules and a controlled vocabulary of comments was created to allow the group of ten curators to annotate consistently. And finally, a set of validation steps was created, including software to compare each predicted peptide to those curated peptides in SwissProt with experimental evidence.

The Release 3 re-annotation improved the quality of the majority of gene models. The length of UTRs and the number of alternative transcripts increased, due to the increase in EST and complete cDNA sequences. The fine details of the exon-intron structure were significantly improved. Numerous genes were merged and/or split, based on the cDNA and BLASTX data; some genes predicted in earlier releases were deleted, others are newly predicted. Genes were deleted if they overlapped transposons or if they fell below a minimum size cutoff (100aa) and had no experimental evidence beyond a computational gene prediction. Overall, these improved annotations in changes in >45% of the predicted proteins.

TRANSPOSABLE ELEMENTS

As a result of the whole genome shotgun assembly, the sequence of each transposon in Releases 1 and 2 was a composite derived from a number of elements of that transposon type. In Release 3, the sequence of each transposon insertion in the euchromatin of the y[1]; cn[1] bw[1] sp[1] strain was determined and characterized (Kaminker et al. 2002). See the BDGP Natural Transposable Element page for more information. The transposons in euchromatin had not been updated between Release 3.1, Release 3.2, or Release 4.0, they were simply mapped forward.

 

Transposable elements (TEs) in the Release 4 sequence have been completely re-annotated, using a combined evidence approach described in Quesneville et al. (2005) (PLoS Comp. Biol. 1 (2):e22). Many more elements are now annotated in Release 4 relative to Release 3 because of improved methods, inclusion of more families, and the addition of more TE-dense sequence in peri- centromeric regions. Many of these new elements are very short (a few hundred nucleotides) and/or from divergent copies including the large INE-1 family, not previously annotated in Release 3.

The Drosophila heterochromatin sequence is extremely rich in repetitive satellite elements, simple repeats, and transposable element fragments. At the time of Release 3.1, greater than 55% of the Release 3 heterochromatin sequence was determined to have homology to a repetitive element of some type. Currently the Drosophila Heterochromatin Genome Project estimates that 75% of the Release 3 heterochromatin sequence is comprised of repetitive sequence. Since the repetitive regions in the heterochromatin are so fragmented and located in regions with many gaps and potential assembly errors, we did not rigorously curate and hand-identify transposable elements in the same manner as Kaminker et al. 2002 for the Release 3 euchromatin. Instead, we used the Kaminker et al. "Natural Transposable Element" dataset as a library for Repeatmasker to identify stretches of sequence that were likely to be a transposable element or repeat. Since these regions may not be represent complete elements, or may contain many nested elements, the DHGP refers to these as 'repeat regions'. Essentially, 'repeat regions' are stretches of genomic sequence with a significant alignment to a known Drosophila transposable element or simple repeat. In most cases a repeat region is comprised of thousands of nested fragments of other transposable elements. Since our method relies on alignment to known elements it is likely that some legitimate repeats remain to be identified.

The results of the heterochromatin repeat analysis can be seen as the 'Repeatmasker' result tier when using the Apollo genome viewer or obtained as FASTA, GFF, of GAME-XML from the DHGP FTP site.

 
GENE AND TRANSCRIPT IDENTIFIERS

In Releases 3.0 and 3.1, protein-coding genes were given 'CG ' identifiers of the form CGnnnnn. For non-protein-coding genes, such as tRNAs, snRNAs, snoRNAs, microRNAs, miscellaneous non-coding RNAs, and pseudogenes , 'CR' identifiers of the form CRnnnnn were assigned. Transposable elements were given TEnnnnn identifiers. Transcripts were assigned FlyBase transcript identifiers, for which the gene identifier is followed by a suffix -R[A-Z]; e.g., CG12345-RA, CG12345-RB. For peptides, the -R
[A-Z] suffix is replaced by a -P[A-Z] suffix, with the second identifying letter always in agreement with that of the corresponding transcript; e.g., CG12345-PA, CG12345-PB.

In Release 3.2, the standard symbols for gene annotation CGnnnnn were replaced with the accepted gene symbol (where available). For example, CG8094, CG8094-RA, and CG8094-PA become gene Hex-C, transcript Hex-C-RA, and protein Hex-C-PA. The CG8094 ID is still supported as a more computable alternative to this symbolic name, but will be less visible.

In Release 1 and 2, only protein-coding genes were annotated, and CGnnnn identifiers were assigned to genes, CTnnnn identifiers to transcripts, and pp-CTnnnn identifiers to peptides. These old Release 1 and 2 CT identifiers are now obsolete, and there is no mapping between CT identifiers and the Release 3 CGnnnn-RA identifiers. However, in most cases the CT identifier has become a synonym of the gene, and can be queried using the FlyBase Gene Search page to find the gene they were associated with in Release 2. In some cases, a Release 2 gene may correspond to more than one Release 3 gene, e.g. if exons were redistributed or split between two new Release 3 genes.

Send comments to us at flybase-help AT morgan.harvard.edu