Notes on FlyBase annotation Chado DB derived releases: r3.1.0_12182003 = chado xml derived (public release 3.1.0 annotations, Dec 2003) r3.2.0_12052003 = chado xml derived (pre-release 3.2.0 annotations, Dec 2003) v7.0_0728 = chado xml derived (preliminary public release 3.1.0 annotations, July/Aug 2003) id-check: FBan to FBgn id/symbol checks FBan annotation data (chado.xml derived) comparing to FBgn gene data of November 03 for ID and symbol errors/mismatches compare FBan-v7.0_0728.list by anId to FBgn: (public data as of Dec 2003) ok=12815 ; FBti=-- ; noFBgn=1 ; oldFBgn=213 ; dupFBgn=3 ; badsym=943 compare FBan-r3.1.0_12182003.list by anId to FBgn: ok=12873 ; FBti=1571 ; noFBgn=1 ; oldFBgn=248 ; dupFBgn=3 ; badsym=948 compare FBan-r3.2.0_12052003.list by anId to FBgn: ok=13454 ; FBti=1573 ; noFBgn=296 ; oldFBgn=219 ; dupFBgn=4 ; badsym=247 seq-check: annotation feature sequence check Sequences are generated using feature tables derived from chado xml to extract dna (all use same release 3.1.0 dna). The test features are transposon = transposable_element features translation = standard protein translation of CDS feature sequences transcript = mRNA tRNA miscRNA features (untranslated) - note the transcript comparison set includes ncRNA not in baseline set Baseline data labeled '_r3.1.0g' = gadfly release 3.1.0 (feb-july 2003), from ftp://flybase.net/genomes/Drosophila_melanogaster/dmel_RELEASE3-1/FASTA/ Generated sequences are checked against baseline sequences comparing: (a) sequence length, (b) sequence 32bit checksum, (c) sequence ID Difference in length or checksum means they are not same bases, different ID means a change in transcript/protein assignment r3.1.0_12182003/seqs-check/ sequences match but FBgn0002781/mod(mdg4) 7 trans-spliced translations (need special handling) transcript comparison set for r3.1.0 includes miscRNA not in baseline set r3.2.0_12052003/seqs-check/ Chromosome 2L 2R 3L 3R 4 X Total no. sequences 3327 3801 3625 4638 187 3093 Increase in no. sequences 88 111 112 77 8 166 # ID changes 10 15 23 22 2 23 # Length/checksum changes 75 126 112 107 15 73 Unmatched in old 129 172 175 128 13 191 Unmatched in new 51 76 86 73 7 48 feature-check: annotation feature counts, locations FBan annotation features of Dec. 2003 (r310d,r320a; chado.xml) comparing to GadFly features of Feb-July 03 (r310g) [? excluding heterosomes or not ?] # Table of D. mel. genome feature counts per release. # Feature r310g r310d r320a ----------------------------------------------------------------- BAC -- 949 949 CDS 18109 18109 18671 EST -- 302509 302359 cDNA_clone -- 10197 10206 gene 13369 13369 13487 mRNA 18109 18126 18819 miscellaneous curator's obse.. 35 0 0 ncRNA 60 95 65 oligonucleotide -- 193168 193775 processed_transcript -- 14677 14705 protein -- 211135 211751 pseudogene 17 17 38 rRNA 0 0 96 repeat_region -- 3021 3050 segment 437 437 437 snRNA 28 28 28 snoRNA 28 28 28 so 0 14334 14350 tRNA 288 288 281 transcription_start_site -- 16832 17005 transposable_element 1572 1571 1572 transposable_element_inserti.. -- 4346 4373 ----------------------------------------------------------------- -- == data not available for this feature #------------------------ seq- bulk data bulk sequence/annot file name format: $org_$chr_$feature_$release.$format $org in (dmel) $chr in (2L 2R 3L 3R X 4), (2h 3h Xh Yh U) $feature in ( gene, mRNA, CDS, CDS-translation, transposon/transposable_element, pseudogene, tRNA, miscRNA=ncRNA,snRNA,snoRNA,rRNA gene-extended5000 # any other extended? ## non-feature dna chromosome/chromosome-arm/entire (name as $org_$chr_entire ?) scaffold/segment ) $release in ( r3.1.0g (gadfly = current bulk files) r3.1.0d (chado r3.1.0_12182003 - should match r3.1.0g sequence, diff annots) r3.2.0a (chado r3.2.0_12052003) ) $format in ( .fasta(.gz) .gff(.gz) .chado.xml(.bz2) .game.xml(?) ) Notes: -- drop 'whole_' files, leave only per $chr ?, let folks cat $chr together as desired #------------------------ Bulk files compared to those of prior release: ftp://flybase.net/genomes/Drosophila_melanogaster/dmel_RELEASE3-1/FASTA/ whole_genome_* -- create by catenating each chr file set heterochromatin_* and (2h,3h,Xh,Yh,U) -- 'heterosomes' to be added euchromatin_* -- create by catenating each chr file set, excluding 'heterosomes' per chromosome set 2L_3_UTR, 2L_5_UTR -- to be added 2L_CDS == dmel_2L_CDS 2L_annotation == catenate dmel_2L_gene with (tRNA,miscRNA,transposon) set; OTHERS? 2L_annotation_extend5000 == dmel_2L_gene_extended5000, minus (tRNA,miscRNA,transposon) set 2L_annotation_extend2000 .. not planned 2L_annotation_extend500 .. not planned 2L_exon .. not planned 2L_genomic -- to be added (arm dna, no sequence change) 2L_genomic_scaffolds -- to be added (segment dna, no sequence change) 2L_intron .. not planned 2L_masked_genomic .. not planned 2L_noncoding-gene == catenate (tRNA,miscRNA,transposon,pseudogene) 2L_protein-coding-gene == dmel_2L_gene 2L_splice_site .. not planned 2L_tRNA == dmel_2L_tRNA 2L_transcript == catenate dmel_2L_mRNA with (tRNA,miscRNA; drop ncRNA) 2L_translation == dmel_2L_translation (CDS translation) 2L_transposable_element == dmel_2L_transposon (rename dmel_2L_transposable_element ?) 2L_unique_intergenic .. not planned 2L_unique_intron .. not planned Not in past release: dmel_2L_miscRNA dmel_2L_pseudogene gff/