FlyBase .. Aberrations .. Anatomy .. BLAST .. Genes .. Annotation/Sequences .. Gene Products .. Maps .. People .. References .. Stocks .. Transgenes/Transposons .|. Help .. Searches .. News .. Site

Dear Colleagues,

 

We are writing to bring you up to date on the status of the comparative Drosophila genome sequencing and analysis plans.  As you will see below, the sequencing itself is nearly complete.  We are moving into the phase of the project to identify the reference sets of genes, syntenic relationships and chromosomal maps, with the hope that the community for its further analyses of these genomes will use these reference data sets.  Our intention is to make these reference data sets the major focus of the initial publication on these genomes and to invite anyone in the community who so desires to contribute to this effort or to publish their initial findings in a special journal issue timed to come out at the same time as the initial genome center publication. 

 

So as to not constrain the research of individual laboratories, we propose to coordinate downstream analyses only insofar as we will act as liaisons with one or more journals to coordinate publication of the manuscripts together at an agreed-upon date, and by providing information on who is planning to do what analyses.  It will be up to the individual participating groups to make sure that they have completed their analyses and submitted their manuscript on time for peer review and publication.  Our current target for submission of these manuscripts is the end of this calendar year.  The editors of Nature and Nature Genetics have expressed a strong interest in publishing the results (including a main paper summarizing the sequencing and major findings and a collection of papers going into more depth on various aspects of the analysis).  Genome Research and Genetics would also be interested in publishing some of the in-depth analyses in an issue timed to come out at about the same time as the main paper.

 

Sequencing and Assembly

For the comparative analyses, we will use the most up-to-date versions of the various assemblies available at the time of the initial assembly freeze (completed for several of the species; due by Sept. 15 for willistoni and grimshawi).  For the published melanogaster finished arms, melanogaster draft heterochromatin and pseudoobscura WGS assemblies, we will use the assemblies and annotation sets then current in GenBank.  The status of the sequencing and assemblies of the other species is tabulated below:

 

Dros. species

Sequencing & Assembly Status

Sequencing Center

virilis

~9-fold WGS complete & assembled

Agencourt

ananassae

~8-fold WGS complete & assembled

Agencourt

mojavensis

~8-fold WGS complete & assembled

Agencourt

erecta

~12-fold WGS complete & assembled

Agencourt

grimshawi

~8-fold WGS complete (assembly to be released by Sept 15)

Agencourt

willistoni

~6-fold WGS (BAC paired ends currently being sequenced; assembly to be released by Sept 15)

Venter Institute (JCVI)

persimilis

~4-fold WGS complete & assembled

Broad Institute

sechellia

~3-fold WGS complete (assembly to be released by Sept 1)

Broad Institute

yakuba

~6-fold WGS complete (assembly in GenBank)

(additional coverage - automated sequence improvement expected Fall ‘05)

Washington Univ

(WUGSC).

simulans

~3-fold WGS of w501 strain & 1-fold coverage of 6 other strains complete

(2 assemblies currently available; deeper coverage of w501 strain expected Fall ‘05)

Washington Univ. (WUGSC)

 

Most species have been sequenced to deep WGS coverage levels.  The persimilis and sechellia projects have been sequenced to low WGS coverage (~3-4X) with the core assemblies derived independently and then enhanced by synteny to related species. For yakuba, there will be a further assembly that incorporates two rounds of additional reads directed at weak regions of the assembly using WUGSC auto-finishing software.  For simulans, there is 2.8X coverage of one strain (w501) and 1X coverage of 6 other strains.  One version of the assembly layers reads from the other strains onto the core w501 assembly.  The other simulans assembly is built on synteny with melanogaster.  It is anticipated that additional WGS sequencing of w501 will be carried out in the fall to allow an independent assembly of this strain of simulans.

 

Assembly Evaluation

The assembly evaluation group has completed an initial analysis and evaluation of the available whole genome assemblers using the yakuba and virilis data sets.  The results of the analysis remain confidential within the group, but it is safe to say that it was not possible to declare a clear winner across all of the evaluation criteria analyzed.  It is encouraging to note that most of the assemblers did a very good job and the detailed analysis performed identified many specific issues and problems that will help the assembly developers improve the software over the coming months and years.  It was left up to the sequencing centers to make a final decision as to which assembly program(s) they will use.  Agencourt and the Broad Institute are using Arachne, the JCVI is using the Celera Assembler and the WUGSC is using PCAP. 

 

Mapping Supercontigs onto the Chromosome Arms

There are two approaches being undertaken to create assemblies that approximate the extent of the euchromatin of the chromosome arms of each species. 

 

Using synteny to build arm-sized sequence maps: Bill Gelbart’s group (gelbart@morgan.harvard.edu) is evaluating the feasibility of aligning supercontigs into chromosome arm-sized units (ultracontigs) using syntenic information.

 

Association of sequence maps with genetic maps:  Sequence tagged genetic markers (e.g., recombinationally mapped cloned genes, microsatellite markers, SNPs) will be used to associate the supercontigs and/or ultracontigs with the linkage map of each species.  Polytene in situ hybridization using markers from anchor points on the superscaffolds and//or ultracontigs will be used to associate the sequence maps and the cytogenetic map of each chromosome arm.  Thom Kaufman (kaufman@bio.indiana.edu), Bryant McAllister (bryant-mcallister@uiowa.edu) and Teri Markow (tmarkow@public.arl.arizona.edu ) have organized this effort and identified people to take the lead on organizing their species community to establish the map associations for each species:

o        melanogaster species group (simulans, yakuba, sechellia and erecta): Michael Ashburner (ma11@gen.cam.ac.uk) and Thom Kaufman (kaufman@bio.indiana.edu)

o        ananassae: Muneo Matsuda (matsudam@kyorin-u.ac.jp) and Kiyohito Yoshida (majin@ees.hokudai.ac.jp)

o        pseudoobscura: Steve Schaeffer (swschaeffer@psu.edu)

o        persimilis: Mohamed Noor (noor@duke.edu)

o        willistoni: Claudia Rohde (claudiarohde@yahoo.com)

o        virilis: Bryant McAllister (bryant-mcallister@uiowa.edu) and Jorge Vieira (jbvieira@ibmc.up.pt)

o        mojavensis: Teri Markow (tmarkow@public.arl.arizona.edu)

o        grimshawi: Patrick O'Grady (pogrady@uvm.edu)

 

Whole Genome Alignments

Our discussions with alignment groups led us to conclude that it does not make sense to strive for a single set of DNA based alignments because they all differ somewhat, and people have their own preferences about which ones are most useful for their particular downstream analyses.  We would, however, like to make sure that we end up with alignments of a similar quality as those being produced for the human ENCODE regions.  As it stands now, it looks as though we will end up with four different sets of alignments: MAVID alignments (Pachter), Multiz alignments (UCSC), and LAGAN alignments (Sidow/Batzoglou) and TBA alignments (Webb Miller/Karro).

 

Genome-Wide Gene Annotation Sets

Our goal is to produce a genome-wide consensus set of gene predictions for each species.  Having a well-vetted reference set of gene models is important so that groups doing downstream analysis on these annotations.  In order to accomplish this, we are organizing several activities:

 

·         Production of cDNA libraries and 5' ESTs: ESTs will provide training sets for groups who will be contributing gene prediction sets for some or all of the species, as well as having other uses.  Normalized libraries from embryos and adults for virilis, ananassae, and erecta, mojavensis and grimshawi have been sequenced (~20-25,000 5’ EST reads each). A library for willistoni is currently being sequenced.

 

·         Production of gene prediction sets:  There are several groups who have come forward with interest in producing gene prediction sets using a variety of computational approaches.  Now that we have established the timetable for the initial assembly freezes, we will ascertain how long these groups require to produce their gene prediction sets.  Our starting point in these discussions is to target early fall for the production of these sets, but we must recognize that this may not be feasible in all cases.  We are interested in two kinds of gene prediction sets, including predictions of protein-coding and RNA-coding genes:

o        Improved melanogaster gene prediction sets through comparisons with the nucleotides and gene predictions of the other sequenced Drosophila species.

o        Production of reference gene prediction sets for each of the other Drosophila species.

 

·         Evaluation of the gene prediction sets:  In the time frame we are envisioning, it is likely that the initial annotation set will be the one that will be used for subsequent analyses

o        Computational evaluation: There are a variety of computational methodologies to compare the outputs of different gene prediction approaches.  We will enlist the contributors of the prediction sets to propose computational methodologies for this evaluation and for selection of a consensus gene prediction set.

o        Manual evaluation: While it is unlikely that it will have a major impact on the initial analysis annotation set, we do want to begin the task of improving the consensus set of predictions.  One approach to this will be to through manual review and modification of the annotations.  We will enlist interested members of the community with expertise in particular gene families or in expert manual annotation to help in the review of the gene prediction sets.

o        Experimental testing of gene predictions:  If funds become available, we will explore the possibility of carrying out experimental validation of a sample of gene predictions, focusing on examples of predictions that vary among the contributed prediction sets.

 

While we do not wish to constrain any other groups work in any other area of research on these Drosophila genomes, we are happy to invite others to contribute work that might be appropriate for the main paper on the assembly, annotation and analysis of these species.  Please contact Doug Smith (douglas.smith@agencourt.com), Bill Gelbart (gelbart@morgan.harvard.edu) or Thom Kaufman (kaufman@bio.indiana.edu) regarding any such contributions.

 

Accessing Drosophila Genome Resources

 

Timelines

 

Sincerely,

 

Doug Smith, Agencourt Inc.  (douglas.smith@agencourt.com)

Bill Gelbart, Harvard U.  (gelbart@morgan.harvard.edu)

Thom Kaufman, Indiana U. (kaufman@bio.indiana.edu)

 

!DSPAM:430f3389283451587322966!
Send comments to us at flybase-help AT morgan.harvard.edu