Dear Colleagues,
We are writing to bring you up to date on the status of the
comparative Drosophila genome
sequencing and analysis plans. As you
will see below, the sequencing itself is nearly complete. We are moving into the phase of the project
to identify the reference sets of genes, syntenic relationships and chromosomal
maps, with the hope that the community for its further analyses of these
genomes will use these reference data sets.
Our intention is to make these reference data sets the major focus of
the initial publication on these genomes and to invite anyone in the community
who so desires to contribute to this effort or to publish their initial
findings in a special journal issue timed to come out at the same time as the
initial genome center publication.
So as to not constrain the research of individual
laboratories, we propose to coordinate downstream analyses only insofar as we
will act as liaisons with one or more journals to coordinate publication of the
manuscripts together at an agreed-upon date, and by providing information on
who is planning to do what analyses. It
will be up to the individual participating groups to make sure that they have
completed their analyses and submitted their manuscript on time for peer review
and publication. Our current target for
submission of these manuscripts is the end of this calendar year. The editors of Nature and Nature Genetics
have expressed a strong interest in publishing the results (including a main
paper summarizing the sequencing and major findings and a collection of papers
going into more depth on various aspects of the analysis). Genome Research and Genetics would also be
interested in publishing some of the in-depth analyses in an issue timed to
come out at about the same time as the main paper.
Sequencing
and Assembly
For the comparative analyses, we will use the most
up-to-date versions of the various assemblies available at the time of the
initial assembly freeze (completed for several of the species; due by Sept. 15
for willistoni and grimshawi). For the published melanogaster finished arms, melanogaster
draft heterochromatin and pseudoobscura
WGS assemblies, we will use the assemblies and annotation sets then current in
GenBank. The status of the sequencing
and assemblies of the other species is tabulated below:
|
Dros. species
|
Sequencing & Assembly Status
|
Sequencing Center
|
|
virilis
|
~9-fold WGS
complete & assembled
|
Agencourt
|
|
ananassae
|
~8-fold WGS
complete & assembled
|
Agencourt
|
|
mojavensis
|
~8-fold WGS
complete & assembled
|
Agencourt
|
|
erecta
|
~12-fold WGS
complete & assembled
|
Agencourt
|
|
grimshawi
|
~8-fold WGS
complete
(assembly to be released by
Sept 15)
|
Agencourt
|
|
willistoni
|
~6-fold WGS (BAC paired ends currently being
sequenced; assembly to be released by Sept 15)
|
Venter Institute (JCVI)
|
|
persimilis
|
~4-fold WGS
complete & assembled
|
Broad Institute
|
|
sechellia
|
~3-fold WGS
complete (assembly to be
released by Sept 1)
|
Broad Institute
|
|
yakuba
|
~6-fold WGS
complete (assembly in
GenBank)
(additional
coverage - automated sequence improvement expected Fall ‘05)
|
Washington Univ
(WUGSC).
|
|
simulans
|
~3-fold WGS of w501
strain & 1-fold coverage of 6 other strains complete
(2 assemblies
currently available; deeper coverage of w501 strain expected Fall ‘05)
|
Washington Univ.
(WUGSC)
|
Most species have been sequenced to deep WGS coverage
levels. The persimilis and sechellia
projects have been sequenced to low WGS coverage (~3-4X) with the core
assemblies derived independently and then enhanced by synteny to related
species. For yakuba, there will be a further
assembly that incorporates two rounds of additional reads directed at weak
regions of the assembly using WUGSC auto-finishing software. For simulans,
there is 2.8X coverage of one strain (w501) and 1X coverage of 6 other
strains. One version of the assembly
layers reads from the other strains onto the core w501 assembly. The other simulans
assembly is built on synteny with melanogaster. It is anticipated that additional WGS
sequencing of w501 will be carried out in the fall to allow an independent
assembly of this strain of simulans.
Assembly
Evaluation
The assembly evaluation group has completed an initial
analysis and evaluation of the available whole genome assemblers using the yakuba and virilis data sets. The
results of the analysis remain confidential within the group, but it is safe to
say that it was not possible to declare a clear winner across all of the
evaluation criteria analyzed. It is
encouraging to note that most of the assemblers did a very good job and the
detailed analysis performed identified many specific issues and problems that
will help the assembly developers improve the software over the coming months
and years. It was left up to the
sequencing centers to make a final decision as to which assembly program(s)
they will use. Agencourt and the Broad
Institute are using Arachne, the JCVI is using the Celera Assembler and the
WUGSC is using PCAP.
Mapping
Supercontigs onto the Chromosome Arms
There are two approaches being undertaken to create
assemblies that approximate the extent of the euchromatin of the chromosome
arms of each species.
Using synteny to build arm-sized sequence maps: Bill Gelbart’s group (gelbart@morgan.harvard.edu) is
evaluating the feasibility of aligning supercontigs into chromosome arm-sized
units (ultracontigs) using syntenic information.
Association of sequence maps with genetic maps: Sequence tagged
genetic markers (e.g., recombinationally mapped cloned genes, microsatellite
markers, SNPs) will be used to associate the supercontigs and/or ultracontigs
with the linkage map of each species.
Polytene in situ hybridization
using markers from anchor points on the superscaffolds and//or ultracontigs
will be used to associate the sequence maps and the cytogenetic map of each
chromosome arm. Thom Kaufman (kaufman@bio.indiana.edu), Bryant McAllister (bryant-mcallister@uiowa.edu) and Teri
Markow (tmarkow@public.arl.arizona.edu ) have organized this effort and
identified people to take the lead on organizing their species community to
establish the map associations for each species:
o
melanogaster
species group (simulans, yakuba, sechellia and erecta):
Michael Ashburner (ma11@gen.cam.ac.uk) and
Thom Kaufman (kaufman@bio.indiana.edu)
o
ananassae:
Muneo Matsuda (matsudam@kyorin-u.ac.jp) and Kiyohito Yoshida
(majin@ees.hokudai.ac.jp)
o
pseudoobscura:
Steve Schaeffer (swschaeffer@psu.edu)
o
persimilis:
Mohamed Noor (noor@duke.edu)
o
willistoni:
Claudia Rohde (claudiarohde@yahoo.com)
o
virilis: Bryant
McAllister (bryant-mcallister@uiowa.edu) and Jorge Vieira (jbvieira@ibmc.up.pt)
o
mojavensis: Teri
Markow (tmarkow@public.arl.arizona.edu)
o
grimshawi: Patrick
O'Grady (pogrady@uvm.edu)
Whole
Genome Alignments
Our discussions with alignment groups led us to conclude
that it does not make sense to strive for a single set of DNA based alignments
because they all differ somewhat, and people have their own preferences about
which ones are most useful for their particular downstream analyses. We
would, however, like to make sure that we end up with alignments of a similar
quality as those being produced for the human ENCODE regions. As it
stands now, it looks as though we will end up with four different sets of
alignments: MAVID alignments (Pachter), Multiz alignments (UCSC), and LAGAN
alignments (Sidow/Batzoglou) and TBA alignments (Webb Miller/Karro).
Genome-Wide
Gene Annotation Sets
Our goal is to produce a genome-wide consensus set of gene
predictions for each species. Having a
well-vetted reference set of gene models is important so that groups doing
downstream analysis on these annotations.
In order to accomplish this, we are organizing several activities:
·
Production of cDNA libraries and 5' ESTs: ESTs will
provide training sets for groups who will be contributing gene prediction sets
for some or all of the species, as well as having other uses. Normalized libraries from embryos and adults
for virilis, ananassae, and erecta,
mojavensis and grimshawi have
been sequenced (~20-25,000 5’ EST reads each). A library for willistoni is currently being sequenced.
·
Production of gene prediction sets: There are several
groups who have come forward with interest in producing gene prediction sets
using a variety of computational approaches.
Now that we have established the timetable for the initial assembly
freezes, we will ascertain how long these groups require to produce their gene
prediction sets. Our starting point in
these discussions is to target early fall for the production of these sets, but
we must recognize that this may not be feasible in all cases. We are interested in two kinds of gene
prediction sets, including predictions of protein-coding and RNA-coding genes:
o
Improved melanogaster gene prediction sets
through comparisons with the nucleotides and gene predictions of the other
sequenced Drosophila species.
o
Production of
reference gene prediction sets for each of the other Drosophila species.
·
Evaluation of the gene prediction sets: In the time frame we
are envisioning, it is likely that the initial annotation set will be the one
that will be used for subsequent analyses
o
Computational evaluation: There
are a variety of computational methodologies to compare the outputs of
different gene prediction approaches. We
will enlist the contributors of the prediction sets to propose computational
methodologies for this evaluation and for selection of a consensus gene
prediction set.
o
Manual evaluation:
While it is unlikely that it will have a major impact on the initial analysis
annotation set, we do want to begin the task of improving the consensus set of
predictions. One approach to this will
be to through manual review and modification of the annotations. We will enlist interested members of the
community with expertise in particular gene families or in expert manual
annotation to help in the review of the gene prediction sets.
o
Experimental testing of gene predictions: If funds become
available, we will explore the possibility of carrying out experimental
validation of a sample of gene predictions, focusing on examples of predictions
that vary among the contributed prediction sets.
While we do not wish to constrain any other groups work in
any other area of research on these Drosophila genomes, we are happy to invite
others to contribute work that might be appropriate for the main paper on the
assembly, annotation and analysis of these species. Please contact Doug Smith
(douglas.smith@agencourt.com), Bill Gelbart (gelbart@morgan.harvard.edu) or
Thom Kaufman (kaufman@bio.indiana.edu) regarding any such contributions.
Accessing Drosophila Genome Resources
- Michael Eisen (UC Berkeley) is
hosting a central "AAA" web site
"Assembly/Alignment/Annotation of 12 related Drosophila species"
(http://rana.lbl.gov/drosophila/) housing all genome data sets and links
relevant to these projects.
Contributions of additional data sets or links to this resource
should be made by emailing multiple@fruitfly.org.
- FlyBase
(http://flybase.bio.indiana.edu/) will post announcements relating to the
sequencing and analysis of these species, and when annotated genomes are
in GenBank, will incorporate them into FlyBase. In advance of this, FlyBase is providing
BLAST access to these genomes as their assemblies are made available by
the sequencing centers (http://species.flybase.net/).
Timelines
- Sequencing and Assembly: Initial
freezes of all species' WGS assemblies to be available by Sept. 15.
- Creating Chromosome Arm Maps: The
working group will be meeting at a workshop at the U. of Arizona on
Sunday, Oct. 30 with a goal of producing the first release of the
chromosome arm maps immediately thereafter.
- Preliminary alignments are now
available at the AAA site. These
will be recomputed once the initial assemblies are frozen.
- Gene Prediction Sets: We request
that any groups interested in submitting annotation prediction sets for
any or all of the Drosophila species (including the previously annotated melanogaster and pseudoobscura genomes) submit these
to AAA by November 1. We will
publish a request regarding formats of these data sets in the near
future. All of these prediction
sets will be made accessible as prediction tracks in FlyBase and will be
available for downloading by other groups displaying the Drosophila
genomes.
- Consensus Annotation Sets: A
working group will be formed to establish the criteria for determining the
consensus annotation set, which will be the set of gene models submitted
to GenBank. The goal is to have the
consensus annotation sets available for downstream analysis by December 1.
- Manuscripts: The current goal is
to submit the initial manuscript and related papers from individual groups
on February 1, 2006.
Sincerely,
Doug Smith, Agencourt
Inc. (douglas.smith@agencourt.com)
Bill Gelbart, Harvard
U. (gelbart@morgan.harvard.edu)
Thom Kaufman, Indiana
U. (kaufman@bio.indiana.edu)
!DSPAM:430f3389283451587322966!