FlyBase .. Aberrations .. Anatomy .. BLAST .. Genes .. Annotation/Sequences .. Gene Products .. Maps .. People .. References .. Stocks .. Transgenes/Transposons .|. Help .. Searches .. News .. Site

RefMan Sections     RefMan Table of Contents     FlyBase Documents

FlyBase Reference Manual B. Detailed Descriptions of FlyBase Structure and Data
This section Last Updated: 5 August 2006

B.1. Genes

The Genes section of FlyBase contains information on Drosophila genes that has been curated from the literature and sequence databases. Data from all species of the family Drosophilidae are included. The initial data set was produced by merging the genes data in the text of Lindsley and Zimm (1992) with the old LOCI table of Ashburner, and Merriam's Genevent database. Information from all three sources has, however, been considerably revised and reformatted. New gene and allele records are added through FlyBase's curation of the literature and sequence databases. The curation of phenotypic data, a particularly complex class of Genes data, is discussed in Phenotypic Data in FlyBase, Drysdale (2001).

Some of the records in Genes will be transient. As more data become available some gene records will merge with others. Furthermore, some of these records are based on minimal data, for example, the annotation to an EMBL or GenBank sequence record. Our policy is to include data wherever we can. As records merge (or split) they will always be traceable by their secondary gene identifier numbers and by their synonyms.

One of the major differences between Lindsley and Zimm (1992) on the one hand, and Lindsley and Grell (1968) and Bridges and Brehme (1944), on the other, is that the 1944 and 1968 books were very much catalogs of mutations, rather than of genes. Bridges and Brehme (1944) and Lindsley and Grell (1968) were allele based, while Lindsley and Zimm (1992) is largely, although not entirely, gene based. FlyBase is a gene based database, and Genes reflects this change. Having said that, it will be apparent that the transition is by no means complete in genes. For the majority of genes, mutant phenotypes are described in the respective allele records. In many cases, where, as far as we know, all mutant alleles have a similar phenotype, then this description will be found in the record for the first allele in genes. Many genes in Lindsley and Zimm (1992) had no alleles specified, although it is clear that these genes were identified by one or more mutant alleles. In these cases we have arbitrarily designated an allele with the superscript 1. (Likewise, where an allele is referred to in text with a gene designation, we have regarded this as implying allele 1, where this seems reasonable, and made the change to state allele 1 explicitly). There remain, in Genes, many cases where phenotypic information is to be found within the gene record itself. This is especially so for genes for which there is a great amount of data.

Errors in Genes.

Genes data will not be free of errors, typographical, of fact, or of interpretation. Please inform FlyBase when you find any error in these data. It will then be corrected. E-mail to flybase-updates at morgan.harvard.edu (reformat to standard e-mail address) or contact a member of the FlyBase group, whose addresses and phone/fax numbers are given in Reference Manual I: The FlyBase Project.

B.1.1. General description of Genes data

The Genes file contains a set of Drosophila gene records, the data of each record being organized into many different fields. As far as possible, we have implemented controlled vocabularies for the descriptions. These are indicated by [cv]. The controlled vocabularies are to be found in controlled-vocabularies.txt. This process is by no means complete, except for some of the simpler fields, such as mutagen. For example all X ray induced alleles are described as 'X ray' (without the quotes) in the allele origin field, never 'X rays', 'X-ray' or 'X-rays'.

The use of controlled vocabularies will increase in the future. This will allow users to more easily search the database and retrieve genes or alleles with particular properties.

Overall syntax: The maximum line length is 255 characters; there are no blank lines; all lines begin with either * or #; lines that begin with # have no other characters; lines that begin with * have a letter in column 2, a space in column 3 and at least one more character beginning in column 4. The character # appears nowhere else in the file. The character * does, unfortunately, but the string *[A-Z,a-z] does not.

Record structure: The lines that are just '#' identify the end of record for a gene. All other lines hold data for a gene, each field is one or more lines that have the same character in field 2. This character identifies the field and, sometimes, its position within a record (see below).

B.1.2. List of Genes field descriptions

These are the current field designations in alphabetical order:

*a gene symbol
*b genetic location
*c cytological location
*d biological role of gene product [cv]
*e full name of gene or allele
*f cellular compartment of which gene product is a component [cv]
*g nucleic acid sequence databank and other DNA accession number
*h polymorphism data
*i symbol synonym(s)
*j xenogenetic interaction information on alleles
*k phenotypic information on alleles
*l transposable element data
*m protein database accession number
*n aberrations causing position-effect variegation of gene [cv]
*o origin/mutagen [cv]
*p phenotypic information on genes
*q information concerning functional relationships between genes
*r information on wild-type biological role
*s molecular information for genes and alleles
*t class of gene [cv]
*u miscellaneous information on genes and alleles
*v information on availability
*w discoverer
*x reference(s)
*y secondary FlyBase identifier number(s)
*z primary FlyBase identifier number
*A allele symbol
*B alternative genetic location
*C comments on cytology associated with allele
*D comments on cytological location
*E a duplicate of a *x field, used to tie data to a reference
*F function of gene product [cv]
*G insertion chromosome associated with allele
*H date record entered or updated
*I transgene construct that carries allele
*J protein domain information
*K arguably most useful aneuploids for this gene
*L synonym for transgene construct symbol
*M probable ortholog in reference species of drosophilid
*N synonym for insertion symbol
*O progenitor allele or chromosome if relevant to allele
*P aberration causing the allele
*Q complementation information concerning alleles
*R comments on origin, including progenitor genotype if irrelevant to allele
*S genetic interaction information on alleles
*T recent review article that discusses this gene
*U nickname
*V name synonym

*Y name of gene product

Field structure: The first line of each record is the *a field. There is only one of these per record. Other fields may appear in any order, and most can appear more than once, not necessarily consecutively. All fields before the first *A field (if any *A) refer to the gene. All fields between two *A fields (or between and *A field and a #) refer to the immediately preceding allele. Thus, for example, *b fields always appear before any *A fields, but *e fields can appear anywhere (e.g., "*e white" and "*e white-apricot"). Fields before the first *A are in a defined order:

aHiezyCbcwBDdJUltrfvFghmnpqsuxE

In pretty outputs the *-codes are replaced by a text term describing the field.

Special characters: There are no special characters used in this file. Superscripts are enclosed between square brackets []; subscripts between double square brackets [[]]. Greek letters are written out, e.g. alpha, beta.

B.1.3. Detailed description of the Genes fields

In this description the fields are grouped logically, rather than alphabetically. Links in the list of field designations in section B.1.2. above go to the relevant detailed field descriptions below.

B.1.4. Nontraditional alleles

In addition to 'alleles' in the traditional sense, FlyBase now names and curates further classes of allele so that phenotypic or expression pattern data can be captured for in vitro construct alleles and alleles of reporter (e.g., Ecol\lacZ), effector (e.g., Scer\FLP) or toxin (e.g., Rcom\DT-A) genes. Since these alleles have not historically been named by researchers, and have been named by FlyBase, their presentation in FlyBase requires some explanation:

B.1.4.1. Alleles of reporter genes

Alleles of reporter genes currently fall into two main classes, those resulting from enhancer trap experiments, and those resulting from promoter (or other regulatory region) analysis, where a fragment is used to drive the expression of a reporter gene. Ecol\lacZ will be used for illustration.

Enhancer trap results:

Promoter analysis results:

B.1.4.2. Alleles of ectopically expressed Drosophila gene products

Products of genes may be ectopically expressed due either to juxtaposition with different regulatory sequences in the genome (as a result of being inserted into different-than-wild-type locations by chromosome rearrangement or P element transposition) or due to in vitro construction creating a different constellation of regulatory sequences than in wild type.

By analogy with alleles of Ecol\lacZ for enhancer traps, P-element-borne insertions of genes e.g., w or ve that have a qualitatively distinct _position-dependent_ mutant phenotype will be curated as new alleles of e.g., w or ve, e.g., veStg caused by a particular insertion of P{HS-rho}, P{HS-rho}Stg.

The 'in vitro construct' ectopic expression alleles currently fall into two main classes, one component or two component systems:

One component systems:
Gene A is expressed from a promoter of gene B. The allele is typically generated by in vitro construction. In such cases the allele symbol is of the format 'gene-Agene-B.PI', e.g., phylsev.PC or 'gene-Agene-B.fragment descriptor' where the author includes a promoter fragment descriptor, e.g., phylninaE.GMR.

An occasional exception is made for promoter fusions that are widely used to provide essentially wild-type gene function; these alleles have the mini-gene '+m construct' designation (see below) prepended to an, e.g., heat shock designation, e.g., w+mW.hs.

It is common that authors report a construct where e.g., ftz is expressed under a 'heat shock' or Hsp70 promoter, while providing no further details about the nature of the promoter. For these cases the allele symbol hs.PI is employed, e.g., Antphs.PZ for 'Antp heat shock construct of Zeng'. An 'hs' designation should be reserved for when the heat inducible, not just the minimal, promoter fragment is used.

Where the allele is both altered in its coding region and being expressed from an ectopic promoter the sequence 'alteration.promoter' is used in the allele designation, e.g., tor13D.hs.sev to denote the coding sequence of tor13D expressed from a heat shock (undefined) promoter with a sev enhancer. An exception to this rule is made for Tags, which appear as the last component of the allele symbol (see below).

Two component systems:

B.1.4.3. Alleles of ectopically expressed non-Drosophila effector products

A note on ribozymes: FlyBase has a foreign ribozyme gene, symbol LTSV\RBZ. Alleles of LTSV\RBZ capture the different variants, e.g., for a heat inducible ftz-targeted ribozyme: LTSV\RBZhs.ftz (syntax 'promoter.target gene') will be named.

'+m' minigenes

The minigene allele designation is used in its narrow sense, i.e., where the only difference between the allele and the wild type is the removal of more or less non-essential sequences. Thus the minigene allele symbol designation reserved for those cases where the gene's own promoter is driving its expression.

The minigene allele symbols begin with 'm', for minigene, and are followed by the construct symbol used in the publication. If no construct symbol has been used, the string 'mIa' where 'm' stands for minigene, 'I' for the first author's last name initial and 'a' for the first in the series is used. If the function of the minigene is stated to be indistinguishable from that of the wild type allele, the 'm' is preceded by a '+'.

Tags Genes can be modified by the addition of a tag allowing the product to be identified, purified, or targeted to a particular subcellular distribution. Tagged alleles have the syntax 'gene-symbol x.T:y' , where x is an identifier and y is the name of the tag, e.g., Hsap\MYC, T:Ivir\HA1, SV40\nls2, e.g., CycBB1.T:Hsap\Myc. Where a tag is artificial, the species prefix Zzzz is used, e.g. T:Zzzz\His6.

B.1.4.4. Classical alleles engineered into transgene constructs, including rescue constructs

A class of alleles are named to capture fragments of genomic DNA used in rescue constructs. The symbol for the rescuing allele symbol begins with '+t'. This is followed by length as stated by authors, construct symbol if length is not given or '+tIa', where 't' stands for transgene, 'I' for the first author's last name initial and 'a' for the first in the series (if neither length nor construct symbol is stated). When rescue is incomplete, the construct is considered as carrying a mutant allele. Allele designator is construct symbol, 'length of genomic insert.tIa' if no symbol is given or 'tIa' where neither length nor construct symbol is stated.

When a classic allele, e.g., wa, is put into a transgene construct it will get a new designation, e.g., wa.tIa, to reflect its transgenic environment, where 't' stands for transgene, 'I' for the first author's last name initial and 'a' for the first in the series

FlyBase is, of course, happy to discuss and advise on use of nomenclature of these non-traditional alleles.

B.1.5. Protein and transcript symbols and exon naming

FlyBase strives to link curated information to particular protein and transcript species. In order to maintain the data in this way, it is necessary to assign different symbols to each gene product. Proteins, transcripts and exons are symbolized as follows.

Protein symbols are of the form cact[+]P482 where the gene symbol and allele designation are followed by a capital P and the size of the protein in amino acids. When the size in amino acids is not known, the size in kiloDaltons is used, e.g. grh[+]P120kD. If no size is known, the symbol is followed by a capital letter to distinguish products that are known to be different, e.g. Sh[+]PA, Sh[+]PB. If multiple proteins of the same size and divergent sequence are characterized, the symbols are followed by different capital letters, e.g. abc[+]P345A, abc[+]P345B. A generic protein symbol, e.g. cact[+]P, is used to capture properties that cannot be specifically attributed to one protein product of a gene.

Transcripts are similarly named. The gene symbol and allele designation are followed by a capital R and the size in kb, e.g. cact[+]R2.2. Where possible the size as estimated by northern blot is used. If not, the size of the longest cDNA is used and this is indicated in the transcript table. For transcripts of unknown size, the symbol is followed by a capital letter, e.g. grh[+]RA, grh[+]RB. For multiple transcripts of similar size and divergent sequence, the symbols are followed by different capital letters, e.g. abc[+]R1.7A, abc[+]R1.7B. A generic transcript symbol, e.g. cact[+]R, is used to capture properties that cannot be specifically attributed to one particular transcript of a gene.

In general, all of the exons comprising a gene are numbered consecutively from 5' to 3'. Where exons partially overlap, they are given the same number with a suffix, e.g. 2a,2b.

In some cases, it is not possible to attribute a characteristic to an individual gene product. For example, expression pattern data is often obtained with probes or antibodies that recognize more than one product of a gene. It is not rigorously known where each individual gene product is expressed. In addition, it is often not possible to determine which transcript observed on a northern blot corresponds to a particular cDNA. In these cases, the data is linked to a generic protein or transcript entity for that gene.

B.1.6. FlyBase Genes - Interactive Fly Cross Index

FlyBase has developed a hierarchical view of the Interactive Fly entitled "Interactive Fly Hierarchy: cross-index to FlyBase genes". This hierarchy is accessible from both Allied Data and Genes. The hierarchy provides an overview of the Interactive Fly with links to the specific Interactive Fly pages, as well as gene lists with links to the individual gene records in FlyBase and the Interactive Fly. This permits searches for genes grouped according to developmental and cellular pathways and functions.

B.1.7. Differences and omissions from Lindsley and Zimm (1992)

All errors found in Lindsley and Zimm (1992) have been corrected. A list of these errors, sorted by page number, is in the file errors.txt in the Redbook section of FlyBase Documents. The material in the DELETION MAP tables in the 'lethals' section of Lindsley and Zimm (1992) is not included; these tables are available in the Redbook section of Maps. The tables of Lindsley and Zimm (1992) have been broken down and the data incorporated into the text of the relevant gene record. All references within the body of a text entry of Lindsley and Zimm (1992), i.e., not in the references: field, have been duplicated into the references: field. With a very few exceptions all references are to be found in the FlyBase Bibliography and carry FlyBase reference ID numbers. The molecular map figures in Lindsley and Zimm (1992) are not included in genes, but are available in Redbook/Images sections of Documents. Lindsley and Zimm often used introductory sections for groups of genes that are, in some way or other, related (see e.g. the record for ASC, page 50). This structure is not suitable for FlyBase, and this information has, in general, been repeated in each of the relevant individual gene records.

B.2. Synonyms

FlyBase maintains a record of synonyms for gene, allele, aberration, transposon and transgene construct symbols that have appeared in the literature and stock center stock lists. Files with tables of synonyms and their corresponding "valid" symbols are found in the relevant sections of FlyBase.

Synonyms have several different causes. Sometimes two workers give the same symbol to two different genes, requiring one of these to be changed. Sometimes two workers, either by accident or design(1), give two different symbols to the same gene, then that which has priority should be used. Many of the synonyms arise, however, as a consequence of minor variation in the way a gene's or aberration's or transposon's or transgene construct's symbol is written (e.g., with lower case or capital first letter), or by error, either in the literature or these tables. In some cases it has been difficult to decide whether a name is a gene synonym or just an allele name (this is especially so for lethals). We have taken a very liberal attitude to synonyms and, when in doubt, have included a name as a synonym even when it may more correctly be an allele name.

The files are:

1. "Scientists would rather use each other's toothbrushes than each other's nomenclature.", Keith Yamamoto.

B.3. Species other than D. melanogaster

FlyBase includes data on all species from the family Drosophilidae. The 'default' species is D. melanogaster and all symbols and names of genes, alleles, aberrations and clones from other species have a prefix of the form Nnnn\, where N is the initial letter of the genus (e.g. D for species in the genus Drosophila) and nnn is normally the first three letters of the specific epithet (e.g., sim for simulans). In formal terms all symbols and names from D. melanogaster have the prefix Dmel\, but this is usually omitted.

Species prefixes are also used for non-melanogaster genes introduced into D. melanogaster via a transgene construct, including Ecol\lacZ, Scer\GAL4 and Avic\GFP. In addition, genes carried by natural transposable elements have the transposon symbol as a 'species' prefix, for example, P\T, the gene for P-element transposase. To find genes such as these in a Genes search, change the 'Species' option from the default 'Dmel' to 'All'.

A list of all of the names and abbreviations used by FlyBase for species is included in the Nomenclature section of FlyBase. The species-abbreviations.txt file has the syntax:
taxgroup | abbreviation | genus | species name | common name | comment

At present, four different 'taxgroups' are recognized:

drosophilid (i.e., species in the family Drosophilidae), non-drosophilid eukaryote, prokaryote, transposable element and virus (including prokaryotes viruses), and the file is sorted in this order.

We stress that identity of gene symbol between two species cannot be used to conclude 'homology' of genes. Where known, or strongly suspected, information concerning homologous genes within the family is present in a *M field of the genes file.

FlyBase has made only limited efforts to curate genes, alleles and aberrations from species other than D. melanogaster for the period before 1989. We have back curated from D.I.S. and some primary papers and reviews that have come to hand. For four species we have incorporated the efforts of others:

We would be happy to hear from colleagues who are able to review records from species other than D. melanogaster. We thank Jerry Coyne for reviewing the records for D. simulans, D. mauritiana and D. sechellia.

B.4. Genetic objects from non-Drosophila species that are included in Drosophila

Sequences from many other organisms are often included in artificial constructs introduced into the genome of Drosophila. FlyBase calls these 'foreign genes' and they have symbols that indicate both the species of origin and the nature of the element, e.g., Hsap\BMP4, the BMP4 gene from humans. A list of the species abbreviations used is to be found in the Nomenclature section.

Just as two or more different Drosophila genes can be engineered into a gene fusion so can two or more different foreign gene coding regions. These are called 'foreign fusion' genes, e.g., Avic\GFP::Ecol\lacZ, a coding fusion of Aequorea victoria GFP and the E. coli lacZ gene.

Structural and non-coding elements ('SAFE elements', see B.1.3.) from non-Drosophila species are called foreign SAFE elements. The most common group of foreign SAFE elements are short sequence tags used to mark genes or their products (including epitope tags). These have symbols that begin with 'T:', e.g., T:Hsap\MYC, the 'myc' epitope tag. Artificial sequences are also classed as SAFE elements, e.g., T:Zzzz\His6 for a DNA sequence encoding a run of six histidine residues.

A limited class of regulatory elements from foreign species are classified as foreign SIRE elements (synthetic and/or isolated regulatory elements). This class is restricted to regulatory elements widely used in an isolated context, for example as mobile activating elements. Examples are the synthetic multiple UAS[[G]] elements, restricted to cases in which they are used within transgene constructs designed to activate adjacent endogenous genes.

The class of element is indicated in a *t line, which, for the objects described in this section, can have the following values:

Each class, or any combination of classes, can be extracted from the database by using the complex query form in Genes with the "Class" option changed from the default "all" to one or more (ctrl+click to add terms) of these categories.

For each class the origin of the gene is described in star-coded format in a *u line with the following syntax:
*u Foreign sequence; species == <species_name>; gene|sequence|sequence tag|function tag|epitope tag == <gene symbol>; <database_abbreviation:database_id>.

Attempts are first made to cross-reference to another genetic database (e.g., OMIM, GDB, MGD). If such a link cannot be made then we attempt to establish a link with a protein or nucleic acid sequence database. The database abbreviations used will be found Reference Manual F: Links To and from FlyBase. The gene name or symbol will be enclosed with single quotation marks if no cross-reference to another genetic database can be found. If no cross-reference can be established then a brief literature reference to the object will be included within the 'comment' field. In the case of epitope tags the comment field will normally include the 'name' of the antibody recognizing the epitope and a literature reference.

B.5. Maps

The Maps section of FlyBase contains map-based browsing and query tools and data. See Reference Manual C: Using FlyBase on the Web for further information on these tools.

FlyBase uses Bridges' revised maps for the banding patterns of the polytene chromosomes. See:

Bridges, 1938, J. Hered. 29: 11--13 (X chromosome), Bridges and Bridges, 1939, J. Hered. 30: 475--476 (2R), Bridges, 1941, J. Hered. 32: 64--65 (3L), Bridges, 1941, J. Hered. 32: 299--300 (3R), Bridges, 1942, J. Hered. 33: 403--408 (2L).

B.5.1. Sequence-based Maps

B.5.1.1. Genome Browser, GBrowse

GBrowse (a product of the Generic Model Organism Database Project) provides a Web-based view of a specified region of the genome; the location of that region along the chromosome arm is indicated graphically. The region of interest can be specified by gene symbol, CG identifier, a mapped feature (such as a Drosophila Gene Collection cDNA clone, BAC genomic clone, P element insertion, or protein sequence accession in the SPTR database with BLASTX similarity to the genomic sequence), or a coordinate extent on a scaffold accession or chromosome arm. One can also input a sequence string using the Fly BLAST server and from the BLAST results list link to the alignment in the GBrowse view. The extent of the region (from 100 bp to 5 Mbp) can be controlled by the user using the zoom option. Adjacent regions can be viewed using the scroll option. Annotated genes, supporting data, and other sequence-aligned data (eg., P-element insertion sites and Affymetrix oligos) are shown as color-coded features flanking the central sequence axis. Features can be indentifed by mousing over the relevant graphic and viewing the feature name in the status bar; when the view is zoomed in sufficiently, or the gene labelling option is selected, the gene annotations are labelled. Included below the gbrowse view of the region are BAC in situ images. The "Display Settings" panel can be used to control the subset of features displayed, the width of the image, and other display options. For example, one can choose to have gene symbols displayed or can choose to have an expanded view of the aligned data. The data behind the GBrowse view, including cytological locations and GO gene function descriptions, can be downloaded in various flat-file formats: tabulated, FASTA, GAME-XML or GFF formats.

B.5.1.2. Drosophila Genome Overview

The FlyBase tool Drosophila Genome Overview is an extension of GBrowse that allows users to browse entire chromosome arms at once. The default view displays cytological numbered divisions, the tiling BAC genomic clones, and the annotated sequence scaffolds in GenBank. Clicking on the BAC or GenBank scaffolds takes users to the GBrowse view of the region. Users can also choose to display all of the genes along a chromosome arm, as well as cDNAs that align to the genomic sequence, P element insertions, transposable elements, and sequencing gaps. The width of the map can be adjusted, which is necessary when viewing these finer, optional features.

B.5.1.3. Apollo

A more flexible and interactive view of the same data provided in gbrowse is possible using the Apollo genome browser and annotator. Use of this tool requires that the Apollo software be downloaded and installed locally; data are then loaded via a Web connection from the annotation database. Data can be saved locally in the form of GAME-XML flat files and subsequently reloaded into Apollo. A detailed and comprehensive user guide for Apollo is available. This tool provides several options for viewing annotations and features down to the sequence level, and allows searches for specific genomic or amino acid sequence strings. Apollo also provides editing options, including sequence-level modifications of exon extents, addition of alternative transcripts, deletion of existing annotations, modifications involving merging or splitting existing annotations, and addition of comments associated with specific genes or transcripts. There are many options for customizing the format of the view and the data sets; these may be saved as user preferences.

B.5.2. Gene Order Maps

Gene order maps contains maps that communicate both gene order and cytological location. There are two formats: files whose names end '.ps' are suitable for downloading and printing on a PostScript printer, while those ending 'txt' are preferable for viewing in a web browser. Their format is documented in detail in the file geneorder.doc in the same folder.

Using the Gene Order Maps

The gene-order map communicates both gene order and cytological location. This is presentationally rather different on a genome-wide map than on a small, well-mapped region, and a novel format has been adopted, which is documented here.

1. Cytological range
Each gene whose cytological location is known with a range of uncertainty less than about two number divisions is written on a vertical line whose extent is the range of uncertainty. Overlapping lines are staggered. To this extent, in other words, the format is as in the EofD. A gene whose symbol exceeds nine characters may cross more than one line; the line it is attached to always goes through the second character of the symbol.

Bands are drawn with differing sizes, but this is not in any way related to amount of DNA per band, as it is on the EofD. It is only a function of how much data we need to place there.

2. "Limiting" genes
In addition, at either end of the line there is the symbol for a gene that is known to lie to the indicated side of the gene in the middle of the line. Two points must be emphasized about these "limiting" genes: they are not being stated to have the same cytological location as the "limited" gene, and they are not being stated definitely to be the neighboring gene. They are chosen by pragmatic criteria as being the most informative genes that are known to lie to the indicated side. These criteria include cytological location and size of range of uncertainty of that location. This means that it is common, especially in well-mapped regions, for a gene to appear more than once. A gene can appear as a limiter of any number of other genes, but it will only be a limited gene on at most one line.

Limiters are identified only by direct recombination, complementation or molecular map data; cytology (of genes or of breakpoints) is never used. If a gene has no limiter on one side (or both), that means that no gene can be placed to that side using direct genetic or molecular data.

3. Multiple "limited" genes on a single line
In the better-characterized regions, gene order is known to a degree that cannot be clearly represented by cytological range. This is alleviated by placing two or more genes "limited" on the same line. So as to maintain completeness of information, a set of genes is only ever limited on the same line if (a) their relative order is completely known, and (b) they all have identical cytological ranges. The limiters of a line with more than one gene are known to lie to the indicated side of all limited genes.

 |      y 
 |      | 
 |      | 
1B5     | 
 |     svr 
 |      | 
 |    elav 
        | 
 |      | 
 |      | 
1B6     | 
 |      | 
 |    Appl 

This says:

It does not say:

4. Nested or overlapping genes
The software that analyses map data understands the concept of genes within genes, but this is hard to depict graphically without a generally more confusing format. Sometimes, therefore, a gene will be shown as its own limiter, or as both limited by and limiting (to the same side) another gene.

We have incorporated some molecular data into this map, and will add much more over the coming year, but the bulk of the information is based on genetic data. Therefore, the definition of overlap of two genes is not necessarily that the transcription units overlap. For example, ftz is shown as embedded in Scr, because Scr[-] ftz[+] deficiencies exist that delete proximal material (including Antp).

5. Genes with cytological extent
A few dozen genes are stated to be deleted by deficiencies which (according to our data) do not quite overlap, thus implying that the gene occupies the whole region between the deficiencies (plus a bit on either side). In most cases the gap between the genes is only one band, so we have fudged the issue by placing the gene at the interband, e.g. y in 1B1-2:

 | 
 | 
1B1 
 |         arth
 |          | 
            y 
 |    y     | 
 |    |     ac 
1B2   ac 
 |    | 
 |    sc 

Two files related to the correspondence of the genetic and cytogenetic maps are also in Maps:

B.5.3. Computed Aberration Breakpoints and Cytological Locations of Genes

If you see computed cytologies in FlyBase that you think are incorrect, please contact us at flybase-updates at morgan.harvard.edu (reformat to standard e-mail address).

Five categories of information regarding the polytene location of genes and aberration breakpoints are captured by FlyBase:

Recombination, complementation and molecular information does not reveal polytene locations directly, but can be combined with orcein and in situ data to derive inferred polytene locations. This type of analysis is non-trivial when conducted on a large dataset. FlyBase has produced software which does it automatically, with some provisos which are explained below (see 'Provisos').

The output of this software is a 'best guess' of the polytene location of each gene or aberration breakpoint for which any relevant data are known to FlyBase. The guess is presented as a range of uncertainty, whose ends are either polytene bands (such as 22F1) or lettered subdivisions (such as 22F). Heterochromatic bands (such as h41) are also used. This range appears as the polytene location of the gene or breakpoint in the header section of the gene or aberration report, and is also used as the underlying data for the various map-based user interfaces, such as the graphical maps and CytoSearch.

To the extent possible (see 'Provisos' below), the computed range of uncertainty of a gene or breakpoint is the range consistent with ALL the data known to FlyBase. Thus, if in one publication a gene has been reported to lie in 35B1-4, and in another publication it is reported to lie in 35B3-6, and there is no other relevant information in FlyBase, the computed location will be 35B3-4. More complex situations arise from complementation and recombination data. For example, if Df(1)xyz is stated to have its proximal breakpoint at 15A1-4, and Df(1)pqr is stated to have its distal breakpoint at 15A3-6, and the Df's are known to overlap (because there is a gene, abc, that they both delete), then both those breakpoints will be computed to lie in 15A3-4 -- as will the gene abc itself.

Because of the inherent complexity of these computations, the basis for the computed range is often far from obvious at first sight. FlyBase therefore includes, directly following the computed range in the Full and Abridged (but not Synopsis) gene and aberration reports, one-line descriptions of the primary data from which each end of the range was determined. Those from the last example above would be as follows (with arbitrary data for the other ends of the deficiencies): note that there is no requirement that any two data items derive from the same reference.

For gene abc:
Computed cytological location: 15A3-4
Left limit from inclusion in Df(1)pqr (FBrf0012345)
Right limit from inclusion in Df(1)xyz (FBrf0054321)
For Df(1)xyz:
Computed cytological location: 14D;15A3-4
Limits of break 1 from polytene analysis (FBrf0013579)