NCBI HomeGenomic Biology Genomic Sequence Assembly Process  
  Search for

NCBI Genomic Sequence Assembly and Annotation Process


Overview Return to the top of the page.

The NCBI Map Viewer facilitates access, query, and display of mapping data including the physical sequence map. As genome sequence data become available for an organism, NCBI staff work to provide the data as a reference sequence (RefSeq) for display in the Map Viewer. We have developed several protocols to reach this goal and rely heavily on collaboration with genome-specific research groups whenever possible. NCBI provides various levels of computation, analysis, and curation as needed per organism. For instance, the mouse genome is assembled by an external group and annotated via the NCBI annotation pipeline. For the human genome, NCBI computes the assembly in collaboration with the international sequencing consortium; NCBI and other external groups independently provide annotation on the assembled genome. And, for other genomes, such as Drosophila melanogaster, the NCBI RefSeqs represent the assembly and annotation as provided by the fly sequencing consortium.

NCBI is providing reference sequence (RefSeq) records that represent assemblies of genomic sequence data and the corresponding RNA and protein sequences. The NCBI annotation pipeline annotates the genomic RefSeq data with features such as genes, RNAs, proteins, variation (SNPs), STS markers, and FISH mapped clones. All sequences (genomic, RNAs, proteins) are available for customized BLAST searches. BLAST results, as well as the sequence features, are readily displayed on NCBI's Map Viewer.

This document describes the processes used to:

  1. assemble the sequence data
  2. annotate features
  3. provide a dataset of assembled genomic sequence, RNAs, and proteins

These processes are complex and will continue to be refined; therefore, regions that are not fully represented should improve with subsequent builds.


Genomic Sequence Assembly and Layout Return to the top of the page.

The procedure below describes the assembly of the human genome based on curated Tiling Path Files (TPFs) that are provided by the international sequencing consortium, and HTGS sequences found in GenBank. The TPF data indicate clone order, per chromosome, clone overlap information, and the location of gaps.

Currently, NCBI is annotating assembled mouse and rat genomic sequence; additional information about these genome assemblies is available here:



Preparing Input Data Return to the top of the page.

The input data include both finished and draft genomic sequence data available in GenBank.

The sequence data are first screened for contaminating sequences. Sequences are compared against a collection of sequences for common contaminants (including the UniVec database, the E. coli genome, bacterial transposons, and bacteriophage) using MegaBLAST (1). Sequence is then masked for repeats using RepeatMasker (2) (repeat sequence is converted to lowercase) and blasted against the genomes of other completely sequenced organisms (including S. cerevisiae, D. melanogaster, and C. elegans). In addition, the sequence data are screened for mouse repeats using RepeatMasker (2) and for mouse STSs using e-PCR (3). Any clone containing sequence from another organism is entirely removed from the input data set.

The prepared sequence data, in which repetitive and contaminating regions have been masked, are then broken into fragments at gap positions.



Sequence Layout Return to the top of the page.

Genomic contigs are constructed in two steps: a clone layout stage, followed by a sequence building stage. The layout stage determines which BACs belong to the same contig. The sequence building stage assembles the clone sequences that belong to the same contig.

The clone layout is generated using clone order information provided in the TPFs and sequence overlap information, in conjunction with the BAC chromosome assignment. BAC clones that are not listed on the TPFs are included in the assembly when they provide an extension into a gap; this placement is determined when sequence overlaps between fragments are identified, using MegaBLAST (1). Approximately 95% of the BACs are assigned to a chromosome based on: 1) inclusion in a TPF (which are chromosome specific); 2) annotation on the submitted GenBank record; and 3) presence of at least 3 STS markers, identified by e-PCR (3) that have themselves been mapped to the same chromosome; or 4) personal communication from a sequencing laboratory.

The layout stage identifies, and removes from subsequent processing, conflicting sequence overlaps and redundant BACs. Overlapping fragments of BACs placed on two different chromosomes are not considered for the layout. In addition, redundant BACs that are fully contained within another longer BAC sequence are filtered out and not used in downstream processing.

Sequence Assembly Return to the top of the page.

The sequence building stage considers all possible fragment-to-fragment overlaps for every pair of BACs that overlap in the contig layout. Overlapping sequences are then merged together to form a single contiguous stretch called a meld. A contig may have several melds if most of the BACs are still of draft quality. Melds are ordered and oriented based on information derived from ESTs, mRNAs, paired plasmid reads, and order and orientation information submitted by the groups that submitted the BAC sequences to GenBank. The gap between consecutive melds on the contig has been arbitrarily set to 100 bases (represented as "n").

Contig Scaffold Return to the top of the page.

Contigs are placed and oriented on a chromosome using the TPF data, sequence overlaps with mapped STS markers, and paired BAC-end sequences.

Feature Annotation Return to the top of the page.

The annotation process identifies sequence features on the contigs such as variation, sequence tagged sites, FISH-mapped clone regions, known and predicted genes, and gene models. This stage provides contig, RNA, and protein records with added feature annotation.

Statistics on the number of contigs provided, the number of features annotated, and the number of elements available on other maps available in the Map Viewer resource are available here: Human Genome.

Clone Features Return to the top of the page.

Human FISH-mapped clones (5, 6) are annotated on the human genome by aligning their sequence tags on the contigs using MegaBLAST (1) and e-PCR (3) analysis. Sequence tags are in the form of either GenBank Accession numbers from the draft or finished clone insert sequence, GenBank Accession numbers of BAC-end sequences, or STS markers determined by PCR and hybridization experiments.

Currently, we only annotate human clones that have been mapped by fluorescence in situ hybridization (FISH) by the human bac resource consortium. These data provide a means to determine the correspondence between the sequence and the cytogenetic coordinate systems.

Mouse BAC clones (7) are annotated by aligning their BAC-end sequence to the assembly using MegaBLAST (1).

STS Features Return to the top of the page.

Electronic PCR (ePCR) (3) is used to place STS primer pairs, stored in UniSTS, on the contigs by looking for consistency between the determined product size and the reported size.

SNP Features Return to the top of the page.

Variations in dbSNP are mapped to the Genome Assembly by BLAST homology. Hits are recorded as high confidence if 95% of the flanking sequence is returned in the alignment with 0-6 mismatches. If no high confidence hits are observed, hits are recorded as low confidence if 75% of the flanking sequence is returned in the alignment with < 3% mismatches.

Variation annotation in the Map Viewer reports overall mapping quality as the number of chromosomes hit, number of contigs hit, and total number of hits to genome. SNPs with ambiguous map positions are annotated with a warning when Variation map is the master map. Complete mapping information is available from both the dbSNP web site and FTP site.

Gene, mRNA, misc_RNA, and Protein Features Return to the top of the page.

Genes are annotated using both (i) RefSeq transcript alignments and (ii) Gnomon prediction in those regions not covered by RefSeq alignments. The annotation includes coding transcripts, pseudogenes, and non-coding transcripts, which are represented as "misc_RNA" features.

RefSeq transcript alignments:

A first set of known genes (and their corresponding transcripts and proteins) are identified by aligning reference sequences (RefSeq) to the assembled genomic sequence using MegaBLAST (1) and assembling the hits according to limited constraints and heuristics regarding exon structure. Transcript models are reconstructed by attempting to settle disagreements between individual sequence alignments without using an a priori model (such as codon usage, initiation, or polyA signals). Although such a model is not used, information generated during a build (including predictions from Gnomon) are used to improve the RefSeqs themselves.

Alternate RefSeq models derived from the available sequence data are grouped under the same gene when they share one or more exons on the same strand.

If the defining RefSeq sequence aligns to more than one location on the genome, the best alignment is selected and annotated on the contig. If they are of equal quality, both are annotated. Genes (and corresponding transcript and protein features) are annotated on the contig if the defining transcript alignment is >=95% identity and the aligned region covers >=50% of the length, or at least 1000 bases.

Gnomon prediction:

Once the RefSeqs are placed on the genome, the remainder of the supporting information includes other mRNAs, ESTs, and information on protein homologies generated from comparisons of translated regions.

Additional GenBank mRNAs and ESTs are aligned to the assembled genomic sequence, and together with the RefSeq alignments, are chained together to merge alignments based on shared splice sites. A set of optimal self-consistent, non-overlapping transcript alignments are chosen from each regional cluster of these chained transcript alignments, using metrics of coding propensity, splice score, and protein alignments via BLASTX against filtered NR proteins (those with CDD hits or hits in distant organisms).

Transcript models are generated via a Hidden Markov Model (HMM) using transcript alignment constraints and protein hit information if available. The model allows nonconsensus splices existing in the transcript alignment, makes deletions/insertions in the sequence to compensate for the frameshifts found in the protein alignments, and suppresses stop codons found in "exons" of protein alignments. Note that models generated with frameshifts and suppressed stop codons are strong candidates for pseudogenes.

The HMM will continue through regions without constraint information and create ab initio models. These are aligned via BLASTP against filtered NR proteins and an optimal self-consistent set of protein hits is chosen based on total score. The HMM is re-run with constraints based on these protein hits. This produces the final set of Gnomon gene models; the contig annotation includes only the subset not overlapping the RefSeq-based models.

End Products Return to the top of the page.

The NCBI Genome Annotation project provides sequences and resource support via AceView, LocusLink, and Map Viewer.

Sequence Data Return to the top of the page.

A comprehensive set of RefSeq records are provided on the FTP site. Multiple mRNA and protein RefSeqs are provided for genes when the supporting RefSeq, GenBank mRNA, and EST data support alternative splicing. Transcripts are also instantiated for some non protein coding genes. These records represent transcribed pseudogenes.

See the RefSeq documentation for a complete list of accession prefixes. Accessions that begin with the prefix XM_ (mRNA), XR_ (non-coding transcript), and XP_ (protein) are model reference sequences produced by NCBI's Genome Annotation project. These records represent the transcripts and proteins that are annotated on the NCBI Contigs (prefix NT_ or NW_), which may have been generated from incomplete data. Because the XM_, XR_, and XP_ accessions reflect the current state of NCBI's assembly of the genomic sequence, they may be different from GenBank submissions for mRNAs and/or the curated RefSeq records. These differences may reflect real sequence variation (polymorphism), errors in GenBank accessions used as sources for unreviewed (provisional) RefSeq records, or errors or gaps in the available genomic sequence. These sequences should be used with caution, after comparing any XM_ or XP accession to other available sequence information (Check BLink, LocusLink, or related sequences).

Resource Support Return to the top of the page.

BLAST - see below

dbSNP provides information about sequence variation including map location, alleles, frequency data, genotype data, and functional data. Report pages include links to LocusLink, UniSTS, GenBank, PubMed, the NCBI Map Viewer, and external submitter web sites.

Human Genome Resources and Mouse Genome Resources pages provide a central point of access to information about sequencing progress, NCBI resources, NIH resources, and meeting and press releases.

LocusLink includes report pages for all genes defined by the genome annotation process. Every effort is made to provide associations to known genes; additional pages anchored on an InterimID are provided for new genes or those that cannot be unambiguously associated with a known gene.

Map Viewer presents a graphical view of the available genome sequence data as well as non-sequence map data such as cytogenetic, genetic, physical, or radiation hybrid maps (the type and number of maps available vary by organism). The Map Viewer provides a robust query interface and interactive displays. Additional information on using the resource and on organism-specific maps including those for human and mouse is available. The Map Veiwer displays may also include links to view supporting evidence (Evidence Viewer, Model Maker).

Entrez Graphical Sequence View provides a graphical overview of the GenBank Flat File plus a section of the sequence data. Annotated features are indicated on both the graphic and the sequence. The interface provides both zoom and scroll capability. This view is available for all sequence records by selecting "Graphics" from the "Display" menu; links to this graphical sequence viewer are provided from LocusLink and the Map Viewer (look for "sv" links). This display is particularly useful for viewing genes and other features annotated on the contigs. See Example.

RefSeq provides a non-redundant database of sequences including genomic, transcript, and protein. RefSeq transcripts are used as a reagent for genome annotation.

UniSTS provides STS marker reports on primer sequences, product size, mapping information, GenBank and RefSeq records that contain the primer sequences (as determined by Electronic PCR), and provides links to relevant resources.

Data Access Return to the top of the page.

RefSeq contigs, model transcripts, and model proteins are fully integrated into the main NCBI resources. Thus, they can be retrieved with Entrez queries, accessed via a customized BLAST page, and are included in protein "BLink" pages, which display the results of pre-computed BLAST searches.

BLAST Return to the top of the page.

The organism-specific BLAST pages (for example, Human , Mouse) pages provide an interface to BLAST an Accession number or FASTA-formatted sequence against the assembled genomic sequence data, as well as the RefSeq transcripts and proteins annotated on the genome.

Entrez Retrieval Return to the top of the page.

The RefSeq contigs, transcripts, and proteins are retrievable with standard Entrez queries such as an Accession number, gene symbol, or protein name. You can also use the Limits settings, or make use of Entrez "properties" restrictions to further restrict the query.

See the RefSeq web site for Entrez query tips

FTP Return to the top of the page.

The genomes FTP site holds data generated by analysis and/or additional processing at NCBI. This site includes sequences generated by the genome build and annotation effort (contig, transcript and protein sequences) as well as Map Viewer data files. Please see the provided README files for further information. This site will be expanded to include additional data such as the curated RefSeq sequences and LocusLink. Currently, mRNA and protein RefSeq (the NM_###### and NP_######) and LocusLink data updates are provided in the pre-existing FTP directory.

References Return to the top of the page.

1. MegaBLAST
Also see:

Zhang Z, Schwartz S, Wagner L, Miller W.
A greedy algorithm for aligning DNA sequences.
J Comput Biol 2000 Feb-Apr;7(1-2):203-14
PMID: 10890397
2. RepeatMasker; Smit, AFA & Green, P.

3. ePCR
Schuler GD 
Sequence mapping by electronic PCR.
Genome Res. 1997 May;7(5):541-50.
PMID: 9149949 
4. GenomeScan
Yeh RF, Lim LP, Burge CB.
Computational inference of homologous gene structures in the human genome.
Genome Res. 2001 May;11(5):803-16.
PMID: 11337476
5. FISH-mapped clones
The BAC Resource consortium, 
Integration of cytogenetic landmarks into the draft sequence of the human genome.
Nature 2001 Feb 15;409(6822):953-8.
PMID: 11237021
6. FISH-mapped clones
Kirsch IR, Green ED, Yonescu R, Strausberg RL, Carter N,
Braden VV, Hilgenfeld E, Schuler G, Lash AE, Shen GL, Martelli M,
Kuehl WM, Klausner RD, Ried T. 
A Systematic, high-resolution linkage of the cytogenetic and physical maps of the human genome. 
Nat Genet. 2000 Apr;24(4):339-40. 
PMID: 10742091
7. Mouse BAC end sequences
Zhao S, Shatsman S, Ayodeji B, Geer K, Tsegaye G, Krol M, Gebregeorgis E, 
Shvartsbeyn A, Russell D, Overton L, Jiang L, Dimitrov G, Tran K, Shetty 
J, Malek JA, Feldblyum T, Nierman WC, Fraser CM.
Mouse BAC ends quality assessment and sequence analyses.
Genome Res 2001 Oct;11(10):1736-45.
PMID: 11591651

Glossary Return to the top of the page.

For more definitions, see:



Term Definition
BAC-end sequence The ends of a Bacterial Artificial Chromosome (BAC) have been sequenced and submitted to GenBank; the internal BAC sequence may not be available. When both end-sequences from the same BAC are available, this information can be used to order contigs into scaffolds.
contig A set of overlapping clones or sequences from which a sequence can be obtained. NCBI Contig records represent contiguous sequence constructed from many clone sequences. These records may include draft and finished sequence and may contain sequence gaps (within a clone) or gaps between clones when the gap is spanned by another clone which is not sequenced.
draft sequence At least 3-4x of the estimated clone insert is covered in Phred Q20 bases in the shotgun sequencing stage, as defined for the human genome sequencing project. Note that the exact definition of 'draft' may be different for other genome projects. Clone sequence may contain several pieces of sequence, separated by gaps. The true order and orientation of these pieces may not be known.
finished sequence The clone insert is contiguously sequenced with high quality standard of error rate of 0.01%. There are usually no gaps in the sequence.
fragment A contiguous stretch of sequence within a clone sequence that doesn't contain a gap, vector, or other contaminating sequence.
meld When two or more fragments overlap in the entire alignable region, these sequences are merged together to make a single longer sequence.
order & orientation Sequence overlap information is used to order and orient (ONO) fragments within a large clone sequence.
scaffold Ordered set of contigs placed on the chromosome.



Revised November 6, 2003

Disclaimer     Privacy statement     NCBI Service Desk