|
NCBI Genomic Sequence Assembly and Annotation Process | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The NCBI Map Viewer facilitates access, query, and display of mapping data including the physical sequence map. As genome sequence data become available for an organism, NCBI staff work to provide the data as a reference sequence (RefSeq) for display in the Map Viewer. We have developed several protocols to reach this goal and rely heavily on collaboration with genome-specific research groups whenever possible. NCBI provides various levels of computation, analysis, and curation as needed per organism. For instance, the mouse genome is assembled by an external group and annotated via the NCBI annotation pipeline. For the human genome, NCBI computes the assembly in collaboration with the international sequencing consortium; NCBI and other external groups independently provide annotation on the assembled genome. And, for other genomes, such as Drosophila melanogaster, the NCBI RefSeqs represent the assembly and annotation as provided by the fly sequencing consortium. NCBI is providing reference sequence (RefSeq) records that represent assemblies of genomic sequence data and the corresponding RNA and protein sequences. The NCBI annotation pipeline annotates the genomic RefSeq data with features such as genes, RNAs, proteins, variation (SNPs), STS markers, and FISH mapped clones. All sequences (genomic, RNAs, proteins) are available for customized BLAST searches. BLAST results, as well as the sequence features, are readily displayed on NCBI's Map Viewer.
This document describes the processes used to:
These processes are complex and will continue to be refined; therefore, regions that are not fully represented should improve with subsequent builds.
The procedure below describes the assembly of the human genome based on curated Tiling Path Files (TPFs) that are provided by the international sequencing consortium, and HTGS sequences found in GenBank. The TPF data indicate clone order, per chromosome, clone overlap information, and the location of gaps. Currently, NCBI is annotating assembled mouse and rat genomic sequence; additional information about these genome assemblies is available here:
The input data include both finished and draft genomic sequence data available in GenBank. The sequence data are first screened for contaminating sequences. Sequences are compared against a collection of sequences for common contaminants (including the UniVec database, the E. coli genome, bacterial transposons, and bacteriophage) using MegaBLAST (1). Sequence is then masked for repeats using RepeatMasker (2) (repeat sequence is converted to lowercase) and blasted against the genomes of other completely sequenced organisms (including S. cerevisiae, D. melanogaster, and C. elegans). In addition, the sequence data are screened for mouse repeats using RepeatMasker (2) and for mouse STSs using e-PCR (3). Any clone containing sequence from another organism is entirely removed from the input data set. The prepared sequence data, in which repetitive and contaminating regions have been masked, are then broken into fragments at gap positions.
Genomic contigs are constructed in two steps: a clone layout stage, followed by a sequence building stage. The layout stage determines which BACs belong to the same contig. The sequence building stage assembles the clone sequences that belong to the same contig. The clone layout is generated using clone order information provided in the TPFs and sequence overlap information, in conjunction with the BAC chromosome assignment. BAC clones that are not listed on the TPFs are included in the assembly when they provide an extension into a gap; this placement is determined when sequence overlaps between fragments are identified, using MegaBLAST (1). Approximately 95% of the BACs are assigned to a chromosome based on: 1) inclusion in a TPF (which are chromosome specific); 2) annotation on the submitted GenBank record; and 3) presence of at least 3 STS markers, identified by e-PCR (3) that have themselves been mapped to the same chromosome; or 4) personal communication from a sequencing laboratory. The layout stage identifies, and removes from subsequent processing, conflicting sequence overlaps and redundant BACs. Overlapping fragments of BACs placed on two different chromosomes are not considered for the layout. In addition, redundant BACs that are fully contained within another longer BAC sequence are filtered out and not used in downstream processing.
The sequence building stage considers all possible fragment-to-fragment overlaps for every pair of BACs that overlap in the contig layout. Overlapping sequences are then merged together to form a single contiguous stretch called a meld. A contig may have several melds if most of the BACs are still of draft quality. Melds are ordered and oriented based on information derived from ESTs, mRNAs, paired plasmid reads, and order and orientation information submitted by the groups that submitted the BAC sequences to GenBank. The gap between consecutive melds on the contig has been arbitrarily set to 100 bases (represented as "n").
Contigs are placed and oriented on a chromosome using the TPF data, sequence
overlaps with mapped STS markers, and paired BAC-end sequences.
The annotation process identifies sequence features on the
contigs such as variation, sequence tagged sites, FISH-mapped clone
regions, known and predicted genes, and gene models. This stage provides
contig, RNA, and protein records with added feature annotation.
Human FISH-mapped clones (5, 6) are annotated on the human genome by aligning their sequence tags on the contigs using MegaBLAST (1) and e-PCR (3) analysis. Sequence tags are in the form of either GenBank Accession numbers from the draft or finished clone insert sequence, GenBank Accession numbers of BAC-end sequences, or STS markers determined by PCR and hybridization experiments. Currently, we only annotate human clones that have been mapped by fluorescence in situ hybridization (FISH) by the human bac resource consortium. These data provide a means to determine the correspondence between the sequence and the cytogenetic coordinate systems. Mouse BAC clones (7) are annotated by aligning their BAC-end sequence to the assembly using MegaBLAST (1).
Electronic PCR
(ePCR) (3) is
used to place STS primer pairs, stored in UniSTS, on
the contigs by looking for consistency between the determined product size
and the reported size.
Variations in dbSNP are mapped to the Genome Assembly by BLAST homology. Hits are recorded as high confidence if 95% of the flanking sequence is returned in the alignment with 0-6 mismatches. If no high confidence hits are observed, hits are recorded as low confidence if 75% of the flanking sequence is returned in the alignment with < 3% mismatches. Variation annotation in the Map Viewer reports overall mapping quality as the number of chromosomes hit, number of contigs hit, and total number of hits to genome. SNPs with ambiguous map positions are annotated with a warning when Variation map is the master map. Complete mapping information is available from both the dbSNP web site and FTP site.
Genes are annotated using both (i) RefSeq transcript alignments and (ii) Gnomon prediction in those regions not covered by RefSeq alignments. The annotation includes coding transcripts, pseudogenes, and non-coding transcripts, which are represented as "misc_RNA" features. RefSeq transcript alignments: A first set of known genes (and their corresponding transcripts and proteins) are identified by aligning reference sequences (RefSeq) to the assembled genomic sequence using MegaBLAST (1) and assembling the hits according to limited constraints and heuristics regarding exon structure. Transcript models are reconstructed by attempting to settle disagreements between individual sequence alignments without using an a priori model (such as codon usage, initiation, or polyA signals). Although such a model is not used, information generated during a build (including predictions from Gnomon) are used to improve the RefSeqs themselves. Alternate RefSeq models derived from the available sequence data are grouped under the same gene when they share one or more exons on the same strand. If the defining RefSeq sequence aligns to more than one location on the genome, the best alignment is selected and annotated on the contig. If they are of equal quality, both are annotated. Genes (and corresponding transcript and protein features) are annotated on the contig if the defining transcript alignment is >=95% identity and the aligned region covers >=50% of the length, or at least 1000 bases. Gnomon prediction: Once the RefSeqs are placed on the genome, the remainder of the supporting information includes other mRNAs, ESTs, and information on protein homologies generated from comparisons of translated regions. Additional GenBank mRNAs and ESTs are aligned to the assembled genomic sequence, and together with the RefSeq alignments, are chained together to merge alignments based on shared splice sites. A set of optimal self-consistent, non-overlapping transcript alignments are chosen from each regional cluster of these chained transcript alignments, using metrics of coding propensity, splice score, and protein alignments via BLASTX against filtered NR proteins (those with CDD hits or hits in distant organisms). Transcript models are generated via a Hidden Markov Model (HMM) using transcript alignment constraints and protein hit information if available. The model allows nonconsensus splices existing in the transcript alignment, makes deletions/insertions in the sequence to compensate for the frameshifts found in the protein alignments, and suppresses stop codons found in "exons" of protein alignments. Note that models generated with frameshifts and suppressed stop codons are strong candidates for pseudogenes. The HMM will continue through regions without constraint information and create ab initio models. These are aligned via BLASTP against filtered NR proteins and an optimal self-consistent set of protein hits is chosen based on total score. The HMM is re-run with constraints based on these protein hits. This produces the final set of Gnomon gene models; the contig annotation includes only the subset not overlapping the RefSeq-based models.
The NCBI Genome Annotation project provides sequences and resource support via AceView, LocusLink, and Map Viewer.
A comprehensive set of RefSeq records are provided on the FTP site. Multiple mRNA and protein RefSeqs are provided for genes when the supporting RefSeq, GenBank mRNA, and EST data support alternative splicing. Transcripts are also instantiated for some non protein coding genes. These records represent transcribed pseudogenes. See the RefSeq documentation for a complete list of accession prefixes. Accessions that begin with the prefix XM_ (mRNA), XR_ (non-coding transcript), and XP_ (protein) are model reference sequences produced by NCBI's Genome Annotation project. These records represent the transcripts and proteins that are annotated on the NCBI Contigs (prefix NT_ or NW_), which may have been generated from incomplete data. Because the XM_, XR_, and XP_ accessions reflect the current state of NCBI's assembly of the genomic sequence, they may be different from GenBank submissions for mRNAs and/or the curated RefSeq records. These differences may reflect real sequence variation (polymorphism), errors in GenBank accessions used as sources for unreviewed (provisional) RefSeq records, or errors or gaps in the available genomic sequence. These sequences should be used with caution, after comparing any XM_ or XP accession to other available sequence information (Check BLink, LocusLink, or related sequences).
BLAST - see
below
RefSeq contigs, model transcripts, and model proteins are fully integrated into the main NCBI resources. Thus, they can be retrieved with Entrez queries, accessed via a customized BLAST page, and are included in protein "BLink" pages, which display the results of pre-computed BLAST searches.
The organism-specific BLAST pages (for example, Human , Mouse) pages provide an interface to BLAST an Accession number or FASTA-formatted sequence against the assembled genomic sequence data, as well as the RefSeq transcripts and proteins annotated on the genome.
The RefSeq contigs, transcripts, and proteins are retrievable
with standard Entrez
queries such as an Accession number, gene symbol, or protein name. You can
also use the Limits settings, or make use of Entrez "properties"
restrictions to further restrict the query.
The genomes FTP
site holds data generated by analysis and/or additional processing
at NCBI. This site includes sequences generated by the genome build and
annotation effort (contig, transcript and protein sequences) as
well as Map Viewer data files. Please see the provided README files for
further information. This site will be expanded to include additional data
such as the curated RefSeq sequences and LocusLink. Currently, mRNA and protein RefSeq (the
NM_###### and NP_######) and LocusLink data updates
are provided in the pre-existing FTP directory.
1. MegaBLAST
Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol 2000 Feb-Apr;7(1-2):203-14 PMID: 108903972. RepeatMasker; Smit, AFA & Green, P. 3. ePCR Schuler GD Sequence mapping by electronic PCR. Genome Res. 1997 May;7(5):541-50. PMID: 91499494. GenomeScan Yeh RF, Lim LP, Burge CB. Computational inference of homologous gene structures in the human genome. Genome Res. 2001 May;11(5):803-16. PMID: 113374765. FISH-mapped clones The BAC Resource consortium, Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 2001 Feb 15;409(6822):953-8. PMID: 112370216. FISH-mapped clones Kirsch IR, Green ED, Yonescu R, Strausberg RL, Carter N, Braden VV, Hilgenfeld E, Schuler G, Lash AE, Shen GL, Martelli M, Kuehl WM, Klausner RD, Ried T. A Systematic, high-resolution linkage of the cytogenetic and physical maps of the human genome. Nat Genet. 2000 Apr;24(4):339-40. PMID: 107420917. Mouse BAC end sequences Zhao S, Shatsman S, Ayodeji B, Geer K, Tsegaye G, Krol M, Gebregeorgis E, Shvartsbeyn A, Russell D, Overton L, Jiang L, Dimitrov G, Tran K, Shetty J, Malek JA, Feldblyum T, Nierman WC, Fraser CM. Mouse BAC ends quality assessment and sequence analyses. Genome Res 2001 Oct;11(10):1736-45.
For more definitions, see:
|