|
RefSeq Production Processes |
Home |
RefSeq records are derived from primary GenBank submissions; varying levels of validation, additional annotation, and manual curation are applied to the RefSeq record. NCBI Reference Sequences are provided through the separate processes described below.
This page provides a brief overview of the RefSeq production processes. Also see:
NCBI Handbook, RefSeq chapter
NCBI Handbook, LocusLink chapter
NCBI Handbook, Genome Annotation chapter
Genome Annotation Pipeline
LocusLink Pipeline
Collaboration |
![back to top](/peth04/20041031093512im_/http://www4.ncbi.nlm.nih.gov/RefSeq/IMG/arrowup.gif) |
The Entrez Genomes pipeline relies on collaboration to supply the RefSeq collection for some organisms including Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. The primary sequence level review is carried out by the collaborating group. Additional functional information may also be provided by the collaborating group. This pipeline is automated; data is provided by a collaborating group and validated at NCBI to detect obvious errors (e.g., the annotated CDS location is not capable of encoding the provided protein), and to apply annotation in a more uniform way. Additional manual curation of these records is not carried out by NCBI staff. NCBI may update the records to correct a general format problem, but otherwise these records are only updated when the collaborating group provides an update.
The LocusLink pipeline also provides some records through collaborations. Collaborators may review the primary accession-to-gene associations, provide reviewed RefSeq sequence data, and/or provide functional descriptions. Additional collaborations with official nomenclature groups and model-organism databases provide a critical data source for the LocusLink/RefSeq pipeline.
RefSeq records that are contributed by collaboration have a REVIEWED status; the collaborating group is identified on the sequence record.
Genome Assembly & Annotation Pipeline |
![back to top](/peth04/20041031093512im_/http://www4.ncbi.nlm.nih.gov/RefSeq/IMG/arrowup.gif) |
NCBI is providing annotation for some assembled genomic sequence data includinghuman, mouse, rat, honey bee (and others). This pipeline is automated and data is refreshed periodically. The model RefSeq records produced from this pipeline have a distinguishing accession prefix (XM, XR, XP), are derived from the genomic sequence, have varying levels of transcript or protein homology support, and are not subject to further manual curation.
Also see:
NCBI Handbook, Annotation chapter
Genome Assembly & Annotation Build Pipeline
LocusLink Associated Pipeline |
![back to top](/peth04/20041031093512im_/http://www4.ncbi.nlm.nih.gov/RefSeq/IMG/arrowup.gif) |
RefSeq records that are associated with LocusLink are produced through a pipeline that takes advantage of the additional descriptive information available in the LocusLink database. Multiple collaborations support the collection of this descriptive information.
This dataset consists primarily of transcripts and proteins. Some records representing genomic regions (accession prefix NG_) are provided specifically to support more comprehensive genome-level annotation.
A combined approach uses both collaborator supplied sequence information and automated BLAST analysis to provide an initial RefSeq record. Records are subject to validation to correct annotation errors and provide annotation in a more consistent format. LocusLink descriptive information, including Official Nomenclature and additional citations, are applied to the records. These initial records have a PROVISIONAL or PREDICTED status.
Additional manual curation is applied to this set of RefSeq records to provide the optimal sequence record, and to fix sequence errors including mis-association with a locus (as might occur for closely related gene families), chimeric sequences, vector or linker contamination, or apparent sequencing errors. Both the nucleotide and protein sequence record may change due to this process. Sequence level review is carried out primarily by NCBI staff but some records are provided via collaboration. Additional annotation and functional information is applied, as available, during the sequence review process. Furthermore, additional sequence records are supplied to represent splice variants. These records have a REVIEWED status.
Since there is a strong manual curation component in this pipeline, input from the research community is especially welcome to further improve the quality of this dataset. The RefSeq records generated by this pipeline are used as a reagent in the genome assembly & annotation pipeline (see above).
Also see:
NCBI Handbook, RefSeq chapter
LocusLink Build Pipeline
Entrez Genomes Pipeline |
![back to top](/peth04/20041031093512im_/http://www4.ncbi.nlm.nih.gov/RefSeq/IMG/arrowup.gif) |
The Entrez Genomes pipeline provides RefSeq records for bacteria, viruses, organelles, plasmids, and other organisms including Saccharomyces cerevisiae, Arabidopsis thaliana, Plasmodium falciparum,, and Leishmania major. Drosophila melanogaster and Caenorhabditis elegans RefSeq records are provided through collaboration, processing in the Entrez Genomes pipeline, and processing in the LocusLink supported pipeline. Records for additional species are added to the collection as sufficient sequence data becomes available. This pipeline relies on both automatic computation, collaboration, and in-house expert analysis to provide records at several levels of curation. These RefSeq records undergo an initial automated validation process before being released. The validation step checks for data errors and provides consistent feature annotation. If more than one genomic sequence is available for the genome, then one is selected for use as the RefSeq standard. This selection takes into account various factors including level of annotation, strain information, and community input.
Also see:
Microbial Genomes
Organelles
Viral Genomes
Plant Genomes
|