NCBI logo gif RefSeq banner gif
PubMed Entrez BLAST OMIM Books Taxonomy Structure

  RefSeq Production Processes Home

RefSeq records are derived from primary GenBank submissions; varying levels of validation, additional annotation, and manual curation are applied to the RefSeq record. NCBI Reference Sequences are provided through the separate processes described below.

This page provides a brief overview of the RefSeq production processes. Also see:
blank spacer gifNCBI Handbook, RefSeq chapter
blank spacer gifNCBI Handbook, LocusLink chapter
blank spacer gifNCBI Handbook, Genome Annotation chapter
blank spacer gifGenome Annotation Pipeline
blank spacer gifLocusLink Pipeline


Collaboration back to top

The Entrez Genomes pipeline relies on collaboration to supply the RefSeq collection for some organisms including Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. The primary sequence level review is carried out by the collaborating group. Additional functional information may also be provided by the collaborating group. This pipeline is automated; data is provided by a collaborating group and validated at NCBI to detect obvious errors (e.g., the annotated CDS location is not capable of encoding the provided protein), and to apply annotation in a more uniform way. Additional manual curation of these records is not carried out by NCBI staff. NCBI may update the records to correct a general format problem, but otherwise these records are only updated when the collaborating group provides an update.

The LocusLink pipeline also provides some records through collaborations. Collaborators may review the primary accession-to-gene associations, provide reviewed RefSeq sequence data, and/or provide functional descriptions. Additional collaborations with official nomenclature groups and model-organism databases provide a critical data source for the LocusLink/RefSeq pipeline.

RefSeq records that are contributed by collaboration have a REVIEWED status; the collaborating group is identified on the sequence record.

Genome Assembly & Annotation Pipeline back to top

NCBI is providing annotation for some assembled genomic sequence data includinghuman, mouse, rat, honey bee (and others). This pipeline is automated and data is refreshed periodically. The model RefSeq records produced from this pipeline have a distinguishing accession prefix (XM, XR, XP), are derived from the genomic sequence, have varying levels of transcript or protein homology support, and are not subject to further manual curation.

Also see:
blank spacer gifNCBI Handbook, Annotation chapter
blank spacer gifGenome Assembly & Annotation Build Pipeline

LocusLink Associated Pipeline back to top

RefSeq records that are associated with LocusLink are produced through a pipeline that takes advantage of the additional descriptive information available in the LocusLink database. Multiple collaborations support the collection of this descriptive information.

This dataset consists primarily of transcripts and proteins. Some records representing genomic regions (accession prefix NG_) are provided specifically to support more comprehensive genome-level annotation.

A combined approach uses both collaborator supplied sequence information and automated BLAST analysis to provide an initial RefSeq record. Records are subject to validation to correct annotation errors and provide annotation in a more consistent format. LocusLink descriptive information, including Official Nomenclature and additional citations, are applied to the records. These initial records have a PROVISIONAL or PREDICTED status.

Additional manual curation is applied to this set of RefSeq records to provide the optimal sequence record, and to fix sequence errors including mis-association with a locus (as might occur for closely related gene families), chimeric sequences, vector or linker contamination, or apparent sequencing errors. Both the nucleotide and protein sequence record may change due to this process. Sequence level review is carried out primarily by NCBI staff but some records are provided via collaboration. Additional annotation and functional information is applied, as available, during the sequence review process. Furthermore, additional sequence records are supplied to represent splice variants. These records have a REVIEWED status.

Since there is a strong manual curation component in this pipeline, input from the research community is especially welcome to further improve the quality of this dataset. The RefSeq records generated by this pipeline are used as a reagent in the genome assembly & annotation pipeline (see above).

Also see:
blank spacer gifNCBI Handbook, RefSeq chapter
blank spacer gifLocusLink Build Pipeline

Entrez Genomes Pipeline back to top

The Entrez Genomes pipeline provides RefSeq records for bacteria, viruses, organelles, plasmids, and other organisms including Saccharomyces cerevisiae, Arabidopsis thaliana, Plasmodium falciparum,, and Leishmania major. Drosophila melanogaster and Caenorhabditis elegans RefSeq records are provided through collaboration, processing in the Entrez Genomes pipeline, and processing in the LocusLink supported pipeline. Records for additional species are added to the collection as sufficient sequence data becomes available. This pipeline relies on both automatic computation, collaboration, and in-house expert analysis to provide records at several levels of curation. These RefSeq records undergo an initial automated validation process before being released. The validation step checks for data errors and provides consistent feature annotation. If more than one genomic sequence is available for the genome, then one is selected for use as the RefSeq standard. This selection takes into account various factors including level of annotation, strain information, and community input.

Also see:
blank spacer gifMicrobial Genomes
blank spacer gifOrganelles
blank spacer gifViral Genomes
blank spacer gifPlant Genomes


 Site contents  
Information
NCBI Handbook
Overview  |  FAQ Frequently Asked Questions
Accessions  |  Status
Entrez Queries
FTP
RefSeq Release
Catalog  |  Notes
Genomes
Statistics
Release Statistics
LocusLink Pipeline
Feedback
NCBI Help Desk
Submit Updates
Submit GeneRIF
Subscribe - eMail Lists
RefSeq  |  Gene
Map Viewer  |  NCBI
Related links
Gene  |  LocusLink
Map Viewer
UniGene  |  COGs
Entrez Genomes Home
Genome Pages
Aspergillus  |  Bee  |  Cat
Chicken  |  Cow
Dictyostelium  |  Dog
Frog  |  Human  |  Microbial
Mouse  |  Organelles  |  Pig
Plants  |  Rat  |  Retrovirus
Sea urchin  |  Sheep
Viruses  |  Zebrafish
Credits
Collaborators
NCBI Staff
Microbial Providers

Last updated September 24, 2004
Questions or Comments?
Write to the Help Desk

Disclaimer     Privacy statement