NCBI Prokaryotic Genome Annotation Pipeline

NCBI Prokaryotic Genome Annotation Pipeline(PGAP) is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Haft DH et al 2018, Tatusova T et al 2016). Recent improvements utilize curated protein profile hidden Markov models (HMMs), including TIGRFAMS and new HMMs for antimicrobial resistance proteins, and curated complex domain architectures for functional annotation of proteins. NCBI's annotation pipeline depends on several internal databases and is not currently available for download or use outside of the NCBI environment.

Related documentation:

GenBank

The NCBI prokaryotic annotation pipeline is available as a service for GenBank submitters. The pipeline is capable of annotating both complete genomes and draft WGS genomes consisting of multiple contigs.  You can request PGAP annotation when you submit your genome to GenBank.

Both WGS and non-WGS genomes, including gapless complete bacterial chromosomes, can be submitted via the Submission Portal. You will be asked to choose whether the genome being submitted is considered WGS or not. The differences for GenBank purposes are:

non-WGS:

  • Each chromosome is in a single sequence and there are no extra sequences
  • Each sequence in the genome must be assigned to a chromosome or plasmid or organelle
  • Plasmids and organelles can still be in multiple pieces.

WGS:

  • One or more chromosomes are in multiple pieces and/or some sequences are not assembled into chromosomes

In both cases:

  • There can still be gaps within the sequences; you will supply that information in the submission.
  • Plasmids and organelles can still be in multiple pieces.
  • Internal sequences must be arranged in the correct order and orientation.
  • Sequences concatenated in unknown order are not allowed.

Submission is through the Genome Submission Portal. See the genome submission instructions page for details.

Refseq

All RefSeq bacterial and archaeal genomes, with the exception of RefSeq Prokaryotic Reference Genomes, are annotated using NCBI's prokaryotic genome annotation pipeline. Additional information on this policy is available here:

For information about RefSeq Eukaryotic genomes, please see: Eukaryotic Genome Annotation

Questions about RefSeq prokaryotic genomes: genomes@ncbi.nlm.nih.gov

References

Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. doi: 10.1093/nar/gkw569. PMID: 27342282

Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O'Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860. doi: 10.1093/nar/gkx1068. PubMed PMID: 29112715

Support Center

Last updated: 2019-01-07T17:56:56Z