NCBI Tools for Bioinformatics Research

Bacterial Genome Submission Guidelines

PubMed

Entrez

BLAST

GenBank
Sequence submission support and software

Sequin
Sequence submission tool

Microbial Genomes

Introduction

The following is a guide to help bacterial genome submitters prepare their submissions using Sequin, an NCBI software tool for submitting and updating GenBank entries.

Many genome centers maintain feature locations and annotations in an internal database. Sequin can read a simple five-column tab-delimited table of feature locations and qualifiers. The table is read into Sequin along with the DNA sequence (in FASTA format) and the submitter information, and the record is then ready for submission to GenBank.

The feature table format allows different kinds of features (e.g., gene, mRNA, coding region, tRNA) and qualifiers (e.g., /product, /note) to be indicated. Once the sequence and annotation have been read in, Sequin's validator can be used to check for errors such as internal stops in coding regions.

FASTA-formatted sequence

Sequin can read nucleotide sequences in FASTA format. FASTA format consists of a single definition line, beginning with a '>' and followed by optional text, and subsequent lines of sequence. At minimum, all definition lines must contain an identifier for the sequence, called the SeqId. Other information about the biological source of the organism can also be encoded on the definition line of the sequence and is used to build the record. A sample definition line is

>OB_HTE831 [organism=Oceanobacillus iheyensis] [strain=HTE831]

Common source modifiers may be incorporated into the definition line e.g. [strain=HTE831]. A complete list of modifiers is available from the Sequin FAQ page. Many of these modifiers can also be entered on the template. See the section "Creating the Sequin file" below.

An example of a FASTA-formatted sequence is shown in Figure 1.

Feature table layout

Sequin reads features from a simple five-column tab-delimited table. The feature table specifies the location and type of each feature, and Sequin processes the feature intervals and translates any coding regions into proteins. The first line of the table contains the following basic information.

>Features SeqId table_name

The SeqId must be the same as that used on the sequence. The table_name is optional. Subsequent lines of the table list the features. Columns are separated by tabs.

Column 1: Start location of feature
Column 2: Stop location of feature
Column 3: Feature key
Column 4: Qualifier key
Column 5: Qualifier value

Figure 2 shows a sample feature table and illustrates a number of points about the feature table format. The GenBank flatfile corresponding to this table is shown in Figure 3.

Features that are on the complementary strand, such as the gene abrB and its corresponding CDS, are indicated by reversing the interval locations.

Please avoid the unnecessary capitalization in all text entered in your table.

locus_tag and protein_id

The locus_tag and protein_id qualifiers should be used in all bacterial genome submissions.

All genes should be assigned a systematic gene identifier which should receive the locus_tag qualifier on the gene feature in the table. Genes may also have functional names as assigned in the scientific literature. In this example, OB0001 is the systematic gene identifier, while abcD is the functional gene name.

The use of locus_tag is supported in Sequin version 4.35 or newer. If you have an older version of Sequin please download the current version.

Table view of gene with both biological name and locus_tag:

1       1575    gene
                        gene    abcD
                        locus_tag        OB0001

Flatfile view:

          gene       1..1575
                     /gene="abcD"
                     /locus_tag="OB0001"

Table view of gene with only locus_tag:

1	1575	gene
			locus_tag	OB0001

Flatfile view:

          gene       1..1575
                     /locus_tag="OB0001"

All proteins in a complete genome should be assigned an identification number by the submitter. We use this number to track proteins when sequences are updated. This number is indicated in the table by the CDS qualifier protein_id, and should have the format gnl|dbname|string, where dbname is the username on your ftp account, and string is the identification number. In this example, the protein_id for abcD is gnl|dbname|OB0001. This identifier is saved with the record (in ASN.1 format), but it is not visible in the flatfile. We recommend using the locus_tag number as the protein identification number.

Please attempt to avoid duplication of previously used locus_tag prefixes. Since the protein_id is used for internal tracking in our database, it is important that the complete protein_id (dbname + string) not be duplicated by a genome center. Thus, if your genome center is submitting more than one genome, please be sure to use a different locus_tag/protein_id for each genome.

Example:

1	1575	gene
			gene	abcD
			locus_tag	OB0001
1	1575	CDS
			product	AbcD
			protein_id	gnl|dbname|OB0001

Gene features

Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and operator binding sites.

Gene names should follow the standard bacterial nomenclature rules of three lower case letters. Different loci are distinguished by a suffix of uppercase letters.

correct cytB
incorrect CYTB
incorrect cytochrome B

CDS (coding region) features

All CDS features should have a product qualifier (protein name). This should be a concise name, not a description or phrase. Alternatively, protein names may be denoted by the same symbol as the corresponding gene, but the symbol begins with a capital letter. In cases where the protein is not known use "unknown" or "hypothetical protein" as the product name.

Descriptions, notes describing similarity to other proteins, and functional comments should be placed in the appropriate CDS qualifiers such as note, or prot_desc, as they are descriptors of the product. E.C. numbers should be fielded in an EC_number qualifier.

start	stop	CDS
			product	DNA gyrase subunit B
			EC_number	5.99.1.3

Qualifiers that can be used on the CDS feature are:

start	stop	CDS
			product
			prot_desc	
			function	
			EC_number	
			note

Bifunctional proteins: If a protein contains two separate and distinct functions or if it has more than one name, each can be listed in the table as a separate product qualifier on the CDS in the table. The value of the first product qualifier will become the /product on the CDS in the flatfile, and any additional product qualifiers will be shown as a /note.

Table view:

start	stop	CDS
			product	methylenetetrahydrofolate dehydrogenase (NADP+)
			product	methenyltetrahydrofolate cyclohydrolase
			note	bifunctional

Please avoid including notes indicating a specific percentage of similarity to other entries in the database, since the corresponding record that you have pointed to may change and make your current note inaccurate, incorrect and obsolete.

Disrupted genes

Occasionally, genes may be disrupted for a variety of reasons, including mutations, sequencing artifacts, and insertion of insertion sequences in the coding region. Consequently, the conceptual translation of the coding region may include a frameshift when compared with proteins in the database. These can be annotated in a number of ways:

a) Add a single gene feature which covers both of the potential coding regions and add the pseudo qualifier indicating that this is a pseudogene. If known, a note qualifier may be added indicating why this gene is disrupted.

1	200	gene	
			gene	abcD
			pseudo
			note	frameshift

b) Alternatively, this can be annotated with a misc_feature. Please use the complete nucleotide spans of the frameshifted gene. If known, a note can be added to indicate the reason for the incomplete translation.

1	200	gene	
			gene	abcD
			note	contains frameshift
1	200	misc_feature
			note	nonfunctional abcD due to frameshift

c) A coding region containing a frameshift that is thought to be corrected by ribosomal slippage can be annotated using a join feature. A join feature is used to combine two non-contiguous regions of sequence that encode a protein. This is typically used to combine eukaryotic exons to translate the coding region. To create a join CDS you should specify the spans of each contiguous region of sequence that encodes the protein. The use of the join feature is rare in bacteria.

333255	333181	CDS
333179	332157
			product	AbcD
			protein_id	gnl|dbname|OB0001
			exception	ribosomal slippage

In this case the CDS should also include an exception qualifier with the exact text "ribosomal slippage".

Alternatively, if you include a join feature for a different reason, please include a note qualifier indicating why the two nucleotide spans are joined.

Functional bacteriophage

If your bacterial genome contains a functional phage, an additional source feature should be included with the spans covering the complete phage sequence. However, if the phage is not functional or if you are not sure, annotate it as a misc_feature.

361	4200	source
			organism	Bacteriophage xyz

Insertion sequences and transposons

Insertion sequences and transposons should be annotated as repeat_region features. The name of the insertion sequence or transposon should be added in a insertion_seq or transposon qualifier.

Table view:

1	100	repeat_region
			insertion_seq	IS912
			
500	600	repeat_region
			transposon	Tn912

Ribosomal RNA, tRNA and other RNA features

RNA features (rRNA, tRNA, RNA) should include a corresponding gene feature with a locus_tag qualifier for tracking purposes.

1	400	gene
			locus_tag	OB0001
1	400	rRNA
			product	16S ribosomal RNA
401	500	gene
			locus_tag	OB0002			
401	500	tRNA
			product	tRNA-Phe
501	600	gene
			locus_tag	OB0003		
501	600	RNA
			product text

Creating the Sequin file

Your Sequin file can be generated using tbl2asn. tbl2asn is a command line program that automates parts of the submission process. It is packaged with the Sequin archive. tbl2asn reads a template along with the sequence and table files, and outputs ASN.1 for submission to GenBank.

The template for tbl2asn is created with Sequin. On the starting Sequin page, click on "Start New Submission", and fill out the Submitting Authors form. On the Organism and Sequences form, enter organism information that pertains to all the records. Features specific to a single record, such as chromosome, should be indicated on the FASTA definition line of the individual sequence. Import a dummy nucleotide sequence (this sequence will be replaced later in the process by the real sequence) and no protein sequence. Save the template using Sequin's File-->Save command.

In addition to the template, you will need two files to generate your Sequin submission. The first file is your FASTA formatted sequence in a file named using .fsa as an extension. The second file is the five column tab-delimited table in a file named using .tbl as an extension.

Next, run tbl2asn with the command

tbl2asn -t template_file -p path_to_files

-t specifies the template file (including the path) [required]

-p specifies the path for the table and sequence files ('-p .' is the current directory) [required]

-v performs a validation [optional]

In the directory specified by '-p', the program looks for pairs of files called file.fsa and file.tbl that have the same file prefix, and builds ASN.1 records for these pairs. The ASN.1 record will be called file.sqn. The results of the validation (error checking) will be called file.val. After any validation errors have been fixed with Sequin, the .sqn files can be submitted to GenBank. Additional tips on using Sequin are found in the Sequin Quick Guide.

Figure 1: Sample FASTA-formatted sequence

>OB_HTE831 [organism=Oceanobacillus iheyensis] [strain=HTE831]
CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC
AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG
AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG
[etc.]

Figure 2: Sequin table format

>Feature OB_HTE831
<1830	2966	gene
			gene	dnaN
			locus_tag	OB0002
<1830	2966	CDS
			product	DNA-directed DNA polymerase III beta chain
			EC_number	2.7.7.7
			protein_id	gnl|ncbi|OB0002
3219	3440	gene
			locus_tag	OB0003
3219	3440	CDS
			product	hypothetical protein
			protein_id	gnl|ncbi|OB0003
3443	4552	gene
			gene	recF
			locus_tag	OB0004
3443	4552	CDS
			product	RecF
			function	DNA repair and genetic recombination
			protein_id	gnl|ncbi|OB0004
5109	7034	gene
			gene	gyrB
			locus_tag	OB0006
5109	7034	CDS
			product	DNA gyrase subunit B
			EC_number	5.99.1.3
			protein_id	gnl|ncbi|OB0006
45081	44806	gene
			gene	abrB
			locus_tag	OB0045
45081	44806	CDS
			product	AbrB
			protein_id	gnl|ncbi|OB0045
			function	transcriptional pleiotropic regulator
64225	64758	gene
			locus_tag	OB0064
64225	64758	CDS
			product	stage V sporulation protein T
			function	transcriptional regulator
			protein_id	gnl|ncbi|OB0064
84524	85393	gene
			locus_tag	OB0082
84524	85393	CDS
			product	chaperonin
			product	heat shock protein 33
			protein_id	gnl|ncbi|OB0082
89569	91050	gene
			locus_tag	OB0088
89569	91050	CDS
			product	lysine-tRNA ligase
			EC_number	6.1.1.6
			protein_id	gnl|ncbi|OB0088
91493	96462	gene
			gene	rrnA
			locus_tag	OB3546
91493	93058	rRNA
			product	16S ribosomal RNA
93292	96213	rRNA
			product	23S ribosomal RNA
96347	96462	rRNA
			product	5S ribosomal RNA
96468	96744	gene
			gene	trnC
			locus_tag	OB3547
96468	96543	tRNA
			product	tRNA-Val
96545	96620	tRNA
			product	tRNA-Thr
96669	96744	tRNA
			product	tRNA-Lys
1914923	1914066	gene
			gene	folD
			locus_tag	OB1880
1914923	1914066	CDS
			product	methylenetetrahydrofolate dehydrogenase (NADP+)
			product	methenyltetrahydrofolate cyclohydrolase
			EC_number	1.5.1.5
			EC_number	3.5.4.9
			protein_id	gnl|ncbi|OB1880
			note	bifunctional

Figure 3: GenBank flatfile

LOCUS	OB_HTE831	3630528 bp    DNA     circular BCT 11-DEC-2002
DEFINITION  Oceanobacillus iheyensis, complete genome.
ACCESSION   
VERSION     
KEYWORDS    .
SOURCE      Oceanobacillus iheyensis
  ORGANISM  Oceanobacillus iheyensis
            Bacteria; Firmicutes; Bacillales; Oceanobacillus.
REFERENCE   1  (bases 1 to 3630528)
  AUTHORS   Takami,H., Takaki,Y. and Uchiyama,I.
  TITLE     Genome sequence of Oceanobacillus iheyensis isolated from the Iheya
            Ridge and its unexpected adaptive capabilities to extreme
            environments
  JOURNAL   Nucleic Acids Res. 30 (18), 3927-3935 (2002)
  MEDLINE   22220767
   PUBMED   12235376
REFERENCE   2  (bases 1 to 3630528)
  AUTHORS   Takami,H., Takaki,Y. and Chee,G.
  TITLE     Direct Submission
  JOURNAL   Submitted (26-DEC-2001) Hideto Takami, Japan Marine Science and
            Technology Center, Deep-sea Microorganisms Research Group; 2-15
            Natsushima-cho, Yokosuka, Kanagawa 237-0061, Japan
FEATURES             Location/Qualifiers
     source          1..3630528
                     /organism="Oceanobacillus iheyensis"
                     /strain="HTE831"
                     /db_xref="taxon:182710"
     gene            1830..2966
                     /gene="dnaN"
                     /locus_tag="OB0002"
     CDS             1830..2966
                     /gene="dnaN"
                     /EC_number="2.7.7.7"
                     /codon_start=1
                     /transl_table=11
                     /product="DNA-directed DNA polymerase III beta chain"
                     /translation="MRFTIQRDKLINGVSNVMKAISARTVIPILTGMKIEVKNHGVTL
                     TGSDSDISIEYYIPIEEDGIVHVENIEEGTIILQAKYFPDIVRKLPESTVDIVVDDQL
                     NVRITSGKAEFNLNGQSAEEYPQLPKVQTENSFELPIDLLKSMIKQTVFAVSTMETRP
                     ILTGVNLKLVDNSLSFTATDSHRLARREIPVSNAPIEISQIVVPGKSLNELNKILGDS
                     EETVEISVTNNQILFRTKHLNFLSRLLDGNYPETSRLIPEQSKTKIQLKTKELLGTID
                     RASLLAKEERNNVVKFNAPGNSMIEISSNSPEVGNVVEEITADQMEGEDVKISFSSKY
                     MIDALKAIEYDEVQIEFTGAMRPFIIRPVGDDSILQLILPVRTY"
     gene            3219..3440
                     /locus_tag="OB0003"
     CDS             3219..3440
                     /locus_tag="OB0003"
                     /codon_start=1
                     /transl_table=11
                     /product="hypothetical protein"
                     /translation="MHEQIQIDTEYITLGQLIKLLNFLESGGMVKTFLQEEGALVNGH
                     LEQRRGRKLYPKDVVEIQGIGSYIVIKED"
     gene            3443..4552
                     /gene="recF"
                     /locus_tag="OB0004"
     CDS             3443..4552
                     /gene="recF"
                     /function="DNA repair and genetic recombination"
                     /codon_start=1
                     /transl_table=11
                     /product="RecF"
                     /translation="MHIEKLELTNYRNYDQLEIAFDDQINVIIGENAQGKTNLMEAIY
                     VLSFARSHRTPREKELIQWDKDYAKIEGRITKRNQSIPLQISITSKGKKAKVNHLEQH
                     RLSDYIGSVNVVMFAPEDLTIVKGAPQIRRRFMDMELGQIQPTYIYHLAQYQKVLKQR
                     NHLLKQLQRKPNSDTTMLEVLTDQLIEHASILLERRFIYLELLRKWAQPIHRGISREL
                     EQLEIQYSPSIEVSEDANKEKIGNIYQMKFAEVKQKEIERGTTLAGPHRDDLIFFVNG
                     KDVQTYGSQGQQRTTALSIKLAEIELIYQEVGEYPILLLDDVLSELDDYRQSHLLNTI
                     QGKVQTFVSTTSVEGIHHETLQQAELFRVTDGVVN"
     gene            5109..7034
                     /gene="gyrB"
                     /locus_tag="OB0006"
     CDS             5109..7034
                     /gene="gyrB"
                     /EC_number="5.99.1.3"
                     /codon_start=1
                     /transl_table=11
                     /product="DNA gyrase subunit B"
                     /translation="MSMEDKITENQEYGADQIQVLEGLEAVRKRPGMYIGSTSEKGLH
                     HLVWEIVDNSIDEALAGYCDHIQVVVEEDNSITVKDNGRGIPVDIQQKTGRPALEVIM
                     TVLHAGGKFGGGGYKVSGGLHGVGASVVNALSSELEVYVHRDGKVHFLSFKKGVPDGE
                     IKVIGDTDITGTVTHFRPDTEIFTETTEYNFDTLEQRLRELAFLNKGLKISIEDKRTD
                     REQVTYHYEGGISSYVEFINKNKEVLHEPFFAEGEDQGISVEVAIQYNDGFASNLYSF
                     ANNIHTYEGGSHEVGFRSGLTRIINDYAKKNGLIKDGDSNLSGDDVREGMTTIVSIKH
                     PDPQFEGQTKTKLGNSEVRAITDGVFSEAFSKFLYENPSTAKIIVEKGLMASRARLAA
                     KKARELTRRKSNLEISNLPGKLADCSSRDAAISELYIVEGDSAGGSAKSGRDRHFQAI
                     LPLRGKILNVEKARLDRILSNNEVRAMITALGSGVGEEFDISKARYHKIVIMTDADVD
                     GAHIRTLLLTFFYRYMRPLIEQGYIYIAQPPLYQVKQGKTVNYAYNDKELDRILNEIP
                     KAPKPNIQRYKGLGEMNADQLWDTTMDPDTRTLLQVELSDAIDADQVFDMLMGDKVEP
                     RRIFIEENAQYVKNLDI"
     gene            complement(44806..45081)
                     /gene="abrB"
                     /locus_tag="OB0045"
     CDS             complement(44806..45081)
                     /gene="abrB"
                     /function="transcriptional pleiotropic regulator"
                     /codon_start=1
                     /transl_table=11
                     /product="AbrB"
                     /translation="MKSTGIVRKVDELGRVVIPIELRRTLDIHEKDTMEIYVDNDKIV
                     LKKYKPNMTCQVTGEVSDENLSIANGNLVLSPAGAQILLEEIQSRFK"
     gene            64225..64758
                     /locus_tag="OB0064"
     CDS             64225..64758
                     /locus_tag="OB0064"
                     /function="transcriptional regulator"
                     /codon_start=1
                     /transl_table=11
                     /product="stage V sporulation protein T"
                     /translation="MKATGIVRRIDDLGRVVVPKEIRRTLRIREGDPLEIFVDREGEV
                     ILKKYSPINELGHFAKEYAEALFQSLQTPVMITDRDDVIAVAGESKKEYLNKPISNAI
                     ADTIEGRSQVFEVDTKSMEIIDGQEQQLQSYCIHPVIANGDPIGCVLIFSKEEKLSKI
                     EQKAAETASTFLAKQME"
     gene            84524..85393
                     /locus_tag="OB0082"
     CDS             84524..85393
                     /locus_tag="OB0082"
                     /note="heat shock protein 33"
                     /codon_start=1
                     /transl_table=11
                     /product="chaperonin"
                     /translation="MKDYLIKATANNGKIRAYAVQSTNTIEEARRRQDTFATASAALG
                     RTITITAMMGAMLKGDDSITTKVMGNGPLGAIVADADADGHVRGYVTNPHVDFDLNDK
                     GKLDVARAVGTEGNISVIKDLGLKDFFTGETPIVSGEISEDFTYYYATSEQLPSAVGA
                     GVLVNPDHTILAAGGFIVQVMPGAEEEVINELEDQIQAIPAISSLIREGKSPEEILTQ
                     LFGEECLTIHEKMPIEFRCKCSKDRLAQAIIGLGNDEIQAMIEEDQGAEATCHFCNEK
                     YHFTEEELEDLKQ"
     gene            89569..91050
                     /locus_tag="OB0088"
     CDS             89569..91050
                     /locus_tag="OB0088"
                     /EC_number="6.1.1.6"
                     /codon_start=1
                     /transl_table=11
                     /product="lysine-tRNA ligase"
                     /translation="MSEELNEHMQVRRDKLAEHMEKGLDPFGGKFERSHQATDLIEKY
                     DSYSKEELEETTDEVTIAGRLMTKRGKGKAGFAHIQDLSGQIQLYVRKDMIGDDAYEV
                     FKSADLGDIVGVTGVMFKTNVGEISVKAKQFQLLTKSLRPLPEKYHGLKDIEQRYRQR
                     YLDLITNPDSRGTFVSRSKIIQSMREYLNGQGFLEVETPMMHSIPGGASARPFITHHN
                     ALDIELYMRIAIELHLKRLMVGGLEKVYEIGRVFRNEGVSTRHNPEFTMIELYEAYAD
                     YHDIMELTENLVAHIAKQVHGSTTITYGEHEINLEPKWTRLHIVDAVKDATGVDFWKE
                     VSDEEARALAKEHGVQVTESMSYGHVVNEFFEQKVEETLIQPTFIHGHPVEISPLAKK
                     NKEDERFTDRFELFIVGREHANAFSELNDPIDQRARFEAQVKERAEGNDEAHYMDEDF
                     LEALEYGMPPTGGLGIGVDRLVMLLTNSPSIRDVLLFPQMRTK"
     gene            91493..96462
                     /gene="rrnA"
                     /locus_tag="OB3546"
     rRNA            91493..93058
                     /gene="rrnA"
                     /product="16S ribosomal RNA"
     rRNA            93292..96213
                     /gene="rrnA"
                     /product="23S ribosomal RNA"
     rRNA            96347..96462
                     /gene="rrnA"
                     /product="5S ribosomal RNA"
     gene            96468..96744
                     /gene="trnC"
                     /locus_tag="OB3547"
     tRNA            96468..96543
                     /gene="trnC"
                     /product="tRNA-Val"
     tRNA            96545..96620
                     /gene="trnC"
                     /product="tRNA-Thr"
     tRNA            96669..96744
                     /gene="trnC"
                     /product="tRNA-Lys"
     gene            complement(1914066..1914923)
                     /gene="folD"
                     /locus_tag="OB1880"
     CDS             complement(1914066..1914923)
                     /gene="folD"
                     /EC_number="1.5.1.5"
                     /EC_number="3.5.4.9"
                     /note="bifunctional; methenyltetrahydrofolate
                     cyclohydrolase"
                     /codon_start=1
                     /transl_table=11
                     /product="methylenetetrahydrofolate dehydrogenase (NADP+)"
                     /translation="MATLLNGKELSEELKQKMKIEVDELKEKGLTPHLTVILVGDNPA
                     SKSYVKGKEKACAVTGISSNLIELPENISQDELLQIIDEQNNDDSVHGILVQLPLPDQ
                     MDEQKIIHAISPAKDVDGFHPINVGKMMTGEDTFIPCTPYGILTMLRSKDISLEGKHA
                     VIIGRSNIVGKPIGLLLLQENATVTYTHSRTKNLQEITKQADILIVAIGRAHAINADY
                     IKEDAVVIDVGINRKDDGKLTGDVDFESAEQKASYITPVPRGVGPMTITMLLKNTIKA
                     AKGLNDVER"
BASE COUNT  1165552 a 648314 c 647106 g1169556 t
ORIGIN      
        1 actttcaaaa aaatcagcgt aaaaaacata ctaatttggg caaattccca cctgttttta
       61 gggacatttt tctttgaatt agagcctcag cagctcgtca ttgctgaatt ttcttgaagt

For additional example see GenBank Accession Number AE014016.

Revised June 9, 2003