Bacterial Genome Submission Guidelines |
PubMed | Entrez | BLAST | OMIM | Books | TaxBrowser | Structure |
GenBank Sequence submission support and software Sequin Sequence submission tool Microbial Genomes |
IntroductionThe following is a guide to help bacterial genome submitters prepare their submissions using Sequin, an NCBI software tool for submitting and updating GenBank entries. Many genome centers maintain feature locations and annotations in an internal database. Sequin can read a simple five-column tab-delimited table of feature locations and qualifiers. The table is read into Sequin along with the DNA sequence (in FASTA format) and the submitter information, and the record is then ready for submission to GenBank. The feature table format allows different kinds of features (e.g., gene, mRNA, coding region, tRNA) and qualifiers (e.g., /product, /note) to be indicated. Once the sequence and annotation have been read in, Sequin's validator can be used to check for errors such as internal stops in coding regions. FASTA-formatted sequenceSequin can read nucleotide sequences in FASTA format. FASTA format consists of a single definition line, beginning with a '>' and followed by optional text, and subsequent lines of sequence. At minimum, all definition lines must contain an identifier for the sequence, called the SeqId. Other information about the biological source of the organism can also be encoded on the definition line of the sequence and is used to build the record. A sample definition line is Common source modifiers may be incorporated into the definition line e.g. [strain=HTE831]. A complete list of modifiers is available from the Sequin FAQ page. Many of these modifiers can also be entered on the template. See the section "Creating the Sequin file" below. An example of a FASTA-formatted sequence is shown in Figure 1. Feature table layoutSequin reads features from a simple five-column tab-delimited table. The feature table specifies the location and type of each feature, and Sequin processes the feature intervals and translates any coding regions into proteins. The first line of the table contains the following basic information. >Features SeqId table_name The SeqId must be the same as that used on the sequence. The table_name is optional. Subsequent lines of the table list the features. Columns are separated by tabs. Column 1: Start location of feature Figure 2 shows a sample feature table and illustrates a number of points about the feature table format. The GenBank flatfile corresponding to this table is shown in Figure 3. Features that are on the complementary strand, such as the gene abrB and its corresponding CDS, are indicated by reversing the interval locations. Please avoid the unnecessary capitalization in all text entered in your table. locus_tag and protein_idThe locus_tag and protein_id qualifiers should be used in all bacterial genome submissions. All genes should be assigned a systematic gene identifier which should receive the locus_tag qualifier on the gene feature in the table. Genes may also have functional names as assigned in the scientific literature. In this example, OB0001 is the systematic gene identifier, while abcD is the functional gene name. The use of locus_tag is supported in Sequin version 4.35 or newer. If you have an older version of Sequin please download the current version. Table view of gene with both biological name and locus_tag: 1 1575 gene gene abcD locus_tag OB0001 Flatfile view: gene 1..1575 /gene="abcD" /locus_tag="OB0001" Table view of gene with only locus_tag: 1 1575 gene locus_tag OB0001 Flatfile view: gene 1..1575 /locus_tag="OB0001" All proteins in a complete genome should be assigned an identification number by the submitter. We use this number to track proteins when sequences are updated. This number is indicated in the table by the CDS qualifier protein_id, and should have the format gnl|dbname|string, where dbname is the username on your ftp account, and string is the identification number. In this example, the protein_id for abcD is gnl|dbname|OB0001. This identifier is saved with the record (in ASN.1 format), but it is not visible in the flatfile. We recommend using the locus_tag number as the protein identification number. Please attempt to avoid duplication of previously used locus_tag prefixes. Since the protein_id is used for internal tracking in our database, it is important that the complete protein_id (dbname + string) not be duplicated by a genome center. Thus, if your genome center is submitting more than one genome, please be sure to use a different locus_tag/protein_id for each genome. Example: 1 1575 gene gene abcD locus_tag OB0001 1 1575 CDS product AbcD protein_id gnl|dbname|OB0001 Gene featuresGene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and operator binding sites. Gene names should follow the standard bacterial nomenclature rules of three lower case letters. Different loci are distinguished by a suffix of uppercase letters. correct cytB CDS (coding region) featuresAll CDS features should have a product qualifier (protein name). This should be a concise name, not a description or phrase. Alternatively, protein names may be denoted by the same symbol as the corresponding gene, but the symbol begins with a capital letter. In cases where the protein is not known use "unknown" or "hypothetical protein" as the product name. Descriptions, notes describing similarity to other proteins, and functional comments should be placed in the appropriate CDS qualifiers such as note, or prot_desc, as they are descriptors of the product. E.C. numbers should be fielded in an EC_number qualifier. start stop CDS product DNA gyrase subunit B EC_number 5.99.1.3 Qualifiers that can be used on the CDS feature are: start stop CDS product prot_desc function EC_number note Bifunctional proteins: If a protein contains two separate and distinct functions or if it has more than one name, each can be listed in the table as a separate product qualifier on the CDS in the table. The value of the first product qualifier will become the /product on the CDS in the flatfile, and any additional product qualifiers will be shown as a /note. Table view: start stop CDS product methylenetetrahydrofolate dehydrogenase (NADP+) product methenyltetrahydrofolate cyclohydrolase note bifunctional Please avoid including notes indicating a specific percentage of similarity to other entries in the database, since the corresponding record that you have pointed to may change and make your current note inaccurate, incorrect and obsolete. Disrupted genesOccasionally, genes may be disrupted for a variety of reasons, including mutations, sequencing artifacts, and insertion of insertion sequences in the coding region. Consequently, the conceptual translation of the coding region may include a frameshift when compared with proteins in the database. These can be annotated in a number of ways: a) Add a single gene feature which covers both of the potential coding regions and add the pseudo qualifier indicating that this is a pseudogene. If known, a note qualifier may be added indicating why this gene is disrupted. 1 200 gene gene abcD pseudo note frameshift b) Alternatively, this can be annotated with a misc_feature. Please use the complete nucleotide spans of the frameshifted gene. If known, a note can be added to indicate the reason for the incomplete translation. 1 200 gene gene abcD note contains frameshift 1 200 misc_feature note nonfunctional abcD due to frameshift c) A coding region containing a frameshift that is thought to be corrected by ribosomal slippage can be annotated using a join feature. A join feature is used to combine two non-contiguous regions of sequence that encode a protein. This is typically used to combine eukaryotic exons to translate the coding region. To create a join CDS you should specify the spans of each contiguous region of sequence that encodes the protein. The use of the join feature is rare in bacteria. 333255 333181 CDS 333179 332157 product AbcD protein_id gnl|dbname|OB0001 exception ribosomal slippage In this case the CDS should also include an exception qualifier with the exact text "ribosomal slippage". Alternatively, if you include a join feature for a different reason, please include a note qualifier indicating why the two nucleotide spans are joined. Functional bacteriophageIf your bacterial genome contains a functional phage, an additional source feature should be included with the spans covering the complete phage sequence. However, if the phage is not functional or if you are not sure, annotate it as a misc_feature. 361 4200 source organism Bacteriophage xyz Insertion sequences and transposonsInsertion sequences and transposons should be annotated as repeat_region features. The name of the insertion sequence or transposon should be added in a insertion_seq or transposon qualifier. Table view: 1 100 repeat_region insertion_seq IS912 500 600 repeat_region transposon Tn912 Ribosomal RNA, tRNA and other RNA featuresRNA features (rRNA, tRNA, RNA) should include a corresponding gene feature with a locus_tag qualifier for tracking purposes. 1 400 gene locus_tag OB0001 1 400 rRNA product 16S ribosomal RNA 401 500 gene locus_tag OB0002 401 500 tRNA product tRNA-Phe 501 600 gene locus_tag OB0003 501 600 RNA product text Creating the Sequin fileYour Sequin file can be generated using tbl2asn. tbl2asn is a command line program that automates parts of the submission process. It is packaged with the Sequin archive. tbl2asn reads a template along with the sequence and table files, and outputs ASN.1 for submission to GenBank. The template for tbl2asn is created with Sequin. On the starting Sequin page, click on "Start New Submission", and fill out the Submitting Authors form. On the Organism and Sequences form, enter organism information that pertains to all the records. Features specific to a single record, such as chromosome, should be indicated on the FASTA definition line of the individual sequence. Import a dummy nucleotide sequence (this sequence will be replaced later in the process by the real sequence) and no protein sequence. Save the template using Sequin's File-->Save command. In addition to the template, you will need two files to generate your Sequin submission. The first file is your FASTA formatted sequence in a file named using .fsa as an extension. The second file is the five column tab-delimited table in a file named using .tbl as an extension. Next, run tbl2asn with the command tbl2asn -t template_file -p path_to_files -t specifies the template file (including the path) [required] -p specifies the path for the table and sequence files ('-p .' is the current directory) [required] -v performs a validation [optional] In the directory specified by '-p', the program looks for pairs of
files called file.fsa and file.tbl that have the same file prefix, and
builds ASN.1 records for these pairs. The ASN.1 record will be called
file.sqn. The results of the validation (error checking) will be
called file.val. After any validation errors have been fixed with
Sequin, the .sqn files can be submitted to GenBank. Additional tips on
using Sequin are found in the Sequin Quick Guide.
Figure 1: Sample FASTA-formatted sequence>OB_HTE831 [organism=Oceanobacillus iheyensis] [strain=HTE831] CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG [etc.] Figure 2: Sequin table format>Feature OB_HTE831 <1830 2966 gene gene dnaN locus_tag OB0002 <1830 2966 CDS product DNA-directed DNA polymerase III beta chain EC_number 2.7.7.7 protein_id gnl|ncbi|OB0002 3219 3440 gene locus_tag OB0003 3219 3440 CDS product hypothetical protein protein_id gnl|ncbi|OB0003 3443 4552 gene gene recF locus_tag OB0004 3443 4552 CDS product RecF function DNA repair and genetic recombination protein_id gnl|ncbi|OB0004 5109 7034 gene gene gyrB locus_tag OB0006 5109 7034 CDS product DNA gyrase subunit B EC_number 5.99.1.3 protein_id gnl|ncbi|OB0006 45081 44806 gene gene abrB locus_tag OB0045 45081 44806 CDS product AbrB protein_id gnl|ncbi|OB0045 function transcriptional pleiotropic regulator 64225 64758 gene locus_tag OB0064 64225 64758 CDS product stage V sporulation protein T function transcriptional regulator protein_id gnl|ncbi|OB0064 84524 85393 gene locus_tag OB0082 84524 85393 CDS product chaperonin product heat shock protein 33 protein_id gnl|ncbi|OB0082 89569 91050 gene locus_tag OB0088 89569 91050 CDS product lysine-tRNA ligase EC_number 6.1.1.6 protein_id gnl|ncbi|OB0088 91493 96462 gene gene rrnA locus_tag OB3546 91493 93058 rRNA product 16S ribosomal RNA 93292 96213 rRNA product 23S ribosomal RNA 96347 96462 rRNA product 5S ribosomal RNA 96468 96744 gene gene trnC locus_tag OB3547 96468 96543 tRNA product tRNA-Val 96545 96620 tRNA product tRNA-Thr 96669 96744 tRNA product tRNA-Lys 1914923 1914066 gene gene folD locus_tag OB1880 1914923 1914066 CDS product methylenetetrahydrofolate dehydrogenase (NADP+) product methenyltetrahydrofolate cyclohydrolase EC_number 1.5.1.5 EC_number 3.5.4.9 protein_id gnl|ncbi|OB1880 note bifunctional
Figure 3: GenBank flatfileLOCUS OB_HTE831 3630528 bp DNA circular BCT 11-DEC-2002 DEFINITION Oceanobacillus iheyensis, complete genome. ACCESSION VERSION KEYWORDS . SOURCE Oceanobacillus iheyensis ORGANISM Oceanobacillus iheyensis Bacteria; Firmicutes; Bacillales; Oceanobacillus. REFERENCE 1 (bases 1 to 3630528) AUTHORS Takami,H., Takaki,Y. and Uchiyama,I. TITLE Genome sequence of Oceanobacillus iheyensis isolated from the Iheya Ridge and its unexpected adaptive capabilities to extreme environments JOURNAL Nucleic Acids Res. 30 (18), 3927-3935 (2002) MEDLINE 22220767 PUBMED 12235376 REFERENCE 2 (bases 1 to 3630528) AUTHORS Takami,H., Takaki,Y. and Chee,G. TITLE Direct Submission JOURNAL Submitted (26-DEC-2001) Hideto Takami, Japan Marine Science and Technology Center, Deep-sea Microorganisms Research Group; 2-15 Natsushima-cho, Yokosuka, Kanagawa 237-0061, Japan FEATURES Location/Qualifiers source 1..3630528 /organism="Oceanobacillus iheyensis" /strain="HTE831" /db_xref="taxon:182710" gene 1830..2966 /gene="dnaN" /locus_tag="OB0002" CDS 1830..2966 /gene="dnaN" /EC_number="2.7.7.7" /codon_start=1 /transl_table=11 /product="DNA-directed DNA polymerase III beta chain" /translation="MRFTIQRDKLINGVSNVMKAISARTVIPILTGMKIEVKNHGVTL TGSDSDISIEYYIPIEEDGIVHVENIEEGTIILQAKYFPDIVRKLPESTVDIVVDDQL NVRITSGKAEFNLNGQSAEEYPQLPKVQTENSFELPIDLLKSMIKQTVFAVSTMETRP ILTGVNLKLVDNSLSFTATDSHRLARREIPVSNAPIEISQIVVPGKSLNELNKILGDS EETVEISVTNNQILFRTKHLNFLSRLLDGNYPETSRLIPEQSKTKIQLKTKELLGTID RASLLAKEERNNVVKFNAPGNSMIEISSNSPEVGNVVEEITADQMEGEDVKISFSSKY MIDALKAIEYDEVQIEFTGAMRPFIIRPVGDDSILQLILPVRTY" gene 3219..3440 /locus_tag="OB0003" CDS 3219..3440 /locus_tag="OB0003" /codon_start=1 /transl_table=11 /product="hypothetical protein" /translation="MHEQIQIDTEYITLGQLIKLLNFLESGGMVKTFLQEEGALVNGH LEQRRGRKLYPKDVVEIQGIGSYIVIKED" gene 3443..4552 /gene="recF" /locus_tag="OB0004" CDS 3443..4552 /gene="recF" /function="DNA repair and genetic recombination" /codon_start=1 /transl_table=11 /product="RecF" /translation="MHIEKLELTNYRNYDQLEIAFDDQINVIIGENAQGKTNLMEAIY VLSFARSHRTPREKELIQWDKDYAKIEGRITKRNQSIPLQISITSKGKKAKVNHLEQH RLSDYIGSVNVVMFAPEDLTIVKGAPQIRRRFMDMELGQIQPTYIYHLAQYQKVLKQR NHLLKQLQRKPNSDTTMLEVLTDQLIEHASILLERRFIYLELLRKWAQPIHRGISREL EQLEIQYSPSIEVSEDANKEKIGNIYQMKFAEVKQKEIERGTTLAGPHRDDLIFFVNG KDVQTYGSQGQQRTTALSIKLAEIELIYQEVGEYPILLLDDVLSELDDYRQSHLLNTI QGKVQTFVSTTSVEGIHHETLQQAELFRVTDGVVN" gene 5109..7034 /gene="gyrB" /locus_tag="OB0006" CDS 5109..7034 /gene="gyrB" /EC_number="5.99.1.3" /codon_start=1 /transl_table=11 /product="DNA gyrase subunit B" /translation="MSMEDKITENQEYGADQIQVLEGLEAVRKRPGMYIGSTSEKGLH HLVWEIVDNSIDEALAGYCDHIQVVVEEDNSITVKDNGRGIPVDIQQKTGRPALEVIM TVLHAGGKFGGGGYKVSGGLHGVGASVVNALSSELEVYVHRDGKVHFLSFKKGVPDGE IKVIGDTDITGTVTHFRPDTEIFTETTEYNFDTLEQRLRELAFLNKGLKISIEDKRTD REQVTYHYEGGISSYVEFINKNKEVLHEPFFAEGEDQGISVEVAIQYNDGFASNLYSF ANNIHTYEGGSHEVGFRSGLTRIINDYAKKNGLIKDGDSNLSGDDVREGMTTIVSIKH PDPQFEGQTKTKLGNSEVRAITDGVFSEAFSKFLYENPSTAKIIVEKGLMASRARLAA KKARELTRRKSNLEISNLPGKLADCSSRDAAISELYIVEGDSAGGSAKSGRDRHFQAI LPLRGKILNVEKARLDRILSNNEVRAMITALGSGVGEEFDISKARYHKIVIMTDADVD GAHIRTLLLTFFYRYMRPLIEQGYIYIAQPPLYQVKQGKTVNYAYNDKELDRILNEIP KAPKPNIQRYKGLGEMNADQLWDTTMDPDTRTLLQVELSDAIDADQVFDMLMGDKVEP RRIFIEENAQYVKNLDI" gene complement(44806..45081) /gene="abrB" /locus_tag="OB0045" CDS complement(44806..45081) /gene="abrB" /function="transcriptional pleiotropic regulator" /codon_start=1 /transl_table=11 /product="AbrB" /translation="MKSTGIVRKVDELGRVVIPIELRRTLDIHEKDTMEIYVDNDKIV LKKYKPNMTCQVTGEVSDENLSIANGNLVLSPAGAQILLEEIQSRFK" gene 64225..64758 /locus_tag="OB0064" CDS 64225..64758 /locus_tag="OB0064" /function="transcriptional regulator" /codon_start=1 /transl_table=11 /product="stage V sporulation protein T" /translation="MKATGIVRRIDDLGRVVVPKEIRRTLRIREGDPLEIFVDREGEV ILKKYSPINELGHFAKEYAEALFQSLQTPVMITDRDDVIAVAGESKKEYLNKPISNAI ADTIEGRSQVFEVDTKSMEIIDGQEQQLQSYCIHPVIANGDPIGCVLIFSKEEKLSKI EQKAAETASTFLAKQME" gene 84524..85393 /locus_tag="OB0082" CDS 84524..85393 /locus_tag="OB0082" /note="heat shock protein 33" /codon_start=1 /transl_table=11 /product="chaperonin" /translation="MKDYLIKATANNGKIRAYAVQSTNTIEEARRRQDTFATASAALG RTITITAMMGAMLKGDDSITTKVMGNGPLGAIVADADADGHVRGYVTNPHVDFDLNDK GKLDVARAVGTEGNISVIKDLGLKDFFTGETPIVSGEISEDFTYYYATSEQLPSAVGA GVLVNPDHTILAAGGFIVQVMPGAEEEVINELEDQIQAIPAISSLIREGKSPEEILTQ LFGEECLTIHEKMPIEFRCKCSKDRLAQAIIGLGNDEIQAMIEEDQGAEATCHFCNEK YHFTEEELEDLKQ" gene 89569..91050 /locus_tag="OB0088" CDS 89569..91050 /locus_tag="OB0088" /EC_number="6.1.1.6" /codon_start=1 /transl_table=11 /product="lysine-tRNA ligase" /translation="MSEELNEHMQVRRDKLAEHMEKGLDPFGGKFERSHQATDLIEKY DSYSKEELEETTDEVTIAGRLMTKRGKGKAGFAHIQDLSGQIQLYVRKDMIGDDAYEV FKSADLGDIVGVTGVMFKTNVGEISVKAKQFQLLTKSLRPLPEKYHGLKDIEQRYRQR YLDLITNPDSRGTFVSRSKIIQSMREYLNGQGFLEVETPMMHSIPGGASARPFITHHN ALDIELYMRIAIELHLKRLMVGGLEKVYEIGRVFRNEGVSTRHNPEFTMIELYEAYAD YHDIMELTENLVAHIAKQVHGSTTITYGEHEINLEPKWTRLHIVDAVKDATGVDFWKE VSDEEARALAKEHGVQVTESMSYGHVVNEFFEQKVEETLIQPTFIHGHPVEISPLAKK NKEDERFTDRFELFIVGREHANAFSELNDPIDQRARFEAQVKERAEGNDEAHYMDEDF LEALEYGMPPTGGLGIGVDRLVMLLTNSPSIRDVLLFPQMRTK" gene 91493..96462 /gene="rrnA" /locus_tag="OB3546" rRNA 91493..93058 /gene="rrnA" /product="16S ribosomal RNA" rRNA 93292..96213 /gene="rrnA" /product="23S ribosomal RNA" rRNA 96347..96462 /gene="rrnA" /product="5S ribosomal RNA" gene 96468..96744 /gene="trnC" /locus_tag="OB3547" tRNA 96468..96543 /gene="trnC" /product="tRNA-Val" tRNA 96545..96620 /gene="trnC" /product="tRNA-Thr" tRNA 96669..96744 /gene="trnC" /product="tRNA-Lys" gene complement(1914066..1914923) /gene="folD" /locus_tag="OB1880" CDS complement(1914066..1914923) /gene="folD" /EC_number="1.5.1.5" /EC_number="3.5.4.9" /note="bifunctional; methenyltetrahydrofolate cyclohydrolase" /codon_start=1 /transl_table=11 /product="methylenetetrahydrofolate dehydrogenase (NADP+)" /translation="MATLLNGKELSEELKQKMKIEVDELKEKGLTPHLTVILVGDNPA SKSYVKGKEKACAVTGISSNLIELPENISQDELLQIIDEQNNDDSVHGILVQLPLPDQ MDEQKIIHAISPAKDVDGFHPINVGKMMTGEDTFIPCTPYGILTMLRSKDISLEGKHA VIIGRSNIVGKPIGLLLLQENATVTYTHSRTKNLQEITKQADILIVAIGRAHAINADY IKEDAVVIDVGINRKDDGKLTGDVDFESAEQKASYITPVPRGVGPMTITMLLKNTIKA AKGLNDVER" BASE COUNT 1165552 a 648314 c 647106 g1169556 t ORIGIN 1 actttcaaaa aaatcagcgt aaaaaacata ctaatttggg caaattccca cctgttttta 61 gggacatttt tctttgaatt agagcctcag cagctcgtca ttgctgaatt ttcttgaagt For additional example see GenBank Accession Number AE014016.
Revised June 9, 2003 |