|
Submission of complete genomes or other large sequence records |
Sequin | Entrez | BLAST | OMIM | Taxonomy | Structure |
Sequin can read a five-column, tab-delimited table of feature locations and qualifiers. The table is read into Sequin, along with the DNA sequence (in FASTA or PHRAP/ACE format) and the submitter information, and the record is then ready for submission to GenBank. Even very long chromosomes (e.g., 20 Mb) can be processed in seconds.
The feature table format allows different kinds of features (e.g., gene, mRNA, coding region, tRNA) and qualifiers (e.g., /product, /note) to be indicated. (Systematic sequencing tracking numbers used in genome-level annotation are entered as locus tags.) It also allows multiple intervals per feature to represent the multiple exons in a spliced mRNA, for example. Once the sequence and annotation have been read in, Sequin's validator can be used to check for internal stops in coding regions and use of consensus splice sites. Coding region (CDS), gene, and mRNA feature intervals are also checked for mutual consistency.
The entire process can be automated for multiple chromosomes in a genome with the utility tbl2asn, which produces ASN.1 from pairs of table and sequence files.
Sequin can read nucleotide sequences of any size in FASTA format. FASTA format consists of a single definition line, beginning with a ">" followed by optional text, and subsequent lines of sequence. At minimum, all definition lines must contain an identifier for the sequence, called the SeqId. Other information about the biological source of the organism can also be encoded on the definition line of the sequence and is used to build the record. A sample definition line is
>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]
Common source modifiers incorporated into the definition line include [strain=yyy] and [chromosome=nnn]. Note that there are no spaces surrounding the equal sign. Complete circular genomes should include the modifier [topology=circular]. Sequences located in the mitochondrion or chloroplast need a modifier of [location=mitochondrion] or [location=chloroplast]. A complete list of modifiers is available from the Sequin help document Source and Organism subpage sections.
An example of a FASTA-formatted sequence is shown in Figure 1.
Sequin reads features from a five-column, tab-delimited table. The feature table specifies the location and type of each feature, and Sequin processes the feature intervals and translates any CDS features into proteins. The first line of the table contains the following basic information.
>Features SeqId table_name
The SeqId must be the same as that used on the sequence. The table_name is optional. Subsequent lines of the table list the features. Columns are separated by tabs.
Column 1: Start location of feature
Column 2: Stop location of feature
Column 3: Feature key
Column 4: Qualifier key
Column 5: Qualifier value
Figure 2 shows a sample table and illustrates a number of points about the table format. The GenBank flatfile corresponding to this table is shown in Figure 3, and a graphical overview of the features in this example is in Figure 4.
Note that when annotating complete genomes, systematic gene names and protein identifiers are required.
When submitting a complete bacterial genome, please review the genome guidelines.
tbl2asn is a command line program that automates parts of the submission process. It is packaged with the Sequin archive (Sequin version 3.30 and above). tbl2asn reads a template, along with the sequence and table files, and outputs ASN.1 for submission to GenBank. Thus, the submitter does not need to read each set of table and sequence files into Sequin. Details about using tbl2asn can be found in this document.
>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI] CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG [etc.]
>Feature Sc_16 1 7000 REFERENCE PubMed 8849441 <1 1050 gene gene ATH1 locus_tag YPR026W <1 1009 CDS product acid trehalase product Ath1p codon_start 2 protein_id gnl|SGD|S0006230 <1 1050 mRNA product acid trehalase [offset=2000] 1253 420 gene locus_tag YPR027C 1253 420 CDS product Ypr027cp note hypothetical protein protein_id gnl|SGD|S0006231 1253 420 mRNA product Ypr027cp 2626 2590 tRNA 2570 2535 gene - product tRNA-Phe 3450 4536 gene gene YIP2 locus_tag YPR028W 3522 3572 CDS 3706 4197 product Yip2p prot_desc similar to human polyposis locus protein 1 (YPD) protein_id gnl|SGD|S0006232 3450 3572 mRNA 3706 4536 product Yip2p
LOCUS Sc_16 7000 bp DNA PLN 08-MAY-2000 DEFINITION Saccharomyces cerevisiae chromosome XVI strain S288C. ACCESSION Sc_16 VERSION KEYWORDS . SOURCE baker's yeast. ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 7000) AUTHORS Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B., Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M., Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and Oliver,S.G. TITLE Life with 6000 genes JOURNAL Science 274 (5287), 546 (1996) PUBMED 8849441 REFERENCE 2 (bases 1 to 7000) AUTHORS Ouellette,B.F.F. TITLE Direct Submission JOURNAL Submitted (08-MAY-2000) NCBI/NLM, National Institutes of Health, Building 38A, Room 8N805, Bethesda, MD 20894, USA FEATURES Location/Qualifiers source 1..7000 /organism="Saccharomyces cerevisiae" /strain="S288C" /chromosome="XVI" mRNA <1..1050 /gene="ATH1" /product="acid trehalase" gene <1..1050 /gene="ATH1" /locus_tag="YPR026W" CDS <1..1009 /gene="ATH1" /note="Ath1p" /codon_start=2 /product="acid trehalase" /translation="DHNGTIVHKSGDVPIHIKIPNRSLIHDQDINFYNGSENERKPNL ERRDVDRVGDPMRMDRYGTYYLLKPKQELTVQLFKPGLNARNNIAENKQITNLTAGVP GDVAFSALDGNNYTHWQPLDKIHRAKLLIDLGEYNEKEITKGMILWGQRPAKNISISI LPHSEKVENLFANVTEIMQNSGNDQLLNETIGQLLDNAGIPVENVIDFDGIEQEDDES LDDVQALLHWKKEDLAKLIEQIPRLNFLKRKFVKILDNVPVSPSEPYYEASRNQSLIE ILPSNRTTFTIDYDKLQVGDKGNTDWRKTRYIVVAVQGVYDDYDDDNKGATIKEIVLN D" mRNA complement(2420..3253) /locus_tag="YPR027C" /product="Ypr027cp" gene complement(2420..3253) /locus_tag="YPR027C" CDS complement(2420..3253) /locus_tag="YPR027C" /note="hypothetical protein" /codon_start=1 /product="Ypr027cp" /translation="MVGIYRILASFVPLLGLLFAFHDDDMIDTVTIIKTVYETVTSTS TAPAPAATKSVSEKKLDDTKLTLQVIQTMVSCFSVGENPANMISCGLGVVILMFSLII ELINKLENDGINEPQRLYDLIKPKYVELPSNYVNEKIKTTFEPLDLYLGVNMNTSGSE LNQNCLILKLGEKTALPFPGLAQQICYTKGASNEFTNYKLSDIQGNLNENSQGIANGV FQKISNIRKISGNFKSQLYQISEKITDENWDGSAVGFTAHGREKGPNKSQISVSFYRD N" tRNA complement(join(4535..4570,4590..4626)) /product="tRNA-Phe" mRNA join(5450..5572,5706..6536) /gene="YIP2" /product="Yip2p" gene 5450..6536 /gene="YIP2" /locus_tag="YPR028W" CDS join(5522..5572,5706..6197) /gene="YIP2" /note="similar to human polyposis locus protein 1 (YPD)" /codon_start=1 /product="Yip2p" /translation="MSEYASSIHSQMKQFDTKYSGNRILQQLENKTNLPKSYLVAGLG FAYLLLIFINVGGVGEILSNFAGFVLPAYLSLVALKTPTSTDDTQLLTYWIVFSFLSV IEFWSKAILYLIPFYWFLKTVFLIYIALPQTGGARMIYQKIVAPLTDRYILRDVSKTE KDEIRASVNEASKATGASVH" BASE COUNT 2201 a 1276 c 1255 g 2268 t ORIGIN 1 cgaccacaat ggtacgattg ttcataaatc aggagatgtt cctattcata taaagatacc 61 aaacagatct ctaatacatg accaggatat caacttctat aatggttccg aaaacgaaag 121 aaaaccaaat ctagagcgta gagacgtcga ccgtgttggt gatccaatga ggatggatag [etc.]
Questions or Comments?
Write to the NCBI Service Desk
Revised June 14, 2004.