NCBI Logo Submission of complete genomes or other large sequence records
Sequin Entrez BLAST OMIM Taxonomy Structure

Introduction

Sequin can read a five-column, tab-delimited table of feature locations and qualifiers. The table is read into Sequin, along with the DNA sequence (in FASTA or PHRAP/ACE format) and the submitter information, and the record is then ready for submission to GenBank. Even very long chromosomes (e.g., 20 Mb) can be processed in seconds.

The feature table format allows different kinds of features (e.g., gene, mRNA, coding region, tRNA) and qualifiers (e.g., /product, /note) to be indicated. (Systematic sequencing tracking numbers used in genome-level annotation are entered as locus tags.) It also allows multiple intervals per feature to represent the multiple exons in a spliced mRNA, for example. Once the sequence and annotation have been read in, Sequin's validator can be used to check for internal stops in coding regions and use of consensus splice sites. Coding region (CDS), gene, and mRNA feature intervals are also checked for mutual consistency.

The entire process can be automated for multiple chromosomes in a genome with the utility tbl2asn, which produces ASN.1 from pairs of table and sequence files.

FASTA-formatted sequence

Sequin can read nucleotide sequences of any size in FASTA format. FASTA format consists of a single definition line, beginning with a ">" followed by optional text, and subsequent lines of sequence. At minimum, all definition lines must contain an identifier for the sequence, called the SeqId. Other information about the biological source of the organism can also be encoded on the definition line of the sequence and is used to build the record. A sample definition line is

>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]

Common source modifiers incorporated into the definition line include [strain=yyy] and [chromosome=nnn]. Note that there are no spaces surrounding the equal sign. Complete circular genomes should include the modifier [topology=circular]. Sequences located in the mitochondrion or chloroplast need a modifier of [location=mitochondrion] or [location=chloroplast]. A complete list of modifiers is available from the Sequin help document Source and Organism subpage sections.

An example of a FASTA-formatted sequence is shown in Figure 1.

Table Layout

Sequin reads features from a five-column, tab-delimited table. The feature table specifies the location and type of each feature, and Sequin processes the feature intervals and translates any CDS features into proteins. The first line of the table contains the following basic information.

>Features SeqId table_name

The SeqId must be the same as that used on the sequence. The table_name is optional. Subsequent lines of the table list the features. Columns are separated by tabs.
Column 1: Start location of feature
Column 2: Stop location of feature
Column 3: Feature key
Column 4: Qualifier key
Column 5: Qualifier value

Figure 2 shows a sample table and illustrates a number of points about the table format. The GenBank flatfile corresponding to this table is shown in Figure 3, and a graphical overview of the features in this example is in Figure 4.

Single Record Submissions

Single submissions can be treated like a new Sequin submission. On the starting Sequin page, click on "Start New Submission", and fill out the Submitting Authors form. On the Organism and Sequences form, indicate that this is a Single sequence in FASTA format, choose the Organism and Molecule, indicate that the FASTA definition line starts with a sequence ID, and import the nucleotide sequence. Do not import protein sequence. After you create the initial record containing the sequence, import the table by choosing it with Sequin's File-->Open command. The features listed in the table will be immediately displayed on the flatfile view. Carry out any desired editing, then choose Search-->Validate to check for errors in the record, and fix any mistakes. Use the File-->Save As command to save the record in ASN.1 format. It is this ASN.1 version that should be submitted to GenBank. Additional tips on using Sequin are found in the Sequin Quick Guide.

Multiple Record Submissions

tbl2asn is a command line program that automates parts of the submission process. It is packaged with the Sequin archive (Sequin version 3.30 and above). tbl2asn reads a template, along with the sequence and table files, and outputs ASN.1 for submission to GenBank. Thus, the submitter does not need to read each set of table and sequence files into Sequin. Details about using tbl2asn can be found in this document.

 

Figure 1: Sample FASTA-formatted sequence:

>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]
CGACCACAATGGTACGATTGTTCATAAATCAGGAGATGTTCCTATTCATATAAAGATACC
AAACAGATCTCTAATACATGACCAGGATATCAACTTCTATAATGGTTCCGAAAACGAAAG
AAAACCAAATCTAGAGCGTAGAGACGTCGACCGTGTTGGTGATCCAATGAGGATGGATAG
[etc.]

 

Figure 2: Sequin table format:

>Feature Sc_16
1	7000	REFERENCE
			PubMed		8849441
<1	1050	gene
			gene		ATH1
			locus_tag	YPR026W
<1	1009	CDS
			product		acid trehalase
			product		Ath1p
			codon_start	2
			protein_id	gnl|SGD|S0006230
<1	1050	mRNA
			product		acid trehalase
[offset=2000]
1253	420	gene
			locus_tag	YPR027C
1253	420	CDS
			product		Ypr027cp
			note		hypothetical protein
			protein_id	gnl|SGD|S0006231
1253	420	mRNA
			product		Ypr027cp
2626	2590	tRNA
2570	2535
			gene		-
			product		tRNA-Phe
3450	4536	gene
			gene		YIP2
			locus_tag	YPR028W
3522	3572	CDS
3706	4197
			product		Yip2p
                        prot_desc       similar to human polyposis locus protein 1 (YPD)
			protein_id	gnl|SGD|S0006232
3450	3572	mRNA
3706	4536
			product		Yip2p	

 

Figure 3: GenBank flatfile

LOCUS       Sc_16        7000 bp    DNA             PLN       08-MAY-2000
DEFINITION  Saccharomyces cerevisiae chromosome XVI strain S288C.
ACCESSION   Sc_16
VERSION
KEYWORDS    .
SOURCE      baker's yeast.
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
            Saccharomycetaceae; Saccharomyces.
REFERENCE   1  (bases 1 to 7000)
  AUTHORS   Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B.,
            Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M.,
            Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and
            Oliver,S.G.
  TITLE     Life with 6000 genes
  JOURNAL   Science 274 (5287), 546 (1996)
   PUBMED   8849441
REFERENCE   2  (bases 1 to 7000)
  AUTHORS   Ouellette,B.F.F.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-MAY-2000) NCBI/NLM, National Institutes of Health,
            Building 38A, Room 8N805, Bethesda, MD 20894, USA
FEATURES             Location/Qualifiers
     source          1..7000
                     /organism="Saccharomyces cerevisiae"
                     /strain="S288C"
                     /chromosome="XVI"
     mRNA            <1..1050
                     /gene="ATH1"
                     /product="acid trehalase"
     gene            <1..1050
                     /gene="ATH1"
                     /locus_tag="YPR026W"
     CDS             <1..1009
                     /gene="ATH1"
                     /note="Ath1p"
                     /codon_start=2
                     /product="acid trehalase"
                     /translation="DHNGTIVHKSGDVPIHIKIPNRSLIHDQDINFYNGSENERKPNL
                     ERRDVDRVGDPMRMDRYGTYYLLKPKQELTVQLFKPGLNARNNIAENKQITNLTAGVP
                     GDVAFSALDGNNYTHWQPLDKIHRAKLLIDLGEYNEKEITKGMILWGQRPAKNISISI
                     LPHSEKVENLFANVTEIMQNSGNDQLLNETIGQLLDNAGIPVENVIDFDGIEQEDDES
                     LDDVQALLHWKKEDLAKLIEQIPRLNFLKRKFVKILDNVPVSPSEPYYEASRNQSLIE
                     ILPSNRTTFTIDYDKLQVGDKGNTDWRKTRYIVVAVQGVYDDYDDDNKGATIKEIVLN
                     D"
     mRNA            complement(2420..3253)
                     /locus_tag="YPR027C"
                     /product="Ypr027cp"
     gene            complement(2420..3253)
                     /locus_tag="YPR027C"
     CDS             complement(2420..3253)
                     /locus_tag="YPR027C"
                     /note="hypothetical protein"
                     /codon_start=1
                     /product="Ypr027cp"
                     /translation="MVGIYRILASFVPLLGLLFAFHDDDMIDTVTIIKTVYETVTSTS
                     TAPAPAATKSVSEKKLDDTKLTLQVIQTMVSCFSVGENPANMISCGLGVVILMFSLII
                     ELINKLENDGINEPQRLYDLIKPKYVELPSNYVNEKIKTTFEPLDLYLGVNMNTSGSE
                     LNQNCLILKLGEKTALPFPGLAQQICYTKGASNEFTNYKLSDIQGNLNENSQGIANGV
                     FQKISNIRKISGNFKSQLYQISEKITDENWDGSAVGFTAHGREKGPNKSQISVSFYRD
                     N"
     tRNA            complement(join(4535..4570,4590..4626))
                     /product="tRNA-Phe"
     mRNA            join(5450..5572,5706..6536)
                     /gene="YIP2"
                     /product="Yip2p"
     gene            5450..6536
                     /gene="YIP2"
                     /locus_tag="YPR028W"
     CDS             join(5522..5572,5706..6197)
                     /gene="YIP2"
                     /note="similar to human polyposis locus protein 1 (YPD)"
                     /codon_start=1
                     /product="Yip2p"
                     /translation="MSEYASSIHSQMKQFDTKYSGNRILQQLENKTNLPKSYLVAGLG
                     FAYLLLIFINVGGVGEILSNFAGFVLPAYLSLVALKTPTSTDDTQLLTYWIVFSFLSV
                     IEFWSKAILYLIPFYWFLKTVFLIYIALPQTGGARMIYQKIVAPLTDRYILRDVSKTE
                     KDEIRASVNEASKATGASVH"
BASE COUNT     2201 a   1276 c   1255 g   2268 t
ORIGIN
        1 cgaccacaat ggtacgattg ttcataaatc aggagatgtt cctattcata taaagatacc
       61 aaacagatct ctaatacatg accaggatat caacttctat aatggttccg aaaacgaaag
      121 aaaaccaaat ctagagcgta gagacgtcga ccgtgttggt gatccaatga ggatggatag [etc.]

 

Figure 4: Sequin graphical view

Sequin graphical view displays the nucleotide sequence 
as the top bar and lower arrows or bars representing different features.

 

Questions or Comments?
Write to the NCBI Service Desk

Revised June 14, 2004.