GCG/helix FAQ: GCG routines

GCG doesn't accept my sequence data format

Your sequence data is not in GCG format. You can use cut-and-paste into Seqed, Readseq or GCG's Reformat to convert it into GCG format on helix, or use GCG-Lite's format conversion tool. These methods are described further below.

Cut-and-paste into Seqed

Log into helix from your desktop computer and start up GCG. Type 'seqed', and enter screen mode. In a separate window, open the file containing your sequence, select it, and paste it into the seqed window. Then exit seqed, saving the file which will now be in GCG format. Check that the entire sequence was cut/pasted and saved. This is a quick-and -dirty method, but is not useful for very long sequences or if you have many sequences.

Readseq

To use Readseq, the data must be in one of the following formats:

   *IG/Stanford, used by Intelligenetics and others
   *GenBank/GB, genbank flatfile format
   *NBRF format
   *EMBL, EMBL flatfile format
   *GCG, single sequence format of GCG software
   *DNAStrider, for common Mac program
   *Fitch format, limited use
   *Pearson/Fasta, common format used by FastA program and
             others
   *Zuker format, limited use. Input only.
   *Olsen, format printed by Olsen VMS sequence editor. Input
             only.
   *Phylip3.2, sequential format for Phylip programs
   *Phylip, interleaved format for Phylip programs (v3.3,
             v3.4)
   *Plain/ Raw, sequence data only (no name, document,
          numbering)
   *MSF multi sequence format used by GCG software
   *PAUP's multiple sequence (NEXUS) format
   *PIR/CODATA format used by PIR

If your data is 'raw' (i.e. it has simply sequence data with no headers, dividers etc.), then be aware that Readseq may not accept it correctly. Your simplest option is to convert it into Fasta format by adding this line to the top of the file:
>>test input sequence
Type 'man readseq' on helix for more information.

It is always worth checking the output file after Readseq! You don't need to examine the whole sequence, just check the beginning, end and length of the sequence. Readseq sometimes doesn't recognize headers properly and includes them in the sequence -- an easy error to notice.

Reformat

In order to use Reformat on sequence files, the files must contain a heading, a dividing line, and a sequence. Type 'genhelp reformat' for more details on the input sequence format. It is a good idea to make a copy of your input sequence before running reformat, as it overwrites the original file. To run the program, type 'reformat filename', and if all goes well you should now have a GCG-formatted sequence in the file. If something doesn't work, see Reformat gave me an empty file or Reformat put the header into the sequence which may help you to troubleshoot.

GCG-Lite

GCG-Lite has a web-based format conversion tool that converts between the formats that Readseq uses. Paste your sequence into the input box, choose an output format, and click on 'Submit Request'. The reformatted sequence will appear in your web browser, where you can save it into a file.

How do I save scores from Pileup runs ?

When you run Pileup interactively, it prints the pairwise similarity scores to the screen. You see output like this:
Determining pairwise similarity scores... 1 x 2 1.19 1 x 3 1.20 //////////////////////// 26 x 28 0.87 27 x 28 0.86
However, the final output file does not contain these scores. If you want to save or print the scores, the quick-and-dirty way is to cut-and-paste them from your screen into a file. An alternative way is to run pileup in batch mode with the '-monitor' option. Since pileup runs can be long, batch mode may be a good idea anyway. When the batch job is done, it will send mail to your account, and if you have the '-monitor' flag, the mail will contain the pairwise scores.

Sample session:
% pileup -batch -monitor PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. PileUp of what sequences ? @hsp2.list 1 BBHSP60 1931 bp 2 BBHSPRO 2116 bp 3 BBOSPAGE 819 bp What is the gap creation penalty (* 5.00 *) ? What is the gap extension penalty (* 0.30 *) ? The minimum density for a one-page plot is 3.3 sequences/100 platen units. What density do you want (* 3.3 *) ? What should I call the output file name (* hsp2.msf *) ? ** pileup will run as a batch or at job. ** pileup was submitted using the command: " atnow " warning: commands will be executed using /bin/sh job 865448280.a at Wed Jun 4 14:18:00 1997

Mail received when batch job finishes:
Date: Wed, 4 Jun 1997 14:18:23 -0400 (EDT) From: Susan Chacko <susanc@helix.nih.gov> Subject: pileup Welcome to the WISCONSIN PACKAGE Version 8.1-UNIX, August 1995 [....etc....] Running pileup using : .. -monitor -INfile1=@hsp2.list -GAPweight=5.000000 -LENgthweight=0.300000 -OUTfile1=hsp2.msf -DENsity=3.333433 -FIGure -Default 1 BBHSP60 1931 bp 2 BBHSPRO 2116 bp 3 BBOSPAGE 819 bp Determining pairwise similarity scores... 1 x 2 0.36 1 x 3 0.37 2 x 3 0.38 Aligning... 1 ........................................-..... 2 .................................................. ..............................................-... ....... Total sequences: 3 Alignment length: 2,120 CPU time: 09.97 Output file:/u5/susanc/hsp2.msf

How do I save the genetic distance scale from Growtree?

The GrowTree program will not display distance information. However, the tree generated by GrowTree uses the distances information in the Distances output file (used as input to the GrowTree program). This file does have distances information. These values are substitutions per 100 residues. Remember that GrowTree grows one tree that is compatible with this information, but it does not grow and evaluate trees.

If you can calculate the scale from the trees file that is output by growtree, you may be able to edit the growtree figure file (which is produced by 'growtree -figure') and manually add the scale. You can get information about editing the figure file from the GCG user guide (or type 'genhelp figure'). However, this will be fairly complicated unless you are very familiar with the figure file formats.

There are other software packages that will do this including PAUP for the Mac and PHYLIP for Mac and Windows. Phylip may be available on the helix systems in the future; send mail to the helix staff via the ptr help system (type 'ptr' at the helix prompt) if you are interested.

A list of programs for Phylogenetic analysis is available from U. Arizona.

I keep getting files called seqed.log in my directory.

If you exit Seqed abnormally, it creates a seqed.log file to attempt to rescue work-in-progress. The correct way to exit seqed is to type Ctrl-D to get to the command line, and then enter 'exit' (to exit or save) or 'quit' (to exit without saving).

Seqed says 'recovering...' for ever!!

When you exited Seqed the previous time, something abnormal happened that caused seqed to rescue your sequence into a file called seqed.log. If this file was somehow corrupted, then the next time seqed is started, you will get a message saying

recovering....can't open

and nothing more will happen. The solution to get seqed working again is to delete the seqed.log file (by typing 'rm seqed.log').

Note: This problem seems to have been fixed in GCG 9.0

What databases are available for Fasta searches?

The welcome notice when you start up GCG tells you what databases are available and when each one was last updated.

The databases can be accessed in groups and sub-groups. For example, Bacterial sequences can be searched separately as ba:*, gb_ba:*, or as part of Genbank (Genbank:*), or GenbankPlus (GBPlus:*), or GenEMBLPlus (GEPlus:*). The database section of the GCG manual lists the short names for each protein and nucleic acid database. (On helix, you can also type 'genmanual' and go to 'Databases').

Genbank and EMBL trade data nightly, so that Genbank encompasses almost all of EMBL at any given time. Thus it is only necessary to maintain one of these databases. On the helix systems, we keep Genbank up-to-date, and so EMBL on our system is vanishingly small. Thus, the GenEMBLPlus is essentially the same as GenbankPlus, GenEMBL is essentially equal to Genbank, and so on.

To use any of these databases in a Fasta search, remember to type ':*' after the database name! For example, you can search the Tags library using 'Tags:*' or 'Gb_Tags:*'. The database specifications are case-insensitive.

How do I get information about the databases?

When you start up GCG, the welcome notice tells you which databases we have installed on helix, and when they were updated. Detailed information about each database can be found in the directory /gcgdata/gcgcore/data/moredata/ in files called swissprot.release, genbank.release etc. You can read these files using the 'more' command (e.g. helix% more /gcgdata/gcgcore/data/moredata/pir.release or copy them to your own directory using GCG's 'fetch' command. (e.g.

helix% fetch rebase.release

Fetch copies GCG sequences or data files from the GCG database 
into your directory or displays them on your terminal screen.

 rebase.release

You may want to see the latest list of databases on the helix systems. You can also obtain this list by typing 'genmanual' at the helix prompt, and going to 'Databases'.

Motifs doesn't find a known motif in my protein.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Could your 'known motif' be a frequent motif, one that is seen commonly in protein sequences? By default, the program does not show these in its output. You need to use the '-frequent' option for the 'motifs' program to pick out common motifs such as leucine zippers.

From the GCG help on motifs:

 The PROSITE Dictionary  contains  a number of short sequence patterns
 that occur frequently in protein sequences.  Most of these frequently
 found  patterns  are  post-translational  modifications,   but   more
 specific  patterns  such  as  leucine  zippers  also  fall  into this
 category.  Such frequently  found patterns are  not normally shown by
 Motifs,  but you can display them  by  using the command-line  option
 -FREquent.   More  so  than  with  other   patterns  in  the  PROSITE
 Dictionary,  the presence of these frequently occurring patterns does
 not assure  you that the  protein actually contains the corresponding
 function.

 Here are the patterns that the PROSITE Dictionary classifies
 as frequently occurring in Release 14.0:
 
;Amidation            1 xG(R,K)(R,K)                                 0009.pdoc
;Asn_Glycosylation    1 N~(P)(S,T)~(P)                               0001.pdoc
;Camp_Phospho_Site    1 (R,K)2x(S,T)                                 0004.pdoc
;Ck2_Phospho_Site     1 (S,T)x2(D,E)                                 0006.pdoc
;Glycosaminoglycan    1 SGxG                                         0002.pdoc
;Leucine_Zipper       1 Lx6Lx6Lx6L                                   0029.pdoc
;Microbodies_Cter     1 (S,T,A,G,C,N)(R,K,H)(L,I,V,M,A,F,Y)>         0299.pdoc
;Myristyl             1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P)     0008.pdoc
;Pkc_Phospho_Site     1 (S,T)x(R,K)                                  0005.pdoc
;Rgd                  1 RGD                                          0016.pdoc
;Tyr_Phospho_Site     1 (R,K)x{2,3}(D,E)x{2,3}Y                      0007.pdoc

What databases are available for Blast searches?

There are no Blast databases available on the helix systems. Blast searches are available to helix users, but are actually run at NCBI, National Library of Medicine on their Blast databases. NCBI maintains a list of databases available for Blast searches. As of Jun 1997, the following are available, but check their web page for changes/updates.

Peptide Sequence Databases
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days.
swissprot the last major release of the SWISS-PROT protein sequence database (no updates)
yeast Yeast (Saccharomyces cerevisiae) protein sequences.
E. coli E. coli genomic CDS translations
pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank
kabat [kabatpro] Kabat's database of sequences of immunological interest
alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994) .
Nucleotide Sequence Databases
nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences)
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions
htgs High Throughput Genomic Sequences
yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
E. coli E. coli genomic nucleotide sequences
pdb Sequences derived from the 3-dimensional structure
kabat [kabatnuc] Kabat's database of sequences of immunological interest
vector Vector subset of GenBank(R), NCBI, (anonymous FTP to ncbi.nlm.nih.gov beneath the /pub/vector directory)
mito Database of mitochondrial sequences, Rel. 1.0, July 1995"
alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).
epd Eukaryotic Promotor Database
gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

Fasta says "No sequences to compare to query sequence!"

You may have typed in the database name incorrectly. For example, the following responses would be incorrect:

  Search for query in what sequence(s) (* GenEMBL:* *) ?  genbank
  Search for query in what sequence(s) (* GenEMBL:* *) ?  embl 
  Search for query in what sequence(s) (* GenEMBL:* *) ?  genbank:
  Search for query in what sequence(s) (* GenEMBL:* *) ?  all

and Fasta will say:

   *** No sequences to compare to query sequence! ***

You need the ":*" after the database name in order for Fasta to search it correctly. Here are some correct responses:

  Search for query in what sequence(s) (* GenEMBL:* *) ?

(i.e. if you simply hit a carriage return (Enter), Fasta will search the default database. For nucleic acid query sequences this is GenEMBL, and for protein query sequences the default is Swissprot).

  Search for query in what sequence(s) (* GenEMBL:* *) ?  genbank:*

(This is the correct syntax for the database name).

  Search for query in what sequence(s) (* GenEMBL:* *) ? *

Here the syntax is correct, but the '*' tells Fasta to search through all the sequence files in your own current directory, and not through any database. If this is what you want, you're ok.

Another possibility is that you were trying to search EMBL? On the helix systems, we primarily maintain Genbank, and only keep those entries from EMBL that do not appear in EMBL. Since EMBL and Genbank largely overlap, many of the EMBL databases (such as em_pat, em_gss) are empty or have very few sequences. Try the equivalent Genbank databases instead (e.g. gb_pat or gb_gss).

Reformat gave me an empty file.

One possible reason:

You edited the file in a word processor and saved it as a word processor document. Reformat can't read the word processor codes that are in your document, and can't convert it. You need to save it as an ASCII text file, which Reformat should be able to convert.

Reformat put the header into the sequence.

The file does not contain the dividing line. Input files for Reformat need to have a header, a dividing line, and a sequence. The dividing line contains only two dots, and nothing else. For example, an input file like this will work:

   Human fetal Beta globin G gamma
	   ..
   AGGAAGCACC CTTCAGCAGT TCCACA

But if the middle line with the dots is missing, Reformat will assume the whole file contains a sequence. It says:

   No ".." divider

and will create a 'sequence' that looks like:

    REFORMAT of: test.seq  check: -1  from: 1  to: 52  June 17, 1997 10:14

    (No documentation)

     test.seq  Length: 52  June 17, 1997 10:14  Type: P  Check: 7154  ..

          1  Humanfetal Betaglobin GgammaAGGA AGCACCCTTC AGCAGTTCCA 

         51  CA

Where do I get the codon usage table for xxx?

The programs BackTranslate, CodonPreference and Correspond use codon frequency tables. By default they use the table for E.Coli. However, additional codon usage tables have been provided on the helix systems for GCG users. At present they include:

Species Codon table file name
green algae algae.cod
Aspergillus nidulans asn.cod
Arabidopsis thaliana ath.cod
Hordeum vulgare (Barley) bly.cod
Bombyx mori (Silk Moth) bmo.cod
Bos taurus (Cow) bov.cod
Bacillus subtilis bsu.cod
Caenorhabditis elegans cel.cod
Chironomus sp. chi.cod
Gallus sp. (Chicken) chk.cod
Chlamydomonas cre.cod
Dictyostelium discoideum ddi.cod
Drosophila melanogaster dro.cod
Escherichia coli eco.cod
enteric bacterial (highly expressed) genes ecohigh.cod
Cricetulus sp. & Mesocricetus sp. (Hamster) ham.cod
Homo sapiens hum.cod
Klebsiella pneumoniae kpn.cod
Macaca sp mac.cod
Mus sp. (Mouse) mus.cod
Zea mays (Maize) mze.cod
Zea mays chloroplast (Maize) mzecp.cod
Neurospora crassa neu.cod
Neisseria gonorrheae ngo.cod
Pisum sativum (Pea) pea.cod
Petunia sp. pet.cod
Plasmodium falciparum pfa.cod
Phaseolus vulgaris (Lima bean) phv.cod
Solanum tuberosum (Potato) pot.cod
Pseudomonas sp. pse.cod
Oryctolagus sp. (Rabbit) rab.cod
Rattus sp. (Rat) rat.cod
Rhizobium meliloti rhm.cod
Oryza sativa (Rice) ric.cod
Ovis sp. (Sheep) shp.cod
Physarum polycephalum slm.cod
Glycine max (Soybean) soy.cod
Staphylococcus aureus sta.cod
Salmonella thphimurium sty.cod
Strongylocentrotus purpuratus sus.cod
Tetrahymena thermophila tet.cod
Nicotiana tabacum (Tobacco) tob.cod
Nicotiana tabacum chloroplast (Tobacco) tobcp.cod
Lycopersicon esculentum (Tomato) tom.cod
Trypanosoma brucei trb.cod
Triticum aestivum (Wheat) wht.cod
Xenopus laevis xel.cod
Saccharomyces cerevisiae ysc.cod
Saccharomyces cerevisiae mitochondrion yscmt.cod
Schizosaccharomyces pombe ysp.cod

For details about any of these tables, extract the actual table by typing 'fetch xxx.cod' at the helix prompt, and read the documentation at the top of the file.

To use any of these codon tables instead of the default (E.Coli) table, add the parameter '-translate=xxx.cod' when running any of the above programs.

If you need a codon table that is not in this list, send mail to the helix staff using the ptr command on helix (type 'ptr' at the helix prompt) and we will attempt to track it down for you.

What options are available with each program?

Most GCG programs have a '-check' option. If you use this option, GCG will list all available options for that particular program and prompt you for further input.

Sample session:
helix% motifs -check Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. Minimal Syntax: % motifs [-INfile=]SW:Kad1_Human -Default Prompted Parameters: [-OUTfile=]kad1_human.motifs the output file name Local Data Files: -DATa=prosite.patterns file of protein sequence patterns Optional Parameters: -NOREFerence suppresses the PROSITE abstract for each pattern found -FREquent shows motifs that are frequently found in proteins -MISmatch=1 allows one mismatch -NAMes writes output file in file-of-sequence-names format -APPend appends the pattern data file to your output file -SHOw shows every file searched, even if no pattern was found -ONCe limits finds to patterns found only once -MINCuts=2 limits finds to patterns found a minimum of 2 times -MAXCuts=3 limits finds to patterns found a maximum of 3 times -EXCLude=n1,n2 excludes patterns found between positions n1 and n2 -NOMONitor suppresses the screen trace showing each file -NOSUMmary suppresses the screen summary at the end of the program Add what to the command line ? -frequent MOTIFs from what protein sequence(s) ? cram_craab.swissprot What should I call the output file (* cram_craab.motifs *) ? cram_craab.swissprot len: 46 ............................ Total finds: 2 Total length: 46 Total sequences: 1 CPU time (sec): 01.19 Output file: "/u5/susanc/cram_craab.motifs"

How do I edit more than 30 aligned sequences?

You're out of luck, GCG's multiple sequence editors cannot handle more than 30 sequences at once.

Why can't I find a published sequence?

We maintain Genbank on helix, and only those parts of EMBL that are not present in Genbank. There are several possible reasons why you might not find a published sequence:

Are you searching in the right database? If you are using GCG's Lookup program, you can search 'Genbank' &/or 'GB_New'. The 'Genbank' database contains the last release of Genbank. The 'GB_New' database contains all updates since that release, upto the previous night. If the data were published recently, the sequence might be in GB_New rather than Genbank.
Are you searching by accession number? If so, make sure you are using the Genbank accession number rather than EMBL...
In the Lookup program, try searching 'All text' for words relevant to your sequence, in case the published accession number has a typo.
Authors sometimes request that their data be held in confidence until after publication, even though they have already been assigned an accession number. This may create a delay in the release of the data.

How do I get my sequences into Pileup?

You need a file that contains the names of all the sequences you want to put into Pileup. If the sequence files are actually in your own directory, and are all in GCG format, your input file can look like this:

--------------------------- File pileup.inp -------------------
w68343.gb_new
x12871.gb_new
hcu31231.gb_vi
hsig487.gb_pr
---------------------------------------------------------------

When pileup asks for the sequences, type '@pileup.inp' in response. Note the '@' symbol before the filename; this tells GCG that the file contains a list of sequences. The filename itself should not start with '@'!

You don't need to have the sequences in your own directory to align them; pileup (and most GCG programs) can fish out sequences from the databases. For example, a valid file of sequence names would be:

----------------------- File hsp.list --------------------------
LOOKUP in: genbank  of: "[SQ-ALL: hsp70*]"

 3 entries  June  3, 1997 16:11 ..

GB_BA:BBHSP60 ! ID: 120c0005
! DEFINITION  B.burgdorferi hsp60 gene for 60kDa heat shock protein.
GB_BA:BBHSPRO ! ID: 130c0005
! DEFINITION  B.burgdorferi dnaK gene for heat-shock protein.
GB_BA:BBOSPAGE ! ID: 480c0005
! DEFINITION  B.burgdorferi OspA gene for outer surface protein A.
-------------------------------------------------------------------

(This file was output by GCG's Lookup program). In this case, the sequences are still in the database, and Pileup will pull out each sequence at the start of its run. It will take a little longer than if the sequences are in your home directory.

Sample pileup session:
helix% pileup PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. PileUp of what sequences ? @pileup.inp 1 w68343.gb_new 392 bp 2 x12871.gb_new 429 bp 3 hcu31231.gb_vi 308 bp 4 hsig487.gb_pr 487 bp What is the gap creation penalty (* 5.00 *) ? [....etc......]

Should I use Blast or Fasta?

Blast is a program developed at NCBI which searches with a query sequence against a database. Blast looks for regions of local similarity -- i.e. it searches for regions of the same length between the query sequence and the database sequence. It does not insert gaps to improve the quality of the match. Blast calculates a 'quality-of-match' value using a scoring matrix, and the output will have all matches which are above a cutoff value. Since Blast looks for regions of similarity, it can find that two different regions of the same database sequence are similar to a query sequence.

Fasta uses the Pearson-Lippman algorithm, and looks for "word" matches (an optimal word size is 2 for proteins and 4-6 for nucleic acids) between the query and the database sequence. It finds the region in the database sequence that has the most word matches, and then looks for other nearby regions that match (i.e. it inserts gaps into the sequences if that will improve the match). It keeps track of the highest-matching sequences, and will report only the top matching sequences.

Which is more appropriate for your purpose? Note the following:

Blast is faster. A typical Blast search takes a few minutes, a typical Fasta search is 10-20 mins. (However, with the parallelized version of Fasta on helix, this may no longer be true)
Blast is usually more sensitive than Fasta for detecting protein sequence similarity, since it doesn't require a perfect match at the first stage of the search. Blast can also filter out low-complexity protein sequence regions which may result in non-specific matches.
Blast has more search modes. For example, you can use tblastn to search with a query nucleotide sequence against a protein database. The query sequence will be translated into all 6 reading frames before the search. To do the same thing with Fasta, you'd have to translate the nucleotide sequence separately and run Fasta 6 times.
but
Fasta will let you tailor your search more precisely. For example, you could search just the plant sequences (using pl:*) while Blast will require you to search the whole GenEMBL database. Thus you may get a more meaningful result from Fasta.
Blast has a long word size, which reduces its sensitivity.
Fasta is good at detecting genomic DNA regions using a cDNA query sequence because it allows a gap extension penalty of zero. Blast will find only the longest exon or fail, since it only measures ungapped alignments.
Blast cannot search with very short sequences. While not the best approach, Fasta does better.

Because of its speed and simplicity, a Blast search is probably your best first bet. If one of the above points is important to you, or you want more details, try Fasta.

Additional links:

Can I use Fasta with a short sequence?

You can, but it's not a good idea. Fasta and Blast work best with sequences over 50 bp. For short sequences, try findpatterns instead.

For example, say you have a 10-nucleotide protein binding site within a promoter, and you want to search the E.Coli genome for this consensus sequence. Findpatterns works best if you can define a sequence pattern that is specific, but includes all of the "real" binding sites. It does not allow positional weighting, so the best thing to give it is fully conserved residues/IUPAC specifications and allowed spacings between conserved groups. In addition, if there are correlated dinucleotide variations they are best searched as separate patterns instead of via a single generalization of the pattern.

i.e.

CNGGATNA{5,7}TNATCCNG
CNGGTANA{5,7}TNTAGGNGMM

works better than

CNGGNNNA{5,7}TNNNCCNG

Findpatterns also lets you allow mismatches, but in general for short, degenerate patterns without positional weighting this will just decrease your signal to noise ratio. It is better to explicitly search additional patterns.

Findpatterns patterns are specified in the same data file format as restriction enzymes for "MAP", etc. Type "fetch pattern.dat" to get an example file you can modify with your own patterns. Type "genhelp findpatterns" to see a description of the program. The pattern syntax is described under "defining patterns" and making and using your own pattern files are described under "pattern file" and "local data files".

If you have several of these sequences, and if the contribution of the individual bases is unequal (e.g. the T1, A2 and T6 in the -10 box are more conserved than the others) then you can align the known sequences (make individual sequence files and then use Pileup; or enter them directly using Lineup). You can then use Profilemake and then Profilesearch. These latter two programs are meant for proteins but can be gotten to work with DNA sequences - don't forget to use -MATRix=profiledna.cmp with Profilemake.

Dr. William Pearson, author of FastA, says that FastA should work fine for such a search if you use a word size of 1 instead of the default word size of 6 for DNA.

(Response to this question modified from posts by Michael Leonetto and Paul Roy on bionet.software.gcg.)

Should I use framesearch or Blastx?

aka

Framesearch takes forever; is it worth it?

Framesearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Framesearch uses the Smith-Waterman algorithm, and can insert gaps in the sequence to improve the match.

Two similar programs are Blastx, which searches a protein database against a query nucleotide sequence, or Tblastn and Tfasta, which search a nucleotide database against a query protein sequence. All these programs first translate the nucleotide sequence in all six reading frames before searching and alignment. The Blast group of programs does not insert gaps, and none of these account for frame shift errors that may be present.

But framesearch is very slow, especially when searching an entire database. The blast group of programs will give you a result in half an hour or less , as opposed to 3 hrs or days for framesearch. If you need the sensitivity of framesearch, try running a Blastx/Tblastn/Tfasta search, taking the top scores, and running framesearch against those.

Additional links:

Framesearch sensitivity, from the GCG Newsletter, Nov 95.

Searching databases for sequences similar to a sequence of interest from NYU.

After I generate a multiple sequence alignment using the PileUp program, how can I calculate a consensus sequence for that alignment?

Use the Pretty program with the command-line qualifier -CONsensus. The Pretty program uses a scoring matrix to calculate a consensus sequence for a group of aligned sequences. It prompts you for the plurality you want to use in creating this consensus.

You also can specify a threshold value with the command-line qualifier -THReshold. This can be important in amino acid sequence analysis. The default value for the threshold qualifier is 1.0, which means that all matched residue pairs in the scoring matrix that have a value of 1.0 or greater will vote as a block in calculating a consensus residue for a particular position. If you do not want to see block voting, you can increase the threshold value with the command-line qualifier -THReshold=1.5. (1.5 is the value assigned for an identical match in the default scoring matrix used by Pretty. For example, an alanine to alanine match would be assigned a value of 1.5.) If you are using a scoring matrix other than the default one, you can use the command line qualifier -IDEntity to eliminate block voting.

Question/response taken from GCG Newsletter, March 94..

How can I use a consensus sequence created with the Pretty program as input to other Wisconsin Package programs?

You can do this by re-running the Pretty program with the command-line qualifier -UGLy using the Pretty output file as input. Pretty rewrites the sequences in the original input file, including the consensus sequence, into individual sequence files. For example, the consensus sequence is rewritten into a file called consensus.ugly. You can use these individual files as input to any other Wisconsin Package programs.

Question/response taken from GCG Newsletter, March 94..

Can I change how agreement or disagreement at positions in the alignment are displayed in the consensus sequence or in the alignment generated by Pretty?

Yes. There are a number of qualifiers you can use to accomplish this when running the Pretty program:

-IDEntity displays only those residues in the consensus sequence that are unanimous at a given position in the alignment. If you specify a symbol, such as -IDEntity=*, the program displays that symbol in the consensus sequence. If you do not specify a symbol, the unanimous residue will appear in the consensus sequence at that position.
-DIFferences displays only those residues in the alignment that disagree with the consensus sequence at a given position in the alignment. If you specify a symbol, such as -DIFferences="-", that symbol will be displayed in the alignment at those positions that agree with the consensus. If you do not specify a symbol, the position will be blank.
-CASe displays residues in the alignment that agree with the consensus sequence in uppercase and those that disagree with the consensus in lowercase.

Question/response taken from GCG Newsletter, March 94..

How do I import a sequence file into GCG?

Say you have a file, created in WP or Simple Text or Microsoft Word, or something that came out of an automatic sequencer, on your desktop machine. It's unlikely to be in GCG format, but you want to run some GCG programs on this sequence. There are a few different ways to convert it into the right format:

Cut-and-paste into Seqed

Log into helix from your desktop computer and start up GCG. Type 'seqed', and enter screen mode. In a separate window, open the file containing your sequence, select it, and paste it into the seqed window. Then exit seqed, saving the file which will now be in GCG format. Check that the entire sequence was cut/pasted and saved. This is a quick-and-dirty method that mostly works, but you have to be careful that you cut-and-paste the entire sequence. It is not useful for very long sequences or if you have many sequences.

GCG's Reformat program

Save the file as an ASCII text file. If the sequence has word-processor codes in it, no sequence conversion program will accept it.

Bring the file to helix using some file transfer program like Fetch or FTP.

Start up GCG (type 'gcg' at the helix prompt) and then run reformat (type 'reformat filename' at the helix prompt). Reformat should give you a GCG-formatted sequence in the same file. If something doesn't work right, see Reformat gave me an empty file or Reformat put the header into the sequence which may help you to troubleshoot.

Readseq

To use Readseq, the data must be in one of the following formats:

   *IG/Stanford, used by Intelligenetics and others
   *GenBank/GB, genbank flatfile format
   *NBRF format
   *EMBL, EMBL flatfile format
   *GCG, single sequence format of GCG software
   *DNAStrider, for common Mac program
   *Fitch format, limited use
   *Pearson/Fasta, common format used by FastA program and
             others
   *Zuker format, limited use. Input only.
   *Olsen, format printed by Olsen VMS sequence editor. Input
             only.
   *Phylip3.2, sequential format for Phylip programs
   *Phylip, interleaved format for Phylip programs (v3.3,
             v3.4)
   *Plain/ Raw, sequence data only (no name, document,
          numbering)
   *MSF multi sequence format used by GCG software
   *PAUP's multiple sequence (NEXUS) format
   *PIR/CODATA format used by PIR

GCG-Lite

Reformat cut off the end of my sequence.

One possible reason is that your input sequence was in one long line without carriage returns. This can happen if you typed the sequence in something like SimpleText, without carriage returns, and then transferred it to helix. There are several ways around this:

Edit the file on helix, and add a few carriage returns after every 50 bp or so. This is reasonable as a one-time option; however, if you are going to be doing this often you should use a less cumbersome method.
When you type the sequence in your editor (the one on your desktop machine), put carriage returns (i.e. hit the Enter key) after every 50 or 80 characters.
See if your editor (the one on your desktop machine) has an option for saving the file with carriage returns.
When you transfer the file, see if the file transfer program has an option to insert carriage returns. For example, the Fetch program has an option to transfer the file as 'Wrapped Text'. This will break up the long lines, and Reformat will be able to process the sequence successfully.

It is always worth checking the output from a reformatting program! You don't need to examine the whole sequence, just look at the beginning, the end and the length of the sequence.

How do I translate the coding regions of a Genbank entry?

Search for your protein by accession number or name in Lookup.
In the 'Features:' field, type 'cds.
In the 'Form of output list' field, select 'Fragments'. (Use the space bar on your keyboard to toggle options in this field).

Save the result in a file, which will contain the coding regions.

Then use 'translate' to translate them.

Sample session:
helix% lookup LookUp identifies sequences by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release--please look carefully at your results. LOOKUP in what sequence libraries: a) swissprot b) pir c) embl d) genbank e) em_tags f) gb_tags g) gb_new h) genpept i) All libraries q) quit Please choose one or more (* i *): d Complete the query form below: All text: tryptophan synthase Definition: Author: Keyword: Sequence name: Accession number: Organism: Reference: Title: Feature: cds On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Fragments Press D to continue. Searching genbank 32 features were found. Do you wish to: 1) write out this list to a file 2) preview the results 3) refine the query 4) choose different libraries q) quit Please choose one (* 1 *): 1 What should I call the output file (* lookup.list *) ? test.list .... 32 features were written to "test.list" helix% translate @test.list Translate translates nucleotide sequences into peptide sequences. acctrpf from 506 to 1147 acctrpf from 1149 to 1466 bactrpab from 179 to 1393 bactrpab from 1374 to 2183 bactrsyb from 31 to 1233 brltrp2 from 1 to 366 buhtrpb from 1 to 676 buhtrpba from 1 to 676 buhtrpbb from 1 to 679 buhtrpbc from 1 to 676 ccrtrpfba from 181 to 450 ccrtrpfba from 500 to 1159 ccrtrpfba from 1182 to 2402 ccrtrpfba from 2415 to 3242 mvotrpba from 1 to 206 mvotrpba from 304 to 1533 mvotrpba from 1571 to 2425 mvotrpba from 2460 to 2600 (reverse of) mvotrpba from 2666 to 2874 (reverse of) pptrpab from 1 to 18 pptrpab from 129 to 1346 pptrpab from 1343 to 2152 (reverse of) psetrpbai from 99 to 908 (reverse of) psetrpbai from 908 to 2134 psetrpbai from 2245 to 3141 syctrpb from 518 to 1756 ab003491 from 1 to 1413 athtrpb from 1517 to 1858 athtrpb from 2109 to 2465 athtrpb from 2549 to 2991 athtrpb from 3084 to 3178 athtrpb from 3275 to 3450 athtrpsb from 442 to 798 athtrpsb from 1174 to 1530 athtrpsb from 1606 to 2048 athtrpsb from 2136 to 2230 athtrpsb from 2322 to 2497 cynpltrnk from 148 to 876 mzeorp1a from 1 to 1170 mzeorp2a from 3 to 1334 Output files called "*.pep"

How do I stop Fasta?

If you start up a Fasta run, and then discover that you're searching the wrong database or have the wrong query sequence, or that the search is simply taking too long, you can stop the program. For Fasta, a 'Ctrl-C' (hold down the Control key, and hit 'c') will stop the search. The program will tell you the best scores it has found so far, and offer to continue with the alignment part of Fasta. Another 'Ctrl-c' at this point will stop Fasta completely, and return you to the helix prompt.

If you just want to 'pause' the program (while you take a look at your query file, for example) use 'Ctrl-z' (hold down the Control key and hit 'z') to suspend the program execution. You will get a helix prompt, and the program will remain in a suspended state until you type 'fg' (for 'foreground') or 'bg' (for 'background'). You can also choose to kill the program after suspending it, by typing 'jobs', and killing the appropriate job. For example, after pausing a Fasta job with 'Ctrl-z':
helix% jobs [1] + Suspended ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta )
This tells you that you have one job suspended, and that it is job number 1.
helix% bg [1] ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta ) & 1,701 Sequences 675,057 aa searched SW:AGL1_ARATH 1,801 Sequences 716,153 aa searched SW:ALAT_PIG 1,901 Sequences 754,379 aa searched SW:ALF_SCHPO 2,001 Sequences 792,028 aa searched SW:AMA1_PLAF8 [...etc....]
This puts the job into the background. You have limited control over it now, and will probably not be able to pause, kill or stop it until it is finished.
helix% fg ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta x89712.genpept ) 6,001 Sequences 2,332,970 aa searched SW:CC27_HUMAN 6,101 Sequences 2,374,535 aa searched SW:CCKN_MACEU 6,201 Sequences 2,406,280 aa searched SW:CD36_HUMAN 6,301 Sequences 2,437,213 aa searched SW:CD9_CERAE 6,401 Sequences 2,479,989 aa searched SW:CE11_CAEEL 6,501 Sequences 2,516,827 aa searched SW:CERB_CERCA 6,601 Sequences 2,563,151 aa searched SW:CG2A_DROME 6,701 Sequences 2,594,679 aa searched SW:CH10_CAUCR [...etc...]
This puts the job back into the foreground and it will continue to the end.
helix% kill %1 [1] Terminated ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta )
This kills the Fasta job.

'fg' and 'bg' should be used with caution, as they interfere with the normal execution of a Fasta job, and in some situations can lead to unexpected results especially if a job is foregrounded or backgrounded more than once. Your job may hang, become unkillable, or leave incomplete files.

When should I use stringsearch?

Stringsearch searches through the specified database for a particular character string. You can choose to search through Definitions (option 'A') or Complete Sequence Annotation (option 'B'). The 'A' option is much faster, but both options have been superceded by GCG's Lookup program, which is faster and has more options.

For example, a search through Swissprot for 'glyceraldehyde' takes 25 seconds with Stringsearch Option A, 22 minutes with Stringsearch Option B, and about half a second with Lookup. Lookup searches can also be more easily tailored for your specific needs -- you can search through mouse sequences only, for example. So in most cases, use Lookup instead of Stringsearch.

How can I get a tfasta output aligned with Pileup?

I have a small region of a protein which contains the residues which have been found to be mutated in patients. I want to check Genbank for all homologous proteins to see if these residues are highly conserved. So I want to do a tfasta run to identify the homologs, and then use Pileup to align them. How can I get the list of sequences from a tfasta output into Pileup?

The tfasta output (or the tblastn output) are not suitable for automatic parsing, so it is difficult to get a list of sequence names from the output that you can pop into Pileup.

However, a better way to do this would be to run a fasta search against the GenPept database, instead of Genbank. GenPept is maintained on the helix systems and is updated bimonthly. It contains all the putative translations from the last Genbank release. You can search GenPept using 'GenPept:*' in Fasta. In Pileup, you can then use the Begin= and End= parameters to define the ends of the region for alignment. Each sequence can have its own range limitation.

Another option is to use framesearch instead of fasta. You could then create a list file using the begin, end and range information from the framesearch output, and use that as input into Pileup. However, framesearch is much slower than Fasta.

How do I design a primer?

There are several design criteria:

melting temperatures of primer and product
GC content
ambiguous bases
self-complementarity -- how to avoid primer dimer formation and primer secondary structure
size of primer
PCR product length
repeats, e.g. similarity to common rodent repeat sequences

GCG's prime program selects primers, and is also accessible through

GCG-Lite's web interface to prime.

Other web-accessible primer design programs are

Primer3 from the Whitehead Institute,

xprimer from U. Minnesota,

GeneFisher from U. Bielefeld.

Programs than run on Macs and PC's are MacVector for the Mac, GeneWorks for the Mac, OSP for Mac/PC/VAX/Sun/Xwindows, and primegen. Also check EBI's list of primer design programs.

More information about primer design can be found at:

Restriction Maps and PCR primer design, a course at NYU.

PCR primer analysis, a course from U. Kentucky.

How do I convert peptide sequences from 1-letter to 3-letter amino acid codes?

Reformat, which converts other sequence formats into GCG format, can do this. You would run it with the '-ONEintothree' option, and your input sequence file can be either GCG format or any of the other formats that Reformat accepts. Note that reformat will write over your input sequence file, so save a copy if you want to preserve the original!

Sample session:
helix% more jc5122.pir2 P1;JC5122 - superoxide dismutase (EC 1.15.1.1) (Mn) precursor - Caenorhabditis elegans C;Species: Caenorhabditis elegans C;Date: 02-Feb-1997 #sequence_revision 27-Feb-1997 #text_change 13-Mar-1997 C;Accession: JC5122; JS0750 [...etc etc....] jc5122.pir2 Length: 221 August 14, 1997 10:53 Type: P Check: 1490 .. 1 MLQNTVRCVS KLVQPITGVA AVRSKHSLPD LPYDYADLEP VISHEIMQLH 51 HQKHHATYVN NLNQIEEKLH EAVSKGNVKE AIALQPALKF NGGGHINHSI 101 FWTNLAKDGG EPSAELLTAI KSDFGSLDNL QKQLSASTVA VQGSGWGWLG 151 YCPKGKILKV ATCANQDPLE ATTGLVPLFG IDVWEHAYYL QYKNVRPDYV 201 NAIWKIANWK NVSERFAKAQ Q helix% reformat -oneintothree Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs. REFORMAT what sequence file(s) ? jc5122.pir2 jc5122.pir2 length: 663 aa helix% more jc5122.pir2 P1;JC5122 - superoxide dismutase (EC 1.15.1.1) (Mn) precursor - Caenorhabditis elegans C;Species: Caenorhabditis elegans C;Date: 02-Feb-1997 #sequence_revision 27-Feb-1997 #text_change 13-Mar-1997 C;Accession: JC5122; JS0750 [... etc etc ...] F;50,98,182,186/Binding site: manganese (His, His, Asp, His) #status predicted jc5122.pir2 Length: 663 August 14, 1997 10:53 Type: P Check: 9126 .. 1 Met Leu Gln Asn Thr Val Arg Cys Val Ser Lys Leu Val Gln Pro 46 Ile Thr Gly Val Ala Ala Val Arg Ser Lys His Ser Leu Pro Asp 91 Leu Pro Tyr Asp Tyr Ala Asp Leu Glu Pro Val Ile Ser His Glu 136 Ile Met Gln Leu His His Gln Lys His His Ala Thr Tyr Val Asn 181 Asn Leu Asn Gln Ile Glu Glu Lys Leu His Glu Ala Val Ser Lys 226 Gly Asn Val Lys Glu Ala Ile Ala Leu Gln Pro Ala Leu Lys Phe 271 Asn Gly Gly Gly His Ile Asn His Ser Ile Phe Trp Thr Asn Leu 316 Ala Lys Asp Gly Gly Glu Pro Ser Ala Glu Leu Leu Thr Ala Ile 361 Lys Ser Asp Phe Gly Ser Leu Asp Asn Leu Gln Lys Gln Leu Ser 406 Ala Ser Thr Val Ala Val Gln Gly Ser Gly Trp Gly Trp Leu Gly 451 Tyr Cys Pro Lys Gly Lys Ile Leu Lys Val Ala Thr Cys Ala Asn 496 Gln Asp Pro Leu Glu Ala Thr Thr Gly Leu Val Pro Leu Phe Gly 541 Ile Asp Val Trp Glu His Ala Tyr Tyr Leu Gln Tyr Lys Asn Val 586 Arg Pro Asp Tyr Val Asn Ala Ile Trp Lys Ile Ala Asn Trp Lys 631 Asn Val Ser Glu Arg Phe Ala Lys Ala Gln Gln

Blast gives me an error: Server did not say "ok"

When you run Blast on the helix systems, the search actually runs to the computers at NCBI. As of Jan 1998, the NCBI 'experimental' servers (Cruncher and Muncher), which used to process blast jobs, have been replaced. The new system now sends blast jobs to the NCBI web server. If the server is overloaded or the network is slow, you may see the following error message:


helix% blast x89712.genpept -default

 Sending query...
 Awaiting results...

 ** Server did not say "OK".

Your options are to try again later, use the Blast email server (add '-mail' to the command line), or use Blast via the web.

The old 'blast -batch' command doesn't work anymore; to run blast in batch mode see How do I run blast searches in batch mode

Why have my alignments changed in Version 9?

The following information is taken from the GCG Newsletter, July 1997.

Protein sequence alignment programs in the Wisconsin Package use scoring matrices to make comparisons between pairs of amino acids. In Version 9.0, released in December 1996, changes were made to the scoring matrices which may affect the alignments you create. If the sequences you aligned in previous versions of the Package were very similar, you may not have noticed much difference to those same alignments produced using Version 9.0. However, with more distantly related sequences the alignments produced may be substantially different.

Version 9 Changes to Scoring Matrices

There are two main changes to protein scoring matrices in Version 9.0.

BLOSUM62¹, the new default protein scoring matrix, is the primary cause for the difference in alignments created in Version 9.0. Versions of the Wisconsin Package prior to 9.0 used a scoring matrix based on the PAM250 table from M. Dayhoff², where all perfect matches were set to a single value. BLOSUM62, in contrast, assigns scores ranging from 4 to 11 to perfect matches (identity). Scores for a non-perfect but similar matches range from 0 to 3. Scores for mismatches (non-similar matches) are assigned vales less than 0. Knowing the range of these values is helpful for setting display thresholds for similar amino acids.
Scoring matrix values are now integers. Matrices which previously held floating point (real number) values have had their values multiplied by 10 so that they now hold integer values. In addition, the values for gap creation and extension penalties also must be entered as integer values.

The modified PAM250 matrix which was used in previous software versions is still available in Version 9.0. The matrix has been renamed oldpep.cmp, and the original floating point values were changed to integer values. However, we do not recommend using this matrix. If you are trying to align distantly related sequences and are not achieving expected results with BLOSUM62, use the BLOSUM30 matrix. You can specify this alternative matrix with the -MATRix=blosum30.cmp (UNIX) or /MATRix=blosum30.cmp (OpenVMS) command-line parameter when you run a sequence comparison program.

How Does This Affect Me?

Two common questions you may have about the alignments created in Version 9.0 and their answers are detailed below.

Q: The global alignment displayed by PileUp or Gap shows one sequence completely to the right of the others, with only a few end bases shown as matched when run with BLOSUM62 (see examples below). But with the old matrix the sequences appeared to align together. Why?

A: The old matrix (PAM250) was overly permissive of mismatches and allowed you to align unrelated sequences. The BLOSUM62 matrix is more restrictive and will not routinely align unrelated sequences along their entire lengths. Thus, you may find that the displaced sequence(s) may be unrelated to the rest of your alignment. If you have reason to believe that the sequences are globally but very distantly related, then you might want to align them with a matrix based on an evolutionary model that assumes greater divergence time, like BLOSUM30. (A complete series of BLOSUM matrices is provided in the GenMoreData directory.)

(UNIX) prompt> namels genmoredata:blosum*

(OpenVMS) prompt> dir genmoredata:blosom*

51 100 Calm_Human RHVMTNLGEK LTDEEVDEMI READIDGDGQ VNYEEFVQMM TAK~~~~~~~ Calm_Drome RHVMTNLGEK LTDEEVDEMI READIDGDGQ VNYEEFVTMM TSK~~~~~~~ Calm_Wheat RHVMTNLGEK LTDEEVDEMI READVDGDGQ INYEEFVKVM MAK~~~~~~~ Hsp2_Mouse ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MV RYRMRSPSEG 101 147 Calm_Human ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ Calm_Drome ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ Calm_Wheat ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ Hsp2_Mouse PHQGPGQDHE REEQGQGQGL SPERVEDYGR THRGHHHHRH RRCSRKR

Global alignments created by PileUp in Version 9.

51 RHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVTMMTSK....... 93 |. : 1 ......................................MVRYRMRSPSEG 12
Global alignment created by Gap in Version 9.

Pileup doesn't find conserved motifs in my alignment in Version 9

The following information is taken from the GCG Newsletter, July 1997.

PileUp was always intended as a global alignment tool, which aligns sequences along their entire lengths. Because of this, PileUp is not ideally suited to finding the best local region of similarity (such as a shared motif) among all of the sequences. Global alignment algorithms tend to overlook areas of local similarity in favor of aligning the entirety of all sequences.

If you know where the areas of conserved motifs lie, use the SeqLab Editor to manually edit the resulting global alignment from PileUp, and then select subranges to realign portions of an existing alignment.

References

(1) Henikoff, S. and Henikoff, J. G. 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA 89, 10915-10919.

(2) Schwartz, R.M. and Dayhoff, M.O. (1979). Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff, ed.), 5, Suppl. 3 (pp; 353-358), National Biomedical Research Foundation, Washington, D.C., USA.

How do I run blast searches in batch mode?

After NCBI replaced their experimental blast servers in Jan 98, the blast searches on the helix system changed. The new blast program does not have the 'blast -batch' command. Instead, you should use the helix utility batchblast. A simple run of batchblast would be

helix% batchblast *.seq

which would run blast searches with default parameters for each of the files called xxx.seq, and produce an output file called xxx.blastn for each search. Batchblast has options to choose the database, program, and parameters; see the batchblast man page for more information.

How do I view the trees from Paupsearch?

Paupsearch and Paupdisplay are the GCG interfaces to the PAUP program. A common Mac/Windows program to view such trees is TreeView. However, the output produced by the GCG Paup programs is not suitable for Treeview input. TreeView requires a 'rooted' file.

The good news is that you can use PaupDisplay (and PaupSearch) to create a script files that can then be input to gcg_paup. A script can then be modified so that gcg_paup, when run, will perform the mid-point rooting and output the required .rooted file.

Here are the steps that you should take:

Run PaupSearch and then PaupDisplay with the desired settings. You should include the optional qualifier "SCRIPT" on the command line used to run PaupDisplay in order to generate the necessary script file. In addition, you may wish to consider using the optional command line qualifier "-NORUN" which prevents PaupSearch from doing anything beside generating the script. If you leave this out, then you will end up duplicating your efforts since you will repeat the analysis when you run gcg_paup (please see below).
Sample command line:
```
helix% paupsearch  -norun -script=myfile.script
```

In the script file (i.e. "myfile.script" in this example) you will see a "PAUP block" that looks something like this:


     begin paup;
     set errorstop;
     log file='myfile.paupdisplay' replace;
     set criterion=parsimony;
     describe all / brlens patristic homoplasy nofvalue root=midpoint;
     quit;
     endblock;

Edit the PAUP block, adding a "Savetrees" command that looks something like:

     savetrees /brlens root file='myfile.rooted';

Run gcg_paup with the modified script. For example:
```
   helix% gcg_paup myfile.script
```

Here is an example of the final .rooted file:


 #NEXUS 

 begin trees;  [Treefile saved Fri May  8 10:41:54 1998]
 [!>Tree(s) input to PAUP* as user-defined tree(s)
 ]
         translate
                 1 chpdrbbq,
                 2 chpdrbbs,
 
                 3 chpdrbbd,
                 4 chpdrbca,
                 5 chpdrbae,
                 6 chpdrbag
                 ;
 tree PAUP_1 = ((1:0,2:1):26,((3:2,4:1):19,(5:7,6:6):12):0);
 end;

This file (called myfile.rooted by default) will work as input to TreeView.

Information courtesy of Eric Cabot, GCG inc.

I need to compare my 200 bp sequence against a complete virus genome, which is 124,000 bases long. How should I do this?

One way is to run Fasta against the genome sequence. If the genome sequence is in Genbank, you can pull it out with the GCG 'fetch' command. Then run Fasta as follows:

helix% fasta

FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST. 

11-Nov-1997

 FASTA with what query sequence ?  mysequence.gcg

                  Begin (* 1 *) ?  
                End (*    50 *) ?  

 Search for query in what sequence(s) (* GenEMBL:* *) ?  hevzvxx.gb_vi

 What word size (* 6 *) ?  

 Don't show scores whose E() value exceeds: (* 2.0 *):  

 What should I call the output file (* mysequence.fasta *) ?  

          1 Sequences     124,884 bp searched    hevzvxx.gb_vi


(Nucleotide) FASTA of: test.gcg  from: 1   to: 50  June 22, 1998 10:39
 TO: hevzvxx.gb_vi  Sequences:          1  Symbols:    124,884  Word Size: 6

 Searching with both strands of the query.
 Scoring matrix: GenRunData:fastadna.cmp
 Constant pamfactor used
 Gap creation penalty: 16      Gap extension penalty: 4


The best scores are:                    init1 initn   opt

hevzvxx.gb_vi  
! LOCUS       HEVZVXX    124884 bp   ...  250   250   250
hevzvxx.gb_vi  
! LOCUS       HEVZVXX    124884 bp   ...   55    80    56



The list contains 2 entries.
How many alignments would you like to see (* 2 *) ? 

Aligning...


 CPU time used: 
       Database scan:  0:00: 0.8
Post-scan processing:  0:00: 1.5
      Total CPU time:  0:00: 3.0
 Output File: mysequence.fasta

As you can see, Fasta only found 2 entries -- basically, the best hit in the forward and backward direction. The genome is large, and so programs like Gap and Bestfit won't work, and anyway they will also give only the single best hit. An alternative approach would be to use the program Wordsearch followed by the program Segments. For example:

helix% wordsearch

WordSearch identifies sequences in the database that share large 
numbers of common words in the same register of comparison with your
query sequence.  The output of  WordSearch can be displayed with
Segments.

 WORDSEARCH with what query sequence ?  mysequence.gcg
                                       
                  Begin (* 1 *) ?   
                End (*  1000 *) ?  

 Search for query in what sequence(s) (* GenEMBL:* *) ?  vi:hevzvxx
                                                        
 What word size (* 6 *) ?  

 List how many best diagonals (* 50 *) ?  

 Integrate how many adjacent diagonals (* 3 *) ?  

 What should I call the output file (* mysequence.word *) ?  

      1 HEVZVXX              Len: 124,884
         6-mers found:     1,045,257
 Diagonals with words:        24,879
      Total diagonals:       251,766
   Sequences searched:             1
             CPU time:         00.53

          Output file: mysequence.word

The output from WordSearch is then typically passaged through the program Segments to produce a ".pairs" file. Here is what some of the the results of this search look like:

(BestFit) SEGMENTS from: myfile.word  May 22, 1998 14:29

 (Nucleotide) WORDSEARCH of: /usr/users/share/cabot/test/temp/myfile.seq  
 check: 2347 from: 1  to: 1000
 ASSEMBLE    January 13, 1998 16:44
Symbols:     1 to: 1000  from: dmrt412g          ck: 5977,     1 to: 1000 
LOCUS       DMRT412G     6897 bp    DNA             INV       13-NOV-1996
DEFINITION  Drosophila retrotransposon 412 genome.
ACCESSION   X04132 X03733 . . . 

 AvMatch: 3.84  AvMisMatch: -6.00  GapWeight: 50  LengthWeight: 3   ..

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   3
                    . =   1

myfile.seq                check: 2347  from: 1      to: 1000   /Reverse
GB_VI:HEVZVXX             check: 4734  from: 70843  to: 124884
     X04370 Varicella-Zoster virus complete genome. 9/93
 Gaps: 1  Quality: 209  Ratio: 2.944  Score: 50  Width: 7  Limits: +/-8 
                  .         .         .         .         .
     989 CTCCGTTTTACAATATTTCTTACAATTTTTCTTATCTATATATATTTTAT 940
         |||||| ||   |||||   | ||   | |    || |   |||| || |
   70854 CTCCGTGTT.TTATATTATATCCACGGTGTTGATTCAACCAATATGTTGT 70902
                  .         .  
     939 ATTTACTTTATATTTATATATA 918
         || | |||| | |||   ||||
   70903 ATCTTCTTTTTTTTTTACTATA 70924

myfile.seq                check: 2347  from: 1      to: 1000  
GB_VI:HEVZVXX             check: 4734  from: 26311  to: 124884
     X04370 Varicella-Zoster virus complete genome. 9/93
 Gaps: 0  Quality: 112  Ratio: 7.000  Score: 46  Width: 8  Limits: +/-9 
                  .      
     648 AGATAATATTAAAAAT 663
         | ||| ||||| ||||
   26964 ACATATTATTATAAAT 26979

If the query sequence is relatively short (under 240 residues) then an alternate solution would be to use the program FindPatterns. Please be aware of the fact that, unless the optional qualifier "MISmatch" is included on the command line used to run FindPatterns, that only exact matches will be found. E.g:
```
   findpatterns -mis=2
```

Yet another approach is to use the program BreakUp to break HEVZVXX into overlapping segments which can then be searched with FastA. You must first use Fetch to copy the entire sequence to a file in the current working directory. Here is an example, showing how to break the sequence into a series of 5 kb segments that overlap by 500 bases.

% fetch vi:hevzvxx
Fetch copies GCG sequences or data files from the GCG database 
into your directory or displays them on your terminal screen.

 hevzvxx.gb_vi

% breakup  -seg=5000 -over=500

BreakUp reads a GCG-format sequence file containing more than 350,000
sequence characters and writes it as a set of separate, shorter,
overlapping sequence files that can be analyzed by Wisconsin 
Package programs. 


 BREAKUP of what file(s) ?   hevzvxx.gb_vi
                           
 hevzvxx_00.gb_vi  length: 5500 bp
 hevzvxx_01.gb_vi  length: 5500 bp
 hevzvxx_02.gb_vi  length: 5500 bp
 hevzvxx_03.gb_vi  length: 5500 bp
 hevzvxx_04.gb_vi  length: 5500 bp
 hevzvxx_05.gb_vi  length: 5500 bp
 hevzvxx_06.gb_vi  length: 5500 bp
 hevzvxx_07.gb_vi  length: 5500 bp
 hevzvxx_08.gb_vi  length: 5500 bp
 hevzvxx_09.gb_vi  length: 5500 bp
 hevzvxx_10.gb_vi  length: 5500 bp
 hevzvxx_11.gb_vi  length: 5500 bp
 hevzvxx_12.gb_vi  length: 5500 bp
 hevzvxx_13.gb_vi  length: 5500 bp
 hevzvxx_14.gb_vi  length: 5500 bp
 hevzvxx_15.gb_vi  length: 5500 bp
 hevzvxx_16.gb_vi  length: 5500 bp
 hevzvxx_17.gb_vi  length: 5500 bp
 hevzvxx_18.gb_vi  length: 5500 bp
 hevzvxx_19.gb_vi  length: 5500 bp
 hevzvxx_20.gb_vi  length: 5500 bp
 hevzvxx_21.gb_vi  length: 5500 bp
 hevzvxx_22.gb_vi  length: 5500 bp
 hevzvxx_23.gb_vi  length: 5500 bp
 hevzvxx_24.gb_vi  length: 4884 bp

You can then use an ambigous file specification such as "hevzvxx_*.gb_vi" with to perform a FastA to search against the the files produced by BreakUp. I recommend that you also include the optional qualifier NOSTATs the FastA command line in order to suppress the calculation of E() statistics. For example:

% fasta -nostat
 
FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST. 

 FASTA with what query sequence ?  mysequence.gcg
                                  
                  Begin (* 1 *) ?  
                End (*  1000 *) ?  

 Search for query in what sequence(s) (* GenEMBL:* *) ?  hevzvxx_*.gb_vi

Information courtesy of Eric Cabot, GCG inc.

How do I kill a paupsearch batch run?

On the helix systems, some cpu-intensive programs like paupsearch, framesearch, and mfold have been set up to run remotely on our high-speed parallel computer. This is invisible to users, so that you run these programs just like any other. For an interactive run of these programs, you can kill the program by typing 'Ctrl-C' (hold down the Control key and press 'c') just as with any Unix program.

For a batch run, however, the program is running on another computer, and it is not as easy to kill the program. During normal working hours, you can call up the helix staff (4-DCRT) and ask them to kill it for you. If you want to kill it yourself, you need to rlogin to churn, find the process, and kill it. Sample session:
helix% mfold -batch >> Setting up remote run on churn.nih.gov... MFold predicts optimal and suboptimal secondary structures for an RNA molecule using the most recent energy minimization method of Zuker. (Linear) MFOLD what sequence ? mzeorp1a.gb_pl Begin (* 1 *) ? End (* 1382 *) ? What should I call the energy matrix output file (* mzeorp1a.mfold *) ? ** mfold will run as a batch or at job. ** mfold was submitted using the command: " atnow " warning: commands will be executed using /bin/sh job 908896860.a at Tue Oct 20 11:21:00 1998
In the above session, I started an mfold run in batch. Note that the program told me:

>> Setting up remote run on churn.nih.gov...

which is a major clue that the program is running remotely on churn. I now realize I put in the wrong sequence and want to kill this run before it goes for days.

helix% rlogin churn
IRIX Release 6.2 IP25 churn
Copyright 1987-1996 Silicon Graphics, Inc. All Rights Reserved.
Last login: Tue Oct 20 09:52:40 EDT 1998 by susanc@chisos.cit.nih.gov
Share II resource management; Copyright (C) 1989-1995 Softway Pty Ltd
Share login on /dev/ttyq19.
Tue Oct 20 11:21:48 EDT 1998

churn% ps -fu susanc
     UID   PID  PPID  C    STIME TTY     TIME CMD
  susanc 29736 29735 19 11:21:45 pts/19  0:01 -tcsh 
  susanc 29634   840  0 11:21:05 ?       0:00 sh 
  susanc 29635 29634  0 11:21:06 ?       0:00 /bin/csh -f ./mfold_29611_1 
  susanc 29761 29736  8 11:22:19 pts/19  0:00 ps -fu susanc 
  susanc 28429 28428  0 09:52:40 pts/17  0:01 -tcsh 
  susanc 29636 29635 80 11:21:06 ?       1:05 /gcg/gcgbin/execute/mfold -Init=mfold_29611_1.init 
  susanc 29637 29635  0 11:21:06 ?       0:00 Mail -s mfold susanc 
  susanc 19125 19124  0 14:13:27 pts/13  0:01 -tcsh 

churn% kill 29636
churn% ps -fu susanc
  susanc 18938 18937  0 14:05:30 pts/12  0:04 -tcsh 
  susanc 29736 29735  0 11:21:45 pts/19  0:01 -tcsh 
  susanc 29776 29736  7 11:23:19 pts/19  0:00 ps -fu susanc 
churn% exit
helix%

So, as above, I rlogin'd to churn, looked for all the processes running under my name (with 'ps -fu susanc'), found the one that says /gcg/gcgbin/execute/mfold, and killed that one's process id. Then I typed 'ps -fu susanc' again to make sure it was gone, and exited churn back to helix. The batch process will send you email saying that it was killed.

How can I find all relevant sequences with Lookup?

I'm looking for Mouse TGF-beta related sequences. I tried an All Text search in Lookup for 'TGF-beta', and found 37 sequences. But M13177, a TGF-beta sequence that I know exists, was not among the list. Why didn't it show up?

The problem is really that there is no standard nomenclature that Genbank depositers use. So some people might describe as sequence with the words 'TGF-beta', and some may use 'Transforming Growth Factor Beta', with or without hyphens. Lookup's indices are pretty good, but they can't take care of all the variations.

The best approach is to think of all the possible variations of your key phrase, and search for all of them. So, in this case, look for

TGF-beta or Transforming Growth Factor

You need to use & for AND, and | for OR. See the lookup documentation for more information. So a good search would be:

 Complete the query form below:

                 All text:  TGF-beta | (transforming & growth & factor)
               Definition:
                   Author:
                  Keyword:
            Sequence name:
         Accession number:
                 Organism:  mus musculus
                Reference:
                    Title:
                  Feature:
  On or after (dd-mmm-yy):               On or before (dd-mmm-yy):
 Shortest sequence length:                Longest sequence length:

     Inter-field operator:  AND             Form of output list:  Whole Entries

 Press D to continue.

Searching genbank

 49 entries were found.

As you see, 49 entries were found, among which is GB_RO:MUSTGFRNA, which is your accession number M13177. Note that you can 'fetch' the sequence using either the locus name MUSTGFRNA or the accession number M13177.

How can I find bases 3134678 to 3136537 of the E. Coli genome?

Unfortunately there is no really convenient way to do exactly what you want. There are a couple of possible methods, but both require more effort than one would like.

Method 1: GCG's Lookup and Fetch programs: The E. Coli genome is divided into 400 sections, which are saved as 400 separate entries in the Genbank database. You can find the list of entries by using the GCG program 'lookup'. Select 'Genbank' as the database to search, type 'Coli & genome' next to 'All Text', and save the results. It will show that entry GB_BA1:ECAE000111 contains section 1 of 400 of the complete genome, and GB_BA1:ECAE000510 contains the last section, 400 of 400.
This does not tell you which bases are contained in each entry. If from some other source (e.g. a paper, or the author) you know one entry that contains part of your region of interest, you can get the upstream and downstream bases by getting the preceding and following entries. For example, in this case entry ECAE000381 contains most of the region of interest. If you want a few hundred bases on either side of this entry, you would want entries ECAE000380 and ECAE000382. Note that this requires prior knowledge of at least one entry number. The entries themselves give no indication of which bases they include, they just say 'section 271 of 400' etc.
In the lookup program, you could also search for 'U00096', which is the accession number for the entire genome. You can do this search through GCG-Lite+ on the web.
You can obtain the actual entries by typing 'fetch ECAE000381' for each accession number at the helix prompt.

Method 2: NCBI's Genome Query

An alternative way to obtain the sequence is by pointing your web browser at NCBI's Genome Query. Click on 'E. Coli' and you will get a basic circular figure showing the genome. Put your cursor as close to 3100K as possible, and click. You will get a figure showing 3100950..3150949, or something close to that. Use the forward and back arrows at the bottom of the figure to obtain a region that includes your region of interest. Then click on the 'TextView' button in the left frame. The main frame should now show an actual Genbank entry for bases 3100950..3150949
Choose 'Save Frame As...' in your browser menu under File. Choose 'Text' as the format, and save the entry into a file on your desktop computer. This entry is in Genbank format, will need to be converted to GCG format for any GCG program. ou can use the helix program 'fmtseq' to convert it. Type 'fmtseq' at the helix prompt and follow instructions.
Method 2: NCBI's Genome Query: An alternative way to obtain the sequence is by pointing your web browser at NCBI's Genome Query. Click on 'E. Coli' and you will get a basic circular figure showing the genome. Put your cursor as close to 3100K as possible, and click. You will get a figure showing 3100950..3150949, or something close to that. Use the forward and back arrows at the bottom of the figure to obtain a region that includes your region of interest. Then click on the 'TextView' button in the left frame. The main frame should now show an actual Genbank entry for bases 3100950..3150949
Choose 'Save Frame As...' in your browser menu under File. Choose 'Text' as the format, and save the entry into a file on your desktop computer. This entry is in Genbank format, will need to be converted to GCG format for any GCG program. ou can use the helix program 'fmtseq' to convert it. Type 'fmtseq' at the helix prompt and follow instructions.

However the sequence is obtained, you will need to delete the extra bases on each end within seqed, and will have to keep track of the base numbering by yourself. Type 'seqed filename' to start up seqed. If you used Method 1, you will have two files to enter. Position your cursor at the end of the first sequence, hit Ctrl-D to get to the command prompt, and type 'Include filename2' to add the second sequence at the end of the first.

How can I search for promoter regions?

The GCG program Findpatterns can be used to find any pattern, and the GCG file tfsites.dat contains patterns from the Transcription Factor database. Sample session:

helix 75% findpatterns -data=genmoredata:tfsites.dat

FindPatterns identifies sequences that contain short patterns like
GAATTC or YRYRYRYR.  You can define the patterns ambiguously and allow
mismatches. You can provide the patterns in a file or simply type them
in from the terminal

 FINDPATTERNS in what sequence(s) ? aw778747.gb_est46

  What should I call the output file (* aw778747.find *) ?

     aw778747.gb_est46 len:        410     ....................................
...............................................................................
...............................................................................
..................................................

 FINDPATTERNS in what sequence(s) ?
      
     Total finds:             45
    Total length:            410
 Total sequences:              1
        CPU time:          01.63

     Output file: aw778747.find

The file tfsites.dat is in the GCG directory genmoredata. You can copy this to your own directory using the 'fetch' program. e.g.

helix% fetch tfsites.dat

Fetch copies GCG sequences or data files from the GCG database
into your directory or displays them on your terminal screen.

 tfsites.dat

You can then modify this file to retain only the sites of your interest.

Peptide Sequence Databases
nr	All non-redundant GenBank CDS translations+PDB+SwissProt+PIR
month	All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days.
swissprot	the last major release of the SWISS-PROT protein sequence database (no updates)
yeast	Yeast (Saccharomyces cerevisiae) protein sequences.
E. coli	E. coli genomic CDS translations
pdb	Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank
kabat [kabatpro]	Kabat's database of sequences of immunological interest
alu	Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994) .
Nucleotide Sequence Databases
nr	All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences)
month	All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest	Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
dbsts	Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions
htgs	High Throughput Genomic Sequences
yeast	Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
E. coli	E. coli genomic nucleotide sequences
pdb	Sequences derived from the 3-dimensional structure
kabat [kabatnuc]	Kabat's database of sequences of immunological interest
vector	Vector subset of GenBank(R), NCBI, (anonymous FTP to ncbi.nlm.nih.gov beneath the /pub/vector directory)
mito	Database of mitochondrial sequences, Rel. 1.0, July 1995"
alu	Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).
epd	Eukaryotic Promotor Database
gss	Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

Species	Codon table file name
green algae	algae.cod
Aspergillus nidulans	asn.cod
Arabidopsis thaliana	ath.cod
Hordeum vulgare (Barley)	bly.cod
Bombyx mori (Silk Moth)	bmo.cod
Bos taurus (Cow)	bov.cod
Bacillus subtilis	bsu.cod
Caenorhabditis elegans	cel.cod
Chironomus sp.	chi.cod
Gallus sp. (Chicken)	chk.cod
Chlamydomonas	cre.cod
Dictyostelium discoideum	ddi.cod
Drosophila melanogaster	dro.cod
Escherichia coli	eco.cod
enteric bacterial (highly expressed) genes	ecohigh.cod
Cricetulus sp. & Mesocricetus sp. (Hamster)	ham.cod
Homo sapiens	hum.cod
Klebsiella pneumoniae	kpn.cod
Macaca sp	mac.cod
Mus sp. (Mouse)	mus.cod
Zea mays (Maize)	mze.cod
Zea mays chloroplast (Maize)	mzecp.cod
Neurospora crassa	neu.cod
Neisseria gonorrheae	ngo.cod
Pisum sativum (Pea)	pea.cod
Petunia sp.	pet.cod
Plasmodium falciparum	pfa.cod
Phaseolus vulgaris (Lima bean)	phv.cod
Solanum tuberosum (Potato)	pot.cod
Pseudomonas sp.	pse.cod
Oryctolagus sp. (Rabbit)	rab.cod
Rattus sp. (Rat)	rat.cod
Rhizobium meliloti	rhm.cod
Oryza sativa (Rice)	ric.cod
Ovis sp. (Sheep)	shp.cod
Physarum polycephalum	slm.cod
Glycine max (Soybean)	soy.cod
Staphylococcus aureus	sta.cod
Salmonella thphimurium	sty.cod
Strongylocentrotus purpuratus	sus.cod
Tetrahymena thermophila	tet.cod
Nicotiana tabacum (Tobacco)	tob.cod
Nicotiana tabacum chloroplast (Tobacco)	tobcp.cod
Lycopersicon esculentum (Tomato)	tom.cod
Trypanosoma brucei	trb.cod
Triticum aestivum (Wheat)	wht.cod
Xenopus laevis	xel.cod
Saccharomyces cerevisiae	ysc.cod
Saccharomyces cerevisiae mitochondrion	yscmt.cod
Schizosaccharomyces pombe	ysp.cod