Your sequence data is not in GCG format. You can use cut-and-paste into Seqed, Readseq or GCG's Reformat to convert it into GCG format on helix, or use GCG-Lite's format conversion tool. These methods are described further below.
*IG/Stanford, used by Intelligenetics and others *GenBank/GB, genbank flatfile format *NBRF format *EMBL, EMBL flatfile format *GCG, single sequence format of GCG software *DNAStrider, for common Mac program *Fitch format, limited use *Pearson/Fasta, common format used by FastA program and others *Zuker format, limited use. Input only. *Olsen, format printed by Olsen VMS sequence editor. Input only. *Phylip3.2, sequential format for Phylip programs *Phylip, interleaved format for Phylip programs (v3.3, v3.4) *Plain/ Raw, sequence data only (no name, document, numbering) *MSF multi sequence format used by GCG software *PAUP's multiple sequence (NEXUS) format *PIR/CODATA format used by PIRIf your data is 'raw' (i.e. it has simply sequence data with no headers, dividers etc.), then be aware that Readseq may not accept it correctly. Your simplest option is to convert it into Fasta format by adding this line to the top of the file:
It is always worth checking the output file after Readseq! You don't need to examine the whole sequence, just check the beginning, end and length of the sequence. Readseq sometimes doesn't recognize headers properly and includes them in the sequence -- an easy error to notice.
When you run Pileup interactively, it prints the pairwise similarity scores to the screen. You see output like this:
Determining pairwise similarity scores... 1 x 2 1.19 1 x 3 1.20 //////////////////////// 26 x 28 0.87 27 x 28 0.86 |
Sample session:
|
Mail received when batch job finishes:
Date: Wed, 4 Jun 1997 14:18:23 -0400 (EDT) From: Susan Chacko <susanc@helix.nih.gov> Subject: pileup Welcome to the WISCONSIN PACKAGE Version 8.1-UNIX, August 1995 [....etc....] Running pileup using : .. -monitor -INfile1=@hsp2.list -GAPweight=5.000000 -LENgthweight=0.300000 -OUTfile1=hsp2.msf -DENsity=3.333433 -FIGure -Default 1 BBHSP60 1931 bp 2 BBHSPRO 2116 bp 3 BBOSPAGE 819 bp Determining pairwise similarity scores... 1 x 2 0.36 1 x 3 0.37 2 x 3 0.38 Aligning... 1 ........................................-..... 2 .................................................. ..............................................-... ....... Total sequences: 3 Alignment length: 2,120 CPU time: 09.97 Output file:/u5/susanc/hsp2.msf |
If you can calculate the scale from the trees file that is output by growtree, you may be able to edit the growtree figure file (which is produced by 'growtree -figure') and manually add the scale. You can get information about editing the figure file from the GCG user guide (or type 'genhelp figure'). However, this will be fairly complicated unless you are very familiar with the figure file formats.
There are other software packages that will do this including PAUP for the Mac and PHYLIP for Mac and Windows. Phylip may be available on the helix systems in the future; send mail to the helix staff via the ptr help system (type 'ptr' at the helix prompt) if you are interested.
A list of programs for Phylogenetic analysis is available from U. Arizona.
recovering....can't openand nothing more will happen. The solution to get seqed working again is to delete the seqed.log file (by typing 'rm seqed.log').
Note: This problem seems to have been fixed in GCG 9.0
The databases can be accessed in groups and sub-groups. For example, Bacterial sequences can be searched separately as ba:*, gb_ba:*, or as part of Genbank (Genbank:*), or GenbankPlus (GBPlus:*), or GenEMBLPlus (GEPlus:*). The database section of the GCG manual lists the short names for each protein and nucleic acid database. (On helix, you can also type 'genmanual' and go to 'Databases').
Genbank and EMBL trade data nightly, so that Genbank encompasses almost all of EMBL at any given time. Thus it is only necessary to maintain one of these databases. On the helix systems, we keep Genbank up-to-date, and so EMBL on our system is vanishingly small. Thus, the GenEMBLPlus is essentially the same as GenbankPlus, GenEMBL is essentially equal to Genbank, and so on.
To use any of these databases in a Fasta search, remember to type ':*' after the database name! For example, you can search the Tags library using 'Tags:*' or 'Gb_Tags:*'. The database specifications are case-insensitive.
helix% fetch rebase.release Fetch copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen. rebase.release
You may want to see the latest list of databases on the helix systems. You can also obtain this list by typing 'genmanual' at the helix prompt, and going to 'Databases'.
From the GCG help on motifs:
The PROSITE Dictionary contains a number of short sequence patterns that occur frequently in protein sequences. Most of these frequently found patterns are post-translational modifications, but more specific patterns such as leucine zippers also fall into this category. Such frequently found patterns are not normally shown by Motifs, but you can display them by using the command-line option -FREquent. More so than with other patterns in the PROSITE Dictionary, the presence of these frequently occurring patterns does not assure you that the protein actually contains the corresponding function. Here are the patterns that the PROSITE Dictionary classifies as frequently occurring in Release 14.0: ;Amidation 1 xG(R,K)(R,K) 0009.pdoc ;Asn_Glycosylation 1 N~(P)(S,T)~(P) 0001.pdoc ;Camp_Phospho_Site 1 (R,K)2x(S,T) 0004.pdoc ;Ck2_Phospho_Site 1 (S,T)x2(D,E) 0006.pdoc ;Glycosaminoglycan 1 SGxG 0002.pdoc ;Leucine_Zipper 1 Lx6Lx6Lx6L 0029.pdoc ;Microbodies_Cter 1 (S,T,A,G,C,N)(R,K,H)(L,I,V,M,A,F,Y)> 0299.pdoc ;Myristyl 1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P) 0008.pdoc ;Pkc_Phospho_Site 1 (S,T)x(R,K) 0005.pdoc ;Rgd 1 RGD 0016.pdoc ;Tyr_Phospho_Site 1 (R,K)x{2,3}(D,E)x{2,3}Y 0007.pdoc
Peptide Sequence Databases | |
nr | All non-redundant GenBank CDS translations+PDB+SwissProt+PIR |
month | All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. |
swissprot | the last major release of the SWISS-PROT protein sequence database (no updates) |
yeast | Yeast (Saccharomyces cerevisiae) protein sequences. |
E. coli | E. coli genomic CDS translations |
pdb | Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank |
kabat [kabatpro] | Kabat's database of sequences of immunological interest |
alu | Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994) . |
Nucleotide Sequence Databases | |
nr | All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences) |
month | All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. |
dbest | Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions |
dbsts | Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions |
htgs | High Throughput Genomic Sequences |
yeast | Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences |
E. coli | E. coli genomic nucleotide sequences |
pdb | Sequences derived from the 3-dimensional structure |
kabat [kabatnuc] | Kabat's database of sequences of immunological interest |
vector | Vector subset of GenBank(R), NCBI, (anonymous FTP to ncbi.nlm.nih.gov beneath the /pub/vector directory) |
mito | Database of mitochondrial sequences, Rel. 1.0, July 1995" |
alu | Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994). |
epd | Eukaryotic Promotor Database |
gss | Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. |
Search for query in what sequence(s) (* GenEMBL:* *) ? genbank Search for query in what sequence(s) (* GenEMBL:* *) ? embl Search for query in what sequence(s) (* GenEMBL:* *) ? genbank: Search for query in what sequence(s) (* GenEMBL:* *) ? alland Fasta will say:
*** No sequences to compare to query sequence! ***You need the ":*" after the database name in order for Fasta to search it correctly. Here are some correct responses:
Search for query in what sequence(s) (* GenEMBL:* *) ?(i.e. if you simply hit a carriage return (Enter), Fasta will search the default database. For nucleic acid query sequences this is GenEMBL, and for protein query sequences the default is Swissprot).
Search for query in what sequence(s) (* GenEMBL:* *) ? genbank:*(This is the correct syntax for the database name).
Search for query in what sequence(s) (* GenEMBL:* *) ? *Here the syntax is correct, but the '*' tells Fasta to search through all the sequence files in your own current directory, and not through any database. If this is what you want, you're ok.
Another possibility is that you were trying to search EMBL? On the helix systems, we primarily maintain Genbank, and only keep those entries from EMBL that do not appear in EMBL. Since EMBL and Genbank largely overlap, many of the EMBL databases (such as em_pat, em_gss) are empty or have very few sequences. Try the equivalent Genbank databases instead (e.g. gb_pat or gb_gss).
Human fetal Beta globin G gamma .. AGGAAGCACC CTTCAGCAGT TCCACABut if the middle line with the dots is missing, Reformat will assume the whole file contains a sequence. It says:
No ".." dividerand will create a 'sequence' that looks like:
REFORMAT of: test.seq check: -1 from: 1 to: 52 June 17, 1997 10:14 (No documentation) test.seq Length: 52 June 17, 1997 10:14 Type: P Check: 7154 .. 1 Humanfetal Betaglobin GgammaAGGA AGCACCCTTC AGCAGTTCCA 51 CA
Species | Codon table file name |
green algae | algae.cod |
Aspergillus nidulans | asn.cod |
Arabidopsis thaliana | ath.cod |
Hordeum vulgare (Barley) | bly.cod |
Bombyx mori (Silk Moth) | bmo.cod |
Bos taurus (Cow) | bov.cod |
Bacillus subtilis | bsu.cod |
Caenorhabditis elegans | cel.cod |
Chironomus sp. | chi.cod |
Gallus sp. (Chicken) | chk.cod |
Chlamydomonas | cre.cod |
Dictyostelium discoideum | ddi.cod |
Drosophila melanogaster | dro.cod |
Escherichia coli | eco.cod |
enteric bacterial (highly expressed) genes | ecohigh.cod |
Cricetulus sp. & Mesocricetus sp. (Hamster) | ham.cod |
Homo sapiens | hum.cod |
Klebsiella pneumoniae | kpn.cod |
Macaca sp | mac.cod |
Mus sp. (Mouse) | mus.cod |
Zea mays (Maize) | mze.cod |
Zea mays chloroplast (Maize) | mzecp.cod |
Neurospora crassa | neu.cod |
Neisseria gonorrheae | ngo.cod |
Pisum sativum (Pea) | pea.cod |
Petunia sp. | pet.cod |
Plasmodium falciparum | pfa.cod |
Phaseolus vulgaris (Lima bean) | phv.cod |
Solanum tuberosum (Potato) | pot.cod |
Pseudomonas sp. | pse.cod |
Oryctolagus sp. (Rabbit) | rab.cod |
Rattus sp. (Rat) | rat.cod |
Rhizobium meliloti | rhm.cod |
Oryza sativa (Rice) | ric.cod |
Ovis sp. (Sheep) | shp.cod |
Physarum polycephalum | slm.cod |
Glycine max (Soybean) | soy.cod |
Staphylococcus aureus | sta.cod |
Salmonella thphimurium | sty.cod |
Strongylocentrotus purpuratus | sus.cod |
Tetrahymena thermophila | tet.cod |
Nicotiana tabacum (Tobacco) | tob.cod |
Nicotiana tabacum chloroplast (Tobacco) | tobcp.cod |
Lycopersicon esculentum (Tomato) | tom.cod |
Trypanosoma brucei | trb.cod |
Triticum aestivum (Wheat) | wht.cod |
Xenopus laevis | xel.cod |
Saccharomyces cerevisiae | ysc.cod |
Saccharomyces cerevisiae mitochondrion | yscmt.cod |
Schizosaccharomyces pombe | ysp.cod |
For details about any of these tables, extract the actual table by typing 'fetch xxx.cod' at the helix prompt, and read the documentation at the top of the file.
To use any of these codon tables instead of the default (E.Coli) table, add the parameter '-translate=xxx.cod' when running any of the above programs.
If you need a codon table that is not in this list, send mail to the helix staff using the ptr command on helix (type 'ptr' at the helix prompt) and we will attempt to track it down for you.
Sample session:
helix% motifs -check Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. Minimal Syntax: % motifs [-INfile=]SW:Kad1_Human -Default Prompted Parameters: [-OUTfile=]kad1_human.motifs the output file name Local Data Files: -DATa=prosite.patterns file of protein sequence patterns Optional Parameters: -NOREFerence suppresses the PROSITE abstract for each pattern found -FREquent shows motifs that are frequently found in proteins -MISmatch=1 allows one mismatch -NAMes writes output file in file-of-sequence-names format -APPend appends the pattern data file to your output file -SHOw shows every file searched, even if no pattern was found -ONCe limits finds to patterns found only once -MINCuts=2 limits finds to patterns found a minimum of 2 times -MAXCuts=3 limits finds to patterns found a maximum of 3 times -EXCLude=n1,n2 excludes patterns found between positions n1 and n2 -NOMONitor suppresses the screen trace showing each file -NOSUMmary suppresses the screen summary at the end of the program Add what to the command line ? -frequent MOTIFs from what protein sequence(s) ? cram_craab.swissprot What should I call the output file (* cram_craab.motifs *) ? cram_craab.swissprot len: 46 ............................ Total finds: 2 Total length: 46 Total sequences: 1 CPU time (sec): 01.19 Output file: "/u5/susanc/cram_craab.motifs" |
--------------------------- File pileup.inp ------------------- w68343.gb_new x12871.gb_new hcu31231.gb_vi hsig487.gb_pr ---------------------------------------------------------------When pileup asks for the sequences, type '@pileup.inp' in response. Note the '@' symbol before the filename; this tells GCG that the file contains a list of sequences. The filename itself should not start with '@'!
You don't need to have the sequences in your own directory to align them; pileup (and most GCG programs) can fish out sequences from the databases. For example, a valid file of sequence names would be:
----------------------- File hsp.list -------------------------- LOOKUP in: genbank of: "[SQ-ALL: hsp70*]" 3 entries June 3, 1997 16:11 .. GB_BA:BBHSP60 ! ID: 120c0005 ! DEFINITION B.burgdorferi hsp60 gene for 60kDa heat shock protein. GB_BA:BBHSPRO ! ID: 130c0005 ! DEFINITION B.burgdorferi dnaK gene for heat-shock protein. GB_BA:BBOSPAGE ! ID: 480c0005 ! DEFINITION B.burgdorferi OspA gene for outer surface protein A. -------------------------------------------------------------------(This file was output by GCG's Lookup program). In this case, the sequences are still in the database, and Pileup will pull out each sequence at the start of its run. It will take a little longer than if the sequences are in your home directory.
Sample pileup session:
helix% pileup PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. PileUp of what sequences ? @pileup.inp 1 w68343.gb_new 392 bp 2 x12871.gb_new 429 bp 3 hcu31231.gb_vi 308 bp 4 hsig487.gb_pr 487 bp What is the gap creation penalty (* 5.00 *) ? [....etc......] |
Fasta uses the Pearson-Lippman algorithm, and looks for "word" matches (an optimal word size is 2 for proteins and 4-6 for nucleic acids) between the query and the database sequence. It finds the region in the database sequence that has the most word matches, and then looks for other nearby regions that match (i.e. it inserts gaps into the sequences if that will improve the match). It keeps track of the highest-matching sequences, and will report only the top matching sequences.
Which is more appropriate for your purpose? Note the following:
but
Additional links:
For example, say you have a 10-nucleotide protein binding site within a promoter, and you want to search the E.Coli genome for this consensus sequence. Findpatterns works best if you can define a sequence pattern that is specific, but includes all of the "real" binding sites. It does not allow positional weighting, so the best thing to give it is fully conserved residues/IUPAC specifications and allowed spacings between conserved groups. In addition, if there are correlated dinucleotide variations they are best searched as separate patterns instead of via a single generalization of the pattern.
i.e.
CNGGATNA{5,7}TNATCCNGworks better than
CNGGTANA{5,7}TNTAGGNGMM
CNGGNNNA{5,7}TNNNCCNG
Findpatterns also lets you allow mismatches, but in general for short, degenerate patterns without positional weighting this will just decrease your signal to noise ratio. It is better to explicitly search additional patterns.
Findpatterns patterns are specified in the same data file format as restriction enzymes for "MAP", etc. Type "fetch pattern.dat" to get an example file you can modify with your own patterns. Type "genhelp findpatterns" to see a description of the program. The pattern syntax is described under "defining patterns" and making and using your own pattern files are described under "pattern file" and "local data files".
If you have several of these sequences, and if the contribution of the individual bases is unequal (e.g. the T1, A2 and T6 in the -10 box are more conserved than the others) then you can align the known sequences (make individual sequence files and then use Pileup; or enter them directly using Lineup). You can then use Profilemake and then Profilesearch. These latter two programs are meant for proteins but can be gotten to work with DNA sequences - don't forget to use -MATRix=profiledna.cmp with Profilemake.
Dr. William Pearson, author of FastA, says that FastA should work fine for such a search if you use a word size of 1 instead of the default word size of 6 for DNA.
(Response to this question modified from posts by Michael Leonetto and Paul Roy on bionet.software.gcg.)
Two similar programs are Blastx, which searches a protein database against a query nucleotide sequence, or Tblastn and Tfasta, which search a nucleotide database against a query protein sequence. All these programs first translate the nucleotide sequence in all six reading frames before searching and alignment. The Blast group of programs does not insert gaps, and none of these account for frame shift errors that may be present.
But framesearch is very slow, especially when searching an entire database. The blast group of programs will give you a result in half an hour or less , as opposed to 3 hrs or days for framesearch. If you need the sensitivity of framesearch, try running a Blastx/Tblastn/Tfasta search, taking the top scores, and running framesearch against those.
Additional links:
You also can specify a threshold value with the command-line qualifier -THReshold. This can be important in amino acid sequence analysis. The default value for the threshold qualifier is 1.0, which means that all matched residue pairs in the scoring matrix that have a value of 1.0 or greater will vote as a block in calculating a consensus residue for a particular position. If you do not want to see block voting, you can increase the threshold value with the command-line qualifier -THReshold=1.5. (1.5 is the value assigned for an identical match in the default scoring matrix used by Pretty. For example, an alanine to alanine match would be assigned a value of 1.5.) If you are using a scoring matrix other than the default one, you can use the command line qualifier -IDEntity to eliminate block voting.
Question/response taken from GCG Newsletter, March 94..
Question/response taken from GCG Newsletter, March 94..
*IG/Stanford, used by Intelligenetics and others *GenBank/GB, genbank flatfile format *NBRF format *EMBL, EMBL flatfile format *GCG, single sequence format of GCG software *DNAStrider, for common Mac program *Fitch format, limited use *Pearson/Fasta, common format used by FastA program and others *Zuker format, limited use. Input only. *Olsen, format printed by Olsen VMS sequence editor. Input only. *Phylip3.2, sequential format for Phylip programs *Phylip, interleaved format for Phylip programs (v3.3, v3.4) *Plain/ Raw, sequence data only (no name, document, numbering) *MSF multi sequence format used by GCG software *PAUP's multiple sequence (NEXUS) format *PIR/CODATA format used by PIRIf your data is 'raw' (i.e. it has simply sequence data with no headers, dividers etc.), then be aware that Readseq may not accept it correctly. Your simplest option is to convert it into Fasta format by adding this line to the top of the file:
It is always worth checking the output file after Readseq! You don't need to examine the whole sequence, just check the beginning, end and length of the sequence. Readseq sometimes doesn't recognize headers properly and includes them in the sequence -- an easy error to notice.
Sample session:
helix% lookup LookUp identifies sequences by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release--please look carefully at your results. LOOKUP in what sequence libraries: a) swissprot b) pir c) embl d) genbank e) em_tags f) gb_tags g) gb_new h) genpept i) All libraries q) quit Please choose one or more (* i *): d Complete the query form below: All text: tryptophan synthase Definition: Author: Keyword: Sequence name: Accession number: Organism: Reference: Title: Feature: cds On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Fragments Press |
If you just want to 'pause' the program (while you take a look at your query file, for example) use 'Ctrl-z' (hold down the Control key and hit 'z') to suspend the program execution. You will get a helix prompt, and the program will remain in a suspended state until you type 'fg' (for 'foreground') or 'bg' (for 'background'). You can also choose to kill the program after suspending it, by typing 'jobs', and killing the appropriate job. For example, after pausing a Fasta job with 'Ctrl-z':
helix% jobs [1] + Suspended ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta )This tells you that you have one job suspended, and that it is job number 1. helix% bg [1] ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta ) & 1,701 Sequences 675,057 aa searched SW:AGL1_ARATH 1,801 Sequences 716,153 aa searched SW:ALAT_PIG 1,901 Sequences 754,379 aa searched SW:ALF_SCHPO 2,001 Sequences 792,028 aa searched SW:AMA1_PLAF8 [...etc....] |
helix% fg ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta x89712.genpept ) 6,001 Sequences 2,332,970 aa searched SW:CC27_HUMAN 6,101 Sequences 2,374,535 aa searched SW:CCKN_MACEU 6,201 Sequences 2,406,280 aa searched SW:CD36_HUMAN 6,301 Sequences 2,437,213 aa searched SW:CD9_CERAE 6,401 Sequences 2,479,989 aa searched SW:CE11_CAEEL 6,501 Sequences 2,516,827 aa searched SW:CERB_CERCA 6,601 Sequences 2,563,151 aa searched SW:CG2A_DROME 6,701 Sequences 2,594,679 aa searched SW:CH10_CAUCR [...etc...] |
helix% kill %1 [1] Terminated ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta ) |
'fg' and 'bg' should be used with caution, as they interfere with the normal execution of a Fasta job, and in some situations can lead to unexpected results especially if a job is foregrounded or backgrounded more than once. Your job may hang, become unkillable, or leave incomplete files.
For example, a search through Swissprot for 'glyceraldehyde' takes 25 seconds with Stringsearch Option A, 22 minutes with Stringsearch Option B, and about half a second with Lookup. Lookup searches can also be more easily tailored for your specific needs -- you can search through mouse sequences only, for example. So in most cases, use Lookup instead of Stringsearch.
The tfasta output (or the tblastn output) are not suitable for automatic parsing, so it is difficult to get a list of sequence names from the output that you can pop into Pileup.
However, a better way to do this would be to run a fasta search against the GenPept database, instead of Genbank. GenPept is maintained on the helix systems and is updated bimonthly. It contains all the putative translations from the last Genbank release. You can search GenPept using 'GenPept:*' in Fasta. In Pileup, you can then use the Begin= and End= parameters to define the ends of the region for alignment. Each sequence can have its own range limitation.
Another option is to use framesearch instead of fasta. You could then create a list file using the begin, end and range information from the framesearch output, and use that as input into Pileup. However, framesearch is much slower than Fasta.
Other web-accessible primer design programs are
Programs than run on Macs and PC's are MacVector for the Mac, GeneWorks for the Mac, OSP for Mac/PC/VAX/Sun/Xwindows, and primegen. Also check EBI's list of primer design programs.
More information about primer design can be found at:
Sample session:
helix% more jc5122.pir2 P1;JC5122 - superoxide dismutase (EC 1.15.1.1) (Mn) precursor - Caenorhabditis elegans C;Species: Caenorhabditis elegans C;Date: 02-Feb-1997 #sequence_revision 27-Feb-1997 #text_change 13-Mar-1997 C;Accession: JC5122; JS0750 [...etc etc....] jc5122.pir2 Length: 221 August 14, 1997 10:53 Type: P Check: 1490 .. 1 MLQNTVRCVS KLVQPITGVA AVRSKHSLPD LPYDYADLEP VISHEIMQLH 51 HQKHHATYVN NLNQIEEKLH EAVSKGNVKE AIALQPALKF NGGGHINHSI 101 FWTNLAKDGG EPSAELLTAI KSDFGSLDNL QKQLSASTVA VQGSGWGWLG 151 YCPKGKILKV ATCANQDPLE ATTGLVPLFG IDVWEHAYYL QYKNVRPDYV 201 NAIWKIANWK NVSERFAKAQ Q helix% reformat -oneintothree Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme data file(s) so that they can be read by GCG programs. REFORMAT what sequence file(s) ? jc5122.pir2 jc5122.pir2 length: 663 aa helix% more jc5122.pir2 P1;JC5122 - superoxide dismutase (EC 1.15.1.1) (Mn) precursor - Caenorhabditis elegans C;Species: Caenorhabditis elegans C;Date: 02-Feb-1997 #sequence_revision 27-Feb-1997 #text_change 13-Mar-1997 C;Accession: JC5122; JS0750 [... etc etc ...] F;50,98,182,186/Binding site: manganese (His, His, Asp, His) #status predicted jc5122.pir2 Length: 663 August 14, 1997 10:53 Type: P Check: 9126 .. 1 Met Leu Gln Asn Thr Val Arg Cys Val Ser Lys Leu Val Gln Pro 46 Ile Thr Gly Val Ala Ala Val Arg Ser Lys His Ser Leu Pro Asp 91 Leu Pro Tyr Asp Tyr Ala Asp Leu Glu Pro Val Ile Ser His Glu 136 Ile Met Gln Leu His His Gln Lys His His Ala Thr Tyr Val Asn 181 Asn Leu Asn Gln Ile Glu Glu Lys Leu His Glu Ala Val Ser Lys 226 Gly Asn Val Lys Glu Ala Ile Ala Leu Gln Pro Ala Leu Lys Phe 271 Asn Gly Gly Gly His Ile Asn His Ser Ile Phe Trp Thr Asn Leu 316 Ala Lys Asp Gly Gly Glu Pro Ser Ala Glu Leu Leu Thr Ala Ile 361 Lys Ser Asp Phe Gly Ser Leu Asp Asn Leu Gln Lys Gln Leu Ser 406 Ala Ser Thr Val Ala Val Gln Gly Ser Gly Trp Gly Trp Leu Gly 451 Tyr Cys Pro Lys Gly Lys Ile Leu Lys Val Ala Thr Cys Ala Asn 496 Gln Asp Pro Leu Glu Ala Thr Thr Gly Leu Val Pro Leu Phe Gly 541 Ile Asp Val Trp Glu His Ala Tyr Tyr Leu Gln Tyr Lys Asn Val 586 Arg Pro Asp Tyr Val Asn Ala Ile Trp Lys Ile Ala Asn Trp Lys 631 Asn Val Ser Glu Arg Phe Ala Lys Ala Gln Gln |
helix% blast x89712.genpept -default Sending query... Awaiting results... ** Server did not say "OK".Your options are to try again later, use the Blast email server (add '-mail' to the command line), or use Blast via the web.
The old 'blast -batch' command doesn't work anymore; to run blast in batch mode see How do I run blast searches in batch mode
Protein sequence alignment programs in the Wisconsin Package use scoring matrices to make comparisons between pairs of amino acids. In Version 9.0, released in December 1996, changes were made to the scoring matrices which may affect the alignments you create. If the sequences you aligned in previous versions of the Package were very similar, you may not have noticed much difference to those same alignments produced using Version 9.0. However, with more distantly related sequences the alignments produced may be substantially different.
Version 9 Changes to Scoring Matrices
There are two main changes to protein scoring matrices in Version 9.0.
How Does This Affect Me?
Two common questions you may have about the alignments created in Version 9.0 and their answers are detailed below.
Q: The global alignment displayed by PileUp or Gap shows one sequence completely to the right of the others, with only a few end bases shown as matched when run with BLOSUM62 (see examples below). But with the old matrix the sequences appeared to align together. Why?
A: The old matrix (PAM250) was overly permissive of mismatches and allowed you to align unrelated sequences. The BLOSUM62 matrix is more restrictive and will not routinely align unrelated sequences along their entire lengths. Thus, you may find that the displaced sequence(s) may be unrelated to the rest of your alignment. If you have reason to believe that the sequences are globally but very distantly related, then you might want to align them with a matrix based on an evolutionary model that assumes greater divergence time, like BLOSUM30. (A complete series of BLOSUM matrices is provided in the GenMoreData directory.)
(UNIX) prompt> namels genmoredata:blosum*
(OpenVMS) prompt> dir genmoredata:blosom*
51 100 Calm_Human RHVMTNLGEK LTDEEVDEMI READIDGDGQ VNYEEFVQMM TAK~~~~~~~ Calm_Drome RHVMTNLGEK LTDEEVDEMI READIDGDGQ VNYEEFVTMM TSK~~~~~~~ Calm_Wheat RHVMTNLGEK LTDEEVDEMI READVDGDGQ INYEEFVKVM MAK~~~~~~~ Hsp2_Mouse ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MV RYRMRSPSEG 101 147 Calm_Human ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ Calm_Drome ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ Calm_Wheat ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~ Hsp2_Mouse PHQGPGQDHE REEQGQGQGL SPERVEDYGR THRGHHHHRH RRCSRKR |
Global alignments created by PileUp in Version 9.
51 RHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVTMMTSK....... 93 |. : 1 ......................................MVRYRMRSPSEG 12 |
PileUp was always intended as a global alignment tool, which aligns sequences along their entire lengths. Because of this, PileUp is not ideally suited to finding the best local region of similarity (such as a shared motif) among all of the sequences. Global alignment algorithms tend to overlook areas of local similarity in favor of aligning the entirety of all sequences.
If you know where the areas of conserved motifs lie, use the SeqLab Editor to manually edit the resulting global alignment from PileUp, and then select subranges to realign portions of an existing alignment.
References
(1) Henikoff, S. and Henikoff, J. G. 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA 89, 10915-10919.
(2) Schwartz, R.M. and Dayhoff, M.O. (1979). Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff, ed.), 5, Suppl. 3 (pp; 353-358), National Biomedical Research Foundation, Washington, D.C., USA.
helix% batchblast *.seqwhich would run blast searches with default parameters for each of the files called xxx.seq, and produce an output file called xxx.blastn for each search. Batchblast has options to choose the database, program, and parameters; see the batchblast man page for more information.
The good news is that you can use PaupDisplay (and PaupSearch) to create a script files that can then be input to gcg_paup. A script can then be modified so that gcg_paup, when run, will perform the mid-point rooting and output the required .rooted file.
Here are the steps that you should take:
Sample command line:
helix% paupsearch -norun -script=myfile.script
begin paup; set errorstop; log file='myfile.paupdisplay' replace; set criterion=parsimony; describe all / brlens patristic homoplasy nofvalue root=midpoint; quit; endblock; |
savetrees /brlens root file='myfile.rooted';
helix% gcg_paup myfile.script
#NEXUS begin trees; [Treefile saved Fri May 8 10:41:54 1998] [!>Tree(s) input to PAUP* as user-defined tree(s) ] translate 1 chpdrbbq, 2 chpdrbbs, 3 chpdrbbd, 4 chpdrbca, 5 chpdrbae, 6 chpdrbag ; tree PAUP_1 = ((1:0,2:1):26,((3:2,4:1):19,(5:7,6:6):12):0); end; |
This file (called myfile.rooted by default) will work as input to TreeView.
Information courtesy of Eric Cabot, GCG inc.
helix% fasta FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. 11-Nov-1997 FASTA with what query sequence ? mysequence.gcg Begin (* 1 *) ? End (* 50 *) ? Search for query in what sequence(s) (* GenEMBL:* *) ? hevzvxx.gb_vi What word size (* 6 *) ? Don't show scores whose E() value exceeds: (* 2.0 *): What should I call the output file (* mysequence.fasta *) ? 1 Sequences 124,884 bp searched hevzvxx.gb_vi (Nucleotide) FASTA of: test.gcg from: 1 to: 50 June 22, 1998 10:39 TO: hevzvxx.gb_vi Sequences: 1 Symbols: 124,884 Word Size: 6 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4 The best scores are: init1 initn opt hevzvxx.gb_vi ! LOCUS HEVZVXX 124884 bp ... 250 250 250 hevzvxx.gb_vi ! LOCUS HEVZVXX 124884 bp ... 55 80 56 The list contains 2 entries. How many alignments would you like to see (* 2 *) ? Aligning... CPU time used: Database scan: 0:00: 0.8 Post-scan processing: 0:00: 1.5 Total CPU time: 0:00: 3.0 Output File: mysequence.fasta |
helix% wordsearch WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments. WORDSEARCH with what query sequence ? mysequence.gcg Begin (* 1 *) ? End (* 1000 *) ? Search for query in what sequence(s) (* GenEMBL:* *) ? vi:hevzvxx What word size (* 6 *) ? List how many best diagonals (* 50 *) ? Integrate how many adjacent diagonals (* 3 *) ? What should I call the output file (* mysequence.word *) ? 1 HEVZVXX Len: 124,884 6-mers found: 1,045,257 Diagonals with words: 24,879 Total diagonals: 251,766 Sequences searched: 1 CPU time: 00.53 Output file: mysequence.word |
(BestFit) SEGMENTS from: myfile.word May 22, 1998 14:29 (Nucleotide) WORDSEARCH of: /usr/users/share/cabot/test/temp/myfile.seq check: 2347 from: 1 to: 1000 ASSEMBLE January 13, 1998 16:44 Symbols: 1 to: 1000 from: dmrt412g ck: 5977, 1 to: 1000 LOCUS DMRT412G 6897 bp DNA INV 13-NOV-1996 DEFINITION Drosophila retrotransposon 412 genome. ACCESSION X04132 X03733 . . . AvMatch: 3.84 AvMisMatch: -6.00 GapWeight: 50 LengthWeight: 3 .. Match display thresholds for the alignment(s): | = IDENTITY : = 3 . = 1 myfile.seq check: 2347 from: 1 to: 1000 /Reverse GB_VI:HEVZVXX check: 4734 from: 70843 to: 124884 X04370 Varicella-Zoster virus complete genome. 9/93 Gaps: 1 Quality: 209 Ratio: 2.944 Score: 50 Width: 7 Limits: +/-8 . . . . . 989 CTCCGTTTTACAATATTTCTTACAATTTTTCTTATCTATATATATTTTAT 940 |||||| || ||||| | || | | || | |||| || | 70854 CTCCGTGTT.TTATATTATATCCACGGTGTTGATTCAACCAATATGTTGT 70902 . . 939 ATTTACTTTATATTTATATATA 918 || | |||| | ||| |||| 70903 ATCTTCTTTTTTTTTTACTATA 70924 myfile.seq check: 2347 from: 1 to: 1000 GB_VI:HEVZVXX check: 4734 from: 26311 to: 124884 X04370 Varicella-Zoster virus complete genome. 9/93 Gaps: 0 Quality: 112 Ratio: 7.000 Score: 46 Width: 8 Limits: +/-9 . 648 AGATAATATTAAAAAT 663 | ||| ||||| |||| 26964 ACATATTATTATAAAT 26979 |
findpatterns -mis=2
% fetch vi:hevzvxx Fetch copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen. hevzvxx.gb_vi % breakup -seg=5000 -over=500 BreakUp reads a GCG-format sequence file containing more than 350,000 sequence characters and writes it as a set of separate, shorter, overlapping sequence files that can be analyzed by Wisconsin Package programs. BREAKUP of what file(s) ? hevzvxx.gb_vi hevzvxx_00.gb_vi length: 5500 bp hevzvxx_01.gb_vi length: 5500 bp hevzvxx_02.gb_vi length: 5500 bp hevzvxx_03.gb_vi length: 5500 bp hevzvxx_04.gb_vi length: 5500 bp hevzvxx_05.gb_vi length: 5500 bp hevzvxx_06.gb_vi length: 5500 bp hevzvxx_07.gb_vi length: 5500 bp hevzvxx_08.gb_vi length: 5500 bp hevzvxx_09.gb_vi length: 5500 bp hevzvxx_10.gb_vi length: 5500 bp hevzvxx_11.gb_vi length: 5500 bp hevzvxx_12.gb_vi length: 5500 bp hevzvxx_13.gb_vi length: 5500 bp hevzvxx_14.gb_vi length: 5500 bp hevzvxx_15.gb_vi length: 5500 bp hevzvxx_16.gb_vi length: 5500 bp hevzvxx_17.gb_vi length: 5500 bp hevzvxx_18.gb_vi length: 5500 bp hevzvxx_19.gb_vi length: 5500 bp hevzvxx_20.gb_vi length: 5500 bp hevzvxx_21.gb_vi length: 5500 bp hevzvxx_22.gb_vi length: 5500 bp hevzvxx_23.gb_vi length: 5500 bp hevzvxx_24.gb_vi length: 4884 bp |
% fasta -nostat FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. FASTA with what query sequence ? mysequence.gcg Begin (* 1 *) ? End (* 1000 *) ? Search for query in what sequence(s) (* GenEMBL:* *) ? hevzvxx_*.gb_vi |
For a batch run, however, the program is running on another computer, and it is not as easy to kill the program. During normal working hours, you can call up the helix staff (4-DCRT) and ask them to kill it for you. If you want to kill it yourself, you need to rlogin to churn, find the process, and kill it. Sample session:
helix% mfold -batch >> Setting up remote run on churn.nih.gov... MFold predicts optimal and suboptimal secondary structures for an RNA molecule using the most recent energy minimization method of Zuker. (Linear) MFOLD what sequence ? mzeorp1a.gb_pl Begin (* 1 *) ? End (* 1382 *) ? What should I call the energy matrix output file (* mzeorp1a.mfold *) ? ** mfold will run as a batch or at job. ** mfold was submitted using the command: " atnow " warning: commands will be executed using /bin/sh job 908896860.a at Tue Oct 20 11:21:00 1998 |
>> Setting up remote run on churn.nih.gov...which is a major clue that the program is running remotely on churn. I now realize I put in the wrong sequence and want to kill this run before it goes for days.
helix% rlogin churn IRIX Release 6.2 IP25 churn Copyright 1987-1996 Silicon Graphics, Inc. All Rights Reserved. Last login: Tue Oct 20 09:52:40 EDT 1998 by susanc@chisos.cit.nih.gov Share II resource management; Copyright (C) 1989-1995 Softway Pty Ltd Share login on /dev/ttyq19. Tue Oct 20 11:21:48 EDT 1998 churn% ps -fu susanc UID PID PPID C STIME TTY TIME CMD susanc 29736 29735 19 11:21:45 pts/19 0:01 -tcsh susanc 29634 840 0 11:21:05 ? 0:00 sh susanc 29635 29634 0 11:21:06 ? 0:00 /bin/csh -f ./mfold_29611_1 susanc 29761 29736 8 11:22:19 pts/19 0:00 ps -fu susanc susanc 28429 28428 0 09:52:40 pts/17 0:01 -tcsh susanc 29636 29635 80 11:21:06 ? 1:05 /gcg/gcgbin/execute/mfold -Init=mfold_29611_1.init susanc 29637 29635 0 11:21:06 ? 0:00 Mail -s mfold susanc susanc 19125 19124 0 14:13:27 pts/13 0:01 -tcsh churn% kill 29636 churn% ps -fu susanc susanc 18938 18937 0 14:05:30 pts/12 0:04 -tcsh susanc 29736 29735 0 11:21:45 pts/19 0:01 -tcsh susanc 29776 29736 7 11:23:19 pts/19 0:00 ps -fu susanc churn% exit helix% |
The problem is really that there is no standard nomenclature that Genbank depositers use. So some people might describe as sequence with the words 'TGF-beta', and some may use 'Transforming Growth Factor Beta', with or without hyphens. Lookup's indices are pretty good, but they can't take care of all the variations.
The best approach is to think of all the possible variations of your key phrase, and search for all of them. So, in this case, look for
TGF-beta or Transforming Growth FactorYou need to use & for AND, and | for OR. See the lookup documentation for more information. So a good search would be:
Complete the query form below: All text: TGF-beta | (transforming & growth & factor) Definition: Author: Keyword: Sequence name: Accession number: Organism: mus musculus Reference: Title: Feature: On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole Entries Press |
This does not tell you which bases are contained in each entry. If from some other source (e.g. a paper, or the author) you know one entry that contains part of your region of interest, you can get the upstream and downstream bases by getting the preceding and following entries. For example, in this case entry ECAE000381 contains most of the region of interest. If you want a few hundred bases on either side of this entry, you would want entries ECAE000380 and ECAE000382. Note that this requires prior knowledge of at least one entry number. The entries themselves give no indication of which bases they include, they just say 'section 271 of 400' etc.
In the lookup program, you could also search for 'U00096', which is the accession number for the entire genome. You can do this search through GCG-Lite+ on the web.
You can obtain the actual entries by typing 'fetch ECAE000381' for each accession number at the helix prompt.
Choose 'Save Frame As...' in your browser menu under File. Choose 'Text' as the format, and save the entry into a file on your desktop computer. This entry is in Genbank format, will need to be converted to GCG format for any GCG program. ou can use the helix program 'fmtseq' to convert it. Type 'fmtseq' at the helix prompt and follow instructions.
However the sequence is obtained, you will need to delete the extra bases on each end within seqed, and will have to keep track of the base numbering by yourself. Type 'seqed filename' to start up seqed. If you used Method 1, you will have two files to enter. Position your cursor at the end of the first sequence, hit Ctrl-D to get to the command prompt, and type 'Include filename2' to add the second sequence at the end of the first.
helix 75% findpatterns -data=genmoredata:tfsites.dat FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal FINDPATTERNS in what sequence(s) ? aw778747.gb_est46 What should I call the output file (* aw778747.find *) ? aw778747.gb_est46 len: 410 .................................... ............................................................................... ............................................................................... .................................................. FINDPATTERNS in what sequence(s) ? Total finds: 45 Total length: 410 Total sequences: 1 CPU time: 01.63 Output file: aw778747.find |
helix% fetch tfsites.dat Fetch copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen. tfsites.dat |