GCG doesn't accept my sequence data format

Your sequence data is not in GCG format. You can use cut-and-paste into Seqed, Readseq or GCG's Reformat to convert it into GCG format on helix, or use GCG-Lite's format conversion tool. These methods are described further below.

  • Cut-and-paste into Seqed
  • Log into helix from your desktop computer and start up GCG. Type 'seqed', and enter screen mode. In a separate window, open the file containing your sequence, select it, and paste it into the seqed window. Then exit seqed, saving the file which will now be in GCG format. Check that the entire sequence was cut/pasted and saved. This is a quick-and -dirty method, but is not useful for very long sequences or if you have many sequences.
  • Readseq
  • To use Readseq, the data must be in one of the following formats:
       *IG/Stanford, used by Intelligenetics and others
       *GenBank/GB, genbank flatfile format
       *NBRF format
       *EMBL, EMBL flatfile format
       *GCG, single sequence format of GCG software
       *DNAStrider, for common Mac program
       *Fitch format, limited use
       *Pearson/Fasta, common format used by FastA program and
                 others
       *Zuker format, limited use. Input only.
       *Olsen, format printed by Olsen VMS sequence editor. Input
                 only.
       *Phylip3.2, sequential format for Phylip programs
       *Phylip, interleaved format for Phylip programs (v3.3,
                 v3.4)
       *Plain/ Raw, sequence data only (no name, document,
              numbering)
       *MSF multi sequence format used by GCG software
       *PAUP's multiple sequence (NEXUS) format
       *PIR/CODATA format used by PIR
    
    If your data is 'raw' (i.e. it has simply sequence data with no headers, dividers etc.), then be aware that Readseq may not accept it correctly. Your simplest option is to convert it into Fasta format by adding this line to the top of the file:
    >>test input sequence
    Type 'man readseq' on helix for more information.

    It is always worth checking the output file after Readseq! You don't need to examine the whole sequence, just check the beginning, end and length of the sequence. Readseq sometimes doesn't recognize headers properly and includes them in the sequence -- an easy error to notice.

  • Reformat
  • In order to use Reformat on sequence files, the files must contain a heading, a dividing line, and a sequence. Type 'genhelp reformat' for more details on the input sequence format. It is a good idea to make a copy of your input sequence before running reformat, as it overwrites the original file. To run the program, type 'reformat filename', and if all goes well you should now have a GCG-formatted sequence in the file. If something doesn't work, see Reformat gave me an empty file or Reformat put the header into the sequence which may help you to troubleshoot.

  • GCG-Lite
  • GCG-Lite has a web-based format conversion tool that converts between the formats that Readseq uses. Paste your sequence into the input box, choose an output format, and click on 'Submit Request'. The reformatted sequence will appear in your web browser, where you can save it into a file.


    How do I save scores from Pileup runs ?

    When you run Pileup interactively, it prints the pairwise similarity scores to the screen. You see output like this:
    Determining pairwise similarity scores...
         
            1   x     2       1.19
            1   x     3       1.20
         
          ////////////////////////
         
           26   x    28       0.87
           27   x    28       0.86
    
    However, the final output file does not contain these scores. If you want to save or print the scores, the quick-and-dirty way is to cut-and-paste them from your screen into a file. An alternative way is to run pileup in batch mode with the '-monitor' option. Since pileup runs can be long, batch mode may be a good idea anyway. When the batch job is done, it will send mail to your account, and if you have the '-monitor' flag, the mail will contain the pairwise scores.

    Sample session:
    % pileup -batch -monitor
    
    PileUp creates a multiple sequence alignment from a group of related
    sequences using progressive, pairwise alignments.  It can also plot a
    tree showing the clustering relationships used to create the alignment. 
    
     PileUp of what sequences ?  @hsp2.list
    
    
       1         BBHSP60  1931 bp
       2         BBHSPRO  2116 bp
       3        BBOSPAGE   819 bp
    
     What is the gap creation penalty (* 5.00 *) ?  
    
     What is the gap extension penalty (* 0.30 *) ?  
    
     The minimum density for a one-page plot is 3.3 sequences/100 platen units.
     What density do you want (* 3.3 *) ?  
    
     What should I call the output file name (* hsp2.msf *) ? 
    
     ** pileup will run as a batch or at job.
    
     ** pileup was submitted using the command:
        "  atnow  "
    
    warning: commands will be executed using /bin/sh
    job 865448280.a at Wed Jun  4 14:18:00 1997
    

    Mail received when batch job finishes:
    Date: Wed, 4 Jun 1997 14:18:23 -0400 (EDT)
    From: Susan Chacko <susanc@helix.nih.gov>
    Subject: pileup
    
                         Welcome to the WISCONSIN PACKAGE 
                          Version 8.1-UNIX, August 1995
    			[....etc....]
    Running pileup using :
    ..
    -monitor
    -INfile1=@hsp2.list
    -GAPweight=5.000000
    -LENgthweight=0.300000
    -OUTfile1=hsp2.msf
    -DENsity=3.333433
    -FIGure
    -Default
    
       1         BBHSP60  1931 bp
       2         BBHSPRO  2116 bp
       3        BBOSPAGE   819 bp
    
     Determining pairwise similarity scores...
    
       1   x     2       0.36
       1   x     3       0.37
       2   x     3       0.38
    
     Aligning...
    
       1     ........................................-.....
       2     ..................................................
             ..............................................-...
             .......
    
    
            Total sequences:          3
           Alignment length:      2,120
                   CPU time:      09.97
    
                Output file:/u5/susanc/hsp2.msf
    


    How do I save the genetic distance scale from Growtree?

    The GrowTree program will not display distance information. However, the tree generated by GrowTree uses the distances information in the Distances output file (used as input to the GrowTree program). This file does have distances information. These values are substitutions per 100 residues. Remember that GrowTree grows one tree that is compatible with this information, but it does not grow and evaluate trees.

    If you can calculate the scale from the trees file that is output by growtree, you may be able to edit the growtree figure file (which is produced by 'growtree -figure') and manually add the scale. You can get information about editing the figure file from the GCG user guide (or type 'genhelp figure'). However, this will be fairly complicated unless you are very familiar with the figure file formats.

    There are other software packages that will do this including PAUP for the Mac and PHYLIP for Mac and Windows. Phylip may be available on the helix systems in the future; send mail to the helix staff via the ptr help system (type 'ptr' at the helix prompt) if you are interested.

    A list of programs for Phylogenetic analysis is available from U. Arizona.


    I keep getting files called seqed.log in my directory.

    If you exit Seqed abnormally, it creates a seqed.log file to attempt to rescue work-in-progress. The correct way to exit seqed is to type Ctrl-D to get to the command line, and then enter 'exit' (to exit or save) or 'quit' (to exit without saving).


    Seqed says 'recovering...' for ever!!

    When you exited Seqed the previous time, something abnormal happened that caused seqed to rescue your sequence into a file called seqed.log. If this file was somehow corrupted, then the next time seqed is started, you will get a message saying
    recovering....can't open
    
    and nothing more will happen. The solution to get seqed working again is to delete the seqed.log file (by typing 'rm seqed.log').

    Note: This problem seems to have been fixed in GCG 9.0


    What databases are available for Fasta searches?

    The welcome notice when you start up GCG tells you what databases are available and when each one was last updated.

    The databases can be accessed in groups and sub-groups. For example, Bacterial sequences can be searched separately as ba:*, gb_ba:*, or as part of Genbank (Genbank:*), or GenbankPlus (GBPlus:*), or GenEMBLPlus (GEPlus:*). The database section of the GCG manual lists the short names for each protein and nucleic acid database. (On helix, you can also type 'genmanual' and go to 'Databases').

    Genbank and EMBL trade data nightly, so that Genbank encompasses almost all of EMBL at any given time. Thus it is only necessary to maintain one of these databases. On the helix systems, we keep Genbank up-to-date, and so EMBL on our system is vanishingly small. Thus, the GenEMBLPlus is essentially the same as GenbankPlus, GenEMBL is essentially equal to Genbank, and so on.

    To use any of these databases in a Fasta search, remember to type ':*' after the database name! For example, you can search the Tags library using 'Tags:*' or 'Gb_Tags:*'. The database specifications are case-insensitive.


    How do I get information about the databases?

    When you start up GCG, the welcome notice tells you which databases we have installed on helix, and when they were updated. Detailed information about each database can be found in the directory /gcgdata/gcgcore/data/moredata/ in files called swissprot.release, genbank.release etc. You can read these files using the 'more' command (e.g. helix% more /gcgdata/gcgcore/data/moredata/pir.release or copy them to your own directory using GCG's 'fetch' command. (e.g.
    helix% fetch rebase.release
    
    Fetch copies GCG sequences or data files from the GCG database 
    into your directory or displays them on your terminal screen.
    
     rebase.release
    

    You may want to see the latest list of databases on the helix systems. You can also obtain this list by typing 'genmanual' at the helix prompt, and going to 'Databases'.


    Motifs doesn't find a known motif in my protein.

    Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Could your 'known motif' be a frequent motif, one that is seen commonly in protein sequences? By default, the program does not show these in its output. You need to use the '-frequent' option for the 'motifs' program to pick out common motifs such as leucine zippers.

    From the GCG help on motifs:

     The PROSITE Dictionary  contains  a number of short sequence patterns
     that occur frequently in protein sequences.  Most of these frequently
     found  patterns  are  post-translational  modifications,   but   more
     specific  patterns  such  as  leucine  zippers  also  fall  into this
     category.  Such frequently  found patterns are  not normally shown by
     Motifs,  but you can display them  by  using the command-line  option
     -FREquent.   More  so  than  with  other   patterns  in  the  PROSITE
     Dictionary,  the presence of these frequently occurring patterns does
     not assure  you that the  protein actually contains the corresponding
     function.
    
     Here are the patterns that the PROSITE Dictionary classifies
     as frequently occurring in Release 14.0:
     
    ;Amidation            1 xG(R,K)(R,K)                                 0009.pdoc
    ;Asn_Glycosylation    1 N~(P)(S,T)~(P)                               0001.pdoc
    ;Camp_Phospho_Site    1 (R,K)2x(S,T)                                 0004.pdoc
    ;Ck2_Phospho_Site     1 (S,T)x2(D,E)                                 0006.pdoc
    ;Glycosaminoglycan    1 SGxG                                         0002.pdoc
    ;Leucine_Zipper       1 Lx6Lx6Lx6L                                   0029.pdoc
    ;Microbodies_Cter     1 (S,T,A,G,C,N)(R,K,H)(L,I,V,M,A,F,Y)>         0299.pdoc
    ;Myristyl             1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P)     0008.pdoc
    ;Pkc_Phospho_Site     1 (S,T)x(R,K)                                  0005.pdoc
    ;Rgd                  1 RGD                                          0016.pdoc
    ;Tyr_Phospho_Site     1 (R,K)x{2,3}(D,E)x{2,3}Y                      0007.pdoc
    
    


    What databases are available for Blast searches?

    There are no Blast databases available on the helix systems. Blast searches are available to helix users, but are actually run at NCBI, National Library of Medicine on their Blast databases. NCBI maintains a list of databases available for Blast searches. As of Jun 1997, the following are available, but check their web page for changes/updates.

    Peptide Sequence Databases
    nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR
    month All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days.
    swissprot the last major release of the SWISS-PROT protein sequence database (no updates)
    yeast Yeast (Saccharomyces cerevisiae) protein sequences.
    E. coli E. coli genomic CDS translations
    pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank
    kabat [kabatpro] Kabat's database of sequences of immunological interest
    alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994) .
    Nucleotide Sequence Databases
    nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences)
    month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
    dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
    dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions
    htgs High Throughput Genomic Sequences
    yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
    E. coli E. coli genomic nucleotide sequences
    pdb Sequences derived from the 3-dimensional structure
    kabat [kabatnuc] Kabat's database of sequences of immunological interest
    vector Vector subset of GenBank(R), NCBI, (anonymous FTP to ncbi.nlm.nih.gov beneath the /pub/vector directory)
    mito Database of mitochondrial sequences, Rel. 1.0, July 1995"
    alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).
    epd Eukaryotic Promotor Database
    gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.


    Fasta says "No sequences to compare to query sequence!"

    You may have typed in the database name incorrectly. For example, the following responses would be incorrect:
      Search for query in what sequence(s) (* GenEMBL:* *) ?  genbank
      Search for query in what sequence(s) (* GenEMBL:* *) ?  embl 
      Search for query in what sequence(s) (* GenEMBL:* *) ?  genbank:
      Search for query in what sequence(s) (* GenEMBL:* *) ?  all
    
    and Fasta will say:
       *** No sequences to compare to query sequence! *** 
    
    You need the ":*" after the database name in order for Fasta to search it correctly. Here are some correct responses:
      Search for query in what sequence(s) (* GenEMBL:* *) ? 
    
    (i.e. if you simply hit a carriage return (Enter), Fasta will search the default database. For nucleic acid query sequences this is GenEMBL, and for protein query sequences the default is Swissprot).
      Search for query in what sequence(s) (* GenEMBL:* *) ?  genbank:*
    
    (This is the correct syntax for the database name).
      Search for query in what sequence(s) (* GenEMBL:* *) ? *
    
    Here the syntax is correct, but the '*' tells Fasta to search through all the sequence files in your own current directory, and not through any database. If this is what you want, you're ok.

    Another possibility is that you were trying to search EMBL? On the helix systems, we primarily maintain Genbank, and only keep those entries from EMBL that do not appear in EMBL. Since EMBL and Genbank largely overlap, many of the EMBL databases (such as em_pat, em_gss) are empty or have very few sequences. Try the equivalent Genbank databases instead (e.g. gb_pat or gb_gss).


    Reformat gave me an empty file.

    One possible reason:
  • You edited the file in a word processor and saved it as a word processor document. Reformat can't read the word processor codes that are in your document, and can't convert it. You need to save it as an ASCII text file, which Reformat should be able to convert.


    Reformat put the header into the sequence.

    The file does not contain the dividing line. Input files for Reformat need to have a header, a dividing line, and a sequence. The dividing line contains only two dots, and nothing else. For example, an input file like this will work:
       Human fetal Beta globin G gamma
    	   ..
       AGGAAGCACC CTTCAGCAGT TCCACA
    
    But if the middle line with the dots is missing, Reformat will assume the whole file contains a sequence. It says:
       No ".." divider
    
    and will create a 'sequence' that looks like:
        REFORMAT of: test.seq  check: -1  from: 1  to: 52  June 17, 1997 10:14
    
        (No documentation)
    
         test.seq  Length: 52  June 17, 1997 10:14  Type: P  Check: 7154  ..
    
              1  Humanfetal Betaglobin GgammaAGGA AGCACCCTTC AGCAGTTCCA 
    
             51  CA
    


    Where do I get the codon usage table for xxx?

    The programs BackTranslate, CodonPreference and Correspond use codon frequency tables. By default they use the table for E.Coli. However, additional codon usage tables have been provided on the helix systems for GCG users. At present they include:

    SpeciesCodon table file name
    green algae algae.cod
    Aspergillus nidulansasn.cod
    Arabidopsis thalianaath.cod
    Hordeum vulgare (Barley)bly.cod
    Bombyx mori (Silk Moth)bmo.cod
    Bos taurus (Cow)bov.cod
    Bacillus subtilisbsu.cod
    Caenorhabditis eleganscel.cod
    Chironomus sp.chi.cod
    Gallus sp. (Chicken)chk.cod
    Chlamydomonascre.cod
    Dictyostelium discoideum ddi.cod
    Drosophila melanogasterdro.cod
    Escherichia colieco.cod
    enteric bacterial (highly expressed) genes ecohigh.cod
    Cricetulus sp. & Mesocricetus sp. (Hamster)ham.cod
    Homo sapienshum.cod
    Klebsiella pneumoniaekpn.cod
    Macaca spmac.cod
    Mus sp. (Mouse)mus.cod
    Zea mays (Maize)mze.cod
    Zea mays chloroplast (Maize)mzecp.cod
    Neurospora crassaneu.cod
    Neisseria gonorrheaengo.cod
    Pisum sativum (Pea)pea.cod
    Petunia sp.pet.cod
    Plasmodium falciparumpfa.cod
    Phaseolus vulgaris (Lima bean)phv.cod
    Solanum tuberosum (Potato)pot.cod
    Pseudomonas sp.pse.cod
    Oryctolagus sp. (Rabbit)rab.cod
    Rattus sp. (Rat)rat.cod
    Rhizobium meliloti rhm.cod
    Oryza sativa (Rice) ric.cod
    Ovis sp. (Sheep) shp.cod
    Physarum polycephalum slm.cod
    Glycine max (Soybean) soy.cod
    Staphylococcus aureus sta.cod
    Salmonella thphimurium sty.cod
    Strongylocentrotus purpuratus sus.cod
    Tetrahymena thermophila tet.cod
    Nicotiana tabacum (Tobacco) tob.cod
    Nicotiana tabacum chloroplast (Tobacco) tobcp.cod
    Lycopersicon esculentum (Tomato) tom.cod
    Trypanosoma brucei trb.cod
    Triticum aestivum (Wheat) wht.cod
    Xenopus laevis xel.cod
    Saccharomyces cerevisiae ysc.cod
    Saccharomyces cerevisiae mitochondrion yscmt.cod
    Schizosaccharomyces pombe ysp.cod

    For details about any of these tables, extract the actual table by typing 'fetch xxx.cod' at the helix prompt, and read the documentation at the top of the file.

    To use any of these codon tables instead of the default (E.Coli) table, add the parameter '-translate=xxx.cod' when running any of the above programs.

    If you need a codon table that is not in this list, send mail to the helix staff using the ptr command on helix (type 'ptr' at the helix prompt) and we will attempt to track it down for you.


    What options are available with each program?

    Most GCG programs have a '-check' option. If you use this option, GCG will list all available options for that particular program and prompt you for further input.

    Sample session:
    helix% motifs -check
    
    Motifs looks for sequence motifs by searching through proteins for the
    patterns defined in the PROSITE Dictionary of Protein Sites and
    Patterns.  Motifs can display an abstract of the current literature on
    each of the motifs it finds. 
    
    Minimal Syntax: % motifs [-INfile=]SW:Kad1_Human -Default
    
    Prompted Parameters:
    
    [-OUTfile=]kad1_human.motifs    the output file name
    
    Local Data Files:  
    
    -DATa=prosite.patterns   file of protein sequence patterns
    
    Optional Parameters:
    
    -NOREFerence    suppresses the PROSITE abstract for each pattern found
    -FREquent       shows motifs that are frequently found in proteins 
    -MISmatch=1     allows one mismatch
    -NAMes          writes output file in file-of-sequence-names format
    -APPend         appends the pattern data file to your output file
    -SHOw           shows every file searched, even if no pattern was found
    -ONCe           limits finds to patterns found only once
    -MINCuts=2      limits finds to patterns found a minimum of 2 times
    -MAXCuts=3      limits finds to patterns found a maximum of 3 times
    -EXCLude=n1,n2  excludes patterns found between positions n1 and n2
    -NOMONitor      suppresses the screen trace showing each file
    -NOSUMmary      suppresses the screen summary at the end of the program
    
     Add what to the command line ?  -frequent
    
     MOTIFs from what protein sequence(s) ?  cram_craab.swissprot
    
     What should I call the output file (* cram_craab.motifs *) ?  
    
    cram_craab.swissprot len:         46 ............................
    
                 Total finds:          2
                Total length:         46
             Total sequences:          1
              CPU time (sec):      01.19
    
                 Output file: "/u5/susanc/cram_craab.motifs"
    


    How do I edit more than 30 aligned sequences?

    You're out of luck, GCG's multiple sequence editors cannot handle more than 30 sequences at once.


    Why can't I find a published sequence?

    We maintain Genbank on helix, and only those parts of EMBL that are not present in Genbank. There are several possible reasons why you might not find a published sequence:


    How do I get my sequences into Pileup?

    You need a file that contains the names of all the sequences you want to put into Pileup. If the sequence files are actually in your own directory, and are all in GCG format, your input file can look like this:
    --------------------------- File pileup.inp -------------------
    w68343.gb_new
    x12871.gb_new
    hcu31231.gb_vi
    hsig487.gb_pr
    ---------------------------------------------------------------
    
    When pileup asks for the sequences, type '@pileup.inp' in response. Note the '@' symbol before the filename; this tells GCG that the file contains a list of sequences. The filename itself should not start with '@'!

    You don't need to have the sequences in your own directory to align them; pileup (and most GCG programs) can fish out sequences from the databases. For example, a valid file of sequence names would be:

    ----------------------- File hsp.list --------------------------
    LOOKUP in: genbank  of: "[SQ-ALL: hsp70*]"
    
     3 entries  June  3, 1997 16:11 ..
    
    GB_BA:BBHSP60 ! ID: 120c0005
    ! DEFINITION  B.burgdorferi hsp60 gene for 60kDa heat shock protein.
    GB_BA:BBHSPRO ! ID: 130c0005
    ! DEFINITION  B.burgdorferi dnaK gene for heat-shock protein.
    GB_BA:BBOSPAGE ! ID: 480c0005
    ! DEFINITION  B.burgdorferi OspA gene for outer surface protein A.
    -------------------------------------------------------------------
    
    (This file was output by GCG's Lookup program). In this case, the sequences are still in the database, and Pileup will pull out each sequence at the start of its run. It will take a little longer than if the sequences are in your home directory.

    Sample pileup session:
    helix% pileup
    
    PileUp creates a multiple sequence alignment from a group of related
    sequences using progressive, pairwise alignments.  It can also plot a
    tree showing the clustering relationships used to create the alignment. 
    
     PileUp of what sequences ?  @pileup.inp
    
    
       1   w68343.gb_new   392 bp
       2   x12871.gb_new   429 bp
       3  hcu31231.gb_vi   308 bp
       4   hsig487.gb_pr   487 bp
    
     What is the gap creation penalty (* 5.00 *) ?
    
    [....etc......]
    


    Should I use Blast or Fasta?

    Blast is a program developed at NCBI which searches with a query sequence against a database. Blast looks for regions of local similarity -- i.e. it searches for regions of the same length between the query sequence and the database sequence. It does not insert gaps to improve the quality of the match. Blast calculates a 'quality-of-match' value using a scoring matrix, and the output will have all matches which are above a cutoff value. Since Blast looks for regions of similarity, it can find that two different regions of the same database sequence are similar to a query sequence.

    Fasta uses the Pearson-Lippman algorithm, and looks for "word" matches (an optimal word size is 2 for proteins and 4-6 for nucleic acids) between the query and the database sequence. It finds the region in the database sequence that has the most word matches, and then looks for other nearby regions that match (i.e. it inserts gaps into the sequences if that will improve the match). It keeps track of the highest-matching sequences, and will report only the top matching sequences.

    Which is more appropriate for your purpose? Note the following:

    Because of its speed and simplicity, a Blast search is probably your best first bet. If one of the above points is important to you, or you want more details, try Fasta.

    Additional links:

    More on Blast and Fasta at U. Oxford.
    Database Similarity Searching using Blast and Fasta -- an article by Bruno Gaeta with details about the algorithms and choices.
    Interpreting FastA output from the University of Minnesota.
    The Blast Help Manual from NCBI.
    Searching databases for sequences similar to a sequence of interest from NYU.


    Can I use Fasta with a short sequence?

    You can, but it's not a good idea. Fasta and Blast work best with sequences over 50 bp. For short sequences, try findpatterns instead.

    For example, say you have a 10-nucleotide protein binding site within a promoter, and you want to search the E.Coli genome for this consensus sequence. Findpatterns works best if you can define a sequence pattern that is specific, but includes all of the "real" binding sites. It does not allow positional weighting, so the best thing to give it is fully conserved residues/IUPAC specifications and allowed spacings between conserved groups. In addition, if there are correlated dinucleotide variations they are best searched as separate patterns instead of via a single generalization of the pattern.

    i.e.

    CNGGATNA{5,7}TNATCCNG
    CNGGTANA{5,7}TNTAGGNGMM
    works better than
    CNGGNNNA{5,7}TNNNCCNG

    Findpatterns also lets you allow mismatches, but in general for short, degenerate patterns without positional weighting this will just decrease your signal to noise ratio. It is better to explicitly search additional patterns.

    Findpatterns patterns are specified in the same data file format as restriction enzymes for "MAP", etc. Type "fetch pattern.dat" to get an example file you can modify with your own patterns. Type "genhelp findpatterns" to see a description of the program. The pattern syntax is described under "defining patterns" and making and using your own pattern files are described under "pattern file" and "local data files".

    If you have several of these sequences, and if the contribution of the individual bases is unequal (e.g. the T1, A2 and T6 in the -10 box are more conserved than the others) then you can align the known sequences (make individual sequence files and then use Pileup; or enter them directly using Lineup). You can then use Profilemake and then Profilesearch. These latter two programs are meant for proteins but can be gotten to work with DNA sequences - don't forget to use -MATRix=profiledna.cmp with Profilemake.

    Dr. William Pearson, author of FastA, says that FastA should work fine for such a search if you use a word size of 1 instead of the default word size of 6 for DNA.

    (Response to this question modified from posts by Michael Leonetto and Paul Roy on bionet.software.gcg.)


    Should I use framesearch or Blastx?

    aka

    Framesearch takes forever; is it worth it?

    Framesearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Framesearch uses the Smith-Waterman algorithm, and can insert gaps in the sequence to improve the match.

    Two similar programs are Blastx, which searches a protein database against a query nucleotide sequence, or Tblastn and Tfasta, which search a nucleotide database against a query protein sequence. All these programs first translate the nucleotide sequence in all six reading frames before searching and alignment. The Blast group of programs does not insert gaps, and none of these account for frame shift errors that may be present.

    But framesearch is very slow, especially when searching an entire database. The blast group of programs will give you a result in half an hour or less , as opposed to 3 hrs or days for framesearch. If you need the sensitivity of framesearch, try running a Blastx/Tblastn/Tfasta search, taking the top scores, and running framesearch against those.

    Additional links:

    Framesearch sensitivity, from the GCG Newsletter, Nov 95.
    Searching databases for sequences similar to a sequence of interest from NYU.


    After I generate a multiple sequence alignment using the PileUp program, how can I calculate a consensus sequence for that alignment?

    Use the Pretty program with the command-line qualifier -CONsensus. The Pretty program uses a scoring matrix to calculate a consensus sequence for a group of aligned sequences. It prompts you for the plurality you want to use in creating this consensus.

    You also can specify a threshold value with the command-line qualifier -THReshold. This can be important in amino acid sequence analysis. The default value for the threshold qualifier is 1.0, which means that all matched residue pairs in the scoring matrix that have a value of 1.0 or greater will vote as a block in calculating a consensus residue for a particular position. If you do not want to see block voting, you can increase the threshold value with the command-line qualifier -THReshold=1.5. (1.5 is the value assigned for an identical match in the default scoring matrix used by Pretty. For example, an alanine to alanine match would be assigned a value of 1.5.) If you are using a scoring matrix other than the default one, you can use the command line qualifier -IDEntity to eliminate block voting.

    Question/response taken from GCG Newsletter, March 94..


    How can I use a consensus sequence created with the Pretty program as input to other Wisconsin Package programs?

    You can do this by re-running the Pretty program with the command-line qualifier -UGLy using the Pretty output file as input. Pretty rewrites the sequences in the original input file, including the consensus sequence, into individual sequence files. For example, the consensus sequence is rewritten into a file called consensus.ugly. You can use these individual files as input to any other Wisconsin Package programs.

    Question/response taken from GCG Newsletter, March 94..


    Can I change how agreement or disagreement at positions in the alignment are displayed in the consensus sequence or in the alignment generated by Pretty?

    Yes. There are a number of qualifiers you can use to accomplish this when running the Pretty program: Question/response taken from GCG Newsletter, March 94..


    How do I import a sequence file into GCG?

    Say you have a file, created in WP or Simple Text or Microsoft Word, or something that came out of an automatic sequencer, on your desktop machine. It's unlikely to be in GCG format, but you want to run some GCG programs on this sequence. There are a few different ways to convert it into the right format:
    Cut-and-paste into Seqed
    Log into helix from your desktop computer and start up GCG. Type 'seqed', and enter screen mode. In a separate window, open the file containing your sequence, select it, and paste it into the seqed window. Then exit seqed, saving the file which will now be in GCG format. Check that the entire sequence was cut/pasted and saved. This is a quick-and-dirty method that mostly works, but you have to be careful that you cut-and-paste the entire sequence. It is not useful for very long sequences or if you have many sequences.
    GCG's Reformat program
    Save the file as an ASCII text file. If the sequence has word-processor codes in it, no sequence conversion program will accept it.
    Bring the file to helix using some file transfer program like Fetch or FTP.
    Start up GCG (type 'gcg' at the helix prompt) and then run reformat (type 'reformat filename' at the helix prompt). Reformat should give you a GCG-formatted sequence in the same file. If something doesn't work right, see Reformat gave me an empty file or Reformat put the header into the sequence which may help you to troubleshoot.

    Readseq
    To use Readseq, the data must be in one of the following formats:
       *IG/Stanford, used by Intelligenetics and others
       *GenBank/GB, genbank flatfile format
       *NBRF format
       *EMBL, EMBL flatfile format
       *GCG, single sequence format of GCG software
       *DNAStrider, for common Mac program
       *Fitch format, limited use
       *Pearson/Fasta, common format used by FastA program and
                 others
       *Zuker format, limited use. Input only.
       *Olsen, format printed by Olsen VMS sequence editor. Input
                 only.
       *Phylip3.2, sequential format for Phylip programs
       *Phylip, interleaved format for Phylip programs (v3.3,
                 v3.4)
       *Plain/ Raw, sequence data only (no name, document,
              numbering)
       *MSF multi sequence format used by GCG software
       *PAUP's multiple sequence (NEXUS) format
       *PIR/CODATA format used by PIR
    
    If your data is 'raw' (i.e. it has simply sequence data with no headers, dividers etc.), then be aware that Readseq may not accept it correctly. Your simplest option is to convert it into Fasta format by adding this line to the top of the file:
    >>test input sequence
    Type 'man readseq' on helix for more information.

    It is always worth checking the output file after Readseq! You don't need to examine the whole sequence, just check the beginning, end and length of the sequence. Readseq sometimes doesn't recognize headers properly and includes them in the sequence -- an easy error to notice.

    GCG-Lite
    GCG-Lite has a web-based format conversion tool that converts between the formats that Readseq uses. Paste your sequence into the input box, choose an output format, and click on 'Submit Request'. The reformatted sequence will appear in your web browser, where you can save it into a file on your own machine. Transfer this file to helix using a file transfer program such as Fetch or FTP.


    Reformat cut off the end of my sequence.

    One possible reason is that your input sequence was in one long line without carriage returns. This can happen if you typed the sequence in something like SimpleText, without carriage returns, and then transferred it to helix. There are several ways around this: It is always worth checking the output from a reformatting program! You don't need to examine the whole sequence, just look at the beginning, the end and the length of the sequence.


    How do I translate the coding regions of a Genbank entry?

  • Search for your protein by accession number or name in Lookup.
    In the 'Features:' field, type 'cds.
    In the 'Form of output list' field, select 'Fragments'. (Use the space bar on your keyboard to toggle options in this field).
  • Save the result in a file, which will contain the coding regions.
  • Then use 'translate' to translate them.

    Sample session:
    helix% lookup
    LookUp identifies sequences by name, accession number, author, organism,
    keyword, title, reference, feature, definition, length, or date.  The output
    is a list of sequences. 
    
    The LookUp program is experimental in this release--please look carefully at
    your results. 
    
     LOOKUP in what sequence libraries:
    
       a) swissprot
       b) pir
       c) embl
       d) genbank
       e) em_tags
       f) gb_tags
       g) gb_new
       h) genpept
       i) All libraries
     
       q) quit
    
     Please choose one or more (* i *): d
    
     Complete the query form below:
    
                     All text:  tryptophan synthase
                   Definition:
                       Author:
                      Keyword:
                Sequence name:
             Accession number:
                     Organism:
                    Reference:
                        Title:
                      Feature:  cds
      On or after (dd-mmm-yy):               On or before (dd-mmm-yy):
     Shortest sequence length:                Longest sequence length:
    
         Inter-field operator:  AND             Form of output list:  Fragments
    
     Press D to continue.
    
     Searching genbank
    
     32 features were found.
    
     Do you wish to:
    
       1) write out this list to a file
       2) preview the results
       3) refine the query
       4) choose different libraries
     
       q) quit
    
     Please choose one (* 1 *): 1
     What should I call the output file (* lookup.list *) ?  test.list
    
     ....
     32 features were written to "test.list"
    
     helix% translate @test.list
    Translate translates nucleotide sequences into peptide sequences.  
    
                 acctrpf          from    506 to   1147
                 acctrpf          from   1149 to   1466
                 bactrpab         from    179 to   1393
                 bactrpab         from   1374 to   2183
                 bactrsyb         from     31 to   1233
                 brltrp2          from      1 to    366
                 buhtrpb          from      1 to    676
                 buhtrpba         from      1 to    676
                 buhtrpbb         from      1 to    679
                 buhtrpbc         from      1 to    676
                 ccrtrpfba        from    181 to    450
                 ccrtrpfba        from    500 to   1159
                 ccrtrpfba        from   1182 to   2402
                 ccrtrpfba        from   2415 to   3242
                 mvotrpba         from      1 to    206
                 mvotrpba         from    304 to   1533
                 mvotrpba         from   1571 to   2425
                 mvotrpba         from   2460 to   2600
    (reverse of) mvotrpba         from   2666 to   2874
    (reverse of) pptrpab          from      1 to     18
                 pptrpab          from    129 to   1346
                 pptrpab          from   1343 to   2152
    (reverse of) psetrpbai        from     99 to    908
    (reverse of) psetrpbai        from    908 to   2134
                 psetrpbai        from   2245 to   3141
                 syctrpb          from    518 to   1756
                 ab003491         from      1 to   1413
                 athtrpb          from   1517 to   1858
                 athtrpb          from   2109 to   2465
                 athtrpb          from   2549 to   2991
                 athtrpb          from   3084 to   3178
                 athtrpb          from   3275 to   3450
                 athtrpsb         from    442 to    798
                 athtrpsb         from   1174 to   1530
                 athtrpsb         from   1606 to   2048
                 athtrpsb         from   2136 to   2230
                 athtrpsb         from   2322 to   2497
                 cynpltrnk        from    148 to    876
                 mzeorp1a         from      1 to   1170
                 mzeorp2a         from      3 to   1334
    
       Output files called "*.pep"
    


    How do I stop Fasta?

    If you start up a Fasta run, and then discover that you're searching the wrong database or have the wrong query sequence, or that the search is simply taking too long, you can stop the program. For Fasta, a 'Ctrl-C' (hold down the Control key, and hit 'c') will stop the search. The program will tell you the best scores it has found so far, and offer to continue with the alignment part of Fasta. Another 'Ctrl-c' at this point will stop Fasta completely, and return you to the helix prompt.

    If you just want to 'pause' the program (while you take a look at your query file, for example) use 'Ctrl-z' (hold down the Control key and hit 'z') to suspend the program execution. You will get a helix prompt, and the program will remain in a suspended state until you type 'fg' (for 'foreground') or 'bg' (for 'background'). You can also choose to kill the program after suspending it, by typing 'jobs', and killing the appropriate job. For example, after pausing a Fasta job with 'Ctrl-z':
    helix% jobs
    [1]  + Suspended              ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta )
    
    This tells you that you have one job suspended, and that it is job number 1.
    helix% bg
    [1]    ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta ) &
          1,701 Sequences     675,057 aa searched    SW:AGL1_ARATH
          1,801 Sequences     716,153 aa searched    SW:ALAT_PIG
          1,901 Sequences     754,379 aa searched    SW:ALF_SCHPO
          2,001 Sequences     792,028 aa searched    SW:AMA1_PLAF8
    [...etc....]
    
    This puts the job into the background. You have limited control over it now, and will probably not be able to pause, kill or stop it until it is finished.
    helix% fg
    ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta x89712.genpept )
    
          6,001 Sequences   2,332,970 aa searched    SW:CC27_HUMAN
          6,101 Sequences   2,374,535 aa searched    SW:CCKN_MACEU
          6,201 Sequences   2,406,280 aa searched    SW:CD36_HUMAN
          6,301 Sequences   2,437,213 aa searched    SW:CD9_CERAE
          6,401 Sequences   2,479,989 aa searched    SW:CE11_CAEEL
          6,501 Sequences   2,516,827 aa searched    SW:CERB_CERCA
          6,601 Sequences   2,563,151 aa searched    SW:CG2A_DROME
          6,701 Sequences   2,594,679 aa searched    SW:CH10_CAUCR
    [...etc...]
    
    This puts the job back into the foreground and it will continue to the end.
    helix% kill %1
    [1]    Terminated             ( set noglob; $GCGUTILDIR/gcglsf.pl $GCGUTILDIR/fasta )
    
    This kills the Fasta job.

    'fg' and 'bg' should be used with caution, as they interfere with the normal execution of a Fasta job, and in some situations can lead to unexpected results especially if a job is foregrounded or backgrounded more than once. Your job may hang, become unkillable, or leave incomplete files.


    When should I use stringsearch?

    Stringsearch searches through the specified database for a particular character string. You can choose to search through Definitions (option 'A') or Complete Sequence Annotation (option 'B'). The 'A' option is much faster, but both options have been superceded by GCG's Lookup program, which is faster and has more options.

    For example, a search through Swissprot for 'glyceraldehyde' takes 25 seconds with Stringsearch Option A, 22 minutes with Stringsearch Option B, and about half a second with Lookup. Lookup searches can also be more easily tailored for your specific needs -- you can search through mouse sequences only, for example. So in most cases, use Lookup instead of Stringsearch.


    How can I get a tfasta output aligned with Pileup?

    I have a small region of a protein which contains the residues which have been found to be mutated in patients. I want to check Genbank for all homologous proteins to see if these residues are highly conserved. So I want to do a tfasta run to identify the homologs, and then use Pileup to align them. How can I get the list of sequences from a tfasta output into Pileup?

    The tfasta output (or the tblastn output) are not suitable for automatic parsing, so it is difficult to get a list of sequence names from the output that you can pop into Pileup.

    However, a better way to do this would be to run a fasta search against the GenPept database, instead of Genbank. GenPept is maintained on the helix systems and is updated bimonthly. It contains all the putative translations from the last Genbank release. You can search GenPept using 'GenPept:*' in Fasta. In Pileup, you can then use the Begin= and End= parameters to define the ends of the region for alignment. Each sequence can have its own range limitation.

    Another option is to use framesearch instead of fasta. You could then create a list file using the begin, end and range information from the framesearch output, and use that as input into Pileup. However, framesearch is much slower than Fasta.


    How do I design a primer?

    There are several design criteria:
    • melting temperatures of primer and product
    • GC content
    • ambiguous bases
    • self-complementarity -- how to avoid primer dimer formation and primer secondary structure
    • size of primer
    • PCR product length
    • repeats, e.g. similarity to common rodent repeat sequences
    GCG's prime program selects primers, and is also accessible through
    GCG-Lite's web interface to prime.

    Other web-accessible primer design programs are

    Primer3 from the Whitehead Institute,
    xprimer from U. Minnesota,
    GeneFisher from U. Bielefeld.

    Programs than run on Macs and PC's are MacVector for the Mac, GeneWorks for the Mac, OSP for Mac/PC/VAX/Sun/Xwindows, and primegen. Also check EBI's list of primer design programs.

    More information about primer design can be found at:

    Restriction Maps and PCR primer design, a course at NYU.
    PCR primer analysis, a course from U. Kentucky.


    How do I convert peptide sequences from 1-letter to 3-letter amino acid codes?

    Reformat, which converts other sequence formats into GCG format, can do this. You would run it with the '-ONEintothree' option, and your input sequence file can be either GCG format or any of the other formats that Reformat accepts. Note that reformat will write over your input sequence file, so save a copy if you want to preserve the original!

    Sample session:
    helix% more jc5122.pir2
    P1;JC5122 - superoxide dismutase (EC 1.15.1.1) (Mn) precursor - Caenorhabditis 
     elegans
    C;Species: Caenorhabditis elegans
    C;Date: 02-Feb-1997 #sequence_revision 27-Feb-1997 #text_change 13-Mar-1997
    C;Accession: JC5122; JS0750
        [...etc etc....]
    jc5122.pir2  Length: 221  August 14, 1997 10:53  Type: P  Check: 1490  ..
    
           1  MLQNTVRCVS KLVQPITGVA AVRSKHSLPD LPYDYADLEP VISHEIMQLH 
    
          51  HQKHHATYVN NLNQIEEKLH EAVSKGNVKE AIALQPALKF NGGGHINHSI 
    
         101  FWTNLAKDGG EPSAELLTAI KSDFGSLDNL QKQLSASTVA VQGSGWGWLG 
    
         151  YCPKGKILKV ATCANQDPLE ATTGLVPLFG IDVWEHAYYL QYKNVRPDYV 
    
         201  NAIWKIANWK NVSERFAKAQ Q
    
    
    helix% reformat -oneintothree
    
    Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme 
    data file(s) so that they can be read by GCG programs. 
    
     REFORMAT what sequence file(s) ?  jc5122.pir2
    
         jc5122.pir2  length: 663 aa
    
    helix% more jc5122.pir2
    P1;JC5122 - superoxide dismutase (EC 1.15.1.1) (Mn) precursor - Caenorhabditis 
     elegans
    C;Species: Caenorhabditis elegans
    C;Date: 02-Feb-1997 #sequence_revision 27-Feb-1997 #text_change 13-Mar-1997
    C;Accession: JC5122; JS0750
       [... etc etc ...]
    F;50,98,182,186/Binding site: manganese (His, His, Asp, His) #status predicted
    
    jc5122.pir2  Length: 663  August 14, 1997 10:53  Type: P  Check: 9126  ..
    
           1  Met Leu Gln Asn Thr Val Arg Cys Val Ser Lys Leu Val Gln Pro 
    
          46  Ile Thr Gly Val Ala Ala Val Arg Ser Lys His Ser Leu Pro Asp 
    
          91  Leu Pro Tyr Asp Tyr Ala Asp Leu Glu Pro Val Ile Ser His Glu 
    
         136  Ile Met Gln Leu His His Gln Lys His His Ala Thr Tyr Val Asn 
    
         181  Asn Leu Asn Gln Ile Glu Glu Lys Leu His Glu Ala Val Ser Lys 
    
         226  Gly Asn Val Lys Glu Ala Ile Ala Leu Gln Pro Ala Leu Lys Phe 
    
         271  Asn Gly Gly Gly His Ile Asn His Ser Ile Phe Trp Thr Asn Leu 
    
         316  Ala Lys Asp Gly Gly Glu Pro Ser Ala Glu Leu Leu Thr Ala Ile 
    
         361  Lys Ser Asp Phe Gly Ser Leu Asp Asn Leu Gln Lys Gln Leu Ser 
    
         406  Ala Ser Thr Val Ala Val Gln Gly Ser Gly Trp Gly Trp Leu Gly 
    
         451  Tyr Cys Pro Lys Gly Lys Ile Leu Lys Val Ala Thr Cys Ala Asn 
    
         496  Gln Asp Pro Leu Glu Ala Thr Thr Gly Leu Val Pro Leu Phe Gly 
    
         541  Ile Asp Val Trp Glu His Ala Tyr Tyr Leu Gln Tyr Lys Asn Val 
    
         586  Arg Pro Asp Tyr Val Asn Ala Ile Trp Lys Ile Ala Asn Trp Lys 
    
         631  Asn Val Ser Glu Arg Phe Ala Lys Ala Gln Gln 
    
    


    Blast gives me an error: Server did not say "ok"

    When you run Blast on the helix systems, the search actually runs to the computers at NCBI. As of Jan 1998, the NCBI 'experimental' servers (Cruncher and Muncher), which used to process blast jobs, have been replaced. The new system now sends blast jobs to the NCBI web server. If the server is overloaded or the network is slow, you may see the following error message:
    
    helix% blast x89712.genpept -default
    
     Sending query...
     Awaiting results...
    
     ** Server did not say "OK".
    
    Your options are to try again later, use the Blast email server (add '-mail' to the command line), or use Blast via the web.

    The old 'blast -batch' command doesn't work anymore; to run blast in batch mode see How do I run blast searches in batch mode


    Why have my alignments changed in Version 9?

    The following information is taken from the GCG Newsletter, July 1997.

    Protein sequence alignment programs in the Wisconsin Package use scoring matrices to make comparisons between pairs of amino acids. In Version 9.0, released in December 1996, changes were made to the scoring matrices which may affect the alignments you create. If the sequences you aligned in previous versions of the Package were very similar, you may not have noticed much difference to those same alignments produced using Version 9.0. However, with more distantly related sequences the alignments produced may be substantially different.

    Version 9 Changes to Scoring Matrices

    There are two main changes to protein scoring matrices in Version 9.0.

    • BLOSUM621, the new default protein scoring matrix, is the primary cause for the difference in alignments created in Version 9.0. Versions of the Wisconsin Package prior to 9.0 used a scoring matrix based on the PAM250 table from M. Dayhoff2, where all perfect matches were set to a single value. BLOSUM62, in contrast, assigns scores ranging from 4 to 11 to perfect matches (identity). Scores for a non-perfect but similar matches range from 0 to 3. Scores for mismatches (non-similar matches) are assigned vales less than 0. Knowing the range of these values is helpful for setting display thresholds for similar amino acids.
    • Scoring matrix values are now integers. Matrices which previously held floating point (real number) values have had their values multiplied by 10 so that they now hold integer values. In addition, the values for gap creation and extension penalties also must be entered as integer values.
    The modified PAM250 matrix which was used in previous software versions is still available in Version 9.0. The matrix has been renamed oldpep.cmp, and the original floating point values were changed to integer values. However, we do not recommend using this matrix. If you are trying to align distantly related sequences and are not achieving expected results with BLOSUM62, use the BLOSUM30 matrix. You can specify this alternative matrix with the -MATRix=blosum30.cmp (UNIX) or /MATRix=blosum30.cmp (OpenVMS) command-line parameter when you run a sequence comparison program.

    How Does This Affect Me?

    Two common questions you may have about the alignments created in Version 9.0 and their answers are detailed below.

    Q: The global alignment displayed by PileUp or Gap shows one sequence completely to the right of the others, with only a few end bases shown as matched when run with BLOSUM62 (see examples below). But with the old matrix the sequences appeared to align together. Why?

    A: The old matrix (PAM250) was overly permissive of mismatches and allowed you to align unrelated sequences. The BLOSUM62 matrix is more restrictive and will not routinely align unrelated sequences along their entire lengths. Thus, you may find that the displaced sequence(s) may be unrelated to the rest of your alignment. If you have reason to believe that the sequences are globally but very distantly related, then you might want to align them with a matrix based on an evolutionary model that assumes greater divergence time, like BLOSUM30. (A complete series of BLOSUM matrices is provided in the GenMoreData directory.)

    (UNIX) prompt> namels genmoredata:blosum*

    (OpenVMS) prompt> dir genmoredata:blosom*

                 51                                                 100   
    
      Calm_Human  RHVMTNLGEK LTDEEVDEMI READIDGDGQ VNYEEFVQMM TAK~~~~~~~
    
      Calm_Drome  RHVMTNLGEK LTDEEVDEMI READIDGDGQ VNYEEFVTMM TSK~~~~~~~
    
      Calm_Wheat  RHVMTNLGEK LTDEEVDEMI READVDGDGQ INYEEFVKVM MAK~~~~~~~
    
      Hsp2_Mouse  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~MV RYRMRSPSEG
    
                  101                                             147
    
      Calm_Human  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~
    
      Calm_Drome  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~
    
      Calm_Wheat  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~
    
      Hsp2_Mouse  PHQGPGQDHE REEQGQGQGL SPERVEDYGR THRGHHHHRH RRCSRKR
    
    

    Global alignments created by PileUp in Version 9.

    
      51 RHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVTMMTSK....... 93    
    
                                               |.  :
    
       1 ......................................MVRYRMRSPSEG 12
    
    
    Global alignment created by Gap in Version 9.


    Pileup doesn't find conserved motifs in my alignment in Version 9

    The following information is taken from the GCG Newsletter, July 1997.

    PileUp was always intended as a global alignment tool, which aligns sequences along their entire lengths. Because of this, PileUp is not ideally suited to finding the best local region of similarity (such as a shared motif) among all of the sequences. Global alignment algorithms tend to overlook areas of local similarity in favor of aligning the entirety of all sequences.

    If you know where the areas of conserved motifs lie, use the SeqLab Editor to manually edit the resulting global alignment from PileUp, and then select subranges to realign portions of an existing alignment.

    References

    (1) Henikoff, S. and Henikoff, J. G. 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences USA 89, 10915-10919.

    (2) Schwartz, R.M. and Dayhoff, M.O. (1979). Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff, ed.), 5, Suppl. 3 (pp; 353-358), National Biomedical Research Foundation, Washington, D.C., USA.


    How do I run blast searches in batch mode?

    After NCBI replaced their experimental blast servers in Jan 98, the blast searches on the helix system changed. The new blast program does not have the 'blast -batch' command. Instead, you should use the helix utility batchblast. A simple run of batchblast would be
    helix% batchblast *.seq
    
    which would run blast searches with default parameters for each of the files called xxx.seq, and produce an output file called xxx.blastn for each search. Batchblast has options to choose the database, program, and parameters; see the batchblast man page for more information.


    How do I view the trees from Paupsearch?

    Paupsearch and Paupdisplay are the GCG interfaces to the PAUP program. A common Mac/Windows program to view such trees is TreeView. However, the output produced by the GCG Paup programs is not suitable for Treeview input. TreeView requires a 'rooted' file.

    The good news is that you can use PaupDisplay (and PaupSearch) to create a script files that can then be input to gcg_paup. A script can then be modified so that gcg_paup, when run, will perform the mid-point rooting and output the required .rooted file.

    Here are the steps that you should take:

    • Run PaupSearch and then PaupDisplay with the desired settings. You should include the optional qualifier "SCRIPT" on the command line used to run PaupDisplay in order to generate the necessary script file. In addition, you may wish to consider using the optional command line qualifier "-NORUN" which prevents PaupSearch from doing anything beside generating the script. If you leave this out, then you will end up duplicating your efforts since you will repeat the analysis when you run gcg_paup (please see below).

      Sample command line:

      helix% paupsearch  -norun -script=myfile.script
      
    • In the script file (i.e. "myfile.script" in this example) you will see a "PAUP block" that looks something like this:
      
           begin paup;
           set errorstop;
           log file='myfile.paupdisplay' replace;
           set criterion=parsimony;
           describe all / brlens patristic homoplasy nofvalue root=midpoint;
           quit;
           endblock;
      
      Edit the PAUP block, adding a "Savetrees" command that looks something like:
           savetrees /brlens root file='myfile.rooted';
      
    • Run gcg_paup with the modified script. For example:
         helix% gcg_paup myfile.script
      
    Here is an example of the final .rooted file:
    
     #NEXUS 
    
     begin trees;  [Treefile saved Fri May  8 10:41:54 1998]
     [!>Tree(s) input to PAUP* as user-defined tree(s)
     ]
             translate
                     1 chpdrbbq,
                     2 chpdrbbs,
     
                     3 chpdrbbd,
                     4 chpdrbca,
                     5 chpdrbae,
                     6 chpdrbag
                     ;
     tree PAUP_1 = ((1:0,2:1):26,((3:2,4:1):19,(5:7,6:6):12):0);
     end;
    

    This file (called myfile.rooted by default) will work as input to TreeView.

    Information courtesy of Eric Cabot, GCG inc.


    I need to compare my 200 bp sequence against a complete virus genome, which is 124,000 bases long. How should I do this?

    • One way is to run Fasta against the genome sequence. If the genome sequence is in Genbank, you can pull it out with the GCG 'fetch' command. Then run Fasta as follows:
      helix% fasta
      
      FastA does a Pearson and Lipman search for similarity between a query
      sequence and a group of sequences of the same type (nucleic acid or
      protein). For nucleotide searches, FastA may be more sensitive than BLAST. 
      
      11-Nov-1997
      
       FASTA with what query sequence ?  mysequence.gcg
      
                        Begin (* 1 *) ?  
                      End (*    50 *) ?  
      
       Search for query in what sequence(s) (* GenEMBL:* *) ?  hevzvxx.gb_vi
      
       What word size (* 6 *) ?  
      
       Don't show scores whose E() value exceeds: (* 2.0 *):  
      
       What should I call the output file (* mysequence.fasta *) ?  
      
                1 Sequences     124,884 bp searched    hevzvxx.gb_vi
      
      
      (Nucleotide) FASTA of: test.gcg  from: 1   to: 50  June 22, 1998 10:39
       TO: hevzvxx.gb_vi  Sequences:          1  Symbols:    124,884  Word Size: 6
      
       Searching with both strands of the query.
       Scoring matrix: GenRunData:fastadna.cmp
       Constant pamfactor used
       Gap creation penalty: 16      Gap extension penalty: 4
      
      
      The best scores are:                    init1 initn   opt
      
      hevzvxx.gb_vi  
      ! LOCUS       HEVZVXX    124884 bp   ...  250   250   250
      hevzvxx.gb_vi  
      ! LOCUS       HEVZVXX    124884 bp   ...   55    80    56
      
      
      
      The list contains 2 entries.
      How many alignments would you like to see (* 2 *) ? 
      
      Aligning...
      
      
       CPU time used: 
             Database scan:  0:00: 0.8
      Post-scan processing:  0:00: 1.5
            Total CPU time:  0:00: 3.0
       Output File: mysequence.fasta
      

    • As you can see, Fasta only found 2 entries -- basically, the best hit in the forward and backward direction. The genome is large, and so programs like Gap and Bestfit won't work, and anyway they will also give only the single best hit. An alternative approach would be to use the program Wordsearch followed by the program Segments. For example:
      helix% wordsearch
      
      WordSearch identifies sequences in the database that share large 
      numbers of common words in the same register of comparison with your
      query sequence.  The output of  WordSearch can be displayed with
      Segments.
      
       WORDSEARCH with what query sequence ?  mysequence.gcg
                                             
                        Begin (* 1 *) ?   
                      End (*  1000 *) ?  
      
       Search for query in what sequence(s) (* GenEMBL:* *) ?  vi:hevzvxx
                                                              
       What word size (* 6 *) ?  
      
       List how many best diagonals (* 50 *) ?  
      
       Integrate how many adjacent diagonals (* 3 *) ?  
      
       What should I call the output file (* mysequence.word *) ?  
      
            1 HEVZVXX              Len: 124,884
               6-mers found:     1,045,257
       Diagonals with words:        24,879
            Total diagonals:       251,766
         Sequences searched:             1
                   CPU time:         00.53
      
                Output file: mysequence.word
      
      The output from WordSearch is then typically passaged through the program Segments to produce a ".pairs" file. Here is what some of the the results of this search look like:
      (BestFit) SEGMENTS from: myfile.word  May 22, 1998 14:29
      
       (Nucleotide) WORDSEARCH of: /usr/users/share/cabot/test/temp/myfile.seq  
       check: 2347 from: 1  to: 1000
       ASSEMBLE    January 13, 1998 16:44
      Symbols:     1 to: 1000  from: dmrt412g          ck: 5977,     1 to: 1000 
      LOCUS       DMRT412G     6897 bp    DNA             INV       13-NOV-1996
      DEFINITION  Drosophila retrotransposon 412 genome.
      ACCESSION   X04132 X03733 . . . 
      
       AvMatch: 3.84  AvMisMatch: -6.00  GapWeight: 50  LengthWeight: 3   ..
      
              Match display thresholds for the alignment(s):
                          | = IDENTITY
                          : =   3
                          . =   1
      
      myfile.seq                check: 2347  from: 1      to: 1000   /Reverse
      GB_VI:HEVZVXX             check: 4734  from: 70843  to: 124884
           X04370 Varicella-Zoster virus complete genome. 9/93
       Gaps: 1  Quality: 209  Ratio: 2.944  Score: 50  Width: 7  Limits: +/-8 
                        .         .         .         .         .
           989 CTCCGTTTTACAATATTTCTTACAATTTTTCTTATCTATATATATTTTAT 940
               |||||| ||   |||||   | ||   | |    || |   |||| || |
         70854 CTCCGTGTT.TTATATTATATCCACGGTGTTGATTCAACCAATATGTTGT 70902
                        .         .  
           939 ATTTACTTTATATTTATATATA 918
               || | |||| | |||   ||||
         70903 ATCTTCTTTTTTTTTTACTATA 70924
      
      myfile.seq                check: 2347  from: 1      to: 1000  
      GB_VI:HEVZVXX             check: 4734  from: 26311  to: 124884
           X04370 Varicella-Zoster virus complete genome. 9/93
       Gaps: 0  Quality: 112  Ratio: 7.000  Score: 46  Width: 8  Limits: +/-9 
                        .      
           648 AGATAATATTAAAAAT 663
               | ||| ||||| ||||
         26964 ACATATTATTATAAAT 26979
      

    • If the query sequence is relatively short (under 240 residues) then an alternate solution would be to use the program FindPatterns. Please be aware of the fact that, unless the optional qualifier "MISmatch" is included on the command line used to run FindPatterns, that only exact matches will be found. E.g:
         findpatterns -mis=2
      

    • Yet another approach is to use the program BreakUp to break HEVZVXX into overlapping segments which can then be searched with FastA. You must first use Fetch to copy the entire sequence to a file in the current working directory. Here is an example, showing how to break the sequence into a series of 5 kb segments that overlap by 500 bases.
      % fetch vi:hevzvxx
      Fetch copies GCG sequences or data files from the GCG database 
      into your directory or displays them on your terminal screen.
      
       hevzvxx.gb_vi
      
      % breakup  -seg=5000 -over=500
      
      BreakUp reads a GCG-format sequence file containing more than 350,000
      sequence characters and writes it as a set of separate, shorter,
      overlapping sequence files that can be analyzed by Wisconsin 
      Package programs. 
      
      
       BREAKUP of what file(s) ?   hevzvxx.gb_vi
                                 
       hevzvxx_00.gb_vi  length: 5500 bp
       hevzvxx_01.gb_vi  length: 5500 bp
       hevzvxx_02.gb_vi  length: 5500 bp
       hevzvxx_03.gb_vi  length: 5500 bp
       hevzvxx_04.gb_vi  length: 5500 bp
       hevzvxx_05.gb_vi  length: 5500 bp
       hevzvxx_06.gb_vi  length: 5500 bp
       hevzvxx_07.gb_vi  length: 5500 bp
       hevzvxx_08.gb_vi  length: 5500 bp
       hevzvxx_09.gb_vi  length: 5500 bp
       hevzvxx_10.gb_vi  length: 5500 bp
       hevzvxx_11.gb_vi  length: 5500 bp
       hevzvxx_12.gb_vi  length: 5500 bp
       hevzvxx_13.gb_vi  length: 5500 bp
       hevzvxx_14.gb_vi  length: 5500 bp
       hevzvxx_15.gb_vi  length: 5500 bp
       hevzvxx_16.gb_vi  length: 5500 bp
       hevzvxx_17.gb_vi  length: 5500 bp
       hevzvxx_18.gb_vi  length: 5500 bp
       hevzvxx_19.gb_vi  length: 5500 bp
       hevzvxx_20.gb_vi  length: 5500 bp
       hevzvxx_21.gb_vi  length: 5500 bp
       hevzvxx_22.gb_vi  length: 5500 bp
       hevzvxx_23.gb_vi  length: 5500 bp
       hevzvxx_24.gb_vi  length: 4884 bp
      
      
      You can then use an ambigous file specification such as "hevzvxx_*.gb_vi" with to perform a FastA to search against the the files produced by BreakUp. I recommend that you also include the optional qualifier NOSTATs the FastA command line in order to suppress the calculation of E() statistics. For example:
      % fasta -nostat
       
      FastA does a Pearson and Lipman search for similarity between a query
      sequence and a group of sequences of the same type (nucleic acid or
      protein). For nucleotide searches, FastA may be more sensitive than BLAST. 
      
       FASTA with what query sequence ?  mysequence.gcg
                                        
                        Begin (* 1 *) ?  
                      End (*  1000 *) ?  
      
       Search for query in what sequence(s) (* GenEMBL:* *) ?  hevzvxx_*.gb_vi
      
      
    Information courtesy of Eric Cabot, GCG inc.


    How do I kill a paupsearch batch run?

    On the helix systems, some cpu-intensive programs like paupsearch, framesearch, and mfold have been set up to run remotely on our high-speed parallel computer. This is invisible to users, so that you run these programs just like any other. For an interactive run of these programs, you can kill the program by typing 'Ctrl-C' (hold down the Control key and press 'c') just as with any Unix program.

    For a batch run, however, the program is running on another computer, and it is not as easy to kill the program. During normal working hours, you can call up the helix staff (4-DCRT) and ask them to kill it for you. If you want to kill it yourself, you need to rlogin to churn, find the process, and kill it. Sample session:
    helix% mfold -batch
     
    >> Setting up remote run on churn.nih.gov...
    
    MFold predicts optimal and suboptimal secondary structures for an RNA
    molecule using the most recent energy minimization method of Zuker.
    
     (Linear) MFOLD what sequence ? mzeorp1a.gb_pl
    
                      Begin (* 1 *) ?  
                    End (*  1382 *) ?  
     
     What should I call the energy matrix output file (* mzeorp1a.mfold *) ? 
    
     ** mfold will run as a batch or at job.
    
     ** mfold was submitted using the command:
        "  atnow   "
    
           warning: commands will be executed using /bin/sh
    job 908896860.a at Tue Oct 20 11:21:00 1998
    
    In the above session, I started an mfold run in batch. Note that the program told me:

    >> Setting up remote run on churn.nih.gov...
    
    which is a major clue that the program is running remotely on churn. I now realize I put in the wrong sequence and want to kill this run before it goes for days.
    helix% rlogin churn
    IRIX Release 6.2 IP25 churn
    Copyright 1987-1996 Silicon Graphics, Inc. All Rights Reserved.
    Last login: Tue Oct 20 09:52:40 EDT 1998 by susanc@chisos.cit.nih.gov
    Share II resource management; Copyright (C) 1989-1995 Softway Pty Ltd
    Share login on /dev/ttyq19.
    Tue Oct 20 11:21:48 EDT 1998
    
    churn% ps -fu susanc
         UID   PID  PPID  C    STIME TTY     TIME CMD
      susanc 29736 29735 19 11:21:45 pts/19  0:01 -tcsh 
      susanc 29634   840  0 11:21:05 ?       0:00 sh 
      susanc 29635 29634  0 11:21:06 ?       0:00 /bin/csh -f ./mfold_29611_1 
      susanc 29761 29736  8 11:22:19 pts/19  0:00 ps -fu susanc 
      susanc 28429 28428  0 09:52:40 pts/17  0:01 -tcsh 
      susanc 29636 29635 80 11:21:06 ?       1:05 /gcg/gcgbin/execute/mfold -Init=mfold_29611_1.init 
      susanc 29637 29635  0 11:21:06 ?       0:00 Mail -s mfold susanc 
      susanc 19125 19124  0 14:13:27 pts/13  0:01 -tcsh 
    
    churn% kill 29636
    churn% ps -fu susanc
      susanc 18938 18937  0 14:05:30 pts/12  0:04 -tcsh 
      susanc 29736 29735  0 11:21:45 pts/19  0:01 -tcsh 
      susanc 29776 29736  7 11:23:19 pts/19  0:00 ps -fu susanc 
    churn% exit
    helix% 
    
    So, as above, I rlogin'd to churn, looked for all the processes running under my name (with 'ps -fu susanc'), found the one that says /gcg/gcgbin/execute/mfold, and killed that one's process id. Then I typed 'ps -fu susanc' again to make sure it was gone, and exited churn back to helix. The batch process will send you email saying that it was killed.


    How can I find all relevant sequences with Lookup?

    I'm looking for Mouse TGF-beta related sequences. I tried an All Text search in Lookup for 'TGF-beta', and found 37 sequences. But M13177, a TGF-beta sequence that I know exists, was not among the list. Why didn't it show up?

    The problem is really that there is no standard nomenclature that Genbank depositers use. So some people might describe as sequence with the words 'TGF-beta', and some may use 'Transforming Growth Factor Beta', with or without hyphens. Lookup's indices are pretty good, but they can't take care of all the variations.

    The best approach is to think of all the possible variations of your key phrase, and search for all of them. So, in this case, look for

    TGF-beta or Transforming Growth Factor
    
    You need to use & for AND, and | for OR. See the lookup documentation for more information. So a good search would be:
     Complete the query form below:
    
                     All text:  TGF-beta | (transforming & growth & factor)
                   Definition:
                       Author:
                      Keyword:
                Sequence name:
             Accession number:
                     Organism:  mus musculus
                    Reference:
                        Title:
                      Feature:
      On or after (dd-mmm-yy):               On or before (dd-mmm-yy):
     Shortest sequence length:                Longest sequence length:
    
         Inter-field operator:  AND             Form of output list:  Whole Entries
    
     Press D to continue.
    
    Searching genbank
    
     49 entries were found.
    
    As you see, 49 entries were found, among which is GB_RO:MUSTGFRNA, which is your accession number M13177. Note that you can 'fetch' the sequence using either the locus name MUSTGFRNA or the accession number M13177.


    How can I find bases 3134678 to 3136537 of the E. Coli genome?

    Unfortunately there is no really convenient way to do exactly what you want. There are a couple of possible methods, but both require more effort than one would like.

    Method 1: GCG's Lookup and Fetch programs
  • The E. Coli genome is divided into 400 sections, which are saved as 400 separate entries in the Genbank database. You can find the list of entries by using the GCG program 'lookup'. Select 'Genbank' as the database to search, type 'Coli & genome' next to 'All Text', and save the results. It will show that entry GB_BA1:ECAE000111 contains section 1 of 400 of the complete genome, and GB_BA1:ECAE000510 contains the last section, 400 of 400.

    This does not tell you which bases are contained in each entry. If from some other source (e.g. a paper, or the author) you know one entry that contains part of your region of interest, you can get the upstream and downstream bases by getting the preceding and following entries. For example, in this case entry ECAE000381 contains most of the region of interest. If you want a few hundred bases on either side of this entry, you would want entries ECAE000380 and ECAE000382. Note that this requires prior knowledge of at least one entry number. The entries themselves give no indication of which bases they include, they just say 'section 271 of 400' etc.

    In the lookup program, you could also search for 'U00096', which is the accession number for the entire genome. You can do this search through GCG-Lite+ on the web.

    You can obtain the actual entries by typing 'fetch ECAE000381' for each accession number at the helix prompt.

    Method 2: NCBI's Genome Query
  • An alternative way to obtain the sequence is by pointing your web browser at NCBI's Genome Query. Click on 'E. Coli' and you will get a basic circular figure showing the genome. Put your cursor as close to 3100K as possible, and click. You will get a figure showing 3100950..3150949, or something close to that. Use the forward and back arrows at the bottom of the figure to obtain a region that includes your region of interest. Then click on the 'TextView' button in the left frame. The main frame should now show an actual Genbank entry for bases 3100950..3150949

    Choose 'Save Frame As...' in your browser menu under File. Choose 'Text' as the format, and save the entry into a file on your desktop computer. This entry is in Genbank format, will need to be converted to GCG format for any GCG program. ou can use the helix program 'fmtseq' to convert it. Type 'fmtseq' at the helix prompt and follow instructions.

  • However the sequence is obtained, you will need to delete the extra bases on each end within seqed, and will have to keep track of the base numbering by yourself. Type 'seqed filename' to start up seqed. If you used Method 1, you will have two files to enter. Position your cursor at the end of the first sequence, hit Ctrl-D to get to the command prompt, and type 'Include filename2' to add the second sequence at the end of the first.


    How can I search for promoter regions?

    The GCG program Findpatterns can be used to find any pattern, and the GCG file tfsites.dat contains patterns from the Transcription Factor database. Sample session:
    helix 75% findpatterns -data=genmoredata:tfsites.dat
    
    FindPatterns identifies sequences that contain short patterns like
    GAATTC or YRYRYRYR.  You can define the patterns ambiguously and allow
    mismatches. You can provide the patterns in a file or simply type them
    in from the terminal
    
     FINDPATTERNS in what sequence(s) ? aw778747.gb_est46
    
      What should I call the output file (* aw778747.find *) ?
    
         aw778747.gb_est46 len:        410     ....................................
    ...............................................................................
    ...............................................................................
    ..................................................
    
     FINDPATTERNS in what sequence(s) ?
          
         Total finds:             45
        Total length:            410
     Total sequences:              1
            CPU time:          01.63
    
         Output file: aw778747.find
    
    The file tfsites.dat is in the GCG directory genmoredata. You can copy this to your own directory using the 'fetch' program. e.g.
    helix% fetch tfsites.dat
    
    Fetch copies GCG sequences or data files from the GCG database
    into your directory or displays them on your terminal screen.
    
     tfsites.dat
    
    
    You can then modify this file to retain only the sites of your interest.