Search Results

I need to compare my 200 bp sequence against a complete virus genome, which is 124,000 bases long. How should I do this?

One way is to run Fasta against the genome sequence. If the genome sequence is in Genbank, you can pull it out with the GCG 'fetch' command. Then run Fasta as follows:

helix% fasta

FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST. 

11-Nov-1997

 FASTA with what query sequence ?  mysequence.gcg

                  Begin (* 1 *) ?  
                End (*    50 *) ?  

 Search for query in what sequence(s) (* GenEMBL:* *) ?  hevzvxx.gb_vi

 What word size (* 6 *) ?  

 Don't show scores whose E() value exceeds: (* 2.0 *):  

 What should I call the output file (* mysequence.fasta *) ?  

          1 Sequences     124,884 bp searched    hevzvxx.gb_vi


(Nucleotide) FASTA of: test.gcg  from: 1   to: 50  June 22, 1998 10:39
 TO: hevzvxx.gb_vi  Sequences:          1  Symbols:    124,884  Word Size: 6

 Searching with both strands of the query.
 Scoring matrix: GenRunData:fastadna.cmp
 Constant pamfactor used
 Gap creation penalty: 16      Gap extension penalty: 4


The best scores are:                    init1 initn   opt

hevzvxx.gb_vi  
! LOCUS       HEVZVXX    124884 bp   ...  250   250   250
hevzvxx.gb_vi  
! LOCUS       HEVZVXX    124884 bp   ...   55    80    56



The list contains 2 entries.
How many alignments would you like to see (* 2 *) ? 

Aligning...


 CPU time used: 
       Database scan:  0:00: 0.8
Post-scan processing:  0:00: 1.5
      Total CPU time:  0:00: 3.0
 Output File: mysequence.fasta

As you can see, Fasta only found 2 entries -- basically, the best hit in the forward and backward direction. The genome is large, and so programs like Gap and Bestfit won't work, and anyway they will also give only the single best hit. An alternative approach would be to use the program Wordsearch followed by the program Segments. For example:

helix% wordsearch

WordSearch identifies sequences in the database that share large 
numbers of common words in the same register of comparison with your
query sequence.  The output of  WordSearch can be displayed with
Segments.

 WORDSEARCH with what query sequence ?  mysequence.gcg
                                       
                  Begin (* 1 *) ?   
                End (*  1000 *) ?  

 Search for query in what sequence(s) (* GenEMBL:* *) ?  vi:hevzvxx
                                                        
 What word size (* 6 *) ?  

 List how many best diagonals (* 50 *) ?  

 Integrate how many adjacent diagonals (* 3 *) ?  

 What should I call the output file (* mysequence.word *) ?  

      1 HEVZVXX              Len: 124,884
         6-mers found:     1,045,257
 Diagonals with words:        24,879
      Total diagonals:       251,766
   Sequences searched:             1
             CPU time:         00.53

          Output file: mysequence.word

The output from WordSearch is then typically passaged through the program Segments to produce a ".pairs" file. Here is what some of the the results of this search look like:

(BestFit) SEGMENTS from: myfile.word  May 22, 1998 14:29

 (Nucleotide) WORDSEARCH of: /usr/users/share/cabot/test/temp/myfile.seq  
 check: 2347 from: 1  to: 1000
 ASSEMBLE    January 13, 1998 16:44
Symbols:     1 to: 1000  from: dmrt412g          ck: 5977,     1 to: 1000 
LOCUS       DMRT412G     6897 bp    DNA             INV       13-NOV-1996
DEFINITION  Drosophila retrotransposon 412 genome.
ACCESSION   X04132 X03733 . . . 

 AvMatch: 3.84  AvMisMatch: -6.00  GapWeight: 50  LengthWeight: 3   ..

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   3
                    . =   1

myfile.seq                check: 2347  from: 1      to: 1000   /Reverse
GB_VI:HEVZVXX             check: 4734  from: 70843  to: 124884
     X04370 Varicella-Zoster virus complete genome. 9/93
 Gaps: 1  Quality: 209  Ratio: 2.944  Score: 50  Width: 7  Limits: +/-8 
                  .         .         .         .         .
     989 CTCCGTTTTACAATATTTCTTACAATTTTTCTTATCTATATATATTTTAT 940
         |||||| ||   |||||   | ||   | |    || |   |||| || |
   70854 CTCCGTGTT.TTATATTATATCCACGGTGTTGATTCAACCAATATGTTGT 70902
                  .         .  
     939 ATTTACTTTATATTTATATATA 918
         || | |||| | |||   ||||
   70903 ATCTTCTTTTTTTTTTACTATA 70924

myfile.seq                check: 2347  from: 1      to: 1000  
GB_VI:HEVZVXX             check: 4734  from: 26311  to: 124884
     X04370 Varicella-Zoster virus complete genome. 9/93
 Gaps: 0  Quality: 112  Ratio: 7.000  Score: 46  Width: 8  Limits: +/-9 
                  .      
     648 AGATAATATTAAAAAT 663
         | ||| ||||| ||||
   26964 ACATATTATTATAAAT 26979

If the query sequence is relatively short (under 240 residues) then an alternate solution would be to use the program FindPatterns. Please be aware of the fact that, unless the optional qualifier "MISmatch" is included on the command line used to run FindPatterns, that only exact matches will be found. E.g:
```
   findpatterns -mis=2
```

Yet another approach is to use the program BreakUp to break HEVZVXX into overlapping segments which can then be searched with FastA. You must first use Fetch to copy the entire sequence to a file in the current working directory. Here is an example, showing how to break the sequence into a series of 5 kb segments that overlap by 500 bases.

% fetch vi:hevzvxx
Fetch copies GCG sequences or data files from the GCG database 
into your directory or displays them on your terminal screen.

 hevzvxx.gb_vi

% breakup  -seg=5000 -over=500

BreakUp reads a GCG-format sequence file containing more than 350,000
sequence characters and writes it as a set of separate, shorter,
overlapping sequence files that can be analyzed by Wisconsin 
Package programs. 


 BREAKUP of what file(s) ?   hevzvxx.gb_vi
                           
 hevzvxx_00.gb_vi  length: 5500 bp
 hevzvxx_01.gb_vi  length: 5500 bp
 hevzvxx_02.gb_vi  length: 5500 bp
 hevzvxx_03.gb_vi  length: 5500 bp
 hevzvxx_04.gb_vi  length: 5500 bp
 hevzvxx_05.gb_vi  length: 5500 bp
 hevzvxx_06.gb_vi  length: 5500 bp
 hevzvxx_07.gb_vi  length: 5500 bp
 hevzvxx_08.gb_vi  length: 5500 bp
 hevzvxx_09.gb_vi  length: 5500 bp
 hevzvxx_10.gb_vi  length: 5500 bp
 hevzvxx_11.gb_vi  length: 5500 bp
 hevzvxx_12.gb_vi  length: 5500 bp
 hevzvxx_13.gb_vi  length: 5500 bp
 hevzvxx_14.gb_vi  length: 5500 bp
 hevzvxx_15.gb_vi  length: 5500 bp
 hevzvxx_16.gb_vi  length: 5500 bp
 hevzvxx_17.gb_vi  length: 5500 bp
 hevzvxx_18.gb_vi  length: 5500 bp
 hevzvxx_19.gb_vi  length: 5500 bp
 hevzvxx_20.gb_vi  length: 5500 bp
 hevzvxx_21.gb_vi  length: 5500 bp
 hevzvxx_22.gb_vi  length: 5500 bp
 hevzvxx_23.gb_vi  length: 5500 bp
 hevzvxx_24.gb_vi  length: 4884 bp

You can then use an ambigous file specification such as "hevzvxx_*.gb_vi" with to perform a FastA to search against the the files produced by BreakUp. I recommend that you also include the optional qualifier NOSTATs the FastA command line in order to suppress the calculation of E() statistics. For example:

% fasta -nostat
 
FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST. 

 FASTA with what query sequence ?  mysequence.gcg
                                  
                  Begin (* 1 *) ?  
                End (*  1000 *) ?  

 Search for query in what sequence(s) (* GenEMBL:* *) ?  hevzvxx_*.gb_vi

Information courtesy of Eric Cabot, GCG inc.