Display COMPLETE DOCUMENT Scroll Up Scroll DOWN MORE! TOP

Can I use Fasta with a short sequence?

You can, but it's not a good idea. Fasta and Blast work best with sequences over 50 bp. For short sequences, try findpatterns instead.

For example, say you have a 10-nucleotide protein binding site within a promoter, and you want to search the E.Coli genome for this consensus sequence. Findpatterns works best if you can define a sequence pattern that is specific, but includes all of the "real" binding sites. It does not allow positional weighting, so the best thing to give it is fully conserved residues/IUPAC specifications and allowed spacings between conserved groups. In addition, if there are correlated dinucleotide variations they are best searched as separate patterns instead of via a single generalization of the pattern.

i.e.

CNGGATNA{5,7}TNATCCNG
CNGGTANA{5,7}TNTAGGNGMM
works better than
CNGGNNNA{5,7}TNNNCCNG

Findpatterns also lets you allow mismatches, but in general for short, degenerate patterns without positional weighting this will just decrease your signal to noise ratio. It is better to explicitly search additional patterns.

Findpatterns patterns are specified in the same data file format as restriction enzymes for "MAP", etc. Type "fetch pattern.dat" to get an example file you can modify with your own patterns. Type "genhelp findpatterns" to see a description of the program. The pattern syntax is described under "defining patterns" and making and using your own pattern files are described under "pattern file" and "local data files".

If you have several of these sequences, and if the contribution of the individual bases is unequal (e.g. the T1, A2 and T6 in the -10 box are more conserved than the others) then you can align the known sequences (make individual sequence files and then use Pileup; or enter them directly using Lineup). You can then use Profilemake and then Profilesearch. These latter two programs are meant for proteins but can be gotten to work with DNA sequences - don't forget to use -MATRix=profiledna.cmp with Profilemake.

Dr. William Pearson, author of FastA, says that FastA should work fine for such a search if you use a word size of 1 instead of the default word size of 6 for DNA.

(Response to this question modified from posts by Michael Leonetto and Paul Roy on bionet.software.gcg.)