NCBI logo BLAST Logo
PubMed Entrez BLAST OMIM Taxonomy Structure
[ spacer ]

BLAST Frequently Asked Questions (FAQ)

Tips and Hints: Troubleshooting:
Tips and hints

Q: Which BLAST program should I use?

You have many choices to make between different BLAST programs and databases. Some of these choices are better for answering some questions then others. We have created a selection chart to help you make the decision of BLAST program for the question you are asking. This is the "BLAST Program Selection Guide".

Q: How can I search a batch of sequences with BLAST?

There are three options for "Batch" BLAST searches:
1) Web MegaBLAST EST analysis tool: This program is optimized for aligning nucleotide sequences that differ slightly as a result of sequencing or other similar "errors". MegaBLAST is good for scanning a large number of EST type sequences (about 500 kb in length) against large database in search of the closest matches. You can import a file EST sequences in FASTA format or as a list of GenBank accessions or/GIs and have them compared to the BLAST databases. The default is an easily reviewable Hit Table format, although you can download and save the results in Standard pairwise HTML or any of the other result output options. MegaBLAST is available from the BLAST web page, the standalone BLAST executables, or via the network BLAST client (see below).

2) Standalone BLAST executables:The Standalone BLAST executables are command line programs which run BLAST searches against local downloaded copies of the NCBI BLAST databases. The programs will handle either a single large file with multiple FASTA query sequences, or you can create a script to send multiple files one at a time. The executables are available for a wide variety of platforms, including many "flavors" of UNIX (LINUS, Solaris, etc.) Windows PC and even Mac OSX.

The Standalone executables are available at the anonymous FTP location: ftp://ftp.ncbi.nih.gov/blast/executables/ There is information on the Standalone BLAST executables available in the README file at ftp://ftp.ncbi.nih.gov/blast/documents/blast.txt which is also bundled with the downloaded binaries.

3) BLAST Network Client 'blastcl3':The BLAST 2.0 Network client will allow you to submit a single file of FASTA sequences over an internet connection to the NCBI BLAST databases. You submit searches through the client to the NCBI servers and do not need to download the database locally. The BLAST Network client executables are located at: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/ There are blastcl3 executables for various UNIX platforms, PC Windows and Macintosh.

Q: How can I write a program to submit jobs to NCBI's BLAST servers?

By using the URLAPI. Documentation also available in postscript and PDF.

Q: How can I limit my BLAST search based on Organism?

The option to limit a search to organism and even taxonomic classification is part of the "Limit by Entrez Search" option on most standard BLAST search pages. There is a pull down menu to select the most common organisms found in GenBank and also a field to input the species name, or classification (example: "eubacteria"). Using this option will cause your query sequences to be compared only to sequences in our databases from that organism.

There are also several "specialized" BLAST Pages devoted to different organisms on the main BLAST web page.

How can I limit my search to a subset of database sequences?

You can use the "Limit by Entrez Search" option found on most Standard BLASTR search pages to run an Entrez search and have your query sequence compared to the resutls of this search. For example, if you wanted to limit you search to all phosphorylase sequences from mouse you could enter the following valid Entrez search strategy in the Limit by Entrez field of the BLAST search page: phosphorylase AND "Mus musculus"[Organism]

Q: Is it possible to search for a motif or pattern with BLAST?

There are two general approaches to this type of questions. First do you wish to find if motifs exist in your query sequence, or do you have a known motif and wish to find other protiens or nucleotides with this motif?

In the first case, finding motifs in your query sequence can be done for proteins using the CDD (Conserved Domain Database) and CDART (Conserved Domain Architecture Retrieval Tool) tools. CDD allows you to compare your protein to an database of alignments and profiles representing protein domains conserved in molecular evolution as well as 3-dimensional protein structures in the MMDB database. These tools use popular protien motif databases, PFam (http://pfam.wustl.edu/) and Smart (http://smart.embl-heidelberg.de) in addition to the MMDB database.

For conditions of the second case if you have a known motif and wish to identify other proteins with this motif you can use PHI-BLAST. PHI-BLAST searches take a motif pattern and protein sequence as input and then compares these to the NCBI protein databases looking for other proteins which contain conserved regions similar to the motif entered.

For nucleotides it is only possible to search with short query sequences representing your motif or region of interest with the Nucleotide BLAST "Search for short nearly exact matches" service from the main BLAST web page. This can find other sequences whicvh contain similar nucleotide patterns. however there are no database of nucleotide patterns which can identify patterns in your nucleotide query sequence.

You may also be interested in checking out other molecular biology web sites, such as those mentioned in the Other Molecular Biology Resources section at the end of this FAQ, for motif searching software.

Q: How do I perform a similarity search with a short peptide/nucleotide sequence?

There is a special page with pre-set parameters for searching with short sequences. You can access this page by clicking the "Search for short nearly exact matches" link on the main BLAST web page.

Essentially for these searches, the Expect value has been increased and the word size decreased to optimise for short hits which generally score a large E value require smaller word sizes to initiate formation of the HSP for extension. In addition, for proteins, the matix "PAM30" becomes the default which optimises hits to smaller sequences which have a lower percentage of evolutionary drift in general.

Q: Can I use BLAST to compare to two or more sequences in a multiple sequence alignment?

You can use the BLAST 2 Sequences service to compare two nucleotide or two protein sequences against each other using the Gapped BLAST algorithm. The this will allow you to perform a BLAST search between the two sequences allowing for the introduction of gaps (deletions and insertions) in the resulting alignment. Remember that BLAST is a "local" alignment program and does not make global alignments between sequences to calculate total percent homologies.

To compare one sequence against a specific sequence or set of sequences, you can also use a separate multiple sequence alignment program. There are many such software tools available to do this. You may also be interested in checking out other molecular biology web sites, such as those mentioned in the Other Molecular Biology Resources section at the end of this FAQ.

Q: What is the Expect (E) value?

The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. However, keep in mind that searches with short sequences, can be virtually indentical and have relatively high EValue. This is because the calculation of the E-value also takes into account the length of the Query sequence. This is because shorter sequences have a high probability of occurring in the database purely by chance. For more details please see the calculations in the BLAST Course.

The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most main BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

Q: What is low-complexity sequence?

Regions with low-complexity sequence have an unusual composition andthis can create problems in sequence similarity searching (Wootton & Federhen, 1996). Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexity sequence because it can cause artifactual hits (please also see Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?)

In BLAST searches performed without a filter, often certain hits will be reported with high scores only because of the presence of a low-complexity region. Most often, this type of match cannot be thought of as the result of homology shared by the sequences. Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences that are not truly related.

Other Molecular Biology Resources:

The on-line BLAST Course was written by Dr. Stephen Altschul and discusses the basics of the Gapped BLAST algorithm. In addition the full text of the 1997 Nucleic Acids Research paper "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs" is also available on-line.
Other links:
European Bioinformatics Institute (EBI) BioCatalog
Indiana University IUBio Archive
Sequence manipulation site

Troubleshooting

Q: Causes for "No significant similarity found".

Below are several reasons that a BLAST search can result in the "No significant similarity found" message.

Short Sequences: There is a special BLAST optimized for searchig with small sequences. Go to tbe main BLAST web page and select the "Search for short nearly exact matches" link for Nucleotide - Nucleotide or Protein Protein sections.

Filtering: BLAST filters regions of low-complexity (for a description of low-complexity see "What is low-complexity sequence?" below). If your sequence contains large regions of "low complexity" it may not significant hits to the database. You can turn off filtering by setting the "Filter" option to "None" using the pull down tab.

Query Format: Another reason you may see the "No Significant Similarity found" message is using the wrong type of sequence in your search.

1) Accession/GI Number or FASTA. Check that you have the Input Data set to the correct format for your Query. Set the pull down menu to "Accession number or Gi" to search with GenBank accession numbers or Gi numbers. Set to FASTA for raw amino acid or nucleotide sequences. For more information on FASTA format, click here.

2) Sequence type and Program combination. You can search with an amino acid query sequence using the blastp and tblastn programs. With nucleotide query sequences you can use blastn, blastx, and tblastx. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

For more information on the BLAST programs, click here.

Q: Why does my search timeout on the BLAST servers?

Certain combinations of BLAST searches with large sequences against large databases can cause the BLAST servers to timeout. This has to do with a limit on the server CPU's which prevents sequences which generate many HSPs from hoarding server resources.

However there are some things you can do to prevent timeout and generate results from large sequences.

- Some sequences contain large regions of ALU repeats. In this case you can select the "Human Repeat" filtering option on the main BLAST search page. This will mask repeat regions which generate a large number of biologically uninteresting hits to the databases.

- Increase the Word Size to 20 - 25. With a default Word Size of 7, the BLAST algorithm finds initial HSPs of 7 bases in length and begins extension of these from either end. In a large sequence this can generate 100's of initial HSPs between the query sequence and even a single large genomic sequence in the databases. Increasing the Word Size to 25 makes the initial HSP smaller, limiting the number small initial fragments to be extended.

- Decrease the Expect value to 1.0 or lower. Many hits from large sequences are to many small fragments in the database. The expect value for these searches is such that decreasing the expect value will eliminate these results, and concentrate on results which are more likely to contain large coding regions and genomic fragments.

If you are still seeing a "timeout" error message after making the above changes, please contact blast-help@ncbi.nlm.nih.gov with the RID of your search.

Q: Why do I get the message "ERROR:BLASTSetUpSearch: Unable to calculate Karlin-Altschul params, check querysequence" ?

This will happen if your entire query sequence has been masked by low complexity filtering. You will need to turn filtering off to get hits. For further information on filtering, please read the sections of the BLAST FAQs on Q: What is low-complexity sequence? and also Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

Q: Why do I get the message "ERROR: Blast: No valid letters to be indexed"?

You may have accidentally entered an accession number in the search box without changing the input selection from "Sequence in FASTA format" to "Accession or gi". You will also see this error message if too many ambiguity codes (R,Y,K,W,N, etc. fornucleotides) are present in your query sequence. Although BLAST allows ambiguity codes, be aware that these will always contribute a negative score in nucleic acid searches. Thus, sequences such as degenerate PCR primers with ambiguity codes maynot find any significant hits even though they may be designed from sequences that are present in the database.

Q: After running a search why do I see a string of "X"s (or "N"s) in my query sequence that I did not put there?

You are seeing the result of automatic filtering of your query for low-complexity sequence that is performed to prevent artifactual hits. The filter substitutes any low-complexity sequence that it finds with the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") or the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Low-complexity regions can result in high scores that reflect compositional bias rather than significant position-by-position alignment (Wootton amp; Federhen, 1996). Filter programs can eliminate these potentially confounding matches from the blast reports, leaving regions whose BLAST statistics reflect the specificity of their parities alignment. Queries searched with the blastn program are filtered with DUST. The other BLAST programs use SEG.

Q: How can I see low-similarity matches when there are many strong hits to my query sequence? Often, when the query is a member of a large sequence family, the summary hit list and the alignments returned only contain very high scoring hits. To look at low-similarity matches, you must increase the maximum number of results returned. On the BLAST Web pages, often it is sufficient to increase the size of the summary hit list and the number of alignments shown using the menus on the Advanced pages. However, it is possible to increase the lists even further using the Other Advanced Options box on the Advanced BLAST pages. For BLAST 2.0, "-v 2000", for example, will increase the number of descriptions returned in the summary hit list to 2000. The option "-b 2000" will similarly increase the number of alignments returned.

Q: I have heard that I will be penalized if I send a large number of sequences to the servers? .

The NCBI WWW BLAST server is a shared resource and it would be unfair for a few users to monoplize it. To prevent this, the server keeps track of how many queries are in the queue for each user and penalzies those users with many queries in the queue. This is done by calculating a 'Time of Execution' (TOE). If a user has only one query in the queue, then the TOE is set to the current time. As a user adds more queries to the queue, then the TOE is set to the current time, plus 60 seconds for every query in the queue. An example would be if a user sent in five requests one after the other without waiting for any to be worked on, then the TOE's for the requests would be:

1st request: current time
2nd request: current time + 60 seconds
3rd request: current time + 120 seconds
4th request: current time + 180 seconds
5th request: current time + 240 seconds

The BLAST server works through requests in the order of earliest to latest TOE. A query will be executed before it's TOE, if there are no other queries with an earlier TOE. Users with large numbers of queries are encouraged to use the BLAST servers at off-peaks hours, which are from 8 p.m. to 8 a.m. (EST).


Disclaimer
Privacy statement
Accessibility
Valid XHTML 1.0, CSS.