![[ spacer ]](https://webharvest.gov/peth04/20041106114346im_/http://www.ncbi.nlm.nih.gov/corehtml/spacer10.GIF) |
BLAST Program Selection Guide |
By blast-help group, NCBI User
Service
NCBI, NLM, NIH, 8600 Rockville Pike, Bethesda, MD 20894 |
|
1. Introduction |
NCBI has provided BLAST sequence analysis services
for over a decade. For many users, the first question they face is
"Which BLAST program should I use?"
In order to help users arrive at an answer to this question, we
have constructed this table called the "BLAST Program Selection
Guide." It is divided into several categories according to the nature and size of the query
and the primary goal of the search. Starting from the query
sequence on the left and cross-referencing to the right, an user
will arrive the specific BLAST program best suited for that
search.
This document is also available in
PDF
(1056656 bytes).
|
|
2. Common BLAST databases |
|
To discuss BLAST program selection, we first need
to know what databases are available and what sequences they
contain. Here we will take a look at the common BLAST databases.
According to their content, they are grouped into nucleotide and
protein databases. These databases and their detailed compositions
are listed in the two tables below.
NCBI also provides specialized BLAST databases such as the vector
screening database, variety of genome databases for different
organisms, and trace databases. The content of those databases will
be listed when the relevant special BLAST pages are discussed. |
|
Table 2.1 Content of Protein Sequence
Databases |
Database Name |
Content Description |
nr |
Non-redundant GenBank CDS translations + PDB + SwissProt + PIR
+ PRF. |
swissprot |
Last major release of the SWISS-PROT protein sequence database
(no incremental updates). |
pat |
Proteins from the Patent division of GenBank. |
month |
All new or revised GenBank CDS translations + PDB + SwissProt +
PIR + PRF released in the last 30 days. |
pdb |
Sequences derived from the 3-dimensional structure records from
the Protein Data Bank. |
|
[Back to
top] |
|
Table 2.2 Nucleotide Databases for
BLAST |
Database Name |
Content Description |
nr |
All GenBank + EMBL + DDBJ + PDB sequences (but no EST, STS,
GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant"
due to computational cost. |
est |
Database of GenBank + EMBL + DDBJ sequences from EST
division. |
est_human |
Human subset of est. |
est_mouse |
Mouse subset of est. |
est_others |
Subset of est other than human or mouse. |
gss |
Genome Survey Sequence, includes single-pass genomic data,
exon-trapped sequences, and Alu PCR sequences. |
htgs |
Unfinished High Throughput Genomic Sequences: phases 0, 1 and
2. Finished, phase 3 HTG sequences are in nr. |
pat |
Nucleotides from the Patent division of GenBank. |
pdb |
Sequences derived from the 3-dimensional structure records from
Protein Data Bank. They are NOT the coding
sequences for the coresponding proteins found in the same PDB
record. |
month |
All new or revised GenBank+EMBL+DDBJ+PDB sequences released in
the last 30 days. |
alu_repeats |
Select Alu repeats from REPBASE, suitable for masking Alu
repeats from query sequences. See "Alu alert" by Claverie and
Makalowski, Nature 371: 752 (1994). |
dbsts |
Database of Sequence Tag Site entries from the STS division of
GenBank + EMBL + DDBJ. |
chromosome |
Complete genomes and complete chromosomes from the NCBI
Reference Sequence project. |
wgs |
Assemblies of Whole Genome Shotgun
sequences. |
|
[Back to
top] |
|
3. Program Selection Tables |
|
The appropriate selection of a BLAST program for a
given search is influenced by the following three factors 1) the
nature of the query, 2) the purpose of the search, and 3) the
database intended as the target of the search. The following tables
provide recommendations on how to make this selection. |
|
Table 3.1 Program
Selection for Nucleotide Queries |
Length ¹ |
Database |
Purpose |
Program |
Explanation |
20 bp or longer
28 bp or above for megablast |
Nucleotide |
Identify the query sequence |
Discontiguous megablast,
megablast, or
blastn |
Learn more ... |
Find sequences similar to query sequence |
discontiguous megablast or blastn |
Learn more ... |
Find similar sequence from the Trace archive |
Trace megablast, or Trace
discontiguous megablast |
Learn more ... |
Find similar proteins to translated query in a translated
database |
Translated BLAST (tblastx) |
Learn more ... |
Peptide |
Find similar proteins to translated query in a protein
database |
Translated BLAST (blastx) |
Learn more ... |
7 - 20 bp |
Nucleotide |
Find primer binding sites or map short contiguous motifs |
Search for short, nearly exact matches |
Learn more ... |
¹ The cut-off is only a
recommendation. For short queries, one is more likely to get
matches if the "Search for short, nearly exact matches" page is
used. Detailed discussion is in the Section 4 below. With default
setting, the shortest unambiguous query one can use is 11 for
blastn and 28 for MEGABLAST. |
|
[Back to
top] |
|
Table 3.2 Program
Selection for Protein Queries |
Length ¹ |
Database |
Purpose |
Program |
Explanation |
15 residues or longer |
Peptide |
Identify the query sequence or find protein sequences similar
to the query |
Standard Protein BLAST (blastp) |
Learn more ... |
Find members of a protein family or build a custom
position-specific score matrix |
PSI-BLAST |
Learn more ... |
Find proteins similar to the query around a given pattern |
PHI-BLAST |
Learn more ... |
Find conserved domains in the query |
CD-search (RPS-BLAST) |
Learn more ... |
Find conserved domains in the query and identify other proteins
with similar domain architectures |
Conserved Domain Architecture Retrieval Tool (CDART) |
Learn more ... |
Nucleotide |
Find similar proteins in a translated nucleotide database |
Translated BLAST (tblastn) |
Learn more ... |
5-15 residues |
Peptide |
Search for peptide motifs |
Search for short, nearly exact matches |
Learn more ... |
¹ The cut-off is only a
recommendation. For short queries, one is more likely to get
matches if the "Search for short, nearly exact matches" page is
used. Detailed discussion is in Section 4 below. |
|
[Back
to top] |
|
As genomic and other specialized sequence
information is made available to the public, NCBI creates
specialized BLAST pages for those sequences. The table below
provides a general guide on how to select and use those special
BLAST databases. |
|
Table 3.3 Search
against Genome or Special Databases |
Query ¹ |
Database |
Purpose |
Program Pages to Use ³ |
Explanation |
Nucleotide:
20 or 28 bp and above
Protein:
15 residues and above |
- ² |
Compare two sequences directly |
Align two sequences |
Learn more ... |
Human Genome |
Map the query sequence
Determine the genomic structure
Identify novel genes
Find homologs
Other data mining |
Human |
Learn more ... |
Mouse Genome |
Mouse |
Learn more ... |
Rat Genome |
Rat |
Learn more ... |
Fugu (Pufferfish) |
Fugu rubripes |
Learn more ... |
Zebrafish |
Zebrafish |
Learn more ... |
Insects (fruit fly, mosquito, and honeybees) |
Insects |
Learn more ... |
Nematodes (worms) |
Nematodes |
Learn more ... |
Plants |
Plants |
Learn more ... |
Fungi Genomes (including yeasts) |
Fungi |
Learn more ... |
Malaria |
Malaria |
Learn more ... |
Other Lower Eukaryotic Genomes |
Other eukaryotes genomes |
Learn more ... |
Microbial Genomes |
Microbial genomes |
Learn more ... |
Immunoglobulin sequences |
Find matches to curated immunoglobulin sequences |
Trace MEGABLAST |
Learn more ... |
Nucleotide:
20 or 28 bp and above |
UniVec |
Screen for vector contamination |
VecScreen |
Learn more ... |
¹ This is similar to what is
in Table 3.1 and Table 3.2. For most of the pages, the search
parameters can be modified to enable searches with a short query by
pasting additional options in the "Advanced Options" text box. For
protein comparisons, -F F -e 20000 -W 2
should be used. For nucleotide comparison, use -F F -e 2000 -W 7. This also requires the
unchecking of the megablast checkbox.
² "Align two sequences" treats the second sequence as the
database.
³ Available databases and their contents are described in
Section 5. |
|
|
NOTE:
GenBank® and BLAST®
are registered trademarks granted to NCBI by USPTO.
For questions and suggestions about BLAST, please write to: blast-help@ncbi.nlm.nih.gov
For general questions about NCBI resources, please write to: info@ncbi.nlm.nih.gov
NCBI User Services can also be reached by phone at:
(301)496-2475. |
[Back to
top] |
|
|
4. Explanation for the program
choices given in Tables 2.1 to 2.2 |
|
4.1 MEGABLAST is the tool of choice to identify a
nucleotide sequence. |
|
The best way to identify an unknown sequence is to
see if that sequence already exists in a public database. If the
database sequence is a well-characterized sequence, then one will
have access to a wealth of biological information. MEGABLAST,
discontiguous-megablast, and blastn all can be used to accomplish
this goal. However, MEGABLAST is specifically designed to
efficiently find long alignments between very similar sequences and
thus is the best tool to use to find the identical match to your
query sequence. In addition to the expect value significance
cut-off, MEGABLAST also provides an adjustable percent identity
cut-off for the alignment, which overrides the significance
threshold set by Expect value parameter.
Web MEGABLAST can also accept batch queries, the only web BLAST
tool with this capability. Please refer to the "Batch
Search" section for details. |
[Back to
top] |
|
4.2 Discontiguous MEGABLAST is better at finding nucleotide sequences similar,
but not identical, to your nucleotide query. |
|
The BLAST nucleotide algorithm finds similar
sequences by breakin the query into short subsequences called
words. The program identifies the exact matches to the query words
first (word hits). BLAST program then extends these word hits in
multiple steps to generate the final gapped alignments.
One of the important parameters governing the sensitivity of BLAST
searches is the length of the initial words, or word size as it is
called. The most important reason that blastn is more sensitive
than MEGABLAST is that it uses a shorter default word size. Because
of this, blastn is better than MEGABLAST at finding alignments to
related nucleotide sequences from other organisms. The word size is
adjustable in blastn and can be reduced from the default value of
11 to a minimum of 7 to increase search sensitivity.
The search sensitivity can further improved by using the newly
introduced discontiguous megablast page. This page uses an
algorithm with the same name, which is similar to that reported by
Ma et.al. Rather than requiring exact
word matches as seeds for alignment extension, discontiguous
megablast uses non-contiguous work within a longer window of
template. In coding mode, the third base wobbling is taken into
consideration by focusing on finding matches at the first and
second codon positions while ignoring the mismatches in the third
position. Searching in discontiguous MEGABLAST using the same word
size is more sensitive and efficient than standard blastn using the
same word size. For this reason, it is now the recommended tool for
this type of search. Alternative non-coding patterns can also be
specified if desired. Additional details on discontiguous are
available at:
http://www.ncbi.nlm.nih.gov/blast/discontiguous.html
http://www.ncbi.nlm.nih.gov/Web/Newsltr/FallWinter02/blastlab.html
It is important to point out that nucleotide-nucleotide searches
are not the best method for finding homologous protein coding
regions in other organisms. That task is better accomplished by
performing searches at the protein level, by direct protein-protein
BLAST searches or by translated BLAST searches. This is because of
the codon degeneracy, the greater information available in amino
acid sequence, and the more sophisticated algorithm in
protein-protein BLAST. |
[Back
to top] |
|
4.3 "Search for short nearly exact matches" is
useful for primer or short nucleotide motif searches. |
|
Short sequences (less than 20 bases) will often
not find any significant matches to the database entries under the
standard nucleotide-nucleotide BLAST settings. The usual reasons
for this are that the significance threshold governed by the expect
value parameter is set too stringently and the default word size
parameter is set too high.
You can adjust both the word size and the expect value on the
standard BLAST pages to work with short sequences. However, we do
provide a BLAST page with these values preset to give optimum
results with short sequences. This page ("Search for short nearly
exact matches") is linked under the nucleotide BLAST section of the
main BLAST page.
Table 4.3.1
Parameter settings for standard blastn and
"Search for short and nearly exact matches" |
Program |
Word Size |
DUST Filter Setting |
Expect Value |
Standard blastn |
11 |
On |
10 |
Search for short nearly exact matches |
7 |
Off |
1000 |
|
|
A common use of this page is to check the
specificity of PCR or hybridization. A useful way to check a pair
of PCR primers is to first concatenate them by inserting string of
20 or more N's in between the two primers, and then search the
concatenated pair as one sequence. Since BLAST looks for local
alignments and automatically searches both strands, there is no
need to reverse complement one of the primers before doing the
concatenation or the search.
The query sequence should contain no ambiguous bases. Consensus
motifs with degenerate
bases, such as KMKGSMGYYGSNNNNNNGCTYRGCWCSYTC or CNNGAANNTCCNNG will not work for this type of
search. |
[Back to
top] |
|
4.4 Use the Trace Archive BLAST page to search raw
primary sequence trace files. |
|
Trace data files are not official entries of the
GenBank database and have no associated feature annotations.
Despite this limitation, they are still a rich source of
information, especially for organisms lacking a significant amount
of regular mRNA or assembled genomic sequences. The sequences come
from a variety of projects and sequencing strategies, including
Whole Genome Shotgun (WGS), BAC end sequencing, and EST sequencing.
The trace data are single pass sequencing reads not trimmed for
quality or vector contamination. Their average lengths are between
500 to 700 bp.
A search against the Trace Archive can use MEGABLAST or
discontiguous MEGABLAST. The former is better for indentifying
exact matches in intra-species searches, such as looking for extra
mRNA sequences or the genomic counterparts for a given gene, while
the later is better for indentifying similar coding sequences from
different species. Information on the Trace Archive is available
from the Trace documentation page. |
[Back to
top] |
|
4.5 Standard protein BLAST is designed for
protein searches. |
|
Standard protein-protein BLAST (blastp) is used
for both identifying a query amino acid sequence and for finding
similar sequences in protein databases. Like other BLAST programs,
blastp is designed to find local regions of similarity. When
sequence similarity spans the whole sequence, blastp will also
report a global alignment, which is the preferred result for
protein identification purposes.
For clear result in identification search, try taking off both "low
complexity filter" and "Composition based statistics" function.
Unlike nucleotide BLAST, there is no comparable MEGABLAST for
protein searches, so batch search via the web is not possible. |
|
4.6 PSI-BLAST is designed for more sensitive
protein-protein similarity searches. |
|
Position-Specific Iterated (PSI)-BLAST is the most
sensitive BLAST program, making it useful for finding very
distantly related proteins. Use PSI-BLAST when your standard
protein-protein BLAST search either failed to find significant
hits, or returned hits with descriptions such as "hypothetical
protein" or "similar to..."
The first round of PSI-BLAST is a standard protein-protein BLAST
search. The program builds a position-specific scoring matrix (PSSM
or profile) from an alignment of the sequences returned with Expect
values better (lower) than the inclusion threshold (default=0.005).
In the second iteration the PSSM becomes the query in the search.
Any new database hits below the inclusion threshold are included in
the construction of the new PSSM. A PSI-BLAST search is said to
have converged when no more new database sequences are added in
subsequent iterations. You can add database hits that fall outside
the inclusion threshold to your PSSM for the next round by checking
the box next to the hit.
You can also save a PSSM created during a PSI-BLAST search of one
database and use it to search a different database. To do this,
change "Alignment" to "PSSM" in a pull-down menu in the Format
section of a "Formatting BLAST" page (at any iteration after the
first). Then format the search, copy the resulting PSSM and paste
it into the PSSM window of a new PSI-BLAST search page. |
[Back to
top] |
|
4.7 PHI-BLAST can do a restricted protein pattern
search. |
|
Pattern-Hit Initiated (PHI)-BLAST is designed to
search for proteins that contain a pattern specified by the user,
AND are similar to the query sequence in the vicinity of the
pattern. This dual requirement is intended to reduce the number of
database hits that contain the pattern, but are likely to have no
true homology to the query.
To run PHI-BLAST, enter your query (which contains one or more
instances of the pattern) into the "Search" box, and enter your
pattern into the "PHI pattern" box in the "Options" section of the
page. Patterns must follow the syntax conventions of PROSITE. Only
one pattern can be used in a given search. The documentation on
pattern syntax is at:
http://www.ncbi.nlm.nih.gov/blast/html/PHIsyntax.html
Sample query sequence, with modified defline and highlighted
pattern occurrence, and a sample pattern in ProSite format are given
below:
>gi|4758958|ref|NP_004148.1| Human
cAMP-dependent protein kinase
MSHIQIPPGLTELLQGYTVEVLRQQPPDLVEFAVEYFTRLREARAPASVLPAATPRQSLGHPPPEPGPDR
VADAKGDSESEEDEDLEVPVPSRFNRRVSVCAETYNPDEEEEDTDPRVIHPKTDEQRCRLQEACKDILLF
KNLDQEQLSQVLDAMFERIVKADEHVIDQGDDGDNFYVIERGTYDILVTKDNQTRSVGQYDNRGS
FGELA
LMYNTPRAATIVATSEGSLWGLDRVTFRRIIVKNNAKKRKMFESFIESVPLLKSLEVSERMKIVDVIGEK
IYKDGERIITQGEKADSFYIIESGEVSILIRSRTKSNKDGGNQEVEIARCHKGQYFGELALVTNKPRAAS
AYAVGDVKCLVMDVQAFERLLGPCMDIMKRNISHYEEQLVKMFGSSVDLGNLGQ
[LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]. |
[Back to
top] |
|
4.8 The protein "Search for short nearly exact
matches" is optimized to find matches to a short peptide. |
|
A short peptide (10-15mer or less) often will not
find any significant matches to the database under the standard
protein-protein BLAST settings. The usual reasons for this are that
the significance threshold governed by the expect value parameter
is set too stringently and the default word size parameter is set
too high.
You could adjust both the word size and the expect value on the
standard BLAST pages to make it work with short query sequences.
NCBI provides a separate BLAST page with these values preset to
optimize blastp searches with short query sequences. This page,
"Search for short nearly exact matches", is available via a link
under the Protein BLAST section of the BLAST home page. In
addition, the more stringent PAM30 is used in lieu of BLOSUM62, and
the composition-based statistics which takes the amino acid
composition of the query sequence into account when calculating the
score and significance of the alignments.
Composition based statistics takes the amino acid composition of
the query and subject sequence into account when calculating the
score and significance of the alignments. It can have a large
effect on searches using queries with a biased amino acid
composition. By definition, short peptides will have a biased
compositions and should not be used with composition based
statistics.
Due to the requirement that the query needs to be at least twice
the word size, a query shorter than 5 residues is not recommended
even though it can be as short as 4 residues when the word size is
set to 2. In addition, since ambiguous residues break the query
sequence, there should be no ambiguities in the query to ensure
that the entire sequence can be used as seeds for the initial
search.
Table 4.8.1
Parameter settings for standard blastp and
"Search for short and nearly exact matches" |
Program |
Word Size |
SEG Filter |
Expect Value |
Composition based Statistics |
Score Matrix |
Standard Protein Blast |
3 |
On |
10 |
On |
BLOSUM62 |
Search for short and nearly exact matches |
2 |
Off |
20000 |
Off |
PAM30 |
|
[Back to
top] |
|
4.9 The "Nucleotide query - Protein db [blastx]"
is useful for finding similar proteins to those encoded by a
nucleotide query. |
|
Translated BLAST services are useful when trying
to find homologous proteins to a nucleotide coding region. Blastx
compares the translation of the nucleotide query sequence to a
protein database. Because blastx translates the query sequence in
all six reading frames and provides combined significance
statistics for hits to different frames, it is particularly useful
when the reading frame of the query sequence is unknown or it
contains errors that may lead to frame shifts or other coding
errors. Thus blastx is often the first analysis performed with a
newly determined nucleotide sequence and is used extensively in
analyzing EST sequences. |
|
4.10 The "Protein query - Translated db
[tblastn]" search is useful for finding protein homologs in
unnannotated nucleotide data. |
|
A tblastn search allows you to compare a protein
sequence to the six-frame translations of a nucleotide database. It
can be a very productive way of finding homologous protein coding
regions in unannotated nucleotide sequences such as expressed
sequence tags (ESTs) and draft genome records (HTG), located in the
BLAST databases est and htgs, respectively.
ESTs are short, single-read cDNA sequences. They comprise the
largest pool of sequence data for many organisms and contain
portions of transcripts from many uncharacterized genes. Since ESTs
have no annotated coding sequences, there are no corresponding
protein translations in the BLAST protein databases. Hence a
tblastn search is the only way to search for these potential coding
regions at the protein level. The HTG sequences, draft sequences
from various genome projects or large genomic clones, are another
large source of unannotated coding regions.
Like all translating searches, the tblastn search is especially
suited to working with error prone data like ESTs and draft genomic
sequences from HTG because it combines BLAST statistics for hits to
multiple reading frames and thus is robust to frame shifts
introduced by sequencing error. |
[Back to
top] |
|
4.11 The "Nucleotide query - Translated db
[tblastx]" is useful for identifying novel genes in error prone
query sequences. |
|
tblastx takes a nucleotide query sequence,
translates it in all six frames, and compares those translations to
the database sequences dynamically translated in all six frames.
This effectively performs a more sensitive blastp search without
doing the manual translation.
tblastx gets around the potential frame-shift and ambiguities that
may prevent certain open reading frames from being detected. This
is very useful in identifying potential proteins encoded by single
pass read ESTs. In addition, it can be a good tool for identifying
novel genes.
This type of search is computationally intensive and searches with
large genomic queries are not recommended. The best way to do this
is to install standalone blast and perform the search locally. For
more information on standalone blast, please read the documents for
formatdb and standalone BLAST at:
ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.txt
ftp://ftp.ncbi.nih.gov/blast/documents/netblast.txt |
[Back to
top] |
|
4.12 The Conserved Domain Database (CDD) search
service uses RPS-BLAST to identify conserved protein domains. |
|
Reverse Position Specific BLAST (RPS-BLAST) is a
more sensitive way of identifying conserved domains in proteins
than standard BLAST searching. It compares a protein sequence
against a database of position specific scoring matrices (PSSMs).
The PSSMs used in CDD search capture the substitution frequencies
at each position in the multiple sequence alignments of recognized
conserved domains. The conserved domain alignments are from the
NCBI's CDD, which contains alignments from protein domain
databases: Smart, Pfam, COG, KOG, and LOAD. For additional
information, refer to CDD help document at: http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml |
[Back to
top] |
|
4.13 The Conserved Domain Architecture Retrieval
Tool (CDART) explores the domain architectures of proteins. |
|
CDART allows you to examine the domain structure
of all proteins in the default BLAST protein database. The CDART
tool first searches a query sequence for the presence of conserved
domains using RPS-BLAST. It then allows you to retrieve proteins
that share one or more protein domains in common with your query.
Because CDART relies on RPS-BLAST, these searches are more
sensitive than ordinary BLAST searches.
If the query does not contain any conserved domains, CDART will not
report any result. |
[Back to
top] |
|
5. Explanation for Program Choices Given in Table
3.3 |
|
5.1 "BLAST 2 Sequences" is designed for direct
comparison of two sequences. |
This program takes two input sequences and
compares them directly. "Aligning Two Sequences" regards the second
sequence as the database. Unlike the other BLAST programs, there is
no need to format the database sequence in any special way. Since
translated BLAST programs are incorporated in this program, the
second sequence can be of different type so long as an appropriate
BLAST program is selected. Appropriate query/program combination is
listed in the table below.
Table 5.1.1
Appropriate Query/Program Combinations for "BLAST 2 Sequences" |
First Query |
Second Query |
Program to Use |
Nucleotide |
Nucleotide |
blastn, megablast, or tblastx |
Nucleotide |
Protein |
blastx |
Protein |
Nucleotide |
tblastn |
Protein |
Protein |
blastp |
If the database sequence or second query is present in an NCBI
database, using the GI/Accession instead of the FASTA sequence
allows the program to incorporate the translation and other
sequence features, found in that record, into the final alignment
making it more informative. |
[Back to
top] |
|
5.2 The Human Genome BLAST page is for comparing a
query against the NCBI's assembly of human genome, plus its
derivative and related databases. |
|
Like other BLAST search pages in this Genomes
section, this page provides a centralized page to access
specialized databases. In this case, the databases are the current
NCBI human genome build and those derived from or related to it.
All flavors of BLAST, except tblastx, are available with MEGABLAST
set as default. Default filters are DUST and human repeats. The
BLAST output links directly to the Human Genome MapViewer, where
database hits can be visualized/analyzed in a genomic context, such
as their relationship to other map elements like Transcript, SNPs,
and Gene_seq. Names for the databases are being standardized, refer
to Table 5.3.1 for details on database content. |
[Back to
top] |
|
5.3 Use the Mouse Genome BLAST page to search current
assemblies and other mouse sequences. |
|
The organization of this page is similar to that
of Human Genome BLAST page. Due to the same concern, tblastx is not
provided. MEGABLAST is the default algorithm and DUST plus rodent
repeats are default filters. The default database "genome" is
analogous to the human genome database. Hits are linked to Mouse
Genome MapViewer for visulization. The content of databases
available for searching is given below. [Content derived from
http://www.ncbi.nlm.nih.gov/genome/seq/Database.html]
Table 5.3.1 Mouse
Genome BLAST Databases and Contents |
curated NT contigs |
Assemblies of Finished Mouse BAC clones. These are annotated
with SNPs and STSs and available in GenBank |
HTGS |
Mouse phase 0, phase 1, phase 2 or phase 3 sequence. These are
the original BAC sequences as submitted by the sequencing
centers. |
genome (default) |
This database represents the current public build of the
genome. The sequences in this database will have RefSeq accession
numbers or type NT_###### or NW_###### and they represent either
contigs from a clone based assembly or supercontigs from a whole
genome shotgun or composite assembly. The contigs in this database
are from both the reference assembly and any alternate assemblies
available for the genome. This database is generated at the time of
a genome release. |
HTGS |
This databases is a collection of all sequences in GenBank that
have an HTG keyword. This allows users to search htgs_phase3
sequences (normally found in NR) and htgs_phase0, 1 and 2 sequences
(normally found in HTGS) at the same time |
Ref RNA |
Collection of reference mRNAs generated by the NCBI RefSeq
project. This database is generated daily. |
Ref protein |
Collection of reference proteins generated by the NCBI RefSeq
project. This database is generated daily. |
Build RNA |
Collection of reference mRNAs generated by NCBI as part of the
genome annotation pipeline. This database is generated at the time
of a genome release. |
Build protein |
Collection of reference proteins generated by NCBI as part of
the genome annotation pipeline. This database is generated at the
time of a genome release. |
Ab Initio RNA |
Collection of ab initio RNA predictions generated by
NCBI as part of the genome annotation pipeline. This database is
generated at the time of a genome release. |
Ab Initio protein |
Collection of ab initio protein predictions generated by
NCBI as part of the genome annotation pipeline. This database is
generated at the time of a genome release. |
ESTs |
Single pass sequence reads from cDNA libraries. This database
is updated daily. |
BAC ends |
The end sequences of BAC clones. This database is generated
daily |
Traces-WGS |
All of the raw organism WGS traces. This database is updated as
needed. |
Traces-ESTs |
All of the raw organism EST traces. This database is updated as
needed. |
Traces-other |
All of the raw organism non-WGS and non-EST traces. This
database is updated as needed. |
WGS contigs |
If an organism was assembled using a whole genome shotgun (WGS)
strategy, this database is available (if the WGS assembly is in
GenBank). This database is updated as needed. Not available for
human. |
Gene Trap Clones (Mouse Only) |
A collection of sequences generated by performing Gene Trap
insertions. This database is updated weekly. |
|
[Back to
top] |
|
5.4 Use the Rat BLAST page to search rat genome
assembly and other rat related sequence databases. |
|
This page provides access to BLAST databases
specific for rat. Databases available are similar to those for
mouse with the exception that the contig accessions are NW_
initialed and there is no "Gene Trap Clone" database. Hits are
linked to visual display in Rat Genome Mapview. For database
information, refer to Table 5.3.1. |
|
5.5 The Microbial page provides centralized access
to complete and unfinished bacterial/archaeal genomes. |
|
This page provides access to many complete and
some unfinished bacterial/archeal genomes. The available genomes
are listed in the page. The primary dataset is the genome(s), with
protein as the derivative dataset. Due to the lack of annotation,
the protein dataset may not be available for unfinished genomes.
One can choose to search against all the genomes or a selected
subset of them, and all flavors of BLAST programs are
available.
This is a very dynamic page since the number of available genomes
is increasing steadily, and the page is frequently updated to
reflect the changes. For BLAST hits to an unfinished genome, only
matched region and its immediate flaking sequences are available
for downloading. |
|
5.6 The Other eukaryotes BLAST page provides access
to genomic sequences of other eukaryotic organisms. |
|
Genomic sequences for many other lower eukaryotes
are available from this page. The exact sequences available for
BLAST search vary depending on the stage of the sequencing
projects.
The databases in the page overlaps with Malaria, Fungi, Insects,
and Nematodes BLAST pages. The difference is that results from
those organism specific pages are hot linked to their corresponding
MapViewer for better visualization, which provides more information
on genomic context. |
[Back to
top] |
|
5.7 Use the Fugu genome BLAST page to search against
the draft Fugu rubripes (Puffer fish) genome. |
|
This page provides access to the draft genome and
the protein translation of Fugu rubripes (Japanese Puffer fish), an
assembly provided by DOE Joint Genome Institute (http://www.jgi.doe.gov/). For details on the
databases and its release policy, please go to http://genome.jgi-psf.org/fugu6/fugu6.home.
Similar BLAST searches against this genome assembly can also be
done there (http://aluminum.jgi-psf.org/prod/bin/runBlast.pl?db=fugu6). |
|
5.8 Use the
Zebrafish Genome BLAST page to search against Zebrafish specific
sequences. |
Currently there are no finished genomic contigs
for this organism. The databases provided in this page are a subset
of sequences from this organism present in different NCBI sequence
databases.
Table 5.8.1
Content of Zebrafish Genome Blast Databases |
Database Name |
Content Description |
mRNAs |
All Zebrafish mRNAs in GenBank. |
ESTs |
Single pass sequence reads from numerous Zebrafish cDNA
libraries. |
HTGS |
Zebrafish phase 0, phase 1, phase 2 or phase 3 sequences. These
are the original BAC sequences as submitted by the sequencing
centers. |
WGS Traces |
All of the raw Zebrafish WGS and BAC Traces from Trace
archive. |
EST Traces |
All of the raw Zebrafish EST Traces from Trace archive. |
Reference mRNAs |
Zebrafish reference mRNAs generated by the NCBI RefSeq
project. |
Reference Proteins |
Zebrafish reference proteins generated by the NCBI RefSeq
project. |
|
[Back to
top] |
|
5.9 Use the Plants genome BLAST pages to search
against green plant genomes. |
|
Currently, only nucleotide sequences are available
from this page for a limited number of green plants. For this
reason, only blastn and tblastn searches are available.
Arabidopsis thaliana is the only exception, for which protein data
is available and hits identified are linked to Mapviewer.
Table 5.9.1
Plants Genome BLAST Database Content |
Database Name |
Content |
Arabidopsis thaliana (mustard) |
Genome assembly, mRNAs, and Proteins |
Avena sativa (Oat) |
Currently available genomic clones, GSS, and STS entries |
Glycine max (soy bean) |
Currently available genomic clone, GSS, EST, and STS
entries |
Hordeum vulgare (Barley) |
Currently available genomic clones, ESTs, GSSs, and STSs |
Oryza sativa (Rice) |
Currently available genomic clone, EST, GSS, htg, and STS
entries |
Oryza sativa indica (Indian Rice) |
WGS contig assemblies, mRNAs, and Proteins |
Oryza sativa ssp. indica WGS contigs (not mapped) |
WGS contigs, not yet mapped |
Tricicum aestivum (Wheat) |
Currently available genomic clone, EST, GSS, and STS
entries |
Zea mays (Corn) |
Currently available genomic clone and EST entries |
Lycopersicon esculentum (Tomato) |
Currently available genomic clone, GSS, EST, and STS
entries |
Mapped sequences from all listed plants |
All of the DNA sequences above. |
|
[Back to
top] |
|
5.10 The Nematode BLAST page. |
|
In this page, one can access the
Caenorhabditis genome and the derivative databases. In
addition, sequence database for Caenorhabditis briggsae is
also available (genomic sequences only). |
|
5.11 Yeasts Genome BLAST page provides access to
multiple yeast genomes. |
|
This page provides access to different yeast
genomes and their protein translations. Sequences of other yeast
strains are also available in addition to that for Saccharomyces
cerevisiae and Schizosaccharomyces pombe. The databases
for the two well known strains can be searched individually or
together. Hits are linked to MapView. All flavors of BLAST, with
the exception of tblastx, are available.
Table 5.11.1
Database list for Yeasts page |
Organism |
Sequence available |
Schizosaccharomyces pombe |
Genome, mRNAs, and Proteins |
Saccharomyces cerevisiae |
Genome, mRNAs, and Proteins |
Saccharomyces paradoxus NRRL Y-17217 |
Nucleotide only |
Saccharomyces mikatae IFO1815(MIT) |
Nucleotide only |
Saccharomyces mikatae IFO1815(WashU) |
Nucleotide only |
Saccharomyces bayanus MCYC623(MIT) |
Nucleotide only |
Saccharomyces bayanus MCYC623(WashU) |
Nucleotide only |
Saccharomyces castellii NRRL Y-12630 |
Nucleotide only |
Saccharomyces kluyveri NRRL Y-12651 |
Nucleotide only |
Saccharomyces kudriavzevii IFO 1802 |
Nucleotide only |
All YEAST Genomes |
Genomes, mRNAs, and Proteins |
Neurospora crassa |
Genome, mRNAs, and proteins |
Magnaporthe grisea |
Genome, mRNAs, and proteins |
Aspergillus nidulans |
Genome, mRNAs, and proteins |
All Species |
Genomes, mRNAs, and Proteins |
|
[Back to
top] |
|
5.12 Use the Flies BLAST page to search the
Anopheles gambiae, Drosophila melanogaster, and
Apis mellifera genomes. |
|
This page provides access to the genome scaffold
of Anopheles gambiae (mosquito) and Drosophila melanogaster
(fruitfly) chromosomes. The proteins translated from the genome
annotation are also available. The data available for Anopheles
gambiae is from a NIAID publicly funded project with the sequencing
and assembly performed by Celera Corporation. The data for
Drosophila melanogaster come from FlyBase. Hits are linked to
corresponding MapViewer pages, providing additional information.
Apis mellifera sequences were added recently, which has no
MapViewer link yet. |
[Back to
top] |
|
5.13 The VecScreen page is for identifying vector
sequence contamination in a query sequence. |
|
VecScreen, under special section, is a rapid
screening tool that checks the query sequence against a
non-redundant vector database, UniVec, which contains one copy of
every unique sequence segment from a large number of cloning
vectors. In addition, UniVec contains sequences for adapters,
linkers, stuffers, and primers that are commonly used in the
cloning and manipulation of cDNA or genomic DNA. Detailed
information on UniVec is at: http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html.
This page is generally used to screen for vector contamination in
sequences before their submission to GenBank. The color-coded
graphics in the result page makes the result easy to
understand. |
|
6. Appendix |
|
6.1 Web MEGABLAST can accept batch queries. |
|
MEGABLAST is the only BLAST web service that can
accept multiple queries. There are two ways to enter batch queries
in MEGABLAST. If the query sequences are not present in the NCBI
Entrez system, those sequences need to be provided in FASTA format,
one after another with no blank lines in between sequences. The
FASTA definition line (or title) of each sequence should be on a
single line all by itself. If those sequences are already saved as
a text file in proper format, the file can be uploaded using the
"Browse" button. An example query file with multiple sequences is
given below.
>Sequence-1
GAGTGGCAGTTATATAGACCGGCGGCGGAGCACGCGTGTGTGCGGACGCAGTTGCGTGAGGGGTTTGTAC
TATCCTCGGTGCTGTGGTGCAGAGCTAGTTCCTCTCCAGCTCAGCCGCGTAGGTTTGGACATATTTGACT
CTTTTCCCCCCAGGTTGAATTGACCAAAGCAATGGTGATGGAGAAGCCTAGTCCCCTGCTGGTCGGGCGG
GAATTTGTGAGACAGTATTACACACTGCTGAACCAGGCCCCAGACATGCTGCATAGATTTTATGGAAAGA
ACTCTTCTTATGTCCATGGGGGATTGGATTCAAATGGAAAGCCAGCAGATGCAGTCTACGGACAGAAAGA
AATCCACAGGAAAGTGATGTCACAAAACTTCACCAACTGCCACACCAAGATTCGCCATGTTGATGCTCAT
GCCACGCTAAATGATGGTGTGGTAGTCCAGGTGATGGGGCTTCTCTCTAACAACAACCAGGCTTTGAGGA
GATTCATGCAAACGTTTGTCCTTGCTCCTGAGGGGTCTGTTGCAAATAAATTCTATGTTCACAATGATAT
CTTCAGATACCAAGATGAGGTCTTTGGTGGGTTTGTCACTGAGCCTCAGGAGGAGTCTGAAGAAGAAGTA
GAGGAACCTGAAGAAAGACAGCAAACACCTGAGGTGGTACCTGATGATTCTGGAACTTTCTATGATCAGG
CAGTTGTCAGTAATGACATGGAAGAACATTTAGAGGAGCCTGTTGCTGAACCAGAGCCTGATCCTGAACC
AGAACCAGAACAAGAACCTGTATCTGAAATCCAAGAGGAAAAGCCTGAGCCAGTATTAGAAGAAACTGCC
CCTGAGGATGCTCAGAAGAGTTCTTCTCCAGCACCTGCAGACATAGCTCAGACAGTACAGGAAGACTTGA
GGACATTTTCTTGGGCATCTGTGACCAGTAAGAATCTTCCACCCAGTGGAGCTGTTCCAGTTACTGGGAT
ACCACCTCATGTTGTTAAAGTACCAGCTTCACAGCCCCGTCCAGAGTCTAAGCCTGAATCTCAGATTCCA
CCACAAAGACCTCAGCGGGATCAAAGAGTGCGAGAACAACGAATAAATATTCCTCCCCAAAGGGGACCCA
>Sequence-2
TCTGCACTGGAGAGTCTGGAGCTGGGAAGACGGAAAACACCAAGAAGGTCATCCAGTACCTCGCCCACGT
GGCATCGTCTCCAAAGGGCAGGAAGGAGCCGGGTGTCCCCGGTGAGCTGGAGCGGCAGCTGCTTCAGGCC
AACCCCATCCTAGAGGCCTTTGGCAATGCCAAGACAGTGAAGAATGACAACTCCTCCCGATTCGGCAAAT
TCATCCGCATCAACTTTGATGTTGCCGGGTACATCGTGGGCGCCAACATTGAGACCTACCTGCTGGAGAA
GTCGCGGGCCATCCGCCAGGCCAAGGACGAGTGCAGCTTCCACATCTTCTACCAGCTGCTGGGGGGCGCT
GGAGAGCAGCTCAAAGCCGACCTCCTCCTCGAGCCCTGCTCCCACTACCGGTTCCTGACCAACGGGCCGT
CATCCTCTCCCGGCCAGGAGCGGGAACTCTTCCAGGAGACGCTGGAGTCGCTGCGGGTCCTGGGATTCAG
CCACGAGGAAATCATCTCCATGCTGCGGATGGTCTCAGCAGTTCTCCAGTTTGGCAACATTGCCTTGAAG
AGAGAACGGAACACCGATCAAGCCACCATGCCTGACAACACAGCTGCACAGAAGCTCTGCCGCCTCTTGG
GACTGGGGGTGACGGATTTCTCCCGAGCCTTGCTCACCCCTCGCATCAAAGTTGGCCGAGACTATGTGCA
GAAAGCCCAGACTAAGGAACAGGCTGACTTCGCGCTGGAGGCCCTGGCCAAGGCCACCTACGAGCGCCTC
If the query sequences are already present in an Entrez Nucleotide
database, their GI or Accession numbers can be pasted into the
search box, one identifier per line.
For example, the two groups of identifiers given in
the table are equivilant. Similar to using sequences, a text file
containing those ID numbers can be uploaded through the "Browse"
button. |
Accessions |
NCBI GIs |
U12345
F12564
BH023812 |
540023
708563
14647366 |
|
For other alternative means of batch BLAST search, refer to " Other
Alternative means for Batch BLAST" (Section 6.3) for more
details. |
[Back to
top] |
|
6.2 Degenerate bases and ambiguity codes are
treated as mismatches by BLAST. |
|
Uncertainties in a nucleotide sequence can be
represented by a standard set of single-letter codes. These codes
are often used to represent degenerate bases in the third position
of codons, in degenerate oligo-nucleotide primers, or sequence
motifs. Even though ambiguities in the query are accepted by BLAST,
BLAST web pages have a built-in functionality that screens query
sequences. Too many ambiguities in a nucleotide query could make
the BLAST page mistake a nucleotide query as a protein. This will
prevent the search from going through and result an error
message.
BLAST treats the ambiguities in an accepted nucleotide query as
mismatches in alignments. In short queries, these ambiguous bases
may break the query in such a way that no valid word is available
for BLAST to index the query and identify initial word hits, thus
preventing BLAST from finding any matches in the database.
Table
6.2.1 Single Letter Nucleotide Code |
Code |
Meaning (Base) |
Code |
Meaning (Base) |
A |
adenosine (A) |
M |
amino (A or C) |
C |
cytidine (C) |
S |
strong (G or C) |
G |
guanine (G) |
W |
weak (A or T) |
T |
thymidine (T) |
B |
not A (G or T or C) |
U |
uridine (U) |
D |
not C (G or A or T) |
R |
purine (G or A) |
H |
not G (A or C or T) |
Y |
pyrimidine (T or C) |
V |
not T (G or C or A) |
K |
keto (G or T) |
N |
any base (A or G or C or T) |
- 1 |
gap(s) |
|
|
1 Dash(s) in the query
will not be accepted. They will be removed before the search is
submitted. To represent gaps, use a string of N's. |
For those programs that use amino acid query sequences (BLASTP and
TBLASTN), the IUPAC based amino acid codes are given in the table
below.
Table 6.2.2
Single Letter Amino Acid Code |
Code |
Residue |
Code |
Residue |
A |
alanine |
P |
proline |
B |
aspartate or asparagine |
Q |
glutamine |
C |
cysteine |
R |
arginine |
D |
aspartate |
S |
serine |
E |
glutamate |
T |
threonine |
F |
phenylalanine |
U1 |
selenocysteine |
G |
glycine |
V |
valine |
H |
histidine |
W |
tryptophan |
I |
isoleucine |
Y |
tyrosine |
K |
lysine |
Z |
glutamate or glutamine |
L |
leucine |
X |
any residue |
M |
methionine |
* |
translation stop |
N |
asparagine |
-2 |
gap of indeterminate length |
1 BLAST cannot handle U
properly in protein alignment since it was not specified in the
scoring matrices used by blastp. To partially resolve this, U in
the query is replaced by an X before the search is performed.
2 Dash(s) in the query will be removed before the
search is submitted. To represents a gap, a string of X's should be
used instead.
3 Blastp treats the anbiguous codes as mismatches in
the alignment. |
|
|
[Back to
top] |
|
6.3 Other alternative means for batch BLAST
searches. |
|
Even though BLAST home page does not offer batch
searches other than blasn via MEGABLAST, we do provide alternatives
to users who would like to batch their blastp or other types of
BLAST searches. The options and their pros and cons are summarized
in the table below.
Table 6.3
Alternatives Means for Batch BLAST Searches |
Alternatives |
Pros |
Cons |
Links |
blastcl3 |
- No database maintenance
- Simple to set up
|
- server/network fluctuation
- Relative low throughput
- No graphic output
|
document
program |
URL-API |
- Versatility
- No database maintenance
|
- Custom scripts needed
- Load restrictions
|
document |
Standalone BLAST |
- No server fluctuation
- Custom databases
- High throughput
|
- Needs database update
- No graphic
|
document
program |
|
[Back to
top] |
|
|
Disclaimer
Privacy statement
Accessibility
Valid
XHTML 1.0,
CSS.
|