PubMed | Entrez | BLAST | OMIM | Books | TaxBrowser | Entrez Structure |
|
Conserved Domain Database Additional Help Search with advanced options Conserved Domain Architecture Retrieval Tool Explore CDD's source databases NCBI's structure database Taxonomy in MMDB 3D-structure viewer Structure comparisons Submit structure database searches Research topics and staff |
Index
What is a Conserved Domain?Domains can be thought of as distinct functional and/or structural units of a protein. These two classifications coincide rather often, as a matter of fact, and what is found as an independently folding unit of a polypeptide chain also carries specific function. Domains are often identified as recurring (sequence or structure) units, which may exist in various contexts. The image below illustrates 4 "domains" identified as structural units in the MMDB-entry 1IGR, chain A, as segments colored in magenta, blue, brown, and green. (Click on the figure to launch this view in Cn3D):For this query sequence, the CD-Search service would identify conserved domains as indicated at the bottom of the image. Good correspondence exists between structural units, identified by purely geometric criteria, and units asserted to be evolutionary conserved. The region annotated as "Furin-like" was split in two by the MMDB domain parser (Click here to redo the CD-Search for this example). Molecular evolution may have utilized such domains as building blocks, recombined in different arrangements to modulate protein function. We define conserved domains as recurring units in molecular evolution, whose extents can be determined by sequence and structure analysis. Conserved domains contain conserved sequence patterns or motifs, which allow for their detection in polypeptide sequences. The distinction between domains and motifs is not sharp, however, especially in the case of short repetitive units. Functional motifs are also present outside the scope of structurally conserved domains. The CD database is not meant to systematically collect such motifs.
What are the source databases?Conserved Domains can be summarized with multiple local sequence alignments. Computational biologists have compiled collections of such alignments representing conserved domains, and we import them from three major sources:
COG (Clusters of Orthologous Groups) is an NCBI-curated protein classification resource. Sequence alignments corresponding to COGs are created automatically from constituent sequences and have not been validated manually for import in CDD. Colleagues at NCBI have contributed a small number of curated alignments. This collection is labeled "LOAD", which stands for "Library Of Ancient Domains" [Aravind L, Sreekumar K., and Koonin E.V.]. CDD now also contains alignment models curated in the context of the CDD project. We have re-evaluated and modified multiple sequence alignments imported from outside sources, and made them agree with what we know from three-dimensional structure and three-dimensional structure superposition. Curated alignments contain aligned blocks spanning all rows (with no gaps allowed inside blocks) and unaligned regions between blocks. The blocks are meant to represent conserved structural core motifs of the corresponding domain family. In addition to working on the alignment and alignment model, we attempt to put feature annotation on aligned columns. Features are recorded together with evidence and both can be visualized on CD-Server generated CD summary pages and with Cn3D. Source databases are evident from CD accessions:
What are the CD processing steps?Source alignments are processed to provide links from each sequence to the protein division of Entrez. Occasionally, sequences, which cannot be identified in Entrez's databases, are omitted or substituted for closely related matches. Whenever possible, sequences in the alignment are substituted for closely related sequences, C::: D which have direct links to three-dimensional structures. A representative sequence is chosen, preferably with a structure-link, for technical reasons. By default, this representative is chosen as the 3D structure shown when CD alignments are visualized with Cn3D, but the CD-server allows for the selection of other structure representatives when appropriate.A consensus sequence is computed from the imported alignments. Alignment columns have to be represented in at least (weighted) 50% of all aligned sequences to determine the extent of the consensus. The most frequently occurring residue in each column (after weighting) is reported. For the extent of the consensus sequence, a position-specific scoring matrix (PSSM) is calculated; the consensus sequence does not contribute to the residue frequency statistics. Search databases compiled of these PSSMs are available through the CD-Search service. A curated set of CDs, where 3D-structure information is used explicitly to define domain boundaries, aligned blocks, and amend alignment details has been added to the collection.
How and when is CDD updated?The source databases are updated several times a year. We try to follow these updates and adjust the CD database with not more than a few months delay. Updates to the non-curated CD database will be incremental, mirroring the availability of new domain family alignments, the removal of families, and changes in seed alignments.
How to find "Conserved Domains"Most users will explore CDs starting from CD-Search results for a protein of interest. The hit list formatted by CD-Search will contain active hotlinks to respective CD summary pages. CD-Search results have been pre-calculated for proteins in Entrez and are available as "Domains" links.CDD also is its own Entrez database; the Entrez query interface allows searching for keywords, publication dates, and taxonomic span, to list a few options. Conserved Domains in CDD have also been neighbored to each other; the corresponding links to "related" domains are available in the Entrez system. Two domains are called related if their search models identify several of the same or overlapping sequence intervals in Entrez/Protein with significant scores. As a third option a simple search interface can be used to identify and find CDs by keyword or accession. Search for strings like "Kinase" or "pfam023" or "Tetratrico" to see how it works. Use a single keyword only, no wildcards and logical operators are permitted at the moment, but partial strings are being matched: Following the hotlinks to CD summary pages allows alignment display in several formats.
CD Alignment visualizationA CD-summary page will list the main CD features and provide hotlinks to the source database and literature references, if available. Click here for an example.Hot-Links and information from the CD-summary page
Alignment Displays
What is RPS-BLAST?RPS-BLAST stands for "Reverse Position-Specific BLAST". This is a variant of the popular PSI-BLAST program ("Position-Specific Iterated BLAST"). PSI-BLAST finds sequences significantly similar to the query in a database search and uses the resulting alignments to build a Position-Specific Score Matrix (PSSM) for the query. With this PSSM the database is scanned again to eventually pull in more significant hits, and further refine the scoring model.RPS-BLAST uses the query sequence to search a database of pre-calculated PSSMs, and report significant hits in a single pass. The role of the PSSM has changed from "query" to "subject", hence the term "reverse" in RPS-BLAST. RPS-BLAST is the search tool used in the CD-Search service. The CD-Search service provides a web-interface to the RPS-BLAST program, the CD search databases, and interactive alignment visualization including 3D structures. A standalone version of the RPS-BLAST program is available as part of the NCBI toolkit distribution.
Which Search Databases are available?Currently, CD-Search is offered with the following search databases:
Can I run RPS-BLAST locally?A standalone version of RPS-BLAST is available as part of the NCBI toolkit distribution (see ftp://ftp.ncbi.nih.gov/toolbox). Information on how to create your own RPS-BLAST search databases can be found here.Position-specific scoring matrices corresponding to the CD-Search databases are available via FTP (see the README file for instructions).
What input is required?To submit a query sequence to CD-search, you only need to provide the sequence, as raw sequence data, formatted as FASTA, or as Gi/Accession (valid in the NCBI Entrez system). Hitting the submit button will start CD-search with default settings for search sensitivity and display options.Note that CD-search only works for protein sequences. Advanced search options:Note that those are only available when using the actual CD-Search form. Searches launched from the CDD Home page or together with protein BLAST requests use default search parameters.
Output formatting options:
How long do I have to wait for the results?CD-search requests are submitted to the BLAST servers immediately. A typical search should take a few seconds only, depending on the size of the search database chosen, the length of the query sequence, and the load on the servers. Click here to test response time with a typical query.CD-Search requests can also be sent to the BLAST Queuing system (this happens by default for searches launched in parallel with protein BLAST requests), use the optional button at the bottom of the CD-Search page. Requests sent to the query will take longer, but the results can be retrieved at a later time using the RID ("Request ID"), without having to re-calculate the search. A form at the bottom of the CD-Search page can be used to retrieve earlier search results by RID.
When do search requests end up in the BLAST-Queue?When CD-search is run as an integral part of protein-BLAST search requests, the jobs are put in the BLAST queue and may take a little longer to complete (depending on the system load and length of query sequence). Queued CD-search will try to retrieve the finished results every few seconds until they are available. You may also store the request-id (RID) and retrieve results later here.
What are the elements on the results page?The CD-search server will summarize results on a single page in various formats. These are listed below in detail. On top of each search page, summary information concerning the query sequence and the search database is printed.
Occasionally domain-cartoons have jagged edges. This means that the alignment found by RPS-BLAST omitted more than 20% of the CD's extent at the n- or c-terminus (or both, as indicated by the cartoons). This feature may give hints towards truncated query sequences, false-positive hits, or unusual domain architectures involving long insertions. The exact percentage of the CD's extent used in the alignment is listed in detail in the pairwise alignment section.
How do I look at multiple alignments?When you click on the cartoon in the graphical display on the CD-search results page, an alignment view will be opened, which adds the query sequence to the multiple CD-alignment. It is possible to limit the number of CD-sequences displayed in this view and to determine which of the CD's sequences are displayed:
Alignment visualization including 3D-structures"Redisplay Alignment" from an alignment view that includes the query sequence will launch Cn3D by default if a structure is included in the CD. The structure viewing options as discussed above are available. Note that Cn3D offers column-specific coloring by sequence conservation when invoked with multiple alignment views, this is a convenient feature to study sequence conservation within a CD-alignment and to find out how well the aligned query fits the existing patterns of conservation and variability.
What does the pink dot mean?A pink dot next to a CD-identifier indicates that the CD has links to 3D structure. Alignment visualization for hits to this CD may utilize Cn3D, which must be installed as a helper application. The pink dots in the leftmost column of the CD-search hit list are hotlinked to provide immediate visualization: If Cn3D is installed properly, a 3D-view will be presented showing the alignment of the query with a subset of the most similar sequences found in this CD (and the representative with 3D structure, irrespective of its similarity to the query). |
||||||||||||||||||||
Updated 05/15/04 |
Help Desk | NCBI | NLM | NIH | Credits |