NCBI CDD logo
PubMed Entrez BLAST OMIM Books TaxBrowser Entrez Structure
  Search Entrez  for

CDD Home

Conserved Domain Database

NCBI Handbook

Additional Help

CD-Search

Search with advanced options

CDART

Conserved Domain Architecture Retrieval Tool

Smart

Pfam: US / UK

COG

Explore CDD's source databases


MMDB

NCBI's structure database

PDBeast

Taxonomy in MMDB

Cn3D v4.1

3D-structure viewer

VAST

Structure comparisons

VAST Search

Submit structure database searches

Research

Research topics and staff


CDD - Conserved Domain Database Help

Index

What is a Conserved Domain?

Domains can be thought of as distinct functional and/or structural units of a protein. These two classifications coincide rather often, as a matter of fact, and what is found as an independently folding unit of a polypeptide chain also carries specific function. Domains are often identified as recurring (sequence or structure) units, which may exist in various contexts. The image below illustrates 4 "domains" identified as structural units in the MMDB-entry 1IGR, chain A, as segments colored in magenta, blue, brown, and green. (Click on the figure to launch this view in Cn3D):


For this query sequence, the CD-Search service would identify conserved domains as indicated at the bottom of the image. Good correspondence exists between structural units, identified by purely geometric criteria, and units asserted to be evolutionary conserved. The region annotated as "Furin-like" was split in two by the MMDB domain parser (Click here to redo the CD-Search for this example).

Molecular evolution may have utilized such domains as building blocks, recombined in different arrangements to modulate protein function. We define conserved domains as recurring units in molecular evolution, whose extents can be determined by sequence and structure analysis.

Conserved domains contain conserved sequence patterns or motifs, which allow for their detection in polypeptide sequences. The distinction between domains and motifs is not sharp, however, especially in the case of short repetitive units. Functional motifs are also present outside the scope of structurally conserved domains. The CD database is not meant to systematically collect such motifs.

What are the source databases?

Conserved Domains can be summarized with multiple local sequence alignments. Computational biologists have compiled collections of such alignments representing conserved domains, and we import them from three major sources: Smart and Pfam are public domain databases, which are offered in combination with HMM-based search engines and alignment visualization services.
COG (Clusters of Orthologous Groups) is an NCBI-curated protein classification resource. Sequence alignments corresponding to COGs are created automatically from constituent sequences and have not been validated manually for import in CDD.
Colleagues at NCBI have contributed a small number of curated alignments. This collection is labeled "LOAD", which stands for "Library Of Ancient Domains" [Aravind L, Sreekumar K., and Koonin E.V.].

CDD now also contains alignment models curated in the context of the CDD project. We have re-evaluated and modified multiple sequence alignments imported from outside sources, and made them agree with what we know from three-dimensional structure and three-dimensional structure superposition. Curated alignments contain aligned blocks spanning all rows (with no gaps allowed inside blocks) and unaligned regions between blocks. The blocks are meant to represent conserved structural core motifs of the corresponding domain family. In addition to working on the alignment and alignment model, we attempt to put feature annotation on aligned columns. Features are recorded together with evidence and both can be visualized on CD-Server generated CD summary pages and with Cn3D.

Source databases are evident from CD accessions:
Accession starts withSource Database
pfam Pfam
smartSmart
COG COGs
LOAD_LOAD set
cd Curated at NCBI

What are the CD processing steps?

Source alignments are processed to provide links from each sequence to the protein division of Entrez. Occasionally, sequences, which cannot be identified in Entrez's databases, are omitted or substituted for closely related matches. Whenever possible, sequences in the alignment are substituted for closely related sequences, C::: D which have direct links to three-dimensional structures. A representative sequence is chosen, preferably with a structure-link, for technical reasons. By default, this representative is chosen as the 3D structure shown when CD alignments are visualized with Cn3D, but the CD-server allows for the selection of other structure representatives when appropriate.
A consensus sequence is computed from the imported alignments. Alignment columns have to be represented in at least (weighted) 50% of all aligned sequences to determine the extent of the consensus. The most frequently occurring residue in each column (after weighting) is reported. For the extent of the consensus sequence, a position-specific scoring matrix (PSSM) is calculated; the consensus sequence does not contribute to the residue frequency statistics. Search databases compiled of these PSSMs are available through the CD-Search service.
A curated set of CDs, where 3D-structure information is used explicitly to define domain boundaries, aligned blocks, and amend alignment details has been added to the collection.

How and when is CDD updated?

The source databases are updated several times a year. We try to follow these updates and adjust the CD database with not more than a few months delay. Updates to the non-curated CD database will be incremental, mirroring the availability of new domain family alignments, the removal of families, and changes in seed alignments.

How to find "Conserved Domains"

Most users will explore CDs starting from CD-Search results for a protein of interest. The hit list formatted by CD-Search will contain active hotlinks to respective CD summary pages. CD-Search results have been pre-calculated for proteins in Entrez and are available as "Domains" links.
CDD also is its own Entrez database; the Entrez query interface allows searching for keywords, publication dates, and taxonomic span, to list a few options. Conserved Domains in CDD have also been neighbored to each other; the corresponding links to "related" domains are available in the Entrez system. Two domains are called related if their search models identify several of the same or overlapping sequence intervals in Entrez/Protein with significant scores.
As a third option a simple search interface can be used to identify and find CDs by keyword or accession. Search for strings like "Kinase" or "pfam023" or "Tetratrico" to see how it works. Use a single keyword only, no wildcards and logical operators are permitted at the moment, but partial strings are being matched:

for
Following the hotlinks to CD summary pages allows alignment display in several formats.

CD Alignment visualization

A CD-summary page will list the main CD features and provide hotlinks to the source database and literature references, if available. Click here for an example.

Hot-Links and information from the CD-summary page

  • CD: CD accession followed by short name
  • PSSM-Id: unique numerical identifier of the search model
  • Source: if active, the link jumps to the corresponding entry in the source database (works for Smart, Pfam-, and COG-derived CDs).
  • Description short description of the family represented by the CD
  • Taxa: links from taxonomic terms bring up the NCBI's taxonomy browser at the corresponding nodes.
  • References: links to literature references in PubMed. These citations have been collected by the source databases.
  • Related: CDs considered related according to CDART clustering
  • Status: information about the CD's curation status. Curated models have been realigned by NCBI with consideration of 3D structure. Alignments imported from outside sources have not been changed (except for the import process detailed above)
  • Created: date at which the seed (or de-novo) alignment was imported into CDD.
  • Updated: date of the most recent changes to the alignment model and/or descriptive information.
  • Aligned: lists the number of rows in the sequence alignment (including consensus)
  • PSSM: length of the search model
  • Representative: from the aligned CD sequences, a representative has been chosen for technical reasons. This is a structure-derived sequence, if possible. The extent of the search model, however, is reflected in a consensus representative sequence, which is used to represent the alignment model, but is ignored when calculating the PSSM. Amino acid residues in the consensus represent the most frequent residue type seen in a column of the alignment model (after weighting to account for redundancy).
  • Proteins: this link will point to a CDART summary of proteins identified to contain this domain by CD-Search.

Alignment Displays

  • How?
    By default, a CD-summary page will show an abbreviated version of the alignment underlying the search model. Click here for an example. Alignments can be displayed in several formats. If a representative structure is available, the "View 3D Structure"- button launches the Cn3D structure and alignment viewer, which must be installed as a helper application. Click here to test your setup. In Cn3D, the user has interactive control over alignment formatting and coloring options.
    Using pull-down menus two structure display options can be set: the Cn3D version (default setting is for Cn3D 4.0 or higher), and the level of display detail. The default "Virtual Bonds" setting displays C-alpha atoms only, with virtual bonds connecting them, and "All Atoms" will load a detailed model. "All Atoms" will increase the amount of structure data transmitted (and loading the structures may take significantly longer).
    The default alignment view is formatted for display by the browser ("Hypertext"). Click here for an example. Other available formats are "Plain Text", and "mFASTA", which can be selected as menu options next to the "View Alignment"-button.
    The HTML-formatted alignment views indicate aligned and unaligned residues (as uppercase and lowercase letters). Blocks aligned across all selected sequences are colored using a color ramp from blue to red, unaligned residues and columns not fully covered by all sequences will be shown in gray. Using the color ramp, conserved columns will be shown in red, a pull-down menu allows for various settings of the conservation-coloring threshold in the displayed alignment. Another menu allows for control of the alignment display width.
  • What?
    The menu options next to the "Subset Rows"-button allow for a selection of sequences from the CD for alignment viewing. The first pull-down menu controls the number of sequences to be displayed; the second menu controls the selection. If "top listed sequences" is selected, the specified number will be picked as they occur in the recorded alignment (arbitrary order).
    "Of the most diverse members" will select a subset, which represents a sequence-dissimilar subset of the indicated size (starting with the consensus, the sequence most dissimilar to the consensus, the sequence most dissimilar to these two, ...). Sequence similarity is measured by fraction of identical residues. Note that the consensus/representative sequence is always included in the displays.
    If "select by taxonomy" is chosen, the aligned sequences displayed must be picked from the taxonomy tree, by cutting off unwanted branches. If "select from list" is chosen, a tabular listing of the aligned sequences is presented; individual sequences can be picked from the table and will be added to the sequence selection of "top listed sequences" or "the most diverse members".
    Note: If alignment views are launched from a taxonomy-subset CD-summary, the automatically generated sequence display lists are also subjected to taxonomy subsetting, which may result in empty display lists and error messages.
  • Features:
    Curated CD alignment models may have been enriched by recording the location and nature of features conserved in the domain family. Typically these would describe catalytic residues, binding sites, or motifs commonly referred to in the literature. Features are added if they seem applicable to the family described in the CD's scope and if there is evidence linking the feature to a set of addresses on the alignment. Such evidence is recorded and available for inspection, it may be free-text comments, citations linked to PubMed, or "structure evidence" - exemplifying the existence of a site by highlighting an actual molecular complex, for example.
    Features are indicated on formatted alignment displays by an extra alignment row containing hash-marks ('#') to pinpoint the feature address. If more than one feature is available for an alignment model, the highlighted feature can be selected from a pull-down menu. When using Cn3D for alignment visualization, features are accessible via Cn3D's CDD Annotations panel, for interactive highlighting and alignment column selection.
    On CD-summary pages, evidence for the currently displayed feature is accessible at the bottom of the page. Citations used as reference are found as links to PubMed entries, buttons labeled "View Structure Evidence" will launch Cn3D with a customized view, which highlights the evidence-forming set of residues and molecules by using different display styles.


What is RPS-BLAST?

RPS-BLAST stands for "Reverse Position-Specific BLAST". This is a variant of the popular PSI-BLAST program ("Position-Specific Iterated BLAST"). PSI-BLAST finds sequences significantly similar to the query in a database search and uses the resulting alignments to build a Position-Specific Score Matrix (PSSM) for the query. With this PSSM the database is scanned again to eventually pull in more significant hits, and further refine the scoring model.
RPS-BLAST uses the query sequence to search a database of pre-calculated PSSMs, and report significant hits in a single pass. The role of the PSSM has changed from "query" to "subject", hence the term "reverse" in RPS-BLAST.
RPS-BLAST is the search tool used in the CD-Search service. The CD-Search service provides a web-interface to the RPS-BLAST program, the CD search databases, and interactive alignment visualization including 3D structures. A standalone version of the RPS-BLAST program is available as part of the NCBI toolkit distribution.

Which Search Databases are available?

Currently, CD-Search is offered with the following search databases:
  • Smart - a mirror of a recent Smart set of domain alignments. Note that some Smart families may be missing from the mirror due to update delays or because they describe very short conserved peptides and/or motifs, which would be difficult to detect using the CD-Search service. You may want to try the HMM-based search service offered on the Smart site. Note also that some Smart domains are not mirrored in CD because they represent "superfamilies" encompassing several individual, but related, domains; the corresponding seed alignments may not be available from the source database in these cases. Note also that Smart version numbers do not change with incremental updates of the source database (and the mirrored CD-Search database).
  • Pfam - a mirror of a recent Pfam-A database of curated seed alignments. Pfam version numbers do change with incremental updates. As with Smart, families describing very short motifs or peptides may be missing from the mirror. An HMM-based search engine is offered on the Pfam site.
  • COG - a mirror of the current COG database of orthologous protein families. Seed alignments have been generated by an automated process. An alternative search engine, "Cognitor", which runs protein-BLAST against a database of COG-assigned sequences, is offered on the COG site.
  • Cdd - this is a superset including Smart, Pfam, COG, alignments from the LOAD-database (Library Of Ancient Domains), contributed by I. Aravind, E. Koonin, and colleagues, and CD alignments curated at NCBI.

Can I run RPS-BLAST locally?

A standalone version of RPS-BLAST is available as part of the NCBI toolkit distribution (see ftp://ftp.ncbi.nih.gov/toolbox). Information on how to create your own RPS-BLAST search databases can be found here.

Position-specific scoring matrices corresponding to the CD-Search databases are available via FTP (see the README file for instructions).

What input is required?

To submit a query sequence to CD-search, you only need to provide the sequence, as raw sequence data, formatted as FASTA, or as Gi/Accession (valid in the NCBI Entrez system). Hitting the submit button will start CD-search with default settings for search sensitivity and display options.
Note that CD-search only works for protein sequences.

Advanced search options:

Note that those are only available when using the actual CD-Search form. Searches launched from the CDD Home page or together with protein BLAST requests use default search parameters.
  • Expect: modifies the E-value threshold used for filtering results. False positive results should be very rare with the default setting of 0.01 (use a more conservative, i.e. lower setting for more reliable results), results with E-values in the range of 1 and above should be considered putative false positives.
  • Filter: By default, query sequences are filtered for compositionally biased regions. These are flagged as such and largely ignored during the search phase. If filtering is turned on, the graphical display of results highlights filtered-out regions on the query.
  • Search mode: this option affects the detailed heuristic with which initial hits are detected and expanded into local alignments. Changing the options from "Multiple hits, 1-pass" (Default) to "Single-hit, 1-pass", and "2-pass" may result in slightly higher sensitivity, but will also increase runtime.

Output formatting options:

  • Number of hits displayed: limits the size of the hit list produced by CD-Search. Typically, for average sized proteins, the number of expected domain-hits is small and the default setting should be more than sufficient.
  • Graphical overview: On top of each search results page, a graphical overview is presented to indicate the location of domain-hits on the query sequence. The pull-down menu allows this feature to be turned off. The "condensed overview" will only show the best scoring hit for each region on the query, and display overlapping hits only if the mutual overlap with better-scoring hits does not exceed 50%.
  • Color Schemes: This option affects the way pairwise alignments are displayed (and colored) on the CD-search results pages. Only option "3" (Default) uses color to highlight similar and identical residues, the other options produce more conventional (BLAST-style) alignment displays.
    Color Scheme 1 Aligned residues are displayed in uppercase, residues identical in the alignment between query and representative are shown in the extra line between the two sequences, similar residues (with a positive score in the BLOSUM62 matrix) are indicated with a "+". Regions masked out due to composition-bias are displayed in italics.
    Color Scheme 2 Same as Scheme 1, except for aligned regions being displayed in bold letters, and masked out regions being colored gray.
    Color Scheme 3 Identical residues colored red, similar residues colored blue. Masked out regions are printed in italics.

How long do I have to wait for the results?

CD-search requests are submitted to the BLAST servers immediately. A typical search should take a few seconds only, depending on the size of the search database chosen, the length of the query sequence, and the load on the servers. Click here to test response time with a typical query.
CD-Search requests can also be sent to the BLAST Queuing system (this happens by default for searches launched in parallel with protein BLAST requests), use the optional button at the bottom of the CD-Search page. Requests sent to the query will take longer, but the results can be retrieved at a later time using the RID ("Request ID"), without having to re-calculate the search. A form at the bottom of the CD-Search page can be used to retrieve earlier search results by RID.

When do search requests end up in the BLAST-Queue?

When CD-search is run as an integral part of protein-BLAST search requests, the jobs are put in the BLAST queue and may take a little longer to complete (depending on the system load and length of query sequence). Queued CD-search will try to retrieve the finished results every few seconds until they are available. You may also store the request-id (RID) and retrieve results later here.

What are the elements on the results page?

The CD-search server will summarize results on a single page in various formats. These are listed below in detail. On top of each search page, summary information concerning the query sequence and the search database is printed.

How do I look at multiple alignments?

When you click on the cartoon in the graphical display on the CD-search results page, an alignment view will be opened, which adds the query sequence to the multiple CD-alignment. It is possible to limit the number of CD-sequences displayed in this view and to determine which of the CD's sequences are displayed:
  1. top listed sequences: will select the first N sequences, in the order in which they occur in the CD.
  2. of most diverse members: will pick the N most diverse CD-sequences starting with the representative. Use this option to learn about the highest sequence diversity expected for a typical family member, as defined by the authors of the source alignment.
  3. sequences most similar to the query: (default) will pick N sequences most similar, in the aligned region, to the query of the CD-search. Use this option to learn about the nature of the query's closest sequence neighbors in the CD.
The multiple alignment is formatted for display with your browser by default. However, the resulting page allows for the display options to be modified (as previously described "Alignment visualization in the CD browser")

Alignment visualization including 3D-structures

"Redisplay Alignment" from an alignment view that includes the query sequence will launch Cn3D by default if a structure is included in the CD. The structure viewing options as discussed above are available. Note that Cn3D offers column-specific coloring by sequence conservation when invoked with multiple alignment views, this is a convenient feature to study sequence conservation within a CD-alignment and to find out how well the aligned query fits the existing patterns of conservation and variability.

What does the pink dot mean?

A pink dot next to a CD-identifier indicates that the CD has links to 3D structure. Alignment visualization for hits to this CD may utilize Cn3D, which must be installed as a helper application. The pink dots in the leftmost column of the CD-search hit list are hotlinked to provide immediate visualization: If Cn3D is installed properly, a 3D-view will be presented showing the alignment of the query with a subset of the most similar sequences found in this CD (and the representative with 3D structure, irrespective of its similarity to the query).


Updated 05/15/04

Privacy statement

Disclaimer

 
Help Desk NCBI NLM NIH Credits