NCBI
NLM
PubMed Nucleotide Protein Genome Structure PopSet Taxonomy OMIM Books SNP
 Search for
  Limits Preview/Index History Clipboard Details  
Spacer gif

GENERAL
Contact Us
dbSNP Homepage
SNP Science Primer
Announcements
dbSNP Summary
FTP Download
   Server
   Getting Started
Build History
Handle Request

DOCUMENTATION
FAQ
dbSNP Handbook
Overview
How to Submit
RefSNP Summary Info
Schema
   Database
      PDF
      ChangesNEW gif
   Genotype
Data Formats
Heterozygosity Computation

SEARCH
Entrez SNP
Blast SNP
Batch Query
By Submitter
New Batches
Method
Population
   Detail
   Class
Publication
Chromosome Report
Locus Information
STS Markers
Free Form Search
   Simple
   Advanced
Mouse StrainsNEW gif

HAPLOTYPE
Specifications
Sample HapSet
Sample Individual

 

 

Spacer gif
  SNP FAQ
dbSNP Frequently Asked Questions (FAQs)

General Information

1. Description of SNP database
2. Relationship between dbSNP and GenBank
3. Classes of genetic variation within the database
4. History of the database
5. Sources of the data
6. Growth rate
7. Species represented in the database
8. Relationship between dbSNP to NIH Polymorphic Discovery Resource
9. Provisions for quality control

Searching
1. Searching db SNP
2. Batch queries
3. Searching for polymorphisms of specific genes or chromosomes
Submission and Withdrawal
1. Submitting data
2. Simultaneous SNP and STS submission
3. Naming formats
4. NCBI assay ID, or ss ID
5. Reference SNP ID, or rs ID
6. Human gene nomenclature
7. Hold until published (HUP) policy
8. Withdrawing data
 
Downloading
1. Downloading via FTP
2. Getting Started - dbSNP FTP Primer
Linking to dbSNP
1.  Linking to a dbSNP record
Data Processing and Summary Measures
1.  How variations are mapped to genome sequence
2.  Why do the functional classifications for some variations change when a genome is re-assembled?
3.  How average heterozygosity is computed


General Information

1. What is the SNP database?
 

SNP stands for "single nucleotide polymorphism".  SNPs are the most common genetic variations and occur once every 100 to 300 bases.  A key aspect of research in genetics is the association of sequence variation with heritable phenotypes.  It is expected that SNPs will accelerate the identification of disease genes by allowing researchers to look  for associations between a disease and specific differences (SNPs) in a population.  This differs from the more typical approach of pedigree analysis which tracks transmission of a disease through a family.  It is much easier to obtain DNA samples from a random set of individuals in a population than it is to obtain them from every member of a family over several generations.  Once discovered, these polymorphisms can be used by additional laboratories, using the sequence information around the polymorphism and the specific experimental conditions.

For a current summary of information contained in the database, see the dbSNP Summary page.

Back to FAQs
 

2. What is the relationship between dbSNP and GenBank?
dbSNP is an independent database and not a division of GenBank. Although much of the data in dbSNP is not in GenBank, the dbSNP data will be integrated with other NCBI genomic data, and this will ultimately affect GenBank records. The sequences of dbSNP records are expected to be contained within the sequences of one or more GenBank records, with the GenBank records generally containing longer sequences and fewer allele designations.  The integration will eventually link dbSNP records with GenBank records containing overlapping sequences deduced or stated to be from the same location.

Reference dbSNP records will be mapped to external resources or databases and will also point back to the original dbSNP submitter records.  For further information on this, see "What is a Reference SNP, or rs ID?" below.   As with all NCBI projects, the data in dbSNP will be freely available to the scientific community and made available in a variety of forms.  Please see the dbSNP Home Page for further information.

Back to FAQs
 

3. What classes of genetic variation are included in the database?
The database has been designed to accept several classes of genetic variation:
    • SNPs
    •  microsatellite repeats
    •  small insertion/deletion polymorphisms

    dbSNP uses the term "SNP" in the much looser sense of "minor genetic variation," so there is no requirement or assumption about minimum allele frequencies for the polymorphisms in the database. Thus the scope of dbSNP includes disease causing CLINICAL MUTATIONS as well as NEUTRAL POLYMORPHISMS.  Given the current activity in the discovery of general sequence variation, it is anticipated that SNP markers with unknown selective effects will be the vast majority of submitted records.

    Back to FAQs
     

 4. When was the SNP database established?
dbSNP was launched in September of 1998.

Back to FAQs
 

 5. Where does the data come from?
dbSNP accepts submissions from all laboratories and is especially interested in mutations in genes and/or where additional biological information is known.  Major contributors to the database are laboratories associated with the  National Human Genome Research Institute (NHGRI) grants program. The NHGRI has funded an intensive effort to collect 50,000 SNPs in three years. The grant recipients under this program include genome centers, private extra mural research labs, and private businesses.  These groups are working on a variety of aspects of new SNP discovery, new technologies for SNP detection, and rapid SNP genotyping in large samples.

In addition to this central contribution to dbSNP, other research labs and private companies can deposit SNP information to make it easily accessible to the research community and part of the public domain.  We are currently designing a common data exchange format for SNP data to be used between the central SNP databases.

Back to FAQs
 
6. What is the rate at which dbSNP is growing?
 
In general, the database grows at a rate of about 90 SNPs per month; however, large numbers of submissions by large research projects can cause uneven growth.  While dbSNP welcomes data from smaller contributors, the majority of the data will probably come from a small number of large projects funded by the NHGRI grants. For this reason, it is expected to grow in erratic jumps for the next few years. It is expected that there will be 5,000 - 10,000 SNPs by the end of the first year of the funded SNP grants.  For the current number of SNPs in the database, see the dbSNP Summary page.
Back to FAQs
 
7. Which species are represented in the database?
Currently only  Homo sapiens is represented in the database; however, the database has been designed to accept mutation information from any species, not just Homo sapiens.

Back to FAQs
 

8. How is dbSNP related to the NIH Polymorphism Discovery Resource?
The database has been designed to accept frequency and/or individual genotype information from any submitter-defined  population, not just the NIH Polymorphism Discovery Resource (NIHDPR).  The NIHDPR is encouraged as a resource for the extra mural NIH-funded SNP grants to facilitate a more straightforward evaluation of differing discovery or  genotyping methods using a consistent panel of samples. None of the frequency information currently in dbSNP comes from the NIHDPR.

Back to FAQs
 

9. What provisions are made for quality control?
Data validation can be maintained for both submitted assay reports and reference SNP objects.  At the level of an individual submitted assay report, dbSNP provides several fields to assess the quality of the data.

For further information, please see: How to Submit,  Validation fields.

Back to FAQs
 



 
Searching

1. How can I search dbSNP?

dbSNP can be searched both via other NCBI resources or directly.

Via other NCBI resources
We are developing three ways to query the database by integrating it with other NCBI resources.  They are:

 a. by gene name/nomenclature association
Query results from the LocusLink database will show a purple "S" button in SNP records have been mapped to the gene.  Clicking on the S will take you to a list of the reference SNP records for any gene in the LocusLink database.
b. by map location
dbSNP is currently being integrated to GeneMap99 and the integrated physical maps that are being constructed at NCBI.  When integration is completed, the maps may be browsed for SNP content in user specified regions of the map.  This feature should be ready in mid May.
c. as a BLAST operation on dbSNP using a candidate sequence.
The sequences in dbSNP are currently being formatted to be searched by BLAST.  Users will be able to submit a query sequence to BLAST, and receive a list of any SNPs in the database that hit the sequence.  This feature will be ready in late May.
Direct searching of dbSNP
Currently, there are six ways to search dbSNP directly:
2. Is it possible to do batch queries?
There are two means of doing batch queries:
    1. Searching batches submitted by individual laboratories.

    This method is used to search groups, or batches, of SNPs submitted by individual laboratories.  It is possible to search by batches submitted by individual laboratories. Batches are identified by the local batch identification code, submitter handle, number of SNPs in the submission, and the date of submission.  Batches are displayed chronologically by most recent date of submission. See the New Batches page for further information.

    2. Submitting a batch of requests.

    This method would allow a user to submit a batch of queries, or requests. The results would then be returned to the user's email account in ASN.1, FASTA, XML, chromosome report, or text flatfile format. The batch of requests can be submitted as an upload file or entered using a web interface. See the dbSNP Batch Query page for further information.
     

Back to FAQs
3. How can I search for polymorphisms of a specific gene or chromosome?
It is currently not possible to search for SNPs of a specific gene or chromosome; however, a report of mapped SNPs sorted by chromosome and fine map position is available. It is possible to look at the map of a specific gene or chromosome and search for SNPs within that region.  See "How can I search dbSNP?" for a discussion of the integrated resource features that will soon address this need.

NCBI is also in the process of integrating dbSNP entries with other sequence and mapping resources via BLAST and Electronic PCR (E-PCR) analysis. This analysis will attempt to associate all SNPs with a nucleotide sequence record and/or physical map contig. If the SNP is in a gene region, it will be annotated on the appropriate Reference Sequence or UniGene cluster.

Back to FAQs
 


Submission and Withdrawal

1. How can I submit data to the database?

Independent labs can submit data directly to NCBI  by following the submission procedures and suggestions found on the dbSNP How to Submit page.

Back to FAQs
 

2. What kind of data is needed to submit  STS  and SNP data simultaneously?
In order to submit simultaneously STS and SNP data, it is necessary to submit a batch file which includes the following sections in the indicated order.
 
 
SECTION TYPE SECTION DESCRIPTION
Contact CONT Submitter's name, phone number, and other contact information for this datafile
Publication PUB List of pre-press, or published articles about the markers
*SNP Method METHOD Free text section for description of general method of assay
*SNP Population  POPULATION  Description of population sample
*STS Source SOURCE Description of source organism
*STS Protocol PROTOCOL PCR protocol components
*STS Buffer BUFFER PCR buffer components
STS Record STS An STS entry using a SOURCE, PROTOCOL, BUFFER, primers and sequence
*STS Method METH A section to label/decode lines in an STS Map section
STS Map Data MAP  Map information for the STS (and, hence, the SNP)
SNP Record SNPAssay An SNP entry using METHOD, STS, alleles and flanking sequence
SNP Frequency SNPPOPUSE Frequency data for SNPASSAY in POPULATION
 
Sections denoted with an asterisk (*) only have to be defined and submitted once.      The other sections carry the particular details for each SNP, insertion/deletion or microsatellite in a data set.

Details of each section can be found in the dbSTS submission instructions or the dbSNP submission instructions.

Back to FAQs
 

3. Is there a specific name or format I must use for submission?
Individual laboratories will be assigned a unique handle, which is a short lab identifier, usually an acronym or abbreviated name.  The handle will allow submissions to be associated with laboratories independent of the details of who is handling a particular set of submissions from that laboratory.

For more information on handles,  please see:  How to Submit, Handles
 

Local  SNP identifiers need only be unique within a specific handle. The combination "HANDLE | LOCAL SNP ID" will be unique within the database. There is a 64 character limit for SNP identifiers.

For more information on identifiers and examples of identifiers, please see: How to Submit, Identifiers
 
Back to FAQs
 

4. What is the NCBI assay ID, or 'ss' ID?
The NCBI assay ID, or 'ss' ID is simply an accession number assigned by NCBI to submitted SNPs.  It has the format NCBI|ss<NCBI ASSAY ID>. Note that 'ss' is always in the lower case.

For more information on ss ID's please see: How To Submit, Resource Integration.

Back to FAQs
 

5. What is a reference SNP, or 'rs' ID?
A reference SNP ID, or 'rs' ID is an identification tag assigned by NCBI to SNPs that appear to be unique in the database.  The rs ID number, or tag, is assigned at submission. Initially, it is expected that nearly every submission will be assigned a rs ID.  As the database matures, however, submitted SNP's that map to identical locations as previously submitted SNP's will be linked into the reference set of the existing reference SNP record.  These SNP rs ID's will be a set of features that will be mapped to external resources or databases, including NCBI databases.  The SNP rs ID number will be noted on the records on these external resources and databases in order to point users back to the original dbSNP records. A reference SNP record has the format NCBI| rs<NCBI SNP ID>. Note that 'rs' is always in the lower case.

For more information on rs ID's please see: How To Submit, Abstract Report.

Back to FAQs
 

6. What is the official nomenclature for human genes?
There is currently no official nomenclature for human genes; however,  The Human Gene Nomenclature Committee is currently trying to establish a nomenclature standard and does have a recommended format. The Human Gene Nomenclature Committee is the accepted authority for establishing these standards. For new genes lacking official nomenclature the research community is encouraged to use the Nomenclature Committee web form to submit a proposed gene symbol and name, thus creating a community provided name.  In general, the research community does try to conform to using pre-existing names but these names might not be the current official nomenclature name so situations do arise where a single gene is being called by multiple names.  There is no enforcement of this suggested nomenclature method and investigators are free to name a gene as they wish.

There is no standard 'format' for the official gene name, but, for human genes, the official gene symbol (an abbreviation) does have the standardized format of capitalizing all alphabetic characters and excluding use of non alpha/numeric characters.  For example, official symbols might look like ABC3 but do not look like ABC(3).
 
Back to FAQs
 

7. What is the "hold until published (HUP)" policy?
dbSNP data cannot be held confidential until publication. Note that dbSTS and dbSNP have different "hold until published" or HUP policies.  Submissions to dbSTS can be withheld from public view until the accession number is published. dbSNP records, however, will be available for public inspection when the submission process is complete, even in the case of simultaneous dbSNP/sbSTS submissions.  STS submissions that require HUP treatment should be submitted separately, and prior to the SNP submission.
Back to FAQs
 
8. Is it possible to withdraw submissions from the database?
A record can not  be completely deleted from the database.  The submission identification number of all records, and the rs ID number of abstracted reference SNP records, are incorporated into other databases (Please see "What is the NCBI assay ID or 'ss' ID?" and "What is a reference SNP or 'rs' ID?" above). These ID tags indicate source information of SNPs to users of other databases that reference the data in question.  These ID tags point the users back to dbSNP. Also, other laboratories can submit records to SNP that refer to SNPs already in the database. Because a submission quickly becomes permanently associated with other records both within dbSNP and in other databases it is not possible to completely eliminate it from the database.  A record can be marked as "withdrawn," however, so that a query of that SNP will indicate that the submitter has chosen to withdraw that data.

Back to FAQs
 


Downloading

1. Is is possible to download dbSNP?

 dbSNP is available for downloading via the NCBI FTP server in three formats:
    • pc compressed
    • uncompressed
    • unix compressed
     

    The following FTP files are available:
     

    • contact.rep - Handle information and submitter contact information
    • publicat.rep - Publications cited in the database
    • method.rep - Assay methods defined by submitters
    • populatn.rep - Population descriptions defined by submitters
    • snpassay.rep - Assay reports for all SNPs in the database
    • popuse.rep - Population frequency data
    • induse.rep - Individual genotype data
     
    Note: this is the data only. It is not a stand alone searchable database that includes software.

    Back to FAQs
     


Linking to dbSNP

1. Is it possible to create HTML links to a particular dbSNP record?

Yes. If you have a local website with more data related to a particular SNP, it may be useful to create a link from the local site to the ss id number of a chosen SNP record.  To do so, use the following URL:

http://www.ncbi.nlm.nih.gov/SNP/snp_retrieve.cgi?subsnp_id=

Set the "subsnp_id" to the chosen NCBI Assay ID number.

For example, if the NCBI Assay Id number is 123, set the URL as:
http://www.ncbi.nlm.nih.gov/SNP/snp_retrieve.cgi?subsnp_id=123

Back to FAQs
 


Data Processing and Summary Measures

1. How variations are mapped to genome sequence

When reference genome assemblies are available, we use them as anchor sequence to place refSNP clusters into a genomic context. We clean dbSNP-flanking sequence with Repeat Masker and then remap them to the most current build of each genome using MegaBLAST. The mapping results then define a new non-redundant set of variations for the genome.

In general a word size of 28 is used in MegaBLAST computations, but a small subset of our data has a half flank (i.e. 5' or 3' flanking sequence taken individually as MegaBLAST query sequence) size of 25 bases and this is blasted with a word size of 22. To map a deletion we required that both flanking sequences are returned in the alignment, and furthermore both penultimate bases flanking the allele are returned. In other words, the gap as defined in the alignment exactly matches the deletion as defined in dbSNP.

The complete command line from MegaBLAST is:

megablast -U T -F m -J F -X 180 -r 10 -q -20 -P 1000 -R T -W 28

In addition, we filter the MegaBLAST results into two classes in dbSNP database:

0) better than 95% sequence alignment with fewer than 6 mismatches
1) better than 75% alignment with less than 3% mismatches.

Anything below the lower threshold is discarded.

A non-negligible subset of dbSNP RS fails to map because of heavy repeat masking on the flanking sequence. When variations align at more than one site on the genome, they sometimes map with distinct mismatch counts. dbSNP does not make any effort to reduce map redundancy by comparing individual quality scores at each site.

Generally, dbSNP reports all mapping results against the current assembly. This is certainly not, however, everything in dbSNP. There are three major cases where we do not map and/or annotate:

     a. Submissions that are completely masked as repetitive elements. These are dropped from any further computations. This set of
refSNPs are dumped in chromosome "rs_chMasked" on our ftp site.

     b. Submissions that are defined in a cDNA context with extensive splicing. These SNPs are typically annotated on refseq mRNAs through a separate annotation process. We are working to reverse map these variations back to contig coordinates, but that has not been
implemented. For now, you can find this set of variations in "rs_chNotOn" on the ftp site.

     c. variations with excessive hits to the genome. Variations with 3+ hits to the genome are neither annotated as variation features on
contigs nor included in variation tracks for either NCBI or UCSC map viewer resource. These data are in "rs_chMulti" on our ftp site.

Furthermore, the heuristics for non-SNP variations (i.e. named elements and STRs) are probably a bit too conservative so some of these are consequently lost. While we prefer to err on the side of caution to avoid false annotation of variation in inappropriate locations, we are working to improve the success hit-rate for these variations as well.

2Why do the functional classifications for some variations change when a genome is re-assembled?

Functional annotation varies from build to build because the underlying substrate, namely the reference genome sequence, is itself changing from assembly to assembly. During each assembly, the algorithms used to define 'genes' are refined to improve accuracy. Since gene features can be defined by various classes of evidence that vary in their certainty, there is currently some thrashing in estimates of gene numbers and their precise exon structure on the genome. Duplicates are identified and merged, spurious annotations are removed, and new evidence is included as the annotation pipelines are developed.

The net result for the dbSNP user is that SNPs may be in an exon (or gene more generally) in one build, and in an intron or UTR (or intergenic DNA) in the next build if the exon (gene) is subsequently removed. The genome sequencing community is converging on a stable reference sequence. When it is finished, annotation (including SNP function) will be much more stable.

3. How average heterozygosity is computed

Average heterozygosity is computed for each refSNP cluster as described here.
Back to FAQs
 


Comments or Questions?
If you have further questions, comments or suggestions for improvement of this FAQ page or dbSNP please contact the NCBI Service Desk at info@ncbi.nlm.nih.gov
db SNP Home
NCBI Home

FAQ page updated: October 1, 2002


GENERAL: Contact Us | Homepage | Announcements |dbSNP Summary | Genome | FTP SERVER | Build History | Handle Request
DOCUMENTATION:
FAQ | Overview | How to Submit | RefSNP Summary Info | Database Schema
SEARCH: Entrez SNP | Blast SNP | Main Search |Batch Query | By Submitter |New Batches | Method | Population | Publication
| Chromosome Report | Batch |Locus Info | Free Form | Easy Form | Between Marker
HAPLOTYPE: Specifications | Sample HapSet | Sample Individual
NCBI: PubMed | Entrez | BLAST | OMIM | Taxonomy | Structure

Disclaimer     Privacy statement

Revised: June 9, 2004 8:18 AM.