dbSNP Frequently
Asked Questions (FAQs)
General Information
1. Description of SNP database
2. Relationship between dbSNP and GenBank
3. Classes of genetic variation within the database
4. History of the database
5. Sources of the data
6. Growth rate
7. Species represented in the database
8. Relationship between dbSNP to NIH Polymorphic Discovery
Resource
9. Provisions for quality control
Searching
1. Searching db SNP
2. Batch queries
3. Searching for polymorphisms of specific genes
or chromosomes
Submission and Withdrawal
1. Submitting data
2. Simultaneous SNP and STS submission
3. Naming formats
4. NCBI assay ID, or ss ID
5. Reference SNP ID, or rs ID
6. Human gene nomenclature
7. Hold until published (HUP) policy
8. Withdrawing data
Downloading
1. Downloading via FTP
2. Getting
Started - dbSNP FTP Primer
Linking to dbSNP
1. Linking to a dbSNP record
Data Processing and Summary Measures
1. How variations are mapped to genome sequence
2. Why do the functional classifications for some
variations change when a genome is re-assembled?
3. How average heterozygosity is computed
General Information
1. What is the SNP database?
SNP stands for "single nucleotide polymorphism". SNPs
are the most common genetic variations and occur once every 100 to 300
bases. A key aspect of research in genetics is the association of
sequence variation with heritable phenotypes. It is expected that
SNPs will accelerate the identification of disease genes by allowing researchers
to look for associations between a disease and specific differences
(SNPs) in a population. This differs from the more typical approach
of pedigree analysis which tracks transmission of a disease through a family.
It is much easier to obtain DNA samples from a random set of individuals
in a population than it is to obtain them from every member of a family
over several generations. Once discovered, these polymorphisms can
be used by additional laboratories, using the sequence information around
the polymorphism and the specific experimental conditions.
For a current summary of information contained in the database, see
the dbSNP Summary
page.
Back to FAQs
2. What is the relationship
between dbSNP and GenBank?
dbSNP is an independent database and not a division of GenBank.
Although much of the data in dbSNP is not in GenBank, the dbSNP data will
be integrated with other NCBI genomic data, and this will ultimately affect
GenBank records. The sequences of dbSNP records are expected to be contained
within the sequences of one or more GenBank records, with the GenBank records
generally containing longer sequences and fewer allele designations.
The integration will eventually link dbSNP records with GenBank records
containing overlapping sequences deduced or stated to be from the same
location.
Reference dbSNP records will be mapped to external resources or databases
and will also point back to the original dbSNP submitter records.
For further information on this, see "What is a Reference SNP, or rs ID?"
below. As with all NCBI projects, the data in dbSNP will be
freely available to the scientific community and made available in a variety
of forms. Please see the dbSNP
Home Page for further information.
Back to FAQs
3. What classes of genetic
variation are included in the database?
The database has been designed to accept several classes of
genetic variation:
-
SNPs
-
microsatellite repeats
-
small insertion/deletion polymorphisms
dbSNP uses the term "SNP" in the much looser sense of "minor genetic
variation," so there is no requirement or assumption about minimum allele
frequencies for the polymorphisms in the database. Thus the scope of dbSNP
includes disease causing CLINICAL MUTATIONS as well as NEUTRAL POLYMORPHISMS.
Given the current activity in the discovery of general sequence variation,
it is anticipated that SNP markers with unknown selective effects will
be the vast majority of submitted records.
Back to FAQs
4. When was the SNP database
established?
dbSNP was launched in September of 1998.
Back to FAQs
5. Where does the data
come from?
dbSNP accepts submissions from all laboratories and is especially
interested in mutations in genes and/or where additional biological information
is known. Major contributors to the database are laboratories associated
with the National Human Genome
Research Institute (NHGRI) grants program. The NHGRI has funded an
intensive effort to collect 50,000 SNPs in three years. The grant recipients
under this program include genome centers, private extra mural research
labs, and private businesses. These groups are working on a variety
of aspects of new SNP discovery, new technologies for SNP detection, and
rapid SNP genotyping in large samples.
In addition to this central contribution to dbSNP, other research labs
and private companies can deposit SNP information to make it easily accessible
to the research community and part of the public domain. We are currently
designing a common data exchange format for SNP data to be used between
the central SNP databases.
Back to FAQs
6. What is the rate at which
dbSNP is growing?
In general, the database grows at a rate of about 90 SNPs per
month; however, large numbers of submissions by large research projects
can cause uneven growth. While dbSNP welcomes data from smaller contributors,
the majority of the data will probably come from a small number of large
projects funded by the NHGRI grants. For this reason, it is expected to
grow in erratic jumps for the next few years. It is expected that there
will be 5,000 - 10,000 SNPs by the end of the first year of the funded
SNP grants. For the current number of SNPs in the database, see the
dbSNP Summary
page.
Back to FAQs
7. Which species are represented
in the database?
Currently only Homo sapiens is represented in
the database; however, the database has been designed to accept mutation
information from any species, not just Homo sapiens.
Back to FAQs
8. How is dbSNP related to the
NIH Polymorphism Discovery Resource?
The database has been designed to accept frequency and/or individual
genotype information from any submitter-defined population, not just
the NIH
Polymorphism Discovery Resource (NIHDPR). The NIHDPR is encouraged
as a resource for the extra mural NIH-funded SNP grants to facilitate a
more straightforward evaluation of differing discovery or genotyping
methods using a consistent panel of samples. None of the frequency information
currently in dbSNP comes from the NIHDPR.
Back to FAQs
9. What provisions are made
for quality control?
Data validation can be maintained for both submitted assay
reports and reference SNP objects. At the level of an individual
submitted assay report, dbSNP provides several fields to assess the quality
of the data.
For further information, please see: How
to Submit, Validation fields.
Back to FAQs
Searching
1. How can I search
dbSNP?
dbSNP can be searched both via other NCBI resources or directly.
Via other NCBI resources
We are developing three ways to query the database by integrating it
with other NCBI resources. They are:
a. by gene name/nomenclature association
Query results from the LocusLink database will show a purple
"S" button in SNP records have been mapped to the gene. Clicking
on the S will take you to a list of the reference SNP records for any gene
in the LocusLink database.
b. by map location
dbSNP is currently being integrated to GeneMap99 and the integrated
physical maps that are being constructed at NCBI. When integration
is completed, the maps may be browsed for SNP content in user specified
regions of the map. This feature should be ready in mid May.
c. as a BLAST operation on dbSNP using a candidate sequence.
The sequences in dbSNP are currently being formatted to be
searched by BLAST. Users will be able to submit a query sequence
to BLAST, and receive a list of any SNPs in the database that hit the sequence.
This feature will be ready in late May.
Direct searching of dbSNP
Currently, there are six ways to search dbSNP directly:
2. Is it possible to do batch
queries?
There are two means of doing batch queries:
1. Searching batches submitted by individual laboratories.
This method is used to search groups, or batches, of SNPs submitted by
individual laboratories. It is possible to search by batches submitted
by individual laboratories. Batches are identified by the local batch identification
code, submitter handle, number of SNPs in the submission, and the date of
submission. Batches are displayed chronologically by most recent date
of submission. See the New
Batches page for further information.
2. Submitting a batch of requests.
This method would allow a user to submit a batch of queries, or requests.
The results would then be returned to the user's email account in ASN.1,
FASTA, XML, chromosome report, or text flatfile format. The batch of requests
can be submitted as an upload file or entered using a web interface. See
the dbSNP Batch
Query page for further information.
Back to FAQs
3. How can I search for
polymorphisms of a specific gene or chromosome?
It is currently not possible to search for SNPs of a specific
gene or chromosome; however, a report of mapped SNPs sorted by chromosome
and fine map position is available. It is possible to look at the map of
a specific gene or chromosome and search for SNPs within that region.
See "How can I search dbSNP?" for a discussion
of the integrated resource features that will soon address this need.
NCBI is also in the process of integrating dbSNP entries with other
sequence and mapping resources via BLAST
and Electronic PCR (E-PCR)
analysis. This analysis will attempt to associate all SNPs with a nucleotide
sequence record and/or physical map contig. If the SNP is in a gene region,
it will be annotated on the appropriate Reference Sequence or UniGene
cluster.
Back to FAQs
Submission and Withdrawal
1. How can I submit data
to the database?
Independent labs can submit data directly to NCBI by
following the submission procedures and suggestions found on the dbSNP
How
to Submit page.
Back to FAQs
2. What kind of
data is needed to submit STS and SNP data simultaneously?
In order to submit simultaneously STS and SNP data, it is necessary
to submit a batch file which includes the following sections in the indicated
order.
SECTION |
TYPE |
SECTION DESCRIPTION |
Contact |
CONT |
Submitter's name, phone number, and other contact information for this
datafile |
Publication |
PUB |
List of pre-press, or published articles about the markers |
*SNP Method |
METHOD |
Free text section for description of general method of assay |
*SNP Population |
POPULATION |
Description of population sample |
*STS Source |
SOURCE |
Description of source organism |
*STS Protocol |
PROTOCOL |
PCR protocol components |
*STS Buffer |
BUFFER |
PCR buffer components |
STS Record |
STS |
An STS entry using a SOURCE, PROTOCOL, BUFFER, primers and sequence |
*STS Method |
METH |
A section to label/decode lines in an STS Map section |
STS Map Data |
MAP |
Map information for the STS (and, hence, the SNP) |
SNP Record |
SNPAssay |
An SNP entry using METHOD, STS, alleles and flanking sequence |
SNP Frequency |
SNPPOPUSE |
Frequency data for SNPASSAY in POPULATION |
Sections denoted with an asterisk (*) only have to be defined and submitted
once. The other sections carry the particular
details for each SNP, insertion/deletion or microsatellite in a data set.
Details of each section can be found in the dbSTS submission instructions
or the dbSNP submission instructions.
Back to FAQs
3. Is there a specific name
or format I must use for submission?
Individual laboratories will be assigned a unique handle, which
is a short lab identifier, usually an acronym or abbreviated name.
The handle will allow submissions to be associated with laboratories independent
of the details of who is handling a particular set of submissions from
that laboratory.
For more information on handles, please see: How
to Submit, Handles
Local SNP identifiers need only be unique within a specific handle.
The combination "HANDLE | LOCAL SNP ID" will be unique within the database.
There is a 64 character limit for SNP identifiers.
For more information on identifiers and examples of identifiers, please
see: How to Submit, Identifiers
Back to FAQs
4. What is the NCBI assay ID, or 'ss'
ID?
The NCBI assay ID, or 'ss' ID is simply an accession number
assigned by NCBI to submitted SNPs. It has the format NCBI|ss<NCBI
ASSAY ID>. Note that 'ss' is always in the lower case.
For more information on ss ID's please see: How
To Submit, Resource Integration.
Back to FAQs
5. What is a reference SNP, or 'rs'
ID?
A reference SNP ID, or 'rs' ID is an identification tag assigned
by NCBI to SNPs that appear to be unique in the database. The rs
ID number, or tag, is assigned at submission. Initially, it is expected
that nearly every submission will be assigned a rs ID. As the database
matures, however, submitted SNP's that map to identical locations as previously
submitted SNP's will be linked into the reference set of the existing reference
SNP record. These SNP rs ID's will be a set of features that will
be mapped to external resources or databases, including NCBI databases.
The SNP rs ID number will be noted on the records on these external resources
and databases in order to point users back to the original dbSNP records.
A reference SNP record has the format NCBI| rs<NCBI SNP ID>.
Note that 'rs' is always in the lower case.
For more information on rs ID's please see: How
To Submit, Abstract Report.
Back to FAQs
6. What is the official nomenclature
for human genes?
There is currently no official nomenclature for human genes;
however, The Human
Gene Nomenclature Committee is currently trying to establish a nomenclature
standard and does have a recommended format. The Human Gene Nomenclature
Committee is the accepted authority for establishing these standards. For
new genes lacking official nomenclature the research community is encouraged
to use the Nomenclature Committee web form to submit a proposed gene symbol
and name, thus creating a community provided name. In general, the
research community does try to conform to using pre-existing names but
these names might not be the current official nomenclature name so situations
do arise where a single gene is being called by multiple names. There
is no enforcement of this suggested nomenclature method and investigators
are free to name a gene as they wish.
There is no standard 'format' for the official gene name, but, for human
genes, the official gene symbol (an abbreviation) does have the standardized
format of capitalizing all alphabetic characters and excluding use of non
alpha/numeric characters. For example, official symbols might look
like ABC3 but do not look like ABC(3).
Back to FAQs
7. What is the "hold until
published (HUP)" policy?
dbSNP data cannot be held confidential until publication. Note
that dbSTS and dbSNP have different "hold until published" or HUP policies.
Submissions to dbSTS can be withheld from public view until the accession
number is published. dbSNP records, however, will be available for public
inspection when the submission process is complete, even in the case of
simultaneous dbSNP/sbSTS submissions. STS submissions that require
HUP treatment should be submitted separately, and prior to the SNP submission.
Back to FAQs
8. Is it possible to withdraw
submissions from the database?
A record can not be completely deleted from the database.
The submission identification number of all records, and the rs ID number
of abstracted reference SNP records, are incorporated into other databases
(Please see "What is the NCBI assay ID or 'ss' ID?"
and "What is a reference SNP or 'rs' ID?" above).
These ID tags indicate source information of SNPs to users of other databases
that reference the data in question. These ID tags point the users
back to dbSNP. Also, other laboratories can submit records to SNP that
refer to SNPs already in the database. Because a submission quickly becomes
permanently associated with other records both within dbSNP and in other
databases it is not possible to completely eliminate it from the database.
A record can be marked as "withdrawn," however, so that a query of that
SNP will indicate that the submitter has chosen to withdraw that data.
Back to FAQs
Downloading
1. Is is possible to download
dbSNP?
dbSNP is available for downloading via the NCBI
FTP server in three formats:
Linking to dbSNP
1. Is it possible to create
HTML links to a particular dbSNP record?
Yes. If you have a local website with more data related to
a particular SNP, it may be useful to create a link from the local site
to the ss id number of a chosen SNP record. To do so, use the following
URL:
http://www.ncbi.nlm.nih.gov/SNP/snp_retrieve.cgi?subsnp_id=
Set the "subsnp_id" to the chosen NCBI Assay ID number.
For example, if the NCBI Assay Id number is 123, set the URL as:
http://www.ncbi.nlm.nih.gov/SNP/snp_retrieve.cgi?subsnp_id=123
Back to FAQs
Data Processing and Summary
Measures
1. How variations are
mapped to genome sequence
When reference genome assemblies are available, we use them as anchor
sequence to place refSNP clusters into a genomic context. We clean dbSNP-flanking
sequence with Repeat Masker and then remap them to the most current build of
each genome using MegaBLAST. The mapping results then define a new non-redundant
set of variations for the genome.
In general a word size of 28 is used in MegaBLAST computations, but a small
subset of our data has a half flank (i.e. 5' or 3' flanking sequence taken individually
as MegaBLAST query sequence) size of 25 bases and this is blasted with a word
size of 22. To map a deletion we required that both flanking sequences are returned
in the alignment, and furthermore both penultimate bases flanking the allele
are returned. In other words, the gap as defined in the alignment exactly matches
the deletion as defined in dbSNP.
The complete command line from MegaBLAST is:
megablast -U T -F m -J F -X 180 -r 10 -q -20 -P 1000 -R T -W 28
In addition, we filter the MegaBLAST results into two classes in dbSNP database:
0) better than 95% sequence alignment with fewer than 6 mismatches
1) better than 75% alignment with less than 3% mismatches.
Anything below the lower threshold is discarded.
A non-negligible subset of dbSNP RS fails to map because of heavy repeat masking
on the flanking sequence. When variations align at more than one site on the
genome, they sometimes map with distinct mismatch counts. dbSNP does not make
any effort to reduce map redundancy by comparing individual quality scores at
each site.
Generally, dbSNP reports all mapping results against the current assembly. This
is certainly not, however, everything in dbSNP. There are three major cases
where we do not map and/or annotate:
a. Submissions that are completely masked as repetitive
elements. These are dropped from any further computations. This set of
refSNPs are dumped in chromosome "rs_chMasked" on our ftp site.
b. Submissions that are defined in a cDNA context
with extensive splicing. These SNPs are typically annotated on refseq mRNAs
through a separate annotation process. We are working to reverse map these variations
back to contig coordinates, but that has not been
implemented. For now, you can find this set of variations in "rs_chNotOn" on
the ftp site.
c. variations with excessive hits to the genome.
Variations with 3+ hits to the genome are neither annotated as variation features
on
contigs nor included in variation tracks for either NCBI or UCSC map viewer
resource. These data are in "rs_chMulti" on our ftp site.
Furthermore, the heuristics for non-SNP variations (i.e. named elements and
STRs) are probably a bit too conservative so some of these are consequently
lost. While we prefer to err on the side of caution to avoid false annotation
of variation in inappropriate locations, we are working to improve the success
hit-rate for these variations as well.
2. Why
do the functional classifications for some variations change when a genome is
re-assembled?
Functional annotation varies from build to build because the underlying
substrate, namely the reference genome sequence, is itself changing from assembly
to assembly. During each assembly, the algorithms used to define 'genes' are
refined to improve accuracy. Since gene features can be defined by various classes
of evidence that vary in their certainty, there is currently some thrashing
in estimates of gene numbers and their precise exon structure on the genome.
Duplicates are identified and merged, spurious annotations are removed, and
new evidence is included as the annotation pipelines are developed.
The net result for the dbSNP user is that SNPs may be in an exon (or gene more
generally) in one build, and in an intron or UTR (or intergenic DNA) in the
next build if the exon (gene) is subsequently removed. The genome sequencing
community is converging on a stable reference sequence. When it is finished,
annotation (including SNP function) will be much more stable.
3. How average
heterozygosity is computed
Average heterozygosity is computed for each refSNP cluster as described
here.
Back to FAQs
Comments or Questions?
If you have further questions, comments or suggestions for improvement
of this FAQ page or dbSNP please contact the NCBI Service Desk at
info@ncbi.nlm.nih.gov
db SNP Home
NCBI Home
FAQ page updated: October 1, 2002
|