The table
that lists the schema change has three columns.
The first two columns show the build id. The third column has change
description.
Table names are
in bold. Only columns are indented and only columns with changes are shown
under the table.
Please see
data dictionary for full table details.
‘ -- -- --‘ means no
corresponding objects ( table or column) for the build.
For example,
in build 118, a new table “SNPAlleleFreq” is added.
Date last
updated: March 2, 2007
-
For
human, SubInd table is replaced with a partitioned view of a set of underlying
tables by chromosomes in two separate databases to keep each database size
small for ease of maintenance: human_gty1 has SubInd for chromosome 1-7. human_gyt2 has SubInd
for chromosome 8-22, X, Y, Mitochondrial.
-
Table
name format is: SubInd_ch1, SubInd_ch2 etc. SubInd_Un is for any snp that is mapped
to a contig but the contig’s chromosome is unknown; SubInd_NotOn is for any snp
that can not be aligned to any contigs.
Date last
updated: September 28, 2005
-
Major
changes are in alignment types in mapping tables. Detailed description of new mapping tables is
in: ftp://ftp.ncbi.nlm.nih.gov/snp/specs/b125_mapping.doc
-
The
other major change is that starting b125 the old dbSNP database is split by
organisms into 39 different snp databases with identical schema.
-
Except
mapping table changes, other changes are minimal. Please check back around October 6, 2005 for
updated b125 ER diagram.
Date last
updated: November 3, 2004
Build 122 |
Build 123 |
Note |
|
|
|
AlleleMotif |
AlleleMotif |
|
repeat_cnt |
repeat_cnt |
|
Datatype:int |
Datatype:real |
We have submitted
variation as (ACTGTGGGGACACCTCGCCTGCACCCACCTCCCTCCCCCCAGCTCCCGAGC)10.5/11/11.5/12/14/15/21. So repeat_cnt need to real to record the
repeat value of "10.5" in this case. |
|
|
|
ContigInfo |
ContigInfo |
|
contig_acc |
contig_acc |
|
Datatype:varchar(15) |
Datatype:varchar(25) |
To accomate longer contig
accession names. |
|
|
|
-- -- -- |
db_build_pipeline |
This is an internal table
to track tasks. It is in draft state. |
|
|
|
-- -- -- |
dn_Motif_rev |
Motif table has rev_motif_id
to point to the reverse motif sequence in the same Motif table. This table is a denormalized table that
makes it easier to find the reverse form of a Motif. This table is used in internal
computations. |
|
|
|
-- -- -- |
dn_UniGty_allele |
This table is used in
internal computation. |
|
|
|
-- -- -- |
dn_UniVariation_rev |
Similar to dn_Motif_rev.
This table makes it easier to get the reverse form of a univariation string. |
|
|
|
|
|
|
-- -- -- |
hsSupplContigInfo |
dbSNP build 123 human snp are mapped to Genome build
35. To assist users who still work
with NCBI build 34.3 map, we have also mapped the new dbSNP build 123 refSNP(rs) set against build 34.3. The results are in
SNPContigLoc_b34_3. This table stores
the contig information for build 34.3. It has exactly the same structure as
ContigInfo, which has build 35 contig information. |
|
|
|
-- -- -- |
hsSupplMapInfo |
Please see
hsSupplContigInfo note. This table is the "SNPMapInfo" counterpart
for build 123 rs onto build 34.3. SNPMapInfo, of course, is the map
statistics for dbSNP build 123 rs hits on NCBI build 35. |
|
|
|
-- -- -- |
IndGrpByPop |
This table assigns "individual
group such as Asian" to population id. Individual group is a way to
group snp frequency data for analysis. Ind_grp is in SubmittedIndividual,
IndivBySource and Individual table. |
|
|
|
-- -- -- |
IndGrpByPopInd |
This table assigs
"individual group" to each individual id within a population. This
is in addition to the above IndGrpByPop because submitters define population
and sometimes a submitted population contains individuals from several
"Individual groups". |
|
|
|
IndMergeArch |
IndMergeArch |
We assign individual id
(ind_id) based on submission information for example Coriell number. When
there is no information on submitted individual id, we assign new ind_id.
Later we may get Coriell numbers and find out that several separate ind_id
are actually one.
We merge these ind_id. The merge process is saved in this table. |
old_ind_id |
-- -- -- |
|
|
|
|
-- -- -- |
ind_id_high |
old_ind_id-> new_ind_id
was in previous build. To be more precise, we changed to column name to:
ind_id_high and ind_id_low. Since when we merge ind_id, we always keep the
ind_id that has the lowest number. |
|
|
|
new_ind_id |
-- -- -- |
|
|
|
|
-- -- -- |
ind_id_low |
|
|
|
|
-- -- -- |
build_id |
|
|
|
|
-- -- -- |
create_time |
|
|
|
|
-- -- -- |
IUPACna |
This is a table that
assists simple computation of IUPack bases. |
|
|
|
|
|
|
--
-- -- |
mmContigExon_35_1 |
Started mapping of mouse
to NCBI build 35.1. |
|
|
|
|
|
|
RsMergeArch |
RsMergeArch |
|
build_id |
build_id |
smallint ranges from -32,768 - 32,767. Enough for our
build_id. A quick way to save space. |
Datatype:int |
Datatype:smallint |
|
|
|
|
-- -- -- |
rsCurrent |
When an rs
merges away several times, we need to trace rsHigh->rsLow several time to
get the current rs. For convenience, rsCurrent and orien2Current are added.
If an rs only merged once, then rsLow is the same as
rsCurrent. |
-- -- -- |
orien2Current |
see above. |
|
|
|
SNP |
SNP |
|
avg_heterozygosity |
avg_heterozygosity |
|
Datatype:float |
Datatype:real |
"real"
offers plenty precision for the field and saves storage space for this table. |
|
|
|
het_se |
het_se |
|
Datatype:float |
Datatype:real |
see above. |
|
|
|
avg_max_heterozygosity |
-- -- -- |
Remove not used field to
save space. |
snp_class |
-- -- -- |
snp_class is the class for the variation type. This is replace by field univar_id which points to the unified
variation string for the snp in UniVariation table. Use
UniVariation.subsnp_class for this information instead of the old snp_class. |
-- -- -- |
exemplar_subsnp_id |
For a
refSNP exemplar ss, we use to
use view SNPSeqPick. When querying large sets of data, using SNPSeqPick is inefficient and
slow. Add this new field simplifies many queries that need to get exemplar ss
data for an rs, for example the query to get rs
fasta. |
-- -- -- |
univar_id |
This univar_id points to the unified
variation string for the snp in UniVariation table. This makes it easy to get
all the alleles for an rs. |
-- -- -- |
cnt_subsnp |
A refSNP is a cluster of subsnp(ss). Cnt_subsnp is an important property of rs.
Although it can be derived from SNPSubSNPLink table, it is convenient to save
it here. This is a tradeoff between space and speed. |
|
|
|
SNP_mRNA_pos |
-- -- -- |
This table used to store a
snp position on an mrna_acc. The position is derived from ContigExon. But due
to gaps and alignment exceptions, the position in this table has not been
precise. This table is only used to
link to NCBI SeqViewer when a snp is on mRNA. It meant to center the
SeqViewer on the snp. But since mRNA sequence is usually short enough to page
through, we rather simply bring the mRNA SeqViewer on the first position than
keep an inprecise position. We may store this information again in the future
when better ways to compute is found. |
|
|
|
SNPAllele |
-- -- -- |
This table is unnecessary
now that we have univar_id in SNP and we have UniVariAllele table which lists
all alleles for each unified variation. For ex. For '(CT)3/4/5/6',
UniVariAllele will have allele_id for (CT)3, (CT)4, (CT)5 and (CT)6. |
|
|
|
-- -- -- |
SNPAncestralAllele |
We used to have
ancestral_allele_flag in SNPAllele table. Since SNPAllele table can be
dropped, we need this table to save the ancestral allele data. |
|
|
|
SNPContigLoc |
SNPContigLoc |
|
allele |
allele |
|
Datatype:varchar(255) |
Datatype:varchar(1200) |
New mapping program in
build 35 can find alignment with bigger gaps, so allele length needs to be
increased. We see that most of the long alleles contain repeats. In the
future we may try to condense the allele using repeat string and number the
way we do with STR in variation string. |
|
|
|
-- -- -- |
SNPContigLoc_b34_3 |
This table has exactly the
same structure as SNPContigLoc, which has dbSNP build 123 rs hits on NCBI
build 35 and other alternative assemblies. But this table SNPContigLoc_b34_3
has dbSNP build 123 rs hits on NCBI build 34.3 to assist users who still use
the 34.3 genome maps. |
|
|
|
|
|
|
SNPFlankStatus |
-- -- -- |
Starting build 123,
mapping programs use megablast internal repeat masker. Before build 123,
dbSNP repeat masks flank sequences and save the masking status in
SNPFlankStatus. So this table is no
longer needed. |
|
|
|
|
|
|
|
|
|
SNPHWProb |
SNPHWProb |
|
ind_cnt |
ind_cnt |
|
Datatype:int |
Datatype:smallint |
smallint ranges from -32,768 - 32,767. Enough for this
field. A quick way to save space. |
|
|
|
-- -- --
|
SNPIndGtyFlag |
Since a snp is a cluster
of ss and an ind_id is a cluster of submitted_ind_id. When we have individual
genotype for several ss and several submitted_ind_id that cluster to one rs and one ind_id, individual genotype should be same. But
we have found out many records with different genotype data. This table flags this interesting fact. |
|
|
|
|
|
|
SNPSubSNPLink |
SNPSubSNPLink |
|
comment |
comment |
|
Null Option:NOT NULL |
Null Option:NULL |
Comment field is raraly if
ever used but occupies large space. So we nullified it but kept the field for
compatible with old sql codes. In the future, we may drop this completely. |
|
|
|
SubInd |
SubInd |
|
batch_id |
batch_id |
|
Datatype:int |
Datatype:smallint |
smallint ranges from -32,768 - 32,767. Enough for this
field. A quick way to save space. |
|
|
|
submitted_ind_id |
submitted_ind_id |
|
Datatype:int |
Datatype:smallint |
smallint ranges from -32,768 - 32,767. Enough for this
field. A quick way to save space. |
|
|
|
|
|
|
SubSNP |
SubSNP |
We added three derived
fields to SubSNP table to speed up data dump by organism, chromosome, rs and
then within each rs cluster, property of each ss. |
-- -- -- |
snp_id |
|
-- -- -- |
tax_id |
|
-- -- -- |
chr_id |
|
|
|
|
SubSNPHWProb |
-- -- -- |
Hardy-Weinberg probability
(HWP) are computed at each rs level (SNPHWProb) and at each submitted ss and
population level( FreqSummaryBySsPop). This table is therefore dropped. |
|
|
|
SubSNPRepMask |
-- -- -- |
We dropped this table for
the same reason in SNPFlankStatus note.
Starting from dbSNP build 123, mapping uses megablast internal repeat
mask, so we do not track the mask status of submitted snp seqeunce with
dbSNP. |
|
|
|
-- -- -- |
UniVariAllele |
For each unified variation,
this table lists all alleles within the variations. Please see SNP.univar_id
note for more details. |
|
|
|
UniVariation |
UniVariation |
This table keeps all
unique variation. |
allele_cnt |
allele_cnt |
|
Datatype:int |
Datatype:smallint |
smallint is enough. No need for int. |
|
|
|
-- -- -- |
src_code |
This code is a foreign key
to UniVariationSrcCode. It tracks the source the first time a univariation is
added. For ex. The variation string is added for a reverse form of a
variation. |
|
|
|
-- -- -- |
rev_univar_id |
Added this field so it is
easy to find the reverse form of a univariation. |
|
|
|
-- -- -- |
UniVariationSrcCode |
Please see
UniVariation.src_code and the table data for detail. |
|
|
|
-- -- -- |
VariComplexAllele |
This is an internal table
to handle complex variation forms. |
Date last
updated: August 18, 2004
Build 121 |
Build 122 |
Note |
-- -- -- |
ChiSqPValueLookUp |
This table replaces
ChiSqPValue. This is used to get the p-value from ChiSq when computing H-W
Prob. |
|
|
|
dn_Allele_rev |
dn_Allele_rev |
This table is used internally
for processing. |
rev_flag |
rev_flag |
|
Datatype:bit |
Datatype:tinyint |
|
|
|
|
-- -- -- |
dn_Gty_rev |
Used for internal
processing. |
|
|
|
dn_snpFxnCnt |
dn_snpFxnCnt |
This is a summary table
showing count of snp in each functional class(intron,
coding etc). |
-- -- -- |
tax_id |
Added tax_id to this table
since we start to have genomes for more organisms. |
|
|
|
dn_table_rowcount |
dn_table_rowcount |
This table lists the row
count of each table. When users re-create their own dbSNP, they can use this
table to check if they data were loaded OK. |
org_prefix |
-- -- -- |
org_prefix and tax_id is not used yet. Currently, the table
rowcount just the rowcount for a table. In the future when we have organism
division databases, we may have tables shared by multiple organism.
tax_id field will be used to show row count per
tax_id. |
-- -- -- |
tax_id |
|
|
|
|
FreqSummaryBySsPop |
FreqSummaryBySsPop |
Column type changes for this
table. Float to real to save storage space. |
source |
source |
|
Datatype:char(2) |
Datatype:varchar(1) |
We compute H-W prob from
either individual genotype data(I), or just genotype
frequency (P). So we only need varcchar(1). |
|
|
|
chisq |
chisq |
|
Datatype:float |
Datatype:real |
|
Null Option:NOT NULL |
Null Option:NULL |
|
|
|
|
df |
df |
|
Null Option:NOT NULL |
Null Option:NULL |
|
|
|
|
hwp |
hwp |
|
Datatype:float |
Datatype:real |
|
Null Option:NOT NULL |
Null Option:NULL |
|
|
|
|
-- -- -- |
ggaContigExon |
The new set of chicken
mapping table are added. They all have prefix 'gga'. Organism prefix is defined in
OrganismTax.prefix column. |
|
|
|
-- -- -- |
ggaContigInfo |
|
|
|
|
-- -- -- |
ggaMapLink |
|
|
|
|
-- -- -- |
ggaMapLinkPid |
|
|
|
|
-- -- -- |
ggaMapList |
|
|
|
|
-- -- -- |
ggaSNPContigLoc |
|
|
|
|
-- -- -- |
ggaSNPContigLocusId |
|
|
|
|
-- -- -- |
ggaSNPMapInfo |
|
|
|
|
-- -- -- |
IndGrpCode |
Added this new table to
group individuals. See the data in IndGrpCode for a list of groups.
Currently, the grouping only applies to human. |
|
|
|
IndivBySource |
IndivBySource |
|
-- -- -- |
src_ind_grp |
This column stores
grouping supplied by sample source ,such as Coriell. |
|
|
|
Individual |
Individual |
|
-- -- -- |
ind_grp |
This column stores the
grouping code assigned by dbSNP. |
|
|
|
IndivSourceCode |
IndivSourceCode |
|
-- -- -- |
src_type |
Added this column to
distinguish various sources. For example, Coriell's src_type is
"repository" and CEPH is "curator". |
|
|
|
-- -- -- |
display_order |
One individual often has
several id from different source. This value will be
used to order the display. |
|
|
|
-- -- -- |
IndMergeArch |
As we get more individual
sample information, we may see that previously two separated individuals are
actually the same. This table keeps individual merging history. This is not
used yet. |
|
|
|
|
|
|
miscMapLink |
miscMapLink |
|
-- -- -- |
i_tax_id |
For organisms with no NCBI
genome assembly, the hit to GenBank accessions are stored in this table.
Since there're multiple organisms, tax_id was added. But since tax_id is in
SNP table, we could get tax_id by joining SNP and this table. So this column
may be unnecessary and be taken out in future schema modifications. |
|
|
|
Publication |
Publication |
|
asn |
-- -- -- |
This field was not used. |
|
|
|
SNPHWProb |
SNPHWProb |
This table stays basically
the same except the column name listed below. The column name is not
consistent with table FreqSummaryBySsPop which has H-W prob for each ss over an population. |
degreeFreedom |
df |
|
|
|
|
chi_sq |
chisq |
|
|
|
|
hw_prob |
hwp |
|
|
|
|
-- -- -- |
ind_cnt |
Added column ind_cnt for
individual count. |
|
|
|
source |
-- -- -- |
This field is removed
because it is not as meaningful for an refSNP(rs) as
the source in FreqSummaryBySsPop: some member ss may have individual genotype
while other may only have genotype frequency. |
|
|
|
SubmittedIndividual |
SubmittedIndividual |
|
-- -- -- |
loc_ind_grp |
loc_ind_grp will store the individual group name submitted by
submitter. |
|
|
|
-- -- -- |
UniVar_Valid_Allele |
This table is used
internally. |
Date last
updated: June 2, 2004
To keep dbSNP in manageable size, we have partitioned four
large tables by ss# and moved the data with ss# less than 20 million to a
supplemental database called dbSNP_sup.
The four large tables that have been partitioned are:
SubSNPSeq3, SubSNPSeq5, SubSNPCommLine and SubSNPAcc.
For ss# between 1 and 10,000,000, it is stored in table
<tablename>_p1, where <tablename> represents the above four tables.
For ss# between 10,000,001 and 20,000,000, it is stored in
table <tablename>_p2
Currently, we do not have ss# over 30,000,000, when we do, we will create <tablename>_p3 for ss# between
20,000,001 and 30,000,000,
For example:
SubSNPSeq3_p1 has 3’ sequence for all ss# between 1 and
10,000,000.
SubSNPSeq3_p2 has 3’ sequence for all ss# between 10,000,001
and 20,000,000.
SubSNPSeq3_p1 and SubSNPSeq3_p1 still have the same column
definitions as the original SubSNPSeq3.
In dbSNP, for ss# over 20,000,000, we still store them in dbSNP, the main database. We renamed these four tables by adding a suffix “_ins” to note that we will be add new data into these tables as new submissions come in. The original names for these four tables are now partitioned views across dbSNP and dbSNP_sup having all ss#. For more details on partitioned view,
This change will be reflected in ftp bcp files. Using SubSNPSeq3 as an example, instead of having a huge file for SubSNPSeq3 prior to this buid, we now have bcp file for SubSNPSeq_p1, SubSNPSeq_p2 and SubSNPSeq_ins. Between builds, sequence for old snp seldom change (they only change when submitter updates them which is rare). We will have separate small update file for SubSNPSeq3_p1 and SubSNPSeq3_p2. So for each new build, users only need to get the new bcp file from SubSNPSeq3_ins.
Date last
updated: April 24, 2004
Note: ER diagram and Data dictionary for build 120
will be updated on soon.
Build 119 |
Build 120 |
Note |
agContigInfo |
agContigInfo |
Please see change note in
"Build 117 Schema Change - ContigInfo" in this document. |
-- -- -- |
group_term |
|
-- -- -- |
group_label |
|
-- -- -- |
contig_label |
|
|
|
|
agSNPMapInfo |
agSNPMapInfo |
Please see change note in
"Build 117 Schema Change-SNPMapInfo" in this document. |
-- -- -- |
hap_cnt |
|
-- -- -- |
unplaced_cnt |
|
-- -- -- |
ref_cnt |
|
-- -- -- |
assembly_cnt |
|
|
|
|
-- -- -- |
dn_table_rowcount |
Added this table to simply
store the row count for each table. So users who create local dbSNP can
compare the counts. |
|
|
|
-- -- -- |
dn_validationSummary |
An internal validation
table. |
|
|
|
Individual |
Individual |
SubmittedIndividual has tax_id
due to annotation process. It is possible several submitted individuals are
from the same sample. This curated sample data is in Individual table. So to
normalize data, tax_id
should be in Individual table. Currently, we have tax_id in both SubmittedIndividual
and Individual tables. Eventually, we will only have tax_id in Individual
table. |
-- -- -- |
tax_id |
|
|
|
|
SNPHapMapPriority |
SNPHapMapPriority |
|
priority_code |
priority_code |
|
Datatype:tinyint |
Datatype:smallint |
HapMapClassCode has 9
codes, so priority_code needs to be a smallint. |
Date last
updated: Jan. 30, 2004
Note: ER diagram and Data dictionary for build 119
will be updated on soon.
Please see
most recent data dictionary at: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/DataDictionary_b118.html
Build 118 |
Build 119 |
Note |
|
|
|
-- -- -- |
dgContigInfo |
Added 'dg' prefixed tables
for dog snp mapping data. |
-- -- -- |
dgSNPContigLoc |
|
-- -- -- |
dgSNPContigLocusId |
|
-- -- -- |
dgSNPMapInfo |
|
|
|
|
-- -- -- |
dn_chr_locusList |
This table is a
de-normalized table, storing list of locus_id for each contig chromosomes.
This is used in individual genotype summary page. |
|
|
|
-- -- -- |
dn_gty_rsCnt_byChr |
This table is a
de-normalized table, storing genotyped snp count for each individual on
chromosomes. |
|
|
|
-- -- -- |
dn_gty_rsCnt_byLocus |
This table is a
de-normalized table, storing genotyped snp count for each individual on a
locus. |
|
|
|
MethodClass |
MethodClass |
|
|
|
|
-- -- -- |
validation_status |
Added validation_status.
Value 0 means the method is computational. 1 - method
can be considered validating when multiple ss
are in the same rs cluster. |
|
|
|
|
|
|
SNPContigLocusId |
SNPContigLocusId |
Added mrna_ver and
protein_ver. Contig_acc, mrna_acc and protein_acc fields are changed to varcahr(15) to be consistent with accession fields in
other tables such as SNPContigLoc. The
following int field are changed to tinyint:
contig_ver, fxn_class, reading_frame to save space. |
|
|
|
SubmittedIndividual |
SubmittedIndividual |
|
-- -- -- |
loc_ind_id |
Submitter individual id
can be case sensitive. For ex. Mouse strain name use case in special ways.
This field is needed to preserve case information. |
Date last updated: Nov. 14, 2003
Please see data dictionary
at: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/DataDictionary_b118.html
Build 117 |
Build 118 |
Note |
|
|
|
-- -- -- |
dgMapLink |
Added mapping data for
dog. |
|
|
|
-- -- -- |
dgMapLinkPid |
Added mapping data for
dog. |
|
|
|
dn_batchCount |
dn_batchCount |
|
build_id |
-- -- -- |
We will keep snp counts in
each batch for current batch only. So there is no need for this column. |
|
|
|
dn_handleCount |
dn_handleCount |
|
build_id |
-- -- -- |
We will keep snp counts in
each batch for current batch only. So there is no need for this column. |
|
|
|
dn_IND_batchCount |
dn_IND_batchCount |
|
build_id |
-- -- -- |
We will keep snp counts in
each batch for current batch only. So there is no need for this column. |
|
|
|
SubSNPGenotypeSum |
FreqSummaryBySsPop |
This table is renamed to
be consistent with the two tables used to compute this summary:
GtyFreqBySsPop and GtyFreqBySsPop. The new name clearly shows that the data
ins for each SubSNP within a specific population. |
source |
source |
|
Datatype:char(1) |
Datatype:char(2) |
Use char(2): GF - source data
is genotype frequency, IG - source data is individual genotype, AF - source
data is allele frequency. |
non_founder_cnt |
non_founder_ind_cnt |
Renamed so it is clear
that the number in this column count of individuals, not chromosomes. |
-- -- -- |
het |
heterozygosity |
-- -- -- |
het_se |
standard error for the
above heterozygosity. |
|
|
|
mmContigInfo |
mmContigInfo |
The changes in this table
is the same in ContigInfo. Please see change description for ContigInfo in
build 117. |
contig_class |
-- -- -- |
|
-- -- -- |
group_term |
|
-- -- -- |
group_label |
|
-- -- -- |
contig_label |
|
|
|
|
mmSNPContigLocusId |
mmSNPContigLocusId |
|
-- -- -- |
mrna_ver |
Added this to keep track
the version of the mrna accession, like the contig_ver. |
-- -- -- |
protein_ver |
Added this to keep track
the version of a protein accession, like the contig_ver. |
|
|
|
mmSNPMapInfo |
mmSNPMapInfo |
The changes in this table
are the same in ContigInfo. Please see change description for ContigInfo in
build 117. |
-- -- -- |
ref_cnt |
|
-- -- -- |
assembly_cnt |
|
|
|
|
-- -- -- |
SNPAlleleFreq |
This new table keeps the
average allele frequency for refSNP. |
|
|
|
-- -- -- |
SNPGtyFreq |
This new table keeps the
average genotype freqeuncy for refSNP. |
|
|
|
SubmittedIndividual |
SubmittedIndividual |
|
-- -- -- |
loc_ind_alias |
Added this column to track
the different id for a submitted individual for genotype data. |
|
|
|
t_vwSNP_avgFreq |
-- -- -- |
This is replaced with
SNPAlleleFreq. |
|
|
|
t_vwSNP_MinorAlleleCnt2More |
-- -- -- |
This table is no longer needed. |
|
|
|
t_vwSubPopGty_alleleFreq |
-- -- -- |
This table is replaced
with GtyFreqBySsPop. |
|
|
|
t_vwSubSNP_validationByMinorAlleleCnt |
-- -- -- |
This table is no longer
needed. |
|
|
|
Date last updated:
Note: Data dictionary and ER diagram for build 117 will be available on Oct.17, 2003.
Build 116 |
Build 117 |
Note |
|
|
|
Allele |
Allele |
Removed 3 columns in Allele
table because an allele may have many motifs (ex. (GAT)10AAT(GAT)5). So keep
allele motif relationship in enw table AlleleMotif. |
motif_id |
-- -- -- |
|
repeat_cnt |
-- -- -- |
|
rev_allele |
-- -- -- |
|
-- -- -- |
last_updated_time |
|
|
|
|
-- -- -- |
|
Added this table to save
computed allele frequency per subsnp and population. The frequency computed
from individual genotype takes the highest precedence which then takes
precedence over the allele frequency computed from genotype frequency. When
none of the above data is available, we use the submitted allele frequency. |
|
|
|
-- -- -- |
AlleleMotif |
Added this table to keep the
relationship between an allele and all its motif. |
|
|
|
ContigInfo |
contig_class was used to track the haplotype data of humans and
the strain data of mouse for each of the various contigs. But it became clear
that different ncbi assembly tasks required different ways of grouping the
alternatives so the single category contig_class was broken out into three
fields. For dbSNP, the group_term gives the best breakdown of the various
alternative forms, but contig_label and/or group_label give a better idea of
the common name associated with haplotype, strain, alternate_assembly, etc. |
|
contig_class |
-- -- -- |
|
-- -- -- |
group_term |
id used to group contigs
together for annotation purposes. This allows |
-- -- -- |
group_label |
Label used to label an
assembly in the MapViewer. |
-- -- -- |
contig_label |
Label used to label a
particular contig in an assembly. For |
|
|
|
-- -- -- |
GtyFreqBySsPop |
Added this table to save
computed genotype frequency per subsnp and population. The frequency computed
from individual genotype takes the precedence over submitted genotype
frequency. |
|
|
|
ObsVariation |
ObsVariation |
|
-- -- -- |
var_flag |
refers to VarFlagCode.code |
|
|
|
PopClassUniqCount |
-- -- -- |
Not used. |
|
|
|
-- -- -- |
rnContigExon |
Added mapping information
for rat, with prefix 'rn' added to mapping tables. This note also applies to
the following 7 tables. |
|
|
|
-- -- -- |
rnContigInfo |
|
|
|
|
-- -- -- |
rnMapLink |
|
|
|
|
-- -- -- |
rnMapLinkPid |
|
|
|
|
-- -- -- |
rnMapList |
|
|
|
|
-- -- -- |
rnSNPContigLoc |
|
|
|
|
-- -- -- |
rnSNPContigLocusId |
|
|
|
|
-- -- -- |
rnSNPMapInfo |
|
|
|
|
SnpFunctionCode |
SnpFunctionCode |
|
-- -- -- |
is_coding |
Added this column for
internal query. |
|
|
|
SNPMapInfo |
SNPMapInfo |
|
-- -- -- |
ref_cnt |
set greater than zero if
the snp maps onto the reference strand. |
-- -- -- |
assembly_cnt |
Counts the number of
alternate assemblies (excluding reference) which the snp hits. |
|
|
|
sts_markers |
-- -- -- |
|
|
|
|
SubPop |
SubPop |
Change some float column
to real to save storage space. |
|
|
|
-- -- -- |
ValidGtyByVariation |
Internal use currently. |
|
|
|
-- -- -- |
VarFlagCode |
Some submitted variation
pattern is problematic. Define codes to track problems. |
Date:
The following details all schema changes between build 116 and 115, sorted by table names.
Table names are in bold. Only columns are indented and only columns with changes are shown under the table. See data dictionary for full table details.
‘ -- -- --‘ means no corresponding objects ( table or
column) for the build.
For example, in build 116,
pop_cnt is added to table dn_batchCount.
Please see ER diagram for
more details: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/erd_dbSNP.pdf
Please see data dictionary
at: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/DataDictionary_b116.html
Build 116 |
Build 115 |
Note |
dn_batchCount |
dn_batchCount |
Count to speed up query. |
pop_cnt |
-- -- -- |
For frequency and genotype batch, added
count of population and count of individual. |
ind_cnt |
-- -- -- |
|
|
|
|
dn_handleCount |
dn_handleCount |
|
rs_cnt |
rs_cnt |
Make the column nullable,
so 'NULL' means not computed yet, while '0' means the result. |
Null Option:NULL |
Null Option:NOT NULL |
|
rs_validated_cnt |
rs_validated_cnt |
|
Null Option:NULL |
Null Option:NOT NULL |
|
|
|
|
dn_IND_batchCount |
-- -- -- |
This count table helps
speed up "New Batches" web query. |
|
|
|
-- -- -- |
dn_new_batch |
dn_new_xxx_batch tables
are no longer needed. It was used internally to support query most recent
batches. Now web page uses Batch.submit_time to decide submission order. |
-- -- -- |
dn_new_ind_batch |
|
|
|
|
-- -- -- |
dn_new_nov_batch |
|
|
|
|
-- -- -- |
dn_new_pop_batch |
|
|
|
|
-- -- -- |
dn_new_snp_batch |
|
|
|
|
GtyAllele |
GtyAllele |
The new table makes it
easy to compute allele frequency from genotype submission on forward or
reverse strand. |
-- -- -- |
allele_id |
|
rev_flag |
-- -- -- |
|
fwd_allele_id |
-- -- -- |
|
unigty_id |
-- -- -- |
|
create_time |
create_time |
|
Null Option:NOT NULL |
Null Option: |
|
Default:-- -- -- |
Default:DefCurTime |
|
last_updated_time |
-- -- -- |
|
|
|
|
IndivAlias |
--
-- -- |
Genotype submission
includes individual id. Different submitters sometimes use different
individual id for the same DNA sample.
Ex. Mouse strain 'AKR' or 'AKR/J' are the same. |
|
|
|
IndivBySource |
IndivBySource |
Minor database change |
src_id |
src_id |
|
Datatype:int |
Datatype:numeric(3) |
|
|
|
|
IndivSourceCode |
IndivSourceCode |
Minor database change |
code |
code |
|
Datatype:int |
Datatype:numeric(3) |
|
|
|
|
LoadHistory |
LoadHistory |
|
-- -- -- |
create_time |
Each build releases different
component in slight different time.
Now each individual component release time is stored here. |
ftp_done_date |
-- -- -- |
|
entrez_done_date |
-- -- -- |
|
blastdb_done_date |
-- -- -- |
|
web_date |
-- -- -- |
|
|
|
|
mmSNPMapInfo |
mmSNPMapInfo |
See data dictionary. |
hap_cnt |
-- -- -- |
|
unplaced_cnt |
-- -- -- |
|
|
|
|
ObsGenotype |
ObsGenotype |
See data dictionary. |
-- -- -- |
submitted_strand_code |
|
-- -- -- |
obs_ss_fwd |
|
-- -- -- |
create_time |
|
-- -- -- |
unigty_id |
|
-- -- -- |
allele_id_1 |
|
-- -- -- |
allele_id_2 |
|
|
|
|
ObsVariation |
ObsVariation |
subsnp_clas and
iupack_code are moved to table UniVariation. |
-- -- -- |
subsnp_class |
|
univar_id |
univar_id |
|
Null Option:NULL |
Null Option:NOT NULL |
|
-- -- -- |
iupack_code |
|
|
|
|
OrganismTax |
OrganismTax |
Added division_cd in
preparation to split dbSNP database into several divisions. Divisions are
loosely based on GenBank divisions. |
division_cd |
-- -- -- |
|
Population |
Population |
When curating submitted
individuals, it helps to know the source of the population samples. So src_id is added. For ex. If we know a
population's source is NIHPDR, then we know '1' or 'P001' or 'PD001' is the
same sample in NIHPDR panel. We currently manually set this field based on
population description. |
src_id |
-- -- -- |
|
|
|
|
SNPContigLocusId |
SNPContigLocusId |
|
fxn_class |
fxn_class |
data type changed from int
to tinyint to save storage. |
reading_frame |
reading_frame |
same as above. |
contig_ver |
contig_ver |
same as above. |
|
|
|
SNP |
SNP |
|
-- -- -- |
snp_type |
snp_type was used to
record snp withdraw reason code defined in SnpTypes. We currently do not keep
withdrawn snp in SNP table. Withdrawn snp is in SNPHistory. |
tax_id |
-- -- -- |
tax_id, snp_class and
validation_status was orignianlly in SNP table. From build 111 to build 115,
they were moved to t_vwSNP to faciliate database processing. But during
database size increase in build 116, views proved too slow to merit the
separating these three fields. So starting build 116, these three fields are
back in SNP table. |
snp_class |
-- -- -- |
|
validation_status |
-- -- -- |
|
|
|
|
|
|
|
SnpFunctionCode |
SnpFunctionCode |
|
top_level_class |
-- -- -- |
Added this field to group
the functional class code into "exon" and "other". |
|
|
|
SNPHWProb |
SNPHWProb |
|
chi_sq |
chi_sq |
float' takes 8 byte
storage. 'Real' takes 4 byte. In dbSNP, 'real' offers enough precision. |
Datatype:real |
Datatype:float |
|
hw_prob |
hw_prob |
|
Datatype:real |
Datatype:float |
|
|
|
|
SNPMapInfo |
SNPMapInfo |
|
hap_cnt |
-- -- -- |
|
unplaced_cnt |
-- -- -- |
|
|
|
|
|
|
|
SnpValidationCode |
SnpValidationCode |
|
abbrev |
abbrev |
|
Datatype:varchar(40) |
Datatype:varchar(20) |
With the addition of
double hit validation type, the abbreviation string gets longer. |
create_time |
-- -- -- |
|
last_updated_time |
-- -- -- |
|
|
|
|
-- -- -- |
SNPValBy_Illumina_Apr1003
|
The snp list is this table
is now in SNPVal and used to set the validation_status for the snp. |
|
|
|
-- -- -- |
SNPValBy_PERLEGEN_OCT0802 |
The snp list is this table
is now in SNPVal and used to set the validation_status for the snp. |
|
|
|
StrandCode |
StrandCode |
|
rs_to_ss_orien |
-- -- -- |
See data in this table to
see why this new column is needed. |
|
|
|
-- -- -- |
sts_markers |
|
|
|
|
SubInd |
SubInd |
Streamlined SubInd table
by normalizing observed genotype into ObsGenotype table. |
-- -- -- |
allele_cnt |
|
-- -- -- |
a_bit |
|
-- -- -- |
c_bit |
|
-- -- -- |
g_bit |
|
-- -- -- |
t_bit |
|
-- -- -- |
del_bit |
|
-- -- -- |
oth_bit |
|
-- -- -- |
observed |
|
submitted_rs |
-- -- -- |
|
|
|
|
SubIndDeletedBySubmitter |
-- -- -- |
Added this table for
internal tracking. |
|
|
|
SubPop |
SubPop |
|
subpop_id |
-- -- -- |
Added this column to
shorten primary key, especially since we have added more tables referencing
SubPop table, such as SubPopGty. |
|
|
|
SubPopAllele |
SubPopAllele |
|
subpop_id |
-- -- -- |
See above. In the future
SubPopAllele will use this column to reference SubPop so batch_id,
pop_id,subsnp_id,type will no be needed. These four columns are planed to be
dropped in build 117. |
|
|
|
SubPopDeletedBySubmitter |
SubPopDeletedBySubmitter |
Added this table for
internal tracking. |
-- -- -- |
build_id |
|
delete_build_id |
-- -- -- |
|
|
|
|
SubPopGty |
-- -- -- |
In the past, genotype frequency
were stored in SubPopAllele. But submitted genotype frequency has different
property from submitted allele frequency. So this table is created to keep
just genotype frequency to faciliate computing, for ex, to derive allele
frequency based on genotype frequency. |
|
|
|
SubPopHet |
-- -- -- |
This table is created, but
not populated yet. We plan to move the heterozygosity frequency from
SubPopAllele into this table in build 117. |
|
|
|
SubPopRsMerge |
-- -- -- |
Submitters may submit
frequency on an rs that was already merged to another rs. This table keeps a
record of this fact. Since we always stored frequency on the rs exemplar ss,
the effect of this merging is that the frequency may be stored under a
different ss#. |
|
|
|
SubSNP |
SubSNP |
|
-- -- -- |
hw_prob |
SubSNPHWProb is used for
this values. Less than 5% subsnp has data that make the computation of
Hardy-Weinberg probability possible. Storing this value in SubSNP table
wastes space since SubSNP has over 10 millions rows. |
-- -- -- |
subsnp_type |
This was intended to
indicate the withdraw detail of a subsnp. Now SubSNP table does not keep
deleted snp. Deleted SubSNP is tracked in SubSNPDeletedBySubmitter. |
top_or_bot_strand |
-- -- -- |
This is an experimental
way of identifying strand. |
validation_status |
-- -- -- |
Since build 111, we
started to compute validation status from a view. But with data growth, it is
no longer efficient to compute this value in each query. Maintaining a table copy of view
t_vwSubSNP_validation defeats the beauty of views. So we are returning this
field back into SubSNP. |
|
|
|
SubSNPSeqPos |
-- -- -- |
In build 116, we received
submission of snp with flanking sequence over 200Kb for snp's in repeat
regions and the submitter already know the genome location of the snp. In
this case, storing all flanking sequence does not provide value since the
flanks will be repeat masked out anyway. So we save the accession and the snp
position of the accession. |
|
|
|
SubSNPSeqTypeCode |
-- -- -- |
Explains the meaning of
'type' field in table SubSNPSeq3 and SubSNPSeq5. |
|
|
|
-- -- -- |
t_vwSNP |
|
|
|
|
t_vwSubPopGty_alleleFreq |
-- -- -- |
|
|
|
|
-- -- -- |
t_vwSubSNP_validation |
See note for SubSNP.validation_status. |
|
|
|
t_vwSubSNP_validationByMinorAlleleCnt |
|
|
|
|
|
UniGty |
-- -- -- |
|
|
|
|
UniVariation |
-- -- -- |
|
Date:
- Dropped tables: dn_new_snp_batch, dn_new_pop_batch, dn_new_ind_batch, dn_new_nov_batch
These are internal table that are no longer needed.
- Added tax_id to SubmittedIndividual table because some populations group individuals from different organism.
Date:
Table
|
Column
|
Note
|
ObsGenotype
|
|
This table keeps all submitted genotypes and submitted strand code in
obs and submitted_strand_code columns.
Obs_upp is the all uppercase version of obs. Obs_ss_fwd is the genotype on the forward
strand of the SubSNP. Ex. If genotype
is submitted as A/C|STRAND_SS_REV, then obs has “A/C”, obs_ss_fwd has “G/T”.
|
SubSNPGenoSum is renamed to SubSNPGenotypeSum.
|
|
Source has value “I” or “P”:
“I” means the genotype frequency is summarized from individual genotypes. “P” means the genotype frequency is submitted for a population. |
SubSNPHWProb
|
|
Source has same means as in SubSNPGenotypeSum.source
|
SubPopDeletedBySubmitter
|
|
Added this table to keep track deleted data, so summary statistics
needs to be recomputed for the affected batch and refSNP cluster.
|
t_vwSNP
|
-
Dropped rarely used columns.
- Current columns are snp_id, validation_status, snp_class and tax_id, create_time ( is the time this row is inserted ). |
|
Dn_new_snp_batch
Dn_new_pop_batch Dn_new_ind_batch Dn_new_nov_batch |
|
dropped
|
Dn_new_batch
|
|
Added to replace dn_new_<batch_type>_batch, where
<batch_type> has value of snp,pop,ind or nov.
|
Date:
Table
|
Column
|
Note
|
SubSNP
|
|
To get the observed variation and subsnp_class for a give subsnp, use
ObsVariation.pattern, join SubSNP and ObsVariation on
SubSNP.variation_id=ObsVariation.var_id
|
Batch
|
·
Organism is replaced with
OrganismTax.organism.
|
Join Batch and OrganismTax on tax_id to get the organism of a batch.
|
Date:
There are a few major schema changes between build 110 and build 111.
Table |
Column |
Note |
t_vwSubSNP_validation - replace
SubSNP.validaton_status |
validation_status |
- t_vwSubSNP_validation is a direct copy from view vwSubSNP_validation. Please see vwSubSNP_validation.sql for details. |
SNPSeqPick - This is a view. It replaces
the old table with the same name and columns: SNPSeqPick. |
snp_id, subsnp_id |
- see SNPSeqPick.sql for view details. |
t_vwSNP - replace the same columns in
SNP. |
PCR_confirmed_ind assay_sample_size cnt_subsnp indv_homozygosity_obs_ind num_segregate popdata_sample_size snp_class tax_id total_sample_size validation_status |
t_vwSNP is direct copy from view vwSNP. Please see vwSNP.sql for details on how each field is computed. |
SubPopAllele |
cnt cnt_int |
- cnt is changed to type float. This will help eliminate rounding error in computing average allele frequency etc. - cnt_int will be deleted after cnt is set to be number of chromosomes for allele frequency and number of individuals for genotype and heterozygosity data. |
Table |
Column |
Note |
Further change in Build 112 |
SubSNP |
variation_id |
variation_id is foreign key to ObsVariation (var_id). |
- SubSNP.observed will be replaced with ObsVariation.pattern. - SubSNP.subsnp_class will replaced with ObsVariation.subsnp_class. |
Motif |
motif_id motif, rev_motif_id |
This table keeps all unique motifs in an allele. For ex. An observed variation is (CA)2/3/4/6. Allele table will have four rows: (CA)2, (CA)3, (CA)4, (CA)6. Each row with the same motif_id, referring to one motif record of motif (CA). - rev_motif_id is id of the reverse motif. For ex. (CA)’s reverse motif is (TG). rev_motif_id is the motif_id for motif (TG). |
|
Allele |
allele_id, allele motif_id repeat_cnt |
- This table keeps all unique occurrence of an observed allele in a variation submission. |
|
VariAllele |
var_id, allele_id |
- This table associates a variation with 2 or more alleles. |
|
ObsVariation |
var_id, pattern, subsnp_class |
- This table keeps all unique submitted observed field from the old SubSNP.observed field. |
|
ObsGenotype |
gty_id, var_id, allele_id1, allele_id2 |
For an observed variation, this table lists all possible genotypes. For ex. If an observed variation is ‘A/C’, then there could possibly be three genotypes: A/A, A/C, C/C. |
|
SubInd |
gty_id |
foreign key to ObsGenotype(gty_id) |
- The following columns in SubInd will be replace with gty_id: observed, allele_cnt a_bit, c_bit g_bit t_bit del_bit oth_bit |
SubPopAllele |
allele_id |
foreign key to Allele(allele_id). |
-The following columns in SubPopAllele will be replaced with allele_id: allele, other |
t_vwSNP_avgFreq - This is a table directly copied from view vwSNP_avgFreq. |
snp_id, allele, freq |
|
- This information is previously only available on refSNP page. |
ChiSqPValue |
chi_sq df pvalue |
This is a lookup table with chi-square probability distribution use by Hardy-weinberg probability computation. |
|