Schema Change Document Description:

The table that lists the schema change has three columns. The first two columns show the build id. The third column has change description.

Table names are in bold. Only columns are indented and only columns with changes are shown under the table.

Please see data dictionary for full table details.

‘ -- -- --‘ means no corresponding objects ( table or column) for the build.

For example, in build 118, a new table “SNPAlleleFreq” is added.

Build 127 Schema Change

Date last updated: March 2, 2007

- For human, SubInd table is replaced with a partitioned view of a set of underlying tables by chromosomes in two separate databases to keep each database size small for ease of maintenance: human_gty1 has SubInd for chromosome 1-7. human_gyt2 has SubInd for chromosome 8-22, X, Y, Mitochondrial.

- Table name format is: SubInd_ch1, SubInd_ch2 etc. SubInd_Un is for any snp that is mapped to a contig but the contig’s chromosome is unknown; SubInd_NotOn is for any snp that can not be aligned to any contigs.

Build 125 Schema Change

Date last updated: September 28, 2005

- Major changes are in alignment types in mapping tables. Detailed description of new mapping tables is in: ftp://ftp.ncbi.nlm.nih.gov/snp/specs/b125_mapping.doc

- The other major change is that starting b125 the old dbSNP database is split by organisms into 39 different snp databases with identical schema.

- Except mapping table changes, other changes are minimal. Please check back around October 6, 2005 for updated b125 ER diagram.

Build 123 Schema Change

Date last updated: November 3, 2004

Build 122	Build 123	Note

AlleleMotif	AlleleMotif
repeat_cnt	repeat_cnt
Datatype:int	Datatype:real	We have submitted variation as (ACTGTGGGGACACCTCGCCTGCACCCACCTCCCTCCCCCCAGCTCCCGAGC)10.5/11/11.5/12/14/15/21. So repeat_cnt need to real to record the repeat value of "10.5" in this case.

ContigInfo	ContigInfo
contig_acc	contig_acc
Datatype:varchar(15)	Datatype:varchar(25)	To accomate longer contig accession names.

-- -- --	db_build_pipeline	This is an internal table to track tasks. It is in draft state.

-- -- --	dn_Motif_rev	Motif table has rev_motif_id to point to the reverse motif sequence in the same Motif table. This table is a denormalized table that makes it easier to find the reverse form of a Motif. This table is used in internal computations.

-- -- --	dn_UniGty_allele	This table is used in internal computation.

-- -- --	dn_UniVariation_rev	Similar to dn_Motif_rev. This table makes it easier to get the reverse form of a univariation string.


-- -- --	hsSupplContigInfo	dbSNP build 123 human snp are mapped to Genome build 35. To assist users who still work with NCBI build 34.3 map, we have also mapped the new dbSNP build 123 refSNP(rs) set against build 34.3. The results are in SNPContigLoc_b34_3. This table stores the contig information for build 34.3. It has exactly the same structure as ContigInfo, which has build 35 contig information.

-- -- --	hsSupplMapInfo	Please see hsSupplContigInfo note. This table is the "SNPMapInfo" counterpart for build 123 rs onto build 34.3. SNPMapInfo, of course, is the map statistics for dbSNP build 123 rs hits on NCBI build 35.

-- -- --	IndGrpByPop	This table assigns "individual group such as Asian" to population id. Individual group is a way to group snp frequency data for analysis. Ind_grp is in SubmittedIndividual, IndivBySource and Individual table.

-- -- --	IndGrpByPopInd	This table assigs "individual group" to each individual id within a population. This is in addition to the above IndGrpByPop because submitters define population and sometimes a submitted population contains individuals from several "Individual groups".

IndMergeArch	IndMergeArch	We assign individual id (ind_id) based on submission information for example Coriell number. When there is no information on submitted individual id, we assign new ind_id. Later we may get Coriell numbers and find out that several separate ind_id are actually one. We merge these ind_id. The merge process is saved in this table.
old_ind_id	-- -- --

-- -- --	ind_id_high	old_ind_id-> new_ind_id was in previous build. To be more precise, we changed to column name to: ind_id_high and ind_id_low. Since when we merge ind_id, we always keep the ind_id that has the lowest number.

new_ind_id	-- -- --

-- -- --	ind_id_low

-- -- --	build_id

-- -- --	create_time

-- -- --	IUPACna	This is a table that assists simple computation of IUPack bases.


-- -- --	mmContigExon_35_1	Started mapping of mouse to NCBI build 35.1.


RsMergeArch	RsMergeArch
build_id	build_id	smallint ranges from -32,768 - 32,767. Enough for our build_id. A quick way to save space.
Datatype:int	Datatype:smallint

-- -- --	rsCurrent	When an rs merges away several times, we need to trace rsHigh->rsLow several time to get the current rs. For convenience, rsCurrent and orien2Current are added. If an rs only merged once, then rsLow is the same as rsCurrent.
-- -- --	orien2Current	see above.

SNP	SNP
avg_heterozygosity	avg_heterozygosity
Datatype:float	Datatype:real	"real" offers plenty precision for the field and saves storage space for this table.

het_se	het_se
Datatype:float	Datatype:real	see above.

avg_max_heterozygosity	-- -- --	Remove not used field to save space.
snp_class	-- -- --	snp_class is the class for the variation type. This is replace by field univar_id which points to the unified variation string for the snp in UniVariation table. Use UniVariation.subsnp_class for this information instead of the old snp_class.
-- -- --	exemplar_subsnp_id	For a refSNP exemplar ss, we use to use view SNPSeqPick. When querying large sets of data, using SNPSeqPick is inefficient and slow. Add this new field simplifies many queries that need to get exemplar ss data for an rs, for example the query to get rs fasta.
-- -- --	univar_id	This univar_id points to the unified variation string for the snp in UniVariation table. This makes it easy to get all the alleles for an rs.
-- -- --	cnt_subsnp	A refSNP is a cluster of subsnp(ss). Cnt_subsnp is an important property of rs. Although it can be derived from SNPSubSNPLink table, it is convenient to save it here. This is a tradeoff between space and speed.

SNP_mRNA_pos	-- -- --	This table used to store a snp position on an mrna_acc. The position is derived from ContigExon. But due to gaps and alignment exceptions, the position in this table has not been precise. This table is only used to link to NCBI SeqViewer when a snp is on mRNA. It meant to center the SeqViewer on the snp. But since mRNA sequence is usually short enough to page through, we rather simply bring the mRNA SeqViewer on the first position than keep an inprecise position. We may store this information again in the future when better ways to compute is found.

SNPAllele	-- -- --	This table is unnecessary now that we have univar_id in SNP and we have UniVariAllele table which lists all alleles for each unified variation. For ex. For '(CT)3/4/5/6', UniVariAllele will have allele_id for (CT)3, (CT)4, (CT)5 and (CT)6.

-- -- --	SNPAncestralAllele	We used to have ancestral_allele_flag in SNPAllele table. Since SNPAllele table can be dropped, we need this table to save the ancestral allele data.

SNPContigLoc	SNPContigLoc
allele	allele
Datatype:varchar(255)	Datatype:varchar(1200)	New mapping program in build 35 can find alignment with bigger gaps, so allele length needs to be increased. We see that most of the long alleles contain repeats. In the future we may try to condense the allele using repeat string and number the way we do with STR in variation string.

-- -- --	SNPContigLoc_b34_3	This table has exactly the same structure as SNPContigLoc, which has dbSNP build 123 rs hits on NCBI build 35 and other alternative assemblies. But this table SNPContigLoc_b34_3 has dbSNP build 123 rs hits on NCBI build 34.3 to assist users who still use the 34.3 genome maps.


SNPFlankStatus	-- -- --	Starting build 123, mapping programs use megablast internal repeat masker. Before build 123, dbSNP repeat masks flank sequences and save the masking status in SNPFlankStatus. So this table is no longer needed.



SNPHWProb	SNPHWProb
ind_cnt	ind_cnt
Datatype:int	Datatype:smallint	smallint ranges from -32,768 - 32,767. Enough for this field. A quick way to save space.

-- -- --	SNPIndGtyFlag	Since a snp is a cluster of ss and an ind_id is a cluster of submitted_ind_id. When we have individual genotype for several ss and several submitted_ind_id that cluster to one rs and one ind_id, individual genotype should be same. But we have found out many records with different genotype data. This table flags this interesting fact.


SNPSubSNPLink	SNPSubSNPLink
comment	comment
Null Option:NOT NULL	Null Option:NULL	Comment field is raraly if ever used but occupies large space. So we nullified it but kept the field for compatible with old sql codes. In the future, we may drop this completely.

SubInd	SubInd
batch_id	batch_id
Datatype:int	Datatype:smallint	smallint ranges from -32,768 - 32,767. Enough for this field. A quick way to save space.

submitted_ind_id	submitted_ind_id
Datatype:int	Datatype:smallint	smallint ranges from -32,768 - 32,767. Enough for this field. A quick way to save space.


SubSNP	SubSNP	We added three derived fields to SubSNP table to speed up data dump by organism, chromosome, rs and then within each rs cluster, property of each ss.
-- -- --	snp_id
-- -- --	tax_id
-- -- --	chr_id

SubSNPHWProb	-- -- --	Hardy-Weinberg probability (HWP) are computed at each rs level (SNPHWProb) and at each submitted ss and population level( FreqSummaryBySsPop). This table is therefore dropped.

SubSNPRepMask	-- -- --	We dropped this table for the same reason in SNPFlankStatus note. Starting from dbSNP build 123, mapping uses megablast internal repeat mask, so we do not track the mask status of submitted snp seqeunce with dbSNP.

-- -- --	UniVariAllele	For each unified variation, this table lists all alleles within the variations. Please see SNP.univar_id note for more details.

UniVariation	UniVariation	This table keeps all unique variation.
allele_cnt	allele_cnt
Datatype:int	Datatype:smallint	smallint is enough. No need for int.

-- -- --	src_code	This code is a foreign key to UniVariationSrcCode. It tracks the source the first time a univariation is added. For ex. The variation string is added for a reverse form of a variation.

-- -- --	rev_univar_id	Added this field so it is easy to find the reverse form of a univariation.

-- -- --	UniVariationSrcCode	Please see UniVariation.src_code and the table data for detail.

-- -- --	VariComplexAllele	This is an internal table to handle complex variation forms.

Build 122 Schema Change

Date last updated: August 18, 2004

Build 121	Build 122	Note
-- -- --	ChiSqPValueLookUp	This table replaces ChiSqPValue. This is used to get the p-value from ChiSq when computing H-W Prob.

dn_Allele_rev	dn_Allele_rev	This table is used internally for processing.
rev_flag	rev_flag
Datatype:bit	Datatype:tinyint

-- -- --	dn_Gty_rev	Used for internal processing.

dn_snpFxnCnt	dn_snpFxnCnt	This is a summary table showing count of snp in each functional class(intron, coding etc).
-- -- --	tax_id	Added tax_id to this table since we start to have genomes for more organisms.

dn_table_rowcount	dn_table_rowcount	This table lists the row count of each table. When users re-create their own dbSNP, they can use this table to check if they data were loaded OK.
org_prefix	-- -- --	org_prefix and tax_id is not used yet. Currently, the table rowcount just the rowcount for a table. In the future when we have organism division databases, we may have tables shared by multiple organism. tax_id field will be used to show row count per tax_id.
-- -- --	tax_id

FreqSummaryBySsPop	FreqSummaryBySsPop	Column type changes for this table. Float to real to save storage space.
source	source
Datatype:char(2)	Datatype:varchar(1)	We compute H-W prob from either individual genotype data(I), or just genotype frequency (P). So we only need varcchar(1).

chisq	chisq
Datatype:float	Datatype:real
Null Option:NOT NULL	Null Option:NULL

df	df
Null Option:NOT NULL	Null Option:NULL

hwp	hwp
Datatype:float	Datatype:real
Null Option:NOT NULL	Null Option:NULL

-- -- --	ggaContigExon	The new set of chicken mapping table are added. They all have prefix 'gga'. Organism prefix is defined in OrganismTax.prefix column.

-- -- --	ggaContigInfo

-- -- --	ggaMapLink

-- -- --	ggaMapLinkPid

-- -- --	ggaMapList

-- -- --	ggaSNPContigLoc

-- -- --	ggaSNPContigLocusId

-- -- --	ggaSNPMapInfo

-- -- --	IndGrpCode	Added this new table to group individuals. See the data in IndGrpCode for a list of groups. Currently, the grouping only applies to human.

IndivBySource	IndivBySource
-- -- --	src_ind_grp	This column stores grouping supplied by sample source ,such as Coriell.

Individual	Individual
-- -- --	ind_grp	This column stores the grouping code assigned by dbSNP.

IndivSourceCode	IndivSourceCode
-- -- --	src_type	Added this column to distinguish various sources. For example, Coriell's src_type is "repository" and CEPH is "curator".

-- -- --	display_order	One individual often has several id from different source. This value will be used to order the display.

-- -- --	IndMergeArch	As we get more individual sample information, we may see that previously two separated individuals are actually the same. This table keeps individual merging history. This is not used yet.


miscMapLink	miscMapLink
-- -- --	i_tax_id	For organisms with no NCBI genome assembly, the hit to GenBank accessions are stored in this table. Since there're multiple organisms, tax_id was added. But since tax_id is in SNP table, we could get tax_id by joining SNP and this table. So this column may be unnecessary and be taken out in future schema modifications.

Publication	Publication
asn	-- -- --	This field was not used.

SNPHWProb	SNPHWProb	This table stays basically the same except the column name listed below. The column name is not consistent with table FreqSummaryBySsPop which has H-W prob for each ss over an population.
degreeFreedom	df

chi_sq	chisq

hw_prob	hwp

-- -- --	ind_cnt	Added column ind_cnt for individual count.

source	-- -- --	This field is removed because it is not as meaningful for an refSNP(rs) as the source in FreqSummaryBySsPop: some member ss may have individual genotype while other may only have genotype frequency.

SubmittedIndividual	SubmittedIndividual
-- -- --	loc_ind_grp	loc_ind_grp will store the individual group name submitted by submitter.

-- -- --	UniVar_Valid_Allele	This table is used internally.

Build 121 Schema Change

Date last updated: June 2, 2004

To keep dbSNP in manageable size, we have partitioned four large tables by ss# and moved the data with ss# less than 20 million to a supplemental database called dbSNP_sup.

The four large tables that have been partitioned are: SubSNPSeq3, SubSNPSeq5, SubSNPCommLine and SubSNPAcc.

For ss# between 1 and 10,000,000, it is stored in table <tablename>_p1, where <tablename> represents the above four tables.

For ss# between 10,000,001 and 20,000,000, it is stored in table <tablename>_p2

Currently, we do not have ss# over 30,000,000, when we do, we will create <tablename>_p3 for ss# between 20,000,001 and 30,000,000,

For example:

SubSNPSeq3_p1 has 3’ sequence for all ss# between 1 and 10,000,000.

SubSNPSeq3_p2 has 3’ sequence for all ss# between 10,000,001 and 20,000,000.

SubSNPSeq3_p1 and SubSNPSeq3_p1 still have the same column definitions as the original SubSNPSeq3.

In dbSNP, for ss# over 20,000,000, we still store them in dbSNP, the main database. We renamed these four tables by adding a suffix “_ins” to note that we will be add new data into these tables as new submissions come in. The original names for these four tables are now partitioned views across dbSNP and dbSNP_sup having all ss#. For more details on partitioned view,

This change will be reflected in ftp bcp files. Using SubSNPSeq3 as an example, instead of having a huge file for SubSNPSeq3 prior to this buid, we now have bcp file for SubSNPSeq_p1, SubSNPSeq_p2 and SubSNPSeq_ins. Between builds, sequence for old snp seldom change (they only change when submitter updates them which is rare). We will have separate small update file for SubSNPSeq3_p1 and SubSNPSeq3_p2. So for each new build, users only need to get the new bcp file from SubSNPSeq3_ins.

Build 120 Schema Change

Date last updated: April 24, 2004

Note: ER diagram and Data dictionary for build 120 will be updated on soon.

Build 119	Build 120	Note
agContigInfo	agContigInfo	Please see change note in "Build 117 Schema Change - ContigInfo" in this document.
-- -- --	group_term
-- -- --	group_label
-- -- --	contig_label

agSNPMapInfo	agSNPMapInfo	Please see change note in "Build 117 Schema Change-SNPMapInfo" in this document.
-- -- --	hap_cnt
-- -- --	unplaced_cnt
-- -- --	ref_cnt
-- -- --	assembly_cnt

-- -- --	dn_table_rowcount	Added this table to simply store the row count for each table. So users who create local dbSNP can compare the counts.

-- -- --	dn_validationSummary	An internal validation table.

Individual	Individual	SubmittedIndividual has tax_id due to annotation process. It is possible several submitted individuals are from the same sample. This curated sample data is in Individual table. So to normalize data, tax_id should be in Individual table. Currently, we have tax_id in both SubmittedIndividual and Individual tables. Eventually, we will only have tax_id in Individual table.
-- -- --	tax_id

SNPHapMapPriority	SNPHapMapPriority
priority_code	priority_code
Datatype:tinyint	Datatype:smallint	HapMapClassCode has 9 codes, so priority_code needs to be a smallint.

Build 119 Schema Change

Date last updated: Jan. 30, 2004

Note: ER diagram and Data dictionary for build 119 will be updated on soon.

Please see most recent data dictionary at: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/DataDictionary_b118.html

Build 118	Build 119	Note

-- -- --	dgContigInfo	Added 'dg' prefixed tables for dog snp mapping data.
-- -- --	dgSNPContigLoc
-- -- --	dgSNPContigLocusId
-- -- --	dgSNPMapInfo

-- -- --	dn_chr_locusList	This table is a de-normalized table, storing list of locus_id for each contig chromosomes. This is used in individual genotype summary page.

-- -- --	dn_gty_rsCnt_byChr	This table is a de-normalized table, storing genotyped snp count for each individual on chromosomes.

-- -- --	dn_gty_rsCnt_byLocus	This table is a de-normalized table, storing genotyped snp count for each individual on a locus.

MethodClass	MethodClass

-- -- --	validation_status	Added validation_status. Value 0 means the method is computational. 1 - method can be considered validating when multiple ss are in the same rs cluster.


SNPContigLocusId	SNPContigLocusId	Added mrna_ver and protein_ver. Contig_acc, mrna_acc and protein_acc fields are changed to varcahr(15) to be consistent with accession fields in other tables such as SNPContigLoc. The following int field are changed to tinyint: contig_ver, fxn_class, reading_frame to save space.

SubmittedIndividual	SubmittedIndividual
-- -- --	loc_ind_id	Submitter individual id can be case sensitive. For ex. Mouse strain name use case in special ways. This field is needed to preserve case information.

Build 118 Schema Change

Date last updated: Nov. 14, 2003

Please see data dictionary at: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/DataDictionary_b118.html

Build 117	Build 118	Note

-- -- --	dgMapLink	Added mapping data for dog.

-- -- --	dgMapLinkPid	Added mapping data for dog.

dn_batchCount	dn_batchCount
build_id	-- -- --	We will keep snp counts in each batch for current batch only. So there is no need for this column.

dn_handleCount	dn_handleCount
build_id	-- -- --	We will keep snp counts in each batch for current batch only. So there is no need for this column.

dn_IND_batchCount	dn_IND_batchCount
build_id	-- -- --	We will keep snp counts in each batch for current batch only. So there is no need for this column.

SubSNPGenotypeSum	FreqSummaryBySsPop	This table is renamed to be consistent with the two tables used to compute this summary: GtyFreqBySsPop and GtyFreqBySsPop. The new name clearly shows that the data ins for each SubSNP within a specific population.
source	source
Datatype:char(1)	Datatype:char(2)	Use char(2): GF - source data is genotype frequency, IG - source data is individual genotype, AF - source data is allele frequency.
non_founder_cnt	non_founder_ind_cnt	Renamed so it is clear that the number in this column count of individuals, not chromosomes.
-- -- --	het	heterozygosity
-- -- --	het_se	standard error for the above heterozygosity.

mmContigInfo	mmContigInfo	The changes in this table is the same in ContigInfo. Please see change description for ContigInfo in build 117.
contig_class	-- -- --
-- -- --	group_term
-- -- --	group_label
-- -- --	contig_label

mmSNPContigLocusId	mmSNPContigLocusId
-- -- --	mrna_ver	Added this to keep track the version of the mrna accession, like the contig_ver.
-- -- --	protein_ver	Added this to keep track the version of a protein accession, like the contig_ver.

mmSNPMapInfo	mmSNPMapInfo	The changes in this table are the same in ContigInfo. Please see change description for ContigInfo in build 117.
-- -- --	ref_cnt
-- -- --	assembly_cnt

-- -- --	SNPAlleleFreq	This new table keeps the average allele frequency for refSNP.

-- -- --	SNPGtyFreq	This new table keeps the average genotype freqeuncy for refSNP.

SubmittedIndividual	SubmittedIndividual
-- -- --	loc_ind_alias	Added this column to track the different id for a submitted individual for genotype data.

t_vwSNP_avgFreq	-- -- --	This is replaced with SNPAlleleFreq.

t_vwSNP_MinorAlleleCnt2More	-- -- --	This table is no longer needed.

t_vwSubPopGty_alleleFreq	-- -- --	This table is replaced with GtyFreqBySsPop.

t_vwSubSNP_validationByMinorAlleleCnt	-- -- --	This table is no longer needed.

Build 117 Schema Change

Date last updated: Oct. 16, 2003

Note: Data dictionary and ER diagram for build 117 will be available on Oct.17, 2003.

Build 116	Build 117	Note

Allele	Allele	Removed 3 columns in Allele table because an allele may have many motifs (ex. (GAT)10AAT(GAT)5). So keep allele motif relationship in enw table AlleleMotif.
motif_id	-- -- --
repeat_cnt	-- -- --
rev_allele	-- -- --
-- -- --	last_updated_time

-- -- --		Added this table to save computed allele frequency per subsnp and population. The frequency computed from individual genotype takes the highest precedence which then takes precedence over the allele frequency computed from genotype frequency. When none of the above data is available, we use the submitted allele frequency.

-- -- --	AlleleMotif	Added this table to keep the relationship between an allele and all its motif.

ContigInfo	ContigInfo	contig_class was used to track the haplotype data of humans and the strain data of mouse for each of the various contigs. But it became clear that different ncbi assembly tasks required different ways of grouping the alternatives so the single category contig_class was broken out into three fields. For dbSNP, the group_term gives the best breakdown of the various alternative forms, but contig_label and/or group_label give a better idea of the common name associated with haplotype, strain, alternate_assembly, etc.
contig_class	-- -- --
-- -- --	group_term	id used to group contigs together for annotation purposes. This allows annotaters to group contigs by assembly (see note below for PAR special case). Possible values: ref_[strain/haplotype] (haplotype human, strain others) alt_[strain/haplotype] ref_PAR alt_assembly the ref_ and alt_ prefixes assume that these are contigs of type NT/NW and the NCBI-assembly team will be loading these to ID. ref_PAR: The pseudoautosomal region requires special processing. The Y contigs of the PAR region will have this group_term to make sure the NCBI-assembly team don't penalize features in this region for going to two chromosomes. These contigs will be joined back into the reference assembly upon loading to MapViewer alt_assembly: this designation is used when the NCBI-assembly team have an extra assembly for a chromosome and the NCBI-assembly team don't want to load the annotation/contigs from this chromosome to ID.
-- -- --	group_label	Label used to label an assembly in the MapViewer. On a non-sequence based map, this term can serve to group objects into a common coordinate system. Example values: reference C57BL/6J 129_substrain
-- -- --	contig_label	Label used to label a particular contig in an assembly. For example, in mouse, the NCBI-assembly team have grouped all contigs that are of type 129/??? for annotation purposes. When the NCBI-assembly team display a particular contig, it would be useful to display the actual strain name (129/Sv, 129/Ola, etc). Example values: C57BL/6J 129/SvJ 129/SvEvTac

-- -- --	GtyFreqBySsPop	Added this table to save computed genotype frequency per subsnp and population. The frequency computed from individual genotype takes the precedence over submitted genotype frequency.

ObsVariation	ObsVariation
-- -- --	var_flag	refers to VarFlagCode.code

PopClassUniqCount	-- -- --	Not used.

-- -- --	rnContigExon	Added mapping information for rat, with prefix 'rn' added to mapping tables. This note also applies to the following 7 tables.

-- -- --	rnContigInfo

-- -- --	rnMapLink

-- -- --	rnMapLinkPid

-- -- --	rnMapList

-- -- --	rnSNPContigLoc

-- -- --	rnSNPContigLocusId

-- -- --	rnSNPMapInfo

SnpFunctionCode	SnpFunctionCode
-- -- --	is_coding	Added this column for internal query.

SNPMapInfo	SNPMapInfo
-- -- --	ref_cnt	set greater than zero if the snp maps onto the reference strand.
-- -- --	assembly_cnt	Counts the number of alternate assemblies (excluding reference) which the snp hits.

sts_markers	-- -- --

SubPop	SubPop	Change some float column to real to save storage space.

-- -- --	ValidGtyByVariation	Internal use currently.

-- -- --	VarFlagCode	Some submitted variation pattern is problematic. Define codes to track problems.

Build 116 Schema Change

Date: Aug 7, 2003

The following details all schema changes between build 116 and 115, sorted by table names.

Table names are in bold. Only columns are indented and only columns with changes are shown under the table. See data dictionary for full table details.

‘ -- -- --‘ means no corresponding objects ( table or column) for the build.

For example, in build 116, pop_cnt is added to table dn_batchCount.

Please see ER diagram for more details: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/erd_dbSNP.pdf

Please see data dictionary at: ftp://ftp.ncbi.nlm.nih.gov/snp/mssql/schema/DataDictionary_b116.html

Build 116	Build 115	Note
dn_batchCount	dn_batchCount	Count to speed up query.
pop_cnt	-- -- --	For frequency and genotype batch, added count of population and count of individual.
ind_cnt	-- -- --

dn_handleCount	dn_handleCount
rs_cnt	rs_cnt	Make the column nullable, so 'NULL' means not computed yet, while '0' means the result.
Null Option:NULL	Null Option:NOT NULL
rs_validated_cnt	rs_validated_cnt
Null Option:NULL	Null Option:NOT NULL

dn_IND_batchCount	-- -- --	This count table helps speed up "New Batches" web query.

-- -- --	dn_new_batch	dn_new_xxx_batch tables are no longer needed. It was used internally to support query most recent batches. Now web page uses Batch.submit_time to decide submission order.
-- -- --	dn_new_ind_batch

-- -- --	dn_new_nov_batch

-- -- --	dn_new_pop_batch

-- -- --	dn_new_snp_batch

GtyAllele	GtyAllele	The new table makes it easy to compute allele frequency from genotype submission on forward or reverse strand.
-- -- --	allele_id
rev_flag	-- -- --
fwd_allele_id	-- -- --
unigty_id	-- -- --
create_time	create_time
Null Option:NOT NULL	Null Option:
Default:-- -- --	Default:DefCurTime
last_updated_time	-- -- --

IndivAlias	-- -- --	Genotype submission includes individual id. Different submitters sometimes use different individual id for the same DNA sample. Ex. Mouse strain 'AKR' or 'AKR/J' are the same.

IndivBySource	IndivBySource	Minor database change
src_id	src_id
Datatype:int	Datatype:numeric(3)

IndivSourceCode	IndivSourceCode	Minor database change
code	code
Datatype:int	Datatype:numeric(3)

LoadHistory	LoadHistory
-- -- --	create_time	Each build releases different component in slight different time. Now each individual component release time is stored here.
ftp_done_date	-- -- --
entrez_done_date	-- -- --
blastdb_done_date	-- -- --
web_date	-- -- --

mmSNPMapInfo	mmSNPMapInfo	See data dictionary.
hap_cnt	-- -- --
unplaced_cnt	-- -- --

ObsGenotype	ObsGenotype	See data dictionary.
-- -- --	submitted_strand_code
-- -- --	obs_ss_fwd
-- -- --	create_time
-- -- --	unigty_id
-- -- --	allele_id_1
-- -- --	allele_id_2

ObsVariation	ObsVariation	subsnp_clas and iupack_code are moved to table UniVariation.
-- -- --	subsnp_class
univar_id	univar_id
Null Option:NULL	Null Option:NOT NULL
-- -- --	iupack_code

OrganismTax	OrganismTax	Added division_cd in preparation to split dbSNP database into several divisions. Divisions are loosely based on GenBank divisions.
division_cd	-- -- --
Population	Population	When curating submitted individuals, it helps to know the source of the population samples. So src_id is added. For ex. If we know a population's source is NIHPDR, then we know '1' or 'P001' or 'PD001' is the same sample in NIHPDR panel. We currently manually set this field based on population description.
src_id	-- -- --

SNPContigLocusId	SNPContigLocusId
fxn_class	fxn_class	data type changed from int to tinyint to save storage.
reading_frame	reading_frame	same as above.
contig_ver	contig_ver	same as above.

SNP	SNP
-- -- --	snp_type	snp_type was used to record snp withdraw reason code defined in SnpTypes. We currently do not keep withdrawn snp in SNP table. Withdrawn snp is in SNPHistory.
tax_id	-- -- --	tax_id, snp_class and validation_status was orignianlly in SNP table. From build 111 to build 115, they were moved to t_vwSNP to faciliate database processing. But during database size increase in build 116, views proved too slow to merit the separating these three fields. So starting build 116, these three fields are back in SNP table.
snp_class	-- -- --
validation_status	-- -- --


SnpFunctionCode	SnpFunctionCode
top_level_class	-- -- --	Added this field to group the functional class code into "exon" and "other".

SNPHWProb	SNPHWProb
chi_sq	chi_sq	float' takes 8 byte storage. 'Real' takes 4 byte. In dbSNP, 'real' offers enough precision.
Datatype:real	Datatype:float
hw_prob	hw_prob
Datatype:real	Datatype:float

SNPMapInfo	SNPMapInfo
hap_cnt	-- -- --
unplaced_cnt	-- -- --


SnpValidationCode	SnpValidationCode
abbrev	abbrev
Datatype:varchar(40)	Datatype:varchar(20)	With the addition of double hit validation type, the abbreviation string gets longer.
create_time	-- -- --
last_updated_time	-- -- --

-- -- --	SNPValBy_Illumina_Apr1003	The snp list is this table is now in SNPVal and used to set the validation_status for the snp.

-- -- --	SNPValBy_PERLEGEN_OCT0802	The snp list is this table is now in SNPVal and used to set the validation_status for the snp.

StrandCode	StrandCode
rs_to_ss_orien	-- -- --	See data in this table to see why this new column is needed.

-- -- --	sts_markers

SubInd	SubInd	Streamlined SubInd table by normalizing observed genotype into ObsGenotype table.
-- -- --	allele_cnt
-- -- --	a_bit
-- -- --	c_bit
-- -- --	g_bit
-- -- --	t_bit
-- -- --	del_bit
-- -- --	oth_bit
-- -- --	observed
submitted_rs	-- -- --

SubIndDeletedBySubmitter	-- -- --	Added this table for internal tracking.

SubPop	SubPop
subpop_id	-- -- --	Added this column to shorten primary key, especially since we have added more tables referencing SubPop table, such as SubPopGty.

SubPopAllele	SubPopAllele
subpop_id	-- -- --	See above. In the future SubPopAllele will use this column to reference SubPop so batch_id, pop_id,subsnp_id,type will no be needed. These four columns are planed to be dropped in build 117.

SubPopDeletedBySubmitter	SubPopDeletedBySubmitter	Added this table for internal tracking.
-- -- --	build_id
delete_build_id	-- -- --

SubPopGty	-- -- --	In the past, genotype frequency were stored in SubPopAllele. But submitted genotype frequency has different property from submitted allele frequency. So this table is created to keep just genotype frequency to faciliate computing, for ex, to derive allele frequency based on genotype frequency.

SubPopHet	-- -- --	This table is created, but not populated yet. We plan to move the heterozygosity frequency from SubPopAllele into this table in build 117.

SubPopRsMerge	-- -- --	Submitters may submit frequency on an rs that was already merged to another rs. This table keeps a record of this fact. Since we always stored frequency on the rs exemplar ss, the effect of this merging is that the frequency may be stored under a different ss#.

SubSNP	SubSNP
-- -- --	hw_prob	SubSNPHWProb is used for this values. Less than 5% subsnp has data that make the computation of Hardy-Weinberg probability possible. Storing this value in SubSNP table wastes space since SubSNP has over 10 millions rows.
-- -- --	subsnp_type	This was intended to indicate the withdraw detail of a subsnp. Now SubSNP table does not keep deleted snp. Deleted SubSNP is tracked in SubSNPDeletedBySubmitter.
top_or_bot_strand	-- -- --	This is an experimental way of identifying strand.
validation_status	-- -- --	Since build 111, we started to compute validation status from a view. But with data growth, it is no longer efficient to compute this value in each query. Maintaining a table copy of view t_vwSubSNP_validation defeats the beauty of views. So we are returning this field back into SubSNP.

SubSNPSeqPos	-- -- --	In build 116, we received submission of snp with flanking sequence over 200Kb for snp's in repeat regions and the submitter already know the genome location of the snp. In this case, storing all flanking sequence does not provide value since the flanks will be repeat masked out anyway. So we save the accession and the snp position of the accession.

SubSNPSeqTypeCode	-- -- --	Explains the meaning of 'type' field in table SubSNPSeq3 and SubSNPSeq5.

-- -- --	t_vwSNP

t_vwSubPopGty_alleleFreq	-- -- --

-- -- --	t_vwSubSNP_validation	See note for SubSNP.validation_status.

t_vwSubSNP_validationByMinorAlleleCnt

UniGty	-- -- --

UniVariation	-- -- --

Build 115 Schema Change

Date: June 6, 2003

- Dropped tables: dn_new_snp_batch, dn_new_pop_batch, dn_new_ind_batch, dn_new_nov_batch

These are internal table that are no longer needed.

- Added tax_id to SubmittedIndividual table because some populations group individuals from different organism.

Build 114 Schema Change

Date: April 29, 2003

Table	Column	Note
ObsGenotype		This table keeps all submitted genotypes and submitted strand code in obs and submitted_strand_code columns. Obs_upp is the all uppercase version of obs. Obs_ss_fwd is the genotype on the forward strand of the SubSNP. Ex. If genotype is submitted as A/C\|STRAND_SS_REV, then obs has “A/C”, obs_ss_fwd has “G/T”.
SubSNPGenoSum is renamed to SubSNPGenotypeSum.	Freq_obs and genotype_obs are dropped. Hw_prob is renamed to hwp degreeFreedom is renamed to df chi_sq is renamed to chisq	Source has value “I” or “P”: “I” means the genotype frequency is summarized from individual genotypes. “P” means the genotype frequency is submitted for a population.
SubSNPHWProb	Hw_prob is renamed to hwp degreeFreedom is renamed to df chi_sq is renamed to chisq	Source has same means as in SubSNPGenotypeSum.source
SubPopDeletedBySubmitter		Added this table to keep track deleted data, so summary statistics needs to be recomputed for the affected batch and refSNP cluster.
t_vwSNP	- Dropped rarely used columns. - Current columns are snp_id, validation_status, snp_class and tax_id, create_time ( is the time this row is inserted ).
Dn_new_snp_batch Dn_new_pop_batch Dn_new_ind_batch Dn_new_nov_batch		dropped
Dn_new_batch		Added to replace dn_new_<batch_type>_batch, where <batch_type> has value of snp,pop,ind or nov.

Build 112 Schema Change

Date: April 2, 2003

Table	Column	Note
SubSNP	observed is replaced with ObsVariation.pattern. subsnp_class is replaced with ObsVariation.subsnp_class. Dropped columns allele_cnt,a_bit,c_bit,g_bit,t_bit,del_bit,oth_bit,allele_bits	To get the observed variation and subsnp_class for a give subsnp, use ObsVariation.pattern, join SubSNP and ObsVariation on SubSNP.variation_id=ObsVariation.var_id
Batch	· Organism is replaced with OrganismTax.organism.	Join Batch and OrganismTax on tax_id to get the organism of a batch.

Build 111 Schema Change

Date: March 23, 2003

There are a few major schema changes between build 110 and build 111.

Tables/Columns in Build 111 that are replacing old tables/columns in build 110:

Table	Column	Note
t_vwSubSNP_validation - replace SubSNP.validaton_status	validation_status	- t_vwSubSNP_validation is a direct copy from view vwSubSNP_validation. Please see vwSubSNP_validation.sql for details.
SNPSeqPick - This is a view. It replaces the old table with the same name and columns: SNPSeqPick.	snp_id, subsnp_id	- see SNPSeqPick.sql for view details.
t_vwSNP - replace the same columns in SNP.	PCR_confirmed_ind assay_sample_size cnt_subsnp indv_homozygosity_obs_ind num_segregate popdata_sample_size snp_class tax_id total_sample_size validation_status	t_vwSNP is direct copy from view vwSNP. Please see vwSNP.sql for details on how each field is computed.
SubPopAllele	cnt cnt_int	- cnt is changed to type float. This will help eliminate rounding error in computing average allele frequency etc. - cnt_int will be deleted after cnt is set to be number of chromosomes for allele frequency and number of individuals for genotype and heterozygosity data.

New table and columns:

Table	Column	Note	Further change in Build 112
SubSNP	variation_id	variation_id is foreign key to ObsVariation (var_id).	- SubSNP.observed will be replaced with ObsVariation.pattern. - SubSNP.subsnp_class will replaced with ObsVariation.subsnp_class.
Motif	motif_id motif, rev_motif_id	This table keeps all unique motifs in an allele. For ex. An observed variation is (CA)2/3/4/6. Allele table will have four rows: (CA)2, (CA)3, (CA)4, (CA)6. Each row with the same motif_id, referring to one motif record of motif (CA). - rev_motif_id is id of the reverse motif. For ex. (CA)’s reverse motif is (TG). rev_motif_id is the motif_id for motif (TG).
Allele	allele_id, allele motif_id repeat_cnt	- This table keeps all unique occurrence of an observed allele in a variation submission.
VariAllele	var_id, allele_id	- This table associates a variation with 2 or more alleles.
ObsVariation	var_id, pattern, subsnp_class	- This table keeps all unique submitted observed field from the old SubSNP.observed field.
ObsGenotype	gty_id, var_id, allele_id1, allele_id2	For an observed variation, this table lists all possible genotypes. For ex. If an observed variation is ‘A/C’, then there could possibly be three genotypes: A/A, A/C, C/C.
SubInd	gty_id	foreign key to ObsGenotype(gty_id)	- The following columns in SubInd will be replace with gty_id: observed, allele_cnt a_bit, c_bit g_bit t_bit del_bit oth_bit
SubPopAllele	allele_id	foreign key to Allele(allele_id).	-The following columns in SubPopAllele will be replaced with allele_id: allele, other
t_vwSNP_avgFreq - This is a table directly copied from view vwSNP_avgFreq.	snp_id, allele, freq		- This information is previously only available on refSNP page.
ChiSqPValue	chi_sq df pvalue	This is a lookup table with chi-square probability distribution use by Hardy-weinberg probability computation.

Schema Change from builds before 111:

Schema Change Document Description:

Build 127 Schema Change

Build 125 Schema Change

Build 123 Schema Change

Build 122 Schema Change

Build 121 Schema Change

Build 120 Schema Change

Build 119 Schema Change

Build 118 Schema Change

Build 117 Schema Change

Build 116 Schema Change

Build 115 Schema Change

Build 114 Schema Change

Table

Column

Note

ObsGenotype

SubSNPGenoSum is renamed to SubSNPGenotypeSum.

Source has value “I” or “P”:

SubSNPHWProb

Source has same means as in SubSNPGenotypeSum.source

SubPopDeletedBySubmitter

Added this table to keep track deleted data, so summary statistics needs to be recomputed for the affected batch and refSNP cluster.

t_vwSNP

- Dropped rarely used columns.

Dn_new_snp_batch

dropped

Dn_new_batch

Added to replace dn_new_<batch_type>_batch, where <batch_type> has value of snp,pop,ind or nov.

Build 112 Schema Change

Table

Column

Note

SubSNP

To get the observed variation and subsnp_class for a give subsnp, use ObsVariation.pattern, join SubSNP and ObsVariation on SubSNP.variation_id=ObsVariation.var_id

Batch

· Organism is replaced with OrganismTax.organism.

Join Batch and OrganismTax on tax_id to get the organism of a batch.

Build 111 Schema Change

Tables/Columns in Build 111 that are replacing old tables/columns in build 110:

New table and columns:

Schema Change from builds before 111: