|
The mtDNA Population
Database:
An Integrated Software and Database Resource
for Forensic Comparison
Keith
L. Monson
Research
Chemist
Forensic Science Research Unit
Federal Bureau of Investigation
Quantico, Virginia
Kevin
W. P. Miller
Biologist-Forensic Examiner
DNA Analysis Unit 2
Mark R. Wilson
Supervisory Special Agent
DNA Analysis Unit 2
Joseph A. DiZinno
Section Chief
Scientific Analysis Section
Bruce Budowle
Senior Biological Sciences Program Advisor
Forensic Analysis Branch
Federal Bureau of Investigation
Washington, DC
Introduction.......Population
Data.......Associated Software.......Contact
Information
User's
Manual: mtDNA Population Database.......Accessing
the mtDNA Population Database.......Disclaimer.......Acknowledgments.......References
Introduction
Nucleotide
sequencing of the human mitochondrial DNA (mtDNA) control region
has been validated for the genetic characterization of forensic
specimens (for references, see Budowle et al. 1999). Mitochondrial
DNA analysis is especially useful for the analysis of teeth,
bones, and hair, as well as highly degraded tissues that do not
lend themselves to successful nuclear DNA analysis.
In contrast
to nuclear DNA, mtDNA follows maternal clonal inheritance patterns
without recombination. Therefore, with few exceptions (i.e.,
heteroplasmy), mtDNA types are faithfully inherited from one
generation to the next through the maternal line. These characteristics
facilitate collection of reference material for forensic comparison,
even in cases where generations are skipped. For forensic purposes,
the weight of a mtDNA match between two evidentiary items is
determined by counting the number of times the profile occurs
in one or more datasets of unrelated individuals. Given the level
of diversity that has been observed in mtDNA, the estimate of
rarity by counting mtDNA types is highly dependent on the size
of the reference database and will often be overestimated (National
Research Council 1996; Budowle et al. 1999).
This article
describes a database of mtDNA control region nucleotide sequences
and reports on software for searching the profiles. These data
are made available to the forensic and research communities free
of charge. The mtDNA Population Database program has two components:
population data that are stored as relational tables in Microsoft®
Access 2000® format, and specialized software designed to
search these data. The mtDNA Population Database program (data,
searching software, and user's manual) is published in this issue
of Forensic Science Communications. Technical support
is limited to that supplied in the user's manual. Updates to
the population data will be published periodically. Searching
functionality similar to that in the mtDNA Population Database
program provided here is available to law enforcement laboratories
through the CODISmt program (COmbined DNA Index System-mitochondrial
DNA). CODISmt offers the additional possibilities
of searching combined mitochondrial and nuclear DNA profiles
and of searching open case files, particularly missing persons
cases, across the CODIS network.
Population Data
Nucleotide
sequence data are divided into two components, forensic and public
(Miller and Budowle 2001). In each category, profiles are designated
as differences from the Cambridge Reference Sequence (CRS)
(Anderson et al. 1981). The forensic component is used to assess
the weight of mtDNA associations developed in forensic casework.
It consists of anonymous population profiles contributed by collaborating
laboratories. In addition to its own quality assurance measures,
and as a minimum quality assurance assessment on prospective
submissions, each participating laboratory must correctly type
a set of control samples before any of its results are accepted
for inclusion in the population database. This exercise facilitates
compatibility of typing methods and of profile nomenclature.
All forensic profiles include, at a minimum, a sequence region
in hypervariable region I (HVI), defined by nucleotide positions
16024-16365, and a sequence region in hypervariable region II
(HVII), defined by nucleotide positions 73-340. Additional
contributions of population data are welcomed.
Public
data consist of mtDNA sequence data from the scientific literature
and the GenBank and European Molecular Biology Laboratory (EMBL)
genetic databases. These sequence data were collected, cataloged,
annotated, formatted, and organized. The public data include
and replace the mtDNA data from the mtDNA concordance study of
Miller et al. (1996). In general, the quality assurance methods
described by Miller were applied to the public data. The data
were checked for uniformity of nomenclature and cross-checked
with other publications by the same authors and the GenBank/EMBL
genetic databases to minimize the possibility of error and duplication.
Although the public data have not been subjected to the same
quality standards as the forensic data, these data provide useful
information on worldwide population groups not contained within
the forensic dataset and can be used for investigative purposes.
The profiles
in both forensic and public datasets are uniquely identified
in the database by a systematic naming scheme. Each profile is
denoted by a unique identifier and the respective literature
citation. Where possible, each profile is indexed by the population
group assigned by the contributor, as well as the continent,
country, or region of specimen origin. A standard 14-character
nucleotide sequence identifier is assigned to each profile, using
the structure XXX . YYY . ZZZZZZ, as described by Miller and
Budowle (2001). The first three characters (XXX) reflect the
country of origin, using codes defined by the United Nations
(1997). The second three characters (YYY) describe the group
or ethnic affiliation to which a particular profile belongs.
The final six characters (ZZZZZZ) are sequential acquisition
numbers. For example, profile JPN.ASN.000105 designates the 105th
nucleotide sequence from an individual of Asian origin from Japan.
An Asian American individual would carry the same code for ethnicity,
but a different code for country of origin (e.g., USA.ASN.000105).
The population/ethnicity codes for indigenous peoples are numeric
and arbitrarily assigned. For example, USA.008.000105 refers
to an individual from the Apache tribe sampled from the United
States.
Associated Software
The central
function of the software is to facilitate searching of a mtDNA
nucleotide sequence developed from an evidentiary sample against
one or more sequence datasets. The software offers various search
parameters, provides options for report details, and provides
tools for exploration of the datasets themselves. Two types of
searching are supported:
- A comparison
of a single profile against the dataset
- A pairwise
search in which every profile is compared to every other profile
in the selected dataset(s)
Details
of program operation are described in a user's manual that accompanies
the download of data and searching software from the Internet.
To perform a single profile search, the
Search
|
Figure
1
The Search screen is used to enter a target profile and to select
and search dataset(s) for sequences matching that profile. Click to enlarge image. |
screen
(Figure 1) is invoked as the default upon starting the program.
It can also be accessed
by selecting Mode and then Search on the menu bar. The user specifies
the Sequenced Regions (which contain, but are not limited to,
HVI and HVII), the profile in terms of Differences from Anderson
(i.e., the Cambridge Reference Sequence [CRS]) and which data
(forensic or public) are to be searched. To broaden or limit
the search within the dataset chosen, individual groups (e.g.,
African-American, Caucasian, Thai) may be selected. The user
can also specify a number of search parameters. These include
the following:
Whether
partial overlaps (where some, but not all, of the ranges sequenced
in the search and database profiles are in common) are to be
searched,
- Whether
insertions are to be considered, and
- The degree
of output detail.
|
|
|
Figure
3
Search output (excerpt): Profiles with two or fewer differences
from search profile. Click
to enlarge image. |
|
Upon completion of the search, Search
output lists all search parameters and tabulates counts and frequency
of matches (i.e., count/number of sequences searched) in various
groupings of the chosen database (total combined, by major group
and by individual groups). If the option is specified, the sequence
of every matching profile and the number of sites differing from
the target profile are also listed. Figures
2 and 3 illustrate representative portions of the search output.
For example, the number of profiles that are identical to the
search profile can be indicated, as well as the frequency of
the profile relative to the number of profiles in the dataset
used in the search (Figure 2). In addition, individual profiles,
with their respective identifiers, that are obtained as search
results can be displayed (Figure 3).
A pairwise
search may be helpful in determining the general relationships
between datasets. Pairwise comparisons are performed using the
same algorithms as used for a single profile search but are invoked
by selecting Mode, then Pairwise from the menu
bar (Figure 4). In addition, the user can specify either the
specific sequence ranges to be searched or limit the comparison
to the regions common to every profile in the selected dataset(s).
Note that two regions may be defined (HVI and HVII) but that
no gaps in either are permitted in the search sequence. Figure
5 displays excerpted results of several intragroup and intergroup
comparisons. It includes the number of matches and number of
comparisons performed within the dataset, the quotient of the
two, and the mean number of nucleotide differences per comparison.
The mean number of differences tends to be similar for different
populations within a major group. Also provided (but not illustrated)
are counts of mtDNA types observed within each group and estimates
of the random match probability (Stoneking et al. 1991) and genetic
diversity (Tajima 1989).
Contact Information
Deborah Polanskey
Federal Bureau of Investigation
DNA Analysis Unit 2
Room 3220
2501 Investigation Parkway
Quantico, Virginia 22135
User's
Manual: mtDNA Population Database
Access
the Release Notes,
published in this issue of Forensic Science Communications.
Accessing the mtDNA Population
Database
Click
here to install the database
on your hard drive. Database will only download if you are using
Internet Explorer. If you experience difficulties with the database,
please call 202-324-4354.
This version of the mtDNA Population Database was revised as of June 2004. Changes made to the database can be found in the most recent copy of the Release Notes.
Disclaimer
Mitochondrial
DNA (mtDNA) is a small, circular piece of DNA. It is found outside
the nucleus in most cells and is generally involved with the
production of energy for the body. Like other types of DNA, a
portion of mtDNA does not encode proteins and has no known function.
This region, called the control region, is used in forensic DNA
analysis because it is highly variable from person to person.
The mtDNA population database is a compilation of differences
in mtDNA control regions from a random collection of unrelated
individuals of various ethnic backgrounds. A variety of forensic,
law enforcement, and academic institutions have contributed to
the mtDNA population database. The database is intended for use
by the scientific community to establish the relative level of
occurrence of a particular genetic type in a particular group
of individuals. Identifying information specific to any particular
person, such as gender, age, sex, or disease status is not available.
Acknowledgements
The following
institutions contributed nucleotide sequence data to the mtDNA
population database:
Armed
Forces DNA Identification Laboratory, Rockville, Maryland
Illinois State Police, Springfield, Illinois
Institute of Legal Medicine, University of Innsbruck, Innsbruck,
Austria
University of California at Berkeley, Berkeley, California
This work
was supported in part by FBI contract #J-FBI-98-090, with contributions
from the National Institute of Justice.
References
Anderson,
S., Bankier, A. T., Barrell, B. G., de Bruijin, M. H. L., Coulson,
A. R., Drouin, J., Eperson, I. C., Nierlich, D. P., Roe, B. A.,
Sanger, F., Schreier, P. H., Smith, A. J. H., Staden, R., and
Young, I. G. Sequence and organization of the human mitochondrial
genomes, Nature (1981) 290:457-465.
Budowle,
B., Wilson, M. R., DiZinno, J. A., Stauffer, C., Fasano, M. A.,
Holland, M. M., and Monson, K. L. Mitochondrial DNA regions HVI
and HVII population data, Forensic Science International
(1999) 103:23-35.
Miller,
K. W. P. and Budowle, B. A compendium of human mitochondrial
DNA control region: Development of an international standard
forensic database, Croatian Medical Journal (2001) 42(3):315-327.
Miller,
K. W. P., Dawson, J. L., and Hagelberg, E. A concordance of nucleotide
substitutions in the first and second hypervariable segments
of the human mtDNA control region, International Journal of
Legal Medicine (1996) 109:107-113.
National
Research Council. NRC Report II: The Evaluation of Forensic
Evidence. National Academy Press, Washington, DC, 1996, p.159.
Stoneking,
M., Hedgecock, D., Higuchi, R. G., Vigilant, L., Erlich, H. A.,
Arnheim, N., and Wilson, L. A. Population variation of human
mtDNA control region sequences detected by enzymatic amplification
and sequence-specific oligonucleotide probes, American Journal
of Human Genetics (1991) 48:370-382.
Tajima,
F. Statistical method for testing the neutral mutation hypothesis
by DNA polymorphism, Genetics (1989) 123:585-595.
United
Nations. Terminology Bulletin No.347/Rev.1: Country Names.
United Nations Office of Conference and Support Services, New
York, 1997, pp.1-50.
Top
of the page |