Standards for Scientific and
Technical Data
John
Rumble
National
Institute of Standards and Technology
President,
CODATA
john.rumble@nist.gov
Standards for Scientific and
Technical Data
This Talk
- Needs for S&T data and database standards
- Basic
approaches to standards
- Content
of S&T data and database standards
- Types
of standards bodies
- Examples
Needs for S&T Data and Database
Standards
- Improved data collection
completeness, uniformity, easier
- More efficient database building
collective solutions to nomenclature and
representational problems
basis for data dictionaries
- Data exchange and integration
sharing, combining, and comparing data
- Easier data use
common connections to applications, knowledge
discovery
Basic Approaches to Data and
Database Standards
- Neutral formats
allows
one to maintain own format
only requires translation into and out of
format
- Define data elements
include normal data dictionary information
identify minimum set necessary
define others in case wanted or needed
allow extensions (self-definition)
Do not underestimate difficulties in resolving
nomenclature problems
- Separate semantics from syntax
Basic Approaches to Data and Database
Standards
- Neutral formats
allows one to maintain own format
only requires translation into and out of format
- Define data elements
include
normal data dictionary information
identify minimum set necessary
define others in case wanted or needed
allow extensions (self-definition)
Do not underestimate difficulties in resolving
nomenclature problems
- Separate semantics from syntax
Content of S&T Data and Database
Standards - 1
- Description of substance, systems,
species, taxa,
- Reporting properties, measurements,
observations, characteristics, calculations, results
- Property context
Description of Substances, Systems,
Species - 2
- Required for computerized description
- Uniqueness
it is this substance, not that one
different levels - protein, protease,
protease IV, protease IV (acidopolus), ...
- Equivalency
same to specified levels all ketones,
all ethyl-ketones, R- ketones, …
- Can include association (bonding,
joining, etc.), interactions, reactions
Description of Substances, Systems,
Species
- Several types of information
primary identifiers
(names), specifications, characterization and composition, source, processing
history, reference test results, association(bonding, joining), material
form, supplement information
- Millions of chemicals, species,
people,
- Do not have to include all in
standard
- Often have variety of description
approaches
- Most S&T substances, systems
and species need information modeling -I.e., for biological species
Reporting Properties, etc.
- Properties, measurements, observations,
characteristics, calculations, results
- Usually more complex structure
than realized
- Greater dependency on variables
than thought
- Often can be represented in a
variety of ways - text, numbers, equations, coordinate systems, other
- Multiple nomenclature systems
also a problem here
Property Context -1
- Conditions (situation) under which
the property, observation, etc. is meaningful
- One meaning of metadata
- Independent variables of two types
those set at beginning and not changed
those varied throughout data set
- Includes data collection methods,
data analysis documentation, others
Property Context -2
- Variables can be very numerous
(hundreds) and complex
- Information models help greatly
- Most researchers do not record
every variable
Must construct definitions of two types
mandatory - without these data set would
be useless
optional - if you record, do it this way
- List
controlled vocabular
suggested vocabulary
free text
Types of Standards Bodies
- Formal
ISO
EU
National Standards Development Organizations
(SDOs, e.g., JSI))
- Informal
Professional and technical societies
Unions
Pre-standardization
- New
OMG
W3C
CODATA
Examples of S&T Data and
Standards
- Crystallography - CIF, mmCIF
- Physics - neutron collisions,
nuclear structure
- Analytical chemistry - IR, NMR,
Mass Spectra
- Chemical nomenclature - IUPAC,
CAS, CML
- Surface characterization - XPS
- Engineering materials - ISO STEP,
ASTM, matML Biology - viruses, … Molecular
- biology - mmCIF, OMG
The Standards Process
1. Someone identifies need for standard
2. A group is convinced of the need
3. A technology (approach) is proposed
4. A consensus is achieved by all interested parties
5. The standard is published
6. The standard is used by people and groups who have a vested interest (business
reason) to do so
Some Standards Economics and
Sociology
- You must be motivated to build
a standard
- You must be motivation to use
a standard
- Industry builds and uses standards
for business reasons
- If it is hard to build a standard,
the motivation is usually lacking
Standards and Science
- Scientists are often reluctant
to use standards because they are not the state of the art
- Scientists are not accustomed
to working towards a consensus
- Scientists are usually uncomfortable
in working with a formal standards group
- Most scientific data standards
are developed because of a hidden business motivation
Where does CODATA fit in - 1
- Long experience in development
of data reporting requirements
- Multi-national, multi-disciplinary
- Existing Commission, Task Group
activities
- Friendlier environment for scientists
- Considerable experience in setting
scientific data standards
Where does CODATA fit in - 2
- Instructional and training materials
- Task groups
- Web and print publication
- Neutrality
- Cross-union cooperation
- Comprehensive survey of scientific
data and database standards
- CODATA wants to create an environment
helpful for data and database standards