Vocabulary Control in the Plant Genome Database


Dr. C. Rose Broome
Database Specialist
Plant Genome Data and Information Center
National Agricultural Library, USDA
Beltsville, MD

Any filing system is a successful one only if its users can easily find the information they need within it.

The most natural and direct way of accessing a document or other information source is by subject matter. Folders, library card catalog entries, or database records can be filed (physically or logically) according to some kind of subject classification. As long as the classification scheme used by the "filer" is the same as that employed by the "retriever," what goes in can be extracted "out" selectively and usefully.

Computerized databases have proved to be superior filing systems, allowing fast access to information in many ways, no matter how the records are physically stored. A record may have several subject terms assigned to it by the "filer" (or more correctly, the indexer), who places the terms in a special database field that is often labeled "Descriptor," "Keyword," or "Subject Heading." This field is analogous to the subject index that is found in the back of a book, and it may be used by the "retriever" (or searcher) to locate the precise kind of information desired.

This system of indexing works well only if the searcher asks for the record using one or more of the exact terms the indexer chose to classify the information in the record. But even two experts in the same subject matter may not use exactly the same word or phrase to describe the contents of a document. One geneticist may use the term "b chromosomes" whereas another might choose "supernumerary chromosomes" as a subject term. One may think of a term in the singular form (e.g. "leaf"); another may use the plural form ("leaves"). Regional spellings may differ ("color" vs. "colour"; "aluminum" vs. "aluminium"), as may regional usage ("lucerne" in Great Britain vs. "alfalfa" in the United States).

Controlled Vocabularies

The library and information science community has addressed the problems inherent in subject access by adopting the use of "controlled vocabularies," lists of acceptable subject terms that must be used by indexer and searcher alike. The indexer, armed with a knowledge of the subject matter and with a list of subject terms, selects the most appropriate terms with which to classify the item and enters them into the subject field of the database record. The searcher, armed with some degree of knowledge about the subject and with a copy of the same controlled vocabulary list, then selects from the subject field those terms that most closely match the precise subject matter of interest. Precision in retrieval is greatly aided if one knows the exact form and spelling of a subject term.

In a well-constructed vocabulary list, one term should represent one concept. The term chosen to characterize a data record may consist of one word (e.g. "aneuploidy"), or the concept may be represented by a multi-word phrase (e.g. "single-seed descent" or "ornamental woody plants"). The thing represented is conceptualized by the indexer and the searcher alike as a unitary class within the context of the subject matter.

Plant Genome Database

Controlled vocabularies may consist of lists containing a few well-defined categories. For example, in the Plant Genome Database (PGD) being developed at NAL, a field called "genome type" has a controlled vocabulary of only three terms: "nuclear," "chloroplast," and "mitochondrial." Fields such as "linkage-group type," "map type," and "stock type" (for genetic stock collections) have longer but fairly brief, stable lists of acceptable terms. But such fields as "phenotypic trait" (containing such values as "flower color," "yield," "chromosome number," and "seed weight") may eventually have hundreds or thousands of acceptable terms--some of them common to more than one species of organism, but others being unique characteristics of but one species.

In its initial phase, PGD will include data on only a small number of vascular plant species, so vocabulary control will, at first, be relatively simple. But over the next few years the database is expected to expand to include data on the genetics and biology of an, as yet, undetermined number of agriculturally significant plant, animal, and microbial species. The more species (and their specialized terminologies) that are added, the more challenging becomes the work of controlling the terminology to permit selective retrieval of information.

Problems

Several problems arise when one attempts to merge several precise, highly technical vocabularies into a general purpose vocabulary of biological terms.

The English language is rich and complex, and filled with such words as homographs (words spelled the same way, but with multiple meanings) and synonyms or near-synonyms. In building a controlled vocabulary of terms that describe the morphology and anatomy of plants, animals, and microorganisms, some ambiguities will need to be resolved such as the word "ear" (the infructescence of the corn plant or the auditory organ of a mammal?) and "cob" (part of that 'ear' on the corn plant or a kind of small horse?).

Large and complex controlled vocabularies, including lists of gene symbols, names of chemical components, metabolic processes/pathways, and lists of accepted scientific names for organisms, will be required for other data elements in PGD. These vocabularies will of necessity be developed by the collaborative efforts of many biologists over a considerable time. Fortunately these will not all have to be created de novo, as many published thesauri, glossaries, and dictionaries exist that cover the disciplines to be represented in PGD. These authority lists will be carefully examined to see if they may be incorporated into what may ultimately become a full-blown thesaurus for agricultural genomics.

CAB Thesaurus

NAL uses the CAB Thesaurus, published in England by C.A.B. International, to index journal articles for the AGRICOLA database. To search AGRICOLA for articles on a particular subject, one may obtain a copy of the CAB Thesaurus [from CAB International, 845 N. Park Avenue, Tucson AZ 85719, telephone 800-528-4841] and find out exactly what terms are used in the "Descriptor" field to describe subjects of interest to the searcher. By using precise terminology, the searcher can attain great precision in retrieval and avoid "false drops."

The scope of the CAB Thesaurus covers all of agriculture. Because of its breadth, it is presently inadequate as a single source of controlled vocabulary terms for a database so detailed as PGD. However, it may serve as a starting point. More detailed hierarchies of terms may be added to PGD vocabularies by the collaboration of scientists contributing data to the database.

Nomenclatural Lists

Standardized vocabulary and nomenclature lists are being developed by several international organizations for such data types as gene names (International Society of Plant Molecular Biology), plant names in common use (International Union of Biological Sciences), and enzyme nomenclature (International Union of Biochemistry). PGD will utilize pertinent nomenclatural lists sanctioned by the International Unions for use as authority files for the appropriate database fields.

Vocabulary Development

As more different species of life forms are accommodated in PGD, more problems will be inevitable in vocabulary control. However, to enable geneticists to search across species for common genetic factors, mechanisms, and expressions, the terms chosen for description must allow them, whenever possible, to detect genetic commonalities not only when comparing apples with apples, but also in comparing apples with oranges or orangutans.

Vocabulary development must be considered an important adjunct to data collection and input throughout the life cycle of the PGD project to ensure that valuable information in the database is found and put to good use.

A good text for further reading on this subject is Vocabulary Control for Information Retrieval, (1986) 2nd ed., by F. W. Lancaster, Information Resources Press, Arlington, VA.