Categorization of Government Information (CGI) Working Group
U.S. Federal Interagency Committee on Government Information

Recommendation for Search Interoperability

Location:This document is located on the Internet at http://www.cio.gov/documents/ICGI/recommendation.html

Comments: This document will be retired on completion of a recommendation to OMB by the U.S. Federal Interagency Committee on Government Information, in December 2004.
Until September 27, 2004 comments on this document may be sent to the editor, Eliot Christian, U.S. Geological Survey (e-mail: echristian@usgs.gov ).
Comments received and acted upon thus far are listed at http://www.search.gov/interop/rec-comments.html

Contents:

  1. Recommendation
  2. Implications
  3. Background
  4. Base Requirements
  5. ISO 23950 Overview
  6. Alternatives Considered
  7. Review Process Used
  8. Notes and References

1 Recommendation | comments? |

The U.S. Federal Government should adopt a search service standard to enhance interoperability among networked systems that aid in the discovery of and access to government information. The adopted search service standard should be the ISO 23950 international standard, thereby providing a high degree of interoperability across many communities of practice and types of data and information holdings (see section 5, ISO 23950 Overview , below). This recommendation follows existing law and policies of the U.S. Federal Government, positioning the standard search service as a supplement to other search mechanisms as may be needed for reasons other than broad scale interoperability.

2 Implications | comments? |

Policy - No additional policy action is needed to implement the standard search service recommended here. The U.S. Federal Government already has law and policy mandating a standard search service as part of the Government Information Locator Service established by law (United States Code Chapter 44, Section 3511). Corresponding Federal policy (OMB Memorandum 98-5) required a standard search service to be used for locating government information. That standard search service is described in Federal Information Processing Standard (FIPS) 192-1, which is required to be cited in procurements of search technology by Federal agencies. FIPS 192 adopted a profile [GILS] of the international standard recommended here, ISO 23950. Similar law, policy, and standards exist for geospatial data in the United States (i.e., E-Government Act section 216, OMB Circular A-16, and the Geospatial profile [GEO] of ISO 23950) .

Oversight - As noted in existing policy, GSA and OMB oversight should be exercised to assess and enforce existing law and policy requiring search technology procured by Federal agencies to comply with FIPS 192. FIPS 192 should be updated to include the newly available "Web services" profile of ISO 23950 known as [ SRW] .

Cost - There is an ongoing operational cost to government in supporting any standardized search service, but this would be essentially the same as what is entailed in setting up non-standard search services. When first introduced, the support of a search service standard may prompt an add-on cost in acquisitions of search technology, but such a cost would be a small percentage of what the U.S. Federal government spends on disseminating government information. For example, an Internet portal for government information may cost millions per year, while the additional support for the search service standard on that portal may be thousands per year. Also, after search technology vendors have implemented the standard interface once, their costs for supporting additional implementations should be very minor.

3 Background | comments? |

An information index helps a searcher to find resources. Such an index usually covers just one collection of resources, yet searchers often want to search across multiple collections. The ability to search multiple, separately operated indexes is called "search interoperability". Amazingly, libraries worldwide already offer interoperable search across their many thousands of collections. This search interoperability is based on a carefully negotiated international standard supported by the major vendors of information retrieval technology. And, the standard addresses far more than mere "word in text" search--the standard includes sophisticated methods needed for precise searching of collections holding millions of diverse resources.

Clearly, libraries worldwide are meeting their customers' needs to have seamless access to information. The same cannot be said of governments. Agencies no longer rely on libraries as the backbone of information dissemination--they offer information directly to the public via their own indexes and directories of Web pages, databases, and a diverse range of specialized services. It is true that the amount of accessible government information is growing at a healthy rate, but the need for people to confidently search for government information across agencies and levels of government is not being met. The problem is not so much the amount of information. The problem is that few governments have yet focused on search interoperability.

Because public access to government information is the basis of effective, accountable and transparent government, interoperability of government search facilities is essential. Adoption of a search service standard would serve the public interest by making government information more readily accessible through the diverse community of government information providers. Search interoperability also generates government-wide efficiencies: from increased information sharing, and from lowered costs for mechanisms needed to merge information from multiple government sources. Efficiencies accrue within each single government organization, as well. For instance, a search service standard provides some "future-proofing" against changes in search technology. With standards-based search, the periodic migration to new search technology is not so disruptive, and it is also easier to maintain access to legacy holdings.

Governments at all levels worldwide are major producers and consumers of data and information, encompassing many communities of practice and types of data and information holdings. Because governments both depend upon and foster a competitive intermediary market for information dissemination and service delivery, government support of broad scale, standards-based interoperability is essential. In that regard, governments must promote an information search interface that is non-proprietary, fair, and stable. By acquiring products that support an international standard search service, governments will encourage a fair and competitive market for products, and maximize agency choice.

The E-Government Act of 2002 requires the U.S. Federal Government to enhance search interoperability by adopting a common standard. Section 207 ("Accessibility, Usability, And Preservation of Government Information"), requires that the Interagency Committee on Government Information submit recommendations to the Director of the Office of Management and Budget (OMB) on "the adoption of standards, which are open to the maximum extent feasible, to enable the organization and categorization of Government information in a way that is searchable electronically, including by searchable identifiers; and in ways that are interoperable across agencies".

4 Base Requirements | comments? |

This Recommendation satisfies all of the mandatory and desirable requirements as given in the Statement of Requirements for Search Interoperability, posted for public comments and revised over the period February - April, 2004. Citations of "Stated Requirement" in the following table refer to section paragraphs as given in that Statement of Requirements at http://www.search.gov/interop/requirements.html .

Requirement Paraphrased Statement of Requirement Mandatory or Desirable
6.1 (par. 3) Supports different levels of access control, such as restrictions by service, session, distributed resource, database, record, or data element. Mandatory
6.1 (par. 4) Supports authentication of user identity through an ancillary service (e.g., "e-Authentication). Mandatory
6.1 (par. 4) Supports verification of the integrity of delivered data, metadata, or other information must be able to be verified as well. Mandatory
6.2 (par. 1) Supports the search service standard for library catalogs accessible over network technologies, [ISO 23950] (identical to ANSI/NISO Z39.50) Mandatory
6.2 (par. 2) Supports the library standard for catalog records, Machine-Readable Cataloging Mandatory
6.2 (par. 3) Supports access to data without mandating proprietary technologies, nor proprietary vocabularies or thesauri Mandatory
6.3 (par. 1) Can be readily accommodated by leading search products, including Internet search engines Mandatory
6.3 (par. 1) Supports search of information that may be unstructured (often called "full-text"), semi-structured (typically represented with inline "markup"), or structured (sometimes known as "fielded"). Mandatory
6.3 (par. 2) Supports search of HTML meta element contents and other varieties of metadata embedded within particular types of files (e.g., PDF, e-mail, etc). Mandatory
6.3 (par. 2) Supports customizable search of other varieties of structured metadata through common mechanisms such as SQL and LDAP. Mandatory
6.4 (par. 2) Provides for interoperable search across locators for information and collections of information Mandatory
6.4 (par. 3) Interoperable with the international standard search service supporting the U.S. National Spatial Data Infrastructure Clearinghouse of geospatial data Mandatory
6.5 (par. 1) Implementable over the Internet using TCP/IP, HTTP/HTTPS, HTTP GET and HTTP POST Mandatory
6.5 (par. 1) Precisely defined as to how searches are expressed and communicated between a client component and a server component, including a query language, a query syntax, and standardization of a result set schema Mandatory
6.5 (par. 1) Specified in an interface definition language such as Web Services Definition Language Desirable
6.6 Supports searching of structured information using a nested Boolean query, e.g., (date > '20040101') AND ((subject = 'earthquake') OR (subject = 'temblor')). Mandatory
6.6 Supports the usual sets of data structures (word, phrase, date, URL.) and relations (equal, greater than, less than). Mandatory
6.7 (par. 1) Includes a query evaluation function to handle "abstract concepts" (e.g., name, category, date) according to what they mean semantically rather than merely how they may be labeled syntactically. Mandatory
6.7 (par. 3) Supports abstract concepts that are produced by semantic mapping without requiring any particular semantic mapping technique Mandatory
6.8 Supports gateway to Internet Anonymous FTP Archive (IAFA) file system catalogs and Distributed Authoring and Versioning for the Web (WebDAV) Desirable
6.8 Adopts readily to the underlying data model of named properties and property sets that is defined for objects addressable by software Mandatory
6.9 (par. 1) Already in production use for searching metadata variants such as Dublin Core Metadata Initiative, ISO 15836 Encoded Archival Description, and ISO 8879Standard Generalized Markup Language (SGML) Desirable
6.10 Compatible with many and diverse approaches to compiling collections of information, without mandating any particular approach Desirable
6.11 Supports interoperable search of business and services registries, modeled on ISO 11179 Metadata Registries, ebXML, or the Universal Description, Discovery, and Integration (UDDI) model Mandatory
6.12 Scalable in terms of supporting arbitrarily complex searches Mandatory
6.12 Scalable in not foreclosing concurrent searches on multiple servers Desirable
6.13 (par. 1) Extensible to search tasks with unusual data structures and relations, definable through profiles or equivalent Mandatory
6.13 (par. 2) Provides extension mechanisms to nurture innovation in areas not yet ready for the broadest level of standardization Desirable
6.14 Has been in use worldwide in many languages Mandatory
6.14 Supports negotiation between client and server as to each other's language capabilities for the session Mandatory
6.14 Supports character set negotiation, with Latin-1 as a minimum for U.S. Federal Government applications Mandatory

5 ISO 23950 Overview | comments? |

ISO 23950, the international standard for information search and retrieval, defines a particular set of network client-server "services". The definition is powerful enough to accommodate the most commonly required search functions over a broad range of search facilities, including the requirements stated above. This section provides an general overview of ISO 23950 using the particular variety of ISO 23950 known as the [SRW] (Search and Retrieve for the Web) profile.

By analogy to a restaurant, a network service in operation is like a waiter handling a dinner order from a customer. Just as a customer is not expected to give step-by-step instructions to the kitchen, the ISO 23950 service allows the client to precisely specify the request but does not allow the client to specify the exact procedure for satisfying the request. This is an important feature for security as well as for broad interoperability. Clients have no more control than necessary, and clients need not know execution details.

For example, a searcher who wants to find what the Library of Congress may have on "fruit" can send an ISO 23950 search request that looks like this:
http://z3950.loc.gov:7090/voyager?operation=searchRetrieve&version=1.1&query=fruit

This SRW search syntax uses the Internet standard for URL's (RFC 1738). The search request has two component parts: a "base URL" and a "searchpart", separated by a question mark ("?"). The base URL identifies the server host and port (here, "z3950.loc.gov:7090") and the ISO 23950 service (here, "voyager"). The searchpart consists of parameters separated by "&", each with the structure "key= value". The names of the parameters from the ISO 23950 service description are the "key" strings within the URL. (In this example, the keys are "operation", "version", and "query".)

The ISO 23950 definition of a standard search syntax provides an obvious level of interoperability. The example search statement could be applied to several popular Internet search services in this way:
 http://www.google.com/search?operation=searchRetrieve&version=1.1&query=fruit
 http://search.yahoo.com/search?operation=searchRetrieve&version=1.1&query=fruit
 http://alltheweb.com/search?operation=searchRetrieve&version=1.1&query=fruit
 http://www.altavista.com/web/results?operation=searchRetrieve&version=1.1&query=fruit
 http://vivisimo.com/search?operation=searchRetrieve&version=1.1&query=fruit

Without ISO 23950, a searcher would need to use the particular syntax invented by each search technology vendor:
 http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=fruit&btnG=Google+Search
 http://search.yahoo.com/search?fr=fp-pull-web-t&p=fruit
 http://alltheweb.com/search?cat=web&cs=utf8&q=fruit&_sb_lang=pref
 http://www.altavista.com/web/results?q=fruit&kgs=1&kls=0
 http://vivisimo.com/search?query=fruit&v%3Asources=Web&x=0&y=0

This is only a trivial example of the variety of search syntaxes supported by technology vendors, especially as most also support Boolean operations with fielded searches. The bewildering variety of search syntax has become a major barrier to search interoperability among Internet search vendors, just as it was among library catalog search vendors before agreements were reached on the international search standard in the 1990's.

Following here is a bit more detail about the "Common Query Language"  [CQL] syntax used in the "searchpart" of an SRW search URL introduced above. In CQL, a query can be as simple as an unqualified single terms ("fruit" in the example above). Queries also may be joined together using the Boolean "and", "or" operators, as in the following example:
(bird or dinosaur) and (feathers or scales)
The Boolean "not" is used as a binary operator, finding records which contain "this but not that". For example,
dinosaur not reptile
would find records which contain the word "dinosaur" but not the word "reptile"'.

In addition to queries targeted at whole records, queries can be limited to a particular part of the records being searched. These searchable parts are called "indexes" in CQL. For example, limiting a search to the "author" index would find matches on the names of authors. An index is specified in CQL as part of a set of indexes, in recognition that different communities of practice sometimes have unique indexes. For instance, both the bibliographic and the heraldry communities might wish to name a "title" index, but those indexes would have different meanings.

In ISO 23950 and CQL, an "index" is an abstract concept. A CQL query that limits a search to the "author" index can be executed in various ways by the server application. For an e-mail collection, the author index may contain values taken out of the "from" field of e-mail messages; For a news clipping collection, the author index may contain values taken out of the "by-line" field in the news stories. This abstraction is very important for achieving search interoperability.

6 Alternatives Considered | comments? |

Single Portal -From a management control perspective, there is a certain attraction to having just one system encompassing all relevant information. The manager of a portal might then focus just on "operability" issues within the manager's own control, relegating "interoperability" to "someone else's problem". Such a single portal could be physically distributed, using various mechanisms for "pulling" or "pushing" information, metadata, and update signals among distributed components of a logically centralized system. Yet, from a public policy perspective, the very idea of a single, master portal is unrealistic. Any effective government organization must accommodate relationships with other levels of government and with other public and private sector information sources. Consequently, any single portal must co-exist with other information portals, and so must support a degree of interoperability. From a technology perspective also, interoperability is a more appropriate approach. There are simply too many mutually incompatible search mechanisms already in place to imagine that any single solution could provide customized interfaces to all of them. The need for interoperability is even more obvious when one considers that many of those custom interfaces would be shaped by distinct vendors who can change interface specifications at will.

Common Data Model - In the early days of mainframe computer systems, it was common to envision an enterprise-wide "management information system" that mandated a common data model applied to all enterprise information systems. This approach is less stringent than subordinating all systems into some master, all-encompassing system, but it still does require central administration of an abstract and complex model shared by all interacting systems. In practice, this approach suffers much the same difficulties found in the "single portal" approach. Today's reality is that any government organization must accommodate a great variety of in-house and external actors who evolve their component systems independently. These largely independent systems already have their own data models that often have little in common, even when a single vendor has supplied the systems software.

Applications Programming Interface (API) - Software is an integral part of most government information systems and software is implemented through programs. Designers of complex systems usually divide software into modules that are each provided with a published interface with well-defined entry points for application programmers. Unfortunately, such an API approach must be tailored to each distinct programming environment. Now that there are many operating platforms and programming languages, the programming interfaces needed for broad interoperability have become too numerous to be manageable. However, the current "services-oriented architecture" approach underlying the present recommendation does build on the programming discipline of the API approach. The important difference is that "services" are based on the characteristics of a network interface between interacting systems, rather than being based on characteristics of the programming interface. This is a great advantage for the set of problems encountered in information search and retrieval.

Structured Query Language (SQL) - SQL has a long history of use, starting with the first management information system efforts several decades ago. When combined with an appropriate network service such as ODBC (Open DataBase Connectivity), SQL can be used as part of a services-oriented architecture. However, SQL by itself does not include the essential idea of search indexes as an abstract mapping against actual content structures. Also, SQL is oriented toward query of database tables rather than Information Retrieval against very large collections. An SQL query would result in a table having all records that satisfy the search constraints; Information Retrieval would build a "result set" giving a rank-ordered listing of records that satisfy a search request, any of which might be actually retrieved in a separate operation. Nevertheless, SQL is often used very effectively in combination with the ISO 23950 international standard search service recommended here.

7 Review Process Used | comments? |

An iterative approach identified requirements for a search service standard. A draft Statement of Requirements was posted publicly, with comments of the major stakeholders invited, on the Internet at http://www.search.gov/interop/requirements.html . Several revisions were made over the months February - April, 2004. The initial version of the Statement of Requirements was based on a recommendation in February 2003 of the E-Government Technical Committee of the Organization for the Advancement of Structured Information Standards [OASIS]. The OASIS recommendation was informed by, among others, an April 2002 white paper titled "Interoperability Strategy: Concepts, Challenges, and Recommendations" by the Industry Advisory Council [IAC], Enterprise Architecture Shared Interest Group. (Focused on promoting government-industry partnerships, the Industry Advisory Council represents professionals from over 400 leading information technology companies.)

8 Notes and References | comments?|

[CQL] Common Query Language is described at http://www.loc.gov/z3950/agency/zing/cql/

[GILS] E. Christian, Application Profile For The Government Information Locator Service (GILS), http://www.gils.net/prof_v2.html , April 1997.

[GEO] D. Nebert, Z39.50 Application Profile for Geospatial Metadata or "GEO", http://www.blueangeltech.com/Standards/GeoProfile/geo22.htm , May 2000.

[IAC] Dodd, John, at al, Interoperability Strategy: Concepts, Challenges, and Recommendations, http://www.search.gov/IAC-Interoperability.pdf , April 2003.

[ISO 23950] ANSI/NISO Z39.50-1995, Information Retrieval (Z39.50): Application Service Definition and Protocol Specification, http://lcweb.loc.gov/z3950/agency , 1995.

[OASIS] The OASIS E-Government Technical Committee recommendation concerning Search Service interoperability is available at http://www.oasis-open.org/committees/download.php/5846/wd-egov-searchservice-CD.pdf

[SRW] Search/Retrieve Web Service home page is http://www.loc.gov/z3950/agency/zing/srw/