NIST Scientific and Technical Databases NIST Scientific and Technical Databases NIST Homepage Databases Search Our Website

Data Home

Analytical Chemistry

Atomic and Molecular Physics

Biometrics

Biotechnology

Chemical and Crystal Structure

Chemical Kinetics

Chemistry

Communications

Construction

Environmental Data

Fire

Fluids

International Trade

Law Enforcement

Materials Properties

Mathematical Databases, Software and Tools

Optical Character Recognition

Physics

Product Design

Surface Data

Text and Video Retrieval

Thermophysical and Thermochemical

 

thin vertical line

National Institute of Standards and Technology
NIST Special Database 25

NIST Federal Register Document Image Database: Volume 1

  Link to Online Purchase Order Form   Link to FAX and Mail Order Form

 

NIST has produced a new document image database for evaluating document analysis and recognition technologies and information retrieval systems. NIST Special Database 25 contains page images from the 1994 Federal Register and much more.

A new, fully-automated process developed at NIST was used to derive ground truth for document images. The method involves matching optical character recognition (OCR) results from a page with typesetting files for an entire book. Public domain software for deriving ground truth is provided in the form of Perl scripts and C source code, and includes new, more efficient string alignment technology and a word-level scoring package. The documentation includes a complete software reference guide, including online manual pages. With this ground truthing technology, it is now feasible to produce much larger data sets, at much lower cost, than was ever possible with previous labor-intensive, manual data collection projects.

There were roughly 250 issues, comprised of nearly 69,000 pages, published in the Federal Register in 1994. This volume of the database contains the pages of 20 books published in January of that year. The database includes scanned images, SGML-tagged ground truth text, commercial OCR results, and image quality assessment results. These data files are useful in a wide variety of experiments and research. Future volumes may be released, depending on the level of interest.

This volume of the database contains 4711 page images scanned binary at 15.75 pixels per millimeter (400 pixels per inch). The images are stored in the NIST IHead format and are compressed using CCITT Group 4 compression. Documentation for this format and source code for reading and writing IHead images is provided. Of these page images, 4519 of them have corresponding ground truth.

This volume is distributed on two ISO-9660 CD-ROMs utilizing 1.27 gigabytes of storage.

Source code used to create this data is available in sd25_src.tar.Z

Examples from this database are located at the anonymous FTP site sequoyah.nist.gov at: sd25.tar

Cost: $90.00.

Please click here to view the PDF version of Users' Guide

For ordering information contact:

Standard Reference Data
National Institute of Standards and Technology
100 Bureau Dr., STOP 2310
Gaithersburg, MD 20899-2310
Voice: (301)975-2008
Email: Contact Us
FAX: 301-926-0416

Technical contact:

Michael D. Garris
100 Bureau Drive, Stop 8940
Building 225, Room A216
Gaithersburg, MD 20899-8940
Email: mgarris@nist.gov
Voice: (301)975-2928

Keywords: document image database; OCR; optical character; recognition technology;


[Online Databases] [New and Updated Databases]
[Database Price List] [JPCRD] [CODATA] [FAQ] [Comments] [NIST] [Data]

Create Date: 6/02
Last Update: Friday, 19-Mar-04 08:13:20
Contact Us