NIST Scientific and Technical Databases NIST Scientific and Technical Databases NIST Homepage Databases Search Our Website

Data Home

Analytical Chemistry

Atomic and Molecular Physics

Biometrics

Biotechnology

Chemical and Crystal Structure

Chemical Kinetics

Chemistry

Communications

Construction

Environmental Data

Fire

Fluids

International Trade

Law Enforcement

Materials Properties

Mathematical Databases, Software and Tools

Optical Character Recognition

Physics

Product Design

Surface Data

Text and Video Retrieval

Thermophysical and Thermochemical

 

thin vertical line

NIST Special Database 12

NIST Census Miniform Training Database 2: Binary Images from Paper and Microfilm
Link to the Online Purchase Order Form  Link to the FAX and Mail Order Form

 

NIST Special Database 12 is a set of 1990 Census Miniform images. A Miniform is a non-sensitive portion of the Industry and Occupation section of an actual Census Long Form with handwritten responses to three questions.

The database is available on CD-ROM and contains images of 6000 paper miniforms (18,000 fields), 12,500 microfilm miniforms (37,500 fields), and files containing ASCII transcriptions of the strings that were written in the miniform fields. This database is designed for the evaluation of optical character recognition (OCR) systems in a difficult but realistic form-based task on binary images from microfilm.

Each miniform image contains three fields with handwritten answers to the following questions (Long Form Questions 28b, 29a, and 29b respectively).

  • Describe the activity performed at location where employed.
  • What kind of work was this person doing?
  • What were this person's most important activities or duties?
A possible set of responses would therefore be:
  • hospital
  • registered nurse
  • patient care

The forms were scanned from microfilm, yielding images of far lesser quality than forms scanned from paper. The images are 624 by 744 pixels sampled at 78.74 pixels/cm (200 pixels/inch). They are packed five to a file and are CCITT Group 4 compressed. Source code for image manipulation, including programs to uncompress and unpack the images, is present on the CD-ROM. The code is written in the C programming language and was developed on Sun workstations running SunOS 4.1.1.*

Special Database 12 was the second of three produced in conjunction with The Second Census Optical Character Recognition Systems Conference, and was intended for system training. (The first, Special Database 11, contained microfilm training data. The third, Special Database 13, contained the paper and microfilm data used for the actual system testing).

NIST and the Bureau of the Census sponsored the Conference, in which participants sought to determine the state of the art of the OCR industry on a challenging, realistic task. The results of the Conference were published in NIST Internal Report (IR) 5452. That report is available on the Internet in PostScript form via anonymous FTP from the server sequoyah.ncsl.nist.gov, maintained by NIST's Visual Image Processing Group. It is also available on request in hardcopy form.

Special Database 12 comes with a 30-page guide that presents an overview of the Conference and its results and documents the file formats and software.

*Specific hardware and software products identified were used in order to adequately support the development of the technology described in this document. In no case does such identification imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment identified is necessarily the best available for the purpose.

Price: $90.00. Special pricing for multiple copies available. Call for details.

Please click here to view the PDF version of Users' Guide

Link to the FAX and Mail Order Form  Spec. DB 12. NIST Census Miniform Training Database 2: Binary Images from paper and Microfilm

For more information please contact:

Standard Reference Data Program
National Institute of Standards and Technology
100 Bureau Dr., Stop 2310
Gaithersburg, MD 20899-2310

(301) 975-2008 (VOICE) / (301) 926-0416 (FAX) / Contact Us

The scientific contact for this database is:
Stanley Janet
National Institute of Standards and Technology
100 Bureau Drive, Stop 8940
Building 225, Room A216
Gaithersburg, MD 20899-8940
PH: (301) 975-2916
e-mail: stan.janet@nist.gov

Keywords: ASCII Reference; automated character recognition; automated data capture; Binary Image Database; census forms; Census OCR Systems Conference; character recognition; forms recognition; hand print; handwriting recognition; Microfilm Documents; NIST; OCR; optical character recognition; paper; software recognition; style.


[Online Databases] [New and Updated Databases]
[Database Price List] [JPCRD] [CODATA] [FAQ] [Comments] [NIST] [Data]

Create Date: 6/02
Last Update: Friday, 19-Mar-04 08:06:56
Contact Us