NIST
Special Database 12
NIST
Census Miniform Training Database 2: Binary Images from Paper and Microfilm
![Link to the FAX and Mail Order Form](/peth04/20041020092729im_/http://nist.gov/srd/images/faxmail.gif)
NIST
Special Database 12 is a set of 1990 Census
Miniform images. A Miniform is a non-sensitive portion of the Industry
and Occupation section of an actual Census Long Form with handwritten
responses to three questions.
The database is
available on CD-ROM and contains images of 6000 paper miniforms
(18,000 fields), 12,500 microfilm miniforms (37,500 fields),
and files containing ASCII transcriptions of the strings that were written
in the miniform fields. This database is designed for the evaluation of
optical character recognition (OCR) systems in a difficult but realistic
form-based task on binary images from microfilm.
Each miniform image
contains three fields with handwritten answers to the following questions
(Long Form Questions 28b, 29a, and 29b respectively).
- Describe the activity
performed at location where employed.
- What kind of work
was this person doing?
- What were this
person's most important activities or duties?
A possible set of responses
would therefore be:
- hospital
- registered nurse
- patient care
The forms were scanned
from microfilm, yielding images of far lesser quality than forms scanned
from paper. The images are 624 by 744 pixels sampled at 78.74 pixels/cm
(200 pixels/inch). They are packed five to a file and are CCITT Group
4 compressed. Source code for image manipulation, including programs to
uncompress and unpack the images, is present on the CD-ROM. The code is
written in the C programming language and was developed on Sun workstations
running SunOS 4.1.1.*
Special Database
12 was the second of three produced in conjunction with The Second Census
Optical Character Recognition Systems Conference, and was intended for
system training. (The first, Special Database 11, contained microfilm
training data. The third, Special Database 13, contained the paper and
microfilm data used for the actual system testing).
NIST and the Bureau
of the Census sponsored the Conference, in which participants sought to
determine the state of the art of the OCR industry on a challenging, realistic
task. The results of the Conference were published in NIST Internal Report
(IR) 5452. That report is available on the Internet in PostScript form
via anonymous FTP from the server sequoyah.ncsl.nist.gov, maintained
by NIST's Visual Image Processing Group. It is also available on request
in hardcopy form.
Special Database
12 comes with a 30-page guide that presents an overview of the Conference
and its results and documents the file formats and software.
*Specific hardware
and software products identified were used in order to adequately support
the development of the technology described in this document. In no case
does such identification imply recommendation or endorsement by the National
Institute of Standards and Technology, nor does it imply that the equipment
identified is necessarily the best available for the purpose.
Price:
$90.00. Special pricing for multiple copies available. Call for
details.
Please click here
to view the PDF version of Users' Guide
Spec.
DB 12. NIST Census Miniform Training Database 2: Binary Images from
paper and Microfilm
For
more information please contact:
- Standard Reference
Data Program
National Institute of Standards and Technology
100 Bureau Dr., Stop 2310
Gaithersburg, MD 20899-2310
(301) 975-2008
(VOICE) / (301) 926-0416 (FAX) / Contact Us
The
scientific contact for this database is:
- Stanley Janet
National Institute of Standards and Technology
100 Bureau Drive, Stop 8940
Building 225, Room A216
Gaithersburg, MD 20899-8940
PH: (301) 975-2916
e-mail: stan.janet@nist.gov
Keywords:
ASCII Reference; automated character recognition; automated data capture;
Binary Image Database; census forms; Census OCR Systems Conference; character
recognition; forms recognition; hand print; handwriting recognition; Microfilm
Documents; NIST; OCR; optical character recognition; paper; software recognition;
style.
|