Lister Hill Center Logo  
Search Tips
About the Lister Hill Center
Blue Arrow
Blue Arrow
Blue Arrow
Blue Arrow
Innovative Research
Blue Arrow
Blue Arrow
Blue Arrow
Blue Arrow
Blue Arrow
Publications and Lectures
Blue Arrow
Blue Arrow
Blue Arrow
Training and Employment
Blue Arrow
Blue Arrow
LHNCBC: Document Abstract
Year: 2003Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2003-007
Ground Truth Data for Document Image Analysis
Ford G, Thoma GR
Proceedings of 2003 Symposium on Document Image Understanding and Technology. 2003 April 9-11;: 199-205.
The ground truth data described here is collected from the production operation of MARS (Medical Article Records System), a system combining scanning, OCR, document image analysis and lexical analysis techniques. Developed by an R&D division of the National Library of Medicine (NLM), MARS automatically extracts bibliographic data from paper-based biomedical journals to populate the Library's flagship database, MEDLINE , used worldwide by biomedical researchers and clinicians. The bibliographic data extracted include the article title, author names, institutional affiliations and abstracts. This ground truth data includes document images, OCR output and operator-verified data at the page, zone, line, word, and character levels. It is accessible online via a public website to enable researchers to develop innovative and efficient algorithms for automatic zoning (page segmentation), labeling (field identification), lexical analysis techniques to correct OCR errors, and techniques for reformatting syntax to adhere to established conventions. In addition, we offer a tool (Rover) to visually compare the results of such programs to the ground truth data. The ground truth and results data are in XML, and Rover is written in Java. The overall website development uses MacroMedia Dreamweaver UltradDev 4 to provide a rich interface and extensive database connectivity.
PDF