Lister Hill Center Logo  
Search Tips
About the Lister Hill Center
Blue Arrow
Blue Arrow
Blue Arrow
Blue Arrow
Innovative Research
Blue Arrow
Blue Arrow
Blue Arrow
Blue Arrow
Blue Arrow
Publications and Lectures
Blue Arrow
Blue Arrow
Blue Arrow
Training and Employment
Blue Arrow
Blue Arrow
LHNCBC: Document Abstract
Year: 2003Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2003-006
Automated Document Labeling for Web-Based Online Medical Journals
Le DX, Thoma GR
Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;II: 411-15.
An increasing number of publishers are using the Internet and the World Wide Web to provide their subscribers with access to online journals. New techniques are needed to capture, classify, analyze, extract, modify, and reformat Web-based document information for computer storage, access, and processing. An R&D division of the National Library of Medicine (NLM) is developing an automated system, temporarily code-named WebMARS for Web-based Medical Article Records System, to download, analyze and extract bibliographic information from Web-based journal articles to produce citation records for its MEDLINE database. This paper describes one component of this system: assigning meaningful labels to text zones containing article title, author names, affiliation, and abstract. This labeling technique is based on features derived from the World Wide Web Consortium Document Object Model (W3C DOM) and an analysis of the page layout for each journal, a DOM-based document node location and content analysis, string pattern matching, and a depth-first node traversal algorithm. Experiments carried out on a variety of Web-based medical journals have proved the feasibility of this automated document labeling approach. Preliminary evaluation results on a small set of Web-based medical journal articles show that the system is capable of labeling text zones at an accuracy of over 95%. Keywords: W3C Document Object Model, Automated document labeling, MEDLINE database, National Library of Medicine.
PDF