NSF Award Abstract - #0205448 | AWSFL008-DS3 |
NSF Org | IIS |
Latest Amendment Date | July 28, 2004 |
Award Number | 0205448 |
Award Instrument | Continuing grant |
Program Manager |
Sylvia J. Spengler IIS DIV OF INFORMATION & INTELLIGENT SYSTEMS CSE DIRECT FOR COMPUTER & INFO SCIE & ENGINR |
Start Date | September 1, 2002 |
Expires | August 31, 2007 (Estimated) |
Expected Total Amount | $3499759 (Estimated) |
Investigator |
Aravind K. Joshi joshi@linc.cis.upenn.edu (Principal Investigator current) Mark Liberman (Co-Principal Investigator current) Martha S. Palmer (Co-Principal Investigator current) Susan B. Davidson (Co-Principal Investigator current) Fernando C. Pereira (Co-Principal Investigator current) |
Sponsor |
U of Pennsylvania Research Services Philadelphia, PA 191046205 215/898-7293 |
NSF Program | 1687 ITR MEDIUM (GROUP) GRANTS |
Field Application | 0000099 Other Applications NEC |
Program Reference Code | 1655,9218,HPCC, |
EIA-0205448 Joshi, Aravind University of PennsylvaniaITR: Mining the Bibliome -- Information Extraction from the Biomedical Literature
The major goal is the development of qualitatively better methods for automatically extracting information from the biomedical literature, relying on recent research in high-accuracy parsing and shallow semantic analysis. The special focus will be on information relevant to drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline.
This project will also address several database research problems, including methods for modeling complex, incomplete and changing information using semistructured data, and also ways to connect the text analysis process to an information integration environment that can deal with the wide variety of extant bioinformatic data models, formats, languages and interfaces.
The engine of recent progress in language processing research has been linguistic data: text corpora, treebanks, lexicons, test corpora for information retrieval and information extraction, and so on. Much of this data has been created by Penn researchers and published by Penn's Linguistic Data Consortium. Hence, one of our major goals is to develop and publish new linguistic resources in three categories: a large corpus of biomedical text annotated with syntactic structures `Treebank' and shallow semantic structures (proposition bank or `Propbank'; several large sets of biomedical abstracts and full-text articles annotated with entities and relations of interest to drug developers, such as enzyme inhibition by various compounds or genotype/phenotype connections `Factbanks'; and broad-coverage lexicons and tools for the analysis of biomedical texts.