Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ontology Engineering approaches based on semi-automated curation of the primary literature Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical.

Similar presentations


Presentation on theme: "Ontology Engineering approaches based on semi-automated curation of the primary literature Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical."— Presentation transcript:

1 Ontology Engineering approaches based on semi-automated curation of the primary literature Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical Knowledge Engineering Group, Information Sciences Institute, University of Southern California

2 Where’s all the knowledge? Image taken from U.S. Geological Survey Energy Resource Surveys Program The primary research literature... … is the end-product of all scientific research … forms the basis for human understanding of the subject... is written in natural language … is structured … is interpretable … is expensive … is terse

3 Precision and imprecision in biological representation Assay: define model system Experiment: perform measurements Conceptual model ‘Stress’, ‘energy balance’, ‘homeostasis’, ‘glucoprivation’ 2-deoxyglucose (2DG) administrated intravenously to rats, look for activation in ‘stress-responsive’ neurons MAP-K and pERK activate in neurons in PVH, BST and CEAl High-level concepts Independent variables Dependent variables Imprecise Precise

4 Partitioning the literature

5 The problem with knowledge: an over-abundance of data

6 Corpus Preparation for Natural Language Processing The Journal of Comparative Neurology is the foremost international journal for neuroanatomy. We downloaded ~12,000 PDFs in total from 1970-2005. We preprocessed papers with consistent formatting from vol. 204 - 490 (1982-2005) providing a corpus of 9,474 PDF files. This corpus contains 99,094,318 words

7 Active Learning / Information Extraction Methodology

8 The logical structure of a tract- tracing experiment Tracer Chemical [1] Injection Site [1]  Location brain structure topography side Labeled region [1...*]  Location brain structure topography ipsi-contra relative to injection site?  Label type  Label density ‘anterograde’ ‘retrograde’

9 Annotated XML Example from Albanese & Minciacchi, 1983, JCN 216:406-420 expt. label delineation injection labeling description

10 Recall, Precision and F-Score

11 Field Labeling Results – overall label level System FeaturesPrecisionRecallF-Score Baseline0.39260.16730.2346 Lexicon0.56890.37710.4536 Lexicon + Surface Words0.74150.68170.7103 Lexicon + Surface Words + Window Words 0.78430.70390.7420 Lexicon + Surface + Window Words + Dependency features 0.77560.73470.7546 Preliminary data from a training set of 14 documents + testing on 16 documents

12 Field Labeling Results- Confusion Matrices

13 Generalizing the methodology: ‘Histology’ [from Gonzalo-Ruiz et al 1992, JCN 321: 300-311]

14 The logical structure of a tract- tracing experiment Tracer Chemical [1] Injection Site [1]  Location brain structure topography side Labeled region [1...*]  Location brain structure topography ipsi-contra relative to injection site?  Label type  Label density ‘anterograde’ ‘retrograde’

15 Time and effort Current performance achieved by annotating 40 documents Each document contains 97 sentences (in results section) on average Annotation rate  ~ 40 Sent/hr (no support)  ~115 Sent/hr (after 20 documents) Time taken to annotate document to train system to perform at this standard  ~65 hours with no support  Estimate ~2 months for a 50% RA (20 hours / week)

16 Can we discover the schema from the text? Given a large review or a grant proposal specific to a single laboratory Annotate independent and dependent variables in papers. Can we learn and extract these patterns?

17 An example from current set of annotations 10 independent variables: age species sex weight agonist/antagonist combinations (9) primary antibody preparation protocol brain region 1 dependent variable: signal density

18 Acknowledgements Funding  Information Sciences Institute, seed funding *  National Library of Medicine (RO1-LM07061) *  NSF (LONI MAP project)  HBP (USCBP) Neuroscience consultants  Alan Watts *  Larry Swanson *  Arshad Khan *  Rick Thompson *  Joel Hahn *  Lori Gorton *  Kim Rapp * Computer Scientists  Eduard Hovy *  Donghui Feng *  Patrick Pantel * Developers  Tommy Ingulfsen *  Wei-Cheng Cheng


Download ppt "Ontology Engineering approaches based on semi-automated curation of the primary literature Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical."

Similar presentations


Ads by Google