Text mining for biology and medicine: Glasgow, Feb , 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman Linguistic Data Consortium
Text mining for biology and medicine: Glasgow, Feb , 2008 Outline The PennBioIE project: Background, accomplishments, future Public service announcement: Publishing data via the LDC The parable of Yang Jin Annotation as “common law semantics” a serviceable technology that will improve are there better long-term alternatives?
Text mining for biology and medicine: Glasgow, Feb , 2008 PennBioie Project Goals: Learn to strip-mine the bibliome: better NLP tools for text datamining Publish biomedical text annotation: Treebanks, entities, relations Participants: Penn NLP researchers Biomedical researchers (Penn, GSK, CHoP)
Text mining for biology and medicine: Glasgow, Feb , 2008 Penn BioIE Project Domains: CYP inhibition of cytochrome P-450 enzymes 1100 abstracts collaboration with GSK Onco genomic variations associated with cancer 1158 abstracts collaboration with Children’s Hospital of Philadelphia
Text mining for biology and medicine: Glasgow, Feb , 2008 Annotation sequence 1. pretagging (document segmentation etc.) 2. named entities 3. POS 4. treebanking 5. relations
Text mining for biology and medicine: Glasgow, Feb , 2008 Penn BioIE Project Results: Some improved techniques Some published data get rel. 0.9 from rel. 1.0 soon to be published by LDChttp://bioie.ldc.upenn.edu Some applications -- e.g. FABLEFABLE Some questions How to break the F-measure ceiling? How to decrease annotation burden? How to increase semantic coverage?
Text mining for biology and medicine: Glasgow, Feb , 2008
A note on the LDC
Text mining for biology and medicine: Glasgow, Feb , 2008 The Linguistic Data Consortium is an open consortium of universities, companies, and government laboratories; founded in 1992 with seed money from DARPA; run by the University of Pennsylvania with 45 full-time staff in Philadelphia.
Text mining for biology and medicine: Glasgow, Feb , 2008 But really, the LDC is… a specialized digital publisher, which has distributed >50,000 copies of >750 corpora and other resources750 corpora to ~2,500 research organizations in 62 countries. … and might want to publish your data.
Text mining for biology and medicine: Glasgow, Feb , 2008 Why publish with LDC? It’s a publication! LDC pubs have: authors ISBN numbers standard bibliographic citation formats editions IPR, licensing are handled your way (from “all rights reserved” to open access) LDC deals with the hassle of reproduction, distribution, maintenance
Text mining for biology and medicine: Glasgow, Feb , 2008 The parable of Yang Jinparable of Yang Jin
Text mining for biology and medicine: Glasgow, Feb , 2008 The annotation conundrum “Natural” annotation is inconsistent poor agreement for entities, worse for relations task-internal metrics are noisy “Top down” specification is even worse (e.g. existing elaborate ontologies) Solution: iterative refinement of rules via interaction with annotation practice result: complex accretion of “common law” slow to develop, hard to learn more consistent -- but is it correct? complexity may re-create inconsistency new types and sub-types ambiguity, confusion
Text mining for biology and medicine: Glasgow, Feb , 2008 1P vs. 1P independent first passes by junior annotator, no QC ADJ vs. ADJ output of two parallel, independent dual first pass annotations are adjudicated by two independent senior annotators ACE 2005 consistency
Text mining for biology and medicine: Glasgow, Feb , 2008 Iterative improvement From ACE 2005 (Ralph Weischedel): Repeat until criteria met or until time has expired: 1.Analyze performance of previous task & guidelines Scores, confusion matrices, etc. 2.Hypothesize & implement changes to tasks/guidelines 3.Update infrastructure as needed DTD, annotation tool, and scorer 4.Annotate texts 5.Evaluate inter-annotator agreement
Text mining for biology and medicine: Glasgow, Feb , 2008 NLP as Law School Many complex rules Plus Wiki Plus Listserv Rules, Notes, Fiats and Exceptions Task#Pages#Rules Entity 3420 Value 105 TIMEX Relations 3625 Events 7750 Total Example Decision Rule (Event p33) Note: For Events that where a single common trigger is ambiguous between the types LIFE (i.e. INJURE and DIE) and CONFLICT (i.e. ATTACK), we will only annotate the Event as a LIFE Event in case the relevant resulting state is clearly indicated by the construction. The above rule will not apply when there are independent triggers.
Text mining for biology and medicine: Glasgow, Feb , 2008 BioIE case law Guidelines for oncology tagging Guidelines for oncology tagging (local)local
Text mining for biology and medicine: Glasgow, Feb , 2008 Discussion How to make it better Integrating multiple information sources text, bioinformatic databases, microarray data, … less-supervised learning inferring useful features from untagged text active learning, information markets, etc. create a “basis set” of ready-made entity types How to make it different the analogy to translation the lure of systematic semantics (machine) learning: who is learning what?