Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman

Similar presentations


Presentation on theme: " Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman"— Presentation transcript:

1  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman myl@cis.upenn.edu Linguistic Data Consortium http://www.ldc.upenn.edu

2  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Outline  The PennBioIE project: Background, accomplishments, future  Public service announcement: Publishing data via the LDC  The parable of Yang Jin  Annotation as “common law semantics”  a serviceable technology that will improve  are there better long-term alternatives?

3  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 PennBioie Project  Goals:  Learn to strip-mine the bibliome: better NLP tools for text datamining  Publish biomedical text annotation: Treebanks, entities, relations  Participants:  Penn NLP researchers  Biomedical researchers (Penn, GSK, CHoP)

4  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Penn BioIE Project Domains:  CYP inhibition of cytochrome P-450 enzymes 1100 abstracts collaboration with GSK  Onco genomic variations associated with cancer 1158 abstracts collaboration with Children’s Hospital of Philadelphia

5  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Annotation sequence 1. pretagging (document segmentation etc.) 2. named entities 3. POS 4. treebanking 5. relations

6  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Penn BioIE Project Results:  Some improved techniques  Some published data get rel. 0.9 from http://bioie.ldc.upenn.edu rel. 1.0 soon to be published by LDChttp://bioie.ldc.upenn.edu  Some applications -- e.g. FABLEFABLE  Some questions How to break the F-measure ceiling? How to decrease annotation burden? How to increase semantic coverage?

7  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008

8 A note on the LDC

9  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 The Linguistic Data Consortium is an open consortium of universities, companies, and government laboratories; founded in 1992 with seed money from DARPA; run by the University of Pennsylvania with 45 full-time staff in Philadelphia.

10  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 But really, the LDC is… a specialized digital publisher, which has distributed >50,000 copies of >750 corpora and other resources750 corpora to ~2,500 research organizations in 62 countries. … and might want to publish your data.

11  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Why publish with LDC?  It’s a publication!  LDC pubs have: authors ISBN numbers standard bibliographic citation formats editions  IPR, licensing are handled your way (from “all rights reserved” to open access)  LDC deals with the hassle of reproduction, distribution, maintenance

12  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 The parable of Yang Jinparable of Yang Jin

13  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 The annotation conundrum  “Natural” annotation is inconsistent  poor agreement for entities, worse for relations  task-internal metrics are noisy  “Top down” specification is even worse (e.g. existing elaborate ontologies)  Solution: iterative refinement of rules via interaction with annotation practice  result: complex accretion of “common law”  slow to develop, hard to learn  more consistent -- but is it correct?  complexity may re-create inconsistency new types and sub-types  ambiguity, confusion

14  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008  1P vs. 1P independent first passes by junior annotator, no QC  ADJ vs. ADJ output of two parallel, independent dual first pass annotations are adjudicated by two independent senior annotators ACE 2005 consistency

15  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Iterative improvement From ACE 2005 (Ralph Weischedel): Repeat until criteria met or until time has expired: 1.Analyze performance of previous task & guidelines Scores, confusion matrices, etc. 2.Hypothesize & implement changes to tasks/guidelines 3.Update infrastructure as needed DTD, annotation tool, and scorer 4.Annotate texts 5.Evaluate inter-annotator agreement

16  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 NLP as Law School Many complex rules  Plus Wiki  Plus Listserv Rules, Notes, Fiats and Exceptions Task#Pages#Rules Entity 3420 Value 105 TIMEX2 7550 Relations 3625 Events 7750 Total 232150 Example Decision Rule (Event p33) Note: For Events that where a single common trigger is ambiguous between the types LIFE (i.e. INJURE and DIE) and CONFLICT (i.e. ATTACK), we will only annotate the Event as a LIFE Event in case the relevant resulting state is clearly indicated by the construction. The above rule will not apply when there are independent triggers.

17  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 BioIE case law Guidelines for oncology tagging Guidelines for oncology tagging (local)local

18  Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Discussion  How to make it better  Integrating multiple information sources text, bioinformatic databases, microarray data, …  less-supervised learning inferring useful features from untagged text active learning, information markets, etc.  create a “basis set” of ready-made entity types  How to make it different  the analogy to translation  the lure of systematic semantics  (machine) learning: who is learning what?


Download ppt " Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman"

Similar presentations


Ads by Google