Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/27/031 Integrating Syntactic and Semantic Annotation of Biomedical Text Seth Kulick, Mark Liberman, Martha Palmer and Andrew Schein The University of.

Similar presentations


Presentation on theme: "6/27/031 Integrating Syntactic and Semantic Annotation of Biomedical Text Seth Kulick, Mark Liberman, Martha Palmer and Andrew Schein The University of."— Presentation transcript:

1 6/27/031 Integrating Syntactic and Semantic Annotation of Biomedical Text Seth Kulick, Mark Liberman, Martha Palmer and Andrew Schein The University of Pennsylvania Support from: NSF ITR-EIA-0205448

2 6/27/032 Contributors The University of Pennsylvania zAnn Bies, Susan Davidson, Hubert Jin, Aravind Joshi, Seth Kulick, Jeremy Lacivita, Mark Liberman, Mark Mandel, Mitch Marcus, Marty McCormick, Tom Morton, Martha Palmer, Eric Pancoast, Fernando Pereira, Andrew Schein, Val Tannen, Lyle Ungar, Peng Wang eGenome (Children’s Hospital of Philadelphia) zYang Jin, Peter White, Scott Winters GlaxoSmithKline zJim Butler, Paula Matuszek, Robin McEntire Other zRobert Gaizauskas, Jun-ichi Tsujii, Bonnie Webber

3 6/27/033 Goal zInformation Extraction from the biomedical literature, particularly Medline yEnzyme Inhibition Relations Expression of CYP3A11 and PXR was suppressed by inactivation of HNF4alpha customer: GlaxoSmithKline yMutation/Malignancy Relations Ki-ras mutations were detected in 17.2% of the adenomas. customer: eGenome zAnnotate 1-10K abstracts for each domain

4 6/27/034 Approach to Information Extraction zPhase 1: yDevelop definitions and ontologies yAnnotate data according to definitions zPhase 2: Train corpus-based algorithms exploiting various annotation: xParsing xPredicate-argument analysis xReference resolution zPhase 3: “Active Annotation”

5 6/27/035 Active Annotation Machine Learning Selective Sampling/ Labeling Hand Correction Hand Annotation Selected Documents

6 6/27/036 Challenge: Diversity in Expression 1.“Activation of the C-Ki-ras genes by point mutations in codons 12 or 13...” 2.“Point mutations in codons 12 and 13 activated C- Ki-ras” 3.“Point mutations in codons 12 and 13 were activators of C-Ki-ras gene” Want to populate a factbank with: activation(C-Ki-ras, point mutation in codon 12) activation(C-Ki-ras, point mutation in codon 13)

7 6/27/037 Approaches to Handling Diversity zCurrent Approach is to either: yHand build extraction patterns to cover all variant expressions or yAnnotate lots of data to get examples of variant expressions (for machine learning) zProposed Approach: Linguistic analysis of the sentences

8 6/27/038 Information Extraction Approaches Lexical Info Extracted Relations Extraction Algorithm Linguistic Annotation Common Approach Proposed Approach

9 6/27/039 Our Annotation Effort Together for the first time… Annotations include: yTreebank (Syntax) yProbank (predicate-argument structure) yEntities (genes, malignancies) yReference and Coreference yFactbanking (end goal)

10 6/27/0310 NP PP Activation the of Nom genes c-ki-ras NP PP point by mutations PP NP inNP Nom or Nom Codons 12 Nom t 13 Syntactic Structure (Treebank Annotation)

11 6/27/0311 More Examples of Coordination  “the ortho and meta positions”  (= the ortho positions and meta positions)  “PLC and cytochrome P450 arachidonate epoxygenase activity”  (= PLC arachidonate epoxygenase activity and cytochrome P450 arachidonate…)  “enhanced CYP2C9 expression and 11,12 EET production”  (= enhanced CYP2C9 expression and enhanced 11,12 EET production)

12 6/27/0312 Predicate-Argument Annotation: Propbank z“Point mutations in codons 12 and 13 were activators of C-K-ras genes” z“Activation of the C-K-ras genes by point mutations in Codons 12 or 13...” zPredicate-Argument Structure (Propbank): yREL: activation activatee: c-ki-ras genes activator: point mutations in codons 12 or 13 yREL: mutations type: point position(s): Codons 12 or 13

13 6/27/0313 Why Combine Treebank and Propbank? zTreebank indicates constituents ysubject, verb, direct object, etc. zPropbank indicates roles of constituents y“agent,” “theme,” “quantification”, etc. yinhibitor, inhibitee, inhibition rate zPrior work combines Treebank/Propbank for financial text IE: (Surdeneau et al., 2003, Gildea and Palmer, 2002)

14 6/27/0314 Entity Annotation z Entities we annotate include: “gene”, “protein”, “substance”, “malignancy” zMetonymy Issues: yis a reference a gene or a protein? yWe use subtypes, following ACE conference convention yGene is broken in to three categories: “Generic,” “Gene/RNA” and “Protein”

15 6/27/0315 The Gene Entity Generic Gene/RNA Protein

16 6/27/0316 WordFreak Annotation Tool Morton, Lacivita, Pancoast: www.annotation.orgwww.annotation

17 6/27/0317 Reference and Co-reference Annotation zCo-reference is an equivalence relation zsubtypes prevent nonsense in a co-ref graph Example of reference types: “K-Ras is a member of the Ras family of Oncogenes. The protein form is actively expressed in…” class-membership(K-Ras, Ras family) anaphor(K-Ras_protein, protein form)

18 6/27/0318 Current Activities zIn Progress: yEntity Annotation of “Gene,” “Chemical,” “Malignancy,” “genetic variation,” etc. yPOS annotation yTraining Treebank Syntactic Annotators zStarting Up: yStart coreference annotation yBuild our first entity tagging models

19 6/27/0319 zJanuary 2004 - Entity tagging and coreference on oncology domain complete. We publish: annotation guidelines data baseline statistical taggers zMay 2004 - First draft syntactic analysis of oncology domain (1-10K Medline abstracts) Some Projected Milestone Dates

20 6/27/0320 Some Annotation Projects and Related Research zGENIA Project and U Tokyo Work: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA zPasta system and Sheffield Work: http://nlp.shef.ac.uk/research/areas/bio.html  GENIES system and Columbia/CUNY Work zModeling Linguistic Phenomenon: yRay/Craven, IJCAI-2001 yPustejovsky et al. 2003

21 6/27/0321 The End.

22 6/27/0322 Some Examples Follow

23 6/27/0323 Reference and Co-reference zOur reference subtypes are: yAcronyms (definitions and linkages) yAnaphor (such as pronouns) yClasses versus their members y“Is-a” relation, i.e. “{CYP450}, {an enzyme} found in…” yStandardized database reference

24 6/27/0324 Complex Coordination Example Inhibition of CB -52 and -101 metabolism Note coordination of “CB” and also “metabolism”! The sentence above can be represented as: Inhibition of CB-52 metabolism and CB-101 metabolism)


Download ppt "6/27/031 Integrating Syntactic and Semantic Annotation of Biomedical Text Seth Kulick, Mark Liberman, Martha Palmer and Andrew Schein The University of."

Similar presentations


Ads by Google