Text Mining in Biomedicine Michael Krauthammer Department of Pathology Yale University School of Medicine.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

GENIA-GR: a Grammatical Relation Corpus for Parser Evaluation in the Biomedical Domain Yuka Tateisi 1, Yusuke Miyao 2, Kenji Sagae 2, Jun'ichi Tsujii 2,3.

Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Semantic Role Labeling Abdul-Lateef Yussiff

Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.

A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.

LING 581: Advanced Computational Linguistics Lecture Notes January 19th.

Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.

Using Treebanks tgrep2 Lecture 2: 07/12/2011. Using Corpora For discovery For evaluation of theories For identifying tendencies – distribution of a class.

6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.

SRL using complete syntactic analysis Mihai Surdeanu and Jordi Turmo TALP Research Center Universitat Politècnica de Catalunya.

Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

An Overview of Event Extraction from Text Workhop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE'11) October 23,

TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.

Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.

LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.

Automatic Extraction of Opinion Propositions and their Holders Steven Bethard, Hong Yu, Ashley Thornton, Vasileios Hatzivassiloglou and Dan Jurafsky Department.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

1 Statistical NLP: Lecture 10 Lexical Acquisition.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

Based on “Semi-Supervised Semantic Role Labeling via Structural Alignment” by Furstenau and Lapata, 2011 Advisors: Prof. Michael Elhadad and Mr. Avi Hayoun.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.

Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short.

Biomedical Databases & Tools Rolando Garcia-Milian Biomedical & Health Information Services Department Health Sciences Center Library.

Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.

Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.

AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )

CSA2050 Introduction to Computational Linguistics Parsing I.

Workshop on The Transformation of Science Max Planck Society, Elmau, Germany June 1, 1999 TOWARDS INFORMATIONAL SCIENCE Indexing and Analyzing the Knowledge.

Supertagging CMSC Natural Language Processing January 31, 2006.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.

NLP. Introduction to NLP Last week, Min broke the window with a hammer. The window was broken with a hammer by Min last week With a hammer, Min broke.

Handling Unlike Coordinated Phrases in TAG by Mixing Syntactic Category and Grammatical Function Carlos A. Prolo Faculdade de Informática – PUCRS CELSUL,

5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA

NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =

Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.

Approaches to Machine Translation

PRESENTED BY: PEAR A BHUIYAN

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Natural Language Processing (NLP)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27

Improving a Pipeline Architecture for Shallow Discourse Parsing

LING/C SC 581: Advanced Computational Linguistics

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27

Automatic Detection of Causal Relations for Question Answering

Approaches to Machine Translation

Natural Language Processing (NLP)

Artificial Intelligence 2004 Speech & Natural Language Processing

Progress report on Semantic Role Labeling

Natural Language Processing (NLP)

Presentation transcript:

Text Mining in Biomedicine Michael Krauthammer Department of Pathology Yale University School of Medicine

Definition Text mining is –the process of automatically extracting knowledge from large text collections –data mining applied to text documents / knowledge discovery from text –a modular process similar to reading, where facts from different articles / books are combined for novel inference (de Bruijn 2002)

Examples in Biomedicine Protein A activates Protein B Protein C triggers Apoptosis Protein B activates Protein C Text Mining System Protein A Protein B Apoptosis Protein C

Signal Transduction © Max Planck Institute of Molecular Physiology

Signal Transduction - Apoptosis © Daniel Focosi / Molecular Medicine

Signal Transduction - Apoptosis © Daniel Focosi / Molecular Medicine

Signal Transduction - Apoptosis © Daniel Focosi / Molecular Medicine

Mining Molecular Interactions

Information Explosion

Mining Molecular Interactions Protein A activates Protein B Protein C triggers Apoptosis Protein B activates Protein C GeneWays System Protein A Protein B Apoptosis Protein C

Network-based Candidate Gene Prediction

Text Mining - Components

Information Extraction Information Extraction: “the activity of populating a structured information source (or database) from an unstructured, or free text, information source” (Gauzuskas & Wilks 1998)

Information Extraction Many information sources are free text: Law (Court Orders) Academic Research (Research Articles) Finance (Quarterly Reports) Medicine (Discharge Summaries) Biology (Molecular Interactions) Data analysis on free text is difficult Transformation of free text into structured data (machine-readable)

Information Extraction DISCHARGE SUMMARY (free text) PATIENT DATABASE (structured data) NameSmith Symptomfever Symptomweight loss Patient Smith reports fever and weight loss INFORMATION EXTRACTION

Information Extraction SCIENTIFIC ARTICLE (free text) RESEARCH DATABASE (structured data) SubstanceProtein A Interactionactivation SubstanceProtein B INFORMATION EXTRACTION We observed the activation of protein A by protein B

Information Extraction SCIENTIFIC ARTICLE (free text) RESEARCH DATABASE (structured data) SubstanceProtein A Interactionactivation SubstanceProtein B INFORMATION EXTRACTION We observed the activation of protein A by protein B Natural Language Processing

Information Extraction SCIENTIFIC ARTICLE (free text) RESEARCH DATABASE (structured data) SubstanceProtein A Interactionactivation SubstanceProtein B INFORMATION EXTRACTION We observed the activation of protein A by protein B Statistical methods Pattern matching Full/Shallow parsing

Statistical Methods Stapley (2000): Measuring gene associations Venn diagram of a set of Medline documents showing the Intersection of documents containing both genes i and j. BioBibliometric distance: dij=(|i|+|j|) / (|ij|) gene i gene j Stapley, B. J. and G. Benoit (2000). “Biobibliometrics: information retrieval and visualization from co- occurrences of gene names in Medline abstracts.” Pac Symp Biocomput:

Pattern Matching Pattern matching (~regexp) to extract protein- protein interactions Blaschke, C., M. A. Andrade, et al. (1999). “Automatic extraction of biological information from scientific text: protein-protein interactions.” Proc Int Conf Intell Syst Mol Biol: Ng, S. K. and M. Wong (1999). “Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts.” Genome Inform Ser Workshop Genome Inform 10: Ono, T., H. Hishigaki, et al. (2001). “Automated extraction of information on protein-protein interactions from the biological literature.” Bioinformatics 17(2):

Full Parsing Parsing: Detect sequence of grammar rules that describe internal structure of sentence Grammar rule: S -> NP VP [The house] NP [was demolished] VP. Syntax parse tree:

Full Parsing Language Parsing in Biomedicine MedLEE and GENIES semantic grammar parsers Columbia University, Dr. Carol Friedman MedLEE: Clinical medicine parser: discharge summaries, radiology reports, pathology reports the patient has a family history of coronary artery disease /bodyloc>

Full Parsing GENIES: parser for molecular domain. Extracts molecular interactions. Frame representation: Each frame is a list beginning with the elements type, value, possibly followed by additional frames: [protein, Il-2, [state, active]] For example, the parse of Raf-1 activates Mek-1 is [action, activate, [protein, Raf-1], [protein, Mek-1]]

Full Parsing Handles nested sentences (context free language): mediation of sonic hedgehog-induced expression of Coup-Tfii by a protein phosphatase [action,promote,[geneorprotein, phosphatase], [action,activate,[geneorprotein,sonic hedgehog], [action,express,X,[geneorprotein,Coup-Tfii]]]]

Full Parsing Hafner, C. D., K. Baclawski, et al. (1994). “Creating a knowledge base of biological research papers.” Proc Int Conf Intell Syst Mol Biol 2: Friedman, C., P. Kra, et al. (2001). GENIES: A Natural-Language System for the Extraction of Molecular Pathways from Complete Journal Articles. Proc Int Conf Intell Syst Mol Biol, Kopenhagen. Yakushiji A, Tateisi Y, Miyao Y, Tsujii J. Event extraction from biomedical papers using a full parser.Pac Symp Biocomput. 2001: McDonald DM, Chen H, Su H, Marshall BB. Extracting gene pathway relations using a hybrid grammar: the Arizona relation parser.Bioinformatics Jul 15 Leroy G, Chen H, Martinez JD. A shallow parser based on closed-class words to capture relations in biomedical text.J Biomed Inform Jun;36(3): Koike A, Niwa Y, Takagi T. Automatic extraction of gene/protein biological functions from biomedical text.Bioinformatics Oct 27 Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I. Extracting human protein interactions from MEDLINE using a full-sentence parser.Bioinformatics Mar 22;20(5): Epub 2004 Jan 22

Shallow Semantic Parsing Medical Abstracts Zocor (Arg0) reduced cholesterol (Arg1) “The article discussed that Zocor reduced cholesterol in the intervention group.” Medicine action blood test DATABASE What medicine decreased a blood test? How did a medicine affect a blood test?

Shallow Semantic Parsing Shallow Semantic Parsing Technique (SSPT) –Successfully applied in non-medical domain* –“Predicate-centric” –Dissect sentences into simple WHAT did WHAT to WHOM/WHAT, and Modifiers (WHEN, WHERE, WHY and HOW) The article discussed that Zocor (What) reduced (did What) cholesterol (to What) in the intervention group (modifiers). –Thus two core arguments, “Zocor” (Argument 0) and “cholesterol” (Argument 1), are related by the predicate “reduce(d)” –Modifier “in the intervention group” –“The article discussed that” is a null argument, i.e. it is not part of the predicate arguments. * S. Pradhan, D. Jurafsky, et al. In Proc. Of NAACL-HLT 2004.

Treebank contains the Wall Street Journal (WSJ) corpus annotate with syntactic information Propbank annotates the same WSJ corpus found in Treebank with semantic information Given the syntactic and semantic features, we can build a machine learning-based Information Extraction (IE) system, using shallow semantic parsing Advantage of using Treebank and Propbank is its re-use of an existing corpora to do ‘free’ information extraction in the medical domain Treebank and Propbank

“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.” ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (..) )) Introduction: Treebank \\treebank\parsed\mrg\wsj_0001.mrg

wsj/00/wsj_0001.mrg 0 8 gold join.01 vf--a 0:2-ARG0 7:0 ARGM-MOD 8:0-rel 9:1-ARG1 11:1-ARGM-PRD 15:1-ARGM-TMP Verb ‘Join’ Location in Treebank Argument 0 Argument 1Argument M Introduction: Propbank Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

Overall idea Syntax From Treebank

Overall idea Syntax From Treebank Arg0- the eater Arg1- the thing eaten predicts Predicate Arguments From Propbank

Problem: WSJ corpus = business domain In order to use WSJ, we have to make sure that the predicate distribution is “representative” for medical sentences. We found that 99 out of top 100 predicates in medical abstracts can be found in the WSJ corpus.

Results: Verb Frequency 10 most frequently found verbs in medical abstracts #OccurrencesVerbCumulative frequency 11238reduce improve suggest increase use associate compare show provide appear0.260

Methods: ML Training set and Intra-Domain Testing Set WSJ Extract sentences with top 5 verbs 15,424 words Training Set 12,500 words Test Set 2,924 words

Methods: ML Training & Testing (Intra-domain) ML Training ML Testing WSJ Training Set SVMTorch* * Extraction of syntactic features from Treebank and semantic categories from Propbank Extraction of syntactic features WSJ Testing Set Build classifier for semantic categories Predict semantic categories Pierre Vinken, 61 years old, will join [the board]_Arg1 as a nonexecutive director Nov. 29.

Syntactic Features S NP VP The Article discussed SBAR that S NP VP Zocor reduced NP cholesterol PP in NP the intervention group Null Argument 0 Verb Argument 1

Syntactic Features Predicate of the sentences Syntactic path from a word to the sentence predicate –For the word Zocor, the paths are NP  S  VP  VBD and S  VP  VBD Phrase Type –The syntactic category of the constituent –NP and S for Zocor * S. Pradhan, D. Jurafsky, et al. In Proc. Of NAACL-HLT 2004.

Syntactic Features Position of the word relative to the predicate Head Word POS The POS tag of the syntactic head of the constituent Sub-categorization Phrase structure expanding the predicate’s parent node in the parse tree. VP  VBD-NP for the predicate reduced

Results: Intra-domain performance ArgumentRecallPrecisionFn NULL N/A Weighted Avg

Results: Comparison with Prior Work * (Intra-domain) *Table 1: Performance on WSJ test set ArgPrecisionRecallF ID (null) ID + Class * S. Pradhan, D. Jurafsky, et al. In Proc. Of NAACL-HLT 2004.

Methods: ML Cross-Domain Testing Set Medline Abstracts Test set (6373 Words) 250 Sentences with 5 target verbs Manual annotated by 2 Medical Experts Hand annotated test set

Methods: ML Testing (cross-domain) SVMTorch Extraction of syntactic features ML Training ML Testing RCT Abstracts Propbank (WSJ) Extraction of syntactic and semantic categories WSJ Training set Medical Abstracts Testing set Predict semantic categories

Results: Cross-domain performance ArgRecallPrecisionF n NULL Weighted Avg

Results: Comparison with prior work* (cross-domain) Table 15*: Performance on the AQUAINT test set. AQUAINT: collection of text from the NY Times Inc., AP Inc., and Xinhua News Service ArgPrecisionRecallF ID (null) ID + Class

Discussion Our ML classifier for null arguments –Intra-domain F = 86%, and cross-domain F = 75%, difference = 11% Pradhan and Jurafsky article for null arguments –Intra-domain F = 92%, and cross-domain F = 81%, difference = 11% Reuse of Propbank and Treebank information to automatically annotate medical abstract by using SSPT and ML classifier is feasible

Discussion - Limitations Limitation –The results are based on a small medical testing set Future directions –Improve the performance by addition of: Verb sense feature found in Propbank was not used Lack of lexical features Verb Clustering Temporal cue words –Test the performance using much larger medical abstract test set

Summary Literature is an important resource for biomedical knowledge Text mining = framework for accessing the free text in the literature, and transforming it to structured data Machine Learning = essential element in the text mining process

Appendix: Sentence Predicate Extraction Perl module Lingua::EN::Sentence -> Identified sentences Charniak parser 1 -> Identified Parts of Speech –Based on WSJ corpus Extracted terminals with VB* POS tags Program morpha 2 -> Normalization of verbs 1.Charniak, E., A Maximum-Entropy-Inspired Parser. 1999, Brown University. 2.Minning, G., J. Carroll, and P. D., Applied morphological processing of English. Natural Language Engineering, (3): p