Classifying Semantic Relations in Bioscience Texts

Slides:

Advertisements

Similar presentations

Neural networks Introduction Fitting neural networks

Advertisements

Linear Regression.

What is Statistical Modeling

Simple Neural Nets For Pattern Classification

I256 Applied Natural Language Processing Fall 2009 Lecture 14 Information Extraction (2) Barbara Rosario.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI

Distributed Representations of Sentences and Documents

Semantic Interpretation of Medical Text Barbara Rosario, SIMS Steve Tu, UC Berkeley Advisor: Marti Hearst, SIMS.

November 25, 2014Computer Vision Lecture 20: Object Recognition IV 1 Creating Data Representations The problem with some data representations is that the.

Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario SIMS UC Berkeley.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

7-Speech Recognition Speech Recognition Concepts

The Descent of Hierarchy, and Selection in Relational Semantics* Barbara Rosario, Marti Hearst, Charles Fillmore UC Berkeley *with apologies to Charles.

A Language Independent Method for Question Classification COLING 2004.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Data Mining and Decision Support

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Big data classification using neural network

CS Fall 2016 (Shavlik©), Lecture 5

Chapter 7. Classification and Prediction

Deep Feedforward Networks

Deep Learning for Bacteria Event Identification

Deep Learning Amin Sobhani.

Course Outcomes of Object Oriented Modeling Design (17630,C604)

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Instance Based Learning

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Erasmus University Rotterdam

کاربرد نگاشت با حفظ تنکی در شناسایی چهره

Statistical Data Analysis

CSC 594 Topics in AI – Natural Language Processing

Intelligent Information System Lab

Machine Learning Basics

Data Mining Lecture 11.

CSCI 5832 Natural Language Processing

Data Mining Practical Machine Learning Tools and Techniques

Category-Based Pseudowords

Statistical NLP: Lecture 9

Hidden Markov Models Part 2: Algorithms

Intent-Aware Semantic Query Annotation

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Discriminative Frequent Pattern Analysis for Effective Classification

Introduction Task: extracting relational facts from text

On Convolutional Neural Network

Ensemble learning.

Statistical Data Analysis

The Descent of Hierarchy, and Selection in Relational Semantics*

Unsupervised Learning of Narrative Schemas and their Participants

Introduction to Neural Networks

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Statistical NLP : Lecture 9 Word Sense Disambiguation

Marti Hearst Associate Professor SIMS, UC Berkeley

CS249: Neural Language Model

Presentation transcript:

Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

Problem: Which relations hold between 2 entities? Treatment Cure? Disease Prevent? Side Effect?

Hepatitis Examples Cure Prevent Vague These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. Prevent A two-dose combined hepatitis A and B vaccine would facilitate immunization programs Vague Effect of interferon on hepatitis B

Two tasks Relationship Extraction: Entity extraction: Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text Entity extraction: Related problem: identify such entities

The Approach Data: MEDLINE abstracts and titles Graphical models Combine in one framework both relation and entity extraction Both static and dynamic models Simple discriminative approach: Neural network Lexical, syntactic and semantic features

Outline Related work Data and semantic relations Features Models and results Conclusions

Several DIFFERENT Relations between the Same Types of Entities Thus differs from the problem statement of other work on relations Many find one relation which holds between two entities (many based on ACE) Agichtein and Gravano (2000), lexical patterns for location of Zelenko et al. (2002) SVM for person affiliation and organization-location Hasegawa et al. (ACL 2004) Person-Organization -> President “relation” Craven (1999, 2001) HMM for subcellular-location and disorder-association Doesn’t identify the actual relation

Related work: Bioscience Many hand-built rules Feldman et al. (2002), Friedman et al. (2001) Pustejovsky et al. (2002) Saric et al.; this conference Craven (1999, 2001) consider positive examples to be all the sentences that simply contain the entities, rather than analyzing which relations hold between these entities  Role extraction pustejovsky use a rule-based system to extract entities in the inhibit-relation. Their experiments use MEDLINE sentences that contain verbal and nominal forms of the stem inhibit. the actual task performed therefore is the extraction of entities that are connected by some form of the stem inhibit , which is potentially different from the extraction of entities in the inhibit-relation, since there may well be other ways to express this relation.

Data and Relations MEDLINE, abstracts and titles 3662 sentences labeled Relevant: 1724 Irrelevant: 1771 e.g., “Patients were followed up for 6 months” 2 types of Entities, many instances treatment and disease 7 Relationships between these entities An annotator with biology expertise looked at the titles and abstracts separately and labeled the sentences in both based solely on the content of the individual sentences. The labeled data is available at http://biotext.berkeley.edu

Semantic Relationships 810: Cure Intravenous immune globulin for recurrent spontaneous abortion 616: Only Disease Social ties and susceptibility to the common cold 166: Only Treatment Flucticasone propionate is safe in recommended doses 63: Prevent Statins for prevention of stroke

Semantic Relationships 36: Vague Phenylbutazone and leukemia 29: Side Effect Malignant mesodermal mixed tumor of the uterus following irradiation 4: Does NOT cure Evidence for double resistance to permethrin and malathion in head lice

Features Word Part of speech Phrase constituent Orthographic features ‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ … MeSH (semantic features) Replace words, or sequences of words, with generalizations via MeSH categories Peritoneum -> Abdomen

Features (cont.): MeSH MeSH Tree Structures 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z] We also used a large domain-specific lexical hierarchy (MeSH) for generalization across classes of nouns. There are about 19,000 unique main terms in MeSH and 15 main sub-hierarchies (trees), each corresponding to a major branch of medical ontology. For example, tree A corresponds to Anatomy, tree C to Disease, and so on. It is possible for multiple words to be mapped to the same concept. The multiple mapping can be due to lexical ambiguity or just to different ways of classifying the same concept. For this work, we simply retain the first MeSH mapping for each word.

Features (cont.): MeSH Body Regions [A01] 1. Anatomy [A] Musculoskeletal System [A02] Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07] + Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..) Body Regions [A01] Abdomen [A01.047] Groin [A01.047.365] Inguinal Canal [A01.047.412] Peritoneum [A01.047.596] + Umbilicus [A01.047.849] Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….) The longer the MeSH term, the longer the path from the root of the hierarchy and the more precise the description. We can represent the MeSH terms at different levels of description. Currently, we use the second level. This is somewhat arbitrary (and mainly chosen with the sparsity issue in mind), but in light of the importance of the MeSH features it may be worthwhile investigating the issue of finding the optimal level of description. (This can be seen as another form of smoothing.)

Models 2 static generative models 3 dynamic generative models 1 discriminative model (neural networks)

Static Graphical Models S1: observations dependent on Role but independent from Relation given roles S2: observations dependent on both Relation and Role In S1 the observations are independent from the relation (given the roles). In S2, the observations are dependent on both the relation and the role (or in other words, the relation generates not only the sequence of roles but also the observations) encoding the fact that even when the roles are given, the observations depend on the relation. For example, sentences containing the word prevent are more likely to represent a “prevent” kind of relationship. S1 S2

Dynamic Graphical Models D1, D2 as in S1, S2 D3: only one observation per state is dependent on both the relation and the role In D1 the observations are independent from the relation (given the roles). In D2, the observations are dependent on both the relation and the role (or in other words, the relation generates not only the sequence of roles but also the observations) encoding the fact that even when the roles are given, the observations depend on the relation. For example, sentences containing the word prevent are more likely to represent a “prevent” kind of relationship. in D3 only one observation per state is dependent on both the relation and the role, the motivation being that some observations (such as the words) depend on the relation while others might not (like for example, the parts of speech). In the experiments reported here, the observations which have edges from both the role and the relation nodes are the words. (We ran an experiment in which this observation node was the MeSH term, obtaining similar results.)

Graphical Models Relation node: Semantic relation (cure, prevent, none..) expressed in the sentence The node labeled ``Relation'' represents the relationship present in the sentence. We assume here that there is a single relation for each sentence between the entities

Graphical Models Role nodes: 3 choices: treatment, disease, or none The nodes labeled ``Role'' represent the entities (in this case the choices are DISEASE, TREATMENT and NULL) There are as many roles as there are words in the sentence The simpler static models do not assume an ordering in the role sequence. The dynamic models were inspired by prior work on HMM-like graphical models for role extraction These models consist of a Markov sequence of states (usually corresponding to semantic roles) where each state generates one or multiple observations. These models assume that there is an ordering in the semantic roles that can be captured with the Markov assumption and that the role generates the observations (the words, for example). All our models make the additional assumption that there is a relation that generates the role sequence; thus, these models have the appealing property that they can simultaneously perform role extraction and relationship recognition, given the sequence of observations.

Graphical Models Feature nodes (observed): word, POS, MeSH… Children of the role nodes are the words and their features which are the only nodes observed. The task is to recover the sequence of Role states and the Relation, given the observed features.

Graphical Models For Dynamic Model D1: Joint probability distribution over relation, roles and features nodes Parameters estimated with maximum likelihood and absolute discounting smoothing

Thompson et al. 2003 Our D1 Frame classification and role labeling for FrameNet sentences Target word must be observed More relations and roles Our D1 Very similar models, frame == relation C: head word/phrase type of costituent (ex: Anne/NP) Target word observed! all sentences have a target from a fixed list (easier) The inference procedure is also different. More frames/relations (55) More roles (117) For the task most similar to ours (identify constituents and classify them): Frame classification: 97.5% (but target given) Role labeling accuracy: 70.1% (different from our F-measure)

Neural Networks Feed-forward network (MATLAB) Same features Training with conjugate gradient descent One hidden layer (hyperbolic tangent function) Logistic sigmoid function for the output layer representing the relationships Same features Discriminative approach The number of units of the output layer is the number of relations (eight or nine) and therefore fixed. The network was trained for several choices of numbers of hidden units; we chose the best-performing networks based on training set error for each of the models. We then tested these networks on held-out testing data. The features were the same used for the graphical models.

Relation extraction Results in terms of classification accuracy (with and without irrelevant sentences) 2 cases: Roles hidden Roles given Graphical models NN: simple classification problem We run the experiments for two cases: ``roles given'': the true semantic roles are given and used as input in the classification along with the observable features only features'': a more realistic case in which the true roles are hidden and we classify the relations given only the observable features. For the graphical moldels, when the roles are hidden, they are marginalized over

Relation classification: Results Neural Net always best NN always outperform the graphical models. Two possible reasons for this are: the discriminative approach may be the most appropriate for fully labeled data; or the graphical models we proposed may not be the right ones, i.e., the independence assumptions they make may misrepresent underlying dependencies. It must be pointed out that the neural network is much slower than the graphical models, and requires a great deal of memory;

Relation classification: Results With no smoothing, D1 best Graphical Model D1 outperforms the other dynamic models when no smoothing is applied; this was expected, since the parameters for models D2 and D3 are more sparse than D1.

Relation classification: Results With Smoothing and No Roles, D2 best GM However, when smoothing is applied, and when the true roles are hidden (the most realistic case), D2 achieves the best classification accuracies

Relation classification: Results With Smoothing and Roles, D1 best GM When the roles are given D1 is the best model. D1 does well in the cases when both roles are not present. By contrast, D2 does better than D1 when the presence of specific words strongly determines the outcome (e.g., the presence ``prevention'' or ``prevent'' helps identify the Prevent relation).

Relation classification: Results Dynamic models always outperform Static See also how the dynamic models always outperforms by large the static ones

Relation classification: Confusion Matrix Computed for the model D2, “rel + irrel.”, “only features” To provide an idea of where the errors occur, this table shows the confusion matrix for model D2 for the most realistic and difficult case of ``rel + irrel.'', ``only features''. This indicates that the algorithm performs poorly primarily for the cases for which there is little training data, with the exception of the ONLY DISEASE case, which is often mistaken for CURE.

Role extraction Results in terms of F-measure Graphical models NN Junction tree algorithm (BNT) Relation hidden and marginalized over NN Couldn’t run it (features vectors too large) (Graphical models can do role extraction and relationship classification simultaneously) We use a strict evaluation: every token is assessed (for example, even punctuation must be associated with the appropriate entity) and we do not assign partial credit for constituents for which only some of the words are correctly labeled.

Role Extraction: Results F-measures D1 best when no smoothing Again model D1 outperforms the other dynamic models when no smoothing is applied

Role Extraction: Results F-measures D2 best with smoothing, but doesn’t boost scores as much as in relation classification However, when smoothing is applied, model D2 achieves the best F-measures. Note however how the three dynamic models achieve similar results. The percentage improvements of D2 and D3 versus D1 are, respectively, 10% and 6.5% for relation classification and 1.4% for role extraction (in the ``only relevant'', ``only features'' case). This suggests that there is a dependency between the observations and the relation that is captured by the additional edges in D2 and D3, but that this dependency is more helpful in relation classification than in role extraction.

Features impact: Role Extraction Most important features: 1)Word, 2)MeSH Models D1 D2 All features 0.67 0.71 No word 0.58 0.61 -13.4% -14.1% No MeSH 0.63 0.65 -5.9% -8.4% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)

Features impact: Relation classification Most important features: Roles Accuracy: D1 D2 NN All feat. + roles 91.6 82.0 96.9 All feat. – roles 68.9 74.9 79.6 -24.7% -8.7% -17.8% All feat. + roles – Word 91.6 79.8 96.4 0% -2.8% -0.5% All feat. + roles – MeSH 91.6 84.6 97.3 0% 3.1% 0.4% (for all the other features decrease % was negligible) (rel. + irrel.)

Features impact: Relation classification Most realistic case: Roles not known Most important features: 1) Mesh 2) Word for D1 and NN (but vice versa for D2) Accuracy: D1 D2 NN All feat. – roles 68.9 74.9 79.6 All feat. - roles – Word 66.7 66.1 76.2 -3.3% -11.8% -4.3% All feat. - roles – MeSH 62.7 72.5 74.1 -9.1% -3.2% -6.9% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)

Conclusions Classification of subtle semantic relations in bioscience text Discriminative model (neural network) achieves high classification accuracy Graphical models for the simultaneous extraction of entities and relationships Importance of lexical hierarchy Future work: A new collection of disease/treatment data Different entities/relations Unsupervised learning to discover relation types We have addressed the problem of distinguishing between several different relations that can hold between two semantic entities, a difficult and important task in natural language understanding. Because there is no existing gold-standard for this problem, we have developed the relation definitions; this however may not be an exhaustive or fully representative list. In future we plan to assess additional relations. (eventually) It is unclear at this time if this approach will work on other types of text; the technical nature of bioscience text may lend itself well to this type of analysis. However, very useful, real world applications could be developed if we were able to do natural language understanding effectively, if only in the biomedical domain.

Thank you! Barbara Rosario Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu

Additional slides

Smoothing: absolute discounting Lower the probability of seen events by subtracting a constant from their count (ML estimate: ) The remaining probability is evenly divided by the unseen events We experimented with different values for the smoothing factor ranging from a minimum of 0.0000005 to a maximum of 10; the results reported fix the smoothing factor at its minimum value.

F-measures for role extraction in function of smoothing factors Bigger points on the left are the results when no smoothing was applied We found that for the dynamic models, for a wide range of smoothing factors, we achieved almost identical results; nevertheless, in future work, we plan to implement cross-validation to find the optimal smoothing factor. By contrast, the static models were more sensitive to the value of the smoothing factor especially for the role extraction task

Relation accuracies in function of smoothing factors

Role Extraction: Results Static models better than Dynamic for Note: No Neural Networks Note also that for role extraction the static models perform better than for relation classification. The decreases in performance from D1 to S1 and from D2 to S2 are, respectively (in the ``only relevant'', ``only features'' case), 7.4% and 7.3% for role extraction and 27.1% and 44% for relation classification. This suggests the importance of modeling the sequence of roles for relation classification.

Features impact: Role Extraction Most important features: 1)Word, 2)MeSH Models D1 D2 Average All features 0.67 0.71 No word 0.58 0.61 -13.4% -14.1% -13.7% No MeSH 0.63 0.65 -5.9% -8.4% -7.2% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)

Features impact: Role extraction Most important features: 1) Word, 2) MeSH F-measures: D1 D2 Average All features 0.72 0.73 No word 0.65 0.66 -9.7% -9.6% -9.6% No MeSH 0.69 0.69 -4.2% -5.5% -4.8% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge For rel+irrel. (only rel.)

Features impact: Role extraction Most important features: 1) Word, 2) MeSH F-measures: D1 D2 All features 0.72 0.73 No word 0.65 0.66 -9.7% -9.6% No MeSH 0.69 0.69 -4.2% -5.5% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge For rel+irrel. (only rel.)

Features impact: Relation classification Most important features: Roles Accuracy: D1 D2 NN Avg. All feat. + roles 91.6 82.0 96.9 All feat. – roles 68.9 74.9 79.6 -24.7% -8.7% -17.8% -17.1% All feat. + roles – Word 91.6 79.8 96.4 0% -2.8% -0.5% -1.1% All feat. + roles – MeSH 91.6 84.6 97.3 0% 3.1% 0.4% 1.1% (for all the other features decrease % was negligible) When the roles are known the other features have very little impact (maybe comments on (the many) systems that do assume roles given?) (rel. + irrel.)

Features impact: Relation classification Most realistic case: Roles not known Most important features: 1) Mesh 2) Word for D1 and NN (but vice versa for D2) Accuracy: D1 D2 NN Avg. All feat. – roles 68.9 74.9 79.6 All feat. - roles – Word 66.7 66.1 76.2 -3.3% -11.8% -4.3% -6.4% All feat. - roles – MeSH 62.7 72.5 74.1 -9.1% -3.2% -6.9% -6.4% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)