Classifying Semantic Relations in Bioscience Texts

Classifying Semantic Relations in Bioscience Texts
Barbara Rosario Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech

Problem: Which relations hold between 2 entities?
Treatment Cure? Disease Prevent? Side Effect?

Hepatitis Examples Cure Prevent Vague
These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. Prevent A two-dose combined hepatitis A and B vaccine would facilitate immunization programs Vague Effect of interferon on hepatitis B

Two tasks Relationship Extraction: Entity extraction:
Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text Entity extraction: Related problem: identify such entities

The Approach Data: MEDLINE abstracts and titles Graphical models
Combine in one framework both relation and entity extraction Both static and dynamic models Simple discriminative approach: Neural network Lexical, syntactic and semantic features

Outline Related work Data and semantic relations Features
Models and results Conclusions

Several DIFFERENT Relations between the Same Types of Entities
Thus differs from the problem statement of other work on relations Many find one relation which holds between two entities (many based on ACE) Agichtein and Gravano (2000), lexical patterns for location of Zelenko et al. (2002) SVM for person affiliation and organization-location Hasegawa et al. (ACL 2004) Person-Organization -> President “relation” Craven (1999, 2001) HMM for subcellular-location and disorder-association Doesn’t identify the actual relation

Related work: Bioscience
Many hand-built rules Feldman et al. (2002), Friedman et al. (2001) Pustejovsky et al. (2002) Saric et al.; this conference Craven (1999, 2001) consider positive examples to be all the sentences that simply contain the entities, rather than analyzing which relations hold between these entities  Role extraction pustejovsky use a rule-based system to extract entities in the inhibit-relation. Their experiments use MEDLINE sentences that contain verbal and nominal forms of the stem inhibit. the actual task performed therefore is the extraction of entities that are connected by some form of the stem inhibit , which is potentially different from the extraction of entities in the inhibit-relation, since there may well be other ways to express this relation.

Data and Relations MEDLINE, abstracts and titles
3662 sentences labeled Relevant: Irrelevant: 1771 e.g., “Patients were followed up for 6 months” 2 types of Entities, many instances treatment and disease 7 Relationships between these entities An annotator with biology expertise looked at the titles and abstracts separately and labeled the sentences in both based solely on the content of the individual sentences. The labeled data is available at

Semantic Relationships
810: Cure Intravenous immune globulin for recurrent spontaneous abortion 616: Only Disease Social ties and susceptibility to the common cold 166: Only Treatment Flucticasone propionate is safe in recommended doses 63: Prevent Statins for prevention of stroke

Semantic Relationships
36: Vague Phenylbutazone and leukemia 29: Side Effect Malignant mesodermal mixed tumor of the uterus following irradiation 4: Does NOT cure Evidence for double resistance to permethrin and malathion in head lice

Features Word Part of speech Phrase constituent Orthographic features
‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ … MeSH (semantic features) Replace words, or sequences of words, with generalizations via MeSH categories Peritoneum -> Abdomen

Features (cont.): MeSH MeSH Tree Structures 1. Anatomy [A]
2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z] We also used a large domain-specific lexical hierarchy (MeSH) for generalization across classes of nouns. There are about 19,000 unique main terms in MeSH and 15 main sub-hierarchies (trees), each corresponding to a major branch of medical ontology. For example, tree A corresponds to Anatomy, tree C to Disease, and so on. It is possible for multiple words to be mapped to the same concept. The multiple mapping can be due to lexical ambiguity or just to different ways of classifying the same concept. For this work, we simply retain the first MeSH mapping for each word.

Features (cont.): MeSH Body Regions [A01] 1. Anatomy [A]
Musculoskeletal System [A02] Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07] + Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..) Body Regions [A01] Abdomen [A01.047] Groin [A ] Inguinal Canal [A ] Peritoneum [A ] + Umbilicus [A ] Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….) The longer the MeSH term, the longer the path from the root of the hierarchy and the more precise the description. We can represent the MeSH terms at different levels of description. Currently, we use the second level. This is somewhat arbitrary (and mainly chosen with the sparsity issue in mind), but in light of the importance of the MeSH features it may be worthwhile investigating the issue of finding the optimal level of description. (This can be seen as another form of smoothing.)

Models 2 static generative models 3 dynamic generative models
1 discriminative model (neural networks)

Static Graphical Models
S1: observations dependent on Role but independent from Relation given roles S2: observations dependent on both Relation and Role In S1 the observations are independent from the relation (given the roles). In S2, the observations are dependent on both the relation and the role (or in other words, the relation generates not only the sequence of roles but also the observations) encoding the fact that even when the roles are given, the observations depend on the relation. For example, sentences containing the word prevent are more likely to represent a “prevent” kind of relationship. S1 S2

Dynamic Graphical Models
D1, D2 as in S1, S2 D3: only one observation per state is dependent on both the relation and the role In D1 the observations are independent from the relation (given the roles). In D2, the observations are dependent on both the relation and the role (or in other words, the relation generates not only the sequence of roles but also the observations) encoding the fact that even when the roles are given, the observations depend on the relation. For example, sentences containing the word prevent are more likely to represent a “prevent” kind of relationship. in D3 only one observation per state is dependent on both the relation and the role, the motivation being that some observations (such as the words) depend on the relation while others might not (like for example, the parts of speech). In the experiments reported here, the observations which have edges from both the role and the relation nodes are the words. (We ran an experiment in which this observation node was the MeSH term, obtaining similar results.)

Graphical Models Relation node:
Semantic relation (cure, prevent, none..) expressed in the sentence The node labeled ``Relation'' represents the relationship present in the sentence. We assume here that there is a single relation for each sentence between the entities

Graphical Models Role nodes: 3 choices: treatment, disease, or none
The nodes labeled ``Role'' represent the entities (in this case the choices are DISEASE, TREATMENT and NULL) There are as many roles as there are words in the sentence The simpler static models do not assume an ordering in the role sequence. The dynamic models were inspired by prior work on HMM-like graphical models for role extraction These models consist of a Markov sequence of states (usually corresponding to semantic roles) where each state generates one or multiple observations. These models assume that there is an ordering in the semantic roles that can be captured with the Markov assumption and that the role generates the observations (the words, for example). All our models make the additional assumption that there is a relation that generates the role sequence; thus, these models have the appealing property that they can simultaneously perform role extraction and relationship recognition, given the sequence of observations.

Graphical Models Feature nodes (observed): word, POS, MeSH…
Children of the role nodes are the words and their features which are the only nodes observed. The task is to recover the sequence of Role states and the Relation, given the observed features.

Graphical Models For Dynamic Model D1:
Joint probability distribution over relation, roles and features nodes Parameters estimated with maximum likelihood and absolute discounting smoothing

Thompson et al. 2003 Our D1 Frame classification and role
labeling for FrameNet sentences Target word must be observed More relations and roles Our D1 Very similar models, frame == relation C: head word/phrase type of costituent (ex: Anne/NP) Target word observed! all sentences have a target from a fixed list (easier) The inference procedure is also different. More frames/relations (55) More roles (117) For the task most similar to ours (identify constituents and classify them): Frame classification: 97.5% (but target given) Role labeling accuracy: 70.1% (different from our F-measure)

Neural Networks Feed-forward network (MATLAB) Same features
Training with conjugate gradient descent One hidden layer (hyperbolic tangent function) Logistic sigmoid function for the output layer representing the relationships Same features Discriminative approach The number of units of the output layer is the number of relations (eight or nine) and therefore fixed. The network was trained for several choices of numbers of hidden units; we chose the best-performing networks based on training set error for each of the models. We then tested these networks on held-out testing data. The features were the same used for the graphical models.

Relation extraction Results in terms of classification accuracy (with and without irrelevant sentences) 2 cases: Roles hidden Roles given Graphical models NN: simple classification problem We run the experiments for two cases: ``roles given'': the true semantic roles are given and used as input in the classification along with the observable features only features'': a more realistic case in which the true roles are hidden and we classify the relations given only the observable features. For the graphical moldels, when the roles are hidden, they are marginalized over

Relation classification: Results
Neural Net always best NN always outperform the graphical models. Two possible reasons for this are: the discriminative approach may be the most appropriate for fully labeled data; or the graphical models we proposed may not be the right ones, i.e., the independence assumptions they make may misrepresent underlying dependencies. It must be pointed out that the neural network is much slower than the graphical models, and requires a great deal of memory;

With no smoothing, D1 best Graphical Model D1 outperforms the other dynamic models when no smoothing is applied; this was expected, since the parameters for models D2 and D3 are more sparse than D1.

With Smoothing and No Roles, D2 best GM However, when smoothing is applied, and when the true roles are hidden (the most realistic case), D2 achieves the best classification accuracies

With Smoothing and Roles, D1 best GM When the roles are given D1 is the best model. D1 does well in the cases when both roles are not present. By contrast, D2 does better than D1 when the presence of specific words strongly determines the outcome (e.g., the presence ``prevention'' or ``prevent'' helps identify the Prevent relation).

Dynamic models always outperform Static See also how the dynamic models always outperforms by large the static ones

Relation classification: Confusion Matrix
Computed for the model D2, “rel + irrel.”, “only features” To provide an idea of where the errors occur, this table shows the confusion matrix for model D2 for the most realistic and difficult case of ``rel + irrel.'', ``only features''. This indicates that the algorithm performs poorly primarily for the cases for which there is little training data, with the exception of the ONLY DISEASE case, which is often mistaken for CURE.

Role extraction Results in terms of F-measure Graphical models NN
Junction tree algorithm (BNT) Relation hidden and marginalized over NN Couldn’t run it (features vectors too large) (Graphical models can do role extraction and relationship classification simultaneously) We use a strict evaluation: every token is assessed (for example, even punctuation must be associated with the appropriate entity) and we do not assign partial credit for constituents for which only some of the words are correctly labeled.

Role Extraction: Results
F-measures D1 best when no smoothing Again model D1 outperforms the other dynamic models when no smoothing is applied

F-measures D2 best with smoothing, but doesn’t boost scores as much as in relation classification However, when smoothing is applied, model D2 achieves the best F-measures. Note however how the three dynamic models achieve similar results. The percentage improvements of D2 and D3 versus D1 are, respectively, 10% and 6.5% for relation classification and 1.4% for role extraction (in the ``only relevant'', ``only features'' case). This suggests that there is a dependency between the observations and the relation that is captured by the additional edges in D2 and D3, but that this dependency is more helpful in relation classification than in role extraction.

Features impact: Role Extraction
Most important features: 1)Word, 2)MeSH Models D D2 All features No word -13.4% % No MeSH -5.9% % (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)

Features impact: Relation classification
Most important features: Roles Accuracy: D D NN All feat. + roles All feat. – roles -24.7% -8.7% -17.8% All feat. + roles – Word 0% % -0.5% All feat. + roles – MeSH 0% % % (for all the other features decrease % was negligible) (rel. + irrel.)

Most realistic case: Roles not known Most important features: 1) Mesh 2) Word for D1 and NN (but vice versa for D2) Accuracy: D D NN All feat. – roles All feat. - roles – Word -3.3% -11.8% -4.3% All feat. - roles – MeSH -9.1% -3.2% -6.9% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)

Conclusions Classification of subtle semantic relations in bioscience text Discriminative model (neural network) achieves high classification accuracy Graphical models for the simultaneous extraction of entities and relationships Importance of lexical hierarchy Future work: A new collection of disease/treatment data Different entities/relations Unsupervised learning to discover relation types We have addressed the problem of distinguishing between several different relations that can hold between two semantic entities, a difficult and important task in natural language understanding. Because there is no existing gold-standard for this problem, we have developed the relation definitions; this however may not be an exhaustive or fully representative list. In future we plan to assess additional relations. (eventually) It is unclear at this time if this approach will work on other types of text; the technical nature of bioscience text may lend itself well to this type of analysis. However, very useful, real world applications could be developed if we were able to do natural language understanding effectively, if only in the biomedical domain.

Thank you! Barbara Rosario Marti Hearst SIMS, UC Berkeley

Additional slides

Smoothing: absolute discounting
Lower the probability of seen events by subtracting a constant from their count (ML estimate: ) The remaining probability is evenly divided by the unseen events We experimented with different values for the smoothing factor ranging from a minimum of to a maximum of 10; the results reported fix the smoothing factor at its minimum value.

F-measures for role extraction in function of smoothing factors
Bigger points on the left are the results when no smoothing was applied We found that for the dynamic models, for a wide range of smoothing factors, we achieved almost identical results; nevertheless, in future work, we plan to implement cross-validation to find the optimal smoothing factor. By contrast, the static models were more sensitive to the value of the smoothing factor especially for the role extraction task

Relation accuracies in function of smoothing factors

Static models better than Dynamic for Note: No Neural Networks Note also that for role extraction the static models perform better than for relation classification. The decreases in performance from D1 to S1 and from D2 to S2 are, respectively (in the ``only relevant'', ``only features'' case), 7.4% and 7.3% for role extraction and 27.1% and 44% for relation classification. This suggests the importance of modeling the sequence of roles for relation classification.

Features impact: Role Extraction
Most important features: 1)Word, 2)MeSH Models D D Average All features No word -13.4% % % No MeSH -5.9% % % (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)

Features impact: Role extraction
Most important features: 1) Word, 2) MeSH F-measures: D D Average All features No word -9.7% % % No MeSH -4.2% % % (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge For rel+irrel. (only rel.)

Features impact: Role extraction
Most important features: 1) Word, 2) MeSH F-measures: D D2 All features No word -9.7% % No MeSH -4.2% % (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge For rel+irrel. (only rel.)

Most important features: Roles Accuracy: D D NN Avg. All feat. + roles All feat. – roles -24.7% -8.7% -17.8% -17.1% All feat. + roles – Word 0% % -0.5% -1.1% All feat. + roles – MeSH 0% % % % (for all the other features decrease % was negligible) When the roles are known the other features have very little impact (maybe comments on (the many) systems that do assume roles given?) (rel. + irrel.)

Most realistic case: Roles not known Most important features: 1) Mesh 2) Word for D1 and NN (but vice versa for D2) Accuracy: D D NN Avg. All feat. – roles All feat. - roles – Word -3.3% -11.8% -4.3% -6.4% All feat. - roles – MeSH -9.1% -3.2% -6.9% -6.4% (for all the other features decrease % was negligible) Note: MeSH = MeSH ID + domain knowledge (rel. + irrel.)

Classifying Semantic Relations in Bioscience Texts

Similar presentations

Presentation on theme: "Classifying Semantic Relations in Bioscience Texts"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classifying Semantic Relations in Bioscience Texts

Similar presentations

Presentation on theme: "Classifying Semantic Relations in Bioscience Texts"— Presentation transcript:

Similar presentations

About project

Feedback