Download presentation
Presentation is loading. Please wait.
Published byCharleen Smith Modified over 9 years ago
1
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario SIMS UC Berkeley
2
2 Outline of Talk Goal: Extract semantics from text Information and relation extraction Protein-protein interactions
3
3 Text Mining Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text
4
4 Text Mining Text: Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker 1: Extract semantic entities from text
5
5 Text Mining Text: Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 1: Extract semantic entities from text
6
6 Text Mining (cont.) Text: Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 2: Classify relations between entities Associated with Lead to lossPrevent Subtype-of (is a)
7
7 Text Mining (cont.) Text: Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 3: Do reasoning: find new correlations Associated with Lead to loss Prevent Subtype-of (is a)
8
8 Text Mining (cont.) Text: Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 4: Do reasoning: infer causality Associated with Lead to loss Prevent Subtype-of (is a) No prevention Deficiency of magnesium migraine
9
9 My research StressMigraine Magnesium Calcium channel blockers Information Extraction Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker
10
10 My research Relation extraction StressMigraine Magnesium Calcium channel blockers Associated with Lead to lossPrevent Subtype-of (is a)
11
11 Information and relation extraction Problems: Given biomedical text: Find all the treatments and all the diseases Find the relations that hold between them TreatmentDisease Cure? Prevent? Side Effect?
12
12 Hepatitis Examples Cure These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135. Prevent A two-dose combined hepatitis A and B vaccine would facilitate immunization programs Vague Effect of interferon on hepatitis B
13
13 Two tasks Relationship extraction: Identify the several semantic relations that can occur between the entities disease and treatment in bioscience text Information extraction (IE): Related problem: identify such entities
14
14 Outline of IE Data and semantic relations Quick intro to graphical models Models and results Features Conclusions
15
15 Data and Relations MEDLINE, abstracts and titles 3662 sentences labeled Relevant: 1724 Irrelevant: 1771 e.g., “Patients were followed up for 6 months” 2 types of Entities treatment and disease 7 Relationships between these entities The labeled data are available at http://biotext.berkeley.edu
16
16 Semantic Relationships 810: Cure Intravenous immune globulin for recurrent spontaneous abortion 616: Only Disease Social ties and susceptibility to the common cold 166: Only Treatment Flucticasone propionate is safe in recommended doses 63: Prevent Statins for prevention of stroke
17
17 Semantic Relationships 36: Vague Phenylbutazone and leukemia 29: Side Effect Malignant mesodermal mixed tumor of the uterus following irradiation 4: Does NOT cure Evidence for double resistance to permethrin and malathion in head lice
18
18 Outline of IE Data and semantic relations Quick intro to graphical models Models and results Features Conclusions
19
19 Graphical Models Unifying framework for developing Machine Learning algorithms Graph theory plus probability theory Widely used Error correcting codes Systems diagnosis Computer vision Filtering (Kalman filters) Bioinformatics
20
20 (Quick intro to) Graphical Models Nodes are random variables Edges are annotated with conditional probabilities Absence of an edge between nodes implies conditional independence “Probabilistic database” BCD A
21
21 Graphical Models A BCD Define a joint probability distribution: P(X 1,..X N ) = i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) Learning Given data, estimate P(A), P(B|A), P(D), P(C | A, D)
22
22 Graphical Models A BCD Define a joint probability distribution: P(X 1,..X N ) = i P(X i | Par(X i ) ) P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) Learning Given data, estimate P(A), P(B|A), P(D), P(C | A, D) Inference: compute conditional probabilities, e.g., P(A|B, D) Inference = Probabilistic queries. General inference algorithms (Junction Tree)
23
23 Naïve Bayes models Simple graphical model X i depend on Y Naïve Bayes assumption: all X i are independent given Y Currently used for text classification and spam detection x1x1 x2x2 x3x3 Y
24
24 Dynamic Graphical Models Graphical model composed of repeated segments HMMs (Hidden Markov Models) POS tagging, speech recognition, IE tNtN wNwN
25
25 HMMs Joint probability distribution P(t 1,.., t N, w 1,.., w N) = P(t 1 ) P(t i |t i-1 )P(w i |t i ) Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data tNtN wNwN
26
26 HMMs Joint probability distribution P(t 1,.., t N, w 1,.., w N) = P(t 1 ) P(t i |t i-1 )P(w i |t i ) Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data Inference: P(t i | w 1, w 2,… w N ) tNtN wNwN
27
27 Graphical Models for IE Different dependencies between the features and the relation nodes D3 D1 S1 D2 S2 DynamicStatic
28
28 Graphical Model Relation node: Semantic relation (cure, prevent, none..) expressed in the sentence Relation generate the state sequence and the observations Relation
29
29 Graphical Model Markov sequence of states (roles) Role nodes: Role t {treatment, disease, none} Role t-1 Role t Role t+1
30
30 Graphical Model Roles generate multiple observations Feature nodes (observed): word, POS, MeSH… Features
31
31 Graphical Model Inference: Find Relation and Roles given the features observed ??? ?
32
32 Features Word Part of speech Phrase constituent Orthographic features ‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ … Semantic features (MeSH)
33
33 MeSH MeSH Tree Structures 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]
34
34 MeSH (cont.) 1. Anatomy [A] Body Regions [A01] + Musculoskeletal System [A02] Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07] + Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..) Body Regions [A01] Abdomen [A01.047] Groin [A01.047.365] Inguinal Canal [A01.047.412] Peritoneum [A01.047.596] + Umbilicus [A01.047.849] Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….)
35
35 Use of lexical Hierarchies in NLP Big problem in NLP: few words occur a lot, most of them occur very rarely (Zipf’s law) Difficult to do statistics One solution: use lexical hierarchies Another example: WordNet Statistics on classes of words instead of words
36
36 Mapping Words to MeSH Concepts headache pain C23.888.592.612.441 G11.561.796.444 C23.888 G11.561 [Neurologic Manifestations][Nervous System Physiology ] C23 G11 [Pathological Conditions, Signs and Symptoms][Musculoskeletal, Neural, and Ocular Physiology] headache recurrence C23.888.592.612.441 C23.550.291.937 breast cancer cells A01.236 C04 A11
37
37 Graphical Model Joint probability distribution over relation, roles and features nodes Parameters estimated with maximum likelihood and absolute discounting smoothing
38
38 Graphical Model Inference: Find Relation and Roles given the features observed ??? ?
39
39 Relation extraction Results in terms of classification accuracy (with and without irrelevant sentences) 2 cases: Roles given Roles hidden (only features)
40
40 Relation classification: Results Good results for a difficult task One of the few systems to tackle several DIFFERENT relations between the same types of entities; thus differs from the problem statement of other work on relations Accuracy SentencesInputBase.GM D2 Only rel.only feat.46.772.6 roles given76.6 Rel. + irrel.only feat.50.674.9 roles given82.0
41
41 Role Extraction: Results Junction tree algorithm F-measure = (2*Prec*Recall)/(Prec + Recall) (Related work extracting “diseases” and “genes” reports F-measure of 0.50) SentencesF-measure Only rel.0.73 Rel. + irrel.0.71
42
42 Features impact: Role extraction Most important features: 1)Word 2)MeSH Rel. + irrel. Only rel. All features 0.71 0.73 No word 0.61 0.66 -14.1% -9.6% No MeSH 0.65 0.69 -8.4% -5.5%
43
43 Most important features: Roles Accuracy All feat. + roles 82.0 Features impact: Relation classification (rel. + irrel.) All feat. – roles 74.9 -8.7% All feat. + roles – Word 79.8 -2.8% All feat. + roles – MeSH 84.6 3.1%
44
44 Features impact: Relation classification Most realistic case: Roles not known Most important features: 1) Word 2) Mesh Accuracy All feat. – roles 74.9 (rel. + irrel.) All feat. - roles – Word 66.1 -11.8% All feat. - roles – MeSH 72.5 -3.2%
45
45 Conclusions Classification of subtle semantic relations in bioscience text Graphical models for the simultaneous extraction of entities and relationships Importance of MeSH, lexical hierarchy
46
46 Outline of Talk Goal: Extract semantics from text Information and relation extraction Protein-protein interactions; using an existing database to gather labeled data
47
47 Protein-Protein interactions One of the most important challenges in modern genomics, with many applications throughout biology There are several protein-protein interaction databases (BIND, MINT,..), all manually curated
48
48 Protein-Protein interactions Supervised systems require manually labeled data, while purely unsupervised are still to be proven effective for these tasks. Some other approaches: semi-supervised, active learning, co-training. We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins
49
49 HIV-1, Protein Interaction Database Documents interactions between HIV-1 proteins and host cell proteins other HIV-1 proteins disease associated with HIV/AIDS 2224 pairs of interacting proteins, 65 types http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions
50
50 HIV-1, Protein Interaction Database Protein 1Protein 2Paper IDInteraction Type Tat, p14AKT311156964, 11994280.. activates AIP1Gag, Pr5514519844,…binds Tat, p14CDK29223324induces Tat, p14CDK27716549enhances Tat, p14CDK29525916downregulates ….
51
51 Most common interactions
52
52 Protein-Protein interactions Idea: use this to “label data” Protein 1Protein 2InteractionPaper ID Tat, p14AKT3activates11156964 Extract from the paper all the sentences with Protein 1 and Protein 2 …
53
53 Protein-Protein interactions Idea: use this to “label data” Protein 1Protein 2InteractionPaper ID Tat, p14AKT3activates11156964 Extract from the paper all the sentences with Protein 1 and Protein 2 … Label them with the interaction given in the database activates
54
54 Protein-Protein interactions Use citations Find all the papers that cite the papers in the database Protein 1Protein 2InteractionPaper ID Tat, p14AKT3activates11156964 ID 9918876ID 9971769
55
55 Protein-Protein interactions From the papers, extract the citation sentences; from these extract the sentences with Protein 1 and Protein 2 Label them Protein 1Protein 2InteractionPaper ID Tat, p14AKT3activates11156964 ID 9918876ID 9971769 activates
56
56 Examples of sentences Papers: The interpretation of these results was slightly complicated by the fact that AIP-1/ALIX depletion by using siRNA likely had deleterious effects on cell viability, because a Western blot analysis showed slightly reduced Gag expression at later time points (fig. 5C ). Citations: They also demonstrate that the GAG protein from membrane - containing viruses, such as HIV, binds to Alix / AIP1, thereby recruiting the ESCRT machinery to allow budding of the virus from the cell surface (TARGET_CITATION; CITATION ).
57
57 10 Interaction types
58
58 Protein-Protein interactions Tasks: Given sentences from Paper ID, and/or citation sentences to ID Predict the interaction type given in the HIV database for Paper ID Extract the proteins involved 10-way classification problem
59
59 Protein-Protein interactions Models Dynamic graphical model Naïve Bayes
60
60 Graphical Models
61
61 Evaluation Evaluation at document level All (sentences from papers + citations) Papers (only sentences from papers) Citations (only citation sentences) “Trigger word” approach List of keywords (ex: for inhibits: “inhibitor”, “inhibition”, “inhibit”…etc. If keyword presents: assign corresponding interaction
62
62 Results Accuracies on interaction classification ModelAllPapersCitations Markov Model60.557.853.4 Naïve Bayes58.157.855.7 Baselines Most freq. inter.21.811.126.1 TriggerW20.124.420.4 TriggerW + BO25.840.026.1 (Roles hidden)
63
63 Results: confusion matrix For All. Overall accuracy: 60.5%
64
64 Hiding the protein names Replaced protein names with tokens PROT_NAME Selective CXCR4 antagonism by Tat Selective PROT_NAME antagonism by PROT_NAME
65
65 Results with no protein names ModelPapersCitations Markov Model44.4 (-23.1%) 52.3 (-2.0%) Naïve Bayes46.7 (-19.2%) 53.4 (-4.1 %)
66
66 Protein extraction (Protein name tagging, role extraction) The identification of all the proteins present in the sentence that are involved in the interaction These results suggest that Tat - induced phosphorylation of serine 5 by CDK9 might be important after transcription has reached the +36 position, at which time CDK7 has been released from the complex. Tat might regulate the phosphorylation of the RNA polymerase II carboxyl - terminal domain in pre - initiation complexes by activating CDK7
67
67 Protein extraction: results RecallPrecisionF-measure All0.740.850.79 Papers0.560.830.67 Citations0.750.840.79 No dictionary used
68
68 Conclusions of protein- protein interaction project Encouraging results for the automatic classification of protein-protein interactions Use of an existing database for gathering labeled data Use of citations
69
69 Conclusion Machine Learning methods for NLP tasks Three lines of research in this area, state-of-the art results Information and relation extraction for “treatments” and “diseases” Protein-protein interactions (Noun compounds)
70
Thank you! Barbara Rosario SIMS, UC Berkeley rosario@sims.berkeley.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.