Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research.

Similar presentations


Presentation on theme: "Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research."— Presentation transcript:

1 Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research

2 Outline Semantic relations –Protein-protein interactions (joint work with Marti Hearst) –Digital devices (joint work with Bill Schilit, Google and Oksana Yakhnenko, Iowa State University) Models to do text classification and information extraction Two new proposals for getting labeled data

3 Text mining Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text Example: a (human) analysis of titles of articles in the biomedical literature suggested a role of magnesium deficiency in migraines [Swanson]

4 Text mining Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker 1: Extract semantic entities from text

5 Text mining Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 1: Extract semantic entities from text

6 Text mining (cont.) Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 2: Classify relations between entities Associated with Lead to lossPrevent Subtype-of (is a)

7 Text mining (cont.) Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 3: Do reasoning: find new correlations Associated with Lead to loss Prevent Subtype-of (is a)

8 Relations The identification and classification of semantic relations is crucial for the semantic analysis of text Protein-protein interactions Relations for digital devices

9 Protein-protein interactions Applications throughout biology There are several protein-protein interaction databases (BIND, MINT,..), all manually curated Most of the biomedical research and new discoveries are available electronically but only in free text format. Automatic mechanisms are needed to convert text into more structured forms

10 Protein-protein interactions Supervised systems require manually trained data, while purely unsupervised are still to be proven effective for these tasks. We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins

11 HIV-1, Protein interaction database “The goal of this project is to provide scientists a summary of all known interactions of HIV-1 proteins with host cell proteins, other HIV-1 proteins, or proteins from disease organisms associated with HIV/AIDS” There are 2224 interacting protein pairs and 51 types of interaction http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/

12 HIV-1, Protein interaction database Protein 1Protein 2InteractionPaper ID 10000155871activates11156964 10015155030binds14519844, … 1017155871induces9223324 10197155348degraded by10893419 …

13 Protein-protein interactions Idea: use this to “label data” Protein 1Protein 2InteractionPaper ID 10000155871activates11156964 Extract from the paper all the sentences with Protein 1 and Protein 2 … Label them with the interaction given in the database

14 Protein-protein interactions Idea: use this to “label data” Protein 1Protein 2InteractionPaper ID 10000155871activates11156964 Extract from the paper all the sentences with Protein 1 and Protein 2 … Label them with the interaction given in the database activates

15 Protein-protein interactions Use citations Find all the papers that cite the papers in the database Protein 1Protein 2InteractionPaper ID 10000155871activates11156964 ID 9918876ID 9971769

16 Protein-protein interactions From the papers, extract the citation sentences; from these extract the sentences with Protein 1 and Protein 2 Label them Protein 1Protein 2InteractionPaper ID 10000155871activates11156964 ID 9918876ID 9971769 activates

17 Protein-protein interactions Task: Given the sentences extracted from paper ID and/or the citation sentences: Determine the interaction given in the HIV-1 database for paper ID Identify the proteins involved in the interaction (protein name tagging, or role extraction). InteractionPapersCitances Degrades6063 Synergizes with86101 Stimulates10364 Binds98324 Inactivates6892 Interacts with62100 Requires96297 Upregulates11998 Inhibits7884 Suppresses5199

18 The models (1) Naïve Bayes (NB) for interaction classification.

19 The models (2) Dynamic graphical model (DM) for protein interaction classification (and role extraction).

20 Dynamic graphical models Graphical model composed of repeated segments HMMs (Hidden Markov Models) –POS tagging, speech recognition, IE tNtN wNwN

21 HMMs Joint probability distribution –P(t 1,.., t N, w 1,.., w N) = P(t 1 )  P(t i |t i-1 )P(w i |t i ) Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data tNtN wNwN

22 HMMs Joint probability distribution –P(t 1,.., t N, w 1,.., w N) = P(t 1 )  P(t i |t i-1 )P(w i |t i ) Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data Inference: P(t 1, t 2,… t N | w 1, w 2,… w N ) tNtN wNwN

23 Graphical model for role and relation extraction –Markov sequence of states (roles) –States generate multiple observations –Relation generate the state sequence and the observations Interaction Roles Features

24 Analyzing the results Hiding the protein names: “Selective CXCR4 antagonism by Tat” becomes: “Selective PROT1 antagonism by PROT2” – To check whether the interaction types could be unambiguously determined by the protein names. Compare results with a trigger words approach

25 Results: interaction classification ModelClassification accuracies AllPapersCitances DB60.557.853.4 NB58.157.855.7 No Protein Names DB60.544.452.3 NB59.746.753.4 Trigger words 25.840.026.1 Baseline: most frequent inter. 21.811.126.1

26 Results: proteins extraction RecallPrecisionF-measure All0.740.850.79 Papers0.560.830.67 Citances0.750.840.79

27 Conclusions of protein-protein interaction project Difficult and important problem: the classification of (ten) different interaction types between proteins in text The dynamic graphical model DM can simultaneously perform protein name tagging and relation identification High accuracy on both problems (well above the baselines) The results obtained removing the protein names indicate that our models learn the linguistic context of the interactions. Found evidence supporting the hypothesis that citation sentences are a good source of training data, most likely because they provide a concise and precise way of summarizing facts in the bioscience literature. Use of a protein-interaction database to automatically gather labeled data for this task.

28 Relations for digital devices Identification of activities/relations between device pairs. What can you do with a given device pair? –Digital camera and TV –Media server and computer –Media server and wireless router –Toshiba laptop and wireless audio adapter –PC and DVR –TV and DVR

29 Looking for relations You can searches the Web? –Google searches TV DVR and PC DVRTV DVR Current search engines find co- occurrence of query terms Often you need to find semantically related entities For text mining, inference and for search (IR)

30 Looking for relations You can searches the Web? –Google searches PC DVR and TV DVRPC DVR TV DVR You may want to see instead all the sentences in which the two devices are involved in an activity/relation and get a sense of what you can do with these devices Activities_between(PC DVR) –From which you learn for example that »Can build a Better DVR out of an Old PC »Any modern Windows PC can be used for DVR duty Activities_between(TV DVR) –From whichyou learn for example that »DVR allows you to pause live TV »Can watch Google Satellite TV through your "internet ready" Google DVR

31 Looking for relations We can frame this problem as a classification problem: Given a sentence containing two digital devices, is there a relations between them expressed in the sentence or not?

32 Looking for relations Media server and computer –The Allegro Media Server application reads the iTunes music library file to find the music stored on your computer YES –You will use the FTP software to transfer files from your computer to the media server YES –The media server has many functions and it needs to be a high-end computer with plenty of hard drive space to store the very large video files that get created YES –Sometimes you might want to play faster than your computer, or your Internet connection, or your media server, can handle NO –Anderson, George Homsy, A Continuous Media I/O Server and Its Synchronization Mechanism, Computer, v.24 n.10, p.51-57, October 1991 NO –GSIC > Research Computer System > Obtaining Accouts > Media Server NO

33 Looking for relations Media server and wireless router –For example, if you access a local media server in your house that is connected to a wireless router that has a port speed of only 100 Mbps [..] YES –Besides serving as a router, a wireless access point, and a four-port switch, the WRTSL54GS includes a storage link and a media server YES –It has a built in video server, media server, home automation, wireless router, internet gateway NO

34 Our system Set of 57 pairs of digital devices Searched the Web (Google) using the device pairs as queries From the Web pages retrieved, we extracted the text (3627) excerpts containing both devices We labeled them (YES or NOT) Trained a classification system

35 Our FUTURE system Will allow to identify the Web pages containing relations. –Could display only those. –Could highlight only sentences with relations –For digital devices, this would allow, for example, useful queries for troubleshooting Searching the web is one of the principle methods used to seek out information and to resolve problems involving digital devices for home networks

36 Our FUTURE system Possible extensions of the project to get the activities types –We look at the sentences extracted and come up with a set of possible activities. Build a (multi) classification system to classify the different activities (supervised) –Extract the most indicative words for the activities (like the words highlighted here); cluster them to get “activity clusters” (unsupervised) here

37 Our system Set of 50 Device Pairs Search the Web (Google) using the device pairs as query From the Web pages retrieved, we extracted the sentences containing both devices We labeled them (YES or NOT) Trained a classification system

38 Labeling with Mechanical Turk To train a classification system, we need labels –Time consuming, subjective, different for each domain and task –(But unsupervised systems work usually worse) We used a web service, Mechanical Turk (MTurk, http://www.mturk.com) that allows to create and post a task that requires human intervention, and offers a reward for the completion of the task.MTurkhttp://www.mturk.comcreate

39 Mechanical Turk HIT for labeling relations

40 Surveys

41

42 Mechanical Turk We created a total of 121 surveys consisting of 30 questions. Our reward to users was between 15 and 30 cents per survey (< 1 cent for text segment) –We obtained labels for 3627 text segments for under $70. HIT completed (by all 3 “workers”) within a few minutes to a half-hour –We had perfect agreement for 49% of all sentences –5% received all three labels (discarded) –46% two labels were assigned (the majority vote was used to determine the final label) 1865 text segments were labeled YES 1485 text segments were labeled NO

43 Classification Now we have labeled data Need a (binary) classifier

44 Summary (from lecture 17) Algorithms for Classification Binary classification –Perceptron –Winnow –Support Vector Machines (SVM) –Kernel Methods –Multilayer Neural Networks Multi-Class classification –Decision Trees –Naïve Bayes –K nearest neighbor

45 From Gert Lanckriet, Statistical Learning Theory Tutorial 45 Support Vector Machine (SVM) Large Margin Classifier Linearly separable case Goal: find the hyperplane that maximizes the margin w T x + b = 0 M w T x a + b = 1 w T x b + b = -1 Support vectors From Lecture 17

46 Graphical models Directed (like Naïve Bayes and HMM) Undirected (Markov Network)

47 Maximum Margin Markov Networks Large Margin Classifier + (undirected) Markov Networks [Taskar 03] –To combine the strengths of the two methods: High dimensional feature space, strong theoretical guarantees Problem structure, ability to capture correlation between labels Benjamin Taskar, Carlos Guestrin, and Daphne Koller. 2003. Max-margin markov networks. In NIPS.

48 Directed Maximum Margin Model Large Margin Classifier + (directed) graphical model (Naïve Bayes) MMNB: Maximum Margin Naïve Bayes –Essentially, to combine the strengths of graphic models (better at interpreting data, worse performance in classification) with discriminative models (better performance, unintelligible working mechanism)

49 Results Compare with Naïve Bayes and Perceptron (Weka) Classification accuracy: –MMNB: 79.98 –Naïve Bayes: 75.62 –Perceptron: 63.03

50 Conclusion Semantic relations Two projects: interactions between proteins and relations between digital devices Statistical models (dynamic graphical models, maximum margin naïve bayes) Creative ways of obtaining labeled data: protein database and “paying” people (Mturk)

51 Thanks! Barbara Rosario barbara.rosario@intel.com Intel Research

52 Additional slides

53 All device pairs desktop wireless router PC stereo digital camera television pc wireless audio adapter digital camera tv set pc wireless router ibm laptop buffalo media player Phillips stereo pc ibm laptop linksys wireless router prismq media player wireless router ibm laptop squeezebox stereo laptop ibm laptop wireless audio adapter stereo toshiba laptop kodak camera television toshiba laptop buffalo media player laptop linksys wireless router toshiba laptop linksys wireless router laptop media server toshiba laptop netgear wireless router laptop squeezebox toshiba laptop squeezebox laptop stereo toshiba laptop wireless audio adapter laptop wireless audio adapter

54 All device pairs (cont.) buffalo media player wireless router laptop wireless router buffalo media server wireless router linkstation home server wireless router camera tv linkstation multimedia server wireless router computer linksys wireless router media player wireless router computer media server media server linksys wireless router computer stereo media server netgear wireless router computer wireless audio adapter media server wireless router computer wireless router network media player wireless router desktop media server nikon camera television desktop stereo pc media server desktop wireless audio adapter pc squeezebox


Download ppt "Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research."

Similar presentations


Ads by Google