Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research.

Slides:



Advertisements
Similar presentations
Review of Chapter 2. Important concepts – The Internet is a worldwide collection of networks that links millions of businesses, government agencies, educational.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
ONLINE ARABIC HANDWRITING RECOGNITION By George Kour Supervised by Dr. Raid Saabne.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
An Overview of Machine Learning
Supervised Learning Recap
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Data warehouse example
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Presented by Zeehasham Rasheed
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Scalable Text Mining with Sparse Generative Models
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Overview of Search Engines
Software and Multimedia
Introduction to machine learning
Introduction to Data Mining Engineering Group in ACL.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario SIMS UC Berkeley.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Transcription of Text by Incremental Support Vector machine Anurag Sahajpal and Terje Kristensen.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
MULTIMEDIA DEFINITION OF MULTIMEDIA
Machine Learning.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
PhD Dissertation Defense Scaling Up Machine Learning Algorithms to Handle Big Data BY KHALIFEH ALJADDA ADVISOR: PROFESSOR JOHN A. MILLER DEC-2014 Computer.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Brief Intro to Machine Learning CS539
Sentiment analysis algorithms and applications: A survey
Software and Multimedia
Software and Multimedia
Multimedia Information Retrieval
Data Warehousing and Data Mining
Presented by: Prof. Ali Jaoua
Overview of Machine Learning
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research

Outline Semantic relations –Protein-protein interactions (joint work with Marti Hearst) –Digital devices (joint work with Bill Schilit, Google and Oksana Yakhnenko, Iowa State University) Models to do text classification and information extraction Two new proposals for getting labeled data

Text mining Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text Example: a (human) analysis of titles of articles in the biomedical literature suggested a role of magnesium deficiency in migraines [Swanson]

Text mining Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker 1: Extract semantic entities from text

Text mining Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 1: Extract semantic entities from text

Text mining (cont.) Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 2: Classify relations between entities Associated with Lead to lossPrevent Subtype-of (is a)

Text mining (cont.) Text: –Stress is associated with migraines –Stress can lead to loss of magnesium –Calcium channel blockers prevent some migraines –Magnesium is a natural calcium channel blocker StressMigraine Magnesium Calcium channel blockers 3: Do reasoning: find new correlations Associated with Lead to loss Prevent Subtype-of (is a)

Relations The identification and classification of semantic relations is crucial for the semantic analysis of text Protein-protein interactions Relations for digital devices

Protein-protein interactions Applications throughout biology There are several protein-protein interaction databases (BIND, MINT,..), all manually curated Most of the biomedical research and new discoveries are available electronically but only in free text format. Automatic mechanisms are needed to convert text into more structured forms

Protein-protein interactions Supervised systems require manually trained data, while purely unsupervised are still to be proven effective for these tasks. We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins

HIV-1, Protein interaction database “The goal of this project is to provide scientists a summary of all known interactions of HIV-1 proteins with host cell proteins, other HIV-1 proteins, or proteins from disease organisms associated with HIV/AIDS” There are 2224 interacting protein pairs and 51 types of interaction

HIV-1, Protein interaction database Protein 1Protein 2InteractionPaper ID activates binds , … induces degraded by …

Protein-protein interactions Idea: use this to “label data” Protein 1Protein 2InteractionPaper ID activates Extract from the paper all the sentences with Protein 1 and Protein 2 … Label them with the interaction given in the database

Protein-protein interactions Idea: use this to “label data” Protein 1Protein 2InteractionPaper ID activates Extract from the paper all the sentences with Protein 1 and Protein 2 … Label them with the interaction given in the database activates

Protein-protein interactions Use citations Find all the papers that cite the papers in the database Protein 1Protein 2InteractionPaper ID activates ID ID

Protein-protein interactions From the papers, extract the citation sentences; from these extract the sentences with Protein 1 and Protein 2 Label them Protein 1Protein 2InteractionPaper ID activates ID ID activates

Protein-protein interactions Task: Given the sentences extracted from paper ID and/or the citation sentences: Determine the interaction given in the HIV-1 database for paper ID Identify the proteins involved in the interaction (protein name tagging, or role extraction). InteractionPapersCitances Degrades6063 Synergizes with86101 Stimulates10364 Binds98324 Inactivates6892 Interacts with62100 Requires96297 Upregulates11998 Inhibits7884 Suppresses5199

The models (1) Naïve Bayes (NB) for interaction classification.

The models (2) Dynamic graphical model (DM) for protein interaction classification (and role extraction).

Dynamic graphical models Graphical model composed of repeated segments HMMs (Hidden Markov Models) –POS tagging, speech recognition, IE tNtN wNwN

HMMs Joint probability distribution –P(t 1,.., t N, w 1,.., w N) = P(t 1 )  P(t i |t i-1 )P(w i |t i ) Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data tNtN wNwN

HMMs Joint probability distribution –P(t 1,.., t N, w 1,.., w N) = P(t 1 )  P(t i |t i-1 )P(w i |t i ) Estimate P(t 1 ), P(t i |t i-1 ), P(w i |t i ) from labeled data Inference: P(t 1, t 2,… t N | w 1, w 2,… w N ) tNtN wNwN

Graphical model for role and relation extraction –Markov sequence of states (roles) –States generate multiple observations –Relation generate the state sequence and the observations Interaction Roles Features

Analyzing the results Hiding the protein names: “Selective CXCR4 antagonism by Tat” becomes: “Selective PROT1 antagonism by PROT2” – To check whether the interaction types could be unambiguously determined by the protein names. Compare results with a trigger words approach

Results: interaction classification ModelClassification accuracies AllPapersCitances DB NB No Protein Names DB NB Trigger words Baseline: most frequent inter

Results: proteins extraction RecallPrecisionF-measure All Papers Citances

Conclusions of protein-protein interaction project Difficult and important problem: the classification of (ten) different interaction types between proteins in text The dynamic graphical model DM can simultaneously perform protein name tagging and relation identification High accuracy on both problems (well above the baselines) The results obtained removing the protein names indicate that our models learn the linguistic context of the interactions. Found evidence supporting the hypothesis that citation sentences are a good source of training data, most likely because they provide a concise and precise way of summarizing facts in the bioscience literature. Use of a protein-interaction database to automatically gather labeled data for this task.

Relations for digital devices Identification of activities/relations between device pairs. What can you do with a given device pair? –Digital camera and TV –Media server and computer –Media server and wireless router –Toshiba laptop and wireless audio adapter –PC and DVR –TV and DVR

Looking for relations You can searches the Web? –Google searches TV DVR and PC DVRTV DVR Current search engines find co- occurrence of query terms Often you need to find semantically related entities For text mining, inference and for search (IR)

Looking for relations You can searches the Web? –Google searches PC DVR and TV DVRPC DVR TV DVR You may want to see instead all the sentences in which the two devices are involved in an activity/relation and get a sense of what you can do with these devices Activities_between(PC DVR) –From which you learn for example that »Can build a Better DVR out of an Old PC »Any modern Windows PC can be used for DVR duty Activities_between(TV DVR) –From whichyou learn for example that »DVR allows you to pause live TV »Can watch Google Satellite TV through your "internet ready" Google DVR

Looking for relations We can frame this problem as a classification problem: Given a sentence containing two digital devices, is there a relations between them expressed in the sentence or not?

Looking for relations Media server and computer –The Allegro Media Server application reads the iTunes music library file to find the music stored on your computer YES –You will use the FTP software to transfer files from your computer to the media server YES –The media server has many functions and it needs to be a high-end computer with plenty of hard drive space to store the very large video files that get created YES –Sometimes you might want to play faster than your computer, or your Internet connection, or your media server, can handle NO –Anderson, George Homsy, A Continuous Media I/O Server and Its Synchronization Mechanism, Computer, v.24 n.10, p.51-57, October 1991 NO –GSIC > Research Computer System > Obtaining Accouts > Media Server NO

Looking for relations Media server and wireless router –For example, if you access a local media server in your house that is connected to a wireless router that has a port speed of only 100 Mbps [..] YES –Besides serving as a router, a wireless access point, and a four-port switch, the WRTSL54GS includes a storage link and a media server YES –It has a built in video server, media server, home automation, wireless router, internet gateway NO

Our system Set of 57 pairs of digital devices Searched the Web (Google) using the device pairs as queries From the Web pages retrieved, we extracted the text (3627) excerpts containing both devices We labeled them (YES or NOT) Trained a classification system

Our FUTURE system Will allow to identify the Web pages containing relations. –Could display only those. –Could highlight only sentences with relations –For digital devices, this would allow, for example, useful queries for troubleshooting Searching the web is one of the principle methods used to seek out information and to resolve problems involving digital devices for home networks

Our FUTURE system Possible extensions of the project to get the activities types –We look at the sentences extracted and come up with a set of possible activities. Build a (multi) classification system to classify the different activities (supervised) –Extract the most indicative words for the activities (like the words highlighted here); cluster them to get “activity clusters” (unsupervised) here

Our system Set of 50 Device Pairs Search the Web (Google) using the device pairs as query From the Web pages retrieved, we extracted the sentences containing both devices We labeled them (YES or NOT) Trained a classification system

Labeling with Mechanical Turk To train a classification system, we need labels –Time consuming, subjective, different for each domain and task –(But unsupervised systems work usually worse) We used a web service, Mechanical Turk (MTurk, that allows to create and post a task that requires human intervention, and offers a reward for the completion of the task.MTurkhttp://

Mechanical Turk HIT for labeling relations

Surveys

Mechanical Turk We created a total of 121 surveys consisting of 30 questions. Our reward to users was between 15 and 30 cents per survey (< 1 cent for text segment) –We obtained labels for 3627 text segments for under $70. HIT completed (by all 3 “workers”) within a few minutes to a half-hour –We had perfect agreement for 49% of all sentences –5% received all three labels (discarded) –46% two labels were assigned (the majority vote was used to determine the final label) 1865 text segments were labeled YES 1485 text segments were labeled NO

Classification Now we have labeled data Need a (binary) classifier

Summary (from lecture 17) Algorithms for Classification Binary classification –Perceptron –Winnow –Support Vector Machines (SVM) –Kernel Methods –Multilayer Neural Networks Multi-Class classification –Decision Trees –Naïve Bayes –K nearest neighbor

From Gert Lanckriet, Statistical Learning Theory Tutorial 45 Support Vector Machine (SVM) Large Margin Classifier Linearly separable case Goal: find the hyperplane that maximizes the margin w T x + b = 0 M w T x a + b = 1 w T x b + b = -1 Support vectors From Lecture 17

Graphical models Directed (like Naïve Bayes and HMM) Undirected (Markov Network)

Maximum Margin Markov Networks Large Margin Classifier + (undirected) Markov Networks [Taskar 03] –To combine the strengths of the two methods: High dimensional feature space, strong theoretical guarantees Problem structure, ability to capture correlation between labels Benjamin Taskar, Carlos Guestrin, and Daphne Koller Max-margin markov networks. In NIPS.

Directed Maximum Margin Model Large Margin Classifier + (directed) graphical model (Naïve Bayes) MMNB: Maximum Margin Naïve Bayes –Essentially, to combine the strengths of graphic models (better at interpreting data, worse performance in classification) with discriminative models (better performance, unintelligible working mechanism)

Results Compare with Naïve Bayes and Perceptron (Weka) Classification accuracy: –MMNB: –Naïve Bayes: –Perceptron: 63.03

Conclusion Semantic relations Two projects: interactions between proteins and relations between digital devices Statistical models (dynamic graphical models, maximum margin naïve bayes) Creative ways of obtaining labeled data: protein database and “paying” people (Mturk)

Thanks! Barbara Rosario Intel Research

Additional slides

All device pairs desktop wireless router PC stereo digital camera television pc wireless audio adapter digital camera tv set pc wireless router ibm laptop buffalo media player Phillips stereo pc ibm laptop linksys wireless router prismq media player wireless router ibm laptop squeezebox stereo laptop ibm laptop wireless audio adapter stereo toshiba laptop kodak camera television toshiba laptop buffalo media player laptop linksys wireless router toshiba laptop linksys wireless router laptop media server toshiba laptop netgear wireless router laptop squeezebox toshiba laptop squeezebox laptop stereo toshiba laptop wireless audio adapter laptop wireless audio adapter

All device pairs (cont.) buffalo media player wireless router laptop wireless router buffalo media server wireless router linkstation home server wireless router camera tv linkstation multimedia server wireless router computer linksys wireless router media player wireless router computer media server media server linksys wireless router computer stereo media server netgear wireless router computer wireless audio adapter media server wireless router computer wireless router network media player wireless router desktop media server nikon camera television desktop stereo pc media server desktop wireless audio adapter pc squeezebox