Learning Analogies and Semantic Relations Nov 29 2010 William Cohen.

Slides:



Advertisements
Similar presentations
Mining Mouse Vocalizations Jesin Zakaria Department of Computer Science and Engineering University of California Riverside.
Advertisements

Text Similarity David Kauchak CS457 Fall 2011.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Processing Strings with HMMs: Structuring text and computing distances William W. Cohen CALD.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Reduced Support Vector Machine
A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations Peter D. Turney Institute for Information Technology National Research Council of.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
Hidden Markov Models.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Query Operations Relevance Feedback & Query Expansion.
Expressing Implicit Semantic Relations without Supervision ACL 2006.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
SINGULAR VALUE DECOMPOSITION (SVD)
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
Pair HMMs and edit distance Ristad & Yianilos. Special meeting Wed 4/14 What: Evolving and Self-Managing Data Integration Systems Who: AnHai Doan, Univ.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Post-Ranking query suggestion by diversifying search Chao Wang.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Edit Distances William W. Cohen.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Data Mining and Text Mining. The Standard Data Mining process.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
From Frequency to Meaning: Vector Space Models of Semantics
Automatic Writing Evaluation
Entity- & Topic-Based Information Ordering
Vector-Space (Distributional) Lexical Semantics
CS 430: Information Discovery
CS 430: Information Discovery
Word embeddings (continued)
INF 141: Information Retrieval
The Voted Perceptron for Ranking and Structured Classification
Topic: Semantic Text Mining
Latent Semantic Analysis
Presentation transcript:

Learning Analogies and Semantic Relations Nov William Cohen

Announcements Upcoming assignments: –Wiki pages for October should be revised –Wiki pages for November due tomorrow 11/30 –Projects due Fri 12/10 Project presentations next week: –Monday12/6 and Wed 12/8 –20min including time for Q/A –30min for the group project –(Order is reverse of mid-term project reports)

[Machine Learning, 2005]

Motivation Information extraction is about understanding entity names in text… … and also relations between entities. How do you determine if you “understand” an arbitrary relation? –For fixed relations R: labeled data (ACE) –For arbitrary relations: … ?

Evaluation

How do you measure the similarity of relation instances? 1.Create a feature vector r x:y for each instance x:y mason:stone  soldier:gun  2.Use cosine distance.

Creating an instance vector for x:y Generate a bunch of queries. –“X of the Y” (“stone of the mason”) –“X with the Y” (soldier with the gun”) –… For each query q j (X,Y), record the number of hits in a search engine as r x:y,j –Actually record log(#hits+1) –Actually sometimes replace X with stem(X)*

The queries used Similar to Hearst ’92 & followups

Some results Ranking 369 possible x:y pairs as possible answers

How do you measure the similarity of relation instances? 1.Create a feature vector r x:y for each instance x:y 2.Use cosine distance to rank (a),…(d) 3.Test-taking strategy: -Define margin=(bestScore-secondBest) -If margin 0 then skip -If margin<θ and θ<0 then guess the top 2.

Results

Followup work Given x:y pairs, replace vectors with rows in M’: 1.Look up synonyms x’, y’ of x and y and construct “near analogies” x’:y, x:y’. Drop any that don’t occur frequently. - e.g. “mason:stone”  “mason:rock” 2.Search for phrase “x Q y” or “y Q x”, using near analogies as well as original pair x:y, and any sequence of up to three words Q. 3.For each phrase create patterns by introducing wildcards. 4.Build a pair-pattern matrix frequency M. 5.Apply SVD to M to get best 300 dimensions  M’. Define sim 1 (x:y, u:v) = cosine distance in M’. Compute similarity of x:y and u:v as average of sim1(p1,p2) for all pairs p1,p2 where (a) p1 is x:y or an alternate; (b) p2 is u:v or an alternate; and (c) sim1(p1,p2)>=sim1(x:y,u:v) [Turney, CL 2006]

Results for LRA 56.5 On 50B word WMTS corpus… 40.3 VSM-WMTS

Additional application: relation classification

Relation classification

Ablation experiments - 1

Ablation experiments - 2 What is the effect of using many automatically-generated patterns vs only 64 manually-generated ones? (Most of manual patterns are found automatically). Feature selection in pattern space instead of SVD

Lessons and questions How are relations and surface patterns correlated? –One-many? (several class-subclass patterns) –Many-one? (some patterns are ambiguous) –Many-many? (and is it 10-10, , ?) Is it surprising that information about relation similarity is spread out across –So much text? –So many surface patterns?

Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY like this: –Find phrases: left? X middle{0,3} Y right? (e.g., the mason cut the stone with”) and stem –In each phrase, replace all words other than X and Y are replace them with wildcards, creating 2 n-2 patterns: (e.g., * mason cut the stone with”, “the mason * the stone with”, … “*mason * * stone *”) –Retain the 20M examples associated with the most X,Y pairs –Weight a pattern that appears i times for X,Y as log(i+1). –Normalize vectors to unit length Use supervised learning on this representation [Turney, COLING 2008]

Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for synonym-or-not [Turney, COLING 2008] Use 10-CV on 80 questions = 320 word pairs Accuracy 76.2% Rank = 9/15 compared to prior approaches (best, 97.5; avg human, 64.5)

Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for synonym-vs-antonym [Turney, COLING 2008] Use 10-CV on 136 sample questions Accuracy 75% First published results

Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for synonym-vs-antonym [Turney, COLING 2008] Use 10-CV on 136 sample questions Accuracy 75% First published results

Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for similar/associated/both [Turney, COLING 2008] Use 10-CV on 144 pairs labeled in psychological experiments Accuracy 77.1% First published results

Followup 2 … a pure corpus-based approach Given M word pairs X,Y, construct feature vectors f XY Use supervised learning for analogies [Turney, COLING 2008] From another problem Repeat 10x with a different “negative” example and average scores for test cases, then pick best answer Accuracy: 52.1% Rank: 3/12 prior papers (best 56.1%; avg student 57%)

Summary

Background for Wed: pair HMMs and generative models of alignment

Alignments and expectations Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998

HMM Example 1 2 Pr(1->2) Pr(2->1) Pr(2->2)Pr(1->1) Pr(1->x) d0.3 h0.5 b0.2 Pr(2->x) a0.3 e0.5 o0.2 Sample output: x T =heehahaha, s T =

HMM Inference t=1t=2...t=T l=1... l=2... l=K... Key point: Pr(s i =l) depends only on Pr(l’->l) and s i-1 so you can propogate probabilities forward... x1x1 x2x2 x3x3 xTxT

Pair HMM Notation Andrew will use “null”

Pair HMM Example 1 ePr(e)

Pair HMM Example 1 ePr(e) Sample run: z T =,,, Strings x,y produced by z T : x=heehee, y=teehe Notice that x,y is also produced by z 4 +, and many other edit strings

Distances based on pair HMMs

Pair HMM Inference Dynamic programming is possible: fill out matrix left- to-right, top-down

Pair HMM Inference t=1t=2...t=T v=1... v=2... v=K...

Pair HMM Inference t=1t=2...t=T v=1... v=2... v=K... One difference: after i emissions of pair HMM, we do not know the column position i=1 i=2i=3 i=1 i=2

Pair HMM Inference: Forward-Backward t=1t=2...t=T v=1... v=2... v=K...

Multiple states SUB ePr(e) IX ePr(e) …… IY

...v=K... v=2...v=1 t=T...t=2 t=1 l=2 An extension: multiple states...v=K... v=2...v=1 t=T...t=2 t=1 l=1 conceptually, add a “state” dimension to the model EM methods generalize easily to this setting SUB IX