Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian.

Slides:

Advertisements

Similar presentations

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:

Computer vision: models, learning and inference Chapter 8 Regression.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Supervised Learning Recap

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Lecture 21: Spectral Clustering

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.

Machine Learning CMPT 726 Simon Fraser University

Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Visual Recognition Tutorial

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Data mining and machine learning A brief introduction.

Mobile Robotics Laboratory Institute of Systems and Robotics ISR – Coimbra 3D Hand Trajectory Segmentation by Curvatures and Hand Orientation for Classification.

Text Classification, Active/Interactive learning.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.

John Lafferty Andrew McCallum Fernando Pereira

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

Machine Learning 5. Parametric Methods.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Semi-Supervised Clustering

Simone Paolo Ponzetto University of Heidelberg Massimo Poesio

Statistical Models for Automatic Speech Recognition

Lecture 15: Text Classification & Naive Bayes

Pattern Recognition PhD Course.

Statistical NLP: Lecture 9

Bayesian Models in Machine Learning

Statistical Models for Automatic Speech Recognition

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Text Categorization Berlin Chen 2003 Reference:

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University Presenter- Juan Ramos, Dept. of Computer Science, Rutgers Universtiy,

Overview Purpose: evaluate a method for clustering words according to their distribution in particular syntactic contexts. Methodology: find lowest distortion sets of clusters of words to determine models of word coocurrence.

Applications Scientific POV: lexical acquisition of words Practical POV: classification concerns data sparseness in garmmar models. Address clusters in large corpus of documents

Definitions Context: function of given word in its sentence. –Eg: a noun as a direct object Sense class: hidden model describing word association tendencies –Mix of cluster and cluster probability given a word Cluster: probabilistic concept of a sense class

Problem Setting Restrict problem to verbs (V) and nouns (N) in main verb-direct object relationship f (v, n) = frequencies of occurrence of verb, noun pairs –Text must be pre-formatted to fit specifications For given noun n, conditional distribution p(n, v) = f(v,n)/(sum (v, f(v,n))

Problem Setting cont. Goal: create set C of clusters and probabilityies p(c|n). Each c in C associated to cluster centroid p(c) –p(c) = average of p(n) over all v in V.

Distributional Similarity Given two distributions p, q, KL distance is D(p || q) = sum (x, p(x) log (p(x)/q(x))) –D(p || q) = 0 implies p = q –Small D(p || q) implies two distributions are likely instances of a centroid p(c). D(p || q) measures loss of data by using p(c).

Theoretical Foundation Given unstructured V, N, training data of X independent pairs of verbs and nouns. Problem: learn joint distribution of pairs given X Not quite unsupervised, not quite supervised –No internal structure in pairs –Learn underlying distribution

Distributional Clustering Approximately decompose p(n,v) to p’(n,v) = sum (c in C, p(c|n)*p(c, v)). –p(c|n) = membership probability of n in c –p(c,v) = p(v|c) = probability of v given centroid for c Assuming p(n), p’(v) coincide, p’(n,v) = sum(c in C, p(c)*p(n|c)*p(v|c))

Maximum Likelihood Cluster Centroids Used to maximize goodness of fit between data and p’(n,v) For sequence of pairs S, S’s model log prob. is: l(S) = sum(N, log (sum (c in C, p’(n,v)))). –Maximize according to p(n|c) and p(v|c). –Variation of l(S):

Maximum Entropy Cluster Membership Assume independence between variations of p(n|c) and p(v|c). –Can find Bayes inverses of p(n|c) given p(v|c) and p(v|n) –p(v|c) that maximize l(S) also minimize average distortion between cluster model and data

Entropy Cluster Membership cont. Average cluster distortion: Entropy:

Entropy Cluster Membership cont. Class and membership distributions: –Z(c) and Z(n) are normalization sums Previous equations reduce log-likelihood to: At maximum, variation vanishes

KL Distortion Attempt to minimize KL distortion through variation of KL distances: –Results in weighted average of noun distributions.

Free Energy Function Combined minimum distortion and max entropy equivalent to minimum of free energy: F = - H/beta F determines and H through partial derivatives: Min of F determines balance between disordering max entropy and ordering distortion min.

Hierarchical Clustering Number of clusters is determined through sequence of increases of beta. –Higher beta implies more local influence of noun on definition of centroids. Start with low beta and a single c in C –Search for lowest beta that splits c into two or more leaf c’s. –Repeat until |C| reaches desired size.

Experimental Results Classify 64 nouns appearing as direct objects of verb ‘fire’ in Associated Press documents, 1988, where |V| = Four words most similar to cluster centroid and KL distances for first splits. –Split 1: cluster of ‘fire’ as discharging weapons vs. cluster of ‘fire’ as releasing employees –Split 2: weapons as projectiles vs. weapons as guns.

Clustering on Verb ‘fire’

Evaluation

Evaluation cont.

Conclusions Clustering is efficient, informative, and returns good predictions Future work –Make clustering method more rigorous –Introduce human judgment, i.e. a more supervised approach –Extend model to other word relationships

References

References cont.

More References