A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang, Dan Roth Department of Computer Science University of Illinois.

Slides:



Advertisements
Similar presentations
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Advertisements

Self-Paced Learning for Semantic Segmentation
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Curriculum Learning for Latent Structural SVM
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Supervised Learning Recap
Learning Structural SVMs with Latent Variables Xionghao Liu.
Probabilistic Clustering-Projection Model for Discrete Data
A Neural Probabilistic Language Model Keren Ye.
A Probabilistic Framework for Semi-Supervised Clustering
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Experiments  Synthetic data: random linear scoring function with random constraints  Information extraction: Given a citation, extract author, book-title,
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
Biointelligence Laboratory, Seoul National University
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.
First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.
Logistic Regression William Cohen.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Probabilistic Equational Reasoning Arthur Kantor
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Discriminative Machine Learning Topic 4: Weak Supervision M. Pawan Kumar Slides available online
Machine Learning Supervised Learning Classification and Regression
Inferring Networks of Diffusion and Influence
Learning Deep Generative Models by Ruslan Salakhutdinov
Convolutional Neural Network
Online Multiscale Dynamic Topic Models
Part 2 Applications of ILP Formulations in Natural Language Processing
Consistent and Efficient Reconstruction of Latent Tree Models
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Multimodal Learning with Deep Boltzmann Machines
Margin-based Decomposed Amortized Inference
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
Stochastic Optimization Maximization for Latent Variable Models
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Probabilistic Latent Preference Analysis
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
GANG: Detecting Fraudulent Users in OSNs
Dan Roth Department of Computer Science
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang, Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Coreference resolution: cluster denotative noun phrases (mentions) in a document based on underlying entities The task: learning a clustering function from training data  Used expressive features between mention pairs (e.g. string similarity).  Learn a similarly metric between mentions.  Cluster mentions based on the metric. The mention arrives in a left-to-right order Motivating Example: Coreference 2 [Bill Clinton], recently elected as the [President of the USA], has been invited by the [Russian President], [Vladimir Putin], to visit [Russia]. [President Clinton] said that [he] looks forward to strengthening ties between [USA] and [Russia]. Learning this metric using a joint distribution over clustering

Online Clustering Online clustering: items arrive in a given order Motivating property: cluster item i with no access to future items on the right, only the previous items to the left This setting is general and is natural in many tasks.  E.g., cluster posts in a forum, cluster network attack An online clustering algorithm is likely to be more efficient than a batch algorithm under such setting. … … i 3

Greedy Best-Left-Link Clustering Best-Left-Linking decoding: (Bengtson and Roth '08). A Naïve way to learn the model:  decouple (i) learning a similarity metric between pairs; (ii) hard clustering of mentions using this metric. [Bill Clinton], recently elected as the [President of the USA], has been invited by the [Russian President], [Vladimir Putin], to visit [Russia]. [President Clinton] said that [he] looks forward to strengthening ties between [USA] and [Russia]. President Clinton Vladimir Putin Russia Russian President President of the USA Bill Clinton 5

Our Contribution A novel discriminative latent variable model, Latent Left-Linking Model (L 3 M), for jointly learning metric and clustering, that outperforms existing models Training the pair-wise similarity metric for clustering using a latent variable structured prediction Relaxing the single best-link: consider a distribution over links Efficient learning algorithm that decomposes over individual items in the training stream 5

Outline Motivation, examples and problem description Latent Left-Linking Model (L 3 M)  Likelihood computation  Inference  Role of temperature  Alternate latent variable perspective Learning  Discriminative structured prediction learning view  Stochastic gradient based decomposed learning Empirical study 6

Latent Left-Linking Model (L 3 M) Each item can link only to some item on its left (creating a left- link) Event i linking to j is ? Of i' linking to j' Probability of i linking to j  ° 2 [0,1] Is a temperature-like user-tuned parameter Pr[j à i] / exp(w ¢ Á(i, j)/°) … … i X j … … i exp (w ¢ Á(i, j)/°) j.. … … i i'i' ? j j’ Modeling Axioms 7

L 3 M: Likelihood of Clustering C is a clustering of data stream d  C (i, j) = 1 if i and j co-clustered else 0 Prob. of C : multiply prob. of items connecting as per C Partition/normalization function efficient to compute Pr[C; w] =  i Pr[i, C ; w] =  i (  j  < i Pr[j à i] C (i, j)) Z d (w)  =  i  (  j  < i exp(w ¢ Á(i, j) /°)) 8 /  i  (  j  < i exp(w ¢ Á(i, j) /°) C (i, j)) A dummy item represents the start of a cluster

Prob. of i connecting to previously formed cluster c = sum of probs. of i connecting to items in c: L 3 M: Greedy Inference/Clustering Sequential arrival of items: Greedy clustering:  Compute c*= argmax c Pr[ c ¯ i ]  Connect i to c* if Pr[c* ¯ i] > t (threshold) otherwise i starts a new cluster  May not yield the most likely clustering … … i Pr[c ¯ i] =  j  2 c Pr[j à i; w] /  j  2 c exp(w ¢ Á(i, j) /°) 9

Inference: role of temperature ° Prob. of i connecting to previous item j ° tunes the importance of high-scoring links  As ° decreases from 1 to 0, high-scoring links become more important  For ° = 0, Pr[j à i] is a Kronecker delta function centered on the argmax link (assuming no ties) For ° = 0, clustering considers only the “best-left-link” and greedy clustering is exact Pr[j à i] / exp(w ¢ Á(i, j)/°) 10 Pr[c ¯ i] /  j  2 c exp(w ¢ Á(i, j) /°)

Latent Variables: Left-Linking Forests Left-linking forest, f : the parent (arrow directions reversed) of each item on its left Probability of forest f based on sum of edge weights in f L 3 M: same as expressing the probability of C as the sum of probabilities of all consistent (latent) Left-linking forests 11 Pr[f; w] / exp(  (i, j)  2 f w ¢ Á(i, j) /°) Pr[C; w]=  f2 F(C)  Pr[f; w]

Outline Motivation, examples and problem description Latent Left-Linking Model  Inference  Role of temperature  Likelihood computation  Alternate latent variable perspective Learning  Discriminative structured prediction learning view  Stochastic gradient based decomposed learning Empirical study 12

L 3 M: Likelihood-based Learning Learn w from annotated clustering C d for data d 2 D L 3 M: Learn w via regularized neg. log-likelihood Relation to other latent variable models:  Learn by marginalizing underlying latent left-linking forests  °=1 : Hidden Variable CRFs (Quattoni et al, 07)  °=0 : Latent Structural SVMs (Yu and Joachims, 09) LL(w) = ¯ kwk 2 +  d log Z d (w)   d  i  log (  j  < i exp(w ¢ Á(i, j) /°) C d (i, j)) Regularization Un-normalized Probability Partition Function 13

Training Algorithms: Discussion The objective function LL(w) is non-convex Can use Concave-Convex Procedure (CCCP) (Yuille and Rangarajan 03; Yu and Joachims, 09)  Pros: guaranteed to converge to a local minima (Sriperumbudur et al, 09)  Cons: requires entire data stream to compute single gradient update Online updates based on Stochastic (sub-)gradient descent (SGD)  Sub-gradient can be decomposed to a per-item basis  Cons: no theoretical guarantees for SGD with non-convex functions  Pros: can learn in an online fashion; Converge much faster than CCCP  Great empirical performance 14

Outline Motivation, examples and problem description Latent Left-Linking Model  Inference  Role of temperature  Likelihood computation  Alternate latent variable perspective Learning  Discriminative structured prediction learning view  Stochastic gradient based decomposed learning Empirical study 15

Experiment: Coreference Resolution Cluster denotative noun phrases called mentions Mentions follow a left-to-right order Features: mention distance, substring match, gender match, etc. Experiments on ACE 2004 and OntoNotes-5.0. Report average of three popular coreference clustering evaluation metrics: MUC, B 3, and CEAF 16

Coreference: ACE Jointly learn metric and clustering helps Better Considering multiple links helps

Coreference: OntoNotes By incorporating with domain knowledge constraints, L 3 M achieves the state of the art performance on OntoNotes-5.0 (Chang et al. 13) Better

Experiments: Document Clustering Cluster the posts in a forum based on authors or topics. Dataset: discussions from The posts in the forum arrive in a time order: Features: common words, tf-idf similarity, time between arrival Evaluate with Variation-of-Information (Meila, 07) 19 Veteran insurance Re: Veteran insurance North Korean Missiles Re: Re: Veteran insurance

Author Based Clustering 20 Better

Topic Based Clustering 21 Better

Conclusions Latent Left-Linking Model  Principled probabilistic modeling for online clustering tasks  Marginalizes underlying latent link structures  Tuning ° helps – considering multiple links helps  Efficient greedy inference SGD-based learning  Decompose learning into smaller gradient updates over individual items  Rapid convergence and high accuracy Solid empirical performance on problems with a natural streaming order 22