EXPLORATORY LEARNING Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Unsupervised Learning
AUTOMATIC GLOSS FINDING for a Knowledge Base using Ontological Constraints Bhavana Dalvi (PhD Student, LTI) Work done with: Prof. William Cohen, CMU Prof.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
K Means Clustering , Nearest Cluster and Gaussian Mixture
CLASSIFYING ENTITIES INTO AN INCOMPLETE ONTOLOGY Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer Science, Carnegie Mellon University.
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Generative Topic Models for Community Analysis
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Visual Recognition Tutorial
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Dongyeop Kang1, Youngja Park2, Suresh Chari2
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
EM and expected complete log-likelihood Mixture of Experts
Text Classification, Active/Interactive learning.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Text Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Exploratory Learning Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School Of Computer.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Randomized Algorithms for Bayesian Hierarchical Clustering
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Iterative K-Means Algorithm Based on Fisher Discriminant UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Mantao Xu to be presented.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Data Mining and Text Mining. The Standard Data Mining process.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Clustering
Unsupervised Learning
Constrained Clustering -Semi Supervised Clustering-
Machine Learning Lecture 9: Clustering
Classification of unlabeled data:
KAIST CS LAB Oh Jong-Hoon
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
EM Algorithm and its Applications
Presentation transcript:

EXPLORATORY LEARNING Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer Science, Carnegie Mellon University

Motivation

Positioning in the problem space  Semi-supervised Learning  All classes are known : e.g. Country, State  Few seed examples for each class : e.g. (Country: USA, Japan, India…) (State: CA, PA, MN etc.)  Model learns to propagate labels from labeled to unlabeled points  Makes use of existing knowledge  Assumes all classes are known  Unsupervised Learning  Works without any training data  Doesn’t make use of existing knowledge Exploratory Learning  Makes use of existing knowledge  Discovers unknown classes City, Animals etc…

Semi-supervised EM  Initialize the model with few seeds per class  Iterate till convergence  E step: Predict labels for unlabeled points You might start with ``fruits’’ and end up in all sorts of ``food’’ items or even ``trees’’.  M step: Recompute model parameters using seeds + predicted labels for unlabeled points Unlabeled points might not belong to any of the existing classes Semantic Drift

Example : Semantic Drift (20-Newsgroups dataset) Existing Proposed

Problem definition

Problem Definition  Input Large set of data-points : X X n Some known classes : C C k Small number of seeds per known class |seeds| << n  Output Labels for all data-points Xi Discover new classes from data: C k+1 … C k+m (k+m) << n

Can we extend the Semi-supervised EM algorithm for this purpose ? Solution

Exploratory EM Algorithm Initialize model with few seeds per class Iterate till convergence (Data likelihood and # classes) E step: Predict labels for unlabeled points For i = 1 to n If P(Cj | Xi) is nearly-uniform for a data-point Xi, j=1 to k Create a new class C k+1, assign Xi to it Else Assign Xi to argmax { P(C i | x) } C i M step: Re-compute model parameters using seeds and predicted labels for unlabeled points  Number of classes might increase in each iteration Check if model selection criterion is satisfied If not, revert to model in Iteration `t-1’

Nearly uniform? Jensen-Shannon Divergence criterion  Data-point: x, current #classes= k  P(C 1 | x), P(C 2 | x),... P(C k | x)  Uniform = [ 1/k 1/k.... 1/k]  Div = Jensen-Shannon-divergence(P(C i |x), Uniform)  If (Div < 1/k) Create new class C k+1 Else Assign x to argmax { P(C i | x) } C i

Nearly uniform? MinMax criterion  Data-point: x, current #classes= k  P(C 1 | x), P(C 2 | x),... P(C k | x)  MaxProb = max{ P(C 1 | x) … P(C k | x)}  MinProb = min{ P(C 1 | x) … P(C k | x)}  If (MaxProb / MinProb) < 2 Create new class C k+1 Else Assign x to argmax { P(C i | x) } C i

What are we trying to optimize? Objective Function : Maximize { Log Data Likelihood – Model Penalty } Params{1..m}, m:#clusters Computed using Model selection criterion

Model Selection Criterion  Extended Akaike information criterion (AICc) Log-Data Model Likelihood Complexity AICc(g) = - {2*L(g) } + { 2*v + 2*v*(v+1)/(n-v-1) } Where, g: model being evaluated, L(g): log-likelihood of data given g, v: number of free parameters of the model, n: number of data points ( Lower values are preferred. )

Semi-supervised Naïve Bayes Seeded K-Means Seeded Von-Mises Fisher Extending existing SSL methods

Naïve Bayes Multinomial model label(X i )=argmax(C j |X i ) C j =1..k if (P(C j | X i ) is nearly uniform) label(X i ) = C k+1 Else label(X i ) = argmax P(C j |X i ) C j =1..k Semi-supervised Naïve BayesExploratory Naïve Bayes

K-Means Features: L1 normalized TFIDF vectors Similarity: Dot Product (centroid, data-point) Assign X i to closest centroid C j If (X i is nearly equidistant from all centroids) Create new cluster C k+1 and put X i in it Else Assign X i to closest centroid Semi-supervised K-MeansExploratory K-Means

Von-Mises Fisher  VMF : data distributed on the unit hypersphere Blue: Kappa = 1 Green: Kappa = 10 Red: Kappa = 100 Mu: mean direction shown with arrows  Banerjee et al : Hard-EM based generative cluster models based on vMF distr.  Extension similar to Naïve Bayes based on near-uniformity of P (C j | X i )

Exploratory EM Algorithm Initialize model with few seeds per class Iterate till convergence (Data likelihood and # classes) E step: Predict labels for unlabeled points If P(Cj | Xi) is nearly-uniform for a data-point Xi, j=1 to k  Create a new class C k+1, assign Xi to it M step: Recompute model parameters using seeds + predicted labels for unlabeled points  Number of classes might increase in each iteration Check if model selection criterion is satisfied If not, revert to model in Iteration `t-1’ Choose classification/ clustering algorithm KMeans, NBayes, VMF … Choose class creation criterion MinMax/ JS / trained classifier … Your choice of Model Selection AIC/BIC/AICc … Generic Applicable to any Clustering / Classification tasks

Semi-supervised Gibbs Sampling + Chinese Restaurant Process  Initialize the model using seed data  for (epoch in 1 to numEpochs) { for (item in unlabeled data) { Decrement data counts for item and label[epoch-1, item] Sample a label from P(label | item) Create a new class using CRP Increment data counts for item and register label[epoch, item] } } (Taken from Bob Carpenter's LingPipe Blog) Inherently Exploratory Baseline

Experiments

Datasets Dataset# Documents# Features# Classes Delicious_Sports Newsgroups18.7K61.2K20 Reuters8.3K18.9K65

Exploratory vs. Semi-supervised EM Comparison in terms of macro averaged seed class F1 Baseline Best case performance of improved baseline Proposed Method

Findings  Algorithm: Exploratory EM ≥ Semi-sup EM with ‘m’ extra classes  New Class creation criterion: Near Uniformity ≥ Random  Existing exploratory method: Chinese Restaurant Process Exploratory EM ≥ Gibbs + CRP - Seed class F1 - Runtime - # classes produced - No need to tune concentration parameter

And Future Work ….. Conclusions

Summary  Dynamically creating new classes reduces semantic drift of known classes.  Simple heuristics for near- uniformity work.  Extends SSL methods NBayes, K-Means, VMF  Exploratory EM version proves to be more effective than “Gibbs Sampling with CRP”  Limited to EM setting  Experimentally converges, theoretical proof is needed  No-more parallelizable  Evaluating newly created clusters is a challenge  Experiments are limited to cases where each datapoint belongs to only one class/cluster. AdvantagesLimitations

Future Work….  Evaluation: ✔ Are the new clusters meaningful? ✔ Can we name newly created clusters/classes? ✔ Can we parallelize it?  Applications: ✔ Scatter gather tool for information retrieval ✔ Hierarchical classification e.g. populating knowledge bases ✔ Multiple view datasets

Thank You Questions?

Extra Slides

ExploreEM is better than Gibbs+CRP Improvements in terms of  F1 on seed classes  #classes produced  Total runtime  No need to tune concentration parameter Explore-CRP-Gibbs Prob of creating a new class extended to depend on - near-uniformity of P(old classes | x)

Explore-CRP-Gibbs  Prob of creating a new class depends on - fixed prior: concentration parameter (P new ) e.g  Can be extended to depend on - near-uniformity of P(known classes | x)  P(new class) = P new / (k * d) where k: current number of classes, d: JS- divergence (uniform, P(C j | X i ))