A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
On-line learning and Boosting
The Stability of a Good Clustering Marina Meila University of Washington
K Means Clustering , Nearest Cluster and Gaussian Mixture
DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Model Assessment and Selection
Probabilistic Clustering-Projection Model for Discrete Data
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
SLAM: Simultaneous Localization and Mapping: Part I Chang Young Kim These slides are based on: Probabilistic Robotics, S. Thrun, W. Burgard, D. Fox, MIT.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Conditional Random Fields
Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Amos Storkey, School of Informatics. Density Traversal Clustering and Generative Kernels a generative framework for spectral clustering Amos Storkey, Tom.
Bayesian Learning Rong Jin.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
Radial Basis Function Networks
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM joint work with John Langford, TTI Chicago, Preliminary version.
Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.
Latent Dirichlet Allocation
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.
NTU & MSRA Ming-Feng Tsai
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Self-taught Clustering – an instance of Transfer Unsupervised Learning † Wenyuan Dai joint work with ‡ Qiang Yang, † Gui-Rong Xue, and † Yong Yu † Shanghai.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Lecture 1.31 Criteria for optimal reception of radio signals.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Advanced Artificial Intelligence
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Topic Models in Text Processing
Presentation transcript:

A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby

Motivation Clustering tasks are often ambiguous It is hard to compare between solutions, especially if based on different objectives Example

Motivation Many structures co- exist simultaneously The problem of comparison cannot be resolved by testing any property of the clustering itself Example

Motivation Inability to compare solutions is problematic for advancement and improvement Example

Thesis We do not cluster the data just for the sake of clustering, but rather to facilitate a solution of some higher level task The quality of clustering should be evaluated by its contribution to the solution of that task

Example Cluster then pack Clustering by shape is preferable Evaluate the amount of time saved

Proof of Concept: Collaborative Filtering via Co-clustering Y Y Y C2C2 X 2 (movies) X 1 (viewers) Model: C1C1

Analysis Expectation w.r.t. the true distribution p(X 1,X 2,Y) (unrestricted) Expectation w.r.t. the classifier q(Y|X 1,X 2 ) Given loss l(Y,Y’) Model-independent comparison –Does not depend on the form of q. We can compare any two co-clusterings We can compare clustering-based solution to any other solution (e.g. Matrix Factorization)

PAC-Bayesian Generalization Bounds Classification [McAllester ‘99, …]: –H – hypotheses space –P – prior over H –Q – posterior over H –To classify x: choose h  H by Q(h). Return: y = h(x) –L(Q) – expected loss, - empirical loss

PAC-Bayesian Generalization Bounds Classification [McAllester ‘99, …]: –H – hypotheses space –P – prior over H –Q – posterior over H –To classify x: choose h  H by Q(h). Return: y = h(x) –L(Q) – expected loss, - empirical loss –With probability ≥ 1-δ: –D – KL-divergence

PAC-Bayesian Analysis of Co-clustering Seldin & Tishby ICML`08

PAC-Bayesian Analysis of Co-clustering

Seldin & Tishby ICML`08 PAC-Bayesian Analysis of Co-clustering

We can compare any two co-clusterings We can find a locally optimal co-clustering We can compare clustering-based solution to any other solution (e.g. Matrix Factorization) Seldin & Tishby ICML`08 PAC-Bayesian Analysis of Co-clustering

Co-occurrence Data Analysis X 2 (documents) X 1 (words) Approached by –Co-clustering –Probabilistic Latent Semantic Analysis –… No theoretical comparison of the approaches No model order selection criteria

Suggested Approach The true distribution p(X 1,X 2 ) (unrestricted) Co-occurrence events are generated by p(X 1,X 2 ) q(X 1,X 2 ) – a density estimator X2X2 X1X1 Evaluate the ability of q to predict new co-occurrences (out-of-sample performance of q) Seldin & Tishby AISTATS`09 Possibility of comparison of approaches Model order selection

Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 Model: With probability ≥ 1-d: Seldin & Tishby AISTATS`09

Density Estimation with Co-clustering Model: With probability ≥ 1-d: Information-Theoretic Co-clustering [Dhillon et. al. ’03] : maximize I(C 1 ;C 2 ) alone PAC-Bayesian approach provides regularization and model order selection Seldin & Tishby AISTATS`09

Future work Formal analysis of clustering –Points are generated by p(X), X  R d –q(X) is an estimator of p(X) E.g. Mixture of Gaussians: q(X) =  i i N(  i,  i ) –Evaluate –E p(X) ln q(X) –Model order selection –Comparison of different approaches

Relation to Other Approaches to Regularization and Model Order Selection in Clustering Information Bottleneck (IB) –Tishby, Pereira & Bialek ’99, Slonim, Friedman & Tishby ’06, … Minimum Description Length (MDL) principle –Grünwald ’07, … Stability –Lange, Roth, Braun & Buhmann ’04, Shamir & Tishby ’08, Ben-David & Luxburg ’08, …

Relation with IB =The “relevance variable” Y was a prototype of a “high level task” ≠IB does not analyze generalization directly –Although there is a post-factum analysis [Shamir,Sabato&Tishby ’08] –There is a slight difference in the resulting tradeoff ≠IB returns the complete curve of the trade-off between compression level and quality of prediction (no model order selection) ≠PAC-Bayesian approach suggests a point which provides optimal prediction at a given sample size

Generalization ≠ MDL =MDL returns a single optimal solution for a given sample size =The resulting trade-off is similar ≠Although the weighting is different ≠MDL is not concerned with generalization ≠MDL solutions can overfit the data –[Kearns,Mansour,Ng,Ron ’97], [Seldin ’09]

Generalization ≠ Stability Example: “Gaussian ring” –Mixture of Gaussians estimation is not stable –If we increase the size of the sample and the number of Gaussians to infinity it will converge to the true distribution “Meaning” of the clusters is different

Some high level remarks (For future work)

Clustering and Humans Clustering represents a structure of the world By clustering objects we ignore irrelevant properties of the objects and concentrate on the relevant ones We communicate by using a structured description of the world There must be advantages to such a representation

What Kind of Tasks Clustering is Required for? Classification - ??? Memory efficiency Computational efficiency Communication efficiency Transfer learning Control Your guess…

Context-Dependent Nature of Clustering (Back to Humans) Clustering is tightly related to object naming (definition of categories) Clustering can change according to our needs

Summary In order to deliver better clustering algorithms and understand their outcome we have to identify and formalize their potential applications Clustering algorithms should be evaluated by their contribution in the context of their potential application