A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.

Slides:

Advertisements

Similar presentations

Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions.

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Pattern Analysis Prof. Bennett Math Model of Learning and Discovery 2/14/05 Based on Chapter 1 of Shawe-Taylor and Cristianini.

DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

Model Assessment and Selection

CMPUT 466/551 Principal Source: CMU

Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

2. Introduction Multiple Multiplicative Factor Model For Collaborative Filtering Benjamin Marlin University of Toronto. Department of Computer Science.

Probability based Recommendation System Course : ECE541 Chetan Tonde Vrajesh Vyas Ashwin Revo Under the guidance of Prof. R. D. Yates.

Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.

Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

SLAM: Simultaneous Localization and Mapping: Part I Chang Young Kim These slides are based on: Probabilistic Robotics, S. Thrun, W. Burgard, D. Fox, MIT.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Conditional Random Fields

Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Visual Recognition Tutorial

Amos Storkey, School of Informatics. Density Traversal Clustering and Generative Kernels a generative framework for spectral clustering Amos Storkey, Tom.

Distributed Representations of Sentences and Documents

Part I: Classification and Bayesian Learning

The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.

1 Collaborative Filtering: Latent Variable Model LIU Tengfei Computer Science and Engineering Department April 13, 2011.

Radial Basis Function Networks

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation Chao Chen ⨳ , Dongsheng Li

Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.

Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Particle Filters for Shape Correspondence Presenter: Jingting Zeng.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer

1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram.

Dual Transfer Learning Mingsheng Long 1,2, Jianmin Wang 2, Guiguang Ding 2 Wei Cheng, Xiang Zhang, and Wei Wang 1 Department of Computer Science and Technology.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.

Biointelligence Laboratory, Seoul National University

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.

Information Bottleneck versus Maximum Likelihood Felix Polyakov.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.

Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.

NTU & MSRA Ming-Feng Tsai

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.

Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.

PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Lecture 1.31 Criteria for optimal reception of radio signals.

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Bias and Variance of the Estimator

Advanced Artificial Intelligence

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

What is The Optimal Number of Features

Presentation transcript:

A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby

Motivation Clustering tasks are often ambiguous It is hard to compare between solutions, especially if based on different objectives Example

Motivation Many structures co- exist simultaneously The problem of comparison of solutions cannot be resolved by testing any property of the clustering itself Example

Motivation Clustering depends on our needs Example Recyclable Non-Recyclable

Motivation Inability to compare solutions is problematic for advancement and improvement Example

Thesis We do not cluster the data just for the sake of clustering, but rather to facilitate a solution of some higher level task The quality of clustering should be evaluated by its contribution to the solution of that task We should put more effort into identification and formulation of the tasks which are solved via clustering

Example Cluster then pack Clustering by shape is preferable Evaluate the amount of time saved

Proof of Concept Collaborative Filtering via Co-clustering Y Y Y C2C2 X 2 (movies) X 1 (viewers) Model: C1C1

Evaluation How well are we going to predict the new ratings Y Y Y C2C2 X 2 (movies) X 1 (viewers) C1C1

Analysis Expectation w.r.t. the true distribution p(X 1,X 2,Y) (unrestricted) Expectation w.r.t. the classifier q(Y|X 1,X 2 ) Given loss l(Y,Y’) Model-independent comparison –Does not depend on the form of q We can compare any two co-clusterings We can compare clustering-based solution to any other solution (e.g. Matrix Factorization)

PAC-Bayesian Analysis of Co-clustering Seldin & Tishby ICML`08

PAC-Bayesian Analysis of Co-clustering

Seldin & Tishby ICML`08 PAC-Bayesian Analysis of Co-clustering

We can compare any two co-clusterings We can find a locally optimal co-clustering We can compare clustering-based solution to any other solution (e.g. Matrix Factorization) Seldin & Tishby ICML`08 PAC-Bayesian Analysis of Co-clustering

Bound Meaning Trade-off between empirical performance and effective complexity

Application Replace with a trade-off Seldin ’09

Application Replace with a trade-off MovieLens dataset –100,000 ratings on 5-star scale –80,000 train ratings, 20,000 test ratings –943 viewers x 1682 movies –State-of-the-art Mean Absolute Error (0.72) –The optimal performance is achieved even with 300x300 cluster space Seldin ’09

Co-occurrence Data Analysis X 2 (documents) X 1 (words) Approached by –Co-clustering [Slonim&Tishby’01,Dhillon et.al.’03,...] –Probabilistic Latent Semantic Analysis [Hofmann’99,…] –… No theoretical comparison of the approaches No model order selection criteria

Suggested Approach Co-occurrence events are generated by p(X 1,X 2 ) q(X 1,X 2 ) – a density estimator X2X2 X1X1 Evaluate the ability of q to predict new co-occurrences (out-of-sample performance of q) Seldin & Tishby AISTATS`09

Suggested Approach The true distribution p(X 1,X 2 ) (unrestricted) Co-occurrence events are generated by p(X 1,X 2 ) q(X 1,X 2 ) – a density estimator X2X2 X1X1 Evaluate the ability of q to predict new co-occurrences (out-of-sample performance of q) Seldin & Tishby AISTATS`09 Possibility of comparison of approaches Model order selection

Density Estimation with Co-clustering q1(C1|X1)q1(C1|X1) q2(C2|X2)q2(C2|X2) X2X2 X1X1 Model: With probability ≥ 1-  : Seldin & Tishby AISTATS`09

Density Estimation with Co-clustering Model: With probability ≥ 1-  : Related work –Information-Theoretic Co-clustering [Dhillon et. al. ’03] : maximize I(C 1 ;C 2 ) alone –PAC-Bayesian approach provides regularization and model order selection Seldin & Tishby AISTATS`09

Future work Formal analysis of clustering –Points are generated by p(X), X  R d –q(X) is an estimator of p(X) E.g. Mixture of Gaussians: q(X) =  i i N(  i,  i ) –Evaluate –E p(X) ln q(X) –Model order selection –Comparison of different approaches

Relation to Other Approaches to Regularization and Model Order Selection in Clustering Information Bottleneck (IB) –[Tishby, Pereira & Bialek ’99, Slonim, Friedman & Tishby ’06, …] Minimum Description Length (MDL) principle –[Barron,Rissanen&Yu’98, Grünwald ’07, …] Stability –[Lange, Roth, Braun & Buhmann ’04, Shamir & Tishby ’08, Ben-David & Luxburg ’08, …]

Relation with IB =The “relevance variable” Y was a prototype of a “high level task” ≠IB does not analyze generalization directly –Although there is a post-factum analysis [Shamir,Sabato&Tishby ’08] –There is a slight difference in the resulting tradeoff ≠IB returns the complete curve of the trade-off between compression level and quality of prediction (no model order selection) ≠PAC-Bayesian approach suggests a point which provides optimal prediction at a given sample size

Generalization ≠ MDL MDL is not concerned with generalization MDL solutions can overfit the data –[Kearns,Mansour,Ng,Ron ’97], [Seldin ’09]

Generalization ≠ Stability Example: “Gaussian ring” –Mixture of Gaussians estimation is not stable –If we increase the size of the sample and the number of Gaussians to infinity it will converge to the true distribution “Meaning” of the clusters is different

Some high level remarks (For future work)

Clustering and Humans Clustering represents a structure of the world By clustering objects we ignore irrelevant properties of the objects and concentrate on the relevant ones (relevance is application dependent) We communicate by using a structured description of the world Clustering is tightly related to object naming There must be advantages to such a representation

What Kind of Tasks Clustering is Required for? Classification - ??? Memory efficiency Computational efficiency Communication efficiency Multi-task learning and Transfer learning Transfer learning and Control Robustness Your ideas…

Summary In order to deliver better clustering algorithms and understand their outcome we have to identify and formalize their potential applications Clustering algorithms should be evaluated by their contribution in the context of their potential application