Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Visual Recognition Tutorial
Computer Vision Group, University of BonnVision Laboratory, Stanford University Abstract This paper empirically compares nine image dissimilarity measures.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Dimensional reduction, PCA
Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
We use Numerical continuation Bifurcation theory with symmetries to analyze a class of optimization problems of the form max F(q,  )=max (G(q)+  D(q)).
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Face Recognition Based on 3D Shape Estimation
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
Biomedical Image Analysis and Machine Learning BMI 731 Winter 2005 Kun Huang Department of Biomedical Informatics Ohio State University.
Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
Machine Learning Queens College Lecture 13: SVM Again.
This week: overview on pattern recognition (related to machine learning)
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Presented by Tienwei Tsai July, 2005
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Digital Image Processing
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai, Chiyuan Zhang, Xiaofei He Zhejiang University.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Maximum Likelihood Estimation
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Dimensionality reduction
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Minimum Information Inference
Machine Learning Clustering: K-means Supervised Learning
Unsupervised Riemannian Clustering of Probability Density Functions
Statistical Models for Automatic Speech Recognition
Principal Component Analysis (PCA)
Jianping Fan Dept of CS UNC-Charlotte
A Hybrid PCA-LDA Model for Dimension Reduction Nan Zhao1, Washington Mio2 and Xiuwen Liu1 1Department of Computer Science, 2Department of Mathematics Florida.
Probabilistic Models with Latent Variables
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Dimension reduction : PCA and Clustering
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Handwritten Characters Recognition Based on an HMM Model
CS4670: Intro to Computer Vision
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Deep Learning for the Soft Cutoff Problem
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presentation transcript:

Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School of Computer Science and Engineering. The Hebrew University 2 Robotics Lab, CS department, Stanford University

Goal: Find a simple representation of X which captures its relation to Y Q1: Clustering is not always the most appropriate description (e.g. PCA vs Clustering) Q2: Data may contain many structures. Not all of them relevant for a given task. X Y ?

Talk layout Continuous feature extraction using SDR Using Irrelevance data in unsupervised learning SDR with irrelevance data Applications

Continuous features Papers Terms Applications Theory What would be a simple representation of the papers ? Clustering is not a good solution. Continuous scale. Look at the mean number of words in the following groups: figure, performance, improvement, empirical equation, inequality, integral Better look at weighted means (e.g. figure only loosely related to results) The means give a continuous index reflecting the content of the document

Information in Expectations Represent p(X|y) via the expected value of some function Look at A set of |Y| values, representing p(X|y) p(X|y) p(x 1 |y) p(x 2 |y) p(x n |y)

Examples Weighted sum of word counts in a document can be informative about content Weighted grey levels in specific image areas may reveal its identity Mean expression level can reveal tissue identity But what are the best features to use ? Need a measure of information in expected values

Quantifying information in expectations Possible measures ? I(X;Y) for any 1-1  (x) Goes to H(Y) as n grows Want to extract the information related only to expected values Consider all distributions which have the given expected values, and choose the least informative one.

Quantifying information in expectations Define the set of distributions which agree with p on the expected values of  and marginals: We define the information in measuring  (x) on p(x,y) as

Sufficient Dimensionality Reduction (SDR) Find  (x) which maximizes Equivalent to finding the maximum likelihood parameters for Can be done using an iterative projection algorithm (GT, JMLR 03) Produces useful features for document analysis But what if p(x,y) contains many structures ?

Talk layout Feature extraction using SDR Irrelevance data in unsupervised learning Enhancement of SDR with irrelevance data Applications

Relevant and Irrelevant Structures Data may contain structures we don’t want to learn For example: Face recognition: face geometry is important, illumination is not. Speech recognition: spectral envelope is important, not pitch (English) Document classification: content is important, style is not. Gene classification: A given gene may be involved in pathological as well as normal pathways Relevance is not absolute, it is task dependent

Irrelevance Data Data set which contains only irrelevant structures are often available (Chechik and Tishby, NIPS 2002) –Images of one person under different illumination conditions –Recordings of one word uttered in different intonations –Document of similar content but different styles –Gene expression patterns from healthy tissues Find features which avoid the irrelevant ones

Learning with Irrelevance Data Given a model of the data f, Q(f,D) is some quantifier of the goodness of feature f on the dataset D (e.g. likelihood, information) We want to find max f Q (f,D+)- Q(f,D-) Has been demonstrated successfully (CT,2002) for the case where –F=p(T|X), soft clustering –Q(F,Y)=I(T;Y ) The principle is general and can be applied to any modeling scheme Irrelevance data (D - )Main data (D + )

Talk layout Information in expectations (SDR) Irrelevance data in unsupervised learning Enhancement of SDR with irrelevance data Applications

Adding Irrelevance Statistics to SDR Using as our goodness of feature quantifier, we can use two distributions, a relevant, and irrelevant The optimal feature is then For =0 we have SDR

Calculating  * (x) When =0, an iterative algorithm can be devised (Globerson and Tishby 02) Otherwise, the gradient of L(  ) can be calculated and ascended

Synthetic Example D-D- D+D+

Phase Transitions  (x)

Talk layout Feature extraction using SDR Irrelevance data in unsupervised learning SDR with irrelevance data Applications

Converting Images into Distributions X Y

Extracting a single feature The AR dataset consists of images of 50 men and women at different illuminations and postures We took the following distributions: –Relevant : 50 men at two illumination conditions (right and left) –Irrelevant: 50 women at the same illumination conditions –Expected features: Discriminate between men, but not between illuminations

Results for a single feature

Face clustering task Took 5 men with 26 different postures Task is to cluster the images according to their identity Took 26 images of another man as irrelevance data Performed dimensionality reduction using several methods (PCA,OPCA,CPCA and SDR-IS) and measured precision for the reduced data

Precision results

Conclusions Presented a method for feature extraction based on expected values of X Showed how it can be augmented to avoid irrelevant structures Future Work –Eliminate dependence on the dimension of Y via compression constraints –Extend to the multivariate case (graphical models)