Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.
Sampling and Pulse Code Modulation
Unsupervised Learning
Universal Scaling of Semantic Information Revealed from IB word clusters or Human language as optimal biological adaptation Naftali Tishby School of Computer.
An Overview of Machine Learning
Supervised Learning Recap
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Bregman Information Bottleneck NIPS’03, Whistler December 2003 Koby Crammer Hebrew University of Jerusalem Noam Slonim Princeton University.
Pattern Recognition and Machine Learning
Unsupervised Image Clustering using Probabilistic Continuous Models and Information Theoretic Principles Shiri Gordon Electrical Engineering – System,
Fundamental limits in Information Theory Chapter 10 :
Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.
Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.
Machine Learning CMPT 726 Simon Fraser University
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Unsupervised Learning
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.
NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
ROBOT MAPPING AND EKF SLAM
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Graphical models for part of speech tagging
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
The Information Bottleneck Method clusters the response space, Y, into a much smaller space, T. In order to informatively cluster the response space, the.
Machine Learning Extract from various presentations: University of Nebraska, Scott, Freund, Domingo, Hong,
Lecture 2: Statistical learning primer for biologists
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
CSE 4705 Artificial Intelligence
Big data classification using neural network
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Minimum Information Inference
Semi-Supervised Clustering
Intro to Machine Learning
Statistical Models for Automatic Speech Recognition
Multimodal Learning with Deep Boltzmann Machines
Machine Learning Basics
Dynamical Statistical Shape Priors for Level Set Based Tracking
Vincent Granville, Ph.D. Co-Founder, DSC
PCA vs ICA vs LDA.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
What is The Optimal Number of Features
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presentation transcript:

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS 2001

Many thanks to: Noam Slonim Amir Globerson Bill Bialek Fernando Pereira Nir Friedman

Feature Selection? NOT generative modeling! –no assumptions about the source of the data Extracting relevant structure from data –functions of the data (statistics) that preserve information Information about what? Approximate Sufficient Statistics Need a principle that is both general and precise. –Good Principles survive longer!

A Simple Example...

Simple Example

A new compact representation The document clusters preserve the relevant information between the documents and words

DocumentsWords

Mutual information How much X is telling about Y? I(X;Y): function of the joint probability distribution p(x,y) - minimal number of yes/no questions (bits) needed to ask about x, in order to learn all we can about Y. Uncertainty removed about X when we know Y: I(X;Y) = H(X) - H( X|Y) = H(Y) - H(Y|X) H(X|Y) H(Y|X) I(X;Y)

Relevant Coding What are the questions that we need to ask about X in order to learn about Y? Need to partition X into relevant domains, or clusters, between which we really need to distinguish... XY X|y 1 y2y2 y1y1 P(x|y 1 ) P(x|y 2 ) X|y 2

Bottlenecks and Neural Nets Auto association: forcing compact representations is a relevant code of w.r.t. Input Output Sample 1 Sample 2 PastFuture

Q: How many bits are needed to determine the relevant representation? –need to index the max number of non-overlapping green blobs inside the blue blob: (mutual information!)

Information Bottleneck The distortion function determines the relevant part of the pattern… but what if we don’t know the distortion function but rather a relevance variable? Examples: Speech vs its transcription Images vs objects-names Faces vs expressions Stimuli vs spike-trains (neural codes) Protein-sequences vs structure/function Documents vs text-categories Input vs responses etc...

The idea: find a compressed signal that needs short encoding ( small ) while preserving as much as possible the information on the relevant signal ( )

A Variational Principle We want a short representation of X that keeps the information about another variable, Y, if possible.

Self Consistent Equations The Self Consistent Equations Marginal: Markov condition: Bayes’ rule:

effective distortion The emerged effective distortion measure: Regular if is absolutely continuous w.r.t. Small if predicts y as well as x:

The iterative algorithm: (Generalized Blahut-Arimoto) Generalized BA-algorithm

Information Bottleneck The Information Bottleneck Algorithm “free energy”

The Information - plane, the optimal for a given is a concave function: Possible phase impossible

Regression as relevant encoding Extracting relevant information is fundamental for many problems in learning: Regression: Knowing the parametric class we can calculate p(X,Y), without sampling! Gaussian noise

Manifold of relevance The self consistent equations: Assuming a continuous manifold for Coupled (local in ) eigenfunction equations, with  as an eigenvalue.

Generalization as relevant encoding The two sample problem: the probability that two samples come from one source Knowing the function class we can estimate p(X,Y). Convergence depends on the class complexity.

Document classification - information curves

Multivariate Information Bottleneck Complex relationship between many variables Multiple unrelated dimensionality reduction schemes Trade between known and desired dependencies Express IB in the language of Graphical Models Multivariate extension of Rate-Distortion Theory

Multivariate Information Bottleneck: Extending the dependency graphs (Multi-information)

Sufficient Dimensionality Reduction (with Amir Globerson) Exponential families have sufficient statistics Given a joint distribution, find an approximation of the exponential form: This can be done by alternating maximization of Entropy under the constraints: The resulting functions are our relevant features at rank d.

Conclusions There may be a single principle behind... Noise filtering time series prediction categorization and classification feature extraction supervised and unsupervised learning visual and auditory segmentation clustering self organized representation...

Summary We present a general information theoretic approach for extracting relevant information. It is a natural generalization of Rate-Distortion theory with similar convergence and optimality proofs. Unifies learning, feature extraction, filtering, and prediction... Applications (so far) include: –Word sense disambiguation –Document classification and categorization –Spectral analysis –Neural codes –Bioinformatics,… –Data clustering based on multi-distance distributions –…