Bregman Information Bottleneck NIPS’03, Whistler December 2003 Koby Crammer Hebrew University of Jerusalem Noam Slonim Princeton University.

Slides:



Advertisements
Similar presentations
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Advertisements

A Tutorial on Learning with Bayesian Networks
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Biointelligence Laboratory, Seoul National University
Supervised Learning Recap
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
An Introduction to Variational Methods for Graphical Models.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Visual Recognition Tutorial
Kernel methods - overview
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Local one class optimization Gal Chechik, Stanford joint work with Koby Crammer, Hebrew university of Jerusalem.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Estimation of parameters. Maximum likelihood What has happened was most likely.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.
Bregman Divergences in Clustering and Dimensionality Reduction COMS : Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.
Gaussian Mixture Models and Expectation Maximization.
Review of Lecture Two Linear Regression Normal Equation
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Chapter Two Probability Distributions: Discrete Variables
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
EM and expected complete log-likelihood Mixture of Experts
IRDM WS Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions,
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
A Generalization of PCA to the Exponential Family Collins, Dasgupta and Schapire Presented by Guy Lebanon.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
1 Information Geometry of Self-organizing maximum likelihood Shinto Eguchi ISM, GUAS This talk is based on joint research with Dr Yutaka Kano, Osaka Univ.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Randomized Algorithms for Bayesian Hierarchical Clustering
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
CHAPTER 5 SIGNAL SPACE ANALYSIS
CS Statistical Machine learning Lecture 24
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
Review of statistical modeling and probability theory Alan Moses ML4bio.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Information geometry.
Learning Deep Generative Models by Ruslan Salakhutdinov
Latent Variables, Mixture Models and EM
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Goodness-of-Fit Tests
Lecture 5 b Faten alamri.
Exponential and Logarithmic Forms
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Bregman Information Bottleneck NIPS’03, Whistler December 2003 Koby Crammer Hebrew University of Jerusalem Noam Slonim Princeton University

Motivation Extend the IB for a broad family of representations Relation to the Exponential family Hello, world Multinomial distribution Vectors

Outline Rate-Distortion Formulation Bregman Divergences Bregman IB Statistical Interpretation Summary

Information Bottleneck XTY X [p(y=1|X) … p(y=n|X)] [p(y=1|T) … p(y=n|T)] T

Input Variables Distortion Rate-Distortion Formulation

Bolzman Distribution: Markov + Bayes Marginal Self-Consistent Equations

Bregman Divergences f (u,f(u)) (v,f(v)) (v, f(u)+f’(u)(v-u)) B f (v||u) = f(v) - (f(u)+f’(u)(v-u))B f (v||u) = f:S R

Functional Bregman Function Input Variables Distortion Bregman IB: Rate-Distortion Formulation

Bolzman Distribution: Prototypes: convex combination of input vectors Marginal Self-Consistent Equations

Special Cases Information Bottleneck:  Bregman function : f(x)=x log(x) – x  Domain: Simplex  Divergence: Kullback-Leibler Soft K-means  Bregman function: f(x)=(1/2) x 2  Domain: Reals n  Divergence: Euclidian Distance  [Still, Bialek, Bottou, NIPS 2003]

Bregman IB Information Bottleneck Bregman Clustering Rate-Distortion Exponential Family

Expectation parameters: Examples (single dimension):  Normal  Poisson

Expectation parameters:  Properties :  Exponential Family and Bregman Divergences

Illustration

Expectation parameters:  Properties :   Exponential Family and Bregman Divergences

Distortion: Data vectors and prototypes: expectation parameters Question: For what exponential distribution we have ? Answer: Poisson Back to Distributional Clustering

Product of Poisson Distributions Illustration a a b a a a b a a a.8.2 ab ab Pr Multinomial Distribution

Back to Distributional Clustering Information Bottleneck:  Distributional clustering of Poison distributions (Soft) k-means:  (Soft) Clustering of Normal distributions

Distortion Input:  Observations Output  Parameters of Distribution IB functional: EM [Elidan & Fridman, before] Maximum Likelihood Perspective

Posterior: Partition Function: Weighted  -norm of the Likelihood  → ∞, most likely cluster governs  →0, clusters collapse into a single prototype Back to Self Consistent Equations

Summary Bregman Information Bottleneck  Clustering/Compression for many representations and divergences Statistical Interpretation  Clustering of distributions from the exponential family  EM like formulation Current Work:  Algorithms  Characterize distortion measures which also yield Bolzman distributions  General distortion measures