Presentation is loading. Please wait.

Presentation is loading. Please wait.

Minimum Information Inference

Similar presentations


Presentation on theme: "Minimum Information Inference"— Presentation transcript:

1 Minimum Information Inference
Naftali Tishby Amir Globerson ICNC, CSE The Hebrew University TAU, Jan. 2, 2005

2 Abuse of knowledge: There are alternative approaches!
Successful bounds are for typical (not worse case) behavior (e.g. StatMech, IT)! specific models and “physics-like” analysis: much tighter (semi) rigorous bounds for specific cases: better UNDERSTANDING of the data (e.g. phase transitions, see HKST 97) Provides interesting insight on trading computational complexity with sample size (e.g. the Ising perceptron) Too much focus on abilities of certain mathematical methods – our theoretical knowledge is MUCH wider!

3 Rigorous bounds from Statistical Mechanics
Key quantity: density of  -good solutions near permissible entropy bound

4 Generalization in unsupervised learning: what is the question?
Find “interesting” sample properties that are stable from sample to sample One interpretation: Separating variables Find T such that P(S1 ,S2 |T)=P(S1 |T) P(S2 |T) T can be clusters, low dim manifolds, features, sufficient statistics In most cases can only be approximated, complexity of T grows with samples size! Interesting question: HOW does it grow?

5 Talk outline Classification with probabilistic models: Generative vs. Discriminative The Minimum Information Principle Generalization error bounds Game theoretic motivation Joint typicality The MinMI algorithms Empirical evaluations Related extensions: SDR and IB

6 The Classification Problem
Learn how to classify (complex) observations X into (simple) classes Y Given labeled examples (xi,yi) Use them to construct a classifier y=g(x) What is a good classifier? Denote by p *(x,y) the true underlying law Want to minimize the generalization error

7 Observed Learned Truth Problem … p*(x,y) y=g(x)
Generalization – Can’t be computed directly  p*(x,y) (xi,yi), i=1…n y=g(x) Observed Learned Truth

8 Choosing a classifier Need to limit search to some set of rules. If every rule is possible we will surely over-fit. Use a family g(x) where  is a parameter. Would be nice if the true rule is in g(x) How do we choose in g(x) ?

9 Common approach: Empirical Risk Minimization
A reasonable strategy. Find the classifier which minimizes the empirical (sample) error: Not necessarily provides the best generalization, although theoretical bounds exist. Computationally hard to minimize directly. Many works minimize upper bounds on the error. Here we focus on a different strategy.

10 Probabilistic models for classification
Had we known p*(x,y) the optimal predictor would be But we don’t know it. We can try to estimate it. Two general approaches: generative and discriminative.

11 Generative Models Assume p(x|y) has some parametric form, e.g. a Gaussian. Each y has a different set of parameters y How do we estimate y, p(y) ? Maximum Likelihood!

12 Generative Models -Estimation
Easy to see that p(y) should be set to the empirical frequency of the classes The parameters y obtained by collecting all x values for the class y, and generating a maximum likelihood estimate.

13 Example: Gaussians Assume the class conditional distribution is Gaussian Then are the empirical mean and variance of the samples in class y. y=1 y=2

14 Example: Naïve Bayes Say X=[X1,…,Xn] is an n dimensional observation
Assume: Parameters are p(xi=k|y). Calculated by counting how many times xi=k in class y. Empirical means of indicator functions:

15 Generative Classifiers: Advantages
Sometimes it makes sense to assume a generation process for p(x|y) (e.g. speech or DNA). Estimation is easy. Closed form solutions in many cases (through empirical means). The parameters can be estimated with relatively high confidence from small samples (e.g. empirical mean and variance). See Ng and Jordan (2001). Performance is not bad at all.

16 Discriminative Classifiers
But, to classify we need only p(y|x). Why not estimate it directly? Generative classifiers (implicitly) estimate p(x), which is not really needed or known. Assume a parametric form for p(y|x):

17 Discriminative Models - Estimation
Choose y to maximize conditional likelihood Estimation is usually not in closed form. Requires iterative maximization (gradient methods etc).

18 Example: logistic regresion
Assume p(x|y) are Gaussians with different means and same variances. Then Goal is to estimate ay,by This is called logistic regression. Since the log of the distribution is linear in x

19 Discriminative Naïve Bayes
Assuming p(x|y) is in Naïve Bayes class, the discriminative distribution is Similar to Naïve Bayes, but the ψ(x,y) functions are not distributions. This is why we need the additional normalization Z. Also called a conditional first order loglinear model .

20 Discriminative: Advantages
Estimates only the relevant distributions (important when X is very complex). Often outperforms generative models for large enough samples (see Ng and Jordan, 2001). Can be shown to minimize an upper bound on the classification error.

21 The best of both worlds…
Generative models (often) employ empirical means which are easy and reliable to estimate. But they model each class separately so poor discrimination is obtained. We would like a discriminative approach based on empirical means.

22 Learning from Expected values (observations, in physics)
Assume we have some “interesting” observables: And we are given their sample empirical means for different classes Y, e.g. class two moments: How can we use this information to build a classifier? Idea: Look for models which yield the observed expectations, but contain no other information.

23 The MaxEnt approach The Entropy H(X,Y) is a measure of uncertainty
(and typicality!) Find the distribution with the given empirical means and maximum joint entropy H(X,Y) (Jaynes 57, …) “Least Committed” to the observations, most typical. Yield “nice” exponential forms:

24 Occam’s in Classification
Minimum assumptions about X and Y imply independence. Because X behaves differently for different Y they cannot be independent How can we quantify their level of dependence ? p(x|y=2) p(x|y=1) m1 m2 X

25 Mutual Information (Shannon 48)
The measure of the information shared by two variables X and Y are independent iff I(X;Y)=0 Bounds the classification error: eBayes<0.5(H(Y)-I(X;Y)). (Hellman and Raviv 1970). Why not minimize it subject to the observation constraints?

26 More for Mutual Information…
I(X;Y) - the unique functional (up to units) that quantifies the notion of information in X about Y in a covariant way. Mutual Information is the generating functional for both source coding (minimization) and channel coding (maximization). Quantifies independence in a model free way Has a natural multivariate extension - I(X1,…,Xn).

27 MinMI: Problem Setting
Given a sample (x1,y1),…,(xn,yn) For each y, calculate the expected value of (X) Calculate empirical marginal p(y) Find the minimum Mutual Information distribution with the given empirical expected values The value of the minimum information is precisely the information in the observations!

28 MinMI Formulation The (convex) set of constraints
The information minimizing distribution A convex problem. No local minima!

29 The problem is convex given p(y) for any empirical means, without specifying p(x).
The minimization generates an auxiliary sparse pMI (x): support alignments. pMI p

30 Characterizing The solution form
Where (y) are Lagrange multipliers and Via Bayes Can be used for classification. But how do we find it?

31 Careful… I cheated… What if pMI(x)=0 ? No legal pMI(y|x) …
But we can still define: Can show that it is subnormalized: And use f(y|x) for classification! Solutions are actually very sparse. Many pMI(x) are zero. “Support Assignments”…

32 A dual formulation Using convex duality we can show that MinMI can be formulated as Called a geometric program Strict inequalities for x such that p(x)=0 Avoids dealing with p(x) at all!

33 A generalization bound
If the estimated means are equal to their true expected values, we can show that the generalization error satisfies -log2 fMI(y|x) fMI(y|x) Y=1

34 A Game Theoretic Interpretation
Among all distributions in F(a), why choose MinMI? The MinMI classifiers minimizes the worst case loss in the class The loss is an upper bound on generalization error Minimize a worst case upper bound

35 MinMI and Joint Typicality
Given a sequence the probability that another independently drawn sequence: is drawn from their joint distribution, Is asymptotically Suggesting Minimum Mutual Information (MinMI) as a general principle for joint (typical) inference.

36 I-Projections (Csiszar 75, Amari 82,…)
The I-projection of a distribution q(x) on a set F For a set defined by linear constraints: Can be calculated using Generalized Iterative Scaling or Gradient methods. Looks Familiar ?

37 The MinMI Algorithm Initialize Iterate
For all y: Set to be the projection of on Marginalize

38 The MinMI Algorithm

39 Synthetic Example Two binary neurons R1,R2 Binary stimulus S
We measure the responses of each neuron individually for each stimulus We know nothing of their correlation

40 Constraints We want to find p(R1,R2|S) – 8 parameters
We have 6 constraints (4 marginal + 2 normalization) Two free parameters:

41 Example: Two moments MaxEnt MinMI
Observations are class conditional mean and variance. MaxEnt solution would be p(X|y) a Gaussian. MinMI solutions are far from Gaussians and discriminate much better. MaxEnt MinMI

42 Example: Conditional Marginals
Recall in Naïve Bayes we used the empirical means of: Can use these means for MinMI.

43 Naïve Bayes Analogs Naïve Bayes Discriminative 1st Order LogLinear

44 Experiments 12 UCI Datasets. Discrete Features Only
used singleton marginal constraints. Compared to Naïve Bayes and 1st order LogLinear model. Note: Naïve Bayes and MinMI use exactly the same input. LogLinear regression also approximates p(x) and uses more information.

45

46 Generalization error for full sample

47 Related ideas Extract the best observables using minimum MI: Sufficient Dimensionality Reduction (SDR) Efficient representations of X with respect to Y: The Information Bottleneck approach. Bounding the information in neural codes from very sparse statistics. Statistical extension of Support Vector Machines.

48 Conclusions MinMI outperforms discriminative model for small sample sizes Outperforms generative model. Presented a method for inferring classifiers based on simple sample means. Unlike generative models, provides generalization guarantees.


Download ppt "Minimum Information Inference"

Similar presentations


Ads by Google