A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Outline Overview Motivating example Maxent modeling Training data Features and constraints The maxent principle Exponential form Maximum likelihood Skipped sections and further reading
Overview Statistical modeling Models the behavior of a random process Utilizes samples of output data to construct a representation of the process Predicts the future behavior of the process Maximum Entropy Models A family of distributions within the class of exponential models for statistical modeling
Motivating example (1/4) English-to-French Translator in au cours de pendant en à dans An English-to-French translator translates the English word in into 5 French phrases Goal 1.Extract a set of facts about the decision-making process 2.Construct a model of this process
Motivating example (2/4) The translator always chooses among those 5 French words The most intuitively appealing model (most uniform model subject to our knowledge) is:
Motivating example (3/4) A reasonable choice for P would be (the most uniform one): If the second clue is discovered: translator chose either dans or en 30% of the time, P must satisfy 2 constraints:
Motivating example (4/4) What if the third constraint is discovered: The choice for the model is not as obvious Two problems arise when complexity is added: The meaning of “ uniform ” and how to measure the uniformity of a model How to find the most uniform model subject to a set of constraints One solution: Maximum Entropy (maxent) Model
Maxent Modeling Consider a random process which produces an output value y, a member of a finite set Y In generating y, the process may be influenced by some contextual information x, a member of a finite set X. The task is to construct a stochastic model that accurately represents the behavior of the random process This model estimates the conditional probability that, given a context x, the process will output y. We denote by P the set of all conditional probability distributions. A model is an element of P
Training data Training sample: Training sample ’ s empirical probability distribution
Features and constraints (1/4) Use a set of statistics of the training sample to construct a statistical model of the process Statistics that is independent of the context: Statistics that depends on the conditioning information x, e.g. in training sample, if April is the word following in, then the translation of in is en with frequency 9/10.
Features and constraints (2/4) To express the event that in translates as en when April is the following word, we can introduce the indicator function: The expected value of f with respect to the empirical distribution is exactly the statistic we are interested in. This expected value is given by: We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f. We call such function a feature function or feature for short.
Features and constraints (3/4) The expected value of f with respect to the model is: where is the empirical distribution of x in the training sample We constrain this expected value to be the same as the expected value of f in the training sample: a constraint equation or simply a constraint
By restricting attention to those models for which the constraint holds, we are eliminating from considering those models which do not agree with the training sample on how often the output of the process should exhibit the feature f. What we have so far: A means of representing statistical phenomena inherent in a sample of data, namely A means of requiring that our model of the process exhibit these phenomena, namely Features and constraints (4/4) Combining the above 3 equations yields:
The maxent principle (1/2) Suppose n feature functions f i are given We would like our model to accord with these statistics, i.e. we would like p to lie in the subset C of P defined by Among the models, we would like to select the distribution which is most uniform. But what does “ uniform ” mean? A mathematical measure of the uniformity of a conditional distribution is provided by the conditional entropy:
The maxent principle (2/2) The entropy is bounded from below by zero The entropy of a model with no uncertainty at all The entropy is bounded from above by The entropy of the uniform distribution over all possible values of y The principle of maximum entropy: To select a model from a set of allowed probability distributions, choose the model with maximum entropy is always well-defined; that is, there is always a unique model with maximum entropy in any constrained set
Exponential form (1/3) The method of Lagrange multipliers is applied to impose the constraint on the optimization The constrained optimization problem is to find Maximize subject to the following constraints: Guarantee that p is a conditional probability distribution In other words,
Exponential form (2/3) When the Lagrange multiplier is introduced, the objective function becomes: The real-valued parameters and correspond to the 1+n constraints imposed on the solution Solve by using EM algorithm See
The maximum entropy model subject to the constraints has the parametric form of the equation below, where can be determined by maximizing the dual function Exponential form (3/3) The final result:
Maximum likelihood The log-likelihood of the empirical distribution as predicted by a model is defined by: The dual function of the previous section is just the log- likelihood for the exponential model ; that is: The result from the previous section can be rephrased as: The model with maximum entropy is the model in the parametric family that maximizes the likelihood of the training sample
Skipped sections Computing the parameters orial/node10.html#SECTION orial/node10.html#SECTION Algorithms for inductive learning orial/node11.html#SECTION orial/node11.html#SECTION Further readings orial/node14.html#SECTION orial/node14.html#SECTION