Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Chapter 5 Discrete Random Variables and Probability Distributions
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
Chain Rules for Entropy
Chapter 6 Random Variables
Measures of Dispersion
Visual Recognition Tutorial
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Chapter 4 Discrete Random Variables and Probability Distributions
4.4 Mean and Variance. Mean How do we compute the mean of a probability distribution? Actually, what does that even mean? Let’s look at an example on.
Short review of probabilistic concepts Probability theory plays very important role in statistics. This lecture will give the short review of basic concepts.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
1 Incorporating N-gram Statistics in the Normalization of Clinical Notes By Bridget Thomson McInnes.
2015/7/12VLC 2008 PART 1 Introduction on Video Coding StandardsVLC 2008 PART 1 Variable Length Coding  Information entropy  Huffman code vs. arithmetic.
Probability for linguists
1 Confidence Interval for Population Mean The case when the population standard deviation is unknown (the more common case).
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
1 Advanced Smoothing, Evaluation of Language Models.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
2. Mathematical Foundations
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Arithmetic of Positive Integer Exponents © Math As A Second Language All Rights Reserved next #10 Taking the Fear out of Math 2 8 × 2 4.
APPLICATIONS OF INTEGRATION 6. A= Area between f and g Summary In general If.
Investment Analysis and Portfolio management Lecture: 24 Course Code: MBF702.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
The Normal Probability Distribution
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Prepared by: Amit Degada Teaching Assistant, ECED, NIT Surat
COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.
Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
Measuring Inequality A practical workshop On theory and technique San Jose, Costa Rica August 4 -5, 2004.
Uncertainty Management in Rule-based Expert Systems
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Two Random Variables.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Univariate Gaussian Case (Cont.)
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
Probability Michael J. Watts
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
6.1 Areas Between Curves In this section we learn about: Using integrals to find areas of regions that lie between the graphs of two functions. APPLICATIONS.
Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.
Applied Discrete Mathematics Week 11: Relations
Learning Tree Structures
Conditional Probability
Corpora and Statistical Methods
Information Based Criteria for Design of Experiments
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 09: BAYESIAN LEARNING
Parametric Methods Berlin Chen, 2005 References:
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
INF 141: Information Retrieval
7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to record.
Data Exploration and Pattern Recognition © R. El-Yaniv
Presentation transcript:

Information Theory Metrics Giancarlo Schrementi

Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation gives you the value that you can expect an event to take.

A Template Equation Expected Value forms a template for many of the equations in Information Theory Notice it has three parts, a summation over all possible events, a probability of an event and the value of that event. It can be thought of as a weighted average

Entropy as an EV Notice it has the same three parts The value here is the log base 2 of the probability which can be seen as the amount of information that an event transmits. Since it’s a logarithm that means less likely occurrences are more informative than highly likely occurrences. So Entropy can be thought of as the expected amount of information that we will receive.

Mutual Information Our three friends show up again The value this time is the joint probability divided by the product of the two marginal probabilities. This equation is definitely an expected value equation but what does that value tell us?

MI’s Value The bottom tells us the probability of the two events occurring if they are independent. The top is the actual probability of the two events occurring which is going to be different if they are dependent in some way. Dividing probabilities tells us how much more likely one probability is than the other. Thus the value tells us how much more likely the joint event is than it would be if the two were independent, giving us a measure of dependence. The lg scales the value to be in the bits of information one event tells us about the next event.

Mutual Information as an EV Mutual Information can be looked at as the expected number of bits that one event tells us about another event in the distribution. This can be thought of as how many fewer bits are need to encode the next event because of your knowledge of the prior event. An MI of zero means that every event is independent of every other event. It is also symmetric across X,Y and is always non-negative.

Kullback-Liebler Divergence Once again it looks like an EV equation. P and Q are two distributions and P(x) is the probability of event x in P. The division in the value part tells us how much more/less likely an event is in distribution P than it would be Q. The lg scales this to be in bits, but what does that tell us?

What does KL tell us? The value part tells us what the informational content difference is between event x in distribution P and event x in distribution Q. So KL tells us the expected difference in informational content of events in the two distributions. This can be thought of as the difference in number of bits needed to encode event X in the two distributions.

Other KL Observations As it is written above it is not-symmetric and does not obey the triangle inequality but it is non-negative. If p and q are the same then the equation results in zero. Kullback and Leibler actually define it as the sum of the the equation above plus its counterpart where p & q are switched. It then becomes symmetric.

Comparing MI and KL The two equations are noticeably similar. In fact you can express MI as the KL divergence between p(i,j) and p(i)p(j). This tells us that MI is computing the divergence between the true distribution and the distribution where each event is completely independent. Essentially computing KL with respect to a reference point. KL divergence gives us a measure of how different two distributions are, MI gives us a measure of different a distribution is from what it would be if its events were independent.

Sample Application: Chunk Parser Chunk parsers attempt to divide a sentence into its logical/grammatical units. In this case recursively, by splitting the sentence into two chunks and then splitting those two chunks into two chunks and so on. MI and KL are both measures that can tell us how statistically related two words are to each other. If two words are highly related to each other there is not likely to be a chunk boundary between them and likewise if they are largely unrelated then between them would be a good location to suppose a chunk boundary.

MI and KL in this Task Here we are only concerned with the values at a point not the expected value over all points so the summations and the weight are missing giving us a ‘point-wise’ calculation. These calculations are across a bigram in a text, where x and y are two words. p( ) is the probability of a bigram occurring Note in the KL equation what P and Q are. Q is p(y) which can be thought of as the prior and P is p(y|x) which can be thought of as the posterior. This gives you the information difference that knowledge of x brings to the probability of the right word y.

Results of the Comparison First, note that these equations can result in negative numbers if the top probability is lower than the bottom one. Also, MI is no longer symmetrical because p( ) is different from p( ). This asymmetry is useful for language analysis because most languages are direction dependent and so we want our metrics to be sensitive to that. They both provide an accurate dependence measure, but the KL measure provides more exaggerated values the further it gets from zero. This is because p(y|x) can get much higher than p( ) will ever be. This makes KL more useful for this context in that relationships are more striking and thus easier to identify from noise.

One Final Variant These two equations are variants of mutual information designed to tell you how much a word tells you about the words that could occur that to its left or to its right. Notice that its an expected value over all the bigrams in which the word occurs in the right or the left and weighted by the conditional probability of that bigram given the word in question. This can be used to give you an estimate of the handedness of a word or how much the word restricts what can occur to the left or the right of it.

References S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22(1):79-86, March Damir Ćavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, Giancarlo Schrementi. "On Statistical Parameter Setting." Proceedings of the First Workshop on Psycho-computational Models of Human Language Acquisition (COLING-2004). Geneva, Switzerland. August 28-29, Damir Ćavar, Paul Rodrigues, Giancarlo Schrementi. "Syntactic Parsing Using Mutual Information and Relative Entropy." Proceedings of the Midwest Computational Linguistics Colloquium (MCLC). Bloomington, IN, USA. July