*Introduction to Natural Language Processing ( ) Probability

Slides:

Advertisements

Similar presentations

Lecture Discrete Probability. 5.2 Recap Sample space: space of all possible outcomes. Event: subset of of S. p(s) : probability of element s of.

Advertisements

SI485i : NLP Day 2 Probability Review. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Outcome.

Copyright © Cengage Learning. All rights reserved.

Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.

Probability theory Much inspired by the presentation of Kren and Samuelsson.

Evaluating Hypotheses

Random Variables and Distributions Lecture 5: Stat 700.

PMAI 2001/02 Targil 1: Basic Probability and Information theory

Probability Distributions: Finite Random Variables.

CHAPTER 10: Introducing Probability

Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.

1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.

Some basic concepts of Information Theory and Entropy

§1 Entropy and mutual information

2. Mathematical Foundations

Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.

1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.

Lectures prepared by: Elchanan Mossel Yelena Shvets Introduction to probability Stat 134 FAll 2005 Berkeley Follows Jim Pitman’s book: Probability Section.

Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.

Theory of Probability Statistics for Business and Economics.

1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.

Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.

Expected values and variances. Formula For a discrete random variable X and pmf p(X): Expected value: Variance: Alternate formula for variance:  Var(x)=E(X^2)-[E(X)]^2.

9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.

JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

CHAPTER 10: Introducing Probability ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.

Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.

12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.

1 Randomized Algorithms Andreas Klappenecker. 2 Randomized Algorithms A randomized algorithm is an algorithm that makes random choices during their execution.

Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.

Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.

Random Variables Learn how to characterize the pattern of the distribution of values that a random variable may have, and how to use the pattern to find.

Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.

Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.

9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab

JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab

Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.

Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.

(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.

1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2.

Discrete Random Variable Random Process. The Notion of A Random Variable We expect some measurement or numerical attribute of the outcome of a random.

Statistical methods in NLP Course 2

CHAPTER 10: Introducing Probability

Lecture 1 Probability and Statistics

Learning Tree Structures

PROBABILITY AND PROBABILITY RULES

Natural Language Processing

Conditional Probability

Corpora and Statistical Methods

Aim – How do we analyze a Discrete Random Variable?

CHAPTER 6 Random Variables

CHAPTER 12: Introducing Probability

Basic Probability aft A RAJASEKHAR YADAV.

Natural Language Processing

Now it’s time to look at…

Introduction to Probability

STA 291 Spring 2008 Lecture 7 Dustin Lueker.

CSCI 5832 Natural Language Processing

Introduction In probability, events are either dependent or independent. Two events are independent if the occurrence or non-occurrence of one event has.

Bayes Rule for probability

Probability distributions

Chapter 6: Random Variables

Honors Statistics From Randomness to Probability

CHAPTER 10: Introducing Probability

Expected values and variances

CPSC 503 Computational Linguistics

Discrete & Continuous Random Variables

M248: Analyzing data Block A UNIT A3 Modeling Variation.

Presentation transcript:

*Introduction to Natural Language Processing (600.465) Probability Intro to NLP 9/14/1999 *Introduction to Natural Language Processing (600.465) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic 9/14/1999 JHU CS 600.465/Jan Hajic JHU CS 600.465/Jan Hajic

Experiments & Sample Spaces Intro to NLP 9/14/1999 Experiments & Sample Spaces Experiment, process, test, ... Set of possible basic outcomes: sample space W coin toss (W = {head,tail}), die (W = {1..6}) yes/no opinion poll, quality test (bad/good) (W = {0,1}) lottery (| W | @ 107 .. 1012) # of traffic accidents somewhere per year (W = N) spelling errors (W = Z*), where Z is an alphabet, and Z* is a set of possible strings over such and alphabet missing word (| W | @ vocabulary size) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Events Event A is a set of basic outcomes Usually A Ì W , and all A ∈ 2W (the event space) W is then the certain event, Ø is the impossible event Example: experiment: three times coin toss W = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} count cases with exactly two tails: then A = {HTT, THT, TTH} all heads: A = {HHH} 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Probability Repeat experiment many times, record how many times a given event A occurred (“count” c1). Do this whole series many times; remember all cis. Observation: if repeated really many times, the ratios of ci/Ti (where Ti is the number of experiments run in the i-th series) are close to some (unknown but) constant value. Call this constant a probability of A. Notation: p(A) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Estimating probability Intro to NLP 9/14/1999 Estimating probability Remember: ... close to an unknown constant. We can only estimate it: from a single series (typical case, as mostly the outcome of a series is given to us and we cannot repeat the experiment), set p(A) = c1/T1. otherwise, take the weighted average of all ci/Ti (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate. 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Example Recall our example: experiment: three times coin toss W = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} count cases with exactly two tails: A = {HTT, THT, TTH} Run an experiment 1000 times (i.e. 3000 tosses) Counted: 386 cases with two tails (HTT, THT, or TTH) estimate: p(A) = 386 / 1000 = .386 Run again: 373, 399, 382, 355, 372, 406, 359 p(A) = .379 (weighted average) or simply 3032 / 8000 Uniform distribution assumption: p(A) = 3/8 = .375 Ask when you would be finished in practice carrying out this experiment: assume 5 seconds per one experiment (3 tosses)? Now is 2:20+ 40000 sec ( = 11h06min40sec) = 1:26:40am Q: would you be finished before 1am tomorrow? Raise your hand if you think yes, otherwise do not do anything. ------- Now you have 30 seconds to think. Q: same Raise hands... CONDITIONAL PROB 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Basic Properties Basic properties: p: 2 W  [0,1] p(W) = 1 Disjoint events: p(∪Ai) = ∑ip(Ai) [NB: axiomatic definition of probability: take the above three conditions as axioms] Immediate consequences: P(Ø) = 0, p(`A ) = 1 - p(A), A⊆B Þ p(A) ≤ p(B) ∑a∈ W p(a) = 1 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Joint and Conditional Probability Intro to NLP 9/14/1999 Joint and Conditional Probability p(A,B) = p(A ∩ B) p(A|B) = p(A,B) / p(B) Estimating form counts: p(A|B) = p(A,B) / p(B) = (c(A ∩ B) / T) / (c(B) / T) = = c(A ∩ B) / c(B) W A B A ∩ B 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Bayes Rule p(A,B) = p(B,A) since p(A ∩ B) = p(B ∩ A) therefore: p(A|B) p(B) = p(B|A) p(A), and therefore p(A|B) = p(B|A) p(A) / p(B) ! W A B A ∩ B 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Independence Can we compute p(A,B) from p(A) and p(B)? Recall from previous foil: p(A|B) = p(B|A) p(A) / p(B) p(A|B) p(B) = p(B|A) p(A) p(A,B) = p(B|A) p(A) ... we’re almost there: how p(B|A) relates to p(B)? p(B|A) = P(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)! 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Chain Rule p(A1, A2, A3, A4, ..., An) = ! p(A1|A2,A3,A4,...,An) × p(A2|A3,A4,...,An) × × p(A3|A4,...,An) × ... p(An-1|An) × p(An) this is a direct consequence of the Bayes rule. Intentionally different order than in book. Just to think a bit. 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

The Golden Rule (of Classic Statistical NLP) Intro to NLP 9/14/1999 The Golden Rule (of Classic Statistical NLP) Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A|B)): take Bayes rule, max over all Bs: argmaxA p(A|B) = argmaxA p(B|A) . p(A) / p(B) = argmaxA p(B|A) p(A) ! ... as p(B) is constant when changing As 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Random Variables is a function X: W Q in general: Q = Rn, typically R easier to handle real numbers than real-world events random variable is discrete if Q is countable (i.e. also if finite) Example: die: natural “numbering” [1,6], coin: {0,1} Probability distribution: pX(x) = p(X=x) =df p(Ax) where Ax = {a ∈ W : X(a) = x} often just p(x) if it is clear from context what X is 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Expectation Joint and Conditional Distributions Intro to NLP 9/14/1999 Expectation Joint and Conditional Distributions is a mean of a random variable (weighted average) E(X) =∑x ∈ X(W) x . pX(x) Example: one six-sided die: 3.5, two dice (sum) 7 Joint and Conditional distribution rules: analogous to probability of events Bayes: pX|Y(x,y) =notation pXY(x|y) =even simpler notation p(x|y) = p(y|x) . p(x) / p(y) Chain rule: p(w,x,y,z) = p(z).p(y|z).p(x|y,z).p(w|x,y,z) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Standard distributions Intro to NLP 9/14/1999 Standard distributions Binomial (discrete) outcome: 0 or 1 (thus: binomial) make n trials interested in the (probability of) number of successes r Must be careful: it’s not uniform! pb(r|n) = ( ) / 2n (for equally likely outcome) ( ) counts how many possibilities there are for choosing r objects out of n; = n! / (n-r)!r! n r n r 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Continuous Distributions Intro to NLP 9/14/1999 Continuous Distributions The normal distribution (“Gaussian”) pnorm(x|m,s) = e-(x-m)2/(2s2)/s 2p where: m is the mean (x-coordinate of the peak) (0) s is the standard deviation (1) m x other: hyperbolic, t 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

CS Dept., Johns Hopkins Univ. Intro to NLP 9/14/1999 *Introduction to Natural Language Processing (600.465) Essential Information Theory I Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic 9/14/1999 JHU CS 600.465/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 The Notion of Entropy Entropy ~ “chaos”, fuzziness, opposite of order, ... you know it: it is much easier to create “mess” than to tidy things up... Comes from physics: Entropy does not go down unless energy is used Measure of uncertainty: if low... low uncertainty; the higher the entropy, the higher uncertainty, but the higher “surprise” (information) we can get out of an experiment 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 The Formula Let pX(x) be a distribution of random variable X Basic outcomes (alphabet) W H(X) = - ∑ x ∈ W p(x) log2 p(x) ! Unit: bits (log10: nats) Notation: H(X) = Hp(X) = H(p) = HX(p) = H(pX) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Using the Formula: Example Intro to NLP 9/14/1999 Using the Formula: Example Toss a fair coin: W = {head,tail} p(head) = .5, p(tail) = .5 H(p) = - 0.5 log2(0.5) + (- 0.5 log2(0.5)) = 2 x ( (-0.5) x (-1) ) = 2 x 0.5 = 1 Take fair, 32-sided die: p(x) = 1 / 32 for every side x H(p) = - ∑ i = 1..32 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 1/32) = -32x ((1/32)x (-5)) = 5 (now you see why it’s called bits?) Unfair coin: p(head) = .2 ... H(p) = .722; p(head) = .01 ... H(p) = .081 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Example: Book Availability Intro to NLP 9/14/1999 Example: Book Availability Entropy H(p) 1 bad bookstore good bookstore 0 0.5 1 ← p(Book Available) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 The Limits When H(p) = 0? if a result of an experiment is known ahead of time: necessarily: $x ∈ W; p(x) = 1 & ∀y ∈ W; y ≠ x Þ p(y) = 0 Upper bound? none in general for | W | = n: H(p) ≤ log2n nothing can be more uncertain than the uniform distribution 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Entropy and Expectation Intro to NLP 9/14/1999 Entropy and Expectation Recall: E(X) =∑x ∈ X(W) pX(x) × x Then: E(log2(1/pX(x))) = ∑x ∈ X(W) pX(x) log2(1/pX(x)) = = - ∑x ∈ X(W) pX(x) log2pX(x) = = H(pX) =notation H(p) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Perplexity: motivation Intro to NLP 9/14/1999 Perplexity: motivation Recall: 2 equiprobable outcomes: H(p) = 1 bit 32 equiprobable outcomes: H(p) = 5 bits 4.3 billion equiprobable outcomes: H(p) ~= 32 bits What if the outcomes are not equiprobable? 32 outcomes, 2 equiprobable at .5, rest impossible: H(p) = 1 bit Any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes? 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Perplexity Perplexity: G(p) = 2H(p) ... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. it is easier to imagine: NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict the “wilder” (biased) distribution, the better: lower entropy, lower perplexity 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Joint Entropy and Conditional Entropy Intro to NLP 9/14/1999 Joint Entropy and Conditional Entropy Two random variables: X (space W),Y (Y) Joint entropy: no big deal: ((X,Y) considered a single event): H(X,Y) = - ∑x ∈ W ∑y ∈ Y p(x,y) log2 p(x,y) Conditional entropy: H(Y|X) = - ∑x ∈ W ∑y ∈ Y p(x,y) log2 p(y|x) recall that H(X) = E(log2(1/pX(x))) (weighted “average”, and weights are not conditional) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Conditional Entropy (Using the Calculus) Intro to NLP 9/14/1999 Conditional Entropy (Using the Calculus) other definition: H(Y|X) = ∑x ∈ W p(x) H(Y|X=x) = for H(Y|X=x), we can use the single-variable definition (x ~ constant) = ∑x ∈ W p(x) ( - ∑y ∈ Y p(y|x) log2p(y|x) ) = = - ∑x ∈ W ∑y ∈ Y p(y|x) p(x) log2p(y|x) = = - ∑x ∈ W ∑y ∈ Y p(x,y) log2p(y|x) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Properties of Entropy I Intro to NLP 9/14/1999 Properties of Entropy I Entropy is non-negative: H(X) ≥ 0 proof: (recall: H(X) = - ∑x ∈ W p(x) log2 p(x)) log(p(x)) is negative or zero for x ≤ 1, p(x) is non-negative; their product p(x)log(p(x) is thus negative; sum of negative numbers is negative; and -f is positive for negative f Chain rule: H(X,Y) = H(Y|X) + H(X), as well as H(X,Y) = H(X|Y) + H(Y) (since H(Y,X) = H(X,Y)) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Properties of Entropy II Intro to NLP 9/14/1999 Properties of Entropy II Conditional Entropy is better (than unconditional): H(Y|X) ≤ H(Y) (proof on Monday) H(X,Y) ≤ H(X) + H(Y) (follows from the previous (in)equalities) equality iff X,Y independent [recall: X,Y independent iff p(X,Y) = p(X)p(Y)] H(p) is concave (remember the book availability graph?) concave function f over an interval (a,b): "x,y∈(a,b), "l ∈ [0,1]: f(lx + (1-l)y)≥lf(x) + (1-l)f(y) function f is convex if -f is concave [for proofs and generalizations, see Cover/Thomas] f lf(x) + (1-l)f(y) x y 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

“Coding” Interpretation of Entropy Intro to NLP 9/14/1999 “Coding” Interpretation of Entropy The least (average) number of bits needed to encode a message (string, sequence, series,...) (each element having being a result of a random process with some distribution p): = H(p) Remember various compressing algorithms? they do well on data with repeating (= easily predictable = low entropy) patterns their results though have high entropy Þ compressing compressed data does nothing 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Coding: Example How many bits do we need for ISO Latin 1? Þ the trivial answer: 8 Experience: some chars are more common, some (very) rare: ...so what if we use more bits for the rare, and less bits for the frequent? [be careful: want to decode (easily)!] suppose: p(‘a’) = 0.3, p(‘b’) = 0.3, p(‘c’) = 0.3, the rest: p(x)@ .0004 code: ‘a’ ~ 00, ‘b’ ~ 01, ‘c’ ~ 10, rest: 11b1b2b3b4b5b6b7b8 code acbbécbaac: 0010010111000011111001000010 a c b b é c b a a c number of bits used: 28 (vs. 80 using “naive” coding) code length ~ 1 / probability; conditional prob OK! 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Entropy of a Language Imagine that we produce the next letter using p(ln+1|l1,...,ln), where l1,...,ln is the sequence of all the letters which had been uttered so far (i.e n is really big!); let’s call l1,...,ln the history h(hn+1), and all histories H: Then compute its entropy: - ∑h ∈ H ∑ l ∈ A p(l,h) log2 p(l|h) Not very practical, isn’t it? 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

CS Dept., Johns Hopkins Univ. Intro to NLP 9/14/1999 *Introduction to Natural Language Processing (600.465) Essential Information Theory II Dr. Jan Haji? CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic 9/14/1999 JHU CS 600.465/Jan Hajic JHU CS 600.465/Jan Hajic

Kullback-Leibler Distance (Relative Entropy) Intro to NLP 9/14/1999 Kullback-Leibler Distance (Relative Entropy) Remember: long series of experiments... ci/Ti oscillates around some number... we can only estimate it... to get a distribution q. So we get a distribution q; (sample space W, r.v. X) the true distribution is, however, p. (same W, X) Þ how big error are we making? D(p||q) (the Kullback-Leibler distance): D(p||q) = ∑x ∈ W p(x) log2 (p(x)/q(x)) = Ep log2 (p(x)/q(x)) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Comments on Relative Entropy Intro to NLP 9/14/1999 Comments on Relative Entropy Conventions: 0 log 0 = 0 p log (p/0) = ∞ (for p > 0) Distance? (less “misleading”: Divergence) not quite: not symmetric: D(p||q) ≠ D(q||p) does not satisfy the triangle inequality but useful to look at it that way H(p) + D(p||q): bits needed for encoding p if q is used -S p log p + S p log p/q = - S p log p + S p (log p - log q) = - Sp log q 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Mutual Information (MI) in terms of relative entropy Intro to NLP 9/14/1999 Mutual Information (MI) in terms of relative entropy Random variables X, Y; pX∩Y(x,y), pX(x), pY(y) Mutual information (between two random variables X,Y): I(X,Y) = D(p(x,y) || p(x)p(y)) I(X,Y) measures how much (our knowledge of) Y contributes (on average) to easing the prediction of X or, how p(x,y) deviates from (independent) p(x)p(y) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Mutual Information: the Formula Intro to NLP 9/14/1999 Mutual Information: the Formula Rewrite the definition: [recall: D(r||s) = ∑v ∈ W r(v) log2 (r(v)/s(v)); substitute r(v) = p(x,y), s(v) = p(x)p(y); <v> ~ <x,y>] I(X,Y) = D(p(x,y) || p(x)p(y)) = = ∑x ∈ W ∑y ∈ Y p(x,y) log2 (p(x,y)/p(x)p(y)) Measured in bits (what else? :-) ! 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

From Mutual Information to Entropy Intro to NLP 9/14/1999 From Mutual Information to Entropy by how many bits the knowledge of Y lowers the entropy H(X): I(X,Y) = ∑x ∈ W ∑y ∈ Y p(x,y) log2 (p(x,y)/p(y)p(x)) = ...use p(x,y)/p(y) = p(x|y) = ∑x ∈ W ∑y ∈ Y p(x,y) log2 (p(x|y)/p(x)) = ...use log(a/b) = log a - log b (a ~ p(x|y), b ~ p(x)), distribute sums = ∑x ∈ W ∑y ∈ Yp(x,y)log2p(x|y) -∑x ∈ W ∑y ∈ Y p(x,y)log2p(x) = ...use def. of H(X|Y) (left term), and ∑y∈ Y p(x,y) = p(x) (right term) = - H(X|Y) + (- ∑x ∈ W p(x)log2p(x)) = ...use def. of H(X) (right term), swap terms = H(X) - H(X|Y) ...by symmetry, = H(Y) - H(Y|X) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Properties of MI vs. Entropy Intro to NLP 9/14/1999 Properties of MI vs. Entropy I(X,Y) = H(X) - H(X|Y) = number of bits the knowledge of Y lowers the entropy of X = H(Y) - H(Y|X) (prev. foil, symmetry) Recall: H(X,Y) = H(X|Y) + H(Y) Þ -H(X|Y) = H(Y) - H(X,Y) Þ I(X,Y) = H(X) + H(Y) - H(X,Y) I(X,X) = H(X) (since H(X|X) = 0) I(X,Y) = I(Y,X) (just for completeness) I(X,Y) >= 0 ... let’s prove that now (as promised). 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Jensen’s Inequality f x y lf(x) + (1-l)f(y) Recall: f is convex on interval (a,b) iff "x,y ∈ (a,b), "l ∈ [0,1]: f(lx + (1-l)y) ≤ lf(x) + (1-l)f(y) J.I.: for distribution p(x), r.v. X on W, and convex f, f(∑x∈W p(x) x) ≤ ∑x∈W p(x) f(x) Proof (idea): by induction on the number of basic outcomes; start with |W| = 2 by: p(x1)f(x1) + p(x2)f(x2) ≥ f(p(x1)x1 + p(x2)x2) (<= def. of convexity) for the induction step (|W| = k → k+1), just use the induction hypothesis and def. of convexity (again). 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Information Inequality Intro to NLP 9/14/1999 Information Inequality D(p||q) ≥ 0 ! Proof: 0 = - log 1 = - log ∑x ∈ W q(x) = - log ∑x ∈ W (q(x)/p(x))p(x)≤ ...apply Jensen’s inequality here ( - log is convex)... ≤ ∑x ∈ W p(x) (-log(q(x)/p(x))) = ∑x ∈ W p(x) log(p(x)/q(x)) = = D(p||q) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Other (In)Equalities and Facts Intro to NLP 9/14/1999 Other (In)Equalities and Facts Log sum inequality: for ri, si ≥ 0 ∑i=1..n (ri log(ri/si)) ≤ (∑i=1..n ri) log(∑i=1..nri/∑i=1..nsi)) D(p||q) is convex [in p,q] (<= log sum inequality) H(pX) ≤ log2|W|, where W is the sample space of pX Proof: uniform u(x), same sample space W: ∑p(x) log u(x) = -log2|W|; log2|W| - H(X) = -∑p(x) log u(x) + ∑p(x) log p(x) = D(p||u) ≥ 0 H(p) is concave [in p]: Proof: from H(X) = log2|W| - D(p||u), D(p||u) convex ÞH(x) concave 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Cross-Entropy Typical case: we’ve got series of observations T = {t1, t2, t3, t4, ..., tn}(numbers, words, ...; ti∈ W); estimate (simple): "y ∈ W: (y) = c(y) / |T|, def. c(y) = |{t ∈ T; t = y}| ...but the true p is unknown; every sample is too small! Natural question: how well do we do using [instead of p]? Idea: simulate actual p by using a different T’ (or rather: by using different observation we simulate the insufficiency of T vs. some other data (“random” difference)) p p 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Cross Entropy: The Formula Intro to NLP 9/14/1999 Cross Entropy: The Formula Hp’( ) = H(p’) + D(p’|| ) Hp’( ) = - ∑x ∈W p’(x) log2 (x) ! p’ is certainly not the true p, but we can consider it the “real world” distribution against which we test note on notation (confusing...): p/p’ ↔ , also HT’(p) (Cross)Perplexity: Gp’(p) = GT’(p)= 2Hp’( ) p p p p p p p 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Conditional Cross Entropy Intro to NLP 9/14/1999 Conditional Cross Entropy So far: “unconditional” distribution(s) p(x), p’(x)... In practice: virtually always conditioning on context Interested in: sample space Y, r.v. Y, y ∈ Y; context: sample space W, r.v. X, x ∈ W;: “our” distribution p(y|x), test against p’(y,x), which is taken from some independent data: Hp’(p) = - ∑y ∈ Y, x ∈ W p’(y,x) log2p(y|x) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Sample Space vs. Data In practice, it is often inconvenient to sum over the sample space(s) Y, W (especially for cross entropy!) Use the following formula: Hp’(p) = - ∑y ∈ Y, x ∈ W p’(y,x) log2p(y|x) = - 1/|T’| ∑i = 1..|T’| log2p(yi|xi) This is in fact the normalized log probability of the “test” data: Hp’(p) = - 1/|T’| log2 Pi = 1..|T’| p(yi|xi) ! 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Computation Example W = {a,b,..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) = 1/64 for a ∈{c..r}, = 0 for the rest:s,t,u,v,w,x,y,z Data (test): barb p’(a) = p’(r) = .25, p’(b) = .5 Sum over W: a a b c d e f g ... p q r s t ... z -p’(a)log2p(a) .5+.5+0+0+0+0+0+0+0+0+0+1.5+0+0+0+0+0 = 2.5 Sum over data: i / si 1/b 2/a 3/r 4/b 1/|T’| -log2p(si) 1 + 2 + 6 + 1 = 10 (1/4)x 10 = 2.5 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Cross Entropy: Some Observations Intro to NLP 9/14/1999 Cross Entropy: Some Observations H(p) ?? <, =, > ?? Hp’(p): ALL! Previous example: [p(a) = .25, p(b) = .5, p(a) = 1/64 for a ∈{c..r}, = 0 for the rest:s,t,u,v,w,x,y,z] H(p) = 2.5 bits = H(p’) (barb) Other data: probable: (1/8)(6+6+6+1+2+1+6+6)= 4.25 H(p) < 4.25 bits = H(p’) (probable) And finally: abba: (1/4)(2+1+1+2)= 1.5 H(p) > 1.5 bits = H(p’) (abba) But what about: baby -p’(‘y’)log2p(‘y’) = -.25log20 = ∞ (??) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

JHU CS 600.465/ Intro to NLP/Jan Hajic 9/14/1999 Cross Entropy: Usage Comparing data?? NO! (we believe that we test on real data!) Rather: comparing distributions (vs. real data) Have (got) 2 distributions: p and q (on some W, X) which is better? better: has lower cross-entropy (perplexity) on real data S “Real” data: S HS(p) = - 1/|S| ∑i = 1..|S| log2p(yi|xi) ?? HS(q) = - 1/|S| ∑i = 1..|S| log2q(yi|xi) 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic

Comparing Distributions Intro to NLP 9/14/1999 Comparing Distributions Test data S: probable p(.) from prev. example: HS(p) = 4.25 p(a) = .25, p(b) = .5, p(a) = 1/64 for a ∈ {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z q(.|.) (conditional; defined by a table): ex.: q(o|r) = 1 q(r|p) = .125 (1/8) (log(p|oth.)+log(r|p)+log(o|r)+log(b|o)+log(a|b)+log(b|a)+log(l|b)+log(e|l)) (1/8) ( 0 + 3 + 0 + 0 + 1 + 0 + 1 + 0 ) HS(q) = .625 9/14/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic JHU CS 600.465/Jan Hajic