9/14/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing (600.465) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

Slides:



Advertisements
Similar presentations
Lecture Discrete Probability. 5.2 Recap Sample space: space of all possible outcomes. Event: subset of of S. p(s) : probability of element s of.
Advertisements

Segment 3 Introduction to Random Variables - or - You really do not know exactly what is going to happen George Howard.
SI485i : NLP Day 2 Probability Review. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Outcome.
June 3, 2008Stat Lecture 6 - Probability1 Probability Introduction to Probability, Conditional Probability and Random Variables Statistics 111 -
22C:19 Discrete Structures Discrete Probability Fall 2014 Sukumar Ghosh.
Chain Rules for Entropy
Background Knowledge Brief Review on Counting,Counting, Probability,Probability, Statistics,Statistics, I. TheoryI. Theory.
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Evaluating Hypotheses
1 Copyright M.R.K. Krishna Rao 2003 Chapter 5. Discrete Probability Everything you have learned about counting constitutes the basis for computing the.
Random Variables and Distributions Lecture 5: Stat 700.
PMAI 2001/02 Targil 1: Basic Probability and Information theory
Probability Distributions Random Variables: Finite and Continuous Distribution Functions Expected value April 3 – 10, 2003.
Information Theory and Security
Probability Distributions: Finite Random Variables.
Joint Distribution of two or More Random Variables
CHAPTER 10: Introducing Probability
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
§1 Entropy and mutual information
2. Mathematical Foundations
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
1. Entropy as an Information Measure - Discrete variable definition Relationship to Code Length - Continuous Variable Differential Entropy 2. Maximum Entropy.
Lectures prepared by: Elchanan Mossel Yelena Shvets Introduction to probability Stat 134 FAll 2005 Berkeley Follows Jim Pitman’s book: Probability Section.
Stat 1510: Introducing Probability. Agenda 2  The Idea of Probability  Probability Models  Probability Rules  Finite and Discrete Probability Models.
Copyright ©2011 Nelson Education Limited. Probability and Probability Distributions CHAPTER 4 Part 2.
Theory of Probability Statistics for Business and Economics.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Expected values and variances. Formula For a discrete random variable X and pmf p(X): Expected value: Variance: Alternate formula for variance:  Var(x)=E(X^2)-[E(X)]^2.
CPSC 531: Probability Review1 CPSC 531:Probability & Statistics: Review Instructor: Anirban Mahanti Office: ICT Class.
Great Theoretical Ideas in Computer Science.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
STA347 - week 51 More on Distribution Function The distribution of a random variable X can be determined directly from its cumulative distribution function.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
CHAPTER 10: Introducing Probability ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Random Variables an important concept in probability.
12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.
1 Randomized Algorithms Andreas Klappenecker. 2 Randomized Algorithms A randomized algorithm is an algorithm that makes random choices during their execution.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Random Variables Learn how to characterize the pattern of the distribution of values that a random variable may have, and how to use the pattern to find.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Lecture 6 Dustin Lueker.  Standardized measure of variation ◦ Idea  A standard deviation of 10 may indicate great variability or small variability,
CHAPTER 10: Introducing Probability ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Statistical NLP Course for Master in Computational Linguistics 2nd Year Diana Trandabat.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
1 COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2.
Discrete Random Variable Random Process. The Notion of A Random Variable We expect some measurement or numerical attribute of the outcome of a random.
PROBABILITY AND PROBABILITY RULES
Natural Language Processing
Corpora and Statistical Methods
Natural Language Processing
Introduction to Probability
CSCI 5832 Natural Language Processing
Probability distributions
Chapter 6: Random Variables
CHAPTER 10: Introducing Probability
CPSC 503 Computational Linguistics
Discrete & Continuous Random Variables
*Introduction to Natural Language Processing ( ) Probability
Presentation transcript:

9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Probability Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

2 Experiments & Sample Spaces Experiment, process, test,... Set of possible basic outcomes: sample space  –coin toss (  = {head,tail}), die (  = {1..6}) –yes/no opinion poll, quality test (bad/good) (  = {0,1}) –lottery (|  |      –# of traffic accidents somewhere per year (  = N) –spelling errors (  =  * ), where Z is an alphabet, and Z * is a set of possible strings over such and alphabet –missing word (|  |  vocabulary size)

3 Events Event A is a set of basic outcomes Usually A    and all A  2  (the event space) –  is then the certain event, Ø is the impossible event Example: –experiment: three times coin toss  = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –count cases with exactly two tails: then A = {HTT, THT, TTH} –all heads: A = {HHH}

4 Probability Repeat experiment many times, record how many times a given event A occurred ( “ count ” c 1 ). Do this whole series many times; remember all c i s. Observation: if repeated really many times, the ratios of c i /T i (where T i is the number of experiments run in the i-th series) are close to some (unknown but) constant value. Call this constant a probability of A. Notation: p(A)

5 Estimating probability Remember:... close to an unknown constant. We can only estimate it: –from a single series (typical case, as mostly the outcome of a series is given to us and we cannot repeat the experiment), set p(A) = c 1 /T 1. –otherwise, take the weighted average of all c i /T i (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate.

6 Example Recall our example: –experiment: three times coin toss  = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –count cases with exactly two tails: A = {HTT, THT, TTH} Run an experiment 1000 times (i.e tosses) Counted: 386 cases with two tails ( HTT, THT, or TTH ) estimate: p(A) = 386 / 1000 =.386 Run again: 373, 399, 382, 355, 372, 406, 359 –p(A) =.379 (weighted average) or simply 3032 / 8000 Uniform distribution assumption: p(A) = 3/8 =.375

7 Basic Properties Basic properties: –p: 2    [0,1] –p(  ) = 1 –Disjoint events: p( ∪ A i ) = ∑ i p(A i ) [NB: axiomatic definition of probability: take the above three conditions as axioms] Immediate consequences: –P(Ø) = 0, p(  A ) = 1 - p(A), A ⊆  p(A) ≤ p(B) – ∑ a ∈  p(a) = 1

8 Joint and Conditional Probability p(A,B) = p(A ∩ B) p(A|B) = p(A,B) / p(B) –Estimating form counts: p(A|B) = p(A,B) / p(B) = (c(A ∩ B) / T) / (c(B) / T) = = c(A ∩ B) / c(B)  A B A ∩ B

9 Bayes Rule p(A,B) = p(B,A) since p(A ∩  p(B ∩  –therefore: p(A|B) p(B) = p(B|A) p(A), and therefore p(A|B) = p(B|A) p(A) / p(B) !  A B A ∩ B

10 Independence Can we compute p(A,B) from p(A) and p(B)? Recall from previous foil: p(A|B) = p(B|A) p(A) / p(B) p(A|B) p(B) = p(B|A) p(A) p(A,B) = p(B|A) p(A)... we ’ re almost there: how p(B|A) relates to p(B)? –p(B|A) = P(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)!

11 Chain Rule p(A 1, A 2, A 3, A 4,..., A n ) = ! p(A 1 |A 2,A 3,A 4,...,A n ) ×  p(A 2 |A 3,A 4,...,A n ) × × p(A 3 |A 4,...,A n ) ×... p(A n-1 |A n ) × p(A n ) this is a direct consequence of the Bayes rule.

12 The Golden Rule (of Classic Statistical NLP) Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A|B)): take Bayes rule, max over all Bs: argmax A p(A|B) = argmax A p(B|A). p(A) / p(B) = argmax A p(B|A) p(A) !... as p(B) is constant when changing As

13 Random Variables is a function X:   Q –in general: Q = R n, typically R –easier to handle real numbers than real-world events random variable is discrete if Q is countable (i.e. also if finite) Example: die: natural “ numbering ” [1,6], coin: {0,1} Probability distribution: –p X (x) = p(X=x) = df p(A x ) where A x = {a ∈  : X(a) = x} –often just p(x) if it is clear from context what X is

14 Expectation Joint and Conditional Distributions is a mean of a random variable (weighted average) –E(X) =∑ x ∈ X(  x. p X (x) Example: one six-sided die: 3.5, two dice (sum) 7 Joint and Conditional distribution rules: –analogous to probability of events Bayes: p X|Y (x,y) = notation p XY (x|y) = even simpler notation p(x|y) = p(y|x). p(x) / p(y) Chain rule: p(w,x,y,z) = p(z).p(y|z).p(x|y,z).p(w|x,y,z)

15 Standard distributions Binomial (discrete) –outcome: 0 or 1 (thus: binomial) –make n trials –interested in the (probability of) number of successes r Must be careful: it ’ s not uniform! p b (r|n) = ( ) / 2 n (for equally likely outcome) ( ) counts how many possibilities there are for choosing r objects out of n; = n! / (n-r)!r! nrnr nrnr

16 Continuous Distributions The normal distribution ( “ Gaussian ” ) p norm (x|  ) = e -(x-  ) 2 /(2  2 ) /  where: –  is the mean (x-coordinate of the peak) (0) –  is the standard deviation (1)  x other: hyperbolic, t

9/14/1999 JHU CS /Jan Hajic 17 *Introduction to Natural Language Processing ( ) Essential Information Theory I Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

18 The Notion of Entropy Entropy ~ “ chaos ”, fuzziness, opposite of order,... –you know it: it is much easier to create “ mess ” than to tidy things up... Comes from physics: –Entropy does not go down unless energy is used Measure of uncertainty: –if low... low uncertainty; the higher the entropy, the higher uncertainty, but the higher “ surprise ” (information) we can get out of an experiment

19 The Formula Let p X (x) be a distribution of random variable X Basic outcomes (alphabet)  H(X) = - ∑ x ∈   p(x) log 2 p(x) ! Unit: bits (log 10 : nats) Notation: H(X) = H p (X) = H(p) = H X (p) = H(p X )

20 Using the Formula: Example Toss a fair coin:  = {head,tail} –p(head) =.5, p(tail) =.5 –H(p) = log 2 (0.5) + (- 0.5 log 2 (0.5)) = 2 x ( (-0.5) x (-1) ) = 2 x 0.5 = 1 Take fair, 32-sided die: p(x) = 1 / 32 for every side x –H(p) = - ∑ i =  p(x i ) log 2 p(x i ) = - 32 (p(x 1 ) log 2 p(x 1 ) (since for all i p(x i ) = p(x 1 ) = 1/32) = -32x ((1/32)x (-5)) = 5 (now you see why it ’ s called bits ?) Unfair coin: –p(head) =.2... H(p) =.722; p(head) = H(p) =.081

21 Example: Book Availability Entropy H(p) 1 bad bookstore good bookstore ←  p(Book Available)

22 The Limits When H(p) = 0? –if a result of an experiment is known ahead of time: –necessarily:  x ∈  ; p(x) = 1 & ∀ y ∈  ; y ≠ x  p(y) = 0 Upper bound? –none in general –for |  | = n: H(p) ≤ log 2 n nothing can be more uncertain than the uniform distribution

23 Entropy and Expectation Recall: –E(X) =∑ x ∈ X  p X (x) ×  x Then: E(log 2 (1/p X (x))) = ∑ x ∈ X  p X (x) log 2 (1/p X (x)) = = - ∑ x ∈ X  p X (x) log 2 p X (x) = = H(p X ) = notation H(p)

24 Perplexity: motivation Recall: –2 equiprobable outcomes: H(p) = 1 bit –32 equiprobable outcomes: H(p) = 5 bits –4.3 billion equiprobable outcomes: H(p) ~= 32 bits What if the outcomes are not equiprobable? –32 outcomes, 2 equiprobable at.5, rest impossible: H(p) = 1 bit –Any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes?

25 Perplexity Perplexity: –G(p) = 2 H(p)... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. it is easier to imagine: –NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict the “ wilder ” (biased) distribution, the better: –lower entropy, lower perplexity

26 Joint Entropy and Conditional Entropy Two random variables: X (space  ),Y (  ) Joint entropy: –no big deal: ((X,Y) considered a single event): H(X,Y) = - ∑ x ∈  ∑ y ∈   p(x,y) log 2 p(x,y) Conditional entropy: H(Y|X) = - ∑ x ∈  ∑ y ∈   p(x,y) log 2 p(y|x) recall that H(X) = E (log 2 (1/p X (x))) (weighted “ average ”, and weights are not conditional)

27 Conditional Entropy (Using the Calculus) other definition: H(Y|X) = ∑ x ∈  p(x) H(Y|X=x) = for H(Y|X=x), we can use the single-variable definition (x ~ constant) = ∑ x ∈  p(x) ( - ∑ y ∈  p(y|x) log 2 p(y|x) ) = = - ∑ x ∈  ∑ y ∈   p(y|x) p(x) log 2 p(y|x) = = - ∑ x ∈  ∑ y  ∈   p(x,y) log 2 p(y|x)

28 Properties of Entropy I Entropy is non-negative: –H(X) ≥  –proof: (recall: H(X) = - ∑ x ∈  p(x) log 2 p(x)) log(p(x)) is negative or zero for x ≤ 1, p(x) is non-negative; their product p(x)log(p(x) is thus negative; sum of negative numbers is negative; and -f is positive for negative f Chain rule: –H(X,Y) = H(Y|X) + H(X), as well as –H(X,Y) = H(X|Y) + H(Y) (since H(Y,X) = H(X,Y))

29 Properties of Entropy II Conditional Entropy is better (than unconditional): –H(Y|X) ≤ H(Y) H(X,Y) ≤ H(X) + H(Y) (follows from the previous (in)equalities) equality iff X,Y independent [recall: X,Y independent iff p(X,Y) = p(X)p(Y)] H(p) is concave (remember the book availability graph?) –concave function f over an interval (a,b):  x,y ∈ (a,b),   [0,1]: f( x + (1- )y)≥ f(x) + (1- )f(y) function f is convex if -f is concave [for proofs and generalizations, see Cover/Thomas] f x y f(x) + (1- )f(y)

30 “ Coding ” Interpretation of Entropy The least (average) number of bits needed to encode a message (string, sequence, series,...) (each element having being a result of a random process with some distribution p): = H(p) Remember various compressing algorithms? –they do well on data with repeating (= easily predictable = low entropy) patterns –their results though have high entropy  compressing compressed data does nothing

31 Coding: Example How many bits do we need for ISO Latin 1? –  the trivial answer: 8 (32 – 255) Experience: some chars are more common, some (very) rare:...so what if we use more bits for the rare, and less bits for the frequent? [be careful: want to decode (easily)!] suppose: p( ‘ a ’ ) = 0.3, p( ‘ b ’ ) = 0.3, p( ‘ c ’ ) = 0.3, the rest: p(x) .0004 code: ‘ a ’ ~ 00, ‘ b ’ ~ 01, ‘ c ’ ~ 10, rest: 11b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 code acbb é cbaac: a c b b é c b a a c number of bits used: 28 (vs. 80 using “ naive ” coding) code length ~ 1 / probability; conditional prob OK!

32 Entropy of a Language Imagine that we produce the next letter using p(l n+1 |l 1,...,l n ), where l 1,...,l n is the sequence of all the letters which had been uttered so far (i.e n is really big!); let ’ s call l 1,...,l n the history h(h n+1 ), and all histories H: Then compute its entropy: –-  ∑ h  ∈  ∑  l  ∈   p(l,h) log 2 p(l|h) Not very practical, isn ’ t it?

33 Kullback-Leibler Distance (Relative Entropy) Remember: –long series of experiments... c i /T i oscillates around some number... we can only estimate it... to get a distribution q. So we get a distribution q; (sample space , r.v. X) the true distribution is, however, p. (same , X)  how big error are we making? D(p||q) (the Kullback-Leibler distance): D(p||q) =  ∑ x  ∈   p(x) log 2 (p(x)/q(x)) = E p log 2 (p(x)/q(x))

34 Comments on Relative Entropy Conventions: –0 log 0 = 0 –p log (p/0) = ∞  (for p > 0) Distance? (less “ misleading ” : Divergence) –not quite: not symmetric: D(p||q) ≠ D(q||p) does not satisfy the triangle inequality [e.g. f(x)+f(y) >= f(x+y)] –but useful to look at it that way H(p) + D(p||q): bits needed for encoding p if q is used

35 Mutual Information (MI) in terms of relative entropy Random variables X, Y; p X ∩ Y (x,y), p X (x), p Y (y) Mutual information (between two random variables X,Y): I(X,Y) = D(p(x,y) || p(x)p(y)) I(X,Y) measures how much (our knowledge of) Y contributes (on average) to easing the prediction of X or, how p(x,y) deviates from (independent) p(x)p(y)

36 Mutual Information: the Formula Rewrite the definition: [ recall: D(r||s) = ∑ v  ∈   r(v) log 2 (r(v)/s(v)); substitute r(v) = p(x,y), s(v) = p(x)p(y); ~ ] I(X,Y) = D(p(x,y) || p(x)p(y)) = = ∑ x ∈  ∑ y  ∈   p(x,y) log 2 (p(x,y)/p(x)p(y)) Measured in bits (what else? :-) !

37 From Mutual Information to Entropy by how many bits the knowledge of Y lowers the entropy H(X): I(X,Y) = ∑ x ∈   ∑ y  ∈   p(x,y) log 2 (p(x,y)/p(y)p(x)) =...use p(x,y)/p(y) = p(x|y) = ∑ x ∈   ∑ y  ∈   p(x,y) log 2 (p(x|y)/p(x)) =...use log(a/b) = log a - log b (a ~ p(x|y), b ~ p(x)), distribute sums = ∑ x ∈   ∑ y  ∈  p(x,y)log 2 p(x|y) - ∑ x ∈   ∑ y  ∈   p(x,y)log 2 p(x) =...use def. of H(X|Y) (left term), and ∑ y ∈   p(x,y) = p(x) (right term) = - H(X|Y) + (- ∑ x  ∈  p(x)log 2 p(x) ) =...use def. of H(X) (right term), swap terms = H(X) - H(X|Y)...by symmetry, = H(Y) - H(Y|X)

38 Properties of MI vs. Entropy I(X,Y) = H(X) - H(X|Y) = number of bits the knowledge of Y lowers the entropy of X = H(Y) - H(Y|X) (prev. foil, symmetry) Recall: H(X,Y) = H(X|Y) + H(Y)  -H(X|Y) = H(Y) - H(X,Y)  I(X,Y) = H(X) + H(Y) - H(X,Y) I(X,X) = H(X) (since H(X|X) = 0) I(X,Y) = I(Y,X) (just for completeness) I(X,Y) >= 0... let ’ s prove that now (as promised).

39 Jensen ’ s Inequality Recall: f is convex on interval (a,b) iff  x,y  (a,b),   [0,1]: f( x + (1- )y) ≤  f(x) + (1- )f(y) J.I.: for distribution p(x), r.v. X on , and convex f, f ( ∑ x  p(x) x ) ≤  ∑ x  p(x) f(x) Proof (idea): by induction on the number of basic outcomes; start with |  | = 2 by: p(x 1 )f(x 1 ) + p(x 2 )f(x 2 ) ≥  f(p(x 1 )x 1 + p(x 2 )x 2 ) (<= def. of convexity) for the induction step (|  | = k → k+1), just use the induction hypothesis and def. of convexity (again). f x y f(x) + (1- )f(y)

40 Information Inequality D(p||q) ≥ 0 ! Proof: 0 = - log 1 = - log ∑ x ∈    q(x) = - log ∑ x ∈    (q(x)/p(x))p(x) ≤...apply Jensen ’ s inequality here ( - log is convex)... ≤ ∑ x ∈    p(x) (-log(q(x)/p(x))) = ∑ x ∈    p(x) log(p(x)/q(x)) = = D(p||q)

41 Other (In)Equalities and Facts Log sum inequality: for r i, s i ≥  ∑ i=1..n  (r i log(r i /s i )) ≤  ( ∑ i=1..n  r i ) log( ∑ i=1..n r i / ∑ i=1..n s i )) D(p||q) is convex [in p,q] (<= log sum inequality) H(p X ) ≤  log 2 |  |, where  is the sample space of p X Proof: uniform u(x), same sample space  : ∑ p(x) log u(x) = -log 2 |  |; log 2 |  | - H(X) = - ∑ p(x) log u(x) + ∑ p(x) log p(x) = D(p||u) ≥ 0 H(p) is concave [in p]: Proof: from H(X) = log 2 |  | - D(p||u), D(p||u) convex  H(x) concave

42 Cross-Entropy Typical case: we ’ ve got series of observations T = {t 1, t 2, t 3, t 4,..., t n }(numbers, words,...; t i ∈  ); estimate (simple):  y ∈  (y) = c(y) / |T|, def. c(y) = |{t ∈  ; t = y}|...but the true p is unknown; every sample is too small! Natural question: how well do we do using [instead of p] ? Idea: simulate actual p by using a different T ’ [p ’] (or rather: by using different observation we simulate the insufficiency of T vs. some other data ( “ random ” difference)) p p

43 Cross Entropy: The Formula H p ’ ( ) = H(p ’ ) + D(p ’ || ) H p ’ ( ) = - ∑ x  ∈   p ’ (x) log 2 ( x ) ! p ’ is certainly not the true p, but we can consider it the “ real world ” distribution against which we test note on notation (confusing...): p/p ’ ↔ , also H T ’ (p) (Cross)Perplexity: G p ’ (p) = G T ’ (p)= 2 H p ’ ( ) p p p p p p p

44 Conditional Cross Entropy So far: “ unconditional ” distribution(s) p(x), p ’ (x)... In practice: virtually always conditioning on context Interested in: sample space , r.v. Y, y ∈  ; context: sample space , r.v. X, x ∈  :  “ our ” distribution p(y|x), test against p ’ (y,x), which is taken from some independent data: H p ’ (p) = - ∑ y  ∈  x  ∈   p ’ (y,x) log 2 p(y|x)

45 Sample Space vs. Data In practice, it is often inconvenient to sum over the sample space(s) ,  (especially for cross entropy!) Use the following formula: H p ’ (p) = - ∑ y  ∈  x  ∈   p ’ (y,x) log 2 p(y|x) = - 1/|T ’ | ∑ i = 1..|T ’ |  log 2 p(y i |x i ) This is in fact the normalized log probability of the “ test ” data: H p ’ (p) = - 1/|T ’ | log 2  i = 1..|T ’ |  p(y i |x i ) !

46 Computation Example  = {a,b,..,z}, prob. distribution (assumed/estimated from data) :  p(a) =.25, p(b) =.5, p(  ) = 1/64 for  ∈ {c...r}, = 0 for the rest:s,t,u,v,w,x,y,z Data (test): barb p ’ (a) = p ’ (r) =.25, p ’ (b) =.5 Sum over  : H p ’ (p)  a b c d e f g... p q r s t... z -p ’ (  )log 2 p(  ) = 2.5 Sum over data: H p ’ (p) i / s i 1/b 2/a 3/r 4/b 1/|T ’ | (Not p ’ ) -log 2 p(s i ) = 10 (1/4)x 10 = 2.5

47 Cross Entropy: Some Observations H(p) ??  > ?? H p ’ (p): ALL! Previous example: [p(a) =.25, p(b) =.5, p(  ) = 1/64 for  ∈ {c..r}, = 0 for the rest:s,t,u,v,w,x,y,z] H(p) = 2.5 bits = H p ’ (p): (barb) Other data: probable: (1/8)( )= 4.25 H(p) < 4.25 bits = H p ’ (p): (probable) And finally: abba: (1/4)( )= 1.5 H(p) > 1.5 bits = H p ’ (p): (abba) But what about: baby -p ’ ( ‘ y ’ )log 2 p( ‘ y ’ ) = -.25log 2 0 = ∞  (??)

48 Cross Entropy: Usage Comparing data?? –NO! (we believe that we test on real data!) Rather: comparing distributions (vs. real data=test data) Have (got) 2 distributions: p and q (on some , X) –which is better? –better: has lower cross-entropy (perplexity) on real data S “ Real ” data: S H S (p) = - 1/|S| ∑ i = 1..|S|  log 2 p(y i |x i ) ?? H S (q) = - 1/|S| ∑ i = 1..|S|  log 2 q(y i |x i )

49 Comparing Distributions p(.) from prev. example: H S (p) = 4.25 p(a) =.25, p(b) =.5, p(  ) = 1/64 for  ∈  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z q(.|.) (conditional; defined by a table): much better!! ex.: q(o|r) = 1 q(r|p) =.125 (1/8) (log(p|oth.)+log(r|p)+log(o|r)+log(b|o)+log(a|b)+log(b|a)+log(l|b)+log(e|l)) (1/8) ( ) H S (q) =.625 (Real)Test data S: probable