Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical methods in NLP Course 2

Similar presentations


Presentation on theme: "Statistical methods in NLP Course 2"— Presentation transcript:

1 Statistical methods in NLP Course 2
Diana Trandabăț

2 Quick Recap Out of three prisoners, one, randomly selected, without their knowldge, will be executed, and the other two will be released. One of the prisoners asks the guard to show him which of the other two will be released (at least one will be released anyway). If the quard answers, will the prisoner have more information than before?

3 Quick Recap

4 Essential Information Theory
Developed by Shannon in the 40s Maximizing the amount of information that can be transmitted over an imperfect communication channel Data compression (entropy) Transmission rate (channel capacity)

5 Probability mass function
The probability that a random variable X has differen numeric values p(x) = P(X=x) =P(Ax) where Ax={    : X() = x} Example: The probabiliy of heads when flipping 2 coins p(0) = ¼ p(1) = ½ p(2) = ¼

6 Probability mass function
The probability that a random variable X has differen numeric values p(x) = P(X=x) =P(Ax) where Ax={    : X() = x} Example: The probabiliy of heads when flipping 2 coins p(nr_heads=0) = ¼ p(nr_heads= 1) = ½ p(nr_heads= 2) = ¼

7 Probability mass function
The probability that a random variable X has differen numeric values p(x) = P(X=x) =P(Ax) where Ax={    : X() = x} Example: The probabiliy of heads when flipping 2 coins p(nr_heads=0) = ¼ p(nr_heads= 1) = ½ p(nr_heads= 2) = ¼

8 Expectation The expectation is the mean or average of a random variable Example: Expectation of rolling one die and Y being the value of its face is: E(X+Y) = E(X)+E(Y) E(XY) = E(X)E(Y) if X and Y are independent

9 Variance The variance of a random variable is a measure of whether the values of the variable tend to be consistent over trials or to vary a lot. Var(X) = E((X-E(X))2) = E(X2) – E2(X) The commonly used standard deviation σ is the square root of variance.

10 Entropy X: discrete random variable;
p(x) = probability mass function of the random variable X Entropy (or self-information) Entropy measures the amount of information in a random variable It is the average length of the message needed to transmit an outcome of that variable using the optimal code (in bits) using the optimal code = the entropy will be the minimum

11 Entropy (cont) H is a weighted average for log(p(X) where the weighting depends on the probability of each x H INCREASES WITH MESSAGE LENGTH i.e when the value of X is determinate, hence providing no new information

12 Exercise Compute the Entropy of tossing a coin

13 Exercise

14 Exercise 2 Example: Entropy of rolling a 8-sided die.

15 Exercise 2 Example: Entropy of rolling a 8-sided die. 1 2 3 4 5 6 7 8

16 Exercise 3 Entropy of biased die P(X=1)=1/2 P(X=2)=1/4 P(X=3)=0

17 Exercise 3 Entropy of biased die P(X=1)=1/2 P(X=2)=1/4 P(X=3)=0

18 Symplified Polynesian
letter frequencies per-letter entropy coding p t k a i u p t k a i u 1/8 1/4

19 Symplified Polynesian
letter frequencies per-letter entropy coding p t k a i u p t k a i u 1/8 1/4

20 Symplified Polynesian
letter frequencies per-letter entropy coding p t k a i u p t k a i u 1/8 1/4

21 Joint Entropy The joint entropy of 2 random variables X,Y is the amount of the information needed on average to specify both their values

22 Conditional Entropy The conditional entropy of a random variable Y given another X, expresses how much extra information one still needs to supply on average to communicate Y given that the other party knows X

23 Chain Rule

24 Simplified Polynesian Revisited
syllable structure all words consist of sequences of CV syllables. C: consonant, V: vowel

25 Simplified Polynesian Revisited
syllable structure all words consist of sequences of CV syllables. C: consonant, V: vowel

26 More on entropy Entropy Rate(per-word/per-letter entropy)
Entropy of a Language

27 Mutual Information I(X,Y) is the mutual information between X and Y.
It is the measure of dependence between two random variables, or the amount of information one random variable contains about the other

28 Mutual Information (cont)
I is 0 only when X,Y are independent: H(X|Y)=H(X) H(X)=H(X)-H(X|X)=I(X,X) Entropy is the self-information For 2 dependent variables, I grows not only with the degree of their dependence but only with their entropy H(X) = I(X<X) This explain also how mutual information between 2 totally dependent variables is not constant but depends on their entropy

29 More on Mutual Information
Conditional Mutual Information Chain Rule Pointwise Mutual Information

30 Exercise 4 Let p(x|y) be given by Find: (a) H(X), H(Y )
(b) H(X|Y ), H(Y|X) (c) H(X,Y) (d) I(X,Y) X | Y 1 1/3

31 Entropy and Linguistics
Entropy is measure of uncertainty. The more we know about something the lower the entropy. If a language model captures more of the structure of the language, then the entropy should be lower. We can use entropy as a measure of the quality of our models 2) Means better code; optimal code entropy minimum!! Entropy of the language Exists in the world; but we don’t know it (we don’t know P); we can only look for better code hoping to lower the Entropy

32 Entropy and Linguistic
Measure of how different two probability distributions are Average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite right distribution q Noisy channel! = > next class!!! Measure of how different two probability distribution (OVER THE SAME EVENT SPACE) are We cannot actually do this because we still don’t know p: tricks, use of another quantity Cross entropy, approximation see pag 75

33 Great!  See you next time!


Download ppt "Statistical methods in NLP Course 2"

Similar presentations


Ads by Google