Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPSC 503 Computational Linguistics

Similar presentations


Presentation on theme: "CPSC 503 Computational Linguistics"— Presentation transcript:

1 CPSC 503 Computational Linguistics
Intro probability and information Theory Lecture 5 Giuseppe Carenini 4/12/2019 CPSC503 Spring 2004

2 Today 28/1 Why do we need probabilities and information theory?
Basic Probability Theory Basic Information Theory Topics that are more important in the field (not just the ones used in the textbook). An overview NOT COMPLETE… try to clarify the key concepts and Address most common misunderstandings 4/12/2019 CPSC503 Spring 2004

3 Why do we need probabilities?
For Spelling errors: what is the most probable correct word? For real-word spelling errors, speech and hand writing recognition - What is the most probable next word? Part-of-speech tagging, word-sense disambiguation, probabilistic parsing Basic question: What is the probability of sequence of words? (e.g. of a sentence) 4/12/2019 CPSC503 Spring 2004

4 Disambiguation Tasks Example: “I made her duck” Part-of-speech tagging
duck : V+ / N+ make : create / cook Word Sense Disambiguation her : possessive adjective / dative pronoun Syntactic Disambiguation (I (made (her duck))) vs. (I (made (her) (duck)) Duck is also semantically ambiguous N (bird vs. cotton fabric) V (plunge under water, lower the head or body suddently) Find the most likely interpretation in the given context make : transitive (single direct obj.) / ditransitive (two objs) / cause (direct obj. + verb) 4/12/2019 CPSC503 Spring 2004

5 Why do we need information theory?
How much information is contained in a particular probabilistic model (PM)? How predictive a PM is? Given two PMs, which one better matches a corpus? Entropy, Mutual Information, Relative Entropy, Cross-Entropy, Perplexity 4/12/2019 CPSC503 Spring 2004

6 Basic Probability/Info Theory
An overview (not complete! sometimes imprecise!) Clarify basic concepts you may encounter in NLP Try to address common misunderstandings 4/12/2019 CPSC503 Spring 2004

7 Experiments and Sample Spaces
Uncertain Situation: Experiment, Process, Test…. Set of possible basic outcomes: sample space Ω Coin toss (Ω={head,tail}, die (Ω={1..6}), Opinion poll (Ω={yes,no}), Quality test (Ω={bad,good}) Lottery (|Ω|  105 – 107) #of traffic accidents in Canada in 2005 (Ω=N) missing word (|Ω|  vocabulary size) That will exactly produce one out of several possible outcome What are the possible outcomes of any process trying to find out what that word is? 4/12/2019 CPSC503 Spring 2004

8 Events Event A is a set of basic outcomes
A  Ω and all A 2Ω (the event space) Ω is the certain event, Ø is the impossible event Examples: Experiment: three times coin toss Ω = {HHH, HHT, HTH, THH, TTH, HTT, THT,TTT} Cases with exactly two tails A={TTH, HTT, THT} All heads A={HHH} That will exactly produce one out of several possible outcome 2Ω powerset of Ω 4/12/2019 CPSC503 Spring 2004

9 Probability Function/Distribution
Intuition: measure of how likely an event is Formally: P: 2Ω  [0,1], P(Ω) = 1 If A and B are disjoint events: P(AB)=P(A)+P(B) Immediate consequences: P(Ø)=0, P(A)=1- P(A), AB  P(A) < P(B) a ΩP(a) = 1 How to estimate P(A) Repeat the “experiment” n times c = # times outcome  A P(A)  c/n Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

10 Missing Word from Book 4/12/2019 CPSC503 Spring 2004
Find out what is the missing word Assuming it is a word already in the text Assuming word are morphologically disambiguated 4/12/2019 CPSC503 Spring 2004

11 Joint and Conditional Probability
P(A,B) = P(AB) P(A|B) = P(A,B)/P(B) Bayes Rule P(A,B) = P(B,A) (since P(AB) = P(AB))  P(A|B) P(B) = P(B|A) P(A)  P(A|B) = P(B|A) P(A) / P(B) Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

12 Missing Word: Independence
Find out what is the missing word Assuming it is a word already in the text Assuming word are morphologically disambiguated 4/12/2019 CPSC503 Spring 2004

13 Independence How does P(A|B) relates P(B)?
If knowing that B is the case does not change the probability of A (i.e., P(A|B)=P(A)) A and B are independent Immediate consequence: P(A,B)=P(A)*P(B) Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

14 Chain Rule The rule The “proof” P(A,B,C,D…) = P(A,B,C,D,..) = P(A)
P(B|A) P(C|A,B) P(D|A,B,C) P(..|A,B,C,D) P(A,B,C,D,..) = P(A) P(A,B)/P(A) P(A,B,C)/P(A,B) P(A,B,C,D)/P(A,B,C) P(..,A,B,C,D)/P(A,B,C,D) General form P(A1,..,An) = product i{1..n} P(Aj| inters j{1..i-1}Aj) General form: 4/12/2019 CPSC503 Spring 2004

15 Random Variables and pmf
Random variables (RV) X allow us to talk about the probabilities of numerical values that are related to the event space Examples:die “natural numbering” [1,6], English word length [1,?] Probability mass function ?So far, event space that differs with every problem we look at ? Random Variable if a function from the sample space to the real number for continuous RV. Because a RV has a numerical range, we can often do math more easily by working with the values of a RV rather then directly with events and we can define a probability for a random V Or to the integer numbers for discrete RV 4/12/2019 CPSC503 Spring 2004

16 Example: English Word length
p(x) Or # of words in a document 1 5 10 15 25 Sampling? How to do it? 4/12/2019 CPSC503 Spring 2004

17 Expectation and Variance
The Expectation is the (expected) mean or average of a RV Example:rolling one die (3.5) The variance of a RV is a measure of whether the values of the RV tend to be consistent over samples or to vary a lot σ is the standard deviation 4/12/2019 CPSC503 Spring 2004

18 Joint, Marginal and Conditional RV/Distributions
Sometimes you need to define more then one RV or prob distribution over a sample space We call all of them distributions Bayes and Chain Rule also apply ! 4/12/2019 CPSC503 Spring 2004

19 Joint Distributions (word length & word class)
Y N V Adj Adv X 1 2 3 4 ……… ……… ……… Word class is not a RV Note: fictional numbers 4/12/2019 CPSC503 Spring 2004

20 Conditional and Independence (word length & word class)
……… 1 2 3 4 N V Adj Adv X Y ……… 1 2 3 4 N V Adj Adv X Y Word class is not a RV 4/12/2019 CPSC503 Spring 2004

21 Standard Distributions
Discrete Binomial Multinomial Continuous Normal Go back to your Stats textbook … 4/12/2019 CPSC503 Spring 2004

22 Today 28/1 Why do we need probabilities and information theory?
Basic Probability Theory Basic Information Theory Topics that are more important in the field (not just the ones used in the textbook). An overview NOT COMPLETE… try to clarify the key concepts and Address most common misunderstandings 4/12/2019 CPSC503 Spring 2004

23 Entropy Def1. Measure of uncertainty
Def2. Measure of the information that we need to resolve an uncertain situation Def3. Measure of the information that we obtain form an experiment that resolves an uncertain situation X not limited to numbers ranges over a set of basic elements Can be words part-of-speech Let p(x)=P(X=x); where x  X. H(p)= H(X)= - xX p(x)log2p(x) It is normally measured in bits. 4/12/2019 CPSC503 Spring 2004

24 Entropy (extra-slides)
Using the formula: Example Example: binary outcome The Limits (why exactly that formula?) Entropy and Expectation Coding interpretation Joint and Conditional Entropy Summary of key Properties X not limited to numbers ranges over a set of basic elements Can be words part-of-speech 4/12/2019 CPSC503 Spring 2004

25 Mutual Information Chain Rule for Entropy:H(X,Y)=H(X)+H(Y|X)
By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y) Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X) This difference is called the mutual information between X and Y, I(X,Y). reduction in uncertainty of one random variable due to knowing about another the amount of information one random variable contains about another It can be proven 4/12/2019 CPSC503 Spring 2004

26 Relative Entropy or Kullback-Leibler Divergence
Def. The relative entropy is a measure of how different two probability distributions (over the same event space) are. D(p||q)= xX p(x)log(p(x)/q(x)) average number of bits wasted by encoding events from a distribution p with distribution q. I(X,Y) = D(p(x,y)||p(x)p(y)) The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q. I(x,y) measures how much deviates from independence! 4/12/2019 CPSC503 Spring 2004

27 Next Time Probabilistic models applied to spelling
Read Chp. 5 up to pag.156 4/12/2019 CPSC503 Spring 2004


Download ppt "CPSC 503 Computational Linguistics"

Similar presentations


Ads by Google