CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics
Intro probability and information Theory Lecture 5 Giuseppe Carenini 4/12/2019 CPSC503 Spring 2004

Today 28/1 Why do we need probabilities and information theory?
Basic Probability Theory Basic Information Theory Topics that are more important in the field (not just the ones used in the textbook). An overview NOT COMPLETE… try to clarify the key concepts and Address most common misunderstandings 4/12/2019 CPSC503 Spring 2004

Why do we need probabilities?
For Spelling errors: what is the most probable correct word? For real-word spelling errors, speech and hand writing recognition - What is the most probable next word? Part-of-speech tagging, word-sense disambiguation, probabilistic parsing Basic question: What is the probability of sequence of words? (e.g. of a sentence) 4/12/2019 CPSC503 Spring 2004

Disambiguation Tasks Example: “I made her duck” Part-of-speech tagging
duck : V+ / N+ make : create / cook Word Sense Disambiguation her : possessive adjective / dative pronoun Syntactic Disambiguation (I (made (her duck))) vs. (I (made (her) (duck)) Duck is also semantically ambiguous N (bird vs. cotton fabric) V (plunge under water, lower the head or body suddently) Find the most likely interpretation in the given context make : transitive (single direct obj.) / ditransitive (two objs) / cause (direct obj. + verb) 4/12/2019 CPSC503 Spring 2004

Why do we need information theory?
How much information is contained in a particular probabilistic model (PM)? How predictive a PM is? Given two PMs, which one better matches a corpus? Entropy, Mutual Information, Relative Entropy, Cross-Entropy, Perplexity 4/12/2019 CPSC503 Spring 2004

Basic Probability/Info Theory
An overview (not complete! sometimes imprecise!) Clarify basic concepts you may encounter in NLP Try to address common misunderstandings 4/12/2019 CPSC503 Spring 2004

Experiments and Sample Spaces
Uncertain Situation: Experiment, Process, Test…. Set of possible basic outcomes: sample space Ω Coin toss (Ω={head,tail}, die (Ω={1..6}), Opinion poll (Ω={yes,no}), Quality test (Ω={bad,good}) Lottery (|Ω|  105 – 107) #of traffic accidents in Canada in 2005 (Ω=N) missing word (|Ω|  vocabulary size) That will exactly produce one out of several possible outcome What are the possible outcomes of any process trying to find out what that word is? 4/12/2019 CPSC503 Spring 2004

Events Event A is a set of basic outcomes
A  Ω and all A 2Ω (the event space) Ω is the certain event, Ø is the impossible event Examples: Experiment: three times coin toss Ω = {HHH, HHT, HTH, THH, TTH, HTT, THT,TTT} Cases with exactly two tails A={TTH, HTT, THT} All heads A={HHH} That will exactly produce one out of several possible outcome 2Ω powerset of Ω 4/12/2019 CPSC503 Spring 2004

Probability Function/Distribution
Intuition: measure of how likely an event is Formally: P: 2Ω  [0,1], P(Ω) = 1 If A and B are disjoint events: P(AB)=P(A)+P(B) Immediate consequences: P(Ø)=0, P(A)=1- P(A), AB  P(A) < P(B) a ΩP(a) = 1 How to estimate P(A) Repeat the “experiment” n times c = # times outcome  A P(A)  c/n Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

Missing Word from Book 4/12/2019 CPSC503 Spring 2004
Find out what is the missing word Assuming it is a word already in the text Assuming word are morphologically disambiguated 4/12/2019 CPSC503 Spring 2004

Joint and Conditional Probability
P(A,B) = P(AB) P(A|B) = P(A,B)/P(B) Bayes Rule P(A,B) = P(B,A) (since P(AB) = P(AB))  P(A|B) P(B) = P(B|A) P(A)  P(A|B) = P(B|A) P(A) / P(B) Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

Missing Word: Independence
Find out what is the missing word Assuming it is a word already in the text Assuming word are morphologically disambiguated 4/12/2019 CPSC503 Spring 2004

Independence How does P(A|B) relates P(B)?
If knowing that B is the case does not change the probability of A (i.e., P(A|B)=P(A)) A and B are independent Immediate consequence: P(A,B)=P(A)*P(B) Axiomatic definition For the missing word we can “repeat” the experiment by considering each word in the text as missing and Counting how many times each word is the outcome 4/12/2019 CPSC503 Spring 2004

Chain Rule The rule The “proof” P(A,B,C,D…) = P(A,B,C,D,..) = P(A)
P(B|A) P(C|A,B) P(D|A,B,C) P(..|A,B,C,D) P(A,B,C,D,..) = P(A) P(A,B)/P(A) P(A,B,C)/P(A,B) P(A,B,C,D)/P(A,B,C) P(..,A,B,C,D)/P(A,B,C,D) General form P(A1,..,An) = product i{1..n} P(Aj| inters j{1..i-1}Aj) General form: 4/12/2019 CPSC503 Spring 2004

Random Variables and pmf
Random variables (RV) X allow us to talk about the probabilities of numerical values that are related to the event space Examples:die “natural numbering” [1,6], English word length [1,?] Probability mass function ?So far, event space that differs with every problem we look at ? Random Variable if a function from the sample space to the real number for continuous RV. Because a RV has a numerical range, we can often do math more easily by working with the values of a RV rather then directly with events and we can define a probability for a random V Or to the integer numbers for discrete RV 4/12/2019 CPSC503 Spring 2004

Example: English Word length
p(x) Or # of words in a document 1 5 10 15 25 Sampling? How to do it? 4/12/2019 CPSC503 Spring 2004

Expectation and Variance
The Expectation is the (expected) mean or average of a RV Example:rolling one die (3.5) The variance of a RV is a measure of whether the values of the RV tend to be consistent over samples or to vary a lot σ is the standard deviation 4/12/2019 CPSC503 Spring 2004

Joint, Marginal and Conditional RV/Distributions
Sometimes you need to define more then one RV or prob distribution over a sample space We call all of them distributions Bayes and Chain Rule also apply ! 4/12/2019 CPSC503 Spring 2004

Joint Distributions (word length & word class)
Y N V Adj Adv X 1 2 3 4 … ……… ……… ……… Word class is not a RV Note: fictional numbers 4/12/2019 CPSC503 Spring 2004

Conditional and Independence (word length & word class)
……… 1 2 3 4 … N V Adj Adv X Y ……… 1 2 3 4 … N V Adj Adv X Y Word class is not a RV 4/12/2019 CPSC503 Spring 2004

Standard Distributions
Discrete Binomial Multinomial Continuous Normal Go back to your Stats textbook … 4/12/2019 CPSC503 Spring 2004

Today 28/1 Why do we need probabilities and information theory?
Basic Probability Theory Basic Information Theory Topics that are more important in the field (not just the ones used in the textbook). An overview NOT COMPLETE… try to clarify the key concepts and Address most common misunderstandings 4/12/2019 CPSC503 Spring 2004

Entropy Def1. Measure of uncertainty
Def2. Measure of the information that we need to resolve an uncertain situation Def3. Measure of the information that we obtain form an experiment that resolves an uncertain situation X not limited to numbers ranges over a set of basic elements Can be words part-of-speech Let p(x)=P(X=x); where x  X. H(p)= H(X)= - xX p(x)log2p(x) It is normally measured in bits. 4/12/2019 CPSC503 Spring 2004

Entropy (extra-slides)
Using the formula: Example Example: binary outcome The Limits (why exactly that formula?) Entropy and Expectation Coding interpretation Joint and Conditional Entropy Summary of key Properties X not limited to numbers ranges over a set of basic elements Can be words part-of-speech 4/12/2019 CPSC503 Spring 2004

Mutual Information Chain Rule for Entropy:H(X,Y)=H(X)+H(Y|X)
By the chain rule for entropy, we have H(X,Y) = H(X)+ H(Y|X) = H(Y)+H(X|Y) Therefore, H(X)-H(X|Y)=H(Y)-H(Y|X) This difference is called the mutual information between X and Y, I(X,Y). reduction in uncertainty of one random variable due to knowing about another the amount of information one random variable contains about another It can be proven 4/12/2019 CPSC503 Spring 2004

Relative Entropy or Kullback-Leibler Divergence
Def. The relative entropy is a measure of how different two probability distributions (over the same event space) are. D(p||q)= xX p(x)log(p(x)/q(x)) average number of bits wasted by encoding events from a distribution p with distribution q. I(X,Y) = D(p(x,y)||p(x)p(y)) The KL divergence between p and q can also be seen as the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q. I(x,y) measures how much deviates from independence! 4/12/2019 CPSC503 Spring 2004

Next Time Probabilistic models applied to spelling
Read Chp. 5 up to pag.156 4/12/2019 CPSC503 Spring 2004

CPSC 503 Computational Linguistics

Similar presentations

Presentation on theme: "CPSC 503 Computational Linguistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPSC 503 Computational Linguistics

Similar presentations

Presentation on theme: "CPSC 503 Computational Linguistics"— Presentation transcript:

Similar presentations

About project

Feedback