KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.

KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP

Outline MAPs and MLEs – catchup from last week Joint Distributions – a new learner Naïve Bayes – another new learner

Administrivia Homeworks: – Due tomorrow – Hardcopy and Autolab submission (see wiki) Texts – Mitchell or Murphy are optional this week – an update from Tom Mitchell’s long- expected new edition – Bishop is also excellent if you prefer but a little harder to skip around in – pick one or the other (both is overkill) – main differences are not content but notation: for instance…

Some practical problems I bought a loaded d20 on EBay…but it didn’t come with any useful specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll)

A better solution I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves? 1. Collect some data (20 rolls) 2. Estimate Pr(i)=C(rolls of i)/C(any roll) 0. Imagine some data (20 rolls, each i shows up 1x)

A better solution? Q: What if I used m rolls with a probability of q=1/20 of rolling any i? I can use this formula with m>20, or even with m<20 … say with m=1

Terminology – more later This is called a uniform Dirichlet prior C(i), C(ANY) are sufficient statistics MLE = maximum likelihood estimate MAP= maximum a posteriori estimate Tom’s notes are different

Some differences…. William: Estimate each probability Pr(i) associated with a multinomial with MLE as: Tom: estimate Θ =P(heads) for a binomial with MLE as: #heads #tails for C(i)=count of times you saw i, and estimate ith MAP as: and with MAP as: #imaginary heads #imaginary tails

Some apparent differences…. Tom: estimate Θ =P(heads) for a binomial with MLE as: #heads #tails and with MAP as: #imaginary heads #imaginary tails C(i) = α 1 C(ANY) = α 0 + α 1 m = ( γ 0 + γ 1 ) q = γ 1 / ( γ 0 + γ 1 ) emphasizes the pseudo-data emphasizes the prior.. and confidence in prior

imagined m=60 samples with q = 0.3imagined m=60 samples with q = 0.4

imagined m= 120 samples with q = 0.3imagined m= 120 samples with q = 0.4

Why we call this a MAP Simple case: replace the die with a coin – Now there’s one parameter: q= P(H) – I start with a prior over q, P( q) – I get some data: D={D1=H, D2=T, ….} – I compute maximum of posterior of q MAP estimate MLE estimate

Why we call this a MAP Simple case: replace the die with a coin – Now there’s one parameter: q= P(H) – I start with a prior over q, P( q) – I get some data: D={D1=H, D2=T, ….} – I compute the posterior of q The math works if the pdf of P(q) is P(x) = α +1, β +1 are counts of imaginary pos/neg examples

Why we call this a MAP The math works if the pdf P(x) = 0.5 20 30 10

Why we call this a MAP This is called a beta distribution The generalization to multinomials is called a Dirichlet distribution Parameters are f(x 1,…,x K ) =

KEY CONCEPTS IN PROBABILITY: THE JOINT DISTRIBUTION

Some practical problems I have 1 standard “fair” d6 die, 2 loaded d6 die, one loaded high, one low. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

A brute-force solution ARoll 1Roll 2P FL111/3 * 1/6 * ½ FL121/3 * 1/6 * 1/10 FL1…… …16 21 2… ……… 66 HL11 12 ……… HF11 … Comment doubles seven doubles A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. P(doubles) P(seven or eleven) P(total is higher than 5) ….

The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C ABC 000 001 010 011 100 101 110 111

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10

Estimating The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, estimate how probable it is from data. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10

Pros and Cons of the Joint Distribution You can do a lot with it! – Answer any query Pr(Y1,Y2,..|X1,X2,…) It takes up a lot of room!  It takes a lot of data to train!  It can be expensive to use  – The big question : how do you simplify (approximate, compactly store,…) the joint and still be able to answer interesting queries?

Copyright © Andrew W. Moore Density Estimation Our Joint Distribution learner is our first example of something called Density Estimation A Density Estimator learns a mapping from a set of attributes values to a Probability Density Estimator Probability Input Attributes

Copyright © Andrew W. Moore Density Estimation – looking ahead Compare it to two other major kinds of models: Regressor Prediction of real-valued output Input Attributes Density Estimator Probability Input Attributes Classifier Prediction of categorical output or class Input Attributes One of a few discrete values

Another example

Starting point: Google books 5-gram data – All 5-grams that appear >= 40 times in a corpus of 1M English books 30Gb compressed, 250-300Gb uncompressed Each 5-gram contains frequency distribution over years (which I ignored) – Pulled out counts for all 5-grams (A,B,C,D,E) where C=affect or C=effect and turned this into a joint probability table

Some of the Joint Distribution ABCDEp istheeffectofthe0.00036 istheeffectofa0.00034.Theeffectofthis0.00034 tothiseffect:“0.00034 betheeffectofthe… …………… nottheeffectofany0.00024 …………… doesnotaffectthegeneral0.00020 doesnotaffectthequestion0.00020 anymanneraffecttheprinciple0.00018 … about 50k more rows...that summarize 90M 5-gram instances in text

Example queries Pr(C) ? cPr(C=c) C=effect0.94628 C=affect0.04725 C=Effect0.00575 C=EFFECT0.00067 C=effecT…

Example queries Pr(B|C=affect) ? bPr(B=b|C=affect) B=not0.61357 B=to0.11483 B=may0.03267 B=they0.02738 B=which…

Example queries Pr(C|B=not,D=the) ? cPr(C|b=not,D=the) B=affect0.99644 B=effect0.00356

Copyright © Andrew W. Moore Density Estimation As a Classifier Density Estimator Probability Input Attributes Classifier Prediction of categorical output or class Input Attributes One of a few discrete values Density Estimator Probability Input Attributes + Class Y P(X 1 =x 1,…,X n =x n ) P(Y=y 1 |X 1 =x 1,…,X n =x n ) P(Y=y k |X 1 =x 1,…,X n =x n ) … Predict: f(X 1 =x 1,…,X n =x n )=max y i P(Y=y i |X 1 =x 1,…,X n =x n )

An experiment: how useful is the brute-force joint classifier? Test set: extracted all uses affect or effect in a 20k document newswire corpus: – about 723 n-grams, 661 distinct Tried to predict center word C with: – argmax c Pr(C=c|A=a,B=b,D=d,E=e) using the joint estimated from the Google ngram data

Poll time… https://piazza.com/class/ij382zqa2572hc

Example queries How many errors would I expect in 100 trials if my classifier always just guesses the most frequent class? https://piazza.com/class/ij382zqa2572hc cPr(C=c) C=effect0.94628 C=affect0.04725 C=Effect0.00575 C=EFFECT0.00067 C=effecT…

Performance summary PatternUsedErrors P(C|A,B,D,E)1011 But: no counts at all for a,b,c,d for 622 of the 723 instances!

Slightly fancier idea…. Tried to predict center word with: – Pr(C|A=a,B=b,D=d,E=e) – then P(C|A,B,D) if there’s no data for that – then P(C|B,D) if there’s no data for that – then P(C|B) … – then P(C)

EXAMPLES – “The cumulative _ of the”  effect (1.0) – “Go into _ on January”  effect (1.0) – “From cumulative _ of accounting” not present in train data Nor is ““From cumulative _ of _” But “_ cumulative _ of _”  effect (1.0) – “Would not _ Finance Minister” not present But “_ not _ _ _”  affect (0.9625)

Performance summary PatternUsedErrors P(C|A,B,D,E)1011 P(C|A,B,D)1576 P(C|B,D)16313 P(C|B)24478 P(C)5831 5% error 3% error 15% error 723

KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.

Similar presentations

Presentation on theme: "KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.

Similar presentations

Presentation on theme: "KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP."— Presentation transcript:

Similar presentations

About project

Feedback