CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Random Variable Discrete or Continuous – Boolean: like an atom in probabilistic logic Denotes some event (eg, die will come up “6”) – Continuous: domain is R numbers Denotes some quantity (age of a person taken from CSE446) Probability Distribution 2 (Joint)
3 Bayesian Methods Learning and classification methods based on probability theory. – Uses prior probability of each category Given no information about an item. – Produces a posterior probability distribution over possible categories Given a description of an item. Bayes theorem plays a critical role in probabilistic learning and classification.
A B 4 Axioms of Probability Theory All probabilities between 0 and 1 Probability of truth and falsity P(true) = 1 P(false) = 0. The probability of disjunction is: A B
De Finetti (1931) An agent who bets according to probabilities that violate these axioms… Can be forced to bet so as to lose money Regardless of the outcome 5
Terminology Conditional Probability Joint Probability Marginal Probability X value is given
7 Conditional Probability P(A | B) is the probability of A given B Assumes: – B is all and only information known. Defined by: B A A B
Probability Theory Sum Rule Product Rule
Monty Hall 9
10
Monty Hall 11
Monty Hall 12
13 Independence A and B are independent iff: Therefore, if A and B are independent: These constraints are logically equivalent
© Daniel S. Weld 14 Independence True B A A B P(A B) = P(A)P(B)
© Daniel S. Weld 15 Independence Is Rare True A A B A&B not independent, since P(A|B) P(A) B P(A) = 25% P(B) = 80% P(A|B) = ? 1/7
© Daniel S. Weld 16 Conditional Independence….? Are A & B independent? P(A|B) ? P(A) A A B B P(A)=(.25+.5)/2 =.375 P(B)=.75 P(A|B)= ( ) /3 =.3333
© Daniel S. Weld 17 A, B Conditionally Independent Given C P(A|B,C) = P(A|C) C = spots P(A|C) =.50P(A|C) =.5 P(B|C) =.333 P(A|B,C)=.5 A C B
18 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.) (Mult both sides of 2 by P(H).) (Substitute 3 in 1.)
Medical Diagnosis Encephalitis Computadoris Extremely Nasty Extremely Rare1/100M DNA Testing0% false negatives 1/10,000 false positive Scared? P(E)? P(H|E)=1* / ~ In 100M people: 1+ 10,000
Your first consulting job Billionaire (eccentric) Eastside tech founder asks: – He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up? – You say: Please flip it a few times: – You say: The probability is: P(H) = 3/5 – He says: Why??? – You say: Because…
Thumbtack – Binomial Distribution P(Heads) = , P(Tails) = 1- Flips are i.i.d.: – Independent events – Identically distributed according to Binomial distribution Sequence D of H Heads and T Tails …
Maximum Likelihood Estimation Data: Observed set D of H Heads and T Tails Hypothesis: Binomial distribution Learning: finding is an optimization problem – What’s the objective function? MLE: Choose to maximize probability of D
Your first parameter learning algorithm Set derivative to zero, and solve!
But, how many flips do I need? Billionaire says: I flipped 3 heads and 2 tails. You say: = 3/5, I can prove it! He says: What if I flipped 30 heads and 20 tails? You say: Same answer, I can prove it! He says: What’s better? You say: Umm… The more the merrier??? He says: Is this why I am paying you the big bucks???
N Prob. of Mistake Exponential Decay! A bound (from Hoeffding’s inequality) For N = H + T, and Let * be the true parameter, for any >0 :
PAC Learning PAC: Probably Approximate Correct Billionaire says: I want to know the thumbtack , within = 0.1, with probability at least 1- = How many flips? Or, how big do I set N ? Interesting! Lets look at some numbers! = 0.1,
What if I have prior beliefs? Billionaire says: Wait, I know that the thumbtack is “close” to What can you do for me now? You say: I can learn it the Bayesian way… Rather than estimating a single , we obtain a distribution over possible values of In the beginningAfter observations Observe flips e.g.: {tails, tails}
Bayesian Learning Use Bayes rule: Or equivalently: Also, for uniform priors: Prior Normalization Data Likelihood Posterior reduces to MLE objective
Bayesian Learning for Thumbtacks Likelihood function is Binomial: What about prior? – Represent expert knowledge – Simple posterior form Conjugate priors: – Closed-form representation of posterior – For Binomial, conjugate prior is Beta distribution
Beta prior distribution – P( ) Likelihood function: Posterior:
Posterior Distribution Prior: Data: H heads and T tails Posterior distribution:
Bayesian Posterior Inference Posterior distribution: Bayesian inference: – No longer single parameter – For any specific f, the function of interest – Compute the expected value of f – Integral is often hard to compute
MAP: Maximum a Posteriori Approximation As more data is observed, Beta is more certain MAP: use most likely parameter to approximate the expectation
MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra thumbtack flips As N → ∞, prior is “forgotten” But, for small sample size, prior is important!
Envelope Problem One envelope has twice as much money as other 35 Switch? AB
36 A has $20 B Envelope Problem $ A One envelope has twice as much money as other E[B] = $ ½ ( ) = $ 25Switch?
What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians…
Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian – X ~ N( , 2 ) – Y = aX + b Y ~ N(a +b,a 2 2 ) Sum of Gaussians is Gaussian – X ~ N( X, 2 X ) – Y ~ N( Y, 2 Y ) – Z = X+Y Z ~ N( X + Y, 2 X + 2 Y ) Easy to differentiate, as we will see soon!
Learning a Gaussian Collect a bunch of data – Hopefully, i.i.d. samples – e.g., exam scores Learn parameters – Mean: μ – Variance: σ X i i = Exam Score …… 9989
MLE for Gaussian: Prob. of i.i.d. samples D={x 1,…,x N }: Log-likelihood of data:
Your second learning algorithm: MLE for mean of a Gaussian What’s MLE for mean?
MLE for variance Again, set derivative to zero:
Learning Gaussian parameters MLE: BTW. MLE for the variance of a Gaussian is biased – Expected result of estimation is not true parameter! – Unbiased variance estimator:
Bayesian learning of Gaussian parameters Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribution Prior for mean:
MAP for mean of Gaussian