CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
A Tutorial on Learning with Bayesian Networks
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
Lecture 10 – Introduction to Probability Topics Events, sample space, random variables Examples Probability distribution function Conditional probabilities.
Visual Recognition Tutorial
Probability theory Much inspired by the presentation of Kren and Samuelsson.
A/Prof Geraint Lewis A/Prof Peter Tuthill
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Evaluating Hypotheses
1 Bayesian Reasoning Chapter 13 CMSC 471 Adapted from slides by Tim Finin and Marie desJardins.
Presenting: Assaf Tzabari
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.
Crash Course on Machine Learning
Recitation 1 Probability Review
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Short Resume of Statistical Terms Fall 2013 By Yaohang Li, Ph.D.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Random Sampling, Point Estimation and Maximum Likelihood.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Chapter 13 February 19, Acting Under Uncertainty Rational Decision – Depends on the relative importance of the goals and the likelihood of.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Applied statistics Usman Roshan.
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
Naive Bayes Classifier
ECE 5424: Introduction to Machine Learning
Appendix A: Probability Theory
Ch3: Model Building through Regression
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Natural Language Processing
Review of Probability and Estimators Arun Das, Jason Rebello
Introduction to Probability
Still More Uncertainty
BN Semantics 3 – Now it’s personal! Parameter Learning 1
12. Principles of Parameter Estimation
Bayesian Networks CSE 573.
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)

Random Variable Discrete or Continuous – Boolean: like an atom in probabilistic logic Denotes some event (eg, die will come up “6”) – Continuous: domain is R numbers Denotes some quantity (age of a person taken from CSE446) Probability Distribution 2 (Joint)

3 Bayesian Methods Learning and classification methods based on probability theory. – Uses prior probability of each category Given no information about an item. – Produces a posterior probability distribution over possible categories Given a description of an item. Bayes theorem plays a critical role in probabilistic learning and classification.

A  B 4 Axioms of Probability Theory All probabilities between 0 and 1 Probability of truth and falsity P(true) = 1 P(false) = 0. The probability of disjunction is: A B

De Finetti (1931) An agent who bets according to probabilities that violate these axioms… Can be forced to bet so as to lose money Regardless of the outcome 5

Terminology Conditional Probability Joint Probability Marginal Probability X value is given

7 Conditional Probability P(A | B) is the probability of A given B Assumes: – B is all and only information known. Defined by: B A A  B

Probability Theory Sum Rule Product Rule

Monty Hall 9

10

Monty Hall 11

Monty Hall 12

13 Independence A and B are independent iff: Therefore, if A and B are independent: These constraints are logically equivalent

© Daniel S. Weld 14 Independence True B A A  B P(A  B) = P(A)P(B)

© Daniel S. Weld 15 Independence Is Rare True A A  B A&B not independent, since P(A|B)  P(A) B P(A) = 25% P(B) = 80% P(A|B) = ?  1/7

© Daniel S. Weld 16 Conditional Independence….? Are A & B independent? P(A|B) ? P(A) A A  B B  P(A)=(.25+.5)/2 =.375 P(B)=.75 P(A|B)= ( ) /3 =.3333

© Daniel S. Weld 17 A, B Conditionally Independent Given C P(A|B,C) = P(A|C) C = spots P(A|C) =.50P(A|C) =.5 P(B|C) =.333 P(A|B,C)=.5 A C B

18 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.) (Mult both sides of 2 by P(H).) (Substitute 3 in 1.)

Medical Diagnosis Encephalitis Computadoris Extremely Nasty Extremely Rare1/100M DNA Testing0% false negatives 1/10,000 false positive Scared? P(E)? P(H|E)=1* / ~ In 100M people: 1+ 10,000

Your first consulting job Billionaire (eccentric) Eastside tech founder asks: – He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up? – You say: Please flip it a few times: – You say: The probability is: P(H) = 3/5 – He says: Why??? – You say: Because…

Thumbtack – Binomial Distribution P(Heads) = , P(Tails) = 1-  Flips are i.i.d.: – Independent events – Identically distributed according to Binomial distribution Sequence D of  H Heads and  T Tails …

Maximum Likelihood Estimation Data: Observed set D of  H Heads and  T Tails Hypothesis: Binomial distribution Learning: finding  is an optimization problem – What’s the objective function? MLE: Choose  to maximize probability of D

Your first parameter learning algorithm Set derivative to zero, and solve!

But, how many flips do I need? Billionaire says: I flipped 3 heads and 2 tails. You say:  = 3/5, I can prove it! He says: What if I flipped 30 heads and 20 tails? You say: Same answer, I can prove it! He says: What’s better? You say: Umm… The more the merrier??? He says: Is this why I am paying you the big bucks???

N Prob. of Mistake Exponential Decay! A bound (from Hoeffding’s inequality) For N =  H +  T, and Let  * be the true parameter, for any  >0 :

PAC Learning PAC: Probably Approximate Correct Billionaire says: I want to know the thumbtack , within  = 0.1, with probability at least 1-  = How many flips? Or, how big do I set N ? Interesting! Lets look at some numbers!  = 0.1, 

What if I have prior beliefs? Billionaire says: Wait, I know that the thumbtack is “close” to What can you do for me now? You say: I can learn it the Bayesian way… Rather than estimating a single , we obtain a distribution over possible values of  In the beginningAfter observations Observe flips e.g.: {tails, tails}

Bayesian Learning Use Bayes rule: Or equivalently: Also, for uniform priors: Prior Normalization Data Likelihood Posterior  reduces to MLE objective

Bayesian Learning for Thumbtacks Likelihood function is Binomial: What about prior? – Represent expert knowledge – Simple posterior form Conjugate priors: – Closed-form representation of posterior – For Binomial, conjugate prior is Beta distribution

Beta prior distribution – P(  ) Likelihood function: Posterior:

Posterior Distribution Prior: Data:  H heads and  T tails Posterior distribution:

Bayesian Posterior Inference Posterior distribution: Bayesian inference: – No longer single parameter – For any specific f, the function of interest – Compute the expected value of f – Integral is often hard to compute

MAP: Maximum a Posteriori Approximation As more data is observed, Beta is more certain MAP: use most likely parameter to approximate the expectation

MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra thumbtack flips As N → ∞, prior is “forgotten” But, for small sample size, prior is important!

Envelope Problem One envelope has twice as much money as other 35 Switch? AB

36 A has $20 B Envelope Problem $ A One envelope has twice as much money as other E[B] = $ ½ ( ) = $ 25Switch?

What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians…

Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian – X ~ N( ,  2 ) – Y = aX + b  Y ~ N(a  +b,a 2  2 ) Sum of Gaussians is Gaussian – X ~ N(  X,  2 X ) – Y ~ N(  Y,  2 Y ) – Z = X+Y  Z ~ N(  X +  Y,  2 X +  2 Y ) Easy to differentiate, as we will see soon!

Learning a Gaussian Collect a bunch of data – Hopefully, i.i.d. samples – e.g., exam scores Learn parameters – Mean: μ – Variance: σ X i i = Exam Score …… 9989

MLE for Gaussian: Prob. of i.i.d. samples D={x 1,…,x N }: Log-likelihood of data:

Your second learning algorithm: MLE for mean of a Gaussian What’s MLE for mean?

MLE for variance Again, set derivative to zero:

Learning Gaussian parameters MLE: BTW. MLE for the variance of a Gaussian is biased – Expected result of estimation is not true parameter! – Unbiased variance estimator:

Bayesian learning of Gaussian parameters Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribution Prior for mean:

MAP for mean of Gaussian