Bayesian learning finalized (with high probability)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Parameter Estimation using likelihood functions Tutorial #1
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
Intro to Reinforcement Learning Learning how to get what you want...
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Bayesian wrap-up (probably). Administrivia My schedule has been chaos... Thank you for your understanding... Feedback on the student lectures? HW2 not.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Bayesian Learning, Cont’d. Administrivia Various homework bugs: Due: Oct 12 (Tues) not 9 (Sat) Problem 3 should read: (duh) (some) info on naive Bayes.
Computer vision: models, learning and inference
Thanks to Nir Friedman, HU
Bayesian Wrap-Up (probably). Administrivia Office hours tomorrow on schedule Woo hoo! Office hours today deferred... [sigh] 4:30-5:15.
Bayesian Learning Part 3+/- σ. Administrivia Final project/proposal Hand-out/brief discussion today Proposal due: Mar 27 Midterm exam: Thurs, Mar 22 (Thurs.
Relevance Feedback Users learning how to modify queries Response list must have least some relevant documents Relevance feedback `correcting' the ranks.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
Statistical Decision Theory
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Likelihood function and Bayes Theorem In simplest case P(B|A) = P(A|B) P(B)/P(A) and we consider the likelihood function in which we view the conditional.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Review of statistical modeling and probability theory Alan Moses ML4bio.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Oliver Schulte Machine Learning 726
Bayesian Estimation and Confidence Intervals
STATISTICS AND PROBABILITY IN CIVIL ENGINEERING
Ch3: Model Building through Regression
Bayes Net Learning: Bayesian Approaches
Review of Probability and Estimators Arun Das, Jason Rebello
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
CSCI 5822 Probabilistic Models of Human and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Bayesian learning finalized (with high probability)

Everything’s random... Basic Bayesian viewpoint: Treat (almost) everything as a random variable Data/independent var: X vector Class/dependent var: Y Parameters: Θ E.g., mean, variance, correlations, multinomial params, etc. Use Bayes’ Rule to assess probabilities of classes Allows us to say: “It is is very unlikely that the mean height is 2 light years”

Uncertainty over params Maximum likelihood treats parameters as (unknown) constants Job is just to pick the constants so as to maximize data likelihood Fullblown Bayesian modeling treats params as random variables PDF over parameter variables tells us how certain/uncertain we are about the location of that parameter Also allows us to express prior beliefs (probabilities) about params

Example: Coin flipping Have a “weighted” coin -- want to figure out  =Pr[heads] Maximum likelihood: Flip coin a bunch of times, measure #heads ; #tails Use estimator to return a single value for  This is called a point estimate

Example: Coin flipping Have a “weighted” coin -- want to figure out  =Pr[heads] Bayesian posterior estimation: Start w/ distribution over what  might be Flip coin a bunch of times, measure #heads ; #tails Update distribution, but never reduce to a single number Always keep around Pr[θ | data] : posterior estimate

Example: Coin flipping ? ? ? ? ? ? ? 0 flips total

Example: Coin flipping 1 flip total

Example: Coin flipping 5 flips total

Example: Coin flipping 10 flips total

Example: Coin flipping 20 flips total

Example: Coin flipping 50 flips total

Example: Coin flipping 100 flips total

How does it work? Think of parameters as just another kind of random variable Now your data distribution is This is the generative distribution A.k.a. observation distribution, sensor model, etc. What we want is some model of parameter as a function of the data Get there with Bayes’ rule:

What does that mean? Let’s look at the parts: Generative distribution Describes how data is generated by the underlying process Usually easy to write down (well, easier than the other parts, anyway) Same old PDF/PMF we’ve been working with Can be used to “generate” new samples of data that “look like” your training data

What does that mean? The parameter prior or a priori distribution: Allows you to say “this value of is more likely than that one is...” Allows you to express beliefs/assumptions/ preferences about the parameters of the system Also takes over when the data is sparse (small N ) In the limit of large data, prior should “wash out”, letting the data dominate the estimate of the parameter Can let be “uniform” (a.k.a., “uninformative”) to minimize its impact

What does that mean? The data prior: Expresses the probability of seeing data set X independent of any particular model Huh?

What does that mean? The data prior: Expresses the probability of seeing data set X independent of any particular model Can get it from the joint data/parameter model: In practice, often don’t need it explicitly (why?)

What does that mean? Finally, the posterior (or a posteriori) distribution: Lit., “from what comes after” (Latin) Essentially, “What we believe about the parameter after we look at the data” As compared to the “prior” or “a priori” (lit., “from what is before”) parameter distribution,

Example: coin flipping A (biased) coin lands heads-up w/ prob p and tails-up w/ prob 1-p Parameter of the system is p Goal is to find Pr[p | sequence of coin flips] (Technically, we want a PDF, f(p | flips) ) Q: what family of PDFs is appropriate?

Example: coin flipping We need a PDF that generates possible values of p p ∈ [0,1] Commonly used distribution is beta distribution: Normalization constant: “Beta function” Pr[heads]Pr[tails]

The Beta Distribution Image courtesey of Wikimedia commons

Generative distribution f(p|α,β) is the prior distribution for p Parameters α and β are hyperparameters Govern shape of f() Still need the generative distribution: Pr[h,t|p] h,t : number of heads, tails Use a binomial distribution:

Posterior Now, by Bayes’ rule:

Exercise Suppose you want to estimate the average air speed of an unladen (African) swallow Let’s say that airspeeds of individual swallows, x, are Gaussianly distributed with mean and variance 1 : Let’s say, also, that we think the mean is “around” 50 kph, but we’re not sure exactly what it is. But our uncertainty (variance) is 10 kph. Derive the posterior estimate of the mean airspeed.