ECE 5424: Introduction to Machine Learning

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Crash Course on Machine Learning
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Probability Review –Statistical Estimation (MLE) Readings: Barber 8.1, 8.2.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Supervised Learning –General Setup, learning from data –Nearest Neighbour.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Regression Readings: Barber 17.1, 17.2.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Maximum Likelihood Estimation
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Machine Learning 5. Parametric Methods.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Bayesian Estimation and Confidence Intervals Lecture XXII.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Markov Chain Monte Carlo in R
Essential Probability & Statistics
CS 2750: Machine Learning Review
Bayesian Estimation and Confidence Intervals
MCMC Output & Metropolis-Hastings Algorithm Part I
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ICS 280 Learning in Graphical Models
ECE 5424: Introduction to Machine Learning
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
CS 2750: Machine Learning Probability Review Density Estimation
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Bayes Net Learning: Bayesian Approaches
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Maximum Likelihood Estimation
ECE 5424: Introduction to Machine Learning
Oliver Schulte Machine Learning 726
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Statistical NLP: Lecture 4
LECTURE 07: BAYESIAN ESTIMATION
Parameter Learning 2 Structure Learning 1: The good
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Parametric Methods Berlin Chen, 2005 References:
BN Semantics 3 – Now it’s personal! Parameter Learning 1
CS639: Data Management for Data Science
Mathematical Foundations of BME Reza Shadmehr
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

ECE 5424: Introduction to Machine Learning Topics: Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7 Stefan Lee Virginia Tech

Administrative HW1 Project Proposal Due on Wed 9/14, 11:55pm Problem 2.2 : Two cases (in ball, out of ball) Project Proposal Due: Tue 09/21, 11:55 pm <=2pages, NIPS format (C) Dhruv Batra

Recap from last time (C) Dhruv Batra

Procedural View Training Stage: Testing Stage Raw Data  x (Feature Extraction) Training Data { (x,y) }  f (Learning) Testing Stage Test Data x  f(x) (Apply function, Evaluate error) (C) Dhruv Batra

Statistical Estimation View Probabilities to rescue: x and y are random variables D = (x1,y1), (x2,y2), …, (xN,yN) ~ P(X,Y) IID: Independent Identically Distributed Both training & testing data sampled IID from P(X,Y) Learn on training set Have some hope of generalizing to test set (C) Dhruv Batra

Interpreting Probabilities What does P(A) mean? Frequentist View limit N∞ #(A is true)/N limiting frequency of a repeating non-deterministic event Bayesian View P(A) is your “belief” about A Market Design View P(A) tells you how much you would bet (C) Dhruv Batra

Concepts Likelihood Prior Posterior How much does a certain hypothesis explain the data? Prior What do you believe before seeing any data? Posterior What do we believe after seeing the data? (C) Dhruv Batra

KL-Divergence / Relative Entropy (C) Dhruv Batra Slide Credit: Sam Roweis

Plan for Today Statistical Learning Simple examples (like coin toss) Frequentist Tool Maximum Likelihood Bayesian Tools Maximum A Posteriori Bayesian Estimation Simple examples (like coin toss) But SAME concepts will apply to sophisticated problems. (C) Dhruv Batra

Your first probabilistic learning algorithm After taking this ML class, you drop out of VT and join an illegal betting company. Specializing in mascot fist fights. Your new boss asks you: If the VT and UVA mascots fight tomorrow, will Hokiebird win or lose? You say: what is the record? W, L, L, W, W You say: P(Hokiebird Wins) = … (C) Dhruv Batra

Slide Credit: Yaser Abu-Mostapha Go to board for mle work (C) Dhruv Batra Slide Credit: Yaser Abu-Mostapha

Maximum Likelihood Estimation Goal: Find a good θ What’s a good θ? One that makes it likely for us to have seen this data Quality of θ = Likelihood(θ; D) = P(data | θ) (C) Dhruv Batra

Why Max-Likelihood? Leads to “natural” estimators MLE is OPT if model-class is correct (C) Dhruv Batra

Sufficient Statistic D1 = {1,1,1,0,0,0} D2 = {1,0,1,0,1,0} A function of the data ϕ(Y) is a sufficient statistic, if the following is true (C) Dhruv Batra

How many flips do I need? He says: What’s better? Boss says: Last year: 3 heads/wins-for-Hokiebird 2 tails/losses-for-Hokiebird. You say:  = 3/5, I can prove it! He says: What if 30 heads/wins-for-Hokiebird 20 tails/losses-for-Hokiebird You say: Same answer, I can prove it! He says: What’s better? You say: Humm… The more the merrier??? He says: Is this why I am paying you the big bucks??? Joke Credit: Carlos Guestrin

Bayesian Estimation Boss says: What is I know the Hokiebird is a better fighter on closer to Thanksgiving? (fighting for his life) You say: Bayesian it is then.. (C) Dhruv Batra

Priors What are priors? Conjugate Priors Express beliefs before experiments are conducted Computational ease: lead to “good” posteriors Help deal with unseen data Regularizers: More about this in later lectures Conjugate Priors Prior is conjugate to likelihood if it leads to itself as posterior Closed form representation of posterior (C) Dhruv Batra

Beta prior distribution – P() Demo: http://demonstrations.wolfram.com/BetaDistribution/ Go to board after this slide for beta map derivation Slide Credit: Carlos Guestrin

Benefits of conjugate priors (C) Dhruv Batra

MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra W/L matches As N → inf, prior is “forgotten” But, for small sample size, prior is important! Slide Credit: Carlos Guestrin

Effect of Prior Prior = Beta(2,2) Dataset = {H} Posterior = Beta(3,2) L(θ) = θ θMLE = 1 Posterior = Beta(3,2) θMAP = (3-1)/(3+2-2) = 2/3 (C) Dhruv Batra

What you need to know Statistical Learning: Maximum likelihood Why MLE? Sufficient statistics Maximum a posterori Bayesian estimation (return an entire distribution) Priors, posteriors, conjugate priors Beta distribution (conjugate of bernoulli)