CS 2750: Machine Learning Probability Review Density Estimation

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
Pattern Recognition and Machine Learning
Random Variables ECE460 Spring, 2012.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
What is Statistical Modeling
Visual Recognition Tutorial
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Short review of probabilistic concepts Probability theory plays very important role in statistics. This lecture will give the short review of basic concepts.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Thanks to Nir Friedman, HU
Probability and Statistics Review Thursday Sep 11.
Today Wrap up of probability Vectors, Matrices. Calculus
Crash Course on Machine Learning
Recitation 1 Probability Review
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Probability Review –Statistical Estimation (MLE) Readings: Barber 8.1, 8.2.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
Lecture 2: Statistical learning primer for biologists
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Machine Learning 5. Parametric Methods.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Bayesian Estimation and Confidence Intervals Lecture XXII.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Applied statistics Usman Roshan.
CS 2750: Machine Learning Review
Probability Theory and Parameter Estimation I
Probability for Machine Learning
ECE 5424: Introduction to Machine Learning
Appendix A: Probability Theory
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
Of Probability & Information Theory
CS 2750: Machine Learning Expectation Maximization
Special Topics In Scientific Computing
Latent Variables, Mixture Models and EM
Review of Probability and Estimators Arun Das, Jason Rebello
Bayesian Models in Machine Learning
Advanced Artificial Intelligence
Statistical NLP: Lecture 4
Pattern Recognition and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CS 2750: Machine Learning Probability Review Density Estimation Prof. Adriana Kovashka University of Pittsburgh March 23, 2017

Plan for this lecture Probability basics (review) Some terms from probabilistic learning Some common probability distributions

Machine Learning: Procedural View Training Stage: Raw Data  x (Extract features) Training Data { (x,y) }  f (Learn model) Testing Stage Test Data x  f(x) (Apply learned model, Evaluate error) Adapted from Dhruv Batra

Statistical Estimation View Probabilities to rescue: x and y are random variables D = (x1,y1), (x2,y2), …, (xN,yN) ~ P(X,Y) IID: Independent Identically Distributed Both training & testing data sampled IID from P(X,Y) Learn on training set Have some hope of generalizing to test set Dhruv Batra

Probability A is non-deterministic event Examples Can think of A as a Boolean-valued variable Examples A = your next patient has cancer A = Andy Murray wins US Open 2017 Dhruv Batra

Interpreting Probabilities What does P(A) mean? Frequentist View limit N∞ #(A is true)/N frequency of a repeating non-deterministic event Bayesian View P(A) is your “belief” about A Adapted from Dhruv Batra

Axioms of Probability 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) Dhruv Batra, Andrew Moore

Axioms of Probability 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) ’ Dhruv Batra, Andrew Moore

Axioms of Probability 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) ’ Dhruv Batra, Andrew Moore

Axioms of Probability 0<= P(A) <= 1 P(false) = 0 P(true) = 1 P(A v B) = P(A) + P(B) – P(A ^ B) Dhruv Batra, Andrew Moore

Probabilities: Example Use Apples and Oranges Chris Bishop

Marginal, Joint, Conditional Marginal Probability Conditional Probability Joint Probability Chris Bishop

Joint Probability P(X1,…,Xn) gives the probability of every combination of values (an n-dimensional array with vn values if all variables are discrete with v values, all vn values must sum to 1): The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. positive negative circle square red 0.20 0.02 blue 0.01 circle square red 0.05 0.30 blue 0.20 Adapted from Ray Mooney

Marginal Probability y z Dhruv Batra, Erik Suddherth

Conditional Probability P(Y=y | X=x): What do you believe about Y=y, if I tell you X=x? P(Andy Murray wins US Open 2017)? What if I tell you: He is currently ranked #1 He has won the US Open once Dhruv Batra

Conditional Probability Chris Bishop

Conditional Probability Dhruv Batra, Erik Suddherth

Sum and Product Rules Sum Rule Product Rule Chris Bishop

Chain Rule Generalizes the product rule: Example: Equations from Wikipedia

The Rules of Probability Sum Rule Product Rule Chris Bishop

Independence A and B are independent iff: Therefore, if A and B are independent: These two constraints are logically equivalent Ray Mooney

Independence Marginal: P satisfies (X  Y) if and only if P(X=x,Y=y) = P(X=x) P(Y=y), xVal(X), yVal(Y) Conditional: P satisfies (X  Y | Z) if and only if P(X,Y|Z) = P(X|Z) P(Y|Z), xVal(X), yVal(Y), zVal(Z) Dhruv Batra

Independence Dhruv Batra, Erik Suddherth

Bayes’ Theorem posterior  likelihood × prior Chris Bishop

Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous) Chris Bishop

Variances and Covariances Chris Bishop

Entropy Important quantity in coding theory statistical physics machine learning Chris Bishop

Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely Chris Bishop

Entropy Chris Bishop

Entropy Chris Bishop

The Kullback-Leibler Divergence Chris Bishop

Mutual Information Chris Bishop

Likelihood / Prior / Posterior A hypothesis is denoted as h; it is one member of the hypothesis space H A set of training examples is denoted as D, a collection of (x, y) pairs for training Pr(h) – the prior probability of the hypothesis – without observing any training data, what is the probability that h is the target function we want? Rebecca Hwa

Likelihood / Prior / Posterior Pr(D) – the prior probability of the observed data – chance of getting the particular set of training examples D Pr(h|D) – the posterior probability of h – what is the probability that h is the target given that we have observed D? Pr(D|h) – the probability of getting D if h were true (a.k.a. likelihood of the data) Pr(h|D) = Pr(D|h)Pr(h)/Pr(D) Rebecca Hwa

MAP vs MLE Estimation Maximum-a-posteriori (MAP) estimation: hMAP = argmaxh Pr(h|D) = argmaxh Pr(D|h)Pr(h)/Pr(D) = argmaxh Pr(D|h)Pr(h) Maximum likelihood estimation (MLE): hML = argmax Pr(D|h) Rebecca Hwa

Plan for this lecture Probability basics (review) Some terms from probabilistic learning Some common probability distributions

The Gaussian Distribution Chris Bishop

Curve Fitting Re-visited Chris Bishop

Gaussian Parameter Estimation Likelihood function Chris Bishop

Maximum Likelihood Determine by minimizing sum-of-squares error, . Chris Bishop

Predictive Distribution Chris Bishop

MAP: A Step towards Bayes posterior  likelihood × prior Determine by minimizing regularized sum-of-squares error, . Adapted from Chris Bishop

The Gaussian Distribution Diagonal covariance matrix Covariance matrix proportional to the identity matrix Chris Bishop

Gaussian Mean and Variance Chris Bishop

Maximum Likelihood for the Gaussian Given i.i.d. data , the log likeli-hood function is given by Sufficient statistics Chris Bishop

Maximum Likelihood for the Gaussian Set the derivative of the log likelihood function to zero, and solve to obtain Similarly Chris Bishop

Maximum Likelihood – 1D Case Chris Bishop

Mixture of two Gaussians Mixtures of Gaussians Old Faithful data set Single Gaussian Mixture of two Gaussians Chris Bishop

Mixtures of Gaussians Combine simple models into a complex model: K=3 Component Mixing coefficient Chris Bishop

Mixtures of Gaussians Chris Bishop

Binary Variables Coin flipping: heads=1, tails=0 Bernoulli Distribution Chris Bishop

Binary Variables N coin flips: Binomial Distribution Chris Bishop

Binomial Distribution Chris Bishop

Parameter Estimation ML for Bernoulli Given: Chris Bishop

Parameter Estimation Example: Prediction: all future tosses will land heads up Overfitting to D Chris Bishop

Beta Distribution Distribution over . Chris Bishop

Bayesian Bernoulli The Beta distribution provides the conjugate prior for the Bernoulli distribution. Chris Bishop

Bayesian Bernoulli The hyperparameters aN and bN are the effective number of observations of x=1 and x=0 (need not be integers) The posterior distribution in turn can act as a prior as more data is observed

Bayesian Bernoulli Interpretation? The fraction of (real and fictitious/prior observations) corresponding to x=1 For infinitely large datasets, reduces to Maxmimum Likelihood Estimation l = N - m

Prior ∙ Likelihood = Posterior Chris Bishop

Multinomial Variables 1-of-K coding scheme: Chris Bishop

ML Parameter Estimation Given: Ensure , use a Lagrange multiplier, λ. Chris Bishop

The Multinomial Distribution Chris Bishop

The Dirichlet Distribution Conjugate prior for the multinomial distribution. Chris Bishop