Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007-03-27 Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Pattern Recognition and Machine Learning
CS479/679 Pattern Recognition Dr. George Bebis
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Chapter 4: Linear Models for Classification
What is Statistical Modeling
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Summarized by Soo-Jin Kim
PATTERN RECOGNITION AND MACHINE LEARNING
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Biointelligence Laboratory, Seoul National University
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Chapter 3: Maximum-Likelihood Parameter Estimation
CEE 6410 Water Resources Systems Analysis
Probability Theory and Parameter Estimation I
ICS 280 Learning in Graphical Models
Ch3: Model Building through Regression
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Special Topics In Scientific Computing
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Presentation transcript:

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I. Kim (originally) Biointelligence Laboratory, Seoul National University

2 (C) 2007, SNU Biointelligence Lab, Example: Polynomial Curve Fitting 1.2 Probability Theory  Probability densities  Expectations and covariance  Bayesian probabilities  The Gaussian distribution  Curve fitting re-visited  Bayesian curve fitting 1.3 Model Selection Contents

3 (C) 2007, SNU Biointelligence Lab, Pattern Recognition Training set, Target vector, Training (learning) phase  Determine Generalization  Test set Preprocessing  Feature extraction

4 (C) 2007, SNU Biointelligence Lab, Supervised, Unsupervised and Reinforcement Learning Supervised Learning: with target vector  Classification  Regression Unsupervised learning: w/o target vector  Clustering  Density estimation  Visualization Reinforcement learning: maximize a reward  Trade-off between exploration & exploitation

5 (C) 2007, SNU Biointelligence Lab, Example: Polynomial Curve Fitting N observations Fit data  With polynomial function Minimizing error function  Sum of squares of errors

6 (C) 2007, SNU Biointelligence Lab, Model Selection & Over-fitting (1/2) Choosing order M

7 (C) 2007, SNU Biointelligence Lab, Model Selection & Over-fitting (2/2) RMS(Root-Mean-Square) Error Too large → Over-fitting The more data, the better generalization Over-fitting is a general property of maximum likelihood

8 (C) 2007, SNU Biointelligence Lab, Control over-fit phenomena  Use penalty term  Shrinkage  Ridge regression  Weight decay Regularization

9 (C) 2007, SNU Biointelligence Lab, Probability Theory “What is the overall probability that the selection procedure will pick an apple?” “Given that we have chosen an orange, what is the probability that the box we chose was the blue one?”

10 (C) 2007, SNU Biointelligence Lab, Rules of Probability (1/2) Joint probability Marginal probability Conditional probability

11 (C) 2007, SNU Biointelligence Lab, Rules of Probability (2/2) Sum rule Production rule Bayes’ theorem Posterior Likelihood Prior Normalizing constant

N = (C) 2007, SNU Biointelligence Lab,

13 (C) 2007, SNU Biointelligence Lab, Probability densities Probabilities with respect to continuous variables  Probability density over x, p(x): Sum rule Product rule Cumulative distribution function

Expectations and Covariances Expectation of f(x): average value of f(x) under a probability dist p(x)  Conditional expectations Variance – a measure of how much variability there is in f around its mean Covariance – the extent to which x and y vary together

15 (C) 2007, SNU Biointelligence Lab, Bayesian Probabilities – F requantist vs. Bayesian Likelihood: Frequentist  w: considered as a fixed parameter determined by ‘estimator’  Maximum likelihood: Error function =  Error bars: Obtained by the distribution of possible data sets –Bootstrap Bayesian  a single data set (the one that is actually observed)  the uncertainty in the parameters: a probability distribution over w  Advantage: the inclusion of prior knowledge arises naturally  Leads less extreme conclusion by incorporating prior  Non-informative prior Bayes’ theorem

16 (C) 2007, SNU Biointelligence Lab, Bayesian Probabilities – Expansion of Bayesian Application Limited application of full Bayesian procedure  Even it has its origin from 18 th century  Need to marginalize over the whole of parameter space Markov chain Monte Carlo sampling method  Computationally intensive   Used for small-scale problem Highly efficient deterministic approximation schemes  e.g. variational Bayes, expectation propagation  Alternative to sampling methods  Have allowed Bayesian to be used in large-scale problems

17 (C) 2007, SNU Biointelligence Lab, Gaussian distribution Gaussian distribution for a single real-valued variable x D-dimensional Multivariate Gaussian Distribution : for d-dimensional vector of x of continuous variables Mean, variance, standard deviation, precision Mean, covariance, determinant

18 (C) 2007, SNU Biointelligence Lab, Gaussian distribution – Example (1/2) Getting unknown parameters Data points are i.i.d.  Maximizing with respect to  sample mean:  Maximizing with respect to variance  sample variance: Evaluate subsequently

19 (C) 2007, SNU Biointelligence Lab, Gaussian distribution – Example (2/2) Bias phenomenon  Limitation of the maximum likelihood approach  Related to over-fitting When we consider the expectations,  We have correct mean  But we have underestimated variance True mean Sample mean Size of N

20 (C) 2007, SNU Biointelligence Lab, Curve Fitting Re-visited (1/2) Goal in the curve fitting problem  Prediction for the target variable t given some new input variable x If we assume Gaussian for t, Determine the unknown w & by maximum likelihood using training data {x, t} Likelihood For data drawn from dist i.i.d., In log form

21 (C) 2007, SNU Biointelligence Lab,  maximizing likelihood = minimizing the sum-of-squares error function (with negative log likelihood) Determining the precision with ML Predictive distribution  Predictions for new values of x (using w, ) Curve Fitting Re-visited (2/2)

22 (C) 2007, SNU Biointelligence Lab, Maximum Posterior (MAP) Introduce a prior (Gaussian distribution on w),  : hyper-parameter  Taking the negative logarithm and combining previous terms, the maximum of the posterior is given by minimum of  Maximizing the posterior dist = minimizing the regularized sum-of-squares error function (1.4)  With regularization parameters: Determine w with most probable value of w given the data (maximizing posterior)

23 (C) 2007, SNU Biointelligence Lab, Bayesian Curve Fitting Marginalization

24 (C) 2007, SNU Biointelligence Lab, Model Selection Proper model complexity → Good generalization & best model Measuring the generalization performance  If data are plentiful, divide into training, validation & test set  Otherwise, cross-validate  Leave-one-out technique  Drawbacks –Expensive computation –Using separate data → multiple complexity parameters  New measures of performance  e.g. Akaike information criterion(AIC), Bayesian information criterion(BIC)