Download presentation
Presentation is loading. Please wait.
Published byTheodora Alexandrina Gilbert Modified over 8 years ago
1
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007-03-27 Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I. Kim (originally) Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
2
2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 1.1 Example: Polynomial Curve Fitting 1.2 Probability Theory 1.2.1 Probability densities 1.2.2 Expectations and covariance 1.2.3 Bayesian probabilities 1.2.4 The Gaussian distribution 1.2.5 Curve fitting re-visited 1.2.6 Bayesian curve fitting 1.3 Model Selection Contents
3
3 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Pattern Recognition Training set, Target vector, Training (learning) phase Determine Generalization Test set Preprocessing Feature extraction
4
4 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Supervised, Unsupervised and Reinforcement Learning Supervised Learning: with target vector Classification Regression Unsupervised learning: w/o target vector Clustering Density estimation Visualization Reinforcement learning: maximize a reward Trade-off between exploration & exploitation
5
5 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.1 Example: Polynomial Curve Fitting N observations Fit data With polynomial function Minimizing error function Sum of squares of errors
6
6 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Model Selection & Over-fitting (1/2) Choosing order M
7
7 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Model Selection & Over-fitting (2/2) RMS(Root-Mean-Square) Error Too large → Over-fitting The more data, the better generalization Over-fitting is a general property of maximum likelihood
8
8 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Control over-fit phenomena Use penalty term Shrinkage Ridge regression Weight decay Regularization
9
9 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2 Probability Theory “What is the overall probability that the selection procedure will pick an apple?” “Given that we have chosen an orange, what is the probability that the box we chose was the blue one?”
10
10 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Rules of Probability (1/2) Joint probability Marginal probability Conditional probability
11
11 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Rules of Probability (2/2) Sum rule Production rule Bayes’ theorem Posterior Likelihood Prior Normalizing constant
12
N = 60 12 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/
13
13 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.1 Probability densities Probabilities with respect to continuous variables Probability density over x, p(x): Sum rule Product rule Cumulative distribution function
14
14 1.2.2 Expectations and Covariances Expectation of f(x): average value of f(x) under a probability dist p(x) Conditional expectations Variance – a measure of how much variability there is in f around its mean Covariance – the extent to which x and y vary together
15
15 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.3 Bayesian Probabilities – F requantist vs. Bayesian Likelihood: Frequentist w: considered as a fixed parameter determined by ‘estimator’ Maximum likelihood: Error function = Error bars: Obtained by the distribution of possible data sets –Bootstrap Bayesian a single data set (the one that is actually observed) the uncertainty in the parameters: a probability distribution over w Advantage: the inclusion of prior knowledge arises naturally Leads less extreme conclusion by incorporating prior Non-informative prior Bayes’ theorem
16
16 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.3 Bayesian Probabilities – Expansion of Bayesian Application Limited application of full Bayesian procedure Even it has its origin from 18 th century Need to marginalize over the whole of parameter space Markov chain Monte Carlo sampling method Computationally intensive Used for small-scale problem Highly efficient deterministic approximation schemes e.g. variational Bayes, expectation propagation Alternative to sampling methods Have allowed Bayesian to be used in large-scale problems
17
17 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.4 Gaussian distribution Gaussian distribution for a single real-valued variable x D-dimensional Multivariate Gaussian Distribution : for d-dimensional vector of x of continuous variables Mean, variance, standard deviation, precision Mean, covariance, determinant
18
18 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.4 Gaussian distribution – Example (1/2) Getting unknown parameters Data points are i.i.d. Maximizing with respect to sample mean: Maximizing with respect to variance sample variance: Evaluate subsequently
19
19 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.4 Gaussian distribution – Example (2/2) Bias phenomenon Limitation of the maximum likelihood approach Related to over-fitting When we consider the expectations, We have correct mean But we have underestimated variance True mean Sample mean Size of N
20
20 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.5 Curve Fitting Re-visited (1/2) Goal in the curve fitting problem Prediction for the target variable t given some new input variable x If we assume Gaussian for t, Determine the unknown w & by maximum likelihood using training data {x, t} Likelihood For data drawn from dist i.i.d., In log form
21
21 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ maximizing likelihood = minimizing the sum-of-squares error function (with negative log likelihood) Determining the precision with ML Predictive distribution Predictions for new values of x (using w, ) 1.2.5 Curve Fitting Re-visited (2/2)
22
22 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Maximum Posterior (MAP) Introduce a prior (Gaussian distribution on w), : hyper-parameter Taking the negative logarithm and combining previous terms, the maximum of the posterior is given by minimum of Maximizing the posterior dist = minimizing the regularized sum-of-squares error function (1.4) With regularization parameters: Determine w with most probable value of w given the data (maximizing posterior)
23
23 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.6 Bayesian Curve Fitting Marginalization
24
24 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.3 Model Selection Proper model complexity → Good generalization & best model Measuring the generalization performance If data are plentiful, divide into training, validation & test set Otherwise, cross-validate Leave-one-out technique Drawbacks –Expensive computation –Using separate data → multiple complexity parameters New measures of performance e.g. Akaike information criterion(AIC), Bayesian information criterion(BIC)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.