Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.

Slides:



Advertisements
Similar presentations
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Advertisements

Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
PATTERN RECOGNITION AND MACHINE LEARNING
Institute of Systems and Robotics ISR – Coimbra Mobile Robotics Lab Bayesian Approaches 1 1 jrett.
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
Lecture 2: Bayesian Decision Theory 1. Diagram and formulation
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory Statistical Decision Making.
Optimal Bayes Classification
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 8 Sept 23, 2005 Nanjing University of Science & Technology.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 471/571 – Lecture 2 Bayesian Decision Theory 08/25/15.
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
Basic Technical Concepts in Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
Special Topics In Scientific Computing
LECTURE 03: DECISION SURFACES
Machine learning, pattern recognition and statistical data modelling
CS668: Pattern Recognition Ch 1: Introduction
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Pattern Recognition PhD Course.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Advanced Pattern Recognition
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Bayesian Decision Theory
Presentation transcript:

Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory

Ch1. Pattern Recognition & Machine Learning Automatically discover regularities in data. Based on regularities classify data into categories Example:

Pattern Recognition & Machine Learning Supervised Learning: Classification Regression Unsupervised Learning: Clustering Density Estimation Dimensionality reduction

Ch1. Pattern Recognition & Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory

Supervised Learning Supervised Learning: Training set x = {x1, x2, …, xN} Class or target vector t = {t1, t2, …, tN} Find a function y(x) that takes a vector x and outputs a class t. {(x,t)} y(x)

Supervised Learning Medical example: X = {patient1, patient2, …. patientN} patient1 = (high pressure, normal temp., high glucose,…) t1 = cancer patient2 = (low pressure, normal temp., normal glucose,…) t1 = not cancer new patient = (low pressure, low temp., normal glucose,…) t = ?

Ch1. Pattern Recognition & Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory

Pattern Recognition & Machine Learning Two problems in supervised learning: 1.Overfitting – Underfitting 2.Curse of Dimensionality

Pattern Recognition & Machine Learning Example Polynomial Fitting (Regression):

Pattern Recognition & Machine Learning Minimize Error:

Pattern Recognition & Machine Learning If our function is linear: y(x,w) = w0 + w1x + w2x 2 + … + wMX M Minimize error: E(w) = ½ Σ n {y(xn,w) – tn) 2 What happens as we vary M?

Underfitting

Overfitting

Pattern Recognition & Machine Learning Root Mean Square Error (RSM)

Regularization One solution: Regularization Penalize for models that are too complex: Minimize error: E(w) = ½ Σn {y(xn,w) – tn)2 + λ/2 ||w|| 2

Pattern Recognition & Machine Learning Two problems in supervised learning: 1.Overfitting – Underfitting 2.Curse of Dimensionality

Curse of Dimensionality Example: Classify vector x in one of 3 classes:

Curse of Dimensionaliy Solution: Divide space into cells:

Curse of Dimensionality But if the no. of dimensions is high we need to take a huge amount of space to look at the ”neighborhood” of x.

Introduction Supervised learning Problems in supervised learning Bayesian decision theory

Decision Theory State of nature. Let C denote the state of nature. C is a random variable. (e.g., C = C1 for sea bass or C = C2 for salmon ) A priori probabilities. Let P(C1) and P(C2) denote the a priori probability of C1 and C2 respectively. We know P(C1) + P(C2) = 1. Decision rule. Decide C1 if P(C1) > P(C2); otherwise choose C2.

Basic Concepts Class-conditional probability density function. Let x be a continuous random variable. p(x|C) is the probability density for x given the state of nature C. For example, what is the probability of lightness given that the class is salmon? p(lightness | salmon) ? Or what is the probability of lightness given sea bass? P(lightness | sea bass) ?

Figure 2.1

Bayes Formula How do we combine a priori and class-conditional Probabilities to know the probability of a state of nature? Bayes Formula. P(Cj | x) = p(x|Cj) P(Cj) / p(x) posterior probability likelihood evidence prior probability Bayes Decision: Choose C1 if P(C1|x) > P(C2|x); otherwise choose C2.

Figure 2.2

Minimizing Error What is the probability of error? P(error | x) = P(C1|x) if we decide C2 P(C2|x) if we decide C1 Does Bayes rule minimize the probability of error? P(error) = ∫ P(error,x) dx = ∫ P(error|x) p(x) dx and if for every x we minimize the error then P(error|x) is as small as it can be. Answer is “yes”.

Simplifying Bayes Rule Bayes Formula. P(Cj | x) = p(x|Cj) P(Cj) / p(x) The evidence p(x) is the same for all states or classes so we can dispense with it to make a decision. Rule: Choose C1 if p(x|C1)P(C1) > p(x|C2)P(C2); otherwise decide C2 If p(x|C1) = p(x|C2) then decision depends on the priors If P(C1) = P(C2) then decision depends on the likelihoods.

Loss Function Let {C1,C2, …, Cc} be the possible states of nature. Let {a1, a2, …, ak} be the possible actions. Loss function λ(ai|Cj) is the loss incurred for taking action ai when the state of nature is Cj. Expected loss R(ai|x) = Σj λ(ai|Cj) P(Cj|x) Decision: Select the action that minimizes the conditional risk ( ** best possible performance ** )

Zero-One Loss We will normally be concerned with the symmetrical or zero-one loss function: λ (ai|Cj) = 0 if i = j 1 if i = j In this case the conditional risk is: R(ai|x) = Σj λ(ai|Cj) P(Cj|x) = 1 – P(Ci|x)