LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Pattern Recognition and Machine Learning

CS479/679 Pattern Recognition Dr. George Bebis

Pattern Recognition and Machine Learning

Model assessment and cross-validation - overview

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Visual Recognition Tutorial

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Machine Learning CMPT 726 Simon Fraser University

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Visual Recognition Tutorial

Bayesian Learning Rong Jin.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

PATTERN RECOGNITION AND MACHINE LEARNING

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity

INTRODUCTION TO Machine Learning 3rd Edition

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

BCS547 Neural Decoding.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Logistic Regression William Cohen.

Machine Learning 5. Parametric Methods.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Chapter 7. Classification and Prediction

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

PDF, Normal Distribution and Linear Regression

Probability Theory and Parameter Estimation I

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

CH 5: Multivariate Methods

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Special Topics In Scientific Computing

Data Mining Lecture 11.

Bias and Variance of the Estimator

Probabilistic Models for Linear Regression

Collaborative Filtering Matrix Factorization Approach

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

10701 / Machine Learning Today: - Cross validation,

Pattern Recognition and Machine Learning

The loss function, the normal equation,

LECTURE 07: BAYESIAN ESTIMATION

Mathematical Foundations of BME Reza Shadmehr

Parametric Methods Berlin Chen, 2005 References:

Speech recognition, machine learning

Recap: Naïve Bayes classifier

Naïve Bayes Classifiers

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Speech recognition, machine learning

Presentation transcript:

LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials: and Bart Selman’s Cornell CS4700 decision tree slideshttp://

AIMA Chapter 18 (4) EVALUATING AND CHOOSING THE BEST HYPOTHESIS

CHOOSING BETWEEN HYPOTHESIS There can be multiple consistent hypothesis

OCKHAM'S RAZOR Overfitting error Training error  Given two models of similar generalization errors, one should prefer the simpler model over the more complex model  In general there is a trade off between complex hypothesis that fit the training data well and simpler hypothesis that may generalize better. Choosing between consistent hypothesis Generalization error

 Goal: Learn a hypothesis that fits the future data best.  How do we define “Future data” and “best”  “Future data”  Stationary assumption: there is a prob. distribution over examples that remains stationary over time.  Data are selected independent and identically distributed (i.i.d)  P(Ej| Ej-1, Ej-2, … ) = P(Ej)  P(Ej) = P(Ej-1) = P(Ej-2) = … independent identically distributed  Can use any past data as the future data for testing  “Best fit”  Error rate: proportion of mistakes it makes EVALUATING AND CHOOSING THE BEST HYPOTHESIS

 Task of finding best hypothesis  Model selection: choosing hypothesis space  Ex> choosing the degree of the polynomial MODEL SELECTION 1 st Order Polynomial3 rd Order Polynomial 9 th Order Polynomial  Optimization: finding best hypothesis within that space  Ex> choosing the slopes (parameter) of polynomials Validation set (10%) Training set (80%) Validation set Test setTest set

 k-fold cross-validation CROSS-VALIDATION Average results  Leave-one-out cross-validataion (k=N) Good tutorial: testtesttrain testtest testtest testtest testtest

 AIAM Ch 18.4 SIMPLE MODEL SELECTION ALGO

 Error rate  Count(y ! = ŷ) /N  Loss function  L(x, y, ŷ) = Utility(result of using y given an input x) – Utility(result of using ŷ given input x) FROM ERROR RATE TO LOSS

 Doing model selection and optimization at once  Search for a hypothesis that directly minimized the weighted sum of empirical loss and the complexity of the hypothesis REGULARIZATION

AIMA Chapter 18 (6) PARAMETRIC LEARNING - REGRESSION & CLASSIFICATION

 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear.  Goal of linear regression:  Find the gest fit h w that minimize the loss function UNIVARIATE LINEAR REGRESSION

UNIVARIATE LINEAR REGRESSION IN PROBABILISTIC MODEL Assume that the data is formed by where… the noise signals are independent the noise has a normal distribution with mean 0 and unknown variance σ 2 Than P(y|w,x) has a normal distribution with mean wx variance σ 2 P(y|w, x) = Normal(=, 2 )

BAYESIAN LINEAR REGRESSION = w+ Normal(0, 2 ) P(y|w, x) = Normal(=, 2 ) We have a set of datapoints (x 1,y 1 ) (x 2,y 2 ) … (x n,y n ) which are EVIDENCE about w. We want to infer w from the data P(w| 1, 2, …,, 1, 2, …, n ) = Normal(=, 2 ) You can use BAYES rule to work out a posterior distribution for w given the data. Or you could do Maximum Likelihood Estimation

 Beyond linear models, analytical solutions generally do not exist:  require alternative method  Maximum Likelihood Extimate  Gradient Decent method (*note: hill climbing) w <- any point in the parameter space Loop until convergence do for each wi in w do * BEYOND LINEAR MODEL w i <− w i − α LOSS() Step size: learning rate

 Batch gradient decent  Minimize the sum of the individual losses for each example  Convergence t unique global minimal guaranteed (in linear case) as long as is selected small enough  Can be slow to converge  Stochastic gradient decent  Minimize the individual losses one example at a time  Fixed rate does not guaranty convergence (but scheduled decreasing learning rate does)  Convergence is faster than batch gradient decent TWO TYPES OF GRADIENT DECENT

Parameter optimization in multivariate regression  MULTIVARIATE LINEAR REGRESSION

MULTIVARIATE LINEAR REGRESSION CONT. L1 regularization vs L2 regularization Optimization with regularization: Find the point closes to the minimum (center of contour) -> L1 regularization leads to weights of zeros

LINEAR CLASSIFICATION WITH A HARD THRESHOLD Linear functions can be used for classification as well as regression Decision boundary Decision boundary learning

LINEAR CLASSIFICATION WITH A HARD THRESHOLD CONT.

LOGISTIC REGRESSION: CLASSIFICATION Unlike linear regression, it is not possible to find a closed-form solution for the coefficient values that maximizes the likelihood function, so an iterative process must be used. We will use gradient decent again.

 Classification examples: Spam filtering NAÏVE BAYES NETWORK FOR CLASSIFICATION Hand written digit recognition

NAÏVE BAYES FOR SPAM FILTERING …. Y:Spam? f1:word1 f2:word2f3:word3 fn:wordn P(Fi|Y) Dictionary of size N Naïve Bayes assumptions: => Evidences are independently drawn P(Y,f1,f2,…,fn) = () ∏ (|) We only specify how each feature depends on the class Total number of parameters is linear in n P(Y)

INFERENCE IN NAÏVE BAYES Goal: compute posterior over causes Step 1: get joint probability of causes and evidence Step 2: get probability of evidence Step 3: renormalize P(f1,f2,…,fn) P(Y|f1,f2,…,fn) P(Y|f1,f2,…,fn) = α P(Y,f1,f2,…,fn) P(f1,f2,…,fn) = α() ∏(f |)= α() ∏(f |) … Y:Spam? #Word1/M#Word2/M #Word3/M #Wordn/M MODEL

 Assumptions:  Bag of words  Count word frequency  Dictionary  What do we need in order to use naïve Bayes?  Estimates of local conditional probability tables  P(Y), the prior over labels  P(Fi|Y) for each feature (evidence variable)  These probabilities are collectively called the parameters of the model and denoted by θ  Learning the parameters  P(Y) & P(Fi|Y) & P(Fi|-Y) from data

EXAMPLE: SPAM FILTERING CONT.