Linear Regression & Classification Prof. Navneet Goyal CS & IS BITS, Pilani
Fundamentals of Modeling Abstract representation of a real-world process Y=3X+2 is a very simple model of how variable Y might relate to variable X Instance of a more general model structure Y =aX+b a & b are parameters θ is generally used to denote a generic parameter or a set (or vector) of parameters θ={a,b} Values of parameters are chosen by estimation – that is by min. or max. an appropriate score function measuring the fit of the model to the data Before we can choose the parameters, we must choose an app. functional form of the model itself
Fundamentals of Modeling Predictive modeling PM can be thought of as learning a mapping from an input set of vector measurements x to a scalar output y Vector output also possible but rarely used in practice One of the variable is expressed as a function of others (predictor variables) Response variable – Y and predictor variables – Xi Ÿ = f(x1,x2,….xp; θ) When Y is quantitative, this task of estimating a mapping from the p-dimensional X to Y is called as regression When Y is categorical, the task of learning a mapping from X to Y is called classification learning or supervised classification
Predictive Modeling Predictive modeling Predicts the value of some target characteristic of an object on the basis of observed values of other characteristics of the object Examples: Regression (Prediction in DM) & Classification
Predictive Modeling Prediction Classification (supervised learning) Linear regression Nonlinear regression Classification (supervised learning) Decision trees k-NN SVM ANN
Definition of Regression Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others. Examples: Sales of a product can be predicted by using the relationship between sales volume and amount of advertising The performance of an employee can be predicted by using the relationship between performance and aptitude tests The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.
Regression Problem Visualisation + + x y, y rmse = s y ^
Structure of a Linear Regression Model Given a set of features x, a linear predictor has the form: The output is a real-valued, quantitative variable x y, y ^
Classification Problem 4/15/2017 Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DC, where each ti is assigned to one class. Prediction is similar, but may be viewed as having infinite number of classes. Dr. Navneet Goyal, BITS,Pilani
Classification Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups A linear classifier has a linear decision boundary The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable
What is Classification? Classification is also known as (statistical) pattern recognition The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task. Example: Face recognition Training data: D = {X,y} Prior knowledge Design/ learn Classifier m(q,x) ^ Predict ^ New pattern: x Predicted class label: y
Classification: Applications Spam mail IDS (rare event classification) Credit- rating Medical diagnosis Categorizing cells as malignant or benign based on MRI scans Classifying galaxies based on their shapes Predicting preterm births Crop yield prediction Identify mushrooms as poisonous or edible …
Classification: Applications Example: Credit Card Company Every purchase is placed in 1 of 4 classes Authorize Ask for further identification before authorizing Do not authorize Do not authorize but contact police Two functions of Data Mining Examine historical data to determine how the data fit into 4 classes Apply the model to each new purchase
Classification: 3 phase job Model building phase (learning phase) Testing phase Model usage phase
Distance-based Classification Nearest Neighbors If it walks like a duck, quacks like a duck, and looks like a duck, then it is probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records
Definition of Nearest Neighbor 4/15/2017 Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x Dr. Navneet Goyal, BITS,Pilani
Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines One Possible Solution
Support Vector Machines Another possible solution
Support Vector Machines Other possible solutions
Support Vector Machines Which one is better? B1 or B2? How do you define better?
Support Vector Machines Find a hyperplane that maximizes the margin => B1 is better than B2
Support Vector Machines What if the problem is not linearly separable?
Nonlinear Support Vector Machines What if decision boundary is not linear?
Support Vector Machines Solid line is preferred Geometrically we can characterize the solid plane as being “furthest” from both classes How can we construct the plane “furthest’’ from both classes?
Support Vector Machines Figure – Best plane bisects closest points in the convex hulls Examine the convex hull of each class’ training data (indicated by dotted lines) and then find the closest points in the two convex hulls (circles labeled d and c). The convex hull of a set of points is the smallest convex set containing the points. If we construct the plane that bisects these two points (w=d-c), the resulting classifier should be robust in some sense.
Non-Convex or Concave Set Convex Sets Convex Set Non-Convex or Concave Set A function (in blue) is convex if and only if the region above its graph (in green) is a convex set.
Convex hull: elastic band analogy Convex Hulls Convex hull: elastic band analogy For planar objects, i.e., lying in the plane, the convex hull may be easily visualized by imagining an elastic band stretched open to encompass the given object; when released, it will assume the shape of the required convex hull.
Disadvantages of Linear Decision Surfaces Var1 Var2
Advantages of Non-Linear Surfaces Var1 Var2
Linear Classifiers in High-Dimensional Spaces Constructed Feature 2 Var1 Var2 Constructed Feature 1 Find function (x) to map to a different space Go back
Handwriting Recognition Task T recognizing and classifying handwritten words within images Performance measure P percent of words correctly classified Training experience E a database of handwritten words with given classifications
Handwriting Recognition
Pattern Recognition Example Handwriting Digit Recognition
Pattern Recognition Example Handwriting Digit Recognition Each digit represented by a 28x28 pixel image Can be represented by a vector of 784 real no.s Objective: to have an algorithm that will take such a vector as input and identify the digit it is representing Non-trivial problem due to variability in handwriting Take images of a large no. of digits (N) – training set Use training set to tune the parameters of an adaptive model Each digit in the training set has been identified by a target vector t, which represents the identity of the corresp. digit. Result of running a ML algo. can expressed as a fn. y(x) which takes input a new digit x and outputs a vector y. Vector y is encoded in the same way as t The form of y(x) is determined through the learning (training) phase
Pattern Recognition Example Generalization The ability to categorize correctly new examples that differ from those in training Generalization is a central goal in pattern recognition Preprocessing Input variables are preprocessed to transform them into some new space of variables where it is hoped that the problem will be easier to solve (see fig.) Images of digits are translated and scaled so that each digit is contained within a box of fixed size. This reduces variability. Preprocessing stage is referred to as feature extraction New test data must be preprocessed using the same steps as training data
Pattern Recognition Example Preprocessing Can also speed up computations For eg.: Face detection in a high resolution video stream Find useful features that are fast to compute and yet that also preserve useful discriminatory information enabling faces to be distinguished form non-faces Avg. value of image intensity in a rectangular sub-region can be evaluated extremely efficiently and a set of such features are very effective in fast face detection Such features are smaller in number than the number of pixels, it is referred to as a form of Dimensionality Reduction Care must be taken so that important information is not discarded during pre processing
Pattern Recognition Example Supervised & unsupervised learning If training data consists of both input vectors and target vectors – supervised learning Digit recognition problem – classification Predicting crop yield – regression If training data consists of only input vectors – unsupervised learning Discover groups of similar examples within data – clustering Find distribution of data within the input space – density estimation Project data from a HD space to 2-3 D space for the purpose of visualization
Reinforcement Learning The problem of finding suitable actions to take in a given situation in order to maximize a reward
Polynomial Curve Fitting Observe Real-valued input variable x • Use x to predict value of target variable t • Synthetic data generated from sin(2π x) • Random noise in target values Target Variable Input Variable
Polynomial Curve Fitting N observations of x x = (x1,..,xN)T t = (t1,..,tN)T • Goal is to exploit training set to predict value of from x • Inherently a difficult problem Target Variable Data Generation: N = 10 Spaced uniformly in range [0,1] Generated from sin(2πx) by adding small Gaussian noise Noise typical due to unobserved variables Input Variable
Polynomial Curve Fitting • Where M is the order of the polynomial • Is higher value of M better? We’ll see shortly! • Coefficients w0 ,…wM are denoted by vector w • Nonlinear function of x, linear function of coefficients w • Called Linear Models Target Variable Input Variable
Sum-of-Squares Error Function
Polynomial curve fitting
Polynomial curve fitting Choice of M?? Called the model selection or model comparison
0th Order Polynomial Poor representations of sin(2πx)
1st Order Polynomial Poor representations of sin(2πx)
3rd Order Polynomial Best Fit to sin(2πx)
9th Order Polynomial Over Fit: Poor representation of sin(2πx)
Polynomial Curve Fitting Good generalization is the objective Dependence of generalization performance on M? Consider a data set of 100 points Calculate E(w*) for both training data & test data Choose M which minimizes E(w*) Root Mean Square Error (RMS) Sometimes convenient to use as division by N allows us to compare different sizes of data sets on equal footing Square root ensures ERMS is measure on the same scale ( and in same units) as the target variable t
Over-fitting Why is it happening? For small M(0,1,2) Inflexible to handle oscillations of sin(2πx) M(3-8) flexible enough to handle oscillations of sin(2πx) For M=9 Too flexible!! TE = 0 GE = high Why is it happening?
Polynomial Coefficients
Data Set Size: M=9 - Larger the data set, the more complex model we can afford to fit to the data - No. of data pts should be no less than 5-10 times the no. of adaptive parameters in the model
Over-fitting Problem Should we limit the no. of parameters according to the available training set? Complexity of the model should depend only on the complexity of the problem! LSE represents a specific case of Maximum Likelihood Over-fitting is a general property of maximum likelihood Over-fitting problem can be avoided using the Bayesian Approach!
Regularization Penalize large coefficient values
Regularization:
Regularization:
Regularization: vs.
Polynomial Coefficients
Linear Models for Regression Polynomial is an example of a broad class of functions called linear regression models The role of regression is to predict the value of one or more continuous target variables t given the value of a D-dimensional vector x of input variables We have already discussed Polynomial Curve Fitting for Regression A polynomial is a specific example of a broad class of functions called Linear Regression Models Functions which are linear functions of the adjustable parameters Simplest form of linear regression models are also linear functions of the input variables A more useful class of functions can be obtained by taking a linear combination of a fixed set of nonlinear functions of the input variables, known as basis functions Linear functions of parameters Non-linear wrt input variables
Linear Models for Regression Linear models have significant limitations as practical techniques for ML, particularly for problems involving high dimensionality Linear models possess nice analytical properties and form the foundation for more sophisticated models
Linear Basis Function Models Simplest linear model for regression with d input variables: Where are the input variables Compare with linear regression with one variable Compare with polynomial regression with one variable Linear in both parameters and input variables Significant limitations since it is a linear fn. of input variables 1-D case – straight line fit
Linear Basis Function Models
Linear Basis Function Models Polynomial regression is a particular example of this model!! How?? Single input variable: x Basis function Polynomial basis Limitation of polynomial basis function? Global: • changes in one region of input space affects others Can divide input space into regions • use different polynomials in each region • equivalent to spline functions
Linear Basis Function Models Polynomial basis functions: These are global; a small change in x affect all basis functions.
Linear Basis Function Models (4) Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μj and s control location and scale (width).
Linear Basis Function Models (5) Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).
Home Work Read about Gaussian, Sigmoidal, & Fourier basis functions Sequential Learning & Online algorithms Will discuss in the next class!
The Bias-Variance Decomposition Bias-variance decomposition is a formal method for analyzing the prediction error of a predictive model Bias = avg. distance bet the target and the location where the projectile hits the ground (depends on angle) Variance = deviation bet x and the avg. position where the projectile hits the floor (depends on force) Noise: if the target is not stationary then the observed distance is also affected by changes in the location of target
The Bias-Variance Decomposition Low degree polynomial has high bias (fits poorly) but has low variance with different data sets High degree polynomial has low bias (fits well) but has high variance with different data sets Interactive demo @: http://www.aiaccess.net/English/Glossaries/Glos Mod/e_gm_bias_variance.htm
The Bias-Variance Decomposition True height of Chinese emperor: 200cm, about 6’6”. Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average
The Bias-Variance Decomposition Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate Squared error = Square of bias error + Variance As variance increases, error increases
Effect of regularization parameter on the bias and variance terms high variance low variance low bias high bias
An example of the bias-variance trade-off
Beating the bias-variance trade-off We can reduce the variance term by averaging lots of models trained on different datasets. This seems silly. If we had lots of different datasets it would be better to combine them into one big training set. With more training data there will be much less variance. Weird idea: We can create different datasets by bootstrap sampling of our single training dataset. This is called “bagging” and it works surprisingly well. But if we have enough computation its better to do the right Bayesian thing: Combine the predictions of many models using the posterior probability of each parameter vector as the combination weight.