Presentation is loading. Please wait.

Presentation is loading. Please wait.

PREDICTION Elsayed Hemayed Data Mining Course. Outline  Introduction  Regression Analysis  Linear Regression  Multiple Linear Regression  Predictor.

Similar presentations

Presentation on theme: "PREDICTION Elsayed Hemayed Data Mining Course. Outline  Introduction  Regression Analysis  Linear Regression  Multiple Linear Regression  Predictor."— Presentation transcript:

1 PREDICTION Elsayed Hemayed Data Mining Course

2 Outline  Introduction  Regression Analysis  Linear Regression  Multiple Linear Regression  Predictor Error Measure  Evaluating the Accuracy a Predictor 2 Prediction

3 Introduction  “What if we would like to predict a continuous value, rather than a categorical label (like classification)?”  Numeric prediction is the task of predicting continuous (or ordered) values for given input.  The salary of college graduates with 10 years of work experience,  The potential sales of a new product given its price.  The most widely used approach for numeric prediction is regression, a statistical methodology 3 Prediction

4 Regression Analysis Prediction 4  Regression analysis can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable (which is continuous-valued).  The predictor variables are the attributes of interest describing the tuple (the values of the predictor variables are known)  The response variable is what we want to predict  Given a tuple described by predictor variables, we want to predict the associated value of the response variable

5 Regression Analysis – cont. Prediction 5  We’ll discuss straight-line regression analysis (which involves a single predictor variable) and multiple linear regression analysis (which involves two or more predictor variables)  Several software packages exist to solve regression problems. Examples include SAS (, SPSS (, and S-Plus (

6 Straight Line Regression Linear Regression 6 Prediction

7 Straight line regression Prediction 7  Straight-line regression analysis involves a response variable, y, and a single predictor variable, x.  It is the simplest form of regression, and models y as a linear function of x.  That is,y = w0+w1x;  where the variance of y is assumed to be constant,  w0 and w1 are regression coefficients  w0 the Y-intercept  w1 the slope of the line.

8 Straight line regression – cont. Prediction 8  These coefficients can be solved for by the method of least squares, which estimates the best-fitting straight line as the one that minimizes the error between the actual data and the estimate of the line.  Let D be a training set consisting of values of predictor variable, x, for some population and their associated values for response variable, y.  The training set contains |D| data points of the form(x1, y1), (x2, y2), : : :, (x |D|, y |D| )

9 Straight line regression – cont. Prediction 9  where is the mean value of x1, x2, : : :, x|D|, and is the mean value of y1, y2, : : :, y|D| Prediction

10 Example – Salary Data Prediction 10  Using Least Square Method  y = 23.6+3.5x.  Thus the salary of a college graduate with, say, 10 years of experience is $58,600.

11 Multiple Linear Regression Prediction 11  It allows response variable y to be modeled as a linear function of, say, n predictor variables or attributes, A1, A2, : : :, An, describing a tuple, X. (That is, X = (x1, x2, : : :, xn).)  An example of a multiple linear regression model based on two predictor attributes or variables, A1 and A2, is y = w0+w1x1+w2x2,  where x1 and x2 are the values of attributes A1 and A2, respectively, in X.

12 Multiple Linear Regression – Least Squares Prediction 12  The least squares method can be extended to solve for w0, w1, and w2.  The equations, however, become long and are tedious to solve by hand.  Multiple regression problems are instead commonly solved with the use of statistical software packages, such as SAS, SPSS, and S-Plus

13 Predictor Error Measures Prediction 13  Let DT be a test set of the form (X1, y1), (X2,y2), : : :, (Xd, yd), where the Xi are the n-dimensional test tuples with associated known values, yi, for a response variable, y, and d is the number of tuples in DT.  The mean squared error exaggerates the presence of outliers, while the mean absolute error does not.

14 Predictor Error Measures – cont. Prediction 14 The mean value of the yi’s of the training data,

15 Evaluating the Accuracy a Predictor – The Holdout method Prediction 15  The given data are randomly partitioned into two independent sets, a training set and a test set.  Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set.  The training set is used to derive the model, whose accuracy is estimated with the test set.  The estimate is pessimistic because only a portion of the initial data is used to derive the model.

16 Estimating accuracy with the holdout method Prediction 16

17 Random Subsampling Prediction 17  The holdout method is repeated k times.  The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration.  (For prediction, we can take the average of the predictor error rates.)

18 Homework due day 3 Prediction 18  Prepare a database with several thousands of records  Define a data mining application to run on your data  Download and install free data mining tool  Use the tool to mine your data  Prepare a demo to present your findings to the class.

19 Summary  Introduction  Regression Analysis  Linear Regression  Multiple Linear Regression  Predictor Error Measure  Evaluating the Accuracy a Predictor 19 Prediction

Download ppt "PREDICTION Elsayed Hemayed Data Mining Course. Outline  Introduction  Regression Analysis  Linear Regression  Multiple Linear Regression  Predictor."

Similar presentations

Ads by Google