Download presentation
Presentation is loading. Please wait.
Published byElizabeth Goodman Modified over 9 years ago
1
PREDICTION Elsayed Hemayed Data Mining Course
2
Outline Introduction Regression Analysis Linear Regression Multiple Linear Regression Predictor Error Measure Evaluating the Accuracy a Predictor 2 Prediction
3
Introduction “What if we would like to predict a continuous value, rather than a categorical label (like classification)?” Numeric prediction is the task of predicting continuous (or ordered) values for given input. The salary of college graduates with 10 years of work experience, The potential sales of a new product given its price. The most widely used approach for numeric prediction is regression, a statistical methodology 3 Prediction
4
Regression Analysis Prediction 4 Regression analysis can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable (which is continuous-valued). The predictor variables are the attributes of interest describing the tuple (the values of the predictor variables are known) The response variable is what we want to predict Given a tuple described by predictor variables, we want to predict the associated value of the response variable
5
Regression Analysis – cont. Prediction 5 We’ll discuss straight-line regression analysis (which involves a single predictor variable) and multiple linear regression analysis (which involves two or more predictor variables) Several software packages exist to solve regression problems. Examples include SAS (www.sas.com), SPSS (www.spss.com), and S-Plus (www.insightful.com).
6
Straight Line Regression Linear Regression 6 Prediction
7
Straight line regression Prediction 7 Straight-line regression analysis involves a response variable, y, and a single predictor variable, x. It is the simplest form of regression, and models y as a linear function of x. That is,y = w0+w1x; where the variance of y is assumed to be constant, w0 and w1 are regression coefficients w0 the Y-intercept w1 the slope of the line.
8
Straight line regression – cont. Prediction 8 These coefficients can be solved for by the method of least squares, which estimates the best-fitting straight line as the one that minimizes the error between the actual data and the estimate of the line. Let D be a training set consisting of values of predictor variable, x, for some population and their associated values for response variable, y. The training set contains |D| data points of the form(x1, y1), (x2, y2), : : :, (x |D|, y |D| )
9
Straight line regression – cont. Prediction 9 where is the mean value of x1, x2, : : :, x|D|, and is the mean value of y1, y2, : : :, y|D| Prediction
10
Example – Salary Data Prediction 10 Using Least Square Method y = 23.6+3.5x. Thus the salary of a college graduate with, say, 10 years of experience is $58,600.
11
Multiple Linear Regression Prediction 11 It allows response variable y to be modeled as a linear function of, say, n predictor variables or attributes, A1, A2, : : :, An, describing a tuple, X. (That is, X = (x1, x2, : : :, xn).) An example of a multiple linear regression model based on two predictor attributes or variables, A1 and A2, is y = w0+w1x1+w2x2, where x1 and x2 are the values of attributes A1 and A2, respectively, in X.
12
Multiple Linear Regression – Least Squares Prediction 12 The least squares method can be extended to solve for w0, w1, and w2. The equations, however, become long and are tedious to solve by hand. Multiple regression problems are instead commonly solved with the use of statistical software packages, such as SAS, SPSS, and S-Plus
13
Predictor Error Measures Prediction 13 Let DT be a test set of the form (X1, y1), (X2,y2), : : :, (Xd, yd), where the Xi are the n-dimensional test tuples with associated known values, yi, for a response variable, y, and d is the number of tuples in DT. The mean squared error exaggerates the presence of outliers, while the mean absolute error does not.
14
Predictor Error Measures – cont. Prediction 14 The mean value of the yi’s of the training data,
15
Evaluating the Accuracy a Predictor – The Holdout method Prediction 15 The given data are randomly partitioned into two independent sets, a training set and a test set. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with the test set. The estimate is pessimistic because only a portion of the initial data is used to derive the model.
16
Estimating accuracy with the holdout method Prediction 16
17
Random Subsampling Prediction 17 The holdout method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration. (For prediction, we can take the average of the predictor error rates.)
18
Homework due day 3 Prediction 18 Prepare a database with several thousands of records Define a data mining application to run on your data Download and install free data mining tool Use the tool to mine your data Prepare a demo to present your findings to the class.
19
Summary Introduction Regression Analysis Linear Regression Multiple Linear Regression Predictor Error Measure Evaluating the Accuracy a Predictor 19 Prediction
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.