Chapter 2 Overview of the Data Mining Process 1. Introduction Data Mining – Predictive analysis Tasks of Classification & Prediction Core of Business.

Slides:



Advertisements
Similar presentations
Chapter 2 Overview of the Data Mining Process
Advertisements

Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Simple Regression Model
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Introduction to Data Mining with XLMiner
Chapter 13 Multiple Regression
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Linear Regression and Correlation Analysis
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
CHAPTER 3 Describing Relationships
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Chapter 5 Data mining : A Closer Look.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Relationships Among Variables
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Data Mining Techniques
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Inference for regression - Simple linear regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Chapter 13: Inference in Regression
Overview DM for Business Intelligence.
Statistics for Business and Economics 8 th Edition Chapter 11 Simple Regression Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Chapter 3 Data Exploration and Dimension Reduction 1.
Chapter 12 Correlation & Regression
Chapter 6 Regression Algorithms in Data Mining
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Simple Linear Regression One reason for assessing correlation is to identify a variable that could be used to predict another variable If that is your.
Chapter 9 – Classification and Regression Trees
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 12-1 Correlation and Regression.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Applied Quantitative Analysis and Practices LECTURE#22 By Dr. Osman Sadiq Paracha.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Correlation & Regression
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 13 Multiple Regression
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Lecture 10: Correlation and Regression Model.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
CHAPTER 3 Describing Relationships
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Overview of the Data Mining Process
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 9 l Simple Linear Regression 9.1 Simple Linear Regression 9.2 Scatter Diagram 9.3 Graphical.
Chapter 12 Simple Regression Statistika.  Analisis regresi adalah analisis hubungan linear antar 2 variabel random yang mempunyai hub linear,  Variabel.
Stats Methods at IC Lecture 3: Regression.
Correlation and Regression
Prepared by Lee Revere and John Large
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Chapter 2 Overview of the Data Mining Process 1

Introduction Data Mining – Predictive analysis Tasks of Classification & Prediction Core of Business Intelligence Data Base Methods – OLAP – SQL – Do not involve statistical modeling 2

Core Ideas in Data Mining Analytical Methods Used in Predictive Analytics – Classification Used with categorical response variables E.g. Will purchase be made / not made? – Prediction Predict (estimate) value of continuous response variable Prediction used with categorical as well – Association Rules Affinity analysis – “what goes with what” Seeks correlations among data 3

Core Ideas in Data Mining Data Reduction – Reduce variables – Group together similar variables Data Exploration – View data as evidence – Get “a feel” for the data Data Visualization – Graphical representation of data – Locate tends, correlations, etc. 4

Supervised Learning “Supervised learning" algorithms are those used in classification and prediction. – Data is available in which the value of the outcome of interest is known. “Training data" are the data from which the classification or prediction algorithm “learns," or is “trained," about the relationship between predictor variables and the outcome variable. This process results in a “model” – Classification Model – Predictive Model 5

Model is then run with another sample of data – “validation data" – the outcome is known but we wish to see how well the model performs – If many different models are being tried out, a third sample of known outcomes -“test data” is used with the final, selected model to predict how well it will do. The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown. 6 Supervised Learning

Linear regression analysis is an example of supervised Learning – The Y variable is the (known) outcome variable – The X variable is some predictor variable. – A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. – The regression line can now be used to predict Y values for new values of X for which we do not know the Y value. 7

Unsupervised Learning No outcome variable to predict or classify No “learning” from cases Unsupervised leaning methods – Association Rules – Data Reduction Methods – Clustering Techniques 8

The Steps in Data Mining 1. Develop an understanding of the purpose of the data mining project – It is a one-shot effort to answer a question or questions or – Application (if it is an ongoing procedure). 2. Obtain the dataset to be used in the analysis. – Random sampling from a large database to capture records to be used in an analysis – Pulling together data from different databases. Internal (e.g. Past purchases made by customers) External (credit ratings). – Usually the analysis to be done requires only thousands or tens of thousands of records. 9

The Steps in Data Mining 3. Explore, clean, and preprocess the data – Verifying that the data are in reasonable condition. – How missing data should be handled? – Are the values in a reasonable range, given what you would expect for each variable? – Are there obvious “outliers?" – Data are reviewed graphically – For example, a matrix of scatter plots showing the relationship of each variable with each other variable. – Ensure consistency in the definitions of fields, units of measurement, time periods, etc. 10

The Steps in Data Mining 4. Reduce the data – If supervised training is involved separate them into training, validation and test datasets. – Eliminating unneeded variables, Transforming variables – Turning “money spent" into “spent > $100" vs. “Spent · $100"), Creating new variables – A variable that records whether at least one of several products was purchased – Make sure you know what each variable means, and whether it is sensible to include it in the model. 5. Determine the data mining task – Classification, prediction, clustering, etc. 6. Choose the data mining techniques to be used – Regression, neural nets, hierarchical clustering, etc. 11

The Steps in Data Mining 7. Use algorithms to perform the task. – Iterative process - trying multiple variants, and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm). – When appropriate, feedback from the algorithm's performance on validation data is used to refine the settings. 8. Interpret the results of the algorithms. – Choose the best algorithm to deploy, – Use final choice on the test data to get an idea how well it will perform. 9. Deploy the model. – Integrate the model into operational systems – Run it on real records to produce decisions or actions. – For example, the model might be applied to a purchased list of possible customers, and the action might be “include in the mailing if the predicted amount of purchase is > $10." 12

Preliminary Steps Organization of datasets – Records in rows – Variables in columns In supervised learning one of these will be the outcome variable Labels the first or last column Sampling from a database – Use a samples to create, validate, & test model Oversampling rare events – If response variable value is seldom found in data then sample size increase – Adjust algorithm as necessary 13

Preliminary Steps (Pre-processing and Cleaning the Data) Types of variables – Continuous – assumes a any real numerical value (generally within a specified range) – Categorical – assumes one of a limited number of values Text (e.g. Payments e {current, not current, bankrupt} Numerical (e.g. Age e {0 … 120} ) Nominal (payments) Ordinal (age) 14

Preliminary Steps (Pre-processing and Cleaning the Data) Handling categorical variables – If categorical is ordered then it can be used as continuous variable (e.g. Age, level of credit, etc.) – Use of “dummy” variables when range of values not large e.g. Variable occupation e {student, unemployed, employed, retired} Create binary (yes/no) dummy variables – Student – yes/no – Unemployed – yes/no – Employed – yes/no – Retired – yes/no Variable selection – The more predictor variables the more records need to build the model – Reduce number of variables whenever appropriate 15

Preliminary Steps (Pre-processing and Cleaning the Data) Overfitting – Building a model - describe relationships among variables in order to predict future outcome (dependent) values on the basis of future predictor (independent) values. – Avoid “explaining“ variation in the data that was nothing more than chance variation. Avoid mislabeling “noise” in the data as if it were a “signal” – Caution - if the dataset is not much larger than the number of predictor variables, then it is very likely that a spurious relationship like this will creep into the model 16

Overfitting 17

Preliminary Steps (Pre-processing and Cleaning the Data) How many variables & how much data A good rule of thumb is to have ten records for every predictor variable. For classification procedures – At least 6xmxp records, – Where m = number of outcome classes, and p = number of variables Compactness or parsimony is a desirable feature in a model. A matrix of x-y plots can be useful in variable selection. Can see at a glance x-y plots for all variable combinations. – A straight line would be an indication that one variable is exactly correlated with another. – We would want to include only one of them in our model. Weed out irrelevant and redundant variables from our model Consult domain expert whenever possible 18

Preliminary Steps (Pre-processing and Cleaning the Data) Outliers – Values that lie far away from the bulk of the data are called outliers – no statistical rule can tell us whether such an outlier is the result of an error – these are judgments best made by someone with “domain" knowledge – if the number of records with outliers is very small, they might be treated as missing data. 19

Preliminary Steps (Pre-processing and Cleaning the Data) Missing values – If the number of records with missing values is small, those records might be omitted – The more variables, the more records to dropped Solution - use average value computed from records with valid data for variable with missing data Reduces variability in data set – Human judgment can be used to determine best way to handle missing data 20

Preliminary Steps (Pre-processing and Cleaning the Data) Normalizing (standardizing) the data – To normalize the data, we subtract the mean from each value, and divide by the standard deviation of the resulting deviations from the mean Expressing each value as “number of standard deviations away from the mean“ – the z-score Needed if variables are in different units e.G. Hours, thousands of dollars, etc. – Clustering algorithms measure variables values in distance from each other – need a standard value for distance. – Data mining software, including XLMiner, typically has an option that normalizes the data in those algorithms where it may be required 21

Preliminary Steps Use and creation of partition – Training partition The largest partition Contains the data used to build the various models Same training partition is generally used to develop multiple models. – Validation partition Used to assess the performance of each model, Used to compare models and pick the best one. In classification and regression trees algorithms the validation partition may be used automatically to tune and improve the model. – Test partition Sometimes called the “holdout" or “evaluation" partition is used to assess the performance of a chosen model with new data. 22

The Three Data Partitions and Their Role in the Data Mining Process 23

Example – Linear Regression Boston Housing Data 24

25

Partitioning the data 26

Using XLMiner for Multiple Linear Regression 27

Specifying Output 28

Prediction of Training Data 29

Prediction of Validation Data 30

Summary of errors 31

RMS error Error = actual - predicted RMS = Root-mean-squared error = Square root of average squared error In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable 32

Using Excel and XLMiner for Data Mining Excel is limited in data capacity However, the training and validation of DM models can be handled within the modest limits of Excel and XLMiner Models can then be used to score larger databases XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model) 33

Simple Regression Example 34

Simple Regression Model Make prediction about the starting salary of a current college graduate Data set of starting salaries of recent college graduates 35 Data SetCompute Average Salary How certain are of this prediction? There is variability in the data.

Compute Total Variation Simple Regression Model The smaller the amount of total variation the more accurate (certain) will be our prediction. Use total variation as an index of uncertainty about our prediction 36

Simple Regression Model How “explain” the variability - Perhaps it depends on the student’s GPA 37 Salary GPA

Find a linear relationship between GPA and starting salary As GPA increases/decreases starting salary increases/decreases 38 Simple Regression Model

Least Squares Method to find regression model – Choose a and b in regression model (equation) so that it minimizes the sum of the squared deviations – actual Y value minus predicted Y value (Y-hat) 39 Simple Regression Model

How good is the model? 40 Simple Regression Model  u-hat is a “residual” value  The sum of all u-hats is zero  The sum of all u-hats squared is the total variance not explained by the model  “unexplained variance” is 7,425,926 a= 4,779 & b = 5,370 A computer program computed these values

Simple Regression Model 41 Total Variation = 23,000,000

42 Simple Regression Model Total Unexplained Variation = 7,425,726

Simple Regression Model Relative Goodness of Fit – Summarize the improvement in prediction using regression model Compute R 2 – coefficient of determination 43 Regression Model (equation) a better predictor than guessing the average salary The GPA is a more accurate predictor of starting salary than guessing the average R 2 is the “performance measure“ for the model. Predicted Starting Salary = 4, ,370 * GPA

Detailed Regression Example 44

Data Set 45 Obs #SalaryGPAMonths Work

Scatter Plot - GPA vs Salary 46

Scatter Plot - Work vs Salary 47

Pearson Correlation Coefficients -1 <= r <= 1 48 SalaryGPA Months Work Salary1 GPA Months Work

Three Regressions Salary = f(GPA) Salary = f(Work) Salary = f(GPA, Work) Interpret Excel Output 49

Interpreting Results Regression Statistics – Multiple R, – R 2, – R 2 adj – Standard Error S y Statistical Significance – t-test – p-value – F test 50

Regression Statistics Table 51 Multiple R – R = square root of R 2 R 2 – Coefficient of Determination R 2 adj – used if more than one x variable Standard Error S y – This is the sample estimate of the standard deviation of the error (actual – predicted)

ANOVA Table 52 Table 1 gives the F statistic Tests the claim – there is no significant relationship between your all of your independent and dependent variables The significance F value is a p-value should reject the claim: – Of NO significant relationship between your independent and dependent variables if p<  – Generally  = 0.05

Regression Coefficients Table 53 Coefficients Column gives – b 0, b 1,, b 2, …, b n values for the regression equation. – The b 0 is the intercept – b 1 value is next to your independent variable x 1 – b 2 is next to your independent variable x 2. – b 3 is next to your independent variable x 3

Regression Coefficients Table p values for individual t tests each independent variables t test - tests the claim that there is no relationship between the independent variable (in the corresponding row) and your dependent variable. Should reject the claim Of NO significant relationship between your independent variable (in the corresponding row) and dependent variable if p< . 54

Salary = f(GPA) 55 Regression Statisticsf(GPA) Multiple R R Square Adjusted R Square Standard Error Observations 10 ANOVA dfSSMSFSignificance F Regression Residual Total Coefficients Standard Errort StatP-valueLower 95%Upper 95% Intercept GPA

Salary = f(Work) 56 Regression Statisticsf(Work) Multiple R R Square Adjusted R Square Standard Error Observations10 ANOVA dfSSMSFSignificance F Regression E-05 Residual Total Coefficients Standard Errort StatP-valueLower 95%Upper 95% Intercept E Months Work E

Salary = f(GPA, Work) 57 Regression Statistics f(GPA,Work) Multiple R R Square Adjusted R Square Standard Error Observations10 ANOVA dfSSMSFSignificance F Regression Residual Total Coefficients Standard Errort StatP-valueLower 95%Upper 95% Intercept GPA Months Work

Compare Three “Models” 58 Regression Statistics f(GPA,Work) Multiple R R Square Adjusted R Square Standard Error Observations10 Regression Statisticsf(Work) Multiple R R Square Adjusted R Square Standard Error Observations10 Regression Statisticsf(GPA) Multiple R R Square Adjusted R Square Standard Error Observations10