Chapter 4 –Dimension Reduction Data Mining for Business Analytics Shmueli, Patel & Bruce.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Chapter Outline 3.1 Introduction
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 3 – Data Visualization © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 8 – Logistic Regression
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 3 – Data Visualization
Chapter 7 – K-Nearest-Neighbor
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Overview DM for Business Intelligence.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Chapter 2 Dimensionality Reduction. Linear Methods
Chapter 3 Data Exploration and Dimension Reduction 1.
Bivariate Distributions Overview. I. Exploring Data Describing patterns and departures from patterns (20%-30%) Exploring analysis of data makes use of.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 9 – Classification and Regression Trees
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimensionality Reduction Motivation I: Data Compression Machine Learning.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Applied Quantitative Analysis and Practices
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Overview of the Data Mining Process
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Principal Component Analysis
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Univariate Point Estimation Confidence Interval Estimation Bivariate: Linear Regression Multivariate: Multiple Regression 1 Chapter 4: Statistical Approaches.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Descriptive Statistics ( )
Chapter 12 – Discriminant Analysis
Chapter 12 Understanding Research Results: Description and Correlation
Statistical analysis.
Data Transformation: Normalization
MATH-138 Elementary Statistics
Data Mining: Concepts and Techniques
School of Computer Science & Engineering
Bioinformatics seminar 2016 spring
Statistical analysis.
Principal Component Analysis (PCA)
Chapter 7 – K-Nearest-Neighbor
Dimension Reduction via PCA (Principal Component Analysis)
Numerical Descriptive Measures
Numerical Descriptive Measures
Descriptive Statistics vs. Factor Analysis
How do we categorize and make sense of data?
Multivariate Statistics
Chapter 2 Exploring Data with Graphs and Numerical Summaries
Principal Components Analysis
Somi Jacob and Christian Bach
Chapter 7: Transformations
Feature Selection Methods
Numerical Descriptive Measures
MGS 3100 Business Analysis Regression Feb 18, 2016
Chapter 4 –Dimension Reduction
Presentation transcript:

Chapter 4 –Dimension Reduction Data Mining for Business Analytics Shmueli, Patel & Bruce

Dimensionality It is the number of variables in the data set It can grow due to derived variables Too many variables can lead to overfitting Accuracy and reliability can suffer Can lead to multicollinearity problem Can pose computational problem When the model is deployed can lead to unnecessary costs of collecting and processing data

Curse of dimensionality Data required to densely populate space increases exponentially as the dimension increases In 100- dimensional space, they would be like distant galaxies Eight points fill the one- dimensional space Become more separated as the dimension increases

Practical Considerations Which variables are most important for the task? Which variables are likely to be useless? Which variables are likely to contain much error? Which variables can be measured in the future? Which variables can be measured before the outcome?

Exploring the data Statistical summary of data: common metrics Average Median Minimum Maximum Standard deviation Counts & percentages

Summary Statistics – Boston Housing

Summarize Using Pivot Tables Counts & percentages are useful for summarizing categorical data Boston Housing example: 471 neighborhoods border the Charles River (1) 35 neighborhoods do not (0)

Pivot Tables - cont. Boston Housing example: Compare average home values in neighborhoods that border Charles River (1) and those that do not (0) Averages are useful for summarizing grouped numerical data

Pivot Tables, cont. Group by multiple criteria: By # rooms and location E.g., neighborhoods on the Charles with 6-7 rooms have average house value of ($000)

Pivot Table - Hint To get counts, drag any variable (e.g. “ID”) to the data area Select “settings” then change “sum” to “count”

Correlations Between Pairs of Variables: Correlation Matrix from Excel CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV CRIM1.00 ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

Reducing Categories A single categorical variable with m categories is typically transformed into m-1 dummy variables Each dummy variable takes the values 0 or 1 0 = “no” for the category 1 = “yes” Problem: Can end up with too many variables Solution: Reduce by combining categories that are close to each other Use pivot tables to assess outcome variable sensitivity to the dummies Exception: Naïve Bayes can handle categorical variables without transforming them into dummies

Combining Categories Many zoning categories are the same or similar with respect to CATMEDV ZN = 12.5, 18, 21, 25, 28, 30, 70 and 85 all have CAT.MEDV = 0 ZN = 17.5, 90, 95 and 100 all have CAT.MEDV = 1

Principal Components Analysis Goal: Reduce a set of numerical variables. The idea: Remove the overlap of information between these variable. [“Information” is measured by the sum of the variances of the variables.] Final product: A smaller number of numerical variables that contain most of the information

Principal Components Analysis How does PCA do this? Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables). These linear combinations are uncorrelated (no information overlap), and only a few of them contain most of the original information. The new variables are called principal components.

Example – Breakfast Cereals

Description of Variables Name: name of cereal mfr: manufacturer type: cold or hot calories: calories per serving protein: grams fat: grams sodium: mg. fiber: grams carbo: grams complex carbohydrates sugars: grams potass: mg. vitamins: % FDA rec shelf: display shelf weight: oz. 1 serving cups: in one serving rating: consumer reports

Consider calories & ratings Total variance (=“information”) is sum of individual variances: = Calories accounts for / = 66% Rating accounts for / = 34% CaloriesRating Calories Rating

First & Second Principal Components Z 1 and Z 2 are two linear combinations. Z 1 has the highest variation (spread of values) Z 2 has the lowest variation

PCA output for these 2 variables Top: weights to project original data onto Z 1 & Z 2 e.g. (-0.847, 0.532) are weights for Z 1 Bottom: reallocated variance for new variables Z 1 : 86% of total variance Z 2 : 14%

Computing the PC scores Let X 1 = Calories, X 2 = Rating Z 1 = (X 1 -  1 ) (X 2 -  2 ) Z 2 = 0.532(X 1 -  1 ) (X 2 -  2 )

Computing the PC scores Let X 1 = Calories, X 2 = Rating Z 1 = (X 1 -  1 ) (X 2 -  2 ) Z 2 = 0.532(X 1 -  1 ) (X 2 -  2 )  1 = ,  2 = NameCaloriesRatingZ1Z1 Z2Z2 100%_Bran ( ) (68.4 – 42.67) = ( ) (68.4 – 42.67) = %_Natural_Bran ( ) (33.98 – 42.67) = ( ) (33.98 – 42.67) = 2.20

Principal Component Scores Weights are used to compute the above scores e.g., col. 1 scores are computed Z 1 scores using weights (-0.847, 0.532)

Properties of the resulting variables New distribution of information: New variances = 498 (for Z 1 ) and 79 (for Z 2 ) Sum of variances = sum of variances for original variables calories and ratings New variable Z 1 has most of the total variance, might be used as proxy for both calories and ratings Z 1 and Z 2 have correlation of zero (no information overlap)

Generalization X 1, X 2, X 3, … X p, original p variables Z 1, Z 2, Z 3, … Z p, weighted averages of original variables All pairs of Z variables have 0 correlation Order Z’s by variance (z 1 largest, Z p smallest) Usually the first few Z variables contain most of the information, and so the rest can be dropped.

PCA on full data set First 6 components shown First 2 capture 93% of the total variation Note: data differ slightly from text

Normalizing data In these results, sodium dominates first PC Just because of the way it is measured (mg), its scale is greater than almost all other variables Hence its variance will be a dominant component of the total variance Normalize each variable to remove scale effect Divide by std. deviation (may subtract mean first) Normalization (= standardization) is usually performed in PCA; otherwise measurement units affect results Note: In XLMiner, use correlation matrix option to use normalized variables

PCA using standardized variables First component accounts for smaller part of variance Need to use more components to capture same amount of information

PCA in Classification/Prediction Apply PCA to training data Decide how many PC’s to use Use variable weights in those PC’s with validation/new data This creates a new reduced set of predictors in validation/new data

Regression-Based Dimension Reduction Multiple Linear Regression or Logistic Regression Use subset selection Algorithm chooses a subset of variables This procedure is integrated directly into the predictive task

Summary Data summarization is an important for data exploration Data summaries include numerical metrics (average, median, etc.) and graphical summaries Data reduction is useful for compressing the information in the data into a smaller subset Categorical variables can be reduced by combining similar categories Principal components analysis transforms an original set of numerical data into a smaller set of weighted averages of the original data that contain most of the original information in less variables.