2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology Chapter 3: Data Exploration & Dimension Reduction
Data Mining, IISE, SNUT Steps in Data Mining revisited 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 4. Explore and customize the data
Data Mining, IISE, SNUT Example: Boston Housing Data Define and understand the purpose of data mining project Make the local economy stable by maintaining home price. 1
Data Mining, IISE, SNUT Example: Boston Housing Data Formulate the data mining problem What is the purpose? To predict the median value of a housing unit in the neighborhood. What data mining task is appropriate? Prediction. 2
Data Mining, IISE, SNUT Example: Boston Housing Data Obtain/verify/modify the data: Data acquisition 3 VariableDescription CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town. CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk )^2 where Bk is the proportion of blacks by town LSTAT lower status of the population MEDV median value of owner-occupied homes in $1000
Data Mining, IISE, SNUT Example: Boston Housing Data Obtain/verify/modify the data: Data example 3
Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Basic statistics 4
Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Single variable Histogram Shows a rough distribution of a single variable. 4
Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Single variable Box plot Shows basic statistics of a single variable. 4 median quartile 1 “max” “min” outliers mean outlier quartile 3 Single box plot Conditional box plot
Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables Correlation analysis Shows the correlation between every pair of two variables. Help to select a representative one among highly correlated (positively or negatively) variables. 4
Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables Scatter plot matrix Shows the interaction between two pair of variables. 4
Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables Pivot table in Excel User-specified data summarization tool. Able to find non-linear relation between two variables. 4
Data Mining, IISE, SNUT Dimensionality Reduction Data customization: Dimensionality reduction Variable Selection Select a small set of original variables. Filter: Variable selection and model building process are independent. Wrapper: Variable selection is guided by the result of data mining models (forward, backward, stepwise). Variable Extraction Construct a small set of variables by transforming and combining original variables. An independent performance criterion is used. 4
Data Mining, IISE, SNUT Dimensionality Reduction Variable selection: Filter approach Example: Select variables based on the correlation matrix. Remove NOX, AGE, DIS, TAX. NOX, DIS, TAX are highly correlated with INDUS. AGE is highly correlated with NOX. 4
Data Mining, IISE, SNUT Dimensionality Reduction Variable selection: Wrapper approach Select variables based on the model results. Forward selection Start with the most relevant variable. Add another variable if it increases the accuracy of the data mining model. Backward elimination Start with the entire variables. Remove the most irrelevant variable if the accuracy of the data mining model increases (at least does not decrease). Stepwise selection Do Forward selection + Backward elimination alternately. 4
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) Purpose Preserve the variance as much as possible with fewer bases. Example: 4
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) Mathematical backgrounds Projection: 4
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) Mathematical backgrounds Covariance: X : a data set (m by n, m: # of variables, n: # of records). Cov(X) ij = Cov(X) ji Total variance of the data set = tr[Cov(X)] = Cov(X) 11 + Cov(X) 22 + Cov(X) 33 +…+ Cov(X) mm 4
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) Mathematical backgrounds Eigen problem: If A is m by m non-singular matrix, There are m different eigenvalues and eigenvectors. Eigenvectors are orthogonal. tr(A) = λ 1 + λ 2 + λ 3 + …+ λ m 4
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) PCA Procedure 1: Normalize the data 4 x1x x2x x1x x2x
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4 PCA Procedure 2: Formulate the problem If a set of vectors (x) are projected onto w, then the variance after projection becomes: PCA aims at maximizing V :
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4 PCA Procedure 3: Solve the problem Use a Lagrangian multiplier.
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4 PCA Procedure 4: Select the bases In the descending order of eigenvalues. With only one basis, 96% of original variance is preserved.
Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4 PCA Procedure 5: Construct new data x1x x2x z1z
Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4 Original data
Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4 When there are only two variables
Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4 Covariance matrix Scatter plot CaloriesRating Calories Rating
Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4 Eigenvalues and eigenvectors
Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4 Newly constructed variables
Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4 General case: more than two variables
Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4 Scatter plot on principal components