2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology Chapter 3: Data Exploration & Dimension Reduction

2 2011 Data Mining, IISE, SNUT Steps in Data Mining revisited 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 4. Explore and customize the data

3 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Define and understand the purpose of data mining project  Make the local economy stable by maintaining home price. 1

4 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Formulate the data mining problem  What is the purpose? To predict the median value of a housing unit in the neighborhood.  What data mining task is appropriate? Prediction. 2

5 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Obtain/verify/modify the data: Data acquisition 3 VariableDescription CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town. CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT lower status of the population MEDV median value of owner-occupied homes in $1000

6 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Obtain/verify/modify the data: Data example 3

7 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Basic statistics 4

8 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Single variable  Histogram Shows a rough distribution of a single variable. 4

9 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Single variable  Box plot Shows basic statistics of a single variable. 4 median quartile 1 “max” “min” outliers mean outlier quartile 3 Single box plot Conditional box plot

10 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables  Correlation analysis Shows the correlation between every pair of two variables. Help to select a representative one among highly correlated (positively or negatively) variables. 4

11 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables  Scatter plot matrix Shows the interaction between two pair of variables. 4

12 2011 Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables  Pivot table in Excel User-specified data summarization tool. Able to find non-linear relation between two variables. 4

13 2011 Data Mining, IISE, SNUT Dimensionality Reduction Data customization: Dimensionality reduction  Variable Selection Select a small set of original variables. Filter: Variable selection and model building process are independent. Wrapper: Variable selection is guided by the result of data mining models (forward, backward, stepwise).  Variable Extraction Construct a small set of variables by transforming and combining original variables. An independent performance criterion is used. 4

14 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable selection: Filter approach  Example: Select variables based on the correlation matrix. Remove NOX, AGE, DIS, TAX. NOX, DIS, TAX are highly correlated with INDUS. AGE is highly correlated with NOX. 4

15 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable selection: Wrapper approach  Select variables based on the model results. Forward selection Start with the most relevant variable. Add another variable if it increases the accuracy of the data mining model. Backward elimination Start with the entire variables. Remove the most irrelevant variable if the accuracy of the data mining model increases (at least does not decrease). Stepwise selection Do Forward selection + Backward elimination alternately. 4

16 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Purpose Preserve the variance as much as possible with fewer bases. Example: 4

17 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Mathematical backgrounds Projection: 4

18 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Mathematical backgrounds Covariance: X : a data set (m by n, m: # of variables, n: # of records). Cov(X) ij = Cov(X) ji Total variance of the data set = tr[Cov(X)] = Cov(X) 11 + Cov(X) 22 + Cov(X) 33 +…+ Cov(X) mm 4

19 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Mathematical backgrounds Eigen problem: If A is m by m non-singular matrix, There are m different eigenvalues and eigenvectors. Eigenvectors are orthogonal. tr(A) = λ 1 + λ 2 + λ 3 + …+ λ m 4

20 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  PCA Procedure 1: Normalize the data 4 x1x1 2.50.52.21.93.12.3211.51.1 x2x2 2.40.72.92.232.71.61.11.60.9 x1x1 0.69-1.310.390.091.290.490.19-0.81-0.31-0.71 x2x2 0.49-1.210.990.291.090.79-0.31-0.81-0.31-1.01

21 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 2: Formulate the problem If a set of vectors (x) are projected onto w, then the variance after projection becomes: PCA aims at maximizing V :

22 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 3: Solve the problem Use a Lagrangian multiplier.

23 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 4: Select the bases In the descending order of eigenvalues. With only one basis, 96% of original variance is preserved.

24 2011 Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 5: Construct new data x1x1 0.69-1.310.390.091.290.490.19-0.81-0.31-0.71 x2x2 0.49-1.210.990.291.090.79-0.31-0.81-0.31-1.01 z1z1 0.83-1.780.990.271.680.91-0.10-1.14-0.44-1.22

25 2011 Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Original data

26 2011 Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  When there are only two variables

27 2011 Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Covariance matrix  Scatter plot CaloriesRating Calories379.63-188.68 Rating-188.68197.32

28 2011 Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Eigenvalues and eigenvectors

29 2011 Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Newly constructed variables

30 2011 Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  General case: more than two variables

31 2011 Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Scatter plot on principal components

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Similar presentations

Presentation on theme: "2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Similar presentations

Presentation on theme: "2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science."— Presentation transcript:

Similar presentations

About project

Feedback