2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 3 – Data Visualization © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter Nineteen Factor Analysis.
Programming and Simulations Frank Witmer 6 January 2011.
Chapter 3 – Data Visualization
Lecture 7: Principal component analysis (PCA)
Principal Component Analysis
Dimensional reduction, PCA
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principal Component Analysis. Consider a collection of points.
POSTER TEMPLATE BY: Cluster-Based Modeling: Exploring the Linear Regression Model Space Student: XiaYi(Sandy) Shen Advisor:
Dan Simon Cleveland State University
M. Verleysen UCL 1 Feature Selection with Mutual Information and Resampling M. Verleysen Université catholique de Louvain (Belgium) Machine Learning Group.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
PCA Example Air pollution in 41 cities in the USA.
Additive Data Perturbation: data reconstruction attacks.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Techniques for studying correlation and covariance structure Principal Components Analysis (PCA) Factor Analysis.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Lecture 12 Factor Analysis.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Principal Component Analysis (PCA)
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
Chapter 4 –Dimension Reduction Data Mining for Business Analytics Shmueli, Patel & Bruce.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Presented by: Muhammad Wasif Laeeq (BSIT07-1) Muhammad Aatif Aneeq (BSIT07-15) Shah Rukh (BSIT07-22) Mudasir Abbas (BSIT07-34) Ahmad Mushtaq (BSIT07-45)
1 Principal Components Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
FAKE GAME updates Pavel Kordík
VISUALIZATION TECHNIQUES UTILIZING THE SENSITIVITY ANALYSIS OF MODELS Ivo Kondapaneni, Pavel Kordík, Pavel Slavík Department of Computer Science and Engineering,
Dynamic graphics, Principal Component Analysis Ker-Chau Li UCLA department of Statistics.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
The GAME Algorithm Applied to Complex Fractionated Atrial Electrograms Data Set Pavel Kordík, Václav Křemen and Lenka Lhotská Department of Computer Science.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Principal Component Analysis (PCA)
Chapter 15 Multiple Regression and Model Building
Principal Components Shyh-Kang Jeng
Dimensionality Reduction
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
School of Computer Science & Engineering
Principal Components Analysis
Information Management course
Business Statistics, 4e by Ken Black
ECON734: Spatial Econometrics – Lab 2
Dynamic graphics, Principal Component Analysis
What is Regression Analysis?
ECON734: Spatial Econometrics – Lab 2
Matrix Algebra and Random Vectors
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Decision Tree Concept of Decision Tree
Factor Analysis (Principal Components) Output
Feature Selection Methods
Principal Component Analysis
Business Statistics, 4e by Ken Black
Decision Tree (Rule Induction)
INTRODUCTION TO Machine Learning
Chapter 4 –Dimension Reduction
Presentation transcript:

2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology Chapter 3: Data Exploration & Dimension Reduction

Data Mining, IISE, SNUT Steps in Data Mining revisited 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 4. Explore and customize the data

Data Mining, IISE, SNUT Example: Boston Housing Data Define and understand the purpose of data mining project  Make the local economy stable by maintaining home price. 1

Data Mining, IISE, SNUT Example: Boston Housing Data Formulate the data mining problem  What is the purpose? To predict the median value of a housing unit in the neighborhood.  What data mining task is appropriate? Prediction. 2

Data Mining, IISE, SNUT Example: Boston Housing Data Obtain/verify/modify the data: Data acquisition 3 VariableDescription CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town. CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk )^2 where Bk is the proportion of blacks by town LSTAT lower status of the population MEDV median value of owner-occupied homes in $1000

Data Mining, IISE, SNUT Example: Boston Housing Data Obtain/verify/modify the data: Data example 3

Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Basic statistics 4

Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Single variable  Histogram Shows a rough distribution of a single variable. 4

Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Single variable  Box plot Shows basic statistics of a single variable. 4 median quartile 1 “max” “min” outliers mean outlier quartile 3 Single box plot Conditional box plot

Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables  Correlation analysis Shows the correlation between every pair of two variables. Help to select a representative one among highly correlated (positively or negatively) variables. 4

Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables  Scatter plot matrix Shows the interaction between two pair of variables. 4

Data Mining, IISE, SNUT Example: Boston Housing Data Data exploration: Multiple variables  Pivot table in Excel User-specified data summarization tool. Able to find non-linear relation between two variables. 4

Data Mining, IISE, SNUT Dimensionality Reduction Data customization: Dimensionality reduction  Variable Selection Select a small set of original variables. Filter: Variable selection and model building process are independent. Wrapper: Variable selection is guided by the result of data mining models (forward, backward, stepwise).  Variable Extraction Construct a small set of variables by transforming and combining original variables. An independent performance criterion is used. 4

Data Mining, IISE, SNUT Dimensionality Reduction Variable selection: Filter approach  Example: Select variables based on the correlation matrix. Remove NOX, AGE, DIS, TAX. NOX, DIS, TAX are highly correlated with INDUS. AGE is highly correlated with NOX. 4

Data Mining, IISE, SNUT Dimensionality Reduction Variable selection: Wrapper approach  Select variables based on the model results. Forward selection Start with the most relevant variable. Add another variable if it increases the accuracy of the data mining model. Backward elimination Start with the entire variables. Remove the most irrelevant variable if the accuracy of the data mining model increases (at least does not decrease). Stepwise selection Do Forward selection + Backward elimination alternately. 4

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Purpose Preserve the variance as much as possible with fewer bases. Example: 4

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Mathematical backgrounds Projection: 4

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Mathematical backgrounds Covariance: X : a data set (m by n, m: # of variables, n: # of records). Cov(X) ij = Cov(X) ji Total variance of the data set = tr[Cov(X)] = Cov(X) 11 + Cov(X) 22 + Cov(X) 33 +…+ Cov(X) mm 4

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  Mathematical backgrounds Eigen problem: If A is m by m non-singular matrix, There are m different eigenvalues and eigenvectors. Eigenvectors are orthogonal. tr(A) = λ 1 + λ 2 + λ 3 + …+ λ m 4

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA)  PCA Procedure 1: Normalize the data 4 x1x x2x x1x x2x

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 2: Formulate the problem If a set of vectors (x) are projected onto w, then the variance after projection becomes: PCA aims at maximizing V :

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 3: Solve the problem Use a Lagrangian multiplier.

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 4: Select the bases In the descending order of eigenvalues. With only one basis, 96% of original variance is preserved.

Data Mining, IISE, SNUT Dimensionality Reduction Variable extraction: Principal component analysis (PCA) 4  PCA Procedure 5: Construct new data x1x x2x z1z

Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Original data

Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  When there are only two variables

Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Covariance matrix  Scatter plot CaloriesRating Calories Rating

Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Eigenvalues and eigenvectors

Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Newly constructed variables

Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  General case: more than two variables

Data Mining, IISE, SNUT Dimensionality Reduction PCA Example: Breakfast cereals 4  Scatter plot on principal components