2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems.

Slides:



Advertisements
Similar presentations
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
What is Statistical Modeling
Introduction to Data Mining with XLMiner
Intro to Data Mining/Machine Learning Algorithms for Business Intelligence Dr. Bambang Parmanto.
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: A Closer Look Chapter Data Mining Strategies.
Chapter 7 – K-Nearest-Neighbor
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining Adrian Tuhtan CS157A Section1.
Data Mining – Intro.
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Enterprise systems infrastructure and architecture DT211 4
Chapter 2 Overview of the Data Mining Process 1. Introduction Data Mining – Predictive analysis Tasks of Classification & Prediction Core of Business.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Data Mining Techniques
Overview DM for Business Intelligence.
Data Mining Chun-Hung Chou
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Inductive learning Simplest form: learn a function from examples
Some Key Questions about you Data Damian Gordon Brendan Tierney Brian Mac Namee.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 6: Analyzing and Interpreting Quantitative Data
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Overview of the Data Mining Process
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Data Mining – Intro.
XLMiner – a Data Mining Toolkit
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining: Concepts and Techniques
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Chapter 1: Introduction
Adrian Tuhtan CS157A Section1
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology

Data Mining, IISE, SNUT Data Mining Definition Revisited Extracting useful information from large datasets. (Hand et al., 2001) Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. (Berry and Linoff, 1997, 2000) Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amount data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. Gartner Group, 2004)

Data Mining, IISE, SNUT Descriptive vs. Predictive (purpose)  Look back to the past  To extract compact and easily understood information from large, sometimes gigantic database.  OLAP (online analytical processing), SQL (structured query language).  Look back to the past  To extract compact and easily understood information from large, sometimes gigantic database.  OLAP (online analytical processing), SQL (structured query language).  Predict the future  Identify strong links between variables of data.  To predict the unknown consequence (dependent variable) based on the information provided (independent variable)  y = f(x 1, x 2,..., x n ) + ε  Predict the future  Identify strong links between variables of data.  To predict the unknown consequence (dependent variable) based on the information provided (independent variable)  y = f(x 1, x 2,..., x n ) + ε Descriptive Modeling Predictive Modeling

Data Mining, IISE, SNUT Supervised vs. Unsupervised (methods)  Goal: predict a single “target” or “outcome” variable.  Finds relations between X and Y.  Train (learn) data where target value is known.  Score data where target value is not known.  Goal: predict a single “target” or “outcome” variable.  Finds relations between X and Y.  Train (learn) data where target value is known.  Score data where target value is not known.  Explores intrinsic characteristics.  Estimates underlying distribution.  Segment data into meaningful groups or detect patterns.  There is no target (outcome) variable to predict or classify.  Explores intrinsic characteristics.  Estimates underlying distribution.  Segment data into meaningful groups or detect patterns.  There is no target (outcome) variable to predict or classify. Supervised Learning Unsupervised Learning

Data Mining, IISE, SNUT Data Mining Techniques Data Visualization  Graphs and plots of data.  Histograms, boxplots, bar charts, scatterplots.  Especially useful to examine relationships between pairs of variables.  Descriptive & Unsupervised 1

Data Mining, IISE, SNUT Data Mining Techniques Data Reduction  Distillation of complex/large data into simpler/smaller data.  Reducing the number of variables/columns. Also called dimensionality reduction(variable selection, variable extraction, e.g., principal component analysis)  Reducing the number of records/rows. Also called data compression (e.g., sampling and clustering)  Descriptive & Unsupervised 2

Data Mining, IISE, SNUT Data Mining Techniques Segmentation/Clustering 3  Goal: divide the entire data into a small number of subgroups.  Homogeneous within groups while heterogeneous between groups.  Examples: Market segmentation, social network analysis.  Descriptive & Unsupervised

Data Mining, IISE, SNUT Data Mining Techniques Segmentation/Clustering example: hierarchical clustering 3

2011 Data Mining, IISE, SNUT Data Mining Techniques 9 Classification  Goal: predict categorical target (outcome) variable.  Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy.  Each row is a case/record/instance.  Each column is a variable.  Target variable is often binary (yes/no).  Predictive & Supervised 4

Data Mining, IISE, SNUT Data Mining Techniques Classification Example: Decision Tree 4

Data Mining, IISE, SNUT Data Mining Techniques Classification Example: Logistic Regression  Play if 1/(1+exp(-0.2*outlook+0.4*humidity+0.8*windy) > 0.5  Else, do not play 4

Data Mining, IISE, SNUT Data Mining Techniques Classification Examples “Separate the riding mower buyers( ● ) from non-buyers( ○ )” (x-axis: income(x$1000), y-axis: Lot size (x1000 sqft)) 4

2011 Data Mining, IISE, SNUT Data Mining Techniques 13 Prediction  Goal: predict numerical target (outcome) variable.  Examples: sales, revenue, performance.  As in classification:  Each row is a case/record/instance.  Each column is a variable.  Taken together, classification and prediction constitute “predictive analytics”  Predictive & Supervised 5

2011 Data Mining, IISE, SNUT Data Mining Techniques 14 Prediction Example: Neural Networks 5

Data Mining, IISE, SNUT Data Mining Techniques Association Rule  Goal: produce rules that define “what goes with what”  Example: “If X was purchased, Y was also purchased”  Rows are transactions.  Used in recommender systems – “Our records show you bought X, you may also like Y”  Also called “affinity analysis,” or “market basket analysis”  Predictive & Unsupervised 6

2011 Data Mining, IISE, SNUT Data Mining Techniques 16 Association Rule Example: Market Basket Analysis Wall Mart (USA) E-Mart (Korea) 6

Data Mining, IISE, SNUT Data Mining Techniques Novelty Detection  Goal: identify if a new case is similar to the given ‘normal’ cases.  Example: medical diagnosis, fault detection, identity verification.  Each row is a case/record/instance.  Each column is a variable.  No explicit target variable, but assumed that all records have the same target.  Also called “outlier detection,” or “one-class classification”  Predictive & Unsupervised 7

Data Mining, IISE, SNUT Data Mining Techniques Novelty Detection Example: Keystroke Dynamics-based User Authentication 7

Data Mining, IISE, SNUT Data Mining Techniques Descriptive ModelingPredictive Modeling Supervised Learning Unsupervised Learning … Classification Prediction Data Visualization Data Reduction Segmentation/clustering Association Rules Novelty Detection

Data Mining, IISE, SNUT Steps in Data Mining 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 4. Explore and customize the data

Data Mining, IISE, SNUT Steps in Data Mining Define and understand the purpose of data mining project  Why do we have to conduct this project?  What would be the achievement if the project succeed? 1 (Jun, 2010:

Data Mining, IISE, SNUT Steps in Data Mining Formulate the data mining problem  What is the purpose? Increase sales. Detect cancer patients.  What data mining task is appropriate? Classification. Prediction. Association rules, … 2

Data Mining, IISE, SNUT Steps in Data Mining Obtain/verify/modify the data: Data acquisition  Data source Data warehouse, Data mart, …  Define input variables and target variable if necessary Ex: Churn prediction for credit card service Inputs: age, sex, tenure, amount of spending, risk grade,… Target: whether he/she leaves the company. 3

Data Mining, IISE, SNUT Steps in Data Mining Obtain/verify/modify the data: Outlier detection  Outlier “A value that the variable cannot have” or “ An extremely rare value” (ex: age 990, height -150cm, …) There are a number of outliers in a real database due to many reasons.  How to deal with outliers? Ignore the record with outliers if total record is sufficient. Replace with another value (mean, median, estimate from a certain pdf, etc) if total records are insufficient. 3

Data Mining, IISE, SNUT Steps in Data Mining Obtain/verify/modify the data: Missing Value Imputation  Missing value A variable is missing when it has null value in database although it should have a certain real value. Operational errors, human errors.  How to deal with missing values? Ignore the record with missing values if total record is sufficient. Replace with another value (mean, median, estimate from a certain pdf, etc) if total records are insufficient. 3

Data Mining, IISE, SNUT Steps in Data Mining Obtain/verify/modify the data: Variable handling  Type of variables Binary: 0/1 (ex: benign/malignant in medical diagnosis). Categorical: more than two values, ordered (high, middle, low) or not ordered (ex: color, job). Ordinal: continuous, differences between two consecutive values are not identical (ex: rank of the final exam). Interval: continuous, difference between two consecutive values are identical (ex: age, height, weight). 3

Data Mining, IISE, SNUT Steps in Data Mining Obtain/verify/modify the data: Variable handling  Variable transformation Binning: interval → binary or ordered categorical. 1-of-C coding: unordered categorical → binary. Low MidHigh “Color: yellow, red, blue, green” d1d2d3 yellow100 red010 blue001 green000 3

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Data Visualization  Single variable 4 Histogram: shows the distribution of a single variable. possible to check the normality. Box plot median quartile 1 “max” “min” outliers mean outlier quartile 3

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Data Visualization 4  Multiple variables Correlation table: indicate which variables are highly (positively or negatively) correlated. Help to remove irrelevant variables or select representative variables

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Data Visualization 4  Multiple variables Scatter plot matrix: Shows the relations between two pairs of variables. Var. 1 Var. 2 Var. 3 Var. 4

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Dimensionality Reduction 4  Curse of dimensionality The number of records increases exponentially to sustain the same explain ability as the number of variables increases. “If there are various logical ways to explain a certain phenomenon, the simplest is the best” - Occam’s Razor 2 1 =22 2 =42 3 =8

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Dimensionality Reduction 4  Variable reduction Select a small set of relevant variables. Correlation analysis, Kolmogorov-Sminrov test, … V1V2V3V4V5V6 V V V V V V61 Select V1 & V4

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Dimensionality Reduction 4  Variable extraction Construct a new variable that contains more intensive information than original variables. Principal component analysis (PCA), …  Example: Original variables: Age, sex, height, weight Income, property, tax paid Constructed variables: Var1: age+3*I(sex = female)+0.2*height-0.3*weight Var2: Income + 0.1*property + 2*tax paid

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Instance Reduction 4  Random sampling Select a small set of records with uniformly distributed sampling rate. In classification, class ratios are preserved.  Stratified sampling Select a set of records such that rare events have higher probability to be selected. In classification, class ratios are modified. Under-sampling: preserve minority, reduce majority. Over-sampling: preserve majority, increase minority.

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Data separation 4  Over-fitting Occurs when data mining algorithms ‘memorize’ the given data, even unnecessary (noise, outlier, etc.).

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Data partition 4  Training Data Used to build a model or learn data mining algorithm.  Validation Data Used to select the best parameters for the model.  Test Data Used to select the best model among algorithms considered. Training Data Algorithm A-1 Algorithm A-2 Algorithm A-3 Algorithm B-1 Algorithm B-2 Algorithm B-3 Validation Data Algorithm A-1 Algorithm A-2 Algorithm A-3 Algorithm B-1 Algorithm B-2 Algorithm B-3 Test Data Algorithm A-1 Algorithm A-2 Algorithm A-3 Algorithm B-1 Algorithm B-2 Algorithm B-3

Data Mining, IISE, SNUT Steps in Data Mining Explore and customize the data: Data normalization 4  Normalization (Standardization) Eliminate the effect caused by different measurement scale or unit. z-score: (value-mean)/(standard deviation). IdAgeIncome 1251,000, ,000, ,000,000 ……… Mean352,000,000 Stdev51,000,000 IdAgeIncome ……… Mean00 Stdev11 Original dataNormalized data

Data Mining, IISE, SNUT Steps in Data Mining Build data mining models  Data mining algorithm Classification Logistic regression, k-nearest neighbor, naïve bayes, classification trees, neural networks, linear discriminant analysis. Prediction Linear regression, k-nearest neighbor, regression trees, neural networks. Association rules: A priori algorithm. Clustering: Hierarchical clustering, K-Means clustering. 5

Data Mining, IISE, SNUT Steps in Data Mining Evaluate and interpret the results  Classification performance Confusion matrix Simple accuracy: (A+C)/(A+B+C+D) Balanced correction rate: Lift charts, receiver operating characteristic (ROC) curve, etc. 6 Predicted 1(+)0(-) Actual 1(+) True positive, Sensitivity (A) False negative, Type I error (B) 0(-) False positive, Type II error (C) True negative, Specificity (D)

Data Mining, IISE, SNUT Steps in Data Mining Evaluate and interpret the results  Prediction performance y: actual target value, y’: predicted target value Mean squared error, Root mean squared error Mean absolute error Mean absolute percentage error 6

Data Mining, IISE, SNUT Steps in Data Mining Evaluate and interpret the results  Clustering Within variance: variance among record in a single cluster. Between variance: variance between clusters. Good clustering: high between variance and low within variance.  Association rules Support: Confidence: Lift: 6

Data Mining, IISE, SNUT Steps in Data Mining Deploy and monitor the model  Deployment Integrate the data mining model into operational system. Run the model on real data to produce decisions or actions. “Send Mr. Kang a coupon because his likelihood to leave the company next month is 80%”  Monitoring Evaluate the performance of the model after deployment. Update or redevelop if necessary. 7