Overview of the Data Mining Process

Slides:



Advertisements
Similar presentations
Chapter 2 Overview of the Data Mining Process
Advertisements

Random Forest Predrag Radenković 3237/10
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Chapter 8 – Logistic Regression
Multiple Linear Regression
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Introduction to Data Mining with XLMiner
Intro to Data Mining/Machine Learning Algorithms for Business Intelligence Dr. Bambang Parmanto.
Lecture Notes for Chapter 2 Introduction to Data Mining
Chapter 7 – K-Nearest-Neighbor
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining – Intro.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Introduction to Directed Data Mining: Decision Trees
Enterprise systems infrastructure and architecture DT211 4
Chapter 2 Overview of the Data Mining Process 1. Introduction Data Mining – Predictive analysis Tasks of Classification & Prediction Core of Business.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Overview DM for Business Intelligence.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Chapter 3 Data Exploration and Dimension Reduction 1.
Some Key Questions about you Data Damian Gordon Brendan Tierney Brian Mac Namee.
STAT 211 – 019 Dan Piett West Virginia University Lecture 1.
Chapter 9 – Classification and Regression Trees
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
© Galit Shmueli and Peter Bruce 2010 Chapter 6: Multiple Linear Regression Data Mining for Business Analytics Shmueli, Patel & Bruce.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Chapter 4 –Dimension Reduction Data Mining for Business Analytics Shmueli, Patel & Bruce.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Fraud Detection Notes from the Field. Introduction Dejan Sarka –Data.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Data Mining – Intro.
Data Transformation: Normalization
XLMiner – a Data Mining Toolkit
Big Data Analytics The Data Mining process Roger Bohn Jan. 2016
Data Mining: Concepts and Techniques
Advanced Analytics Using Enterprise Miner
Chapter 6: Multiple Linear Regression
Lecture 6: Introduction to Machine Learning
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Pre-processing Lecture Notes for Chapter 2
Machine Learning in Business John C. Hull
Chapter 4 –Dimension Reduction
Presentation transcript:

Overview of the Data Mining Process Data Mining for Business Analytics Shmueli, Patel & Bruce

Core Ideas in Data Mining Classification Prediction Association Rules Predictive Analytics Data Reduction and Dimension Reduction Data Exploration and Visualization Supervised and Unsupervised Learning

Supervised Learning Goal: Predict a single “target” or “outcome” variable Training data, where target value is known Score to data where value is not known Methods: Classification and Prediction

Unsupervised Learning Goal: Segment data into meaningful segments; detect patterns There is no target (outcome) variable to predict or classify Methods: Association rules, data reduction & exploration, visualization

Supervised: Classification Goal: Predict categorical target (outcome) variable Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy… Each row is a case (customer, tax return, applicant) Each column is a variable Target variable is often binary (yes/no)

Supervised: Prediction Goal: Predict numerical target (outcome) variable Examples: sales, revenue, performance As in classification: Each row is a case (customer, tax return, applicant) Each column is a variable Taken together, classification and prediction constitute “predictive analytics”

Unsupervised: Association Rules Goal: Produce rules that define “what goes with what” Example: “If X was purchased, Y was also purchased” Rows are transactions Used in recommender systems – “Our records show you bought X, you may also like Y” Also called “affinity analysis”

Unsupervised: Data Reduction Distillation of complex/large data into simpler/smaller data Reducing the number of variables/columns (e.g., principal components) Reducing the number of records/rows (e.g., clustering)

Unsupervised: Data Visualization Graphs and plots of data Histograms, boxplots, bar charts, scatterplots Especially useful to examine relationships between pairs of variables

Data Exploration Data sets are typically large, complex & messy Need to review the data to help refine the task Use techniques of Reduction and Visualization

The Process of Data Mining

Steps in Data Mining Define/understand purpose Obtain data (may involve random sampling) Explore, clean, pre-process data Reduce the data, transform variables Determine DM task (classification, clustering, etc.) Partition the data (for supervised tasks) Choose the techniques (regression, neural networks, etc.) Perform the task Interpret – compare models Deploy the best model

SAS SEMMA Sample Take a sample and partition Explore Examine statistically and graphically Modify Transform variables and impute missing values Model Fit predictive model Assess Compare model using validation dataset

Obtaining Data: Sampling Data mining typically deals with huge databases Algorithms and models are typically applied to a sample from a database, to produce statistically- valid results XLMiner, e.g., limits the “training” partition to 10,000 records Once you develop and select a final model, you use it to “score” the observations in the larger database

Rare event oversampling Often the event of interest is rare Examples: response to mailing, fraud in taxes, … Sampling may yield too few “interesting” cases to effectively train a model A popular solution: oversample the rare cases to obtain a more balanced training set Later, need to adjust results for the oversampling

Pre-processing Data

Types of Variables Determine the types of pre-processing needed, and algorithms used Main distinction: Categorical vs. numeric Numeric Continuous Integer Categorical Ordered (low, medium, high) Unordered (male, female)

Variable handling Numeric Categorical Most algorithms in XLMiner can handle numeric data May occasionally need to “bin” into categories Categorical Naïve Bayes can use as-is In most other algorithms, must create binary dummies (number of dummies = number of categories – 1)

Variable selection Parsimony 10 records/variable 6 x m x p records m = no. of outcome classes P = no. of variables Redundancy must be avoided Domain experts must be consulted

Detecting Outliers An outlier is an observation that is “extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague) Outliers can have disproportionate influence on models (a problem if it is spurious) An important step in data pre-processing is detecting outliers Once detected, domain knowledge is required to determine if it is an error, or truly extreme.

Detecting Outliers In some contexts, finding outliers is the purpose of the DM exercise (airport security screening). This is called “anomaly detection”. 21

Handling Missing Data Most algorithms will not process records with missing values. Default is to drop those records. Solution 1: Omission If a small number of records have missing values, can omit them If many records are missing values on a small set of variables, can drop those variables (or use proxies) If many records have missing values, omission is not practical Solution 2: Imputation Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (non- missing) information

Normalizing (Standardizing) Data Used in some techniques when variables with the largest scales would dominate and skew results Puts all variables on same scale Normalizing function: Subtract mean and divide by standard deviation (used in XLMiner) Alternative function: scale to 0-1 by subtracting minimum and dividing by the range

Partitioning the Data Problem: How well will our model perform with new data? Solution: Separate data into two parts Training partition to develop the model Validation partition to implement the model and evaluate its performance on “new” data Addresses the issue of overfitting

Test Partition When a model is developed on training data, it can overfit the training data (hence need to assess on validation) Assessing multiple models on same validation data can overfit validation data Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data Solution: final selected model is applied to a test partition to give unbiased estimate of its performance on new data

Overfitting Statistical models can produce highly complex explanations of relationships between variables The “fit” may be excellent When used with new data, models of great complexity do not do so well. 26

100% fit – not useful for new data 27

Overfitting (cont.) Causes: Too many predictors A model with too many parameters Trying many different models Consequence: Deployed model will not work as well as expected with completely new data. 28

Building Predictive Model with XLMINER Example – West Roxbury Home Value Dataset TOTAL VALUE TAX LOT SQ FT YR BUILT GROSS AREA LIVING AREA FLOORS ROOMS BED ROOMS FULL BATH HALF BATH KIT CHEN FIRE PLACE REMODEL 344.2 330 9965 1880 2436 1352 2 6 3 1 None 412.6 5190 6590 1945 3108 1976 10 4 Recent 330.1 4152 7500 1890 2294 1371 8 498.6 6272 13773 1957 5032 2608 9 5 331.5 4170 5000 1910 2370 1438 7 337.4 4244 5142 1950 2124 1060 Old 359.4 4521 1954 3220 1916 320.4 4030 10000 2208 1200 333.5 4195 6835 1958 2582 1092 409.4 5150 5093 1900 4818 2992 29

TOTAL VALUE Total assessed value for property, in thousands of USD TAX Tax bill amount based on total assessed value multiplied by the tax rate, in USD LOT SQ FT Total lot size of parcel in square feet YR BUILT Year property was built GROSS AREA Gross floor area LIVING AREA Total living area for residential properties (ftz) FLOORS Number of floors ROOMS Total number of rooms BEDROOMS Total number of bedrooms FULL BATH Total number of full baths HALF BATH Total number of half baths KITCHEN Total number of kitchens FIREPLACE Total number of fireplaces REMODEL When house was remodeled (Recent/Old/None)

Modeling Process Determine the purpose Obtain the data Predict value of homes in West Roxbury Obtain the data West Roxbury Housing.XLSX Explore, Clean and Preprocess the data Tax is circular, not useful as a predictor Generate descriptive statistics and look for unusual values Generate plots and examine them 31

Modeling Process Reduce the dimension Determine Data Mining Task Condense number of categories Consolidate multiple numerical variables using Principal Component Analysis Determine Data Mining Task Partition the data (for supervised tasks) Choose the technique For the example multiple linear regression 32

Modeling Process Use the algorithm to perform the task Interpret results Deploy the model 33

Step 6: Partitioning the data

Step 8: Using XLMiner for Multiple Linear Regression

Summary of errors

RMS error Error = actual - predicted RMS = Root-mean-squared error = Square root of average squared error In previous example, sizes of training and validation sets differ, so only RMS Error and Average Error are comparable

Using Excel and XLMiner for Data Mining Excel is limited in data capacity However, the training and validation of DM models can be handled within the modest limits of Excel and XLMiner Models can then be used to score larger databases XLMiner has functions for interacting with various databases (taking samples from a database, and scoring a database from a developed model)

Summary Data Mining consists of supervised methods (Classification & Prediction) and unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization) Before algorithms can be applied, data must be characterized and pre-processed To evaluate performance and to avoid overfitting, data partitioning is used Data mining methods are usually applied to a sample from a large database, and then the best model is used to score the entire database