Big Data Analytics The Data Mining process Roger Bohn Jan. 2016

Slides:



Advertisements
Similar presentations
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advanced Data Mining: Introduction
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Introduction to Data Mining with XLMiner
Intro to Data Mining/Machine Learning Algorithms for Business Intelligence Dr. Bambang Parmanto.
Chapter 7 – K-Nearest-Neighbor
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Data Mining – Intro.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Chapter 5 Data mining : A Closer Look.
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Data Mining Techniques
Overview DM for Business Intelligence.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Chapter 1 Introduction to Data Mining
Chapter 9 – Classification and Regression Trees
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining and Decision Support
Overview of the Data Mining Process
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
There is an inherent meaning in everything. “Signs for people who can see.”
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Machine Learning with Spark MLlib
Data Mining – Intro.
What Is Cluster Analysis?
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining 101 with Scikit-Learn
Data warehouse & Data Mining: Concepts and Techniques
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Adrian Tuhtan CS157A Section1
Advanced Analytics Using Enterprise Miner
Data Mining: Concepts and Techniques Course Outline
כריית מידע -- מבוא ד"ר אבי רוזנפלד.
Chapter 6: Multiple Linear Regression
Data Warehousing and Data Mining
Dr. Morgan C. Wang Department of Statistics
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining Concepts and Techniques
CSCI N317 Computation for Scientific Applications Unit Weka
MIS2502: Data Analytics Clustering and Segmentation
Course Introduction CSC 576: Data Mining.
MIS2502: Data Analytics Clustering and Segmentation
Data Mining: Concepts and Techniques
Data Mining.
CSE591: Data Mining by H. Liu
Presentation transcript:

Big Data Analytics The Data Mining process Roger Bohn Jan. 2016 Some material from Data Mining for Business Intelligence By Shmueli, Patel & Bruce 9/14/2018

Communication Ted forums Data sets Discussions, news, paper topics R coding, homework, admin Hint: Use descriptive title lines. R question Good stuff Data sets Several sites posted HHS, Census, similar govt stats Database of all IPOs and officers of IPOs, 20 years New satellite data on worldwide land use. <3 km grid x 5 years x 10 observations per year. > 1E7 observations Urbanization, agriculture, unused land, etc. 9/14/2018

Sites for usable data – on Ted Kagle Competitions. This site runs contests among DM teams. Some have monetary prizes, others are just for street cred. You don't have to fully enter the contest in order to use the data (in most cases). About 20 diverse data sets. http://www.kaggle.com/competitions Example:  Walmart Recruiting = Predict Sales in stormy weather UCI Machine Learning  UC Irvine has a repository for data sets that is used by the whole machine learning community.  Example: Weibo microblog entries, and information about the writer. More than 200K observations. Hospital readmission data. Big $ on this issue now for hospitals, because if patients are readmitted within 30 days, the hospital has to pay all the costs of their treatment. 9/14/2018

Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

9/14/2018

Canonical example Which buildings to inspect in NY City? Which financial reports to audit more carefully? Should we approve this transaction? (Is it fraudulent? Likely to fail?) Credit cards Mortgages 9/14/2018

Supervised Learning Goal: Predict a single “target” or “outcome” variable Training data, where target value is known Methods: Classification and Prediction 9/14/2018

Supervised: Classification Goal: Predict categorical target (outcome) variable Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not creditworthy… Each row is a case (customer, tax return, applicant) Each column is a variable Target variable is often binary (yes/no) Deliberately biased classifications: cost of errors 9/14/2018

Supervised: Prediction Goal: Predict numerical target (outcome) variable Examples: sales, revenue, performance As in classification: Each row is a case (customer, tax return, applicant) Each column is a variable Regression a common tool, but often not interested in value of the coefficients per se. Instead: forecast outcome for a new case Taken together, classification and prediction constitute “predictive analytics” 9/14/2018

(Unsupervised) Data Visualization Graphs and plots of data Histograms, boxplots, bar charts, scatterplots Especially useful to examine relationships between pairs of variables General concept: Exploratory Data Analysis Where do you start with new data? 9/14/2018

Steps in Data Mining Define/understand problem/question/decision Obtain data (may involve random sampling) Explore, clean, pre-process data Specify task (classification, clustering, etc.) Try one or more algorithms (regression, k-Nearest Neighbors, trees, neural networks, etc.) Iterative implementation and “tuning” Assess results – compare models Deploy model in production mode. Daily use 9/14/2018

9/14/2018

Knowledge Discovery (KDD) Process This is a view from typical database systems and data warehousing communities Data mining plays an essential role in the knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

Data Mining in Business Intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights Pre-processing Data Format conversion e.g. text to numeric Parsing e.g. web data Merging data from multiple sources Dealing with outliers Missing observations (some algorithms don’t care) Rare event oversampling Normalizing Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights By STEVE LOHR NY Times AUG. 17, 2014

Convert Types of Variables Determine the types of pre-processing needed, and algorithms used Main distinction: Categorical vs. numeric Categorical Ordered (low, medium, high) Unordered (male, female) R language – specify the type of variable 9/14/2018

Detecting Outliers An outlier is an observation that is “extreme”, being distant from the rest of the data (definition of “distant” is deliberately vague) Outliers can have disproportionate influence on models (a problem if it is spurious) An important step in data pre-processing is detecting outliers Once detected, domain knowledge is required to determine if it is an error, or truly extreme. Common example: misplaced decimal point 9/14/2018

Handling Missing Data Many algorithms will not process records with missing values. Default is to drop those records. Solution 1: Omission If a small number of records have missing values, can omit them If many records are missing values on a small set of variables, can drop those variables (or use proxies) If many records have missing values, omission is not practical Solution 2: Imputation Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (non-missing) information Solution 3: Use an algorithm that handles missing data (Classification Trees) 9/14/2018

Rare event oversampling Often the event of interest is rare Examples: response to mailing, fraud, … Only a few percent of total sample. Sampling may yield too few “interesting” cases to effectively train a model A popular solution: oversample the rare cases to obtain a more balanced training set Later, need to adjust results for the oversampling 9/14/2018

Normalizing (Standardizing) Data Needed when variables with the largest scales would dominate and skew results Needed for some algorithms (eg kNN); not for others (regression) Puts all variables on same scale Is weight in g or kg? Meters or feet or km? Normalizing function: Subtract mean and divide by standard deviation Alternative: scale to 0-1 by subtracting minimum and dividing by the range Useful when the data contain dummies and numeric Sometimes best not to normalize. More insight from coefficient values. 9/14/2018

The Problem of Overfitting Statistical models can produce highly complex explanations of relationships between variables The “fit” may appear excellent But, when used with new data, models of great complexity do not do so well. 9/14/2018 24

100% fit – not useful for new data 9/14/2018 25

Overfitting (cont.) Causes: Too many predictors A model with too many parameters Trying many different models Consequence: Deployed model will not work as well as expected with completely new data. 9/14/2018 26

A Big Idea in Data Analytics Partitioning the Data Problem: How well will our model perform with new data? Solution: Separate data into two parts Training partition to develop the model Validation partition to implement the model and evaluate its performance on “new” data Addresses the issue of overfitting A Big Idea in Data Analytics 9/14/2018

Multiple Partitions When a model is developed on training data, it can overfit the training data (hence need to assess on validation) Assessing multiple models on same validation data can overfit validation data Some methods use the validation data to choose a parameter. This too can lead to overfitting the validation data Solution: final selected model is applied to a third test partition. Unbiased estimate of its performance on new data 9/14/2018

Concept should be used in classic regression analysis Instead of trying to estimate the forecast errors, just measure them! Standard error of residuals t test = no longer needed! Statistical estimates of errors have long list of assumptions: Homoskedastic errors No autocorrelation No important omitted variables (hah) Hold aside a testing sample 9/14/2018

Summary Data Mining includes many supervised methods (Classification & Prediction) + some unsupervised methods (Association Rules, Data Reduction, Data Exploration & Visualization) Before algorithms can be applied, data must be characterized and pre-processed. This takeswork! To evaluate performance and to avoid overfitting, partition the data Data mining methods are usually applied to a sample from a large database, and then the best model is used to analyze the entire database 9/14/2018