Data Mining Survey of applications and methodologies - Akshat Singhal, Oberlin College, 2007.

Slides:



Advertisements
Similar presentations
Supporting End-User Access
Advertisements

DAMA-NCR Tuesday, November 13, 2001 Laura Squier Technical Consultant What is Data Mining?
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
1. Abstract 2 Introduction Related Work Conclusion References.
SLIDE 1IS 257 – Fall 2009 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
/faculteit technologie management Introduction to Data Mining a.j.m.m. (ton) weijters (slides are partially based on an introduction of Gregory Piatetsky-Shapiro)
Mining the Data Ira M. Schoenberger, FACHCA Senior Administrator 2011 AHCA/NCAL Quality Symposium Friday February 18, 2011.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Data Mining.
DATA WAREHOUSING.
CS590D: Data Mining Chris Clifton March 22, 2006 Data Mining Process Thanks to Laura Squier, SPSS for some of the material used.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
DataMining By Guan Hang Su CS157A section 2 fall 2005.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Data Mining.
Data Mining & Data Warehousing PresentedBy: Group 4 Kirk Bishop Joe Draskovich Amber Hottenroth Brandon Lee Stephen Pesavento.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
SLIDE 1IS 257 – Fall 2012 Data Mining and OLAP University of California, Berkeley School of Information IS 257: Database Management.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Process, Key Success Factors, Illustrations
Dr. Awad Khalil Computer Science Department AUC
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
More on Data Mining KDnuggets Datanami ACM SIGKDD
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Understanding Data Analytics and Data Mining Introduction.
The CRISP-DM Process Model
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Fox MIS Spring 2011 Data Mining Week 9 Introduction to Data Mining.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining. Overview the extraction of hidden predictive information from large databases Data mining tools predict future trends and behaviors, allowing.
Data Mining Copyright KEYSOFT Solutions.
DATA MINING It is a process of extracting interesting(non trivial, implicit, previously, unknown and useful ) information from any data repository. The.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Ahmed K. Ezzat, SQL Server 2008 and Data Mining Overview 1 Data Mining and Big Data.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
Supporting End-User Access
What is Data Mining? DAMA-NCR Tuesday, November 13, 2001 Laura Squier
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Presentation transcript:

Data Mining Survey of applications and methodologies - Akshat Singhal, Oberlin College, 2007

Presentation Summary What is Data mining? Evolution of Data mining Applications Process Models : Predictive vs Descriptive Decision Tree (Classification Rules) Example Association Rules Example Text Mining Example Software used

What Is Data Mining? Also called Knowledge-Discovery in Databases (KDD) “the extraction of hidden predictive information from large databases” OR the process of automatically searching large volumes of data for patterns Answering questions such as “What products are candy buyers most likely to buy this month?” “What kind of credit card transaction is a likely fraud?” “What colour of automobile is the most associated with accidents?”

Evolution of Data Mining Evolutionary Step Business Question Enabling Technology Data Collection (1960s) “How many widgets were sold this Year?” computers, tapes, disks Data Access (1980s) “How many widgets were sold and for what cost this year?" Relational Databases (RDBMS) Data Warehousing and Decision Support “How many widgets were sold without discount in the recently acquired Puerto Rico store of Giant Corp, Inc.?" On-line analytical processing (OLAP), multidimensional databases, data warehouses Data Mining “How many widgets will be sold in Cleveland next year?” Machine Learning, Technologies for handling mass storage and computation like RAID and SMP. Files RDBMS OLAP Data Mining

What Data Mining is NOT? Data Entry/Storage/Access or connectivity among diverse Data Sources (Data Warehousing) Presenting Data in a better format (Data Presentation / Interfacing) Brute-Force algorithm application for generating data about data (Statistics). Finding relations that don’t manifest themselves in the given data (Business Strategy).

Types of Data Mining: 1.Forecasting what may happen in the future 2.Classifying and Clustering data items into groups by recognizing patterns 3.Associating events (attribute values) that are likely to occur together 4.Sequencing events that are likely to lead to later events

Example Applications Fraud/Non-Compliance Anomaly detection (government) Credit/Risk Scoring Intrusion detection Parts failure prediction Market Basket Analysis “Fun” statistics Product Recommendations Customer Profiling Maximizing profitability (cross selling, identifying profitable customers) Web Mining Weather Prediction Using patterns in Medical test results for diagnosis

Success Stories HSBC - used data mining to target mailings better at customers. (i.e. not sending Car Loan brochures to millionaires) DEA – Analyzed suspect calls to catch drug peddlers. (i.e. don’t say LSD on the phone) IRS – better scheduling, catching Tax Fraud. DaimlerBenz – used data mining for analysis of testing data for F-Cell fuelled vehicles. Walmart – analyzing 7.5 TB of customer and supplier data.

Privacy Concerns Data mining extracts new insights from old data.Data mining extracts new insights from old data. This data may have been collected with a stated purpose of record-keeping only.This data may have been collected with a stated purpose of record-keeping only. Results of data mining can classify people as high risk/potentially criminal and hence hurt themResults of data mining can classify people as high risk/potentially criminal and hence hurt them Many believe data mining is the same as The Man simply stealing information (the mining metaphor is ambiguous)Many believe data mining is the same as The Man simply stealing information (the mining metaphor is ambiguous)

Issues of Scale Common data sets are non-trivial in size, usually in the order of Terabytes. Data is almost never consistent in quality. A top-down approach is needed to solving data mining problems The Answer: Standard process for data mining: CRISP-DM (CRoss Industry Standard Process for Data Mining)

CRISP-DM Proposed by SPSS, Daimler-Benz, and OHRA in 1996 Follows uniform and well-documented guidelines. Flexible on type of : –Business/agency problems –Data –Application software (i.e. software tools used for analysis) Very similar to the standard Software Development Process (top-down model)

Phases of CRISP-DM Business Understanding Data Understanding Evaluation Data Preparation Modeling Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques Collect Initial Data Initial Data Collection Report Describe Data Data Description Report Explore Data Data Exploration Report Verify Data Quality Data Quality Report Data Set Data Set Description Select Data Rationale for Inclusion / Exclusion Clean Data Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model Assessment Revised Parameter Settings Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation Deployment

CRISP-DM: Stage 1 Define business objective. Define data mining objective. Define set of data to be used, and identify outliers in the data. Gauge reliability of analysis Reasons: – Business Objectives are often unclear. (e.g. cutting mailing costs vs. finding new areas to campaign in) – Data quality varies widely, even in large well-structured organizations.

Stage 2-3: Data Preparation Evaluating quality of data Statistical outliers, incomplete data, and sparse data must be accounted for. Data may need to be transformed (for instance, by logarithm function) for useful statistics. Bad quality data: – Sparse data: e.g. in Market Basket analysis, one customer never buys the whole store, so the resulting matrix is very sparse. – Incomplete data: e.g. people do not answer every question in surveys. Data from a 10-year-old IBM mainframe takes conversion and standardized. Non-entries can manifest themselves as 0 or some default value.

Stage 4: Modelling Predictive models: –output is function or distribution that predicts values for individual objects. –e.g. to play or not play, given that its sunny outside) and humidity is high. –Use Classification Rules – Classification looks for associations to one target clustering attribute (say, Class = Ham or Spam ) Descriptive models: –output are interesting (local, marginal) properties of distribution –e. g. If its sunny and we decide to play, the temperature must be cool. –Use Association Rules – Associations are more numerous because they can be between any number of attributes.

Algorithms Predictive: Regression algorithms: neural networks, Rule Induction Classification algorithms : CHAID, C5.0, Naïve Bayesian Classifier. Descriptive: Clustering/Grouping algorithms: K-means, Kohonen maps Association algorithms: GRI

Decision Tree Induction Example (C4.5) The C4.5 algorithm infers from this data, Classification Rules like: If Outlook = sunny and Humidity <=75, Play =yes If Outlook = rainy and Windy = true, Play =yes Rules can be represented as a decision tree. In this example, the rules can help predict if a game will be played, based on weather data.

Association Rules Example Given data about Contact Lenses use and eye characteristics for a number of people, Find such associations in the data: –If tear production rate = reduced (low), then contact-lenses=none (i.e. finding the association that people with dry eyes are not prescribed contact lenses) –If contact-lenses=hard, then astigmatism=true (i.e. finding the association that people with astigmatism are prescribed hard lenses)

Text Mining Example Oberlinconfessional.com is a restricted (to Oberlin) website for anonymous confessions. “Automatically Categorizing Written Texts by Author Gender” by Moshe Koppel describes an algorithm for predicting the gender of a text’s writer based on word occurrences.

Results: Posts are more male than female at 6:00 AM, 7:00 AM, and at 5:00 PM. (possible reason: women don’t stay up that late) Posts are more female than male throughout the rest of the day. (possible reason: there are more women than men in the community)

Software Weka toolkit : Java-based open source data mining workbench (with reusable code) – Pentaho – Open Source Business Intelligence suite. IBM DB2 Data Warehouse Edition – complete data warehouse suite with mining and visualizing capabilities. (easily googleable) SPSS – Back-end software as well as a range of industry-specific data mining solutions. SAS – Commercial Text mining tools and Business Intelligence server.

Presentation Summary What is Data mining? Evolution of Data mining Applications Process Models : Predictive vs Descriptive Decision Tree (Classification Rules) Example Association Rules Example Text Mining Example Software used Slide was repeated because YOU are a hetero-associative learner.

Questions