Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … –

Slides:



Advertisements
Similar presentations
Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.
Advertisements

Decision Tree Approach in Data Mining
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Fast Algorithms for Association Rule Mining
Mining Association Rules
Data Mining – Intro.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining An Introduction.
Spatial Data Mining. Learning Objectives After this segment, students will be able to Describe the motivation for spatial data mining List common pattern.
Inductive learning Simplest form: learn a function from examples
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Spatial Data Mining Satoru Hozumi CS 157B. Learning Objectives Understand the concept of Spatial Data Mining Understand the concept of Spatial Data Mining.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Frequent-Itemset Mining. Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small.
Association Rule Mining
DATA MINING By Cecilia Parng CS 157B.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
CS Data Mining1 Data Mining The Extraction of useful information from data The automated extraction of hidden predictive information from (large)
Spatial Data Mining. Outline 1.Motivation, Spatial Pattern Families 2.Limitations of Traditional Statistics 3.Colocations and Co-occurrences 4.Spatial.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Association Rules Carissa Wang February 23, 2010.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining – Association Rules
Spatial Data Mining.
By Arijit Chatterjee Dr
Data Mining Association Analysis: Basic Concepts and Algorithms
A Research Oriented Study Report By :- Akash Saxena
Association rule mining
Frequent Pattern Mining
Waikato Environment for Knowledge Analysis
Exam #3 Review Zuyin (Alvin) Zheng.
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sangeeta Devadiga CS 157B, Spring 2007
I don’t need a title slide for a lecture
DATA MINING E0 261 Jayant Haritsa Computer Science and Automation
Data Mining: Introduction
Presentation transcript:

Motivation: Why Data Mining? Holy Grail - Informed Decision Making Lots of Data are Being Collected – Business - Transactions, Web logs, GPS-track, … – Science - Remote sensing, Micro-array gene expression data, … Challenges: – Volume (data) >> number of human analysts – Some automation needed Limitations of Relational Database – Can not predict future! (questions about items not in the database!) Ex. Predict tomorrow’s weather or credit-worthiness of a new customer – Can not compute transitive closure and more complex questions Ex. What are natural groups of customers? Ex. Which subsets of items are bought together? Data Mining may help! – Provide better and customized insights for business – Help scientists for hypothesis generation

Motivation for Data Mining Understanding of a (new) phenomenon Discovery of model may beis aided by patterns – Ex London: Cholera deaths clustered around a water pump – Narrow down potential causes – Change Hypothesis: Miasma => Water-borne Though, final model may not involve patterns – Cause-effect e.g. Cholera caused by germs

Data Mining: Definition The process of discovering – interesting, useful, non-trivial patterns patterns: non-specialist exception to patterns: specialist – from large datasets Pattern families 1.Clusters 2.Outlier, Anomalies 3.Associations, Correlations 4.Classification and Prediction models 5.…

What’s NOT Data Mining Simple Querying or summarization of Data – Find number of Subaru drivers in Ramsey county – Search space is not large (not exponential) Testing a hypothesis via a primary data analysis – Ex. Do Subaru driver vote for Democrats ? – Search space is not large! – DM: secondary data analysis to generate multiple plausible hypotheses Uninteresting or obvious patterns in data – Minneapolis and St. Paul have similar climate – Common knowledge: Nearby places have similar climate!

Context of Data Mining Models CRISP-DM (CRoss-Industry Standard Process for DM) – Application/Business Understanding – Data Understanding – Data Preparation – Modeling – Evaluation – Deployment Phases of CRISP-DM

Outline Clustering Outlier Detection Association Rules Classification & Prediction Summary

Clustering: What are natural groups of employees? R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2

Clustering: Geometric View shows 2 groups! R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B

K-Means Algorithm: 1. Start with random seeds R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B Seed

K-Means Algorithm: 2. Assign points to closest seed R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B Seed Age Years Of Service A F E D C B Seed Color shows closest seed

K-Means Algorithm: 3. Revise seeds to group centers R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B Revised seeds

R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B Revised seeds Age Years Of Service A F E D C B Colors show closest Seed K-Means Algorithm: 2. Assign points to closest seed

R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B Revised seed Age Years Of Service A F E D C B Colors show Closest Seed K-Means Algorithm: 3. Revise seeds to group centers

R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B Colors show Closest seed K-Means Algorithm: If seeds changed then Loop back to step 2. Assign points to closest seed

R IdAgeYears of Service A305 B5025 C5015 D255 E3010 F5525 K = 2 Age Years Of Service A F E D C B Colors show Closest seed Age Years Of Service A F E D C B Termination K-Means Algorithm: 3. Revise seeds to group centers

Outline Clustering Outlier Detection Association Rules Classification & Prediction Summary

Outliers – Global and local Ex. Traffic Data in Twin Cities – Abnormal Sensor 9

Outlier Detection Distribution Tests – Global Outliers, i.e., different from population – Local Outliers, i.e. different from neighbors

Outline Clustering Outlier Detection Association Rules Classification & Prediction Summary

Associations: Which Items are bought together ? Input: Transactions with Item-types Metrics balance computation cost and statistical interpretation! – Support: probability (Diaper and Beer in T) = 2/5 – Confidence: probability (Beer in T | Diaper in T) = 2/2 Algorithm Apriori [Agarwal, Srikant, VLDB94] – Support based pruning using monotonicity TransactionItems Bought 1{socks,, milk,, beef, egg, …} 2{pillow,, toothbrush, ice-cream, muffin, …} 3{,, pacifier, formula, blanket, …} …… n{battery, juice, beef, egg, chicken, …}

Apriori Algorithm: How to eliminate infrequent item-sets asap? Transaction IdTimeItem-types bought :35Milk, bread, cookies, juice 79219:38Milk, juice :05Milk, eggs :40Bread, cookies, coffee Support threshold >= 0.5

Apriori Algorithm: Eliminate infrequent Singleton sets. Transaction IdTimeItem-types bought :35Milk, bread, cookies, juice 79219:38Milk, juice :05Milk, eggs :40Bread, cookies, coffee Item- type Count Milk3 Bread2 Cookies2 Juice2 Coffee1 Eggs1 MilkCookiesBread Eggs Juice Coffee Support threshold >= 0.5

Apriori Algorithm: Make pairs from frequent items & Prune infrequent pairs! Transaction IdTimeItem-types bought :35Milk, bread, cookies, juice 79219:38Milk, juice :05Milk, eggs :40Bread, cookies, coffee Item- type Count Milk3 Bread2 Cookies2 Juice2 Coffee1 Eggs1 Item PairCount Milk, Cookies2 Milk, Juice2 Bread, Cookies2 Milk, Bread1 Bread, Juice1 Cookies, Juice1 MilkCookiesBread Eggs Juice Coffee MBBJMJBCMCCJ Support threshold >= 0.5

Transaction IdTimeItem-types bought :35Milk, bread, cookies, juice 79219:38Milk, juice :05Milk, eggs :40Bread, cookies, coffee Item- type Count Milk3 Bread2 Cookies2 Juice2 Coffee1 Eggs1 MilkCookiesBread Eggs Juice Coffee MBBJMJBCMCCJ MBCMBJ BCJ MBCJ MCJ Support threshold >= 0.5 Apriori Algorithm: Make triples from frequent pairs& Prune infrequent triples! Item PairCount Milk, Cookies2 Milk, Juice2 Bread, Cookies2 Milk, Bread1 Bread, Juice1 Cookies, Juice1 No triples generated Due to Monotonicity! Apriori algorithm examined only 12 subsets instead of 64!

Outline Clustering Outlier Detection Association Rules Classification & Prediction Summary

Find a (decision-tree) model to predict loanworthy ! RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes 3 20K..50K< 5K<25No 4 <20K>= 5K<25No 5 <20K< 5K>=25No 6 yes20K..50K>= 5K>=25Yes Predict Class = Loanworthy From Other columns RID MarriedSalaryAcct_balanceAgeLoanWorthy 7 yes<20K>= 5K>=25? Learning Samples Testing Samples

RID MarriedSalaryAcct_balanceAgeLoanWorthy 4 No<20K>= 5K<25No 5 <20K< 5K>=25No Salary RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes RID MarriedSalaryAcct_balanceAgeLoanWorthy 6 yes20K..50K>= 5K>=25Yes Age < 20K > 50K K < 25 >=25 A Decision Tree to Predict Loanworthy From Other columns RID MarriedSalaryAcct_balanceAgeLoanWorthy 7 yes<20K>= 5K>=25? Q? What is the decision on the new application?

RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No 4 <20K>= 5K<25No Age RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes RID MarriedSalaryAcct_balanceAgeLoanWorthy 5 No<20K< 5K>=25No Salary < 25 >= 25 < 20K >=50K RID MarriedSalaryAcct_balanceAgeLoanWorthy 6 yes20K..50K>= 5K>=25Yes K Another Decision Tree to Predict Loanworthy From Other columns RID MarriedSalaryAcct_balanceAgeLoanWorthy 7 yes<20K>= 5K>=25? Q? What is the decision on the new application?

ID3 Algorithm: Choosing a decision for Root Node -1 RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes 3 20K..50K< 5K<25No 4 <20K>= 5K<25No 5 <20K< 5K>=25No 6 yes20K..50K>= 5K>=25Yes MarriedSalaryAcct_balanceAgeLoanworthy # Groups23222 Predict Class = Loanworthy From Other columns

RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes 3 20K..50K< 5K<25No 4 <20K>= 5K<25No 5 <20K< 5K>=25No 6 yes20K..50K>= 5K>=25Yes MarriedSalaryAcct_balanceAgeLoanworthy # Groups23222 Groupsyyn, nnyyy, yn, nnyyn, nyyyyyn, nnyyy, nnn Predict Class = Loanworthy From Other columns ID3 Algorithm: Choosing a decision for Root Node -2

RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes 3 20K..50K< 5K<25No 4 <20K>= 5K<25No 5 <20K< 5K>=25No 6 yes20K..50K>= 5K>=25Yes MarriedSalaryAcct_balanceAgeLoanworthy # Groups23222 Groupsyyn, nnyyy, yn, nnyyn, nyyyyyn, nnyyy, nnn Entropy Predict Class = Loanworthy From Other columns ID3 Algorithm: Choosing a decision for Root Node -3

RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes 3 20K..50K< 5K<25No 4 <20K>= 5K<25No 5 <20K< 5K>=25No 6 yes20K..50K>= 5K>=25Yes MarriedSalaryAcct_balanceAgeLoanworthy # Groups23222 Groupsyyn, nnyyy, yn, nnyyn, nyyyyyn, nnyyy, nnn Entropy Gain Predict Class = Loanworthy From Other columns ID3 Algorithm: Choosing a decision for Root Node - 4

RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes 3 20K..50K< 5K<25No 4 <20K>= 5K<25No 5 <20K< 5K>=25No 6 yes20K..50K>= 5K>=25Yes MarriedSalaryAcct_balanceAgeLoanworthy # Groups23222 Groupsyyn, nnyyy, yn, nnyyn, nyyyyyn, nnyyy, nnn Entropy Gain Predict Class = Loanworthy From Other columns Root Node : Decision is based on Salary

Root Node of a Decision Tree to Predict Loanworhty RID MarriedSalaryAcct_balanceAgeLoanWorthy 4 No<20K>= 5K<25No 5 <20K< 5K>=25No Salary RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No 6 yes20K..50K>= 5K>=25Yes RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes < 20K > 50K K

ID3 Algorithm: Which Leafs needs refinement? RID MarriedSalaryAcct_balanceAgeLoanWorthy 4 No<20K>= 5K<25No 5 <20K< 5K>=25No Salary RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No 6 yes20K..50K>= 5K>=25Yes RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes < 20K > 50K K

RID MarriedSalaryAcct_balanceAgeLoanWorthy 4 No<20K>= 5K<25No 5 <20K< 5K>=25No Salary RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes RID MarriedSalaryAcct_balanceAgeLoanWorthy 6 yes20K..50K>= 5K>=25Yes Age < 20K > 50K K < 25 >=25 ID3 Algorithm Output: A Decision Tree to Predict Loanworthy column From Other columns

Another Decision Tree to Predict Loanworthy From Other columns RID MarriedSalaryAcct_balanceAgeLoanWorthy 4 No<20K>= 5K<25No 5 <20K< 5K>=25No Salary RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes RID MarriedSalaryAcct_balanceAgeLoanWorthy 6 yes20K..50K>= 5K>=25Yes Acct_balance < 20K > 50K K < 5K >=5K

A Decision Root not preferred by ID3 RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No 4 <20K>= 5K<25No ID3 prefer Salary over Age for decision in root node due to difference in information gain Even though the choices are comparable for classification accuracy. Age RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes 5 No<20K< 5K>=25No 6 yes20K..50K>= 5K>=25Yes < 25 >= 25

A Decision Tree not prefered by ID3 RID MarriedSalaryAcct_balanceAgeLoanWorthy 3 Yes20K..50K< 5K<25No 4 <20K>= 5K<25No ID3 is greedy preferring Salary over Age for decision in root node. Thus, it prefers decision tress in earlier slides over following (despite comparable quality): Age RID MarriedSalaryAcct_balanceAgeLoanWorthy 1 No>=50K< 5K>=25Yes 2 >=50K>= 5K>=25Yes RID MarriedSalaryAcct_balanceAgeLoanWorthy 5 No<20K< 5K>=25No Salary < 25 >= 25 < 20K >=50K RID MarriedSalaryAcct_balanceAgeLoanWorthy 6 yes20K..50K>= 5K>=25Yes K

Summary The process of discovering – interesting, useful, non-trivial patterns – from large datasets Pattern families 1.Clusters, e.g., K-Means 2.Outlier, Anomalies 3.Associations, Correlations 4.Classification and Prediction models, e.g., Decision Trees 5.…

Review Quiz Consider an Washingtonian.com article about election micro-targeting using a database of 200+ Million records about individuals. The database is compiled from voter lists, memberships (e.g. advocacy group, frequent buyer cards, catalog/magazine subscription,...) as well polls/surveys of effective messages and preferences. It is at Q1. Match the following use-cases in the article to categories of traditional SQL2 query, association, clustering and classification: (i) How many single Asian men under 35 live in a given congressional district? (ii) How many college-educated women with children at home are in Canton, Ohio? (iii) Jaguar, Land Rover, and Porsche owners tend to be more Republican, while Subaru, Hyundai, and Volvo drivers lean Democratic. (iv) Some of the strongest predictors of political ideology are things like education, homeownership, income level, and household size. (v) Religion and gun ownership are the two most powerful predictors of partisan ID. (vi)... it even studied the roads Republicans drove as they commuted to work, which allowed the party to put its billboards where they would do the most good. (vii) Catalyst and its competitors can build models to predict voter choices.... Based on how alike they are, you can assign a probability to them.... a likelihood of support on each person based on how many character traits a person shares with your known supporters.. (viii) Will 51 percent of the voters buy what RNC candidate is offering? Or will DNC candidate seem like a better deal? Q2. Compare and contrast Data Mining with Relational Databases. Q3. Compare and contrast Data Mining with Traditional Statistics (or Machine Learning).