Data Mining 198:541. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Definition Data mining is the exploration and analysis of large.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

(Some slides courtesy of Rich Caruana, Cornell University)
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Mining Association Rules from Microarray Gene Expression Data.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
ICS 421 Spring 2010 Data Mining 1 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/6/20101Lipyeow Lim.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Normalization and Data Mining R&G Chapter 19 Lecture 27 Science is the knowledge of consequences, and dependence of one fact upon another. Thomas Hobbes.
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 Data Mining, Database Tuning Tuesday, Feb. 27, 2007.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Dr. Awad Khalil Computer Science Department AUC
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Association Rule Mining
DATA MINING By Cecilia Parng CS 157B.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Definition Data mining is the exploration and analysis of large quantities of data.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining Copyright KEYSOFT Solutions.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Chapter 26: Data Mining Prepared by Assoc. Professor Bela Stantic.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining Functionalities
Data Mining.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
A Research Oriented Study Report By :- Akash Saxena
Data Mining 101 with Scikit-Learn
Adrian Tuhtan CS157A Section1
Market Basket Analysis and Association Rules
Sangeeta Devadiga CS 157B, Spring 2007
I don’t need a title slide for a lecture
Market Basket Analysis and Association Rules
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Presentation transcript:

Data Mining 198:541

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Definition (Cont.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Why Use Data Mining Today? Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate Availability of: Data Storage Computational power Off-the-shelf software Expertise

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. An Abundance of Data Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Why Use Data Mining Today? Competitive pressure! “ The secret of success is to know something that nobody else knows. ” Aristotle Onassis Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) Personalization, CRM The real-time enterprise “ Systemic listening ” Security, homeland defense

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. The Knowledge Discovery Process Steps: lIdentify business problem lData mining lAction lEvaluation and measurement lDeployment and integration into businesses processes

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Data Mining Step in Detail 2.1 Data preprocessing Data selection: Identify target datasets and relevant fields Data cleaning Remove noise and outliers Data transformation Create common units Generate new fields 2.2 Data mining model construction 2.3 Model evaluation

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Preprocessing and Mining Original Data Target Data Preprocessed Data Patterns Knowledge Data Integration and Selection Preprocessing Model Construction Interpretation

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Example Application: Sky Survey Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete Goal: Generate a catalog with all objects and their type Method: Use decision trees as data mining model Results: 94% accuracy in predicting sky object classes Increased number of faint objects classified by 300% Helped team of astronomers to discover 16 new high red-shift quasars in one order of magnitude less observation time

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Gold Nuggets? Investment firm mailing list: Discovered that old people do not respond to IRA mailings Bank clustered their customers. One cluster: Older customers, no mortgage, less likely to have a credit card “ Bank of 1911 ” Customer churn example

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. What is a Data Mining Model? A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. Examples: Linear regression model Classification model Clustering

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Data Mining Models (Contd.) A data mining model can be described at two levels: Functional level: Describes model in terms of its intended usage. Examples: Classification, clustering Representational level: Specific representation of a model. Example: Log-linear model, classification tree, nearest neighbor method. Black-box models versus transparent models

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Data Mining: Types of Data Relational data and transactional data Spatial and temporal data, spatio-temporal observations Time-series data Text Images, video Mixtures of data Sequence data Features from processing other data sources

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Types of Variables Numerical: Domain is ordered and can be represented on the real line (e.g., age, income) Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Data Mining Techniques Supervised learning Classification and regression Unsupervised learning Clustering Dependency modeling Associations, summarization, causality Outlier and deviation detection Trend analysis and change detection

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Market Basket Analysis Consider shopping cart filled with several items Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items?

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Market Basket Analysis Given: A database of customer transactions Each transaction is a set of items Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Market Basket Analysis (Contd.) Coocurrences 80% of all customers purchase items X, Y and Z together. Association rules 60% of all customers who purchase X and Y also buy Z. Sequential patterns 60% of customers who first buy X also purchase Y within three weeks.

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Confidence and Support We prune the set of all possible association rules (LHS => RHS) using two interestingness measures: Confidence of a rule: (% of transactions with LHS that contain RHS) X  Y has confidence c if P(Y|X) = c Support of a rule: (% of transactions that contain LHS U RHS) X  Y has support s if P(XY) = s We can also define Support of an itemset (a coocurrence) XY: XY has support s if P(XY) = s

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Example Examples: {Pen} => {Milk} Support: 75% Confidence: 75% {Ink} => {Pen} Support: 75% Confidence: 100%

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Example Find all itemsets with support >= 75%?

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Example Can you find all association rules with support >= 50%?

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Market Basket Analysis: Applications Sample Applications Direct marketing Fraud detection for medical insurance Floor/shelf planning Web site layout Cross-selling

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Association Rule Algorithms Find all large itemsets For each large itemset, find all association rules with sufficient confidence A priori Algorithm (Agrawal et al.) Any subset of a frequent itemset has to be frequent

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Problem Redux (Contd.) Definitions: An itemset is frequent if it is a subset of at least x transactions. (FI.) An itemset is maximally frequent if it is frequent and it does not have a frequent superset. (MFI.) GOAL: Given x, find all frequent (maximally frequent) itemsets (to be stored in the FI (MFI)). Obvious relationship: MFI subset FI Example: D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} } Minimum support x = 3 {1,2} is frequent {1,2,3} is maximal frequent Support({1,2}) = 4 All maximal frequent itemsets: {1,2,3}

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. The Itemset Lattice {} {2}{1}{4}{3} {1,2}{2,3}{1,3}{1,4}{2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4}{1,3,4}{2,3,4}

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Frequent Itemsets Frequent itemsets Infrequent itemsets {} {2}{1}{4}{3} {1,2}{2,3}{1,3}{1,4}{2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4}{1,3,4}{2,3,4}

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Apriori: 1-Itemsets {} {2}{1}{4}{3} {1,2}{2,3}{1,3}{1,4}{2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4}{1,3,4}{2,3,4} The Apriori Principle: I infrequent if (I - {x}) infrequent Infrequent Frequent Currently examined Don ’ t know

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Apriori: 2-Itemsets {} {2}{1}{4}{3} {1,2}{2,3}{1,3}{1,4}{2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4}{1,3,4}{2,3,4} Infrequent Frequent Currently examined Don ’ t know

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Apriori: 3-Itemsets {} {2}{1}{4}{3} {1,2}{2,3}{1,3}{1,4}{2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4}{1,3,4}{2,3,4} Infrequent Frequent Currently examined Don ’ t know

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Extensions Imposing constraints Only find rules involving the dairy department Only find rules involving expensive products Only find “expensive” rules Only find rules with “whiskey” on the right hand side Only find rules with “milk” on the left hand side Hierarchies on the items Calendars (every Sunday, every 1 st of the month)

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Itemset Constraints Definition: A constraint is an arbitrary property of itemsets. Examples: The itemset has support greater than No element of the itemset costs more than $40. The items in the set average more than $20. Goal: Find all itemsets satisfying a given constraint P. “ Solution ” : If P is a support constraint, use the Apriori Algorithm.

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Finding Association Rules 1.Identify frequent itemsets (Itemsets with support > minsup) 2.Generate candidate rules Divide each frequent itemset X into pairs of LHS and RHS itemsets (LHS U RHS = X) 3.Compute the confidence of the rule: Support(X)/support(LHS) (From Apriori, all LHS and RHS are frequent)

Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Applications Spatial association rules Web mining Market basket analysis User/customer profiling