3/3/20081 Data Warehousing and Data Mining. 3/3/20082 Why Data Mining? — Potential Applications Database analysis and decision support –Market analysis.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
Association rules and frequent itemsets mining
LOGO Association Rule Lecturer: Dr. Bo Yuan
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Chapter 5: Mining Frequent Patterns, Association and Correlations
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Mining Association Rules
Mining Association Rules
Data Mining – Intro.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Data Mining.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Chapter 1 Introduction to Data Mining
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
Lecture 4: Association Market Basket Analysis Analysis of Customer Behavior and Service Modeling.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining  Association Rule  Classification  Clustering.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Mining Association Rules in Large Database This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed.
Data Mining Find information from data data ? information.
Data Mining Motivation: “Necessity is the Mother of Invention”
Data Mining: Concepts and Techniques
A Research Oriented Study Report By :- Akash Saxena
Data Mining: Concepts and Techniques
Association rule mining
Mining Association Rules
Adrian Tuhtan CS157A Section1
Mining Association Rules in Large Databases
Association Rule Mining
Data Warehousing and Data Mining
I don’t need a title slide for a lecture
Association Rule Mining
Analysis of Customer Behavior and Service Modeling
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
©Jiawei Han and Micheline Kamber
I. Association Market Basket Analysis.
Understanding Customer Behaviors with Information Technologies
Association Rule Mining
Data Mining: Concepts and Techniques
Presentation transcript:

3/3/20081 Data Warehousing and Data Mining

3/3/20082 Why Data Mining? — Potential Applications Database analysis and decision support –Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation. –Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis. –Fraud detection and management Other Applications: –Text mining (news group, , documents) and Web analysis. –Intelligent query answering

3/3/20083 What Is Data Mining? Data mining (knowledge discovery in databases): –Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases Alternative names and their “inside stories”: –Data mining: a misnomer? –Knowledge discovery in databases (KDD: SIGKDD), knowledge extraction, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? –(Deductive) query processing. – Expert systems or small ML/statistical programs

3/3/20084 Data Mining: A KDD Process –Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

3/3/20085 Data Mining: On What Kind of Data? Relational databases Data warehouses Transactional databases Advanced DB systems and information repositories –Object-oriented and object-relational databases –Spatial databases –Time-series data and temporal data –Text databases and multimedia databases –Heterogeneous and legacy databases –WWW

3/3/20086 Data Mining Functionality Association: –From association, correlation, to causality. –finding rules like “inside(x, city)  near(x, highway)”. Cluster analysis: –Group data to form new classes, e.g., cluster houses to find distributed patterns. Decision Tree: –Prioritize the important factors in constructing a business rule in a tree format. Neural network: –Prioritize the important factors in constructing a business rule in a weighting ranking. Genetic Algorithm: - The fitness of a rule is assessed by its classification accuracy on a set of training samples. Web Mining: - Data mining website for web usages analysis.

3/3/20087 Knowledge Discovery Process Data selection Cleaning Enrichment Coding Data Mining Reporting

3/3/20088

9 Data Selection Once you have formulated your informational requirements, the nest logical step is to collect and select the data you need. Setting up a KDD activity is also a long term investment. A data environment will need to download from operational data on a regular basis, therefore investing in a data warehouse is an important aspect of the whole process.

3/3/200810

3/3/ Cleaning Almost all databases in large organizations are polluted and when we start to look at the data from a data mining perspective, ideas concerning consistency of data change. Therefore, before we start the data mining process, we have to clean up the data as much as possible, and this can be done automatically in many cases.

3/3/200812

3/3/ Enrichment Matching the information from bought-in databases with your own databases can be difficult. A well-known problem is the reconstruction of family relationships in databases. In a relational environment, we can simply join this information with our original data.

3/3/200814

3/3/200815

3/3/ What is frequent pattern mining? Frequent pattern mining algorithms –Apriori and its variations Recent progress on efficient mining methods –Mining frequent patterns without candidate generation Technologies for Mining Frequent Patterns in Large Databases

3/3/ What Is Frequent Pattern Mining? What is a frequent pattern? –Pattern (set of items, sequence, etc.) that occurs together frequently in a database Frequent pattern: an important form of regularity –What products were often purchased together? — beers and diapers! –What are the consequences of a hurricane? –What is the next target after buying a PC?

3/3/ Applications of Frequent Pattern Mining Association analysis –Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering classification –Association-based classification analysis sequential pattern analysis –Web log sequence, DNA analysis, etc.

3/3/ Application Examples Market Basket Analysis –*  Maintenance Agreement What the store should do to boost Maintenance Agreement sales –Home Electronics  * What other products should the store stocks up on if the store has a sale on Home Electronics Attached mailing in direct marketing Detecting “ping-pong”ing of patients transaction: patient item: doctor/clinic visited by a patient support of a rule: number of common patients

3/3/ In general, given a count of source data S, an association rule indicates that the events A1, A2,…An will most likely associate with the event B. S = A1 + A2 + ….. + B + other events A1, A2, ……An => B The Support and Confidence level of this association is:

3/3/ Association Rule Mining Given –A database of customer transactions –Each transaction is a list of items (purchased by a customer in a visit) Find all rules that correlate the presence of one set of items with that of another set of items –Example: 98% of people who purchase tires and auto accessories also get automotive services done –Any number of items in the consequent/antecedent of rule –Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances). Association Rule: If people purchase tire and auto accessories Then people will also get automotive services done Confidence level: 98%

3/3/ Basic Concepts Rule form: “A  [support s, confidence c]”. Support: usefulness of discovered rules Confidence: certainty of the detected association Rules that satisfy both min_sup and min_conf are called strong. Examples: –buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%] –age(x, “30-34”) ^ income(x,“42K-48K”)  buys(x, “high resolution TV”) [2%,60%] –major(x, “CS”) ^ takes(x, “DB”)  grade(x, “A”) [1%, 75%] Association Rule: If Major = “CS” and takes “DB” Then Grade = “A” Support level = 1% Confidence level = 75%

3/3/ Rule Measures: Support and Confidence Find all the rules X & Y  Z with minimum confidence and support –support, s, probability that a transaction contains {X, Y, Z} –confidence, c, conditional probability that a transaction having {X, Y} also contains Z. Let minimum support 50%, and minimum confidence 50%, we have –A  C (50%, 66.6%) –C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer

3/3/ Frequent pattern mining methods: Apriori and its variations The Apriori algorithm Improvements of Apriori Incremental, parallel, and distributed methods Different measures in association mining

3/3/ An Influential Mining Methodology — The Apriori Algorithm The Apriori method: –Proposed by Agrawal & Srikant 1994 –A similar level-wise algorithm by Mannila et al Major idea: –A subset of a frequent itemset must be frequent E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be. Anyone is infrequent, its superset cannot be! –A powerful, scalable candidate set pruning technique: It reduces candidate k-itemsets dramatically (for k > 2)

3/3/ Mining Association Rules — Example For rule A  C: support = support({A  C}) = 50% confidence = support({A  C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent. Min. support 50% Min. confidence 50%

3/3/ Procedure of Mining Association Rules: ÀFind the frequent itemsets: the sets of items that have minimum support (Apriori) uA subset of a frequent itemset must also be a frequent itemset, i.e., if {A  B} is a frequent itemset, both {A} and {B} should be a frequent itemset uIteratively find frequent itemsets with cardinality from 1 to k (k-itemset) ÁUse the frequent itemsets to generate association rules.

3/3/ The Apriori Algorithm Join Step C k is generated by joining L k-1 with itself Prune Step Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset, hence should be removed. (C k : Candidate itemset of size k) (L k : frequent itemset of size k)

3/3/ Apriori—Pseudocode C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

3/3/ The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

3/3/ Mining Frequent Itemsets without Candidate Generation Apriori candidate generate-and-test method suffers from the following costs: It may need to generate a huge number of candidate sets. It may need to repeatedly scan the database and check a large set of candidates by pattern matching

3/3/ Frequent-pattern growth(FP-growth) It adopts a divide-and-conquer strategy to compress the database representing frequent items into a frequent-pattern tree (FP-tree). The mining of the PF-tree starts from each frequent length-1 pattern (as an initial suffix pattern), construct its conditional pattern base (a “subdatabase” consisting of the set of prefix paths in the FP-tree), and then construct its (conditional) FP-tree.

3/3/ Frequent Pattern Tree algorithm Step 1: Create a table of candidate data items in descending order. Step 2: Build the Frequent Pattern Tree according to each event of the candidate data items. Step 3: Link the table with the tree.

3/3/ Transactional data for an AllElectronics branch

3/3/ An FP-tree that registers compressed, frequent pattern information

3/3/ Step 1 Get the frequent one item set in descending order with user requirement of Support Level = 2 I27 I16 I36 I42 I52

3/3/ Step 2 T100=I2, I1, I5

3/3/ Step 3 T200=I2, I4

3/3/ Step 4 T300=I2, I3

3/3/ Step 5 T400=I1, I2, I4

3/3/ Step 6 T500=I1, I3

3/3/ Step 7 T600=I2, I3

3/3/ Step 8 T700=I1, I3

3/3/ Step 9 T800=I1, I2, I3, I5

3/3/ Step 10 T900=I1, I2, I3

3/3/ Step 11 Link table with the tree

3/3/200847

3/3/ Reading Assignment “Data Mining: Concepts and Techniques” by Han and Kamber, Morgan Kaufmann publishers, 2001, chapter 6, pp

3/3/ Lecture Review Question 7 What is the rational of having various data mining technique? In other words, how can one decide which technique of the following to select in data mining? Association rules Clustering Decision Tree Neural network Web Mining Genetic programming What are the major difference between Apriori algorithm and Frequent Pattern Tree (FP-tree) with respect to performance? Justify your answer.

3/3/ CS5483 Tutorial Question 5 a) Given the weather data as shown in the table below: Outlook TemperatureHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes SunnyMildHighTrueNo In this table, there are four attributes: outlook, temperature, humidity and wind; and the outcome is whether to play or not. (a) Show the possible Association Rules that can determine the outcome without support and confidence level. (b) Show the Support level and Confidence level of the following association rule: If temperature = cool then humidity = normal. CS5483 Tutorial Question 7 Given the weather data as shown in the table below: