Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University
Class Topics Introduction Decision Functions Midterm One Midterm Two Data Mining Project Presentations Introduction Decision Functions Cluster Analysis Statistical Decision Theory Feature Selection Machine Learning Neural Nets
Review Data Mining Example Preprocessing Data Preprocessing Tasks
Review – What is Data Mining? It is a method to get beyond the “tip of the iceberg” Data Mining/ Knowledge Discovery in Databases/ Data Archeology/ Data Dredging Information Available from a database
Review – Data Preprocessing Data preparation is a big issue for both warehousing and mining Data preparation includes –Data cleaning and data integration –Data reduction and feature selection –Discretization A lot a methods have been developed but still an active area of research
OUTLINE Frequent Pattern Mining Association Rule Mining Algorithms
Frequent Pattern Mining
What is Frequent Pattern Mining? What is a frequent pattern? –Pattern (set of items, sequence, etc.) that occurs together frequently in a database Frequent pattern: an important form of regularity –What products were often purchased together? — beers and diapers! –What are the consequences of a hurricane? –What is the next target after buying a PC?
Applications Market Basket Analysis –* Maintenance Agreement What the store should do to boost Maintenance Agreement sales –Home Electronics * What other products should the store stocks up on if the store has a sale on Home Electronics Attached mailing in direct marketing Detecting “ping-pong”ing of patients transaction: patient item: doctor/clinic visited by a patient support of a rule: number of common patients
Frequent Pattern Mining Methods Association analysis – Basket data analysis, cross-marketing, catalog design, loss-leader analysis, text database analysis –Correlation or causality analysis Clustering Classification – Association-based classification analysis Sequential pattern analysis – Web log sequence, DNA analysis, etc.
Association Rule Mining
Given –A database of customer transactions –Each transaction is a list of items (purchased by a customer in a visit) Find all rules that correlate the presence of one set of items with that of another set of items –Example: 98% of people who purchase tires and auto accessories also get automotive services done –Any number of items in the consequent/antecedent of rule –Possible to specify constraints on rules (e.g., find only rules involving Home Laundry Appliances).
Basic Concepts Rule form: “A [support s, confidence c]”. Support: usefulness of discovered rules Confidence: certainty of the detected association Rules that satisfy both min_sup and min_conf are called strong. Examples: – buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] – age(x, “30-34”) ^ income(x,“42K-48K”) buys(x, “high resolution TV”) [2%,60%] – major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
Rule Measures Find all the rules X & Y Z with minimum confidence and support –support, s, probability that a transaction contains {X, Y, Z} –confidence, c, conditional probability that a transaction having {X, Y} also contains Z. Customer buys diaper Customer buys beer Customer buys both
Example: Support Given the following data base: For the rule A => C, support is the probability that a transaction contains both A and C 2000 A,B,C 1000 A,C 2 out of 4 transactions contain both A and C so the support is 50%
Example: Confidence Given the same database: For the rule A => C, confidence is the conditional probability that a transaction which contains A also contains C 2000 A,B,C 1000 A,C 2 out of the 3 transactions which contain A also have C so the confidence is 66%
Algorithms
Apriori Algorithm The Apriori method: –Proposed by Agrawal & Srikant 1994 –A similar level-wise algorithm by Mannila et al Major idea: –A subset of a frequent itemset must be frequent E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be. If anyone is infrequent, its superset cannot be! –A powerful, scalable candidate set pruning technique: It reduces candidate k-itemsets dramatically (for k > 2)
Example Min. support 50% Min. confidence 50% Given:
Aprior Process ÀFind the frequent itemsets: the sets of items that have minimum support (Apriori) uA subset of a frequent itemset must also be a frequent itemset, i.e., if {A B} is a frequent itemset, both {A} and {B} should be a frequent itemset uIteratively find frequent itemsets with cardinality from 1 to k (k-itemset) ÁUse the frequent itemsets to generate association rules.
Aprior Algorithm Join Step C k is generated by joining L k-1 with itself Prune Step Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset, hence should be removed. (C k : Candidate itemset of size k) (L k : frequent itemset of size k)
Example Database D Scan D C1C1 L2L2 C2C2 C2C2 C3C3 L3L3 L1L1 Min. support 50% Min. confidence 50% Given:
Generating the Candidate Set In the example, how do you go from L to C? L2L2 C3C3 For example, if L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd}
Generating Strong Association Rules Confidence(A B) = Prob(B|A) = support(A B)/support(A) Example: Database D L3L3 Possible Rules: 2 and 3 => 5confidence 2/2 = 100% 2 and 5 => 3confidence 2/3 = 66% 3 and 5 => 2confidence 2/2 = 100% 2 => 3 and 5confidence 2/3 = 66% 3 => 2 and 5confidence 2/3 = 66% 5 => 3 and 2confidence 2/3 = 66%
Possible Quiz What is a frequent pattern? Define support and confidence. What is the basic principle of the Aprior algorithm?