Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical example is a rule which states that if a customer buys beer and sausage, then with 80% confidence he/she also buys mustard. Association rule mining: Finding associations or correlation among a set of items or objects in transaction databases, relational databases, and data warehouses. Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, etc. Examples: Rule form: LHS  RHS [support, confidence]. buys(x, diapers)  buys(x, beers) [0.5%, 60%] major(x, CS) ^ takes(x, DB)  grade(x, A) [1%, 75%]

Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items –E.g., 98% of people who purchase tires and auto accessories also get automotive services done Applications –*  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) –Home Electronics  * (What other products should the store stocks up?) –Attached mailing in direct marketing –Detecting “ping-pong”ing of patients, faulty “collisions”

Associations, Support and Confidence Let I = {i1, i2,.., im} be a set of literals, each called an item Let D = {t1, t2,.., tn} be a set of transactions, where a transaction, t, is a set of items An association rule is of the form : X => Y where X, Y are subsets of I, and X INTERSECT Y = EMPTY Each rule has two measures of value, support, and confidence. Support indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule. The support of the rule X => Y is support (X UNION Y) c is the CONFIDENCE of rule X => Y if c% of transactions that contain X also contain Y, which can be written as the radio: support(X UNION Y)/support(X)

Rule Measures: Support and Confidence Find all the rules X & Y  Z with minimum confidence and support –support, s, probability that a transaction contains {X &Y &Z} –confidence, c, conditional probability that a transaction having {X & Y} also contains Z Let minimum support 50%, and minimum confidence 50%, we have –A  C (50%, 66.6%) –C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer

Association Discovery Given a user specified minimum support (called MINSUP) and minimum confidence (called MINCONF), an important PROBLEM is to find all high confidence, large itemsets (frequent sets, sets with high support). (where support and confidence are larger than minsup and minconf). This problem can be decomposed into two subproblems: 1. Find all large itemsets: with support > minsup (frequent sets). 2. For a large itemset, X and B  X (or Y  X), find those rules, X\{B} => B ( X-Y => Y) for which confidence > minconf.

Mining Association Rules—An Example For rule A  C: support = support({A &C}) = 50% confidence = support({A &C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Min. support 50% Min. confidence 50%

Rules from frequent sets X ={mustard, sausage, beer}; frequency = 0.4 Y = {mustard, sausage, beer, chips}; frequency = 0.2 if the customer buys mustard, sausage, and beer, then the probability that he/she buys chips is 0.5 simple descriptive pattern statistical meaning :confidence of A=> B : P(B|A)

Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support –A subset of a frequent itemset must also be a frequent itemset (Apriori rule) i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset –Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

The Algorithm 1) The frequent set can be computed through iteration. 1th ITERATION: large 1-candidate sets are found by scanning. Kth ITERATION: C k is created by applying Apriori-gen to L k-1. and scanned for frequent sets. Apriori-gen generates only those k-itemsets whose every (k-1)-itemset subset is frequent (in L k-1 ). (2) Generating rules. Foreach frequent set, X, output all rules R(X, Y) = (X-Y=> Y), (Y is a subset of X) where c(R(X, Y)) = supp(X)/supp(X-Y) is at least minconf.

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

Example Consider the database in Table 2.2. Table 2.2. Sample transaction database TID Items --- ---------------------- 100 A C D 200 B C E 300 A B C E 400 B E Let minimum-support =50% and minimum-confidence = 60%. Since there are four records in the table, the number of transactions above the minsup is 2 (4 x 50% = 2).

The process of finding frequent sets Database_D Candidate_1-itemset Frequent_1-itemset TID Items Itemset Support_Count Itemset Support_Count 100 A C D {A} 2 {A} 2 200 B C E --> {B} 3 {B} 3 300 A B C E {C} 3 {C} 3 400 B E {D} 1 {E} 3 {E} 3 Candidate_2-itemset Candidate_2-itemset Frequent_2-itemset Itemset Itemset Support_Count Itemset Support_Count {A, B} {A, B} 1 {A, C} 2 {A, C} {A, C} 2 {B, C} 2 {A, E} --> {A, E} 1 {B, E} 3 {B, C} {B, C} 2 {C, E} 2 {B, E} {B, E} 3 {C, E} {C, E} 2 Candidate_3-itemset Candidate_3-itemset Frequent_3-itemset Itemset Itemset Support_Count Itemset Support_Count {B, C, E} --> {B, C, E} 2 {B, C, E} 2 Derive association rules. We have large 3-itemset {{B, C, E}} where s = 50%. Remember the predetermined minconf = 60%. we get: B and C implies E, with support = 50% and confidence = 100%. B and E implies C, with support = 50% and confidence = 66.7%. C and E implies B, with support = 50% and confidence = 100%. B implies C and E, with support = 50% and confidence = 66.7%. C implies B and E, with support = 50% and confidence = 66.7%. E implies B and C, with support = 50% and confidence = 66.7%.

The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

General framework for rule discovery a class P of patterns specify whether a pattern p  P occurs frequently enough (support) and is also interesting (confidence ) compute PI(d, P) = { p  P | p occurs sufficiently often in d and p is interesting } examples: –P : all association rules –P’: all association rules with B on the right-hand side –P’’: all association rules with B on the right-hand side and C occurring in the left-hand side

Association Rules in Table Form

Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Similar presentations

Presentation on theme: "Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Similar presentations

Presentation on theme: "Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical."— Presentation transcript:

Similar presentations

About project

Feedback