Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)www.comp.dit.ie/bmacnamee.

Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)www.comp.dit.ie/bmacnamee

2 of 25 2 of 45 Acknowledgments These notes are based (heavily) on those provided by the authors to accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber Some slides are also based on trainer’s kits provided by More information about the book is available at: www-sal.cs.uiuc.edu/~hanj/bk2/ www-sal.cs.uiuc.edu/~hanj/bk2/ And information on SAS is available at: www.sas.com www.sas.com

3 of 25 3 of 45 Mining Association Rules Today we will look at: –Association rule mining –Algorithms for scalable mining of (single- dimensional Boolean) association rules in transactional databases –Sequential pattern mining –Applications/extensions of frequent pattern mining –Summary

4 of 25 4 of 45 What Is Association Mining? Association rule mining: –Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Frequent Pattern: A pattern (set of items, sequence, etc.) that occurs frequently in a database

5 of 25 5 of 45 Motivations For Association Mining Motivation: Finding regularities in data –What products were often purchased together? Beer and nappies! –What are the subsequent purchases after buying a PC? –What kinds of DNA are sensitive to this new drug? –Can we automatically classify web documents?

6 of 25 6 of 45 Motivations For Association Mining (cont…) Foundation for many essential data mining tasks –Association, correlation, causality –Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association –Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)

7 of 25 7 of 45 Motivations For Association Mining (cont…) Broad applications –Basket data analysis, cross-marketing, catalog design, sale campaign analysis –Web log (click stream) analysis, DNA sequence analysis, etc.

8 of 25 8 of 45 Market Basket Analysis Market basket analysis is a typical example of frequent itemset mining Customers buying habits are divined by finding associations between different items that customers place in their “shopping baskets” This information can be used to develop marketing strategies

9 of 25 9 of 45 Market Basket Analysis (cont…)

10 of 25 10 of 45 Association Rule Basic Concepts Let I be a set of items {I 1, I 2, I 3,…, I m } Let D be a database of transactions where each transaction T is a set of items such that T I So, if A is a set of items a transaction T is said to contain A if and only if A T An association rule is an implication A B where A I, B I, and A B=

11 of 25 11 of 45 Association Rule Support & Confidence We say that an association rule A B holds in the transaction set D with support, s, and confidence, c The support of the association rule is given as the percentage of transactions in D that contain both A and B (or A B ) So, the support can be considered the probability P(A B)

12 of 25 12 of 45 Association Rule Support & Confidence (cont…) The confidence of the association rule is given as the percentage of transactions in D containing A that also contain B So, the confidence can be considered the conditional probability P(B|A) Association rules that satisfy minimum support and confidence values are said to be strong

13 of 25 13 of 45 Itemsets & Frequent Itemsets An itemset is a set of items A k -itemset is an itemset that contains k items The occurrence frequency of an itemset is the number of transactions that contain the itemset –This is also known more simply as the frequency, support count or count An itemset is said to be frequent if the support count satisfies a minimum support count threshold The set of frequent itemsets is denoted L k

14 of 25 14 of 45 Support & Confidence Again Support and confidence values can be calculated as follows:

15 of 25 15 of 45 Mining Association Rules: An Example Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F Frequent patternSupport {A}75% {B}50% {C}50% {A, C}50%

16 of 25 16 of 45 Mining Association Rules: An Example (cont…) Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F Frequent patternSupport {A}75% {B}50% {C}50% {A, C}50%

17 of 25 17 of 45 Association Rule Mining So, in general association rule mining can be reduced to the following two steps: 1.Find all frequent itemsets Each itemset will occur at least as frequently as as a minimum support count 2.Generate strong association rules from the frequent itemsets These rules will satisfy minimum support and confidence measures

18 of 25 18 of 45 Combinatorial Explosion! A major challenge in mining frequent itemsets is that the number of frequent itemsets generated can be massive For example, a long frequent itemset will contain a combinatorial number of shorter frequent sub-itemsets A frequent itemset of length 100 will contains the following number of frequent sub-itemsets:

19 of 25 19 of 45 The Apriori Algorithm Any subset of a frequent itemset must be frequent –If {beer, nappy, nuts} is frequent, so is {beer, nappy} –Every transaction having {beer, nappy, nuts} also contains {beer, nappy} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

20 of 25 20 of 45 The Apriori Algorithm (cont…) The Apriori algorithm is known as a candidate generation-and-test approach Method: –Generate length ( k+1 ) candidate itemsets from length k frequent itemsets –Test the candidates against the DB Performance studies show the algorithm’s efficiency and scalability

21 of 25 21 of 45 The Apriori Algorithm: An Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2

22 of 25 22 of 45 Important Details Of The Apriori Algorithm There are two crucial questions in implementing the Apriori algorithm: –How to generate candidates? –How to count supports of candidates?

23 of 25 23 of 45 Generating Candidates There are 2 steps to generating candidates: –Step 1: Self-joining L k –Step 2: Pruning Example of Candidate-generation –L 3 ={abc, abd, acd, ace, bcd} –Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace –Pruning: acde is removed because ade is not in L 3 –C 4 ={abcd}

24 of 25 24 of 45 How to Count Supports Of Candidates? Why counting supports of candidates a problem? –The total number of candidates can be huge –One transaction may contain many candidates Method: –Candidate itemsets are stored in a hash-tree –Leaf node of hash-tree contains a list of itemsets and counts –Interior node contains a hash table –Subset function: finds all the candidates contained in a transaction

25 of 25 25 of 45 Generating Association Rules Once all frequent itemsets have been found association rules can be generated Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold

26 of 25 26 of 45 Example TIDList of item_IDs T100Beer, Crisps, Milk T200Crisps, Bread T300Crisps, Nappies T400Beer, Crisps, Bread T500Beer, Nappies T600Crisps, Nappies T700Beer, Nappies T800Beer, Crisps, Nappies, Milk T900Beer, Crisps, Nappies IDItem I1Beer I2Crisps I3Nappies I4Bread I5Milk

27 of 25 27 of 45 Example

28 of 25 28 of 45 Challenges Of Frequent Pattern Mining Challenges –Multiple scans of transaction database –Huge number of candidates –Tedious workload of support counting for candidates Improving Apriori: general ideas –Reduce passes of transaction database scans –Shrink number of candidates –Facilitate support counting of candidates

29 of 25 29 of 45 Bottleneck Of Frequent-Pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates –To find frequent itemset i 1 i 2 …i 100 # of scans: 100 # of Candidates: + + … + = 2 100 -1 = 1.27*10 30 Bottleneck: candidate-generation-and-test

30 of 25 30 of 45 Mining Frequent Patterns Without Candidate Generation Techniques for mining frequent itemsets which avoid candidate generation include: –FP-growth Grow long patterns from short ones using local frequent items –ECLAT (Equivalence CLASS Transformation) algorithm Uses a data representation in which transactions are associated with items, rather than the other way around (vertical data format) These methods can be much faster than the Apriori algorithm

31 of 25 31 of 45 Sequence Databases and Sequential Pattern Analysis Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining –Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. –Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, etc. –Telephone calling patterns, Weblog click streams –DNA sequences and gene structures

32 of 25 32 of 45 What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : An element may contain a set of items. Items within an element are unordered and we list them alphabetically. is a subsequence of Given support threshold min_sup =2, is a sequential pattern SIDsequence 10 20 30 40

33 of 25 33 of 45 Challenges On Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should –Find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold –Be highly efficient, scalable, involving only a small number of database scans –Be able to incorporate various kinds of user- specific constraints

34 of 25 34 of 45 A Basic Property Of Sequential Patterns: Apriori A basic property: Apriori –If a sequence S is not frequent –Then none of the super-sequences of S is frequent –E.g, is infrequent  so are and 50 40 30 20 10 SequenceSeq. ID Given support threshold min_sup =2

35 of 25 35 of 45 GSP—A Generalized Sequential Pattern Mining Algorithm GSP (Generalized Sequential Pattern) mining algorithm proposed in 1996 Outline of the method –Initially, every item in DB is a candidate of length 1 –For each level (i.e., sequences of length k): Scan database to collect support count for each candidate sequence Generate candidate length (k+1) sequences from length k frequent sequences using Apriori –Repeat until no frequent sequence or no candidate can be found Major strength: Candidate pruning by Apriori

36 of 25 36 of 45 Finding Length 1 Sequential Patterns Examine GSP using an example Initial candidates: all singleton sequences –,,,,,,, Scan database once, count support for candidates 50 40 30 20 10 SequenceSeq. ID min_sup =2 CandSup 3 5 4 3 3 2 1 1

37 of 25 37 of 45 Generating Length 2 Candidates 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates

38 of 25 38 of 45 Finding Length 2 Sequential Patterns Scan database one more time, collect support count for each length 2 candidate There are 19 length 2 candidates which pass the minimum support threshold –They are length 2 sequential patterns

39 of 25 39 of 45 Generating Length 3 Candidates And Finding Length 3 Patterns Generate length 3 candidates –Self-join length 2 sequential patterns Based on the Apriori property, and are all length 2 sequential patterns  is a length-3 candidate –46 candidates are generated Find length 3 sequential patterns –Scan database once more, collect support counts for candidates –19 out of 46 candidates pass support threshold

40 of 25 40 of 45 The GSP Mining Process … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all 50 40 30 20 10 SequenceSeq. ID min_sup =2

41 of 25 41 of 45 The GSP Algorithm Take sequences in form of as length 1 candidates Scan database once, find F 1, the set of length 1 sequential patterns Let k=1; while F k is not empty do –Form C k+1, the set of length (k+1) candidates from F k –If C k+1 is not empty, scan database once, find F k+1, the set of length (k+1) sequential patterns –Let k=k+1

42 of 25 42 of 45 Bottlenecks of GSP A huge set of candidates could be generated –1,000 frequent length 1 sequences generate length 2 candidates! Multiple scans of database in mining Real challenge: mining long sequential patterns –An exponential number of short candidates –A length-100 sequential pattern needs 10 30 candidate sequences!

43 of 25 43 of 45 Improvements On GSP Freespan: –Projection-based: No candidate sequence needs to be generated –But, projection can be performed at any point in the sequence, and the projected sequences will not shrink much PrefixSpan –Projection-based –But only prefix-based projection: less projections and quickly shrinking sequences

44 of 25 44 of 45 Frequent-Pattern Mining: Achievements Frequent pattern mining—an important task in data mining Frequent pattern mining methodology –Candidate generation & test vs. projection-based (frequent-pattern growth) –Various optimization methods: database partition, scan reduction, hash tree, sampling, border computation, clustering, etc. Related frequent-pattern mining algorithm: scope extension –Mining closed frequent itemsets and max-patterns (e.g., MaxMiner, CLOSET, CHARM, etc.) –Mining multi-level, multi-dimensional frequent patterns with flexible support constraints –Constraint pushing for mining optimization –From frequent patterns to correlation and causality

45 of 25 45 of 45 Frequent-Pattern Mining: Research Problems Multi-dimensional gradient analysis: patterns regarding changes and differences –Not just counts—other measures, e.g., avg(profit) Mining top-k frequent patterns without support constraint Mining fault-tolerant associations –“3 out of 4 courses excellent” leads to A in data mining Fascicles and database compression by frequent pattern mining Partial periodic patterns DNA sequence analysis and pattern classification

46 of 25 46 of 45 Questions? ?

Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)www.comp.dit.ie/bmacnamee.

Similar presentations

Presentation on theme: "Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)www.comp.dit.ie/bmacnamee."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)www.comp.dit.ie/bmacnamee.

Similar presentations

Presentation on theme: "Business Systems Intelligence: 4. Mining Association Rules Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)www.comp.dit.ie/bmacnamee."— Presentation transcript:

Similar presentations

About project

Feedback