Market Baskets Frequent Itemsets A-Priori Algorithm

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
Data Mining of Very Large Data
IDS561 Big Data Analytics Week 6.
 Back to finding frequent itemsets  Typically, data is kept in flat files rather than in a database system:  Stored on disk  Stored basket-by-basket.
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
1 Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
1 Association Rules Market Baskets Frequent Itemsets A-priori Algorithm.
Association Analysis: Basic Concepts and Algorithms.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
Mining Association Rules
Mining Association Rules
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Frequent Itemsets and Association Rules 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 3: Frequent Itemsets.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
Jeffrey D. Ullman Stanford University.  2% of your grade will be for answering other students’ questions on Piazza.  18% for Gradiance.  Piazza code.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
CS685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY Frequent Itemset Mining II Tree-based Algorithm Max Itemsets Closed Itemsets.
Mining Association Rules in Large Databases
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Association Rules Repoussis Panagiotis.
Frequent Pattern Mining
Frequent Itemsets Association Rules
CPS216: Advanced Database Systems Data Mining
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Hash-Based Improvements to A-Priori
Mining Association Rules in Large Databases
Frequent Itemset Mining & Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
Mining Association Rules in Large Databases
Association Rule Mining
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

Market Baskets Frequent Itemsets A-Priori Algorithm Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm

The Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day.

Market-Baskets – (2) Really a general many-many mapping (association) between two kinds of things. But we ask about connections among “items,” not “baskets.” The technology focuses on common events, not rare events (“long tail”).

Support Simplest question: find sets of items that appear “frequently” in the baskets. Support for itemset I = the number of baskets containing all items in I. Sometimes given as a percentage. Given a support threshold s, sets of items that appear in at least s baskets are called frequent itemsets.

Example: Frequent Itemsets Items={milk, coke, pepsi, beer, juice}. Support = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c} , {c,j}.

Applications – (1) Items = products; baskets = sets of products someone bought in one trip to the store. Example application: given that many people buy beer and diapers together: Run a sale on diapers; raise price of beer. Only useful if many buy diapers & beer.

Applications – (2) Baskets = sentences; items = documents containing those sentences. Items that appear together too often could represent plagiarism. Notice items do not have to be “in” baskets.

Applications – (3) Baskets = Web pages; items = words. Unusual words appearing together in a large number of documents, e.g., “Brad” and “Angelina,” may indicate an interesting relationship.

Aside: Words on the Web Many Web-mining applications involve words. Cluster pages by their topic, e.g., sports. Find useful blogs, versus nonsense. Determine the sentiment (positive or negative) of comments. Partition pages retrieved from an ambiguous query, e.g., “jaguar.”

Words – (2) Very common words are stop words. They rarely help determine meaning, and they block from view interesting events, so ignore them. The TF/IDF measure distinguishes “important” words from those that are usually not meaningful.

Words – (3) TF/IDF = “term frequency, inverse document frequency”: relates the number of times a word appears to the number of documents in which it appears. Low values are words like “also” that appear at random. High values are words like “computer” that may be the topic of documents in which it appears at all.

Scale of the Problem WalMart sells 100,000 items and can store billions of baskets. The Web has billions of words and many billions of pages.

Association Rules If-then rules about the contents of baskets. {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j.” Confidence of this association rule is the probability of j given i1,…,ik.

Example: Confidence An association rule: {m, b} → c. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} An association rule: {m, b} → c. Confidence = 2/4 = 50%. + _

Finding Association Rules Question: “find all association rules with support ≥ s and confidence ≥ c .” Note: “support” of an association rule is the support of the set of items on the left. Hard part: finding the frequent itemsets. Note: if {i1, i2,…,ik} → j has high support and confidence, then both {i1, i2,…,ik} and {i1, i2,…,ik ,j } will be “frequent.”

Computation Model Typically, data is kept in flat files rather than in a database system. Stored on disk. Stored basket-by-basket. Expand baskets into pairs, triples, etc. as you read baskets. Use k nested loops to generate all sets of size k.

File Organization Example: items are positive integers, and boundaries Basket 1 Example: items are positive integers, and boundaries between baskets are –1. Item Item Item Item Basket 2 Item Item Item Item Basket 3 Item Item Etc.

Computation Model – (2) The true cost of mining disk-resident data is usually the number of disk I/O’s. In practice, association-rule algorithms read the data in passes – all baskets read in turn. Thus, we measure the cost by the number of passes an algorithm takes.

Main-Memory Bottleneck For many frequent-itemset algorithms, main memory is the critical resource. As we read baskets, we need to count something, e.g., occurrences of pairs. The number of different things we can count is limited by main memory. Swapping counts in/out is a disaster (why?).

Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs. Why? Often frequent pairs are common, frequent triples are rare. Why? Probability of being frequent drops exponentially with size; number of sets grows more slowly with size. We’ll concentrate on pairs, then extend to larger sets.

Naïve Algorithm Read file once, counting in main memory the occurrences of each pair. From each basket of n items, generate its n (n -1)/2 pairs by two nested loops. Fails if (#items)2 exceeds main memory. Remember: #items can be 100K (Wal-Mart) or 10B (Web pages).

Example: Counting Pairs Suppose 105 items. Suppose counts are 4-byte integers. Number of pairs of items: 105(105-1)/2 = 5*109 (approximately). Therefore, 2*1010 (20 gigabytes) of main memory needed.

Details of Main-Memory Counting Two approaches: Count all pairs, using a triangular matrix. Keep a table of triples [i, j, c] = “the count of the pair of items {i, j } is c.” (1) requires only 4 bytes/pair. Note: always assume integers are 4 bytes. (2) requires 12 bytes, but only for those pairs with count > 0.

4 per pair 12 per occurring pair Method (1) Method (2)

Triangular-Matrix Approach – (1) Number items 1, 2,… Requires table of size O(n) to convert item names to consecutive integers. Count {i, j } only if i < j. Keep pairs in the order {1,2}, {1,3},…, {1,n }, {2,3}, {2,4},…,{2,n }, {3,4},…, {3,n },…{n -1,n }.

Triangular-Matrix Approach – (2) Find pair {i, j } at the position (i –1)(n –i /2) + j – i. Total number of pairs n (n –1)/2; total bytes about 2n 2.

Details of Approach #2 Total bytes used is about 12p, where p is the number of pairs that actually occur. Beats triangular matrix if at most 1/3 of possible pairs actually occur. May require extra space for retrieval structure, e.g., a hash table.

A-Priori Algorithm for Pairs A two-pass approach called a-priori limits the need for main memory. Key idea: monotonicity : if a set of items appears at least s times, so does every subset. Contrapositive for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets.

Apriori Algorithm - General Agrawal & Srikant 1994 Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 3-candidates Freq 2-itemsets Counting Scan D Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Freq 3-itemsets Itemset Sup bce 2

Important Details of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning How to count supports of candidates?

How to Generate Candidates? Suppose the items in Lk-1 are listed in an order Step 1: self-join Lk-1 INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning For each itemset c in Ck do For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd}

How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

Apriori: Candidate Generation-and-test Any subset of a frequent itemset must be also frequent — an anti-monotone property A transaction containing {beer, diaper, nuts} also contains {beer, diaper} {beer, diaper, nuts} is frequent  {beer, diaper} must also be frequent No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned

The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support return k Lk;

All pairs of items from L1 Count the items Count the pairs To be explained All items Filter Filter Construct Construct C1 L1 C2 L2 C3 Frequent items Frequent pairs First pass Second pass

Challenges of FPM Challenges Improving Apriori: general ideas Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce number of transaction database scans Shrink number of candidates Facilitate support counting of candidates

DIC: Reduce Scans ABCD Once both A and D are determined frequent, the counting of AD can begin Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets A B C D 2-itemsets Apriori … {} Itemset lattice 1-itemsets 2-items S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997. DIC 3-items

Partition: Scan Database Only Twice Partition the database into n partitions Itemset X is frequent  X is frequent in at least one partition Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe, 1995

Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen, 1996

Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 # of scans: 100 # of Candidates: Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Set Enumeration Tree Subsets of I can be enumerated systematically I={a, b, c, d}  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Borders of Frequent Itemsets Connected X and Y are frequent and X is an ancestor of Y  all patterns between X and Y are frequent  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Projected Databases To find a child Xy of X, only X-projected database is needed The sub-database of transactions containing X Item y is frequent in X-projected database  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Tree-Projection Method Find frequent 2-itemsets For each frequent 2-itemset xy, form a projected database The sub-database containing xy Recursive mining If x’y’ is frequent in xy-proj db, then xyx’y’ is a frequent pattern

Borders and Max-patterns Max-patterns: borders of frequent patterns A subset of max-pattern is frequent A superset of max-pattern is infrequent  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

MaxMiner: Mining Max-patterns Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F 1st scan: find frequent items A, B, C, D, E 2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE, Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan Baya’98 Min_sup=2 Potential max-patterns

Frequent Closed Patterns For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “acdf” is a frequent closed pattern Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT’99 Min_sup=2 TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50

CLOSET: Mining Frequent Closed Patterns Flist: list of all freq items in support asc. order Flist: d-a-f-e-c Divide search space Patterns having d Patterns having d but no a, etc. Find frequent closed pattern recursively Every transaction having d also has cfa  cfad is a frequent closed pattern PHM’00 Min_sup=2 TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50

Closed and Max-patterns Closed pattern mining algorithms can be adapted to mine max-patterns A max-pattern must be closed Depth-first search methods have advantages over breadth-first search ones