Association Rule Mining

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Data Mining of Very Large Data
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Fast Algorithms for Association Rule Mining
Lecture14: Association Rules
Mining Association Rules
Finding Similar Items.
1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of knowledge and useful information from the large amounts of.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
1 Low-Support, High-Correlation Finding rare, but very similar items.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
CS 345: Topics in Data Warehousing Thursday, November 18, 2004.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Jeffrey D. Ullman Stanford University.  2% of your grade will be for answering other students’ questions on Piazza.  18% for Gradiance.  Piazza code.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Finding Similar Items: Locality Sensitive Hashing
Market Basket Many-to-many relationship between different objects
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

Association Rule Mining CS246 Association Rule Mining

Association Rule Mining What is the problem? What is an association rule? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Motivating Problem If a customer buys, “Diet Coke,” is she likely to buy a nutrition bar? To arrange store shelves, etc. Beer and diaper Life as a parent is tough… Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Word of Caution Famous example: David Rhine at Duke Tested students for “extrasensory perception” Asked them to guess 10 cards – red or black 1/1000 of them guess all 10 correctly. If done many times, some unlikely events happen for purely statistical reasons No physical validity Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Problem Definition Input: transaction records (set of items) T1: Bread, Milk, Apple T2: Beer, Chips T3: Pants, Brush, Toothpaste, Chopstick … Output: all “association rules” Bread, Milk  Apple If a customer buys bread and milk, he is likely to buy an apple. Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Confidence Bread  Apple: If a customer buys bread, he is likely to buy an apple. What does likely mean? A large fraction of baskets with bread also have apple. Formally, P{ I1 | I2 , I3 } > c c : confidence, say 0.95 Probability to buy an item given other items If a customer buys I2 , I3 , she is likely to buy I1 with 95% probability “Strength” of the rule Identify all association rules satisfying confidence threshold c Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Support Do we really want to find all association rules? If we sell only 5 items of a particular product, who cares what it is sold with? Find association rules only for the set of items that appear often enough. Formally, P{ I1 , I2 , I3 } > s s: support Fraction of records containing the itemset Statistical “significance” I1 , I2 , I3 : frequent itemset Find association rules for frequent itemsets Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Problem Definition Input: transaction records (set of items) Output: All association rules I1 , I2  I3 with support: P{ I1 , I2 , I3 } > s and confidence: P{ I1 | I2 , I3 } > c Is the difference between confidence and support clear? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Basic Algorithm? Step 1: Find all frequent itemsets P{ I1 , I2 , I3 } > s Step 2: From the large itemsets, identify high confidence rules P{ I1 | I2 , I3 } > c Junghoo "John" Cho (UCLA Computer Science)

Step 1: Frequent Itemsets Find all with : frequent itemset More informally, find all sets of items appearing in more than k transactions Is it really difficult? How can we solve it? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Naïve Approach Keep counters for all subsets of items {A, B, C} {A}, {B}, {C}, {AB}, {BC}, {AC} {ABC} Scan all transaction records and increase counters Transaction {A, C} {A}++, {C}++, {AC}++ What is difficult? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Main Challenge? Problem: 2n subsets for n items 1000 items: 21000 = 10301 Clearly not feasible Lesson: When data size is large, even a simple problem can be very difficult. What was their main idea? Junghoo "John" Cho (UCLA Computer Science)

Main Idea of Apriori Algorithm If (A, B, C) is a frequent itemset, (A, B) is a frequent itemset If (A, B) is not a frequent itemset, (A, B, C) cannot be a frequent itemset Consider (A, B, C) only if all its subsets are frequent itemsets Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Apriori Algorithm L1 = { frequent 1-itemsets }, k = 1 Candidate set generation Candidate set Ck : potentially frequent k itemset {A, B, C} is a candidate set iff all its subsets {A, B}, {B, C} and {A, C} are frequent itemsets Generate candidate set Ck+1 using Lk Scanning Check whether candidate sets are actually frequent Increase k by 1, and go to step 2 Ask questions for candidate set generation: If {a}, {b} large, is {a, b} candidate? If {a, b}, {b, c} large, is {a, b, c} candidate? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Example Items: {A, B, C, D} Transactions: {A, B}, {A, D} {A, B, C} {B} Support: 0.5 = 2 transactions Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Example  A B C D        {A,B}    {A, B} {A, D} {A, B, C} {B} Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Why Does Apriori Work? Typical grocery-store scenario: 100,000 different items 10M baskets with 10 items each (108 items) support = 0.01 Q: How many items can Apriori eliminate? A: At most 1000 items remain (less than 1%) An item should appear at least 0.01*107 = 105 108 items in total, so 108/105 = 1000 items Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Basic Algorithm Step 1: Find all frequent itemsets P{ I1 , I2 , I3 } > s Apriori algorithm Step 2: From the large itemsets, identify high confidence rules P{ I1 | I2 , I3 } > c Junghoo "John" Cho (UCLA Computer Science)

Step 2: High Confidence Rules In principle, second step is straightforward: We already estimated values in the first step Piece of cake. Simple division! Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) More On Step 2 Q: But given a frequent k-itemset, how many potential rules? A: 2k! Any efficient algorithm? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Questions (1) Is support pruning valid? What about Castillo de Ygay ($5000 wine)  Caviar? Even if we only sell 100 items, significant profit… Technically very challenging Finding all association rules without support pruning Topic of the next paper Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Questions (2) Is P{Beer|Diaper} > 0.95 really meaningful? What if beer appears in 95% of baskets? Interest: P{Beer, Diaper} / P{Beer} P{Diaper} Implication strength: Beer  Diaper == ~(Beer, ~Diaper) P{~Diaper} P{Beer} / P{~Diaper, Beer} Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Follow-up Works Candidate set generation still costly Iceberg queries No candidate set generation stage Minimizing number of passes Junghoo "John" Cho (UCLA Computer Science)

Mining without Support Pruning What is the Problem? How can we identify “Castillo de Ygay  Caviar”? Apriori is efficient only for frequent items Problem definition Data mining: Low support, high correlation Finding rare, but very similar items Junghoo "John" Cho (UCLA Computer Science)

Matrix Representation Typical scenario 100,000 items 10M baskets with 10 items each Matrix Columns = items Rows = baskets (i, j) = 1 if item cj is in basket ri Very sparse: almost all 0’s (less than 0.01% 1’s) Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Matrix Example {a, b} {a, f} {b, d, g} {b, e} {c, d, e} {a} {e, f} a b c d e f g 1 Junghoo "John" Cho (UCLA Computer Science)

Association Rule and Similarity Think of column Ci as the set of rows with 1 Association Rule (confidence) Similarity Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Example C1 C2 0 1 1 1 1 0 0 0 P(C2|C1) = 2/4 Sim(C1, C2) = 2/5 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Problem Definition Find all highly similar pairs All Ci, Cj with Sim(Ci, Cj) > s* s*: Similarity threshold Junghoo "John" Cho (UCLA Computer Science)

Why Similarity (not Confidence)? A1: Techniques work only for similarity A2: High similarity implies high confidence |C1C2| / |C1C2| < |C1C2| / |C1| All similar pairs are of high confidence numerator/denominator Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Assumption Matrix does not fit into main memory Number of columns is relatively small Can store some information in main memory per each item Number of rows can be very big Sparse data: mostly 0 in the matrix Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Key Idea? “Compress” the matrix into a smaller one Load the compressed matrix into main memory Find high similarity pairs from the compressed matrix Much easier than disk-based computation Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Min-Hash? LSH? Hamming? What are the for? Min-Hash?: compression LSH?: similarity pair computation Hamming LSH?: compression+similarity Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How To Compress? (1) “Hash” each column C to a small signature Sig(C) such that Sim(C1, C2) is the same as the “similarity” of Sig(C1) and Sig(C2) Sig(C) is small enough, so that we can store the “compressed” matrix in main memory Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) How To Compress? (2) Idea 1 Pick 100 random rows Sig(C1) = the 100 bits of the selected rows Would it work? Idea 1 does not work Matrix is sparse Most of the columns will be “0000…0” But the columns are different because 1’s are in different rows Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Min-Hashing Imagine rows are permuted randomly “Hash” function h(C) The first row number with 1 in column C Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Example C1 C2 C3 1 2 3 4 5 Permutation = (45123) S1 S2 S3 5 4 1 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Important Property The probability that h(C1) = h(C2) is the same as Sim(C1, C2) Why? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Row Types Given C1 and C2, rows can be classified as a = # of rows of type a Sim(C1, C2) = a / (a + b + c) Q: What’s P{ h(C1) = h(C2) }? A: a / (a + b + c) Look down C1 and C2 until we see 1 If it’s type a, then h(C1) = h(C2) If it’s type b or c, not. C1 C2 a 1 b c d Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Min-Hash Signature Pick (say) 100 random permutations of the rows Get Min-Hash values from each permutation Sig(C) = the list of 100 Min-Hash values Sim( Sig(C1), Sig(C2) ) = fraction of signatures for which Min-Hash value agrees Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Example 1 2 4 5 3 S3 S2 S1 Perm1 = (12345) Perm2 = (54321) Perm3 = (34512) C1 C2 C3 1 2 3 4 5 Similarities: 1-2 1-3 2-3 Matrix 0.5 0.25 Sig 0.67 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Basic Idea “Compress” the matrix into a smaller one Min-Hash signature Find high similarity pairs from the compressed matrix How? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Problem From the signature matrix (which fits into main memory), identify all similar pairs Assuming 100,000 items Potentially 1010 similar pairs? One counter per one pair? No way How? Junghoo "John" Cho (UCLA Computer Science)

Locality Sensitive Hashing A technique to limit the number of similar pairs to consider Approach Using LSH, identify “candidate similar pairs” Scan the Min-Hash signature matrix for verification Junghoo "John" Cho (UCLA Computer Science)

Locality Sensitive Hashing Partition the signature matrix into l bands of r rows each C1 C2 C3 C4 C5 C6 C7 h1 h2 h3 h4 h5 h6 band 1 r rows l bands band 2 … Junghoo "John" Cho (UCLA Computer Science)

Locality Sensitive Hashing Hash each column in each band into buckets C1 C2 C3 C4 C5 C6 C7 h1 h2 h3 h4 h5 h6 Junghoo "John" Cho (UCLA Computer Science)

Locality Sensitive Hashing Two columns are candidate pair if they hash to the same bucket in any band C1 C2 C3 C4 C5 C6 C7 h1 h2 h3 h4 h5 h6 Candidate pair ! Junghoo "John" Cho (UCLA Computer Science)

Locality Sensitive Hashing Final verification After identifying candidates, verify each candidate-pair (Ci, Cj) by examining Sig(Ci) and Sig (Cj) for similarity Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Example 100,000 columns 100 Min-Hash integer signature Total signature table size 4 x 100 x 100,000 = 40 MB (not bad) Potential similar pairs 100000 x 100000 / 2 = 5,000,000,000 (too many!) 20 bands of 5 integers per band Compute false positive and false negative rates Junghoo "John" Cho (UCLA Computer Science)

False Negative: 80% Similar Probability C1, C2 identical in one band 0.8^5 = 0.328 Probability C1, C2 not identical in any of the 20 bands (1 – 0.328)^20 = 0.00035 We miss only 1/3000 of 80% similar column pairs! Very few false negative Junghoo "John" Cho (UCLA Computer Science)

False Positive: 40% Similar Probability C1, C2 identical in one band 0.4^5 = 0.01 Probability C1, C2 identical in at least one of the 20 bands 1 – (1 – 0.01)^20 = 0.18 Only about 20% of unsimilar pairs are identified as candidate pairs False negatives much lower when similarities << 40% Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) LSH Summary Similar signature column pair identification algorithm Split the signature matrix into l bands of r rows each Identify almost all similar pairs and a small number of unsimilar pairs By adjusting r and l Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Hamming LSH Life is simpler if the matrix has about 50% 1’s We can take a random collection of rows Let us make the matrix denser! How? Construct a series of matrices by OR-ing together pairs of rows 0 disappears over time… Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Example OR 1 1 1 1 More 1’s Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Hamming LSH Construct all matrices No more than log n matrices for n rows Total number of rows in all matrices is 2n Twice as much work as the original matrix Identify similar columns from each matrix From each matrix, apply LHS to the columns with density between 30% -- 70% 1’s Report similar columns Note that similar columns have similar densities, so they will be considered together in at least one matrix No point ever comparing columns whose number of 1’s are very different Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Summary Apriori, Min-Hash, LSH, Hamming LSH Finding frequent pairs? Apriori Finding similar pairs? Min-Hash+LSH or Hamming LSH Min-Hash: Sparse matrix compression LSH: Similar signature identification Hamming LSH: Amplification of 1 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Questions Can we extend the techniques to multiple column rules C1, C2  C3? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) Any Questions? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) AprioriTid (1) Q: What was the main idea? A: Some transactions may not need to be checked Candidate itemsets: {A, B}, {A, C} Transaction: {A, D, E, F}? We may eliminate many transactions Q: How do we know {A, B, E, F} is not necessary? A: When we check {A, B} and {A, C} we can tell that {A, B, E, F} does not have any candidate sets Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) AprioriTid (2) In each pass, Substitute each transaction with a set of candidate itemsets Candidate set: {A, B, C}, {A, C, D}, {A, C, M} Transaction T1: {A, B, C, D, F, G}  T1: {{A, B, C}, {A, C, D}} Candidate itemset {A, C, D} appears in T1 if {A, C} and {A, D} appears in T1 Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) AprioriTid (3) Q: Advantage? A: Many transactions/items may be eliminated Especially in later passes Q: Disadvantage? A: A transaction may be blown up T1: {A, B, C, D}  T1: {{A, B, C}, {A, B, D}} Why not just eliminate “infrequent items”? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) AprioriHybrid In earlier passes, use Apriori In later passes, use AprioriTid Switching criteria Does the generated set of transactions fit in main memory? Junghoo "John" Cho (UCLA Computer Science)

Junghoo "John" Cho (UCLA Computer Science) History of the paper Earlier SIGMOD93 paper (AIS Algorithm) Very difficult to read. Poor organization Did not use the “obvious” pruning criteria Very naïve and simple heuristics Techniques in the paper may not be very important Much more efficient algorithms proposed next year Even great research starts with small ideas As you can see from the history Learn how a “simple” idea can change things… Junghoo "John" Cho (UCLA Computer Science)