Download presentation
Presentation is loading. Please wait.
Published byWhitney Bennett Modified over 9 years ago
1
Frequent itemset mining and temporal extensions Sunita Sarawagi sunita@it.iitb.ac.in http://www.it.iitb.ac.in/~sunita
2
Association rules l Given several sets of items, example: n Set of items purchased n Set of pages visited on a website n Set of doctors visited l Find all rules that correlate the presence of one set of items with another n Rules are of the form X Y where X and Y are sets of items n Eg: Purchase of books A&B purchase of C
3
Parameters: Support and Confidence l All rules X Z have two parameters n Support probability that a transaction has X and Z n confidence conditional probability that a transaction having X also contains Z l Two parameters to association rule mining: n Minimum support s n Minimum confidence c S: 50%, and c: 50% n A C (50%, 66.6%) n C A (50%, 100%)
4
Applications of fast itemset counting l Cross selling in retail, banking l Catalog design and store layout l Applications in medicine: find redundant tests l Improve predictive capability of classifiers that assume attribute independence l Improved clustering of categorical attributes
5
Finding association rules in large databases l Number of transactions: in millions l Number of distinct items: tens of thousands l Lots of work on scalable algorithms l Typically two parts to the algorithm: 1. Finding all frequent itemsets with support > S 2. Finding rules with confidence greater than C n Frequent itemset search more expensive n Apriori algorithm, FP-tree algorithm
6
The Apriori Algorithm l L 1 = {frequent items of size one}; l for (k = 1; L k != ; k++) n C k+1 = candidates generated from L k by Join L k with itself Prune any k+1 itemset whose subset not in L k n for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t n L k+1 = candidates in C k+1 with min_support l return k L k ;
7
How to Generate Candidates? l Suppose the items in L k-1 are listed in an order l Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 l Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k
8
The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3
9
Improvements to Apriori l Apriori with well-designed data structures works well in practice when frequent itemsets not too long (common case) l Lots of enhancements proposed n Sampling: count in two passes n Invert database to be column major instead of row major and count by intersection n Count multiple length itemsets in one-pass l Reducing passes not useful since I/O not bottleneck: l Main bottleneck: candidate generation and counting not optimized for long itemsets
10
Mining Frequent Patterns Without Candidate Generation l Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure n highly condensed, but complete for frequent pattern mining l Develop an efficient, FP-tree-based frequent pattern mining method n A divide-and-conquer methodology: decompose mining tasks into smaller ones n Avoid candidate generation
11
Construct FP-tree from Database {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Item frequency f4 c4 a3 b3 m3 p3 min_support = 0.5 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} Scan DB once, find frequent 1-itemset Order frequent items by decreasing frequency Scan DB again, construct FP-tree
12
Step 1: FP-tree to Conditional Pattern Base l Starting at the frequent header table in the FP-tree l Traverse the FP-tree by following the link of each frequent item l Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency f4 c4 a3 b3 m3 p3
13
Step 2: Construct Conditional FP-tree l For each pattern-base n Accumulate the count for each item in the base n Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam
14
Mining Frequent Patterns by Creating Conditional Pattern-Bases Empty f {(f:3)}|c{(f:3)}c {(f:3, c:3)}|a{(fc:3)}a Empty{(fca:1), (f:1), (c:1)}b {(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m {(c:3)}|p{(fcam:2), (cb:1)}p Conditional FP-treeConditional pattern-base Item Repeat this recursively for higher items…
15
FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K
16
Criticism to Support and Confidence X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates l Need to measure departure from expected. l For two items: l For k items, expected support derived from support of k-1 itemsets using iterative scaling methods
17
Prevalent correlations are not interesting l Analysts already know about prevalent rules l Interesting rules are those that deviate from prior expectation l Mining’s payoff is in finding surprising phenomena 1995 1998 bedsheets and pillow covers sell together! Zzzz... bedsheets and pillow covers sell together!
18
What makes a rule surprising? l Does not match prior expectation n Correlation between milk and cereal remains roughly constant over time l Cannot be trivially derived from simpler rules n Milk 10%, cereal 10% n Milk and cereal 10% … surprising n Eggs 10% n Milk, cereal and eggs 0.1% … surprising! n Expected 1%
19
Finding suprising temporal patterns l Algorithms to mine for surprising patterns n Encode itemsets into bit streams using two models Mopt: The optimal model that allows change along time Mcons: The constrained model that does not allow change along time n Surprise = difference in number of bits in Mopt and Mcons
20
One item: optimal model l Milk-buying habits modeled by biased coin l Customer tosses this coin to decide whether to buy milk n Head or “1” denotes “basket contains milk” n Coin bias is Pr[milk] l Analyst wants to study Pr[milk] along time n Single coin with fixed bias is not interesting n Changes in bias are interesting
21
The coin segmentation problem l Players A and B l A has a set of coins with different biases l A repeatedly n Picks arbitrary coin n Tosses it arbitrary number of times l B observes H/T l Guesses transition points and biases Pick Toss Return A B
22
How to explain the data l Given n head/tail observations n Can assume n different coins with bias 0 or 1 Data fits perfectly (with probability one) Many coins needed n Or assume one coin May fit data poorly l “Best explanation” is a compromise 1/4 5/71/3
23
Coding examples l Sequence of k zeroes n Naïve encoding takes k bits n Run length takes about log k bits l 1000 bits, 10 randomly placed 1’s, rest 0’s n Posit a coin with bias 0.01 n Data encoding cost is (Shannon’s theorem):
24
How to find optimal segments Sequence of 17 tosses: Derived graph with 18 nodes: Edge cost = model cost + data cost Model cost = one node ID + one Pr[head] Data cost for Pr[head] = 5/7, 5 heads, 2 tails Shortest path
25
Two or more items l “Unconstrained” segmentation n k items induce a 2 k sided coin n “milk and cereal” = 11, “milk, not cereal” = 10, “neither” = 00, etc. l Shortest path finds significant shift in any of the coin face probabilities l Problem: some of these shifts may be completely explained by marginals
26
Example l Drop in joint sale of milk and cereal is completely explained by drop in sale of milk l Pr[milk & cereal] / (Pr[milk] Pr[cereal]) remains constant over time l Call this ratio
27
Constant- segmentation l Compute global over all time l All coins must have this common value of l Segment as before l Compare with un-constrained coding cost Observed support Independence
28
Is all this really needed? l Simpler alternative n Aggregate data into suitable time windows n Compute support, correlation, , etc. in each window n Use variance threshold to choose itemsets l Pitfalls n Choices: windows, thresholds n May miss fine detail n Over-sensitive to outliers
29
Experiments l Millions of baskets over several years l Two algorithms n Complete MDL approach n MDL segmentation + statistical tests (MStat) l Data set n 2.8 million transactions n 7 years, 1987 to 1993 n 15800 items n Average 2.62 items per basket
30
Little agreement in itemset ranks l Simpler methods do not approximate MDL
31
MDL has high selectivity l Score of best itemsets stand out from the rest using MDL
32
Three anecdotes l against time l High MStat score n Small marginals n Polo shirt & shorts l High correlation n Small % variation n Bedsheets & pillow cases l High MDL score n Significant gradual drift n Men’s & women’s shorts
33
Conclusion l New notion of surprising patterns based on n Joint support expected from marginals n Variation of joint support along time l Robust MDL formulation l Efficient algorithms n Near-optimal segmentation using shortest path n Pruning criteria l Successful application to real data
34
References l R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile. l S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal description length Proc. of the 24th Int'l Conference on Very Large Databases (VLDB), 1998 l J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000. l Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques by, Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this book) l H. Toivonen. Sampling large databases for association rules. VLDB'96, 134- 145, Bombay, India, Sept. 1996
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.