4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng

4/3/01CS632 - Data Mining2 Papers Rakesh Agrawal and Ramakrishnan Srikant: Fast algorithms for mining association rules. Mining sequential patterns.

4/3/01CS632 - Data Mining3 Outline… For each paper… Present the problem. Describe the algorithms. Intuition Design Performance.

4/3/01CS632 - Data Mining4 Market Basket Introduction Retailers are able to collect massive amounts of sales data (basket data) Bar-code technology E-commerce Sales data generally includes customer id, transaction date and items bought.

4/3/01CS632 - Data Mining5 Market Basket Problem It would be useful to find association rules between transactions. ie. 75% of the people who buy spaghetti also by tomato sauce. Given a set of basket data, how can we efficiently find the set of association rules?

4/3/01CS632 - Data Mining6 Formal Definition (1) L = {i 1,i 2,… i m } set of items. Database D is a set of transactions. Transaction T is a set of items such that T  L. An unique identifier, TID, is associated with each transaction.

4/3/01CS632 - Data Mining7 Formal Definition (2) T contains X, a set of some items in L, if X  T. Association rule, X  Y X  T, Y  T, X  Y =  Confidence – % of transactions which contain X which also contain Y. Support - % of transactions in D which contain X  Y.

4/3/01CS632 - Data Mining8 Formal Definition (3) Given a set of transactions D, we want to generate all association rules that have support and confidence greater than the user-specified minimum support (minsup) and minimum confidence (minconf).

4/3/01CS632 - Data Mining9 Problem Decomposition Two sub-problems: Find all itemsets that have transaction support above minsup. These itemsets are called large itemsets. From all the large itemsets, generate the set of association rules that have confidence about minconf.

4/3/01CS632 - Data Mining10 Second Sub-problem Straightforward approach: For every large itemset l, find all non- empty subsets of l. For every such subset a, output a rule of the form a  ( l – a ) if ratio of support( l ) to support( a ) is at least minconf.

4/3/01CS632 - Data Mining11 Discovering Large Itemsets Done with multiple passes over the data. First pass, find all individual items that are large (have minimum support). Subsequent pass, using large itemsets found in previous pass: Generate candidate itemsets. Count support for each candidate itemset. Eliminate itemsets that do not have min support.

4/3/01CS632 - Data Mining12 Algorithm L 1 = {large 1-itemsets}; for( k=2; L k-1   ; k++) do begin C k = apriori-gen(L k-1 ); // New candidates forall transactions t  D do // Counting support C t = subset(C k, t); // Candidates in t forall candidates c  C t do c.count++; end L k = {c  C k | c.count  minsup} End Answer =  k L k ;

4/3/01CS632 - Data Mining13 Candidate Generation AIS and SETM algorithms: Uses the transactions in the database to generate new candidates. But… this generates a lot of candidates which we know beforehand are not large!

4/3/01CS632 - Data Mining14 Apriori Algorithms Generate candidates using only large itemsets found in previous pass without considering the database. Intuition: Any subset of a large itemset must be large.

4/3/01CS632 - Data Mining15 Apriori Candidate Generation Takes in L k -1 and returns C k. Two steps: Join large itemsets L k -1 with L k -1. Prune out all itemsets in joined result which contain a (k-1)subset not found in L k -1.

4/3/01CS632 - Data Mining16 Candidate Generation (Join) insert into C k select p.item 1, p.item 2,…,p.item k-1,q.item k-1 from L k-1 p, L k-1 q where p.item 1 = q.item 1,…, p.item k-2 = q.item k-2, p.item k-1 < q.item k-1

4/3/01CS632 - Data Mining17 Candidate Gen. (Example) L3L3 {1 2 3} {1 2 4} {1 3 4} {1 3 5} {2 3 4} C4C4 {1 2 3 4} {1 3 4 5} C4C4 {1 2 3 4} Join  Prune 

4/3/01CS632 - Data Mining18 Counting Support Need to count the number of transactions which support a given itemset. For efficiency, use a hash-tree. Subset Function

4/3/01CS632 - Data Mining19 Subset Function (Hash-tree) Candidate itemsets are stored in hash-tree. Leaf node – contains a list of itemsets. Interior node – contains a hash table. Each bucket of the hash table points to another node. Root is at depth 1. Interior nodes at depth d points to nodes at depth d+1.

4/3/01CS632 - Data Mining20 Hash-tree Example (1) C3C3 {1 2 3} {1 2 4} {1 3 4} {1 3 5} {2 3 4} {1 2 3} {1 2 4} {1 3 4} {1 3 5} {2 3 4} 12 23 depth 1 2 3 t=2

4/3/01CS632 - Data Mining21 Using the hash-tree If we are at a leaf – find all itemsets contained in transaction. If we are at an interior node – hash on each remaining element in transaction. Root node – hash on all elements in transaction.

4/3/01CS632 - Data Mining22 Hash-tree Example (2) {1 2 3} {1 2 4} {1 3 4} {1 3 5} {2 3 4} 12 23 D {1 2 3 4} {2 3 5}

4/3/01CS632 - Data Mining23 AprioriTid (1) Does not use the transactions in the database for counting itemset support. Instead stores transactions as sets of possible large itemsets, C k. Each member of C k is of the form:, X k is a possible large itemset

4/3/01CS632 - Data Mining24 AprioriTid (2) Advantage of C k If a transaction does not contain any candidate k -itemset then it will have no entry in C k. Number of entries in C k may be less than the number of transactions in D. Especially for large k. Speeds up counting!

4/3/01CS632 - Data Mining25 AprioriTid (3) However… For small k each entry in C k may be larger than it’s corresponding transaction. The usual space vs. time.

4/3/01CS632 - Data Mining26 AprioriTid (4) Example

4/3/01CS632 - Data Mining27 Observation When C k does not fit in main memory we can see large jump in execution time. AprioriTid beats Apriori only when C k can fit in main memory.

4/3/01CS632 - Data Mining28 AprioriHybrid It is not necessary to use the same algorithm for all the passes. Combine the two algorithms! Start with Apriori When C k can fit in main memory switch to AprioriTID

4/3/01CS632 - Data Mining29 Performance (1) Measured performance by running algorithms on generated synthetic data. Used the following parameters:

4/3/01CS632 - Data Mining30 Performance (2)

4/3/01CS632 - Data Mining31 Performance (3)

4/3/01CS632 - Data Mining32 Mining Sequential Patterns (1) Sequential patterns are ordered list of itemsets. Market basket example: Customers typically rent “star wars” then “empire strikes back” then “return of the Jedi” “Fitted sheets and pillow cases” then “comforter” then “drapes and ruffles”

4/3/01CS632 - Data Mining33 Mining Sequential Patterns (2) Looks at sequences of transactions as opposed to a single transaction. Groups transactions based on customer ID. Customer sequence.

4/3/01CS632 - Data Mining34 Formal Definition (1) Given a database D of customer transactions. Each transaction consists of: customer id, transaction-time, items purchased. No customer has more than one transaction with the same transaction-time.

4/3/01CS632 - Data Mining35 Formal Definition (2) Itemset i, ( i 1 i 2... i m ) where i j is an item. Sequence s,  s 1 s 2 … s n  where s j is an itemset. Sequence  a 1 a 2 … a n  contained in  b 1 b 2 … b n  if there exist integers i 1 < i 2... < i n such that a 1  b i1, a 2  b i2,…, a n  b in. A sequence s is maximal if it is not contained in any other sequence.

4/3/01CS632 - Data Mining36 Formal Definition (3) A customer supports a sequence s if s is contained in the customer sequence for this customer. Support of a sequence - % of customers who support the sequence. For mining association rules, support was % of transactions.

4/3/01CS632 - Data Mining37 Formal Definition (4) Given a database D of customer transactions find the maximal sequences among all sequences that have a certain user-specified minimum support. Sequences that have support above minsup are large sequences.

4/3/01CS632 - Data Mining38 Algorithm: Sort Phase Customer ID – Major key Transaction-time – Minor key Converts the original transaction database into a database of customer sequences.

4/3/01CS632 - Data Mining39 Algorithm: Litemset Phase (1) Litemset Phase: Find all large itemsets. Why? Because each itemset in a large sequence has to be a large itemset.

4/3/01CS632 - Data Mining40 Algorithm: Litemset Phase (2) To get all large itemsets we can use the Apriori algorithms discussed earlier. Need to modify support counting. For sequential patterns, support is measured by fraction of customers.

4/3/01CS632 - Data Mining41 Algorithm: Litemset Phase (3) Each large itemset is then mapped to a set of contiguous integers. Used to compare two large itemsets in constant time.

4/3/01CS632 - Data Mining42 Algorithm: Transformation (1) Need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. Represent transactions as sets of large itemsets. Customer sequence now becomes a list of sets of itemsets.

4/3/01CS632 - Data Mining43 Algorithm: Transformation (2)

4/3/01CS632 - Data Mining44 Algorithm: Sequence Phase (1) Use the set of large itemsets to find the desired sequences. Similar structure to Apriori algorithms used to find large itemsets. Use seed set to generate candidate sequences. Count support for each candidate. Eliminate candidate sequences which are not large.

4/3/01CS632 - Data Mining45 Algorithm: Sequence Phase (2) Two types of algorithms: Count-all: counts all large sequences, including non-maximal sequences. AprioriAll Count-some: try to avoid counting non- maximal sequences by counting longer sequences first. AprioriSome DynamicSome

4/3/01CS632 - Data Mining46 Algorithm: Maximal Phase (1) Find the maximal sequences among the set of large sequences. Set of all large subsequences S

4/3/01CS632 - Data Mining47 Algorithm: Maximal Phase (2) Use hash-tree to find all subsequences of s k in S. Similar to subset function used in finding large itemsets. S is stored in hash-tree.

4/3/01CS632 - Data Mining48 AprioriAll (1)

4/3/01CS632 - Data Mining49 AprioriAll (2) Hash-tree is used for counting. Candidate generation similar to candidate generation in finding large itemsets. Except that order matters and therefore we don’t have the condition: p.item k-1 < q.item k-1

4/3/01CS632 - Data Mining50 AprioriAll (3) Example of candidate generation:

4/3/01CS632 - Data Mining51 AprioriAll (4) Example:

4/3/01CS632 - Data Mining52 Count-some Algorithms Try to avoid counting non-maximal sequences by counting longer sequences first. 2 phases: Forward Phase – find all large sequences or certain lengths. Backward Phase – find all remaining large sequences.

4/3/01CS632 - Data Mining53 AprioriSome (1) Determines which lengths to count using next() function. next() takes in as a parameter the length of the sequence counted in the last pass. next(k) = k + 1 - Same as AprioriAll Balances tradeoff between: Counting non-maximal sequences Counting extensions of small candidate sequences

4/3/01CS632 - Data Mining54 AprioriSome (2) hit k =  L k  /  C k  Intuition: As hit k increases the time wasted by counting extensions of small candidates decreases.

4/3/01CS632 - Data Mining55 AprioriSome (3)

4/3/01CS632 - Data Mining56 AprioriSome (4) Backward Phase: For all lengths which we skipped: Delete sequences in candidate set which are contained in some large sequence. Count remaining candidates and find all sequences with min. support. Also delete large sequences found in forward phase which are non-maximal.

4/3/01CS632 - Data Mining57 AprioriSome (5)

4/3/01CS632 - Data Mining58 AprioriSome (6) Example: 3-Sequences C3C3 next(k) = 2k minsup = 2 Forward Phase:

4/3/01CS632 - Data Mining59 AprioriSome (7) Example Backward Phase: 3-Sequences C3C3

4/3/01CS632 - Data Mining60 Performance (1) Used generated datasets again… Parameters for data:

4/3/01CS632 - Data Mining61 Performance (2) DynamicSome generates too many candidates. AprioriSome does a little better than AprioriAll. It avoids counting many non-maximal sequences.

4/3/01CS632 - Data Mining62 Performance (3) Advantage of AprioriSome is reduced for 2 reasons: AprioriSome generates more candidates. Candidates remain memory resident even if skipped over. Cannot always follow heuristic.

4/3/01CS632 - Data Mining63 Wrap up Just presented two classic papers on data-mining. Association Rules Sequential Patterns

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Similar presentations

Presentation on theme: "4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Similar presentations

Presentation on theme: "4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng."— Presentation transcript:

Similar presentations

About project

Feedback