Presentation is loading. Please wait.

Presentation is loading. Please wait.

New York University EDBT’98 Department of Computer Science Courant Institute of Mathematical Sciences New York University Title Name Department of Computer.

Similar presentations


Presentation on theme: "New York University EDBT’98 Department of Computer Science Courant Institute of Mathematical Sciences New York University Title Name Department of Computer."— Presentation transcript:

1 New York University EDBT’98 Department of Computer Science Courant Institute of Mathematical Sciences New York University Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University http://www/? Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem

2 New York University EDBT’98 Overview uThe importance of maximum frequent set uStructural properties uTraditional one-way search algorithms uPincer-Search algorithm uExperiments on synthetic and census databases uConclusions

3 New York University EDBT’98 Setting uBasic terms:  1,2, …, n: The set of all items  Transaction: A set of items  Database: A set of transactions  User-defined threshold (supp min ): A number in [0,1]  Frequent itemset: A combination of items (an itemset) occurring in at least supp min fraction of the database uMaximum frequent set  An itemset is frequent if and only if it is a subset a maximal frequent itemset  Maximum frequent set: The set of all maximal frequent itemsets uDiscovering the maximum frequent set is a key problem in many data mining applications  Association rules, strong rules, episodes, and minimal keys

4 New York University EDBT’98 An Example  Database TransactionIitemset 1{1,2,3,5} 2{1,5} 3{1,2} 4{1,2,3}  Set supp min to 0.5  Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and {1,2,3} since they occur in at least 2 out of 4 transactions  Maximum frequent set is {{1,2,3},{1,5}} {1,2,3,4,5} {1,2,3} {1,2}{1,3}{2,3}{1,5} {1}{2}{3} {4}{5}

5 New York University EDBT’98 An Example  Database TransactionItemset 1{1,2,3,4,5} 2{1,3} 3{1,2} 4{1,2,3,4}  Set supp min to 0.5  Frequent itemsets are {1}, {2}, {3}, {4}, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, and {1,2,3,4} since they occur in at least 2 out of 4 transactions  Maximum frequent set is {{1,2,3,4}} {1,2,3,4,5} {1,2,3,4} {1,2}{1,3}{1,4}{2,3} {1}{2}{3}{4} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {1,2,3}{1,2,4}{1,3,4} {5} {2,3,4} {2,4}{3,4}

6 New York University EDBT’98 Setting uBasic terms:  1,2, …, n: The set of all items  Transaction: A set of items  Database: A set of transactions  User-defined threshold (supp min ): A number in [0,1]  Frequent itemset: A combination of items (an itemset) occurring in at least supp min fraction of the database uMaximum frequent set  An itemset is frequent if and only if it is a subset a maximal frequent itemset  Maximum frequent set: The set of all maximal frequent itemsets uDiscovering the maximum frequent set is a key problem in many data mining applications  Association rules, strong rules, episodes, and minimal keys

7 New York University EDBT’98 Two Observations uLet A and B be two itemsets and A  B uObservation-1: A infrequent  B infrequent (if a transaction does not contain A, it cannot contain B) uObservation-2: B frequent  A frequent (if a transaction contains B, it must contain A) {1,2,3,5} {1,2,5}{1,3,5}{1,4,5} {1,5}{2,5}{3,5} {5} A {4,5} {2,3,5}{2,4,5}{3,4,5} {1,2,4,5}{1,3,4,5}{2,3,4,5} {1,2,3,4} {1,2}{1,3}{2,3} {1} {2} {3} {1,2,3}{1,2,4}{1,3,4}{2,3,4} {1,4}{2,4}{3,4} B

8 New York University EDBT’98 Computing the Maximum Frequent Set uObservation-1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97) uObservation-2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97) {1,2,3,4,5} {1,2,3,4}{1,2,3,5}{1,2,4,5}{1,3,4,5}{2,3,4,5} {1,2,5}{1,3,5}{1,4,5}{2,3,5}{2,4,5}{3,4,5} {1,5}{2,5}{3,5}{4,5} {1,2,3,4} {1,2,3}{1,2,4}{1,3,4}{2,3,4} {1,2}{1,3}{2,3}{1,4}{2,4}{3,4} {1}{2}{3}{4} {5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {5}

9 New York University EDBT’98 Complexity of One-Way Search uFor bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined) uFor top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined) {1,2,3,4,5} {1,2,3,4}{1,2,3,5}{1,2,4,5}{1,3,4,5}{2,3,4,5} {1,2,5}{1,3,5}{1,4,5}{2,3,5}{2,4,5}{3,4,5} {1,5}{2,5}{3,5}{4,5} {1,2,3,4} {1,2,3}{1,2,4}{1,3,4}{2,3,4} {1,2}{1,3}{2,3}{1,4}{2,4}{3,4} {1}{2}{3}{4} {5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets {5}

10 New York University EDBT’98 {1,2,3,4,5} {1,2,3,4} {1,3,4,5}{1,2,3,5}{1,2,4,5}{2,3,4,5} {1,2,3}{1,2,4}{1,3,4}{2,3,4} {1,2,5}{1,3,5}{1,4,5}{2,3,5}{2,4,5}{3,4,5} {1,2}{1,3}{1,4}{2,3}{2,4}{3,4} {1,5} {2,5}{3,5}{4,5} {1}{2}{3}{4} {5} Blue: frequent itemsets Red: maximal frequent itemsets Black: infrequent itemsets Green: itemsets not examined Pincer Search: Combining Top-down and Bottom-up Searches uUse Observation-1 to eliminate candidates in the top-down search uUse Observation-2 to eliminate candidates in the bottom-up search uThis example shows how combining both searches could dramatically reduce  the number of candidates examined  the pass of reading the database

11 New York University EDBT’98 MFCS: A New Data Structure Maintained uFor bottom-up search: Candidate set (as usual) uFor top-down search: Use a new dynamically maintained data structure: maximum frequent candidate set (MFCS) uMFCS is a set of itemsets:  Union of its subsets contains all known frequent itemsets  Union of its subsets does not contain any currently known infrequent itemsets  It is of minimum cardinality uMFCS supports efficient coordination between bottom-up and top-down searches

12 New York University EDBT’98 {1,2,3,4,5} {1,2,3,4} {1,3,4,5} {1,3,4}{1,4,5} {1,2}{1,3}{1,4}{2,3}{2,4}{3,4}{1,5}{2,5}{3,5}{4,5} {1}{2}{3}{4}{5} By {2,5} By {3,5} By {4,5} Pincer-Search: Search Path

13 New York University EDBT’98 Pincer-Search Algorithm 01. L 0 :=  ; k := 1; C 1 := {{ i } | i   } 02. MFCS := {{1,2,...,n}}; MFS :=  03. while C k   04. read database and count supports for C k and MFCS 05. MFS := MFS  { frequent itemsets in MFCS } 06. determine frequent set L k and and infrequent set S k 07. use S k to update MFCS 08. generate new candidate set C k+1 (join, recover, and prune) 09. k := k +1 10. return MFS

14 New York University EDBT’98 Performance: Observations and Experiments uNon-monotone property of the maximum frequent set  Both the number of candidates and the number of of frequent itemsets increase as the supp min decreases  NOT true for the number of maximal frequent itemsets –If MFS is {{1,2},{2,3},{3,4}} when supp min is 9% –If supp min decreases to 6% then MFS could become {{1,2,3}}  This property will NOT help bottom-up search algorithms  However, this property may help the Pincer-Search algorithm uConcentrated and scattered distributions  Concentrated: on each level, the frequent itemsets have many common items; the frequent items tend to cluster (Narrow and tall)  Scattered: the frequent itemsets do not have many common items (Wide and flat)

15 New York University EDBT’98 Scattered Distributions

16 New York University EDBT’98 Scattered Distributions

17 New York University EDBT’98 Concentrated Distributions

18 New York University EDBT’98 Concentrated Distributions

19 New York University EDBT’98 Census Data

20 New York University EDBT’98 Conclusions uPincer-Search is good for concentrated distributions uIn general, can use Adaptive Pincer-Search uMore experiments on real-life databases needed


Download ppt "New York University EDBT’98 Department of Computer Science Courant Institute of Mathematical Sciences New York University Title Name Department of Computer."

Similar presentations


Ads by Google