1 Pushing Aggregate Constraints by Divide-and-Approximate Ke Wang, Yuelong Jiang, Jeffrey Xu Yu, Guozhu Dong and Jiawei Han.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
Frequent Closed Pattern Search By Row and Feature Enumeration
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Constrained frequent itemset mining.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
Association Analysis: Basic Concepts and Algorithms.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Exercises.
Mining Association Rules in Large Databases
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Constrained frequent itemset mining.
Mining Association Rules
Mining Association Rules
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) Warsaw University of Technology.
Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011
Rule Generation [Chapter ]
1 Confounding In an unreplicated 2 K there are 2 K treatment combinations. Consider 3 factors at 2 levels each: 8 t.c’s If each requires 2 hours to run,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
CONSENSUS THEOREM Choopan Rattanapoka.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Election Theory A Tale of Two Countries or Voting Power Comes To the Rescue!
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
1 CS 352 Introduction to Logic Design Lecture 2 Ahmed Ezzat Boolean Algebra and Its Applications Ch-3 + Ch-4.
CS685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY Frequent Itemset Mining II Tree-based Algorithm Max Itemsets Closed Itemsets.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Lecture 34 Section 6.7 Wed, Mar 28, 2007
Sequential Pattern Mining
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining: Concepts and Techniques
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Market Baskets Frequent Itemsets A-Priori Algorithm
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Fractional Factorial Design
Data Warehousing Mining & BI
Frequent-Pattern Tree
Design matrix Run A B C D E
Association Analysis: Basic Concepts
Presentation transcript:

1 Pushing Aggregate Constraints by Divide-and-Approximate Ke Wang, Yuelong Jiang, Jeffrey Xu Yu, Guozhu Dong and Jiawei Han

2 No Easy to Push Constraints The exists a gap between the interesting criterion and the techniques used in mining patterns from a large amount of data Anti-monotonicity is too loose as a pruning strategy. Anti-monotonicity is too restricted as an interesting criterion. Should we design new algorithms to mine those patterns that can only be found using anti-monotonicity? Mining patterns with “general” constraints

3 Iceberg-Cube Mining A iceberg-cube mining query select A, B, C, count(*) from R cube by A, B, C having count(*) >= 2 Count(*) >= 2 is an anti- monotone constraint. ABC M ABCCount(*)

4 Iceberg-Cube Mining Another query select A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150 sum(M) >= 150 is an anti-monotone constraint, when all values in M are positive. sum(M) >= 150 is not an anti- monotone constraint, when some values in M are negative. ABC M ABC M R2 R1

5 The Main Idea Study Iceberg-Cube Mining Consider f(v) θ σ f is a function with SQL-like aggregates and arithmetic operators (+, -, *, /); v is a variable; σ is a constant, and θ is either ≤ or ≥. Can we push the constraints into iceberg-cube mining that are not anti-monotone or monotone? If so, what is pushing method that is not specific to a particular constraint? Divide-Approximate: find a “stronger approximate” for the constraint in a subspace.

6 Some Definitions A relation with many dimensions Di and one or more measures Mi. A cell is, di…dk, from Di, …, Dk. Use c as a cell variable Use di…dk for a cell value (representative) SAT(d1…dk) (or SAT(c)) contains all tuples that contains all values in d1…dk (or c). C’ is a super-cell of c, or c is a sub-cell of c’, if c’ contains all the values in c. Let C be a constraint (f(v) θ σ). CUBE(C) denotes the set of cells that satisfy C. A constraint C is weaker than C’ if CUBE(C’) ⊆ CUBE(C) ABC M

7 An Example Iceberg-Cube Mining select A, B, C, sum(M) from R cube by A, B, C having sum(M) >= 150 sum(c) >= 150 is neither anti-monotone nor monotone. Let the space be S = {ABC, AB, AC, BC, A, B, C} Let sum(c) = psum(c) – nsum(c) >= 150. psum(c) is the profit, and nsum(c) is the cost. Push an anti-monotone approximator Use psum(c) >= 150, and ignore nsum(c). If nsum(c) is large, there are have many false positive. Use a min nsum in S: psum(c) – nsum min (ABC) >= 150. nsum min (ABC) is the minimum nsum in S. Use a min nsum in a subspace of S (a stronger constraint) ABC M

8 The Search Strategy (using a lexicographic tree) A node represents a group-by BUC (BottomUpCube): Partition the database in the depth-first order of the lexicographic tree. 0 A B CD E AB ACAD AE DE ACE ADE ABC BD BCBC BCD CE BEBE CD ACD ACDE CDE BDE BCDE ABCD ABCDE ABDE BCE ABCE ABE AB D ABCDE M

9 Another Example Iceberg-Cube Mining select A, B, C, D, E, sum(M) from R cube by A, B, C having sum(M) >= 200 At node ABCDE, sum(12345) = psum(12345) – nsum(12345) = 200 – 250 = -50. (fails). Backtracking to ABC, psum(123) – nsum min (12345) = = 190. (fails) Then, at node ABCE, p[1235], must fail. Therefore, all tuples, t[1235], can be pruned. ABCDE M

10 Find a cell p at u0 fails C, and then extract an anti-monotone approximator Cp. Consider an ancestor uk of u0, where u0 is the left-most leaf in tree(uk). p[u] denote p projected onto u (a cell of u). tree(uk, p) = {p[u] | u is a node in tree(uk)}. p is the max cell in tree(uk, p) and p[uk] is the min cell. In tree(uk, p). If p[uk] fails Cp, all cells in tree(uk, p) fails. Note: tree(uk, p) ≠ tree(uk, p’) if p’ ≠ p. 0 A BC D E AB AC ADAE DE ACE ADE ABC BDBC BCD CECE BE CD ACD ACDE CDEBDE BCDE ABCD ABCDEABDE BCE ABCE ABE ABD uk u0 Tree(uk) A node in tree(uk) is group-by attributes A cell in tree(uk, p) is group-by values u0’ uk’

11 On the backtracking from u0 to uk Check if u0 is on the left-most path in tree(uk) Check if p[uk] can use the same anti-monotone approximator as p[u0] Check if p[uk] fails Cp. If all conditions are met, then For every unexplored child ui of uk, we prune all the tuples that match p on tail(ui), because such tuples generate only cells in tree(uk, p), which fail Cp. tail(u): the set of all dimensions appearing in tree(u). 0 A BCDE AB AC ADAE DE ACE ADE ABC BDBC BCD CE BE CD ACD ACDE CDEBDE BCDE ABCD ABCDEABDE BCE ABCE ABE ABD uk u0 Tree(uk) The Pruning

12 Suppose that a cell p[ABCDE] fails. On the backtracking from ABCDE to ABC, If conditions are met (p[ABC] fails) Prune tuples such that t[ABCE] = p[ABCE] On the backtracking from ABC to AB, If conditions are met (p[AB] fails) Prune tuples such that t[ABDE] = p[ABDE] from tree (ABD) Prune tuples such that t[ABE] = p[ABE] from tree(ABE) 0 A BCDE AB AC ADAE DE ACE ADE ABC BDBC BCD CE BE CD ACD ACDE CDEBDE BCDE ABCD ABCDEABDE BCE ABCE ABE ABD uk u0 ui ui’ uk’ Given a leaf node u0 and a cell p at u0. Let the leftmost path uk…u0 in tree(uk), k >= 0. p is a pruning anchor wrt (uk,u0). Tree(uk, p) the pruning scope.

13 The D&A Algorithm Modify BUC. Push up a pruning anchor p along the leftmost path from u0 to uk. Partition the prunning anchors pushed up to the current node, in addition to partitioning the tuples ABCDE M

14 With Min-Support Suppose cell abcd is frequent, but cell abcde is infrequent. (Shoud stop at abcd) If cell abcd is anchored at node A, cannot prune ae, abe, ace, ade in tree(A, abcd). 0 A B CD E AB AC AD AEDE ACE ADE ABC BD BCBC BCD CE BEBE CD ACD ACDE CDE BDE BCDE ABCD ABCDE ABDE BCE ABCE ABE AB D ABCDE M Min-sup = 3 sum(M) >= 100

15 Rollback tree RBtree(AD), RBtree(AC), RBtree(ABD), RBtree(D), RBtree(C), and RBtree(B) do not have E. If abcd is anchored at the root, we can prune tuples from RBtree(D), RBtree(C), and RBtree(B). 0 AE D C B AB ACAD AE CB AECAED ABC ED EBEB EBC DC ECEC DB ADC AECD DBC ED C BBCD ABCD ABCDE ABED EBD ABCE ABE AB D ABCDE M Min-sup = 3 sum(M) >= 100

16 Constraint/Function Monotonicity A constraint C is a-monotone if whenever a cell is not in CUBE(C), neither is any super-cell. A constraint C is m-monotone if whenever a cell is in CUBE(C), so its every super-cell. A function x(y) is a-monotone wrt y if x decreases as y grows (for cell-valued y) or increases (for real-valued y). A function x(y) is m-monotone wrt y if x increases as y grows (for cell-valued y) or increases (for real-values y). An example: sum(v) = psum(v) – nsum(v) sum(v) is m-monotone wrt psum(v) sum(v) is a-monotone wrt nsum(v)

17 Constraint/Function Monotonicity Let a denote m, and m denote a. Let τ denote either a or m. Example: psum(v) ≥ σ is a-monotone, then psum(v) ≤ σ is m-monotone If psum(c1) ≥ σ is not held, then psum(c2) ≥ σ is not true, where c2 is a super cell of c1. (say c1 is a cell of ABC, and c2 is a cell of ABCD) f(v) ≥ σ is τ-monotone if and only if f(v) is τ-monotone wrt v. f(v) ≤ σ is τ-monotone if and only if f(v) is τ-monotone wrt v. An example: sum(v) = psum(v) – nsum(v) ≥ σ. sum(v) ≥ σ is m-monotone with psum(v), because sum(v) is m- monotone wrt psum(v). sum(v) ≥ σ is a-monotone with nsum(v), because sum(v) is a- monotone wrt nsum(v).

18 Find Approximators Consider f(v) ≥ σ. Divide f(v) ≥ σ into two groups. A+: As cell v grows (becomes a super cell), f monotonically increases. A-: As cell grows (becomes a super cell), f monotonically decreases. Consider sum(v) = psum(v) – nsum(v) ≥ σ. A+ = {nsum(v)} A- = {psum(v)} f(A+; A-/c min ) ≥ σ and f(A+/c min ; A-) ≤ σ are m-monotone approximators in a subspace Si, where c min is a min cell instantiation in Si. f(A+/c max ; A-) ≥ σ and f(A+; A-/c max ) ≤ σ are a-monotone approximators in a subspace Si, where c max is a max cell instantiation in Si. sum(nsum/c max ; psum) ≥ σ

19 Separate Monotonicity Consider function rewriting: (E1 + E2) * E into E1 * E + E2 * E. Consider space division divide a space into subspaces, Si. Find approximators using equation rewriting techniques for a subspace, Si.

20 Experimental Studies Consider sum(v) = psum(v) – nsum(v) Three algorithms BUC: push only the minimum support. BUC+: push approximators and mininum support. D&A: push approximators and minimum support.

21 Vary minimum support

22 Without minimum support *) psum(v) >= sigma

23 Scalability

24 Conclusion General aggregate constraints, rather than only well- behaved constraints. SQL-like tuple-based aggregates, rather than item- based aggregates. Constraint independent techniques, rather than constraint specific techniques A new push strategy: divide-and-approximate