Mining Association Rules in Large Databases

Slides:



Advertisements
Similar presentations
Association Rules Mining
Advertisements

Data Mining Techniques Association Rule
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Constrained frequent itemset mining.
Effect of Support Distribution l Many real data sets have skewed support distribution Support distribution of a retail data set.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CPS : Information Management and Mining
Organization “Association Analysis”
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Minqi Zhou Minqi Zhou Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Association Analysis (4) (Evaluation). Evaluation of Association Patterns Association analysis algorithms have the potential to generate a large number.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Evaluation of Association Patterns
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Association Rules in Large Databases Association rule mining Algorithms for scalable mining of (single-dimensional Boolean) association rules.
Data Warehousing/Mining 1 Data Warehousing/Mining Comp 150 DW Chapter 6: Mining Association Rules in Large Databases Instructor: Dan Hebert.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining (II) Instructor: Qiang Yang Thanks: J.Han and J. Pei.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
1 Association Rules & Correlations zBasic concepts zEfficient and scalable frequent itemset mining methods: yApriori, and improvements yFP-growth zRule.
Mining Frequent Patterns I: Association Rule Discovery Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) Warsaw University of Technology.
Constraint-based (Query-Directed) Mining Finding all the patterns in a database autonomously? — unrealistic! The patterns could be too many but not focused!
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Methods to Improve Apriori’s Efficiency  Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
3.4 improving the Efficiency of Apriori A hash-based technique can be uesd to reduce the size of the candidate k- itermsets,Ck,for k>1. For example,when.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
Mining various kinds of Association Rules
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
November 3, 2015Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
UNIT-5 Mining Association Rules in Large Databases LectureTopic ********************************************** Lecture-27Association rule mining Lecture-28Mining.
1 Mining Association Rules with Constraints Wei Ning Joon Wong COSC 6412 Presentation.
1. UTS 2. Basic Association Analysis (IDM ch. 6) 3. Practical: 1. Project Proposal 2. Association Rules Mining (DMBAR ch. 16) 1. online radio 2. predicting.
Mining Frequent Patterns, Association, and Correlations (cont.) Pertemuan 06 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 (4) Introduction to Data Mining by Tan, Steinbach, Kumar ©
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
1998 년 8 월 7 일 Data Engineering Lab 성 유진 1 Exploratory Mining and Pruning Optimization of Constrained Associations Rules.
CIS664-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University Mining Association Rules.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
UNIT-5 Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Mining Association Rules
©Jiawei Han and Micheline Kamber
CSE4334/5334 Data Mining Lecture 15: Association Rule Mining (2)
Mining Association Rules in Large Databases
©Jiawei Han and Micheline Kamber
CIS671-Knowledge Discovery and Data Mining
Presentation transcript:

Mining Association Rules in Large Databases

Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice General-to-specific vs Specific-to-general

Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice Equivalent Classes

Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice Breadth-first vs Depth-first

ECLAT For each item, store a list of transaction ids (tids) TID-list

ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. 3 traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory  

Mining Frequent Closed Patterns: CLOSET Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c Divide search space Patterns having d Patterns having d but no a, etc. Find frequent closed pattern recursively Every transaction having d also has cfa  cfad is a frequent closed pattern J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. Min_sup=2 TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50

Iceberg Queries Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold Example: select P.custID, P.itemID, sum(P.qty) from purchase P group by P.custID, P.itemID having sum(P.qty) >= 10 Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower ones are above the threshold

Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. A transactional database can be encoded based on dimensions and levels We can explore shared multi-level mining Food milk bread skim full fat wheat white .... Fraser Wonder

Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk ® bread [20%, 60%]. Then find their lower-level “weaker” rules: full fat milk ® wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: full fat milk ® Wonder wheat bread Association rules with multiple, alternative hierarchies: full fat milk ® Wonder bread

Multi-Dimensional Association: Concepts Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension association rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”) hybrid-dimension association rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values

Techniques for Mining Multi-Dimensional Associations Search for frequent k-predicate set: Example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how age is treated. 1. Using static discretization of quantitative attributes Quantitative attributes are statically discretized by using predefined concept hierarchies. 2. Quantitative association rules Quantitative attributes are dynamically discretized into “bins”based on the distribution of the data. 3. Distance-based association rules This is a dynamic discretization process that considers the distance between data points.

Quantitative Association Rules Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized. 2-D quantitative association rules: Aquan1  Aquan2  Acat Cluster “adjacent” association rules to form general rules using a 2-D grid. Example: age(X,”30-34”)  income(X,”24K - 48K”)  buys(X,”high resolution TV”)

ARCS (Association Rule Clustering System) How does ARCS work? 1. Binning 2. Find frequent predicate-set 3. Clustering 4. Optimize

Limitations of ARCS Only quantitative attributes on LHS of rules. Only 2 attributes on LHS. (2D limitation) An alternative to ARCS Non-grid-based equi-depth binning clustering based on a measure of partial completeness. “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. Agrawal.

Interestingness Measurements Objective measures Two popular measurements: support; and confidence Subjective measures A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)

Computing Interestingness Measure Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y Y X f11 f10 f1+ f01 f00 fo+ f+1 f+0 |T| f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc.

Drawback of Confidence Coffee Tea 15 5 20 75 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375

Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S)  P(B) = 0.6  0.7 = 0.42 P(SB) = P(S)  P(B) => Statistical independence P(SB) > P(S)  P(B) => Positively correlated P(SB) < P(S)  P(B) => Negatively correlated

Statistical-based Measures Measures that take into account statistical dependence

Example: Lift/Interest Coffee Tea 15 5 20 75 80 90 10 100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori-style support based pruning? How does it affect these measures?

Example: -Coefficient -coefficient is analogous to correlation coefficient for continuous variables Y X 60 10 70 20 30 100 Y X 20 10 30 60 70 100  Coefficient is the same for both tables

Properties of A Good Measure Piatetsky-Shapiro: 3 properties a good measure M must satisfy: M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged

Constraint-Based Mining Interactive, exploratory mining giga-bytes of data? Could it be real? — Making good use of constraints! What kinds of constraints can be used in mining? Knowledge type constraint: classification, association, etc. Data constraint: SQL-like queries Find product pairs sold together in Vancouver in Dec.’98. Dimension/level constraints: in relevance to region, price, brand, customer category. Interestingness constraints: strong rules (min_support  3%, min_confidence  60%). Rule constraints cheap item sales (price < $10) triggers big sales (sum > $200).

Rule Constraints in Association Mining Left-hand side Right-hand side Rule Constraints in Association Mining Two kind of rule constraints: Rule form constraints: meta-rule guided mining. P(x, y) ^ Q(x, w) ® takes(x, “database systems”). Rule (content) constraint: constraint-based query optimization. sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000

Constrain-Based Association Query Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price) A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C }, where C is a set of constraints on S1, S2 including frequency constraint A classification of (single-variable) constraints: Class constraint: S  A. e.g. S  Item Domain constraint: S v,   { , , , , ,  }. e.g. S.Price < 100 v S,  is  or . e.g. snacks  S.Type V S, or S V,   { , , , ,  } e.g. {snacks, sodas }  S.Type Aggregation constraint: agg(S)  v, where agg is in {min, max, sum, count, avg}, and   { , , , , ,  }. e.g. count(S1.Type)  1 , avg(S2.Price)  100

Constrained Association Query Optimization Problem Given a CAQ = { (S1, S2) | C }, the algorithm should be : sound: It only finds frequent sets that satisfy the given constraints C complete: All frequent sets satisfy the given constraints C are found A naïve solution: Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one. A better approach: Comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation.

Anti-monotone and Monotone Constraints A constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-patterns of S can satisfy Ca A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also satisfies it

Property of Constraints: Anti-Monotone Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint. Examples: sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone Application: Push “sum(S.price)  1000” deeply into iterative frequent set computation.

Characterization of Anti-Monotonicity Constraints S  v,   { , ,  } v  S S  V S  V min(S)  v min(S)  v max(S)  v max(S)  v count(S)  v count(S)  v sum(S)  v sum(S)  v avg(S)  v,   {,  } (frequent constraint) yes no convertible (yes)

Succinct Constraint If a rule constraint is succinct set, we can directly generate precisely the sets that satisfy it, before support counting begins. Example: min(I.price)500, I is a set of products Knowing the price of each product can directly evaluate this predicate Such constraints are precounting prunable

Property of Constraints: Succinctness For any set S1 and S2 satisfying C, S1  S2 satisfies C Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 , i.e., it contains a subset that belongs to A1 , Example : sum(S.Price )  v is not succinct min(S.Price )  v is succinct Optimization: If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is not affected by the iterative support counting.

Characterization of Constraints by Succinctness S  v,   { , ,  } v  S S V S  V S  V min(S)  v min(S)  v min(S)  v max(S)  v max(S)  v max(S)  v count(S)  v count(S)  v count(S)  v sum(S)  v sum(S)  v sum(S)  v avg(S)  v,   { , ,  } (frequent constraint) Yes yes weakly no (no)

Convertible Constraint Suppose all items in patterns are listed in a total order R A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies C Example: avg(I.price)≤500 If items are added to the itemset, sorted by price we know that avg(I.price) can only increase with the addition of a new item. Similarly, there are convertible monotone constraints

Example of Convertible Constraints: Avg(S)  V Let R be the value descending order over the set of items E.g. I={9, 8, 6, 4, 3, 1} Avg(S)  v is convertible monotone w.r.t. R If S is a suffix of S1, avg(S1)  avg(S) {8, 4, 3} is a suffix of {9, 8, 4, 3} avg({9, 8, 4, 3})=6  avg({8, 4, 3})=5 If S satisfies avg(S) v, so does S1 {8, 4, 3} satisfies constraint avg(S)  4, so does {9, 8, 4, 3}