Download presentation
Presentation is loading. Please wait.
1
1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004
2
2 Review Association Rules –interesting association relationship among huge amounts of transactions An association rule is an expression of the form X => Y, where X and Y are sets of items Goal of AA – To find all association rules that satisfy user-specified minimum support and minimum confidence threshold
3
3 Outline Introduction 5 steps of discovering quantitative association rules Partitioning quantitative attributes Interest Algorithm Conclusion
4
4 Introduction Boolean Association Rules Problem- finding associations between the “1” values in a relational table, where all attributes are Boolean. E.g.: {A,B,C},{A,C},{B,C} TIDABC 001111 002101 003011
5
5 Introduction,CONT Most Databases have Richer Attributes types e.g. Quantitative: Age, Income Categorical: Zip, Make of Car Quantitative Association Rules Problems - Mining association rules over quantitative and categorical attributes
6
6 Mapping Quantitative Association Rules Problem into the Boolean Association Rules Problem If all attributes are categorical or the quantitative attributes have only a few values, we could map each pair to a boolean attribute. If the domain of values for a quantitative attribute is large, first partition the values into intervals and then map each pair to a boolean attribute.
7
7 RecIDAge: 20..2 5 Age: 26..35 Marrie d: Yes Marrie d: No NumCa rs : 0 NumCa rs : 1 100100101 200101001 300011010 400010110 RecordIDAgeMarriedNumCars 10023No1 20025Yes1 30029Yes0 30033No0
8
8 Mapping Problems MinSup – If the number of intervals for a quantitative attribute is large, the support for any single interval can be low MinConf – Information lost due to partitioning into intervals. This information lost increases as the interval sizes become larger.
9
9 Catch-22 If intervals are too large, rules may not have MinConf If intervals are too small, rules may not have MinSup How do we solve it ?
10
10 Solve Catch-22 Consider all possible continuous ranges over the values of the quantitative attribute, or over the partitioned intervals Solve minimum support – combine adjacent intervals/values Solve minimum confidence – increase number of intervals
11
11 Unfortunately, More Problems Exec Time – If a quantitative attribute has n values(or intervals), there are on average O(n 2 ) ranges that include a specific value or interval. Many Rules – If a value(or interval) has MinSup, then any range containing this value also has MinSup, thus producing many uninteresting rules.
12
12 Our Approach Maximum Support – stop combining adjacent intervals if their combined support exceeds this value Partial Completeness – quantify information lost due to partitioning Interest Measure – help prune out uninteresting rules
13
13 Problem definition The rule X=>Y holds in the record set D with confidence c if c% of records in D that support X also support Y. The rule X=>Y has support s in the record set D if s% of records in D support XUY
14
14 Formal Problem Statement “ Given a set of Records D, the problem of mining quantitative association rules is to find all quantitative association rules that have support and confidence greater than the user-specified minimum support and minimum confidence”
15
15 5 steps of discovering quantitative association rules 1)Determine the number of partitions for each quantitative attribute 2) Mapping the values of each attribute to a set of consecutive integers, such that the order of the values is preserved 3)Find the support for each value of both quantitative and categorical attributes. For quantitative attributes, adjacent values are combined as long as their support is less than the user-specified maximum support. Next,generate the frequent itemsets. 4)Use Frequent itemsets to generate association rules 5)Determining the interesting rules.
16
16
17
17 Partitioning Quantitative attributes Partial Completeness – Gives a handle on the amount of information lost by partitioning. The lower the level of partial completeness, the less the information lost. Equi-Depth Partitioning – Minimizes the number of intervals required to satisfy Partial Completeness level
18
18 Partial Completeness R – Set of rules generated by considering all ranges over the raw values R ’ – Set of rules generated by considering all ranges over the partitions Measure the information loss – for each rule in R, how “ far ” the “ closest ” rule in R ’ is Using the ratio of the support of the rules as a measure of how far apart the rules are
19
19 Partial Completeness Over Itemsets Let C denote the set of all frequent itemsets in D. For any, we call P K-complete w.r.t C if
20
20 Sample Partial Completeness NumberItemsetSupport 1Age: 20-305% 2Age: 20-406% 3Age: 20-508% 4Cars: 1-25% 5Cars: 1-36% 6Age: 20-30, Cars 1-24% 7Age: 20-40, Cars 1-35% Itemsets 2,3,5,7 form a 1.5-complete set
21
21 Close Rule Given a set of frequent itemsets P which is K-complete w.r.t. the set of all frequent itemsets, the minimum confidence when generating rules from P must be set to 1/K times the desired level to guarantee that a close rule will be generated
22
22 Determining Number of Partitions Given A Partial Completeness Level K, and Equi-Depth partitioning, we get Number of Intervals = where n = Number of quantitative attributes m = Minimum support K = Partial Completeness Level
23
23 Interest Consider the following rules, where about a quarter of people in the age group 20..30 are in the age group 20..25 Age: 20-30 =>Cars: 1..2 (8% Supp, 70% Conf) Age: 20-25 =>Cars: 1..2 (2% Supp, 70% Conf) Second Rule Redundant Capture Rules by “Greater than Expected”
24
24 Expected Values Epr(Z’)[Pr(Z)] – the “expected” value of Pr(Z) based on Pr(Z’), where Z’ is a generalization of Z Epr(Y’|X’)[Pr(Y|X)] – the “expected” confidence of the rule X=>Y based on the rule X’=>Y’, where X’ and Y’ are generalizations of X and Y, respectively.
25
25 Expected Values, Cont
26
26 Interest Measure A Rule X =>Y is R-Interesting w.r.t X’ => Y’ if the support of the rule X=>Y is R times the expected support based on X’ => Y’, Or the confidence is R times the expected confidence based on X’ => Y’, and the itemset X U Y is R-interesting w.r.t X’ U Y’.
27
27 Algorithm—finding frequent itemset Based on the Apriori algorithm for finding boolean association rules Candidate Generation Join Phase Subset Prune Phase Interest Prune Phase Counting Support of Candidates
28
28 Algorithm’Cont K-itemset denote an itemset having k-items L k : Set of Frequent k - itemsets L k –1 is used to generate C k, the Candidate k- Itemsets Scan Database, determine which of the candidates in C k are contained in the record, and their support increment by one At the end of the pass, C k is examined and yield L k.
29
29 Candidate Generation Join Phase: L k-1 is joined with itself, first k- 2 items are the same, and the attributes of the last two items are different. e.g. L 2 { } After the Join step, C3 will consist of the following: { }
30
30 Candidate Generation Subset Prune Phrase:Join Results having some (k-1)-subset that is not in L k-1 are deleted Delete: (Married: yes; Age 20..24; NumCars: 0..1) (Age: 20..24; NumCars: 0..1) Not in L 2 Interest Prune Phase: Further pruning candidate set according to user-specified interest level
31
31 Counting Support of Candidates Partition candidates into groups such that candidates in each group have the same attributes and the same values for categorical attributes. Replace each group with a single super candidate. 1) The common categorical attribute values 2) A data structure representing the set of values of the quantitative attributes e.g. { } { }
32
32 Counting Support of Candidates’Cont Find which “super-candidates” are supported by the categorical attributes in the record. If categorical attributes of a “super- candidates” are supported by a given record, we need to find which of the candidates in the super-candidates are supported by quantitative attributes.
33
33 Conclusion Partitioning and combining adjacent partitions Partial Completeness “Greater-than-expected-value” interest measure
34
34 Questions??
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.