Research issues on association rule mining Loo Kin Kong 26 th February, 2003.

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Recap: Mining association rules from large datasets

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Fast Algorithms For Hierarchical Range Histogram Constructions

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms.

1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.

Data Mining Association Analysis: Basic Concepts and Algorithms

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Mining Association Rules

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

Mining Association Rules

Bayesian Decision Theory Making Decisions Under uncertainty 1.

Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.

On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.

Mining interesting association rules Loo Kin Kong 22 Feb 2002.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Data Mining Find information from data data ? information.

Association Rule Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.

1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Frequency Counts over Data Streams

Data Mining Association Analysis: Basic Concepts and Algorithms

Frequent Pattern Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Farzaneh Mirzazadeh Fall 2007

Association Analysis: Basic Concepts and Algorithms

Approximate Frequency Counts over Data Streams

Association Analysis: Basic Concepts

Presentation transcript:

Research issues on association rule mining Loo Kin Kong 26 th February, 2003

Plan Recent trends on data mining Association rule interestingness Association rule mining on data streams Research directions Conclusion

Association rules First proposed in [Agarwal et al. 94] Given a database D of transactions, which contains only binary attributes For an itemset x, the support of x is defined as supp(x) = fraction of D containing x An association rule is in the form I  J, where: I  J =  supp(I  J)   supp supp(I  J) / supp(I)   conf

Recent trends on association rule mining Association rule interestingness Association rule mining on data streams Privacy preserving [Rizvi el al. 02] New data structures to improve the efficiency of finding frequent itemsets [Relue et al. 01]

Association rule interestingness – overview Problem with association rule mining: Too many rules mined Mined rules may contain redundancy or trivial rules Subjective approaches aim at: Minimizing human effort involved Objective approaches aim at: Based on some predefined interestingness measure, filter rules that are uninteresting

Subjective approaches Rule templates [Klemettinen et al. 94] A rule template specifies what attributes to occur in the LHS and RHS of a rule e.g., any rule in the form “” & (any number of conditions)  “” is uninteresting By elimination [Sahar 99] For a rule r = A  B, r ’ = a  b is an ancestor rule if a  A and b  B. r’ is said to cover r. An ancestor rule can be classified as one of the following: True-Not-Interesting (TNI) Not-True-Interesting (NTI) Not-True-Not-Interesting (NTNI) True-Interesting (TI)

Objective approaches Statistical / problem-specific measures Entropy gain, lift, … Pruning redundant rules by the maximum entropy principle [Jaroszewica 02]

Probability A finite probability space is a pair (S,P), in which S is a finite non-empty set P is a mapping P:S  [0,1], satisfying  s  S P(s) = 1 Each s  S is called an event P(s), also denoted by p s, is the probability of the event s The self-information of s is defined as I(s) = – log P(s)

Entropy A partition U is a collection of mutually exclusive elements whose union equals S Each element contains one or more events The measure of uncertainty that any event of a partition U would occur is called the entropy of the partitioning U H( U ) = – p 1 log p 1 – p 2 log p 2 – … – p N log p N Where p 1,..., p N are respectively the probabilities of events a 1,..., a N of U H( U ) is maximum if p 1 = p 2 =... = p N = 1/N

The maximum entropy method (MEM) The MEM determines the probabilities p i of the events in a partition U, subject to various given constraints. By MEM, when some of the p i ’ s are unknown, they must be chosen to maximize the entropy of U, subject to the given constraints.

Definitions A constraint C is a pair C = ( I, p), where: I is an itemset p  [0,1] is the probability of I occurring in a transaction The set of constraints generated by an association rule I  J is defined as C( I  J ) = {( I, supp( I )), ( I  J, supp( I  J ))} A rule K  J is a sub-rule of I  J if K  I

I -nonredundancy A rule I  J is considered I -nonredundant with respect to R, where R is a set of association rules, if: I = , or I (C I, J (R), I  J ) is larger than some threshold, where I () is either I act () or I pass (), C I, J (R) is the constraints induced by all sub-rules of I  J in R

Pruning redundant association rules Input: A set R of association rules 1. For each singleton A i in the database 2. R i = {   A i } 3. k = 1 4. For each rule I  A i  R, | I |=k, do 5. If I  A i is I -nonredundant w.r.t. R i then 6. R i = R i  { I  A i } 7. k = k+1 8. Goto 4 9. R =  R i

Association rule interestingness: let’s face it... “Interesting” is a subjective sense... Domain knowledge is needed at some stage to determine what is interesting... in fact, one may argue that there does not exist a truly objective interestingness measure... It is because we try to model what is interesting... but “objective” interestingness measures are still worth studying Can act as a filter before any human intervention is required

Interesting or uninteresting? Consider the association rule: r = I  J, supp(r) = 1%, conf(r) = 100% A question: Do you think whether r is interesting or uninteresting? Considering the support and/or confidence of one single rule may not be enough to determine whether a rule is interesting or not So we try to compare a rule with some other rule(s)

Observation: comparing a family of rules For a maximal frequent itemset I: The set of rules I’  {i}, where i  I, I’  I \ {i} forms a family of rules For example, for the maximal frequent itemset {abcde}, abcd  econf = supp({abcde})/supp({abcd}) abc  econf = supp({abce})/supp({abc}) abd  econf = supp({abde})/supp({abd})... are in a family

abcd e abcdabceabdeacdebcde bcdabeaceadebcebdecdeabcabdacd bcbdcdaebecedeabacad abcde 

Observation: comparing a family of rules (cont’d) The blue half of the lattice is obtained by appending the item “e” to each node in the orange half The family of rules captures how the item “e” affects the support of the orange half of the lattice Idea: We may compare confidences of rules in a family to find any “unusually” high or low confidences We can use some statistical tests to perform the comparison; no need for complicated statistical models (e.g., MEM)

Association rule mining on data streams In some new applications, data come as a continuous “ stream ” The sheer volume of a stream over its lifetime is huge Queries require timely answer Examples: Stock ticks Network traffic measurements A method for finding approximate frequency counts on data streams is proposed in [Manku et al. 02]

Goals of the paper The algorithm ensures that All itemsets whose true frequency exceeds sN are reported (i.e., no false negative) No itemset whose true frequency is less than ( s-  ) N is output Estimated frequencies are less than the true frequencies by at most  N Some notations: Let N denote the current length of the stream Let s  (0,1) denote the support threshold Let   (0,1) denote the error tolerance

The simple case: finding frequent items Each transaction in the stream contains only 1 item 2 algorithms were proposed, namely: Sticky Sampling Algorithm Lossy Counting Algorithm Features of the algorithms: Sampling techniques are used Frequency counts found are approximate but error is guaranteed not to exceed a user-specified tolerance level For Lossy Counting, all frequent items are reported

Lossy Counting Algorithm Incoming data stream is conceptually divided into buckets of  1/  transactions Counts are kept in a data structure D Each entry in D is in the form ( e, f,  ), where: e is the item f is the frequency of e in the stream since the entry is inserted in D  is the maximum count of e in the stream before e is added to D

Lossy Counting Algorithm (cont ’ d) 1. D   ; N  0 2. w   1/  ; b  1 3. e  next transaction; N  N if (e,f,  ) exists in D do 5. f  f else do 7. insert (e,1,b-1) to D 8. endif 9. if N mod w = 0 do 10. prune(D, b); b  b endif 12. Goto 3; D: The set of all counts N: Curr. len. of stream e: Transaction (itemset) w: Bucket width b: Current bucket id 1. function prune(D, b) 2. for each entry (e,f,  ) in D do 3. if f +   b do 4. remove the entry from D 5. endif

Lossy Counting Lossy Counting guarantees that: When deletion occurs, b   N If an entry ( e, f,  ) is deleted, f e  b where f e is the actual frequency count of e Hence, if an entry ( e, f,  ) is deleted, f e   N Finally, f  f e  f +  N

The more complex case: finding frequent itemsets The Lossy Counting algorithm is extended to find frequent itemsets Transactions in the data stream contains any number of items Essentially the same as the case for single items, except: Multiple buckets (  of them say) are processed in a batch Each entry in D is in the form ( set, f,  ) Transactions read in are (wisely) expanded to its subsets

Association rule mining on data streams: food for thought Challenges to mine from data streams Fast update Data are usually not permanently stored (but may be buffered) Fast response for queries Minimized resources (e.g. number of counts kept) Possible interesting problems concerning association rule mining on data streams: More efficient/accurate algorithms for finding association rules on data streams Change mining in frequency counts

The lattice structure A bottleneck in the algorithm proposed in [Manku et al. 02] is that it needs to expand a transaction to its subsets for counting For example, for a transaction {abcde}, we may need to count the itemsets {a}, {b}, {c}, {d}, {e}, {ab}, {ac}... Hence updates are expensive (although queries can be fast)

abcd e abcdabceabdeacdebcde acdaceadebcdbcebdecdeabcabdabe aebcbdbecdcedeabacad abcde  The lattice structure (cont ’ d)

Conclusion Both association rule interestingness and mining on data streams are challenging problems Research on rule interestingness can make association rule mining a more efficient tool for knowledge discovery Association rule mining on data streams is an upcoming application and a promising direction for research

References [Agarwal et al. 94] R. Agarwal and R. Srikant. Fast Algorithms for Mining Association Rules. VLDB94. [Jaroszewica 02] S. Jaroszewica and D.A. Simovici. Pruning Redundant Association rules Using Maximum Entropy Principle. PAKDD02. [Klemettinen et al. 94] Mika Klemettinen et al. Finding Interesting Rules from Large Sets of Discovered Association Rules. CIKM94. [Manku et al. 02] G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB02. [Relue et al. 01] R. Relue, X. Wu and H Huang. Efficient Runtime Generation of Association Rules. CIKM01. [Rizvi el al. 02] S. J. Rizvi and J. R. Haritsa. Maintaining Data Privacy in Association Rule Mining. VLDB02. [Sahar 99] Sigal Sahar. Interestingness Via What Is Not Interesting. KDD99.

Q & A