CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.

CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations

Why Association Rules? Understand attributes, not entities Discover relationships that –Show some dependency between attributes –Are “interesting” Give an understanding of the data space

Formal Definition Data: –Items={i 1,…,i n } –Transactions T={t 1,…,t m } where t i = {i j1, …, i jk } Support: Given A  I, supp(A) = |{t  T | t  A}| / |T| Goal: Find rules A  B with support ≥ s and confidence ≥ c where: –A, B  I, A  B =  –s = supp(A  B), c = supp(A  B) / supp(A)

Sample: Market Basket I T HardwareAutoClothingFurnishingsPaper goods Grocery t0t0 --+-+- t1t1 ++---- t2t2 +---+- t3t3 --+++- t4t4 ++--+- t5t5 ----++ t6t6 --++-- t7t7 ++--+- t8t8 --+-++ t9t9 ----+-

Types of associations Machine-learning base: classification / decision rules –Entities independent, unordered –Find rules leading to target class –To get rule sets, re-run for all classes as targets Market-basket –Collection of related entities with same key –Can be modeled as independent entities, sparse data Sequential –Like market basket, but group by distance rather than same key

Historical Association Rule Learning Decision tree converted to rules –ID3, as discussed in previous lecture Direct production of decision rules –CN2, others Problem: Algorithms don’t scale well to many practical problems

Database community contribution: Market Basket Association Rules Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages 207--216, Washington, D.C., May 1993.Mining association rules between sets of items in large databases Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499Rakesh AgrawalRamakrishnan SrikantFast Algorithms for Mining Association Rules in Large DatabasesVLDB 1994

Database community contribution: Market Basket Association Rules Practical problems often have sparse data –Many attributes, few items per transaction Goal is typically search for high support –High support = broad impact –High confidence not crucial (as opposed to classification) Very Large data sets (main-memory algorithms impractical)

A-Priori Algorithm Observation: if A has support s, then –  i  A, supp(i) ≥ s Gives bottom-up algorithm –Find single items with support ≥ s –Just look at transaction subsets with those items for pairs –Recurse

A-Priori Algorithm First, generate all large itemsets –Sets X  I such that supp(X) ≥ s (threshold) –Captures “supp(A  B) ≥ s” part of problem Second, find high-confidence rules that are subsets of X –B = X i, A = X-B –To find confidence, need supp(A) But A will be in all large itemsets – don’t need to go back to the database!

A-Priori Algorithm L 1 = {large 1-itemsets}; for ( k = 2; L k-1   ; k++ ) C k = select p.i 1, p.iY, …, p.i k-1, q.i k-1 from L k-1 p, L k-1 q where p.i 1 = q.i 1, …, p.i k-2 = q.i k-2  transactions t  T C t = subset(C k, t);// Candidates contained in t  candidates c  C t : c.count++; L k = {c  C k | c.count  minsup} Answer =  k L k ;

Frequent episodes for sequential associations Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo: Discovering Frequent Episodes in Sequences. In First International Conference on Knowledge Discovery and Data Mining (KDD'95), 210 - 215, Montreal, Canada, August 1995. AAAI Press. Instead of transaction, items grouped by sliding window in time Same basic idea as A-Priori

Frequent Episodes: Definition Event types E Event (A,t) where A in E Sequence S=((A1,t1),…,(An,tn)) Frequent episode F = (Ai, …, Aj) where –  tl, tm such that t1  tl<…<tm  tn tm-tl  window: –count( ((Ai,tl), …, (Aj, tm)) )  support

Applications/Issues in Security Frequent episodes in intrusion detection data –What does this tell us? Preventing the discovery of associations –Known items to protect –What if we don’t know what we want to protect?

CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.

Similar presentations

Presentation on theme: "CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.

Similar presentations

Presentation on theme: "CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations."— Presentation transcript:

Similar presentations

About project

Feedback