2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques2 Chapter 5 Mining Frequent Patterns, Associations, and Correlations What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques3 What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Rule form: “Body  ead [support, confidence] ”. Examples. buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%] age(X, “20..29”) ^ income(X, “30..39K”)  buys(X, “PC”) [2%, 60%] Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, classification, etc.

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Introduction to Data Mining 4 Denotation of Association Rules Rule form: “Body  ead [support, confidence]”. (association, correlation and causality) Rule examples sales(T, “computer”)  sales(T, “software”) [support = 1%, confidence = 75%] buy(T, “Beer”)  buy(T, “Diaper”) [support = 2%, confidence = 70%] age(X, “20..29”) ^ income(X, “30..39K”)  buys(X, “PC”) [support = 2%, confidence = 60%] data context attribute name data content data object

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques5 Association Rule Mining (Basic Concepts) Given: a database or set of transactions each transaction is a list of items (e.g., purchased by a customer in a visit) Find: all rules that correlate the presence of a set of items with that of another set of items (X  Y) E.g., 98% of people who purchase tires and auto accessories also get car maintenance done Application examples ?  Car Maintenance Agreement (What the store should do to boost Car Maintenance Agreement sales) Home Electronics  ? (What other products should the store stocks up?)

Association Rule Mining (Support and Confidence) Given a transaction D/B, find all the rules X  Y with minimum support and confidence support S : probability that a transaction contains {X & Y } confidence C : conditional probability that a transaction having {X} also contains Y I = {i 1,i 2,i 3,...,i n } : set of all items T j  I : a transaction, A  C (50%, 66.6%) C  A (50%, 100%) Customer buys X Customer buys both Customer buys Y

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques7 A Road Map for Association Rule Mining (Types of Association Rules) Boolean vs. quantitative associations (based on the types of data values) buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] Single dimensional vs. multiple dimensional associations (see above e.g.) Single-level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Various extensions Correlation, causality analysis Association does not necessarily imply correlation or causality Constraints enforced E.g., small sales (sum 1,000)?

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques8 Chapter 5: Mining Association Rules in Large Databases What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques9 Mining Association Rules (How to Find Strong Rules : An Example) 1).Find the frequent itemsets Frequent itemset The set of items that has minimum support 2).Generate association rules by frequent itemsets Rule form : “Body  Head [support, confidence] ” with min. sup. and conf. Example : for rule A  C : find supports of {A,C } and {A} support = support({A,C }) = 50% > min. support confidence = support({A,C })/support({A}) = 66.6% > min. confidence Problem : How many itemsets & rules ? A  C; …; C  A ; …; A, B  C ; … Find strong rules with - support >= 50% - confidence >= 50%

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques10 Mining Association Rules Efficiently : ( Apriori Algorithm) 1. Find the frequent itemsets (Key Steps) Iteratively find frequent itemsets with cardinality from 1 to k (from 1-itemset to k-itemset) by Apriori principle Frequent itemset the set of items that has minimum support Apriori principle Any subset of a frequent itemset must also be frequent i.e., if {A, B } is a frequent 2-itemset, both {A } and {B } should be frequent 1-itemsets. If either {A } or {B } is not frequent, {A, B } cannot be frequent. 2. Generate association rules by the frequent itemsets

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques11 The Apriori Algorithm — Example ( support >= 50% ~ support count >= 2) Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 C3C3 L3L3

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques12 The Apriori Algorithm (Pseudo-Code for Finding Frequent Itemsets) Iteratively : C 1 → L 1 → C 2 → L 2 → C 3 → L 3 …… → C N →  C k (like a table) : The set of candidate itemsets of size k L k (like a table) : The set of frequent itemsets of size k Pseudo-code: C 1 = {candidate 1-itemsets}; L 1 = {frequent 1-itemsets}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidate itemset generated from L k for each transaction t in database do increment the count of all itemsets in C k+1 contained in t L k+1 = itemsets in C k+1 with support >= min_support End Return  k L k ; scan Database C k+1  L k+1 join & prune L k  C k+1 Example

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques13 Generate Candidate Itemsets C k+1 from L k ( L k  C k+1 : Join & Prune) Representation : L k  C k+1 L k : The set of frequent itemsets of size k C k+1 : The set of candidate itemsets of size k+1 Steps of computing candidate k+1-itemsets C k+1 1) Join Step (According to Apriori principle) C k+1 is generated by joining L k with itself (Note the length constraint : k  k+1 ) 2) Prune Step (Apply Apriori principle to prune) Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset (Use L k to prune C k +1 ) Example

Suppose the items in L k are listed (sorted) in order Step 1: self-joining L k (SQL pseudo code) insert into C k+1 (item i ’s are sorted) select p.item 1, p.item 2, …, p.item k, q.item k from L k p, L k q (where p, q are aliases of L k ) where p.item 1 =q.item 1, …, p.item k-1 =q.item k-1, p.item k < q.item k Step 2: pruning : Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset ( C k+1 is like a table) For every (k+1)-itemset c in C k+1 do For every k-subset s of ( k+1)-itemset c do if (s is not in L k ) then delete (k+1)-itemset c from C k+1 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques14 Generation of Candidate Itemset C k+1 (Effective pseudo code )

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques15 Generating Candidate Itemsets C 4 (Example : L 3 → C 4 ) L 3 ={abc, abd, acd, ace, bcd}, C 4 ? p : L 3 q : L 3 Self-joining: L 3 *L 3 abc abc abcd ( from abc+abd )  C 4 abd abd abce ( from abc+ace )  C 4 acd acd acde ( from acd+ace )  C 4 ace ace Pruning: bcd bcd acde is removed from C 4 because ade  L 3,or cde  L 3 {3-subsets of acde have acd, ace, ade, cde} Therefore, C 4 ={abcd}...... pair-wise joining

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques16 Is Apriori Fast Enough ? (Performance Bottlenecks) The core of the Apriori algorithm: Compute C k (L k-1 → C k ) : Use frequent (k – 1)-itemsets to generate candidate k-itemsets (candidate generation) Compute L k (C k → L k ) : Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation Huge candidate itemsets for computing C k : 10 4 frequent 1-itemset would generate ~10 7 candidate 2-itemsets What would be the number of 3-itemsets, 4-itemsets, …, ? For a transaction database of 100 items, e.g., {a 1, a 2, …, a 100 }, there are 2 100  10 30 possible candidate itemsets. Multiple scans of database for computing L k : Needs N scans, N is the length of the longest pattern

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques17 Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB (parallel computing) Sampling: mining on a subset of the given data, lower support threshold + a method to determine the completeness

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques18 Mining Frequent Patterns by FP-Growth (Without Candidate Generation and DB Scans) Avoid costly database scans Compress a large database into a compact, Frequent- Pattern tree structure (FP-tree) Highly condensed, but complete for frequent pattern mining Avoid costly candidate generation Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques19 FP-growth vs. Apriori (Scalability with the Support Threshold) On data set T25I20D10K For data set TxIyDz, x indicates the average transaction length y indicates the average itemset length z indicates the total number of transaction instances

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques20 Major Steps of FP-growth Algorithm 1)Construct FP-tree 2)Construct conditional pattern base for each item in the header table of FP-tree 3)Construct conditional FP-tree from each entry of conditional pattern base 4)Derive frequent patterns by mining conditional FP- trees recursively  If the conditional FP-tree contains a single path, simply enumerate all the patterns

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques21 Construct FP-tree from a Transaction DB {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 50% Support count >=2.5 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} Steps : 1.Scan DB once, find frequent 1-itemset 2.Order frequent items in frequency descending order 3.Scan DB again, construct FP-tree ,   -1  -2

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques22 Benefits of the FP-tree Structure Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) Example: For Connect-4 DB, compression ratio could be over 100

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques23 Step 2: Construct Conditional Pattern Base Starting from the 2 nd item of header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of each frequent item to form an entry of the conditional pattern base Conditional pattern base itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques24 Properties of FP-tree for Conditional Pattern Base Construction Node-link property For a frequent item a i, all the frequent itemsets that contain a i can be obtained by following a i 's node-links, starting from a i 's entry in header table of the FP-tree. Prefix path property To calculate the frequent itemsets containing a i in a path P, only the prefix sub-path of node a i in P need to be accumulated, and its frequency count should carry the same count as node a i.

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques25 Step 3: Construct Conditional FP-tree For each conditional pattern-base entry (illustrated by m) Accumulate the count for each item of the entry Construct the conditional FP-tree for the frequent items of the entry m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree  {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3   

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques26 Conditional Pattern-Bases & Conditional FP-tree Empty f {(f:3)}|c{(f:3)}c {(f:3, c:3)}|a{(fc:3)}a Empty{(fca:1), (f:1), (c:1)}b {(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m {(c:3)}|p{(fcam:2), (cb:1)}p Conditional FP-treeConditional pattern-base Item Assume minimal support = 3, and | denotes concatenation

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques27 Suppose a conditional FP-tree T has a single path P All frequent itemsets of T can be generated by enumerating all the item combinations in path P {} f:3 c:3 a:3 a single path for an m-conditional FP-tree All frequent itemsets containing m m, fm, cm, am, fcm, fam, cam, fcam Frequent Pattern Generation all item combinations in P Step 4: Recursively Mine Conditional FP-trees

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques28 Why Is FP–Growth Algorithm Fast? Performance study shows FP-growth is an order of magnitude faster than Apriori algorithm. Reasoning : FP-tree building Use compact data structure No candidate generation, no candidate test Eliminate repeated database scan Basic operation is counting

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques29 Presentation of Association Rules (Table Form )

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques30 Visualization of Association Rule Using Plane Graph

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques31 Visualization of Association Rule Using Rule Graph

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques33 Multiple-Level Association Rules Rules regarding itemsets at appropriate levels could be quite useful. Items at the lower level are expected to have lower support. Transaction database can be encoded based on dimensions and levels We can explore multi-level mining WonderFraser skim 2% Food bread milk white wheat Items often form concept hierarchy. … …

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques34 Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk  bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% milk  wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk  Wonder wheat bread Association rules with multiple, alternative concept hierarchies: 2% milk  Wonder bread hierarchy

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques35 Uniform Support vs. Reduced Support (Multi-level Association Mining) Uniform Support: Use the same minimum support for mining all levels Pros: Simple - use one minimum support threshold If an itemset contains any item whose ancestors do not have minimum support, there is no need to examine it. Cons: Lower level items do not occur as frequently. If support threshold is too high  miss low level associations too low  generate too many high-level associations

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques36 Uniform Support Example Multi-level mining with uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Χ  

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques37 Uniform Support vs. Reduced Support (Multi-level Association Mining) Reduced Support: Use reduced minimum support at lower levels There are 3 main search strategies: Level-by-level independent (without (i-1)th level info) Each tree node (item) is examined, regardless of whether or not its parent node is found to be frequent (too loose → not efficient ) Level-cross filtering by k-itemset (at L k with (i-1)th level info of k-item ) A k-itemset at the i-th level is examined iff its parent k-itemset at the (i- 1)th level is frequent ( efficient, but too restrictive → many missing rules ) Level-cross filtering by single item (at L k with (i-1)th level info of 1-item) An item at the i-th level is examined iff its parent node at the (i-1)-th level is significant (may not be frequent, comprise of above two strategies) Example

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques38 Reduced Support Multi-level mining with reduced support (Level-cross filtering by single item) Case 1 Level 1 : min_sup = 8% Level 2 : min_sup = 4% Case 2 Level 1 : min_sup = 12% Level 2 : min_sup = 6% 2% Milk [support = 6%] Skim Milk [support = 4%] Milk [support = 10%] 1:  2 : ╳ 1:2:1:2: 1: 2:╳1: 2:╳ In case 2, level-1 parent node is significant but not frequent.

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques39 Redundancy Filtering in Multi-level Association Mining Some rules may be redundant due to relationships between the “ancestors” of items. Example of a redundant rule milk  wheat bread [support = 8%, confidence = 70%] 2% milk  wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to or less than the “expected” value, based on the rule’s ancestor.

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques40 Progressive Deepening in Multi-Level Association Mining A top-down, progressive deepening approach: First mine high-level frequent items: milk (15%), bread (10%) Then mine their lower-level “weaker” frequent itemsets: 2% milk (5%), wheat bread (4%) Different min_support threshold strategies across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels, then toss item t if any of t ’s ancestors is infrequent. If adopting reduced min_support at lower levels, then examine only those descendents whose ancestor’s support is frequent or non-negligible.

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques41 Progressive Refinement of Data Mining Quality Why progressive refinement? Mining operator can be expensive or cheap, fine or rough Trade speed with quality: step-by-step refinement. Two- or multi-step mining: First apply rough/cheap operator (superset coverage) Then apply expensive algorithm on a substantially reduced candidate set (Koperski & Han, SSD’95). Superset coverage property: Preserve all the positive answers—allow a false positive test but not a false negative test. + -

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques43 Multi-Dimensional Association (Concepts) Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules: Inter-dimension association rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”) Hybrid-dimension association rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values, e.g., age

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques44 Mining Quantitative Associations For example : search for frequent k-predicate set An example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how quantitative attributes (e.g., age) are treated. 1.Static discretization Quantitative attributes are statically discretized by using predefined concept hierarchies, where numeric values are replaced by ranges Example, for salary, 0~20K, 21K~30K, 31K~40K,>= 41K 2. Dynamic discretization Quantitative attributes are dynamically discretized into bins individually based on the distribution of the data. Example, by equi-width / equi-depth bins 3. Distance-based (clustering) discretization This is a dynamic discretization process that considers the distance between data points. (See slide 49 for example)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques45 Static Discretization of Quantitative Attributes Before mining, discretizing data using concept hierarchy. Numeric values are replaced by ranges. Range example for salary : 0~20K, 21K~30K, 31K~40K,>= 41K In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional cuboid correspond to the predicate sets. The cells of an n-dimensional cuboid store the support counts of the corresponding n-predicate sets (income)(age) () (buys) (age, income)(age,buys)(income,buys) (age,income,buys)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques46 Quantitative Association Rule Mining Numeric attributes are dynamically discretized such that the confidence or compactness of the mined rules is maximized. For 2D quantitative association rule: A quan1  B quan2  C cat Cluster “adjacent” association rules to form more general rules using a 2-D grid. Example: age(X,”34”)  income(X,”30K - 40K”)  buys(X,”TV”) age(X,”34”)  income(X,”40K - 50K”)  buys(X,”TV”) age(X,”35”)  income(X,”30K - 40K”)  buys(X,”TV”) age(X,”35”)  income(X,”40K - 50K”)  buys(X,”TV”) age(X,”34-35”)  income(X,”30K - 50K”)  buys(X,”TV”) (more compact)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques47 ARCS (Association Rule Clustering System) How does ARCS work? 1. Binning 2. Find frequent predicate set 3. Clustering 4. Optimize

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques48 Limitations of ARCS Only quantitative attributes on LHS of rules. Only 2 attributes on LHS. (2D limitation) An alternative to ARCS Non-grid-based equi-depth binning clustering based on a measure of partial completeness. “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. Agrawal.

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques49 Mining Distance-based Association Rules Binning methods do not capture the semantics of interval data Distance-based partitioning, more meaningful discretization considering: density/”number of points” in an interval “closeness” of points in an interval

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques50 S[X] is a set of N tuples t 1, t 2, …, t N, projected on the numerical attribute set X The diameter of S[X]: ( 相當於每兩點間的平均距離共 N(N-1) 筆 ) dist x :distance metric, e.g. Euclidean distance or Manhattan distance Clusters and Distance Measurements

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques51 Assesses the density of a cluster C X, where Finding clusters and distance-based rules the density threshold replaces the notion of support modified version of the BIRCH clustering algorithm How to Find Clusters Cluster diameter must be small enough # of cluster samples must be large enough

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques53 Interestingness Measurements Objective measures Two popular measurements: support confidence Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques54 Criticism to Support and Confidence (Example 1) Example 1: (Aggarwal & Yu, PODS98) Totally 5000 students (Analyzed in 2D space : playing basketball, eating cereal) 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal The following association rules are found play basketball  eat cereal [40%, 66.7% ] play basketball  not eat cereal [20%, 33.3%] : given : derived

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques55 Criticism to Support and Confidence Rule 1 : play basketball  eat cereal [40%, 66.7% ] Rule 1 implies that a student who plays basketball like eating cereal This rule is misleading. Because the percentage of students eating cereal is 75% (=3750/5000) which is higher than 66.7%. Rule 2 : play basketball  not eat cereal [20%, 33.3%] Rule 2 is far more accurate. It implies that a student who plays basketball doesn’t like eating cereal

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques56 Interestingness Measure: Correlations (Lift) play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift B: basketball C: cereal Lift(B,C) < lift(B,¬C)  a student playing basketball doesn’t like eating cereal B C A D (C/A>D/B)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques57 Criticism to Support and Confidence (Example 2) X and Y: positively correlated X and Z, negatively related 1 : occurred, 0 : not occurred We need a measure of dependent or correlated events (Cont.) Z is more positively correlated to X than Y according to sup. & conf. 1). 2). contradict

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques58 Correlation (lift) : taking both P(A) and P(B) in consideration P(A^B)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if corr A,B is less than 1; otherwise A and B positively correlated P(X)=1/2, P(Y)=1/4, P(Z)=7/8 Criticism to Support and Confidence (Example 2)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques59 Constraint-Based Mining Interactive, exploratory mining giga-bytes of big data? Could it be real? — Making good use of constraints! What kinds of constraints can be used in mining? Knowledge type constraint: Specify the type knowledge to be mined Data constraint: specify the relevant data Find product pairs sold together in Vancouver in Dec.’98. Dimension (attribute) constraints: specify the desired dimension Regarding region, price, brand, customer category. Rule constraints: specify the form of rule small sales (price $200). Interestingness constraints: specify the threholds strong rules (min_support  3%, min_confidence  60%). KDD Steps

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques60 Rule Constraints in Association Mining Two kind of rule constraints: Rule form constraints: meta-rule guided mining. P(x, y) ^ Q(x, w)  takes(x, “database systems”). Rule content constraint: constraint-based query optimization (Ng, et al., SIGMOD’98). sum(LHS) 20 ^ count(LHS) > 3 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99 ): 1-var: A constraint confining only one side (LHS or RHS) of the rule, e.g., as shown above. 2-var: A constraint confining both sides (LHS and RHS). sum(LHS) < min(RHS), and max(RHS) < 5* sum(LHS)

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques62 Why Is the Big Pie Still There? More on constraint-based mining of associations Boolean vs. quantitative associations Association on discrete vs. continuous data From association to correlation and causal structure analysis. Association does not necessarily imply correlation or causal relationships From intra-transaction association to inter-transaction associations E.g., break the barriers of transactions (Lu, et al. TOIS’99). From association analysis to classification and clustering analysis E.g, clustering association rules

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques63 Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction Association analysis in other types of data: spatial data, multimedia data, time series data, etc.

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Introduction to Data Mining 64 Relevant Data Data Preprocessing Data Mining Evaluation/PresentationPattern Knowledge Databases Steps in KDD Process (Technically) Data mining The core step of KDD process

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)

Similar presentations

Presentation on theme: "2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)

Similar presentations

Presentation on theme: "2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)"— Presentation transcript:

Similar presentations

About project

Feedback