CPT-S Advanced Databases 11 Yinghui Wu EME 49
Scalable data mining (a case study) 22
3 Data mining Data 3 What is data mining? A tentative definition: – Use of efficient techniques for analysis of very large collections of data and the extraction of useful and possibly unexpected patterns in data –Non-trivial extraction of implicit, previously unknown and potentially useful information from data –Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
4 Database Processing vs. Data Mining Query –Well defined –SQL, SPARQL, Xpath… Query –Poorly defined –No precise query language Output Output – Precise – Subset of database Output Output – Fuzzy – Not a subset of database – Find all my friends living in Seattle and like French restaurant – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all my friends who frequently goes to French restaurants if their goes to French restaurants if their friends do (association rules) friends do (association rules) – Find all credit applicants who are poor credit risks. (classification) poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) buying habits. (Clustering)
5 Data Mining Models and Tasks Use variables to predict unknown or future values of other variables. Find human-interpretable patterns that describe the data.
Association rule 66
Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection –Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Association Rule Discovery: Applications Marketing and Sales Promotion: –Let the rule discovered be {Bagels, … } --> {Potato Chips} –Potato Chips as consequent => Can be used to determine what should be done to boost its sales. –Bagels in the antecedent => C an be used to see which products would be affected if the store discontinues selling bagels. –Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips
Definition: Frequent Itemset Itemset –A collection of one or more items Example: {Milk, Bread, Diaper} –k-itemset An itemset that contains k items Support count ( ) –Frequency of occurrence of an itemset –E.g. ({Milk, Bread,Diaper}) = 2 Support –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset –An itemset whose support is greater than or equal to a minsup threshold
Definition: Association Rule Example: Association Rule –An implication expression of the form X Y, where X and Y are itemsets –Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics –Support (s) Fraction of transactions that contain both X and Y –Confidence (c) Measures how often items in Y appear in transactions that contain X
Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having –support ≥ minsup threshold –confidence ≥ minconf threshold Brute-force approach: –List all possible association rules –Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! Given d unique items: –Total number of itemsets = 2 d –Total number of possible association rules: If d=6, R = 602 rules
Mining Association Rules: Decoupling Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements
Mining Association Rules Two-step approach: 1.Frequent Itemset Generation –Generate all itemsets whose support minsup 2.Rule Generation –Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive
Frequent Itemset Generation Brute-force approach: –Each itemset in the lattice is a candidate frequent itemset –Count the support of each candidate by scanning the database –Match each transaction against every candidate –Complexity ~ O(NMw) => Expensive since M = 2 d !!!
Frequent Itemset Generation Strategies Reduce the number of candidates (M) –Complete search: M=2 d –Use pruning techniques to reduce M Reduce the number of transactions (N) –Reduce size of N as the size of itemset increases –Use a subsample of N transactions Reduce the number of comparisons (NM) –Use efficient data structures to store the candidates or transactions –No need to match every candidate against every transaction
Reducing Number of Candidates: Apriori Apriori principle: –If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: –Support of an itemset never exceeds the support of its subsets –This is known as the anti-monotone property of support
Found to be Infrequent Illustrating Apriori Principle Pruned supersets
Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6 C C C 3 = 41 With support-based pruning, = 13
Apriori Algorithm Method: –Let k=1 –Generate frequent itemsets of length 1 –Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent
Apriori: Reducing Number of Comparisons Candidate counting: –Scan the database of transactions to determine the support of each candidate itemset –To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
Apriori: Implementation Using Hash Tree 1,4,7 2,5,8 3,6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} We need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
Apriori: Implementation Using Hash Tree transaction Match transaction against 11 out of 15 candidates
Apriori: Alternative Search Methods Traversal of Itemset Lattice –General-to-specific vs Specific-to-general
Traversal of Itemset Lattice –Breadth-first vs Depth-first Apriori: Alternative Search Methods
Bottlenecks of Apriori Candidate generation can result in huge candidate sets: –10 4 frequent 1-itemset will generate 10 7 candidate 2-itemsets –To discover a frequent pattern of size 100, e.g., {a 1, a 2, …, a 100 }, one needs to generate ~ candidates. Multiple scans of database: –Needs (n +1 ) scans, n is the length of the longest pattern
Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABC AB CD,AC BD, AD BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L and L)
Rule Generation How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property c(ABC D) can be larger or smaller than c(AB D) –But confidence of rules generated from the same itemset has an anti-monotone property –e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Association rules in graph data 28
Association in social networks Conventional association rules: item set 29 Traditional association rules: X ⇒ Y Association in social networks “if customers x and x′ are friends living in the same city c, there are at least 3 French restaurants in c that x and x′ both like, and if x′ visits a newly opened French restaurant y in c, then x may also visit y.” x x’ French restaurant city French 3 restaurant y Association with topological constraints! Identify customers for a new French restaurant visit
Association via graph patterns more involved than rules for itemsets! 30 acc #1 “fake” acc #2 “claim a prize” (keywords) article blog detect fake account Question 1: How to define association rule via graph patterns? Question 2: How to discovery interesting rules? Question 3: How to use the rules to identify customers? Identify customers for released album x x1x1 Ecuador “Shakira” album x2x2
Graph Pattern Association Rules (GPARs) 31 graph-pattern association rule (GPAR) R(x, y) R(x, y): Q(x, y) ⇒ q(x, y) Q(x, y): a graph pattern; where x and y are two designated nodes q(x, y) is an edge labeled q from x to y (predicate) Q and q as the antecedent and consequent of R, respectively. R(x, French restaurant ): x x’ French restauran t city French 3 restaurant x x’ French restauran t city French 3 restaurant ⇒ : x French restauran t Q(x, French restaurant ) ⇒ like(x, French restaurant )
Graph Pattern Association Rules (GPARs) 32 graph-pattern association rule (GPAR) R(x, y) R(x, y): Q(x, y) ⇒ q(x, y) If there exists a match h that identifies v x and v y as matches of the designated nodes x and y in Q, respectively, then the consequent q(v x, v y ) will likely hold. R(x, French restaurant ): x x’ French restauran t city French 3 restaurant x x’ French restaurant city French 3 restaurant ⇒ : x French restaurant Q(x, French restaurant ) ⇒ like(x, French restaurant ) “if x and x′ are friends living in city c, there are at least 3 French restaurants in c that x and x′ both like, and if x′ visits a newly opened French restaurant y in c, then x may also visit y.”
Support and Confidence 33 Support of R(x, y): Q(x,y) p(x,y) ⇒ u1u1 Le Bernadin New York (city) French 3 restaurant u2u2 u3u3 Per se French 3 restaurant ⇒ x x’ French restauran t city French 3 restaurant # of isomorphic subgraph in single graph?
Support and Confidence 34 Confidence of R(x, y) Candidate # of x with one edge of type q but is not a match for q(x,y) Local closed world assumption v2v2 x x1x1 Ecuador Shakira album x2x2 v v1v1 Ecuador Shakira album MJ’s album v'v' v'' hobby Pop music “positive” “negative” “unknown”
Discover GPARs 35 The diversified mining problem: Given a graph G, a predicate q(x, y), a support bound and positive integers k and d, find a set S of k nontrivial GPARs pertaining to q(x, y) such that (a) F(S) is maximized; and (b) for each GPAR R ∈ S, supp(R,G) ≥ α and r(PR, x) ≤ d. Mining GPARs for particular event – often leads to same group of entities Difference function Bi-criteria Diversification function x x French restaurant city French 2 restaurant y u1u1 Le Bernadin New York (city) u2u2 u3u3 Per se French 3 restaurant u4u4 Patina LA (city) u5u5 u6u6 French 3 restaurant Asian restaurant X’ x French restaurant city Asian restaurant y Diversified GPARs
Parallel GPAR Discovery 36 A parallel discovery algorithm coordinator S c divides G into n-1 fragments, each assigned to a processor S i discovers GPARs in parallel by bulk synchronous processing in d rounds S c posts a set M of GPARs to each processor each processor generates GPARs locally by extending those in M new GPARs are collected and assembled by S c in the barrier synchronization phase; S c incrementally updates top-k GPARs set L k coordinator worker 1 worker 2 worker n … R1R1 R2R2 … RkRk Assembled New GPARs M i+1 LkLk 1. Distribute M i 2. Locally expand M i 3. Synchronization/ Update L k Parallel scalable? Load Balancing? Communication cost? Parallel scalable? Load Balancing? Communication cost?
Identifying entities using GPARs 37 Given a set of GPARs Σ pertaining to the same q(x,y), graph G and confidence threshold η, the set of entities identified by Σ is the set
Entity Identification algorithm 38
Scalability of discovery algorithm 39 On average 3.2 time faster