M.Phil Probation Talk Association Rules Mining of Existentially Uncertain Data Presenter : Chui Chun Kit Supervisor : Dr. Benjamin C.M. Kao.

Presentation Outline Introduction  What is association rules?  How to mine association rules from large database? Probabilistic Data (Uncertain Data)  What is probabilistic/ uncertain data?  Possible World interpretation of uncertain data Mining frequent patterns from uncertain data  Presents a simple algorithm to mine association rules from uncertain data  Identify computational problem Efficient methods of mining association rules from uncertain data Experimental Results and Discussions Conclusion and Future Work

Section 1 Introduction What is association rule?

Introduction Suppose Peter is a psychologist  He has to judge on a list of psychological symptoms to make diagnosis and give treatments to his patients.  All diagnosis records are stored in a transaction database. We call each patient record as a transaction, and each psychological symptom as an attribute with value either yes or no (i.e. binary attribute), and the collection of patients’ records as a transaction database. Mood Disorder Anxiety Disorder Eating Disorder Obsessive- Compulsive Disorder Depression…Self Destructive Disorder … … … Patient 1 Patient 2 Transaction Database Binary Attributes Either Yes/ No Transactions Psychological Symptoms Transaction Database

Introduction One day, when Peter is reviewing his patients’ records, he discovers some patterns of his patients’ psychological symptoms.  E.g. Patients having “mood disorder” are often associated with “eating disorder”. He would like to learn about the associations between different psychological symptoms from his patients. Mood Disorder Anxiety Disorder Eating Disorder Obsessive- Compulsive Disorder Depression…Self Destructive Disorder … … … Patient 1 Patient 2 Psychological Symptoms Transaction Database

Introduction Peter may be interested in the following associations among different psychological symptoms. Mood Disorder Anxiety Disorder Eating Disorder Obsessive- Compulsive Disorder Depression…Self Destructive Disorder … … … Patient 1 Patient 2 Psychological Symptoms Transaction Database Mood disorder => Eating disorder Mood disorder => Depression Eating disorder => Depression + Mood disorder Eating disorder + Depression => Self destructive disorder + Mood disorder These associations are very useful information to assists diagnosis and give treatments. Association Rules

Introduction However, the psychological symptoms database is very large, it is impossible to analyze the associations by human inspection. In Computer Science research, the problem of mining association rules from transaction database is solved in 1993 by R. Agrawal.  The Apriori Algorithm Mood Disorder Anxiety Disorder Eating Disorder Obsessive- Compulsive Disorder Depression…Self Destructive Disorder … … … Patient 1 Patient 2 Psychological Symptoms Transaction Database Thanks computer scientists ! Too many records  Basic algorithm for mining association rules

2% support value means that there are 2% of the patients in the database have both psychological symptoms. 60% confidence value means that 60% of the patients having Eating disorder also have Depression. Introduction Association Rules There are two parameters to measure the interestingness of association rules.  Support is the fraction of database transaction that contains the items in the association rule. Support shows how frequent is the items in the rule.  Confidence is the percentage of transaction that contains the antecedent also contains the consequent. Confidence shows the certainty of the rule. Eating disorder => Depression [Support = 2%, confidence = 60%] AntecedentConsequent

Introduction Association Rules Two steps for mining association rules  Step 1: Find ALL frequent itemsets Itemsets are frequent if their supports are over the user-specified SUPPORT threshold.  Step 2: Generate association rules from the frequent itemsets An association rule is generated if its confidence is over a user-specified CONFINDENCE threshold. Given the transaction database, find ALL the association rules with SUPPORT values over 10%, and CONFINDENCE values over 60% please! Psychological Symptoms Database

Given the transaction database, find ALL the association rules with SUPPORT values over 10%, and CONFINDENCE values over 60% please! Psychological Symptoms Database Two steps for mining association rules  Step 1: Find ALL frequent itemsets Itemsets are frequent if their support are over the user-specified SUPPORT threshold.  Step 2: Generate association rules from the frequent itemsets An association rule is generated if its confidence over a user-specified CONFINDENCE threshold. Introduction Association Rules The overall performance of mining association rules is determined by the first step. For the sake of discussion, let us focus on the first step in this talk.

Section 1 Introduction How to mine frequent itemsets from large database?

Mining Frequent Itemsets Problem Definition Given  Transaction database D with n attributes, and m transactions. Each transaction t is a Boolean vector representing the presence or absence of items in that transaction.  Minimum support threshold s. Find ALL itemsets with support values over s. I1I2I3I4I5…In t110111…1 t211100…1..………………… tm0110011 Transaction Database D

Brute-force approach Suppose there are 5 items in the database. i.e. A,B,C,D and E. There are totally 32 itemsets. Scan the database once to count the supports of ALL itemsets together. If there are n different items, there will be 2^n itemsets to count in total.  If there are 20 items, there will be 1,000,000 itemsets!!!  Computationally infeasible

The Apriori Algorithm Found to be Infrequent Pruned supersets Apriori property : All subsets of a frequent itemset must also be frequent The Apriori algorithm adopts an iterative approach to identify infrequent itemsets. No need to count their frequency.

The Apriori Algorithm How it works? Candidates Large itemsets Apriori-Gen Subset Function {A} {B} {C} {D} {E} {B} {C} {D} {E} The Apriori algorithm starts from inspecting ALL size-1 items. The supports of ALL size-1-candidates are obtained by a SUBSET FUNCTION procedure by scanning the database once. After obtaining the supports, candidates with support over the support threshold are large items. X Item {A} is infrequent, by APRIORI PROPERTY, ALL supersets of {A} must NOT be frequent. XX XX X X X XX X XXXX X {BC} {BD} {BE} {CD} {CE} {DE} The APRIORI-GEN procedure generate ONLY those size-(k+1)-candidates which are potentially frequent. The Apriori Algorithm obtains the frequent itemsets iteratively until no candidates are generated. X X X {BD} {BE} {CD} {CE} {DE} {BDE} {CDE} X {BDE} {CDE} Save effort for counting the supports of pruned itemsets.

Important Detail of Apriori Subset Function Subset-Function  Scan the database transaction by transaction to increment the corresponding support counts of the candidates.  Generally there are many candidates, the Subset Function organizes the candidates in a hash-tree data structure. Each interior node of the hash-tree contains a hash table. Each leaf node of the hash-tree contains a list of itemsets and support counts. Candidate itemsetsLarge itemsets Apriori-Gen k=k+1 k=1 Subset Function

Candidate {1,2} is stored in this leaf node. Important Detail of Apriori How are candidates stored into the hash tree? Hash tree data structure Each Interior node contains a hash table. Leaf nodes of hash-tree contains a list of itemsets and support counts. Hash table 1,4,72,5,83,6,9 Candidates 1{1,2} 2{2,4} 3{3,6} 4{1,5} 5{2,4} 6 …… Level 0 Level 1 Hash Tree {1,2}{2,4} 2 levels for storing size- 2-candidates. Illustrate how the candidates are stored into the hash-tree. First, hash on the first item of candidate {1,2} Then, hash on the second item of candidate {1,2} Similarly, candidate {2,4} is hashed and stored in this slot.

A transaction with 100 items has 100C2 = 4950 size-2- subsets ! Hash on {1,4} and traverse the hash tree to search for the candidate. Enumerate all size-2-subsets within the transaction and traverse the hash tree to increment the corresponding support counts. Important Detail of Apriori How to process a transaction with the hash-tree? Fitting a transaction into the hash tree. Level 0 Level 1 124…9924 Transaction 1 Candidate Itemset Support Count {1,4} {1,7}0 {1,2} {1,5} {2,4} {2,7} {1,3} {1,6} {2,5} … {2,3} … {3,4} … {3,5} … {3,6} … Subset Function 0 Enumerate ALL size-2 subsets, {1,4} is one of them. When the itemset is found, increment its support count. 1 Same procedure has to repeat for ALL size-2 subsets of the transaction and for ALL transactions !

Section 2 Probabilistic Data What is probabilistic data?

Probabilistic Database or Uncertain Database Probabilistic Database In reality, when psychologists make a diagnosis, they estimate the likelihood of presence of each psychological symptom of a patient. The likelihood of presence of each symptom is represented in terms of existential probabilities. Mood Disorder Anxiety Disorder Eating Disorder Obsessive- Compulsive Disorder Depression…Self Destructive Disorder … … … Patient 1 Patient 2 Psychological Symptoms Transaction Database 97%5%84%14%76%9% 90%85%100% 86%65% 48% How to mine association rules from uncertain database? 

Other areas of probabilistic database  Pattern Recognition Handwriting recognition Speech recognition…etc  Information Retrieval  Scientific Database Probabilistic Database Feature 1Feature 2… Pattern 190%85%… Pattern 260%5%… Binary Features

Section 2 Probabilistic Data Possible World interpretation of uncertain database by S. Abiteboul in the paper “On the Representation and Querying of Sets of Possible Worlds“ in SIGMOD 1987.

Possible World Interpretation Example  A database with two psychological symptoms and two patients.  16 Possible Worlds  We can discuss the supports of itemsets of each individual world. DepressionEating Disorder Patient 190%80% Patient 240%70% 1S1S2 P1√√ P2√√ 2S1S2 P1×√ P2√√ 3S1S2 P1√× P2√√ 4S1S2 P1√√ P2×√ 5S1S2 P1√√ P2√× 6S1S2 P1√√ P2×× 9S1S2 P1×√ P2×√ 10S1S2 P1×√ P2√× 11S1S2 P1√× P2×√ 14S1S2 P1×× P2√× 15S1S2 P1×× P2×√ 16S1S2 P1×× P2×× 8S1S2 P1√× P2√× 12S1S2 P1√× P2×× 13S1S2 P1×√ P2×× 7S1S2 P1×× P2√√ From the uncertain database, one of the possibility is that both patients are actually having both psychological illnesses. Psychological symptoms database On the other hand, the uncertain database also captures the possibility that patient 1 only has eating disorder illness while patient 2 has both of the illnesses. Thus data uncertainty is eliminated when we focus on individual Possible World! Each possibility is called a “Possible World”.

Possible World Interpretation Support of itemset {Depression,Eating Disorder} DepressionEating Disorder Patient 190%80% Patient 240%70% 1S1S2 P1√√ P2√√ 2S1S2 P1×√ P2√√ 3S1S2 P1√× P2√√ 4S1S2 P1√√ P2×√ 5S1S2 P1√√ P2√× 6S1S2 P1√√ P2×× 9S1S2 P1×√ P2×√ 10S1S2 P1×√ P2√× 11S1S2 P1√× P2×√ 14S1S2 P1×× P2√× 15S1S2 P1×× P2×√ 16S1S2 P1×× P2×× 8S1S2 P1√× P2√× 12S1S2 P1√× P2×× 13S1S2 P1×√ P2×× 7S1S2 P1×× P2√√ WorldSupport {S1,S2}World Likelihood 1 2 3 4 5 6 7 8 …… 20.9 × 0.8 × 0.4 × 0.7 We can discuss support of itemset {S1,S2} of possible world 1. We can also discuss the likelihood of possible world 1 being the true world. 10.1 × 0.8 × 0.4 × 0.7 1 1 1 1 1 0 0.0504 0.3024 0.0864 0.1296 0.0056 0.0336 0.2016 0.0224 0 We define the expected support being the weighted average support count represented by ALL the possible worlds. Question: Overall speaking, how many {S1,S2} itemsets will you expect to have from these possible worlds? Psychological symptoms database Thus data uncertainty is eliminated when we focus on individual Possible World! Similarly, we can discuss the support and likelihood of Possible World 2.

Possible World Interpretation WorldSupport {S1,S2}World Likelihood 1 2 3 4 5 6 7 8 …… 2 1 1 1 1 1 1 0 0.0504 0.3024 0.0864 0.1296 0.0056 0.0336 0.2016 0.0224 0 We define the expected support being the weighted average support count represented by ALL the possible worlds. Weighted Support 0.4032 0.0224 0.0504 0.3024 0.0864 0.1296 0.0056 0 0 Expected Support1 Notice that the world likelihoods form a discrete probability density function of the support values of itemset {S1,S2}. Since the possible worlds are independent to each other, the probability density function of the support values of {S1,S2} is as follows P(support) Support0 1 2 20.16% 59.68% 20.16% Expected Support is the is calculated by summing up the weighted support counts of ALL the possible worlds. We expect there will be 1 patient has both “Eating Disorder” and “Depression”.

Possible World Interpretation Instead of enumerating all “Possible Worlds” to calculate the expected support, it can be calculated by the following formula DepressionEating Disorder Patient 190%80% Patient 240%70% Psychological symptoms database Weighted Support 0.72 0.28 TOTAL SUM1 The expected support can be calculated by simply multiplying the existential probabilities within the transaction and obtain the total sum of all transactions

Mining Frequent Itemsets from probabilistic data Problem Definition  Given an uncertain database D with each item of a transaction associated with an existential probability, and a user-specified support threshold s, return ALL the itemsets having expected support greater than or equal to |D|×s.  In another words, find ALL the itemsets that are expected to be frequent according to the existential probabilities in the uncertain database.

Section 3 Mining frequent patterns from uncertain data The Uncertain Apriori algorithm

Uncertain Apriori Algorithm All the procedures are the same as conventional association rule mining algorithm. The only difference is in the subset function. Candidate itemsetsSize-k-large itemsets Apriori-Gen Subset Function k=k+1 k=1 StartEnd

Increase the expected support count by 0.7*0.3 = 0.21 The only difference is in the subset function. Uncertain Apriori Algorithm Level 0 Level 1 Increment the candidate count by the expected support contributed by the transaction. 1 (70%)2 (50%)4 (30%)…9924 (30%) Transaction 1 Candidate Itemset Expected Support Count {1,4} {1,7}0 {1,4} {1,7} {1,2} {1,5} {2,4} {2,7} {1,3} {1,6} {2,5} … {2,3} … {3,4} … {3,5} … {3,6} … Subset Function 00.21 Instead of storing the support counts, candidate itemsets are associated with an expected support count.

Uncertain Apriori Algorithm Level 0 Level 1 1 (90%)2 (2%)3 (99%)…9924 (5%) Transaction 1 {1,2} {1,5} {2,4} {2,7} {1,3} {1,6} {2,5} … {2,3} … {3,4} … {3,5} … {3,6} … Mood Disorder Anxiety Disorder Eating Disorder Obsessive- Compulsive Disorder Depression…Self Destructive Disorder Patient 190%2%99%97%92%…5% Patient 289%96%80%4%8%…3% Patient 38%6%79%10%5%…98% … Psychological Symptoms Transaction Database Thus we can apply Uncertain Apriori on uncertain database to mine ALL the frequent itemsets. Why the algorithm executes so long, even doesn’t terminate  ? {1,4} {1,7}

Computational Issue Each item (attribute) of a transaction (object) is associated with an existential probability, despite the items with very high probability of presence, there are large number of items with relatively low probability of presence. Mood Disorder Anxiety Disorder Eating Disorder Obsessive- Compulsive Disorder Depression…Self Destructive Disorder Patient 190%2%99%97%92%…5% Patient 289%96%80%4%8%…3% Patient 38%6%79%10%5%…98% … Psychological Symptoms Transaction Database

Computational Issue Level 0 Level 1 1 (70%)2 (50%)4 (30%)7 (3%)10 (2%)…991 (60%) Transaction 1 Candidate Itemset Expected Support Count {1,4}0 {1,7}0 {1,10}0 {4,7}0 {4,10}0 {7,10}0 {1,2} {1,5} {2,4} {2,7} {1,3} {1,6} {2,5} … {2,3} … {3,4} … {3,5} … {3,6} … Many insignificant subset increments. If {7,10} turns out to be infrequent after scanning the database, ALL the subset increments are redundant. Transaction with some low existential probability items 0.21 0.021 0.014 0. 009 0. 0006 0. 006 This is the expected support contributed by the transaction to candidates in this leaf node. Psychological Symptoms Uncertain Database

Computational Issue Preliminary experiment is conducted to verify the computational bottleneck of mining uncertain database.  In general, uncertain database will have “longer” transactions. (i.e. more items per transaction) Some items with high existential probabilities. Some items with low existential probabilities.  In our current study, we focus on dataset with bimodal existential probability distribution.

Computational Issue Synthetic Dataset simulates a bimodal distribution of existential probability:  7 datasets with same frequent itemsets.  Vary the percentages of items with low existential probability in the datasets. 0%33.33%50%60%66.67%75%71.4% 1234567

Preliminary Study Iterations Number of candidate itemsets Number of large itemsets Number of Large itemsets in each iteration Number of candidates in each iteration Time spent on subset checking in each iteration for different datasets ALL datasets are having the same large itemsets. There is a sudden burst of number of candidates in second iteration. Fraction of items with low existential probability : 0% Fraction of items with low existential probability : 75% Since both datasets have the same frequent itemsets, subset increment of the 75% low existential probability items maybe actually redundant. There is potential to reduce the execution time. This figure shows the time spent on subset checking in each iteration of different datasets Computational bottleneck occurs in iteration 2. 0% 1 75% 7

Section 4 Efficient Methods of Mining Frequent itemsets from Existentially Uncertain Data

Efficient Method 1 Data Trimming Avoid insignificant subset increments

Method 1 - Data Trimming Strategy Direction  Try to avoid incrementing those insignificant expected support counts.  Save the effort for Traversing the hash tree. Computing the expected support. (Multiplication of float variables) The I/O for retrieving the items with very low existential probability.

Method 1 - Data Trimming Strategy Question: Which item should be trimmed?  Intuitively, items with low existential probability should be trimmed, but how low? For the time being, let assume there is a user-specified trimming threshold.

Method 1 - Data Trimming Strategy Create a trimmed database and trim out all items with existential probability lower than the trimming threshold. During the trimming process, some statistics are kept for error estimation when mining the trimmed database.  Total trimmed expected support count of each item.  Maximum existential probability of trimmed item.  Other information : e.g. inverted list, signature file …etc I1I2I3…I4000 t190%80%3%…1% t280%4%85%…78% t32%5%86%…89% t45%95%3%…100% t594%95%85%…2% Uncertain database I1I2 t190%80% t280% t495% t594%95% + Statistics Total expected support trimmed Maximum existential probability of trimmed item I11.15% I21.23% Trimmed Database

The Subset Function scans the trimmed database and count the expected support of every size-2-candidates. We expect mining the trimmed database saves lots of I/O and Computational Costs. Method 1 - Data Trimming Strategy Trimming Process Candidate itemsets Size-k-large itemsets Apriori-Gen Subset Function Hash-tree k=k+1 Trimming Module + Size-k-infrequent itemsets Size-k-potentially frequent itemsets Pruning Module Patch Up Module Missed frequent itemsets Uncertain Database Trimmed Database + Statistics The uncertain database is first passed into the trimming module to remove the items with low existential probability and gather statistics during the trimming process. During the trimming process the “true” expected support count of size-1 candidates are counted. i.e. Size-1-large itemsets do not have false negative. Then the size-1 frequent items are passed into the APRIORI GEN procedure to generate size-2- candidates. Notice that the infrequent itemsets are only infrequent in the trimmed database. It may contains some true frequent itemsets in the original database. The Pruning Module uses the statistics gathered from the trimming module to estimate the error and identifies the potentially frequent itemsets from the infrequent itemsets. Here comes two strategies: Use the potentially frequent itemsets to generate size- k+1-candidates Do not use the potentially frequent itemsets to generate size-k+1 candidates Finally, all the potentially frequent itemsets are checked against the original database to verify its true support.

Method 1 - Data Trimming Strategy Pruning Module The role of the Pruning Module is to identify those itemsets which are infrequent in the trimmed database but frequent in the original database. Have to be estimated This count represents the expected support of the itemset AB where both item A and B are left in the trimmed database. i.e. This count can be obtained by mining the trimmed database

Method 1 - Data Trimming Strategy Pruning Module If upper bound of plus is greater than or equal to the minimum expected support requirement, {A,B} is regarded as potentially frequent. Otherwise, {A,B} cannot be frequent in the original database and can be pruned. Have to be estimated

Method 1 - Data Trimming Strategy Max count pruning strategy Pruning strategy depends on statistics from the Trimming Module.  For each size-1 item, keeps Total expected support count being trimmed of each item. Maximum existential probability of trimmed item. Global Statistics Total expected support trimmed Maximum existential probability of trimmed item I11.55% I21.23% Since the statistics are “Global” to the whole database, this method is called Global Max Count Pruning Strategy Original Database Using global counts to estimate the whole database is sometime loose, we may use some “Local” statistics to obtain the bound Local Max Count Pruning Strategy Local Statistics Part a Part b Part c Part d Part e Total expected support trimmed Maximum existential probability of trimmed item I1Part a – 16.6 Part b – 14.2 Part c – 13 Part d – 0.1 Part e – 2.7 Part a – 2% Part b – 0.5% Part c – 6% Part d – 1% Part e – 0.7% I2Part a – 2.7 Part b – 19.5 Part c – 2.6 Part d – 12.3 Part e – 0.3 Part a – 1.1% Part b – 3% Part c – 7% Part d – 2.4% Part e – 0.2%

Method 1 - Data Trimming Strategy Max count pruning strategy Let, and be the upper bound estimations of, and respectively. From iteration 1, we have SKIP

Method 1 - Data Trimming Strategy Patch Up Module Candidate itemsets Size-k-large itemsets Apriori-Gen Subset Function Hash-tree k=k+1 Trimming Module + Trimmed Database + Size-k-infrequent itemsets Pruning Module Statistics Patch Up Module Missed frequent itemsets Uncertain Database Size-k-potentially frequent itemsets The Pruning Module identifies a set of potentially frequent itemsets. The Patch Up Module verifies the true frequencies of the potentially frequent itemsets. Two strategies One Pass Patch Up Strategy Multiple Passes Patch Up Strategy

Method 1 - Data Trimming Strategy Determine trimming threshold Candidate itemsets Size-k-large itemsets Apriori-Gen Subset Function Hash-tree k=k+1 + Size-k-infrequent itemsets Pruning Module Patch Up Module Missed frequent itemsets Size-k-potentially frequent itemsets Trimming Module Trimmed Database + Statistics Uncertain Database Question : Which item should be trimmed?

Method 1 - Data Trimming Strategy Determine trimming threshold Before scanning the database and incrementing the support counts of candidates, we cannot deduce which itemset is infrequent. We can make a guess on the trimming threshold from the statistics gathered from previous iterations.

Method 1 - Data Trimming Strategy Determine trimming threshold Cumulative Support of item A in descending order item A ordered by existential probability in descending order Cumulative Support Statistics from previous iteration  Order the existential probabilities of each size-1 item in descending order and plot the cumulative support. E.g. Item A has it’s expected support just over the support threshold. It is marginally frequent, it’s supersets are potentially infrequent. If a superset is infrequent, it won’t be frequent in trimmed database, we want to trim those items such that the error estimation should be tight enough to prune it in the Pruning Module. Use the existential probability of the intersecting item to be the trimming threshold.

Method 1 - Data Trimming Strategy Determine trimming threshold Cumulative Support of item B in descending order item B ordered by existential probability in descending order Cumulative Support Statistics from previous iteration  Order the existential probabilities of each item in descending order and plot the cumulative support. E.g. Item B has it’s expected support much larger than the support threshold. It’s supersets are likely to be frequent. The expected support contributed by these items are insignificant. Use the existential probability of this item to be the trimming threshold.

Efficient Method 2 Decremental Pruning Identify infrequent candidates during database scan

Method 2 - Decremental Pruning In some cases, it is possible to identify an itemset to be infrequent before scanning the whole database. For instance, if the minimum support threshold is 100, and the expected support of item A is 101.  After scanning transaction t2, we can conclude that ALL itemsets containing item A must be infrequent and can be pruned. A… t170%0% t250%0% ………. t100K…… Uncertain Database Total expected support of A is 100.3 from transaction t2 onwards. Total expected support of A is 99.8 from transaction t3 onwards. We can conclude that Item A is infrequent from t2 to t100K, all candidates containing A must be infrequent.

Method 2 - Decremental Pruning Before scanning the database, define two “Decremental Counters” for itemset {A,B} represents the maximum possible support count of itemset {A,B} if  ALL items A match with item B, and  ALL matching item Bs are having 100% existential probabilities from transaction t to the end of the database, then itemset {AB} will have support count larger than the minimum support by ” “.

Method 2 - Decremental Pruning While scanning the transactions, update the decremental counts according to the following equation :

Method 2 - Decremental Pruning Brute-force method Example  Support threshold: 50%, min_sup=2  Expected support of A=2.6, B=2.1, C=2.2  For candidate itemset {A,B} : ABC T1100%50%30% T290%80%70% T330%40%90% T440% 30% Uncertain Database Before scanning the database, initialize the decremental counters of candidate {A,B} Update the decremental counters according to the equation. We can conclude that candidate {A,B} is infrequent without scanning T3 and T4, which saves the computational efforts in the subset function. 0.1 value means that if 1) ALL the item A match with item B, and 2) ALL matching Bs are having 100% existential probabilities from transaction 2 to 4, then the expected support count of {A,B} will be 0.1 larger than min_sup. 0.6 value of d0(A,AB) means that if - ALL the item A match with item B and, - ALL matching Bs are having 100% existential probabilities in the whole database, then the expected support count of {A,B} will be 0.6 larger than min_sup.

Method 2 - Decremental Pruning Brute-force method This method is infeasible because  Each candidate has to associate with at least 2 decremental counters.  Even if any itemset is identified infrequent, the subset function still has to traverse the hash tree and reach the leaf nodes to retrieve the corresponding counters before it is known to be infrequent. Level 0 Level 1 Candidate Itemset Expected Support Count Decremental Counters AD0d0(A,AD),d0(D,AD) AG0d0(A,AG),d0(G,AG) AB AE BD BG AC AF BE … BF … CD … CE … CF …

Method 2 - Decremental Pruning Aggregate by item method Aggregate by item method  Aggregates the decremental counters and obtains an upper bound of them. Suppose there are three size-2-candidates There are totally 6 decremental counters in the brute-force method Aggregate the counters d0(A,AB) and d0(A,AC) by d0(A), and obtain an upper bound of the two counters. Brute-force methodAggregate by item method

Method 2 - Decremental Pruning Aggregate by item method Aggregated Counter Value ABC T1100%50%30% T290%80%70% T330%40%90% T440% 30% Uncertain Database Initialize the counters Scan transaction t1 and Update the decremental counters 0.6-[1*(1-0.5)] Scan transaction t2 and update the decremental counters Since no counter is smaller than zero, we cannot conclude any candidates to be infrequent. Since d2(A) is smaller than zero, {AB},{AC} are infrequent and can be pruned SKIP

Method 2 - Decremental Pruning Hash-tree integration method Other than loosely aggregate the decremental counts by item, aggregation can be based on the hash function used in the subset function. Level 0 Level 1 Candidate Itemset Expected Support Count Decremental Counters AD0d0(A,AD),d0(D,AD) AG0d0(A,AG),d0(G,AG) DG0d0(D,DG),d0(G,DG) AB AE DE BD BG EG AC AF DF BE … BF … CD … CE … CF … Subset Function Recall that the brute-force approach stores the decremental counters in the leaf nodes.

The aggregated decremental counters are stored in the hash nodes. When any of the decremental counters become smaller than or equal to zero, the corresponding itemsets in the leaf node cannot be frequent and can be pruned. Method 2 - Decremental Pruning Hash-tree integration method Level 0 Level 1 Candidate Itemset Expected Support Count Decremental Counters AD0d0(A,AD),d0(D,AD) AG0d0(A,AG),d0(G,AG) DG0d0(D,DG),d0(G,DG) AB AE DE BD BG EG AC AF DF BE … BF … CD … CE … CF … Subset Function The hash-tree integration method aggregates the decremental counters according to the hash function. This is the root of the hash tree.

Method 2 - Decremental Pruning Hash-tree integration method Improving the pruning power  The hash-tree is a prefix tree which is constructed based on lexicographic order of items  Item with higher order will be prefix containing more itemsets. {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Level 0 (Root) 3 itemsets under this decremental counter. 1 itemset under this decremental counter only. If this counter becomes negative during database scan, we can prune 3 itemsets. If this counter becomes negative during database scan, we can prune 1 itemset only.

Method 2 - Decremental Pruning Hash-tree integration method Our strategy is to reorder the items by their expected supports in ascending order such that  The decremental counters of items in higher lexicographic orders will be more likely to become negative than those with lower lexicographic orders. {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Level 0 (Root) If this counter becomes negative during database scan, we can prune 3 itemsets. If this counter becomes negative during database scan, we can prune 1 itemset only. 3 itemsets under this decremental counter. 1 itemset under this decremental counter only.

Efficient Method 3 Candidates filtering Identify infrequent candidates before database scan SKIP

Method 3 – Candidates filtering It is possible to identify some infrequent candidate itemsets before scanning the database to verify its support. ABC T130%50%100% T270%80%90% T390%40%30% T430%40% Uncertain Database Expected Support2.22.12.7 Maximum existential probability90%80%100% min_sup = 2 {A,B}{A,C}{B,C}Size-2-candidate itemsets 1.762.22.1 For instance, after scanning the database, the expected support of item A,B,C are obtained. During the database scan, keep the maximum existential probability of each item. Size-2-candidate itemsets are generated. From the expected supports and maximum existential probabilities obtained above, we can obtain an upper bound of the candidates BEFORE scanning the database. For {A,B}, if ALL items A matches with B with B’s maximum existential probability, {AB} will have expected support value 2.2*80% = 1.76 This is an upper bound of the expected support of {A,B}, which is smaller than min_sup. Thus it must be infrequent and can be pruned. Maximum expected support of size-2-candidates

Method 3 – Candidates filtering ABC T130%50%100% T270%80%90% T390%40%30% T430%40% Uncertain Database Expected Support2.22.12.7 Maximum existential probability90%80%100% min_sup = 2 {A,B}{A,C}{B,C}Size-2-candidate itemsets 1.762.22.1 For {A,B}, if ALL items A matches with B with B’s maximum existential probabilities, {AB} will have expected support value 2.2*80% = 1.76 This is an upper bound of the expected support of {A,B}, which is smaller than min_sup. Thus it must be infrequent and can be pruned. Maximum expected support of size-2-candidates

Section 5 Experimental Results and Discussions

Experiments Synthetic datasets Data associations  Generated by IBM Synthetic Generator. Average length of each transaction (T) Average length of hidden frequent patterns (I) Number of transactions (D) Data uncertainty  We would like to simulate the situation that there are some items with high existential probabilities, while there are also some items with low existential probabilities.  Bimodal distribution Base of high existential probabilities (HB) Base of low existential probabilities (LB) Standard Deviations for high and low existential probabilities (HD,LD)  Percentage of item with low existential probabilities (R) T100R75%I6D100K HB90HD5LB10LD5

Experiments Implementation C programming language Machine  CPU : 2.6 GHz  Memory : 1 Gb  Fedora Experimental settings :  T100R75%I6D100K HB90HD5LB10LD5  136 Mb  Support threshold 0.5% T100R75%I6D100K HB90HD5LB10LD5

Experimental Results Trimming Method T100R75%I6D100K HB90HD5LB10LD5 Since we use the one-pass patch up strategy, trimming methods have one more Patch Up phase. Iteration 2 is computationally expensive because there are many candidates, leading to heavy computational effort in the subset function. Trimming methods can successfully reduce the number of subset increments in ALL iterations. Plus the time spent on Patch Up phase, the trimming methods still have a significant performance gain. Execution time of Trimming Methods VS Uncertain Apriori in each iteration For Uncertain Apriori, each transaction has 100C2 = 4950 size-2 subsets. For Trimming, each transaction has at least 25C2 = 300 size-2 subsets only!

Experimental Results CPU Cost Saving by Trimming T100R75%I6D100K HB90HD5LB10LD5 CPU Cost of Trimming Methods VS Uncertain Apriori in each iteration Negative CPU saving in iteration 1 because time is spent on gathering the statistics for the Pruning Module. CPU Cost Saving in each iterationPercentage of CPU Cost Saving from iteration 2 to 6 Trimming methods achieve high computational saving in iterations where CPU cost is significant.

Experimental Results I/O Cost Saving by Trimming T100R75%I6D100K HB90HD5LB10LD5 I/O Cost of Trimming Methods VS Uncertain Apriori in each iteration I/O Cost Saving of Trimming Methods in each iteration Trimming Methods have extra I/O effort in iteration 2 because they have to scan the original database PLUS create the trimmed database. I/O Cost saving occurs from iteration 3 to iteration 6. That is, I/O cost saving will increase if there are longer frequent itemsets.

Experimental Results Varying Support Threshold T100R75%I6D100K HB90HD5LB10LD5 Execution time of Trimming Methods VS Uncertain Apriori for different support thresholds The rate of increase in execution time of Trimming Method is smaller than that of Uncertain Apriori.

0%100% T100R ? %I6D100K HB90HD5LB10LD5 50% Execution time of Trimming Methods VS Uncertain Apriori for different percentages of items with low existential probability ALL the itemsets are having the same frequent itemsets. Trimming Methods achieve almost linear execution time in increasing percentage of items with low existential probability. Experimental Results Varying percentage of items with low existential probability

Experimental Results Decremental Pruning T100R75%I6D100K HB90HD5LB10LD5 Percentage of Candidates Pruned during database scan for 2nd iteration Execution time of Decremental Pruning VS Uncertain Apriori for different percentages of items with low existential probability Pruning power of the Decremental Methods in 2nd iteration. The “Integrate with Hash Tree” method outperforms the “Aggregated by items” method. Although “Integrate with Hash Tree” method can prune twice number of candidates than the “Aggregate by items” method, the time saving is not significant. This is because the “Integrate with Hash Tree” method has more overhead. 0%100%50%

Experimental Results Varying percentage of items with low existential probability T100R75%I6D100K HB90HD5LB10LD5 The Trimming and Decremental Methods can combine together to form a Hybrid Algorithm Execution time of Decremental and Trimming Methods VS Uncertain Apriori for different percentages of items with low existential probability 0%100%50%

Experimental Results Hybrid Algorithms T100R75%I6D100K HB90HD5LB10LD5 Execution time of Different Combinations VS Uncertain Apriori for different percentages of items with low existential probability 0%100%50% Combining the 3 proposed methods achieves the smallest execution time.

Experimental Results Varying percentage of items with low existential probability Overall CPU saving of the Hybrid Algorithm for different percentages of items with low existential probability Overall I/O saving of the Hybrid Algorithm for different percentages of items with low existential probability T100R75%I6D100K HB90HD5LB10LD5 0%100%50% 0%100%50% CPU cost saving occurs when there are 5% or more items with low existential probability in the dataset. 80% or more CPU cost is saved for dataset with 40% or more items with low existential probability. I/O cost saving occurs when there are 40% or more items with low existential probability in the dataset. In fact, this figure only shows that I/O cost saving will increase if more items are trimmed. Actually the I/O saving should also depends on the length of hidden frequent itemsets, which can be shown by varying the (I) parameter in the dataset generation process.

Conclusion We have defined the problem of mining frequent itemsets from uncertain database. Possible World interpretation has be adopted to be the theoretical background of the mining process. Existing frequent itemsets mining algorithms are either inapplicable or unacceptably inefficient to mine uncertain data. We have identified the computational bottleneck of Uncertain Apriori, and Proposed a number of efficient methods to reduce both CPU and I/O cost significantly.

Future Works Sensitivity and scalability test on each parameters T,I,K,HB,LB…etc Generate Association Rules from uncertain data  What is the meaning of the association rules mined from uncertain data? Real Case Study Other types of association rules  Quantitative association rules  Multidimensional association rules Papers… Now I am interested in these kind of association rules 80% Eating Disorder => 90% Depression

End Thank you

M.Phil Probation Talk Association Rules Mining of Existentially Uncertain Data Presenter : Chui Chun Kit Supervisor : Dr. Benjamin C.M. Kao.

Similar presentations

Presentation on theme: "M.Phil Probation Talk Association Rules Mining of Existentially Uncertain Data Presenter : Chui Chun Kit Supervisor : Dr. Benjamin C.M. Kao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

M.Phil Probation Talk Association Rules Mining of Existentially Uncertain Data Presenter : Chui Chun Kit Supervisor : Dr. Benjamin C.M. Kao.

Similar presentations

Presentation on theme: "M.Phil Probation Talk Association Rules Mining of Existentially Uncertain Data Presenter : Chui Chun Kit Supervisor : Dr. Benjamin C.M. Kao."— Presentation transcript:

Similar presentations

About project

Feedback