Mining Negative Rules in Large Databases using GRD Dhananjay R Thiruvady Supervisor: Professor Geoffrey Webb
Overview Aims Association Rule Discovery Generalized Rule Discovery Tidsets and Diffsets Conclusion
Aims 1. To mine negative rules in a large database using GRD 2. To assess whether the negative rules are of potential interest to a user
Association Rule Discovery Rule: A => B (e.g. tea => coffee) A is the antecedent B is the consequent Aim: Searches database to find strong associations between itemsets Itemsets are subsets of the dataset e.g. tea in a supermarket
Association Rule Discovery (Contd.) Support of Tea => Coffee: Transactions with Tea or Coffee / |Data space| Confidence of Tea => Coffee : Transactions with Tea or Coffee /Transactions with Tea
Association Rule Discovery (Contd.) Generates rules based on minimum support (frequent itemsets) Further constraints can be applied, e.g. confidence (interest)
Generalized Rule Discovery An alternative Association Rule Discovery Uses The OPUS Algorithm for an unordered Search [Webb, 95] Generates large number of rules based on user specified constraints. Constraints include minimum support, confidence, etc.
Tidsets and Diffsets [Zaki, Gouda, 01] Every itemset is stored with it’s corresponding transaction set (Tidsets) Using Vertical Mining has proved to be more efficient than Horizontal Mining. TeaCoffeeMilk
Tidsets and Diffsets (contd.) Diffsets are a set of transactions that the itemset does not appear in. Diffsets are Tidsets for a negative association of an itemset. TeaCoffeeMilkDiffset (Tea) Diffset (Coffee) Diffset (Milk)
Tidsets and Diffsets (contd.) GRD calculates Tidsets for an Itemset Therefore Diffsets for an Itemset can be computed with very little extra cost. ABC~A~B~C
Conclusion Find negative correlations between Itemsets in a database. Rule: tea => ~coffee, ~tea => coffee, ~tea => ~ coffee This will be achieved by extending the GRD technique.
Conclusion (Contd.) Using diffsets: tidsets A = diffset ~A Negative associations can be calculated with very little additional computational overheads Assess whether the results of negative correlations are potentially interesting or not
Any questions?