Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino
Outline Motivations Knowledge Discovery from Database (KDD), Inductive Databases Constraint-Based Mining Incremental Constraint Evaluation Association Rule Mining Incremental Algorithms Constraints properties Item Dependent Constraints (IDC) Context Dependent Constraints (CDC) Incremental Algorithms for IDC and CDC Performance results and Conclusions
Motivations: KDD process and Inductive Databases (IDB) KDD process consists of a non trivial extraction of implicit, previously unknown, and potentially useful information from data Inductive Databases have been proposed by Mannila and Imielinski [CACM’96] as a support for KDD KDD is an interactive and iterative process Inductive Databases contain both data and inductive generalizations (e.g. patterns, models) extracted from the data. users can query the inductive database with an advanced, ad-hoc data mining query language constrained-based queries
Motivations: Constraint-Based Mining and Incrementality Why constraints? can be pushed in the pattern computation and pruning the search space; provide to the user a tool to express her interests (both in data and in knowledge). In IDB constraint-based queries are very often a refinement of previous ones Explorative process Reconciling backgroung and extracted knowledge Why executing each query always from scratch? The new query can be executed incrementally! [Baralis et al., DaWak’99]
The number of such groups must be sufficient (user defined statistical evaluation measures, such as support),,, G from the groups of the database (grouping constraints), I of set of items (itemsets) (on some schema I) satisfying some user defined constraints (mining constraints), (M) T extraction from a source table A Generic Mining Language R=Q( ) A very generic constraint-based mining query requests: extraction from a source table T In our case R contains association rules from the groups of the database (grouping constraints), G satisfying some user defined constraints (mining constraints), (M) The number of such groups must be sufficient (user defined statistical evaluation measures, such as support),, of set of items (itemsets) (on some schema I), I
An Example purchase R=Q(purchase,customer,product,price>100,support_count>=2) transactioncustomerproductdatepricequantity 11001hiking boots12/7/ ski pants12/7/ jacket17/7/ col shirt12/7/ ski pants13/7/ jacket13/7/ col shirt13/7/ jacket20/8/ R Mining Query 2{ski pants} 2{jacket, ski pants} 3{jacket} support_countitemset 2/3 frequency 1jacketski pants 2/3ski pantsjacket confidenceheadbody
Incremental Algorithms We studied an incremental approach to answer new constraint- based queries which makes use of the information (rules with support and confidence) contained in previous results We individuated two classes of query constraints: item dependent (IDC) context dependent (CDC) We propose two newly developed incremental algorithms which allow the exploitation of past results in the two cases (IDC and CDC)
Relationships between two queries Query equivalence: R 1 =R 2 no computation is needed [FQAS’04] Query containment: [This paper] Inclusion: R 2 R 1 and common elements have the same statistical measures. R 2 = C (R 1 ) Dominance: R 2 R 1 but common elements do not have the same statistical measures. R 2 C (R 1 ) We can speed up the execution time of a new query using results of previous queries. Which previous queries? How can we recongnize inclusion or dominance between two constraints-based queries?
IDC vs CDC transactioncustomerproductdatepricequantity 11001ski pants12/7/ hiking boots12/7/ jacket17/7/ col shirt12/7/ ski pants13/7/ jacket13/7/ col shirt13/7/ jacket20/8/ CDC: qty > Item Dependent Constraints (IDC ) are functionally dependent on the item extracted are satisfied for a given itemset either for all the groups in the database or for none if an itemset is common to R1 and R2, it will have the same support: inclusion Context Dependent Constraints (CDC ) depend on the transactions in the database might be satisfied for a given itemset only for some groups in the database a common itemset to R1 and R2 might not have the same support: dominance IDC: price > 150
Incremental Algorithm for IDC Q2 ….. Constraint: price >10 ….. Current query Fail Item Domain Table itemprice A B C category hi-tech housing item C belongs to a row that does not satisfy the new IDC constraint Rules in memory BODYHEAD AB … 1 R1R1 Q1 ….. Constraint: price > 5 ….. Previous query SUPPCONF A C 2 … ……… … BODYHEAD AB21 R2R2 SUPPCONF ………… (R 2 = P (R 1 )) delete from R 1 all rules containing item C
Incremental Algorithm for CDC Q2 ….. Constraint: qty >10 ….. Current query read the DB find groups -in which new constraints are satisfied -containing items belonging to BHF update support counters in BHF R2R2 BODYHEAD ………… SUPPCONF build BHF … Q1 ….. Constraint: qty > 5 ….. Previous query Rules in memory BODYHEAD ………… SUPPCONF R1R1
Body-Head Forest (BHF) g m a (4) f g (3) body (head) tree contains itemsets which are candidates for being in the body (head) part of the rule an itemset is represented as a single path in the tree and vice versa each path in the body (head) tree is associated to a counter representing the body (rule) support a f g rule: rule support = 3 confidence = 3/4
Experiments (1): IC vs CD algorithm ID algorithm execution time vs constraint selectivity execution time vs volume of previous result (a) (b) CD algorithm (c)(d)
Experiments(2): CARE vs Incremental execution time vs cardinality of previous result (a)(b) (c) execution time vs support threshold execution time vs selectivity of constraints
Conclusions and future works We proposed two incremental algorithms to constraint-based mining which make use of the information contained in previous result to answer new queries. The first algorithm deals with item dependent constraints, while the second one with context dependent ones. We evaluated the incremental algorithms on a pretty large dataset. The result set shows that the approach reduces drastically the execution time. An interesting direction for future research: integration of condensed representations with these incremental techniques
the end questions?? questions??
condensed representation It is well known that the set of association rules can rapidly grows to be unwieldy, especially as the frequency bound decreases. Since most of these rules turn out to be redundant, it is not necessary to mine rules from all frequent itemsets, but it is sufficient to consider only the rules among closed frequent itemsets In fact, frequent closed itemsets are a small subsets (or condensed representation) of frequent itemsets without information loss For these reason, mining the frequent closed itemsets instead of frequent itemsets takes great advantages.