ICMLC2007, Aug. 19~22, 2007, Hong Kong 1 Incremental Maintenance of Ontology- Exploiting Association Rules Ming-Cheng Tseng 1, Wen-Yang Lin 2 and Rong Jeng 3 1, 3 Institute of Information Engineering, I-Shou University, Taiwan 2 Dept. of Comp. Sci. & Info. Eng., National University of Kaohsiung, Taiwan August 20, 2007
ICMLC2007, Aug. 19~22, 2007, Hong Kong 2 Outline Introduction Problem description The proposed algorithm Performance evaluation Conclusions
ICMLC2007, Aug. 19~22, 2007, Hong Kong 3 Introduction Motivation In general, there exist lots of semantic relationships (domain knowledge) among items It is natural to incorporate domain ontology into the process of data mining to explore more innovative rules The source databases are changing over time E.g., insertion, deletion, modification The discovered knowledge (rules) has to be updated to reflect new situation
ICMLC2007, Aug. 19~22, 2007, Hong Kong 4 Introduction (cont.) Association rules Given: A database of customer transactions Each transaction is a set of items Find all rules X Y that correlate the presence of one set of items X with another set of items Y Example: Sony VAIO HP LaserJet 1300 (Sup. 30%, Conf. 60%)
ICMLC2007, Aug. 19~22, 2007, Hong Kong 5 Introduction (cont.) Strong association rules Given: User’s specified constraints Minimum support (min_sup) minimum confidence (min_conf) Finding rules X Y with support and confidence larger than the user’s specified minimum values Example: min_sup = 25%, min_conf = 50% Sony VAIO HP LaserJet 1300 (Sup. 30%, Conf. 60%)
ICMLC2007, Aug. 19~22, 2007, Hong Kong 6 Introduction (cont.) Frequent itemsets (patterns) mining The association mining problem can be reduced to the problem of mining frequent itemsets, i.e., itemsets with support larger than min_sup Example min_sup = 25%, min_conf = 50% Sony VAIO HP LaserJet 1300 (Sup. 30%, Conf. 60%) sup({Sony VAIO, HP LaserJet 1300}) = 30% sup({Sony VAIO}) = 50%
ICMLC2007, Aug. 19~22, 2007, Hong Kong 7 Introduction (cont.) Ontology W3C Web Ontology Working Group “An ontology formally defines a common set of terms that are used to describe and represent a domain knowledge.” e.g., taxonomy: a kind of ontology presenting classification relationship among objects
ICMLC2007, Aug. 19~22, 2007, Hong Kong 8 Introduction (cont.) Ontology-exploiting association rules IBM 60GB HD => HP DeskJet
ICMLC2007, Aug. 19~22, 2007, Hong Kong 9 Problem Description Incremental maintenance of ontology-exploiting association rules Given: A database of customer transactions DB An incremental database db An item ontology T Discovered frequent itemsets in DB, L minimum support, ms, and minimum confidence, mc Find all frequent itemsets in UD = DB + db w.r.t. ms Construct all strong rules from the frequent itemsets w.r.t. mc
ICMLC2007, Aug. 19~22, 2007, Hong Kong 10 Problem Description (cont.) -- Example TIDPurchased Items 1IBM TP, Epson EPL, Toner Cartridge 2Sony VAIO, IBM TP, Epson EPL 3IBM TP, HP DeskJet, Ink Cartridge 4HP DeskJet 5IBM TP, HP DeskJet, Ink Cartridge 6Sony VAIO, Ink Cartridge Customer transactions DB L1L1 CountL 2 & L 3 Count {Printer} {PC} {IBM TP} {RAM 256MB*} {IBM 60GB*} {Printer, PC} {Printer, IBM TP} {Printer, RAM 256MB*} {Printer, IBM 60GB*} {RAM 256MB*, IBM 60GB*} {Printer, RAM 256MB*, IBM 60GB*} Discovered frequent itemsets L Item ontology G minsup = 70% (algorithms AROC, AROS)
ICMLC2007, Aug. 19~22, 2007, Hong Kong 11 Problem Description (cont.) Example TIDPurchased Items 1IBM TP, Epson EPL, Toner Cartridge 2Sony VAIO, IBM TP, Epson EPL 3IBM TP, HP DeskJet, Ink Cartridge 4HP DeskJet 5IBM TP, HP DeskJet, Ink Cartridge 6Sony VAIO, Ink Cartridge TIDItems Purchased 7Toner Cartridge 8IBM TP, HP DeskJet, IBM 60GB, Toner Cartridge 9IBM 60GB, Toner Cartridge Customer transactions DB Incremental transactions db Item ontology G minsup = 70% Updated frequent itemsets L’ ??
ICMLC2007, Aug. 19~22, 2007, Hong Kong 12 Basic scheme An Apriori-based maintenance algorithm Employing a bottom-up, level-wise searching strategy Starting from frequent 1-itemset, L 1, then L 2, …, L k, etc. ABCD ABCABDBCDACD ABCD ABAC ADBCBDCD The Proposed Algorithm – IMARO
ICMLC2007, Aug. 19~22, 2007, Hong Kong 13 NotationDefinition DBOriginal database dbIncremental database UD Updated database UD DB + db TItem ontology ED Extension of DB with extended items in T ed Extension of db with extended items in T UE Updated extended database UE ED + ed The Proposed Algorithm – IMARO (cont.) Terminology
ICMLC2007, Aug. 19~22, 2007, Hong Kong 14 Example The Proposed Algorithm – IMARO (cont.)
ICMLC2007, Aug. 19~22, 2007, Hong Kong 15 Note on database extension A component item may exist as a primitive item itself To clarify the meaning of associations involving such an item, we have to differentiate the role this item play e.g., IBM TP => Ink Cartridge buy an IBM TP notebook, also buy an Ink Cartridge buy an IBM TP notebook, also buy an product composed of Ink Cartridge The Proposed Algorithm – IMARO (cont.) TIDPurchased Items 5IBM TP, HP DeskJet, Ink Cartridge TIDPrimitive ItemsExtended Items 5IBM TP, HP DeskJet, Ink Cartridge* PC, RAM 256MB, IBM 60GB, Printer, Ink Cartridge
ICMLC2007, Aug. 19~22, 2007, Hong Kong 16 The Proposed Algorithm – IMARO (cont.) Process flow for updating frequent k-itemsets e.g., AROC or AROS
ICMLC2007, Aug. 19~22, 2007, Hong Kong 17 Frequent/infrequent itemsets inference The Proposed Algorithm – IMARO (cont.) ConditionsResults L ED L ed UEUEActionCase freq.no1 undetd.compare sup UD (A) with ms2 undetd.scan DB3 infreq.no4
ICMLC2007, Aug. 19~22, 2007, Hong Kong 18 The Proposed Algorithm – IMARO (cont.) Optimization 1: Candidate pruning Any candidate itemset that contains both an item and anyone of its extensions (generalized item or component) is pruned. {Epson EPL, Printer} {Epson EPL, Toner Cartridge*}
ICMLC2007, Aug. 19~22, 2007, Hong Kong 19 The Proposed Algorithm – IMARO (cont.) The extension of an item can be added only if that item does appear in at least one candidate itemset being counted currently Photo Conductor Toner Cartridge HP DeskJet Printer Epson EPL - Ink Cartridge - RAM 256MB IBM 60GB Sony VAIO PC IBM TP S 60GB - Optimization 2: Extension filtering
ICMLC2007, Aug. 19~22, 2007, Hong Kong 20 Performance Evaluation Compared with applying our proposed algorithms, AROC and AROS, to the whole database DB+db with T Test data A synthetic dataset generated by the IBM data generator with artificially–built ontology ParameterDefault value |DB|Number of original transactions200,000 |t||t|Average size of transactions20 NNumber of items362 RNumber of groups30 LNumber of levels4 FFanout5
ICMLC2007, Aug. 19~22, 2007, Hong Kong 21 Performance Evaluation (cont.) Varying minimum supports |db| = 40,000
ICMLC2007, Aug. 19~22, 2007, Hong Kong 22 Performance Evaluation (cont.) Varying incremental transaction size ms = 1.5%
ICMLC2007, Aug. 19~22, 2007, Hong Kong 23 Conclusions We have investigated the problem of updating ontology- exploiting association rules when new transactions are inserted into the database An Apriori-based algorithm is proposed Other issues More complicated semantic relationships and knowledge More complicated semantic relationships and knowledge Non-uniform minimum support Generalized item or composite item occurs more frequently Towards a total solution for evolving environments Ontology evolution, database update Interactive refinement of support constraints …
ICMLC2007, Aug. 19~22, 2007, Hong Kong 24 Thanks for your attention!
ICMLC2007, Aug. 19~22, 2007, Hong Kong 25 Conclusions (cont.) Taxonomy of semantic relationships *source: 1993, Veda C. Storey, VLDB journal
ICMLC2007, Aug. 19~22, 2007, Hong Kong 26 Related Work Comparison with previous work ContributorsModel of incremental maintenance of association rules Type of database updateType of ontology Srikant & Agrawal, 1995noneclassification Han & Fu, 1995noneclassification Cheung et al., 1996insertionclassification Cheung et al., 1997insertion, deletion and modification none Jea et al., 2003nonecomposition Chien et al., 2005noneclassification & composition