Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter
Abstract Problem : Increase in Size of Training Set MIND (MINing in Database) Classifier Can be Implemented easily over SQL Other Classifiers Need O(N) space In Memory. MIND Scales Well Over : I/O # of Processors
Over View Introduction Algorithm Database Implementation Performance Experimental Results Conclusions
Introduction - Classification Problem no yes salary <= 62K safe risky Age <= 30 DETAIL TABLE CLASSIFYER
Introduction - Scalability In Classification Importance Of Scalability: Use a Very Large Training Set – Data is Not Memory Resident. Use a Very Large Training Set – Data is Not Memory Resident. Number Of CPUs – better usage of resources. Number Of CPUs – better usage of resources.
Introduction - Scalability In Classification Properties of MIND: Scalable in memory Scalable In CPU Uses SQL Easy to implement Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)
The Algorithm - DataStracture DATA in DETAIL TABLE DETAIL(attr 1,attr 2, ….,class,leaf_num) attr i = i attribute attr i = i attribute class = Class type class = Class type leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree) leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree)
The Algorithm - gini index S - data Set C - number of Classes Pi - relative frequency of class i in S gini index :
The Algorithm GrowTree(DETAIL TABLE) Initialize tree T and put all records of DETAIL in root while (some leaf in T is not a STOP node) for each attribute i do evaluate gini index for each non-STOP leaf at each split value with respect to attribute i for each non-STOP leaf do get the overall best split for it; partition the records and grow the tree for one more level according to best splits; mark all small or pure leaves as STOP nodes; return T;
Database Implementation - Dimension table For Each Attribute and each level of the tree INSERT INTO DIMi SELECT leaf_num,class,attr i,count(*) FROM DETAIL WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr i Size of Dim i = #leaves * #distinct values of attr i * #classes
Database Implementation - Dimension table SQL SELECT FROM DETAIL INSERT INTO DIM 1 leaf_num,class,attr 1,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr 1 INSERT INTO DIM 2 leaf_num,class,attr 2,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr 2
Database Implementation - UP/DOWN - split for each attribute we find all possible split places: INSERT INTO UP SELECT d 1.leaf_num, d 1.attr i, d 1.class,SUM(d 2. count) FROM(FULL OUTER JOIN DIM i d 1, DIM i d 2 ON d 1.leaf_num = d 2.leaf_num AND d 2. attr i <= d 1. attr i AND d 1.class = d 2.class GROUP BY d 1.leaf_num, d 1. attr i, d 1.class
Database Implementation - Class View create view for each class k and attribute i: CREATE VIEW C k _UP(leaf_num,attr i,count) SELECT leaf_num,attr i,count FROM UP WHERE class = k
Database Implementation - GINI VALUE create view for all gini values: CREATE VIEW GINI_VALUE(leaf_num, attr i,gini)AS SELECT u 1.leaf_num, u 1.attr i,ƒ gini FROM C 1 _UP u 1,..,Cc_UP u c,C 1 _DOWN d 1...,Cc_DOWN d c WHERE u 1.attr i =.. = u c. attr i =.. = d c. attr i AND u 1.leaf_num =.. = u c.leaf_num =.. = d c.leaf_num
Database Implementation - MIN GINI VALUE create table for minimum gini values for attribute i : INSERT INTO MIN_GINI SELECT leaf_num,i,attr i, gini FROM GINI_VALUE a WHERE a.gini = (SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num
Database Implementation - BEST SPLIT create view over MIN_GINI for best split : CREATE VIEW BEST_SPLIT (leaf_num,attr_name,attr_value) SELECT leaf_num, attr_name,attr_value FROM MIN_GINI a WHERE a.gini = (SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num
Database Implementation - Partitioning Build new nodes by spliting old nodes according to BEST_SPLIT values Set correct node to recoreds: Update leaf_node - is done by a function No need to UPDATE data or DB
Performance I/O cost of MIND: I/O cost of SPRINT:
Experimental Results Normalized time to finish building the tree Normalized time to build the tree per example the tree per example
Experimental Results Normalized time to build the tree per # of processor the tree per # of processor Time to build tree By Training Set Size
Conclusions MIND works over DB MIND works well because –MIND rephrases the classification to a DB problem –MIND avoid UPDATES the DETAIL table –Parallelism and Scaling Are achived by the use of RDBMS – MIND uses a user function to get the performance gain in the DIMi creation.