Download presentation
Presentation is loading. Please wait.
Published byNigel Wherry Modified over 9 years ago
1
Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter
2
Abstract Problem : Increase in Size of Training Set MIND (MINing in Database) Classifier Can be Implemented easily over SQL Other Classifiers Need O(N) space In Memory. MIND Scales Well Over : I/O # of Processors
3
Over View Introduction Algorithm Database Implementation Performance Experimental Results Conclusions
4
Introduction - Classification Problem no yes salary <= 62K safe risky Age <= 30 DETAIL TABLE CLASSIFYER
5
Introduction - Scalability In Classification Importance Of Scalability: Use a Very Large Training Set – Data is Not Memory Resident. Use a Very Large Training Set – Data is Not Memory Resident. Number Of CPUs – better usage of resources. Number Of CPUs – better usage of resources.
6
Introduction - Scalability In Classification Properties of MIND: Scalable in memory Scalable In CPU Uses SQL Easy to implement Assumptions Attribute Values Are Discrete We focus on the growth stage(no pruning)
7
The Algorithm - DataStracture DATA in DETAIL TABLE DETAIL(attr 1,attr 2, ….,class,leaf_num) attr i = i attribute attr i = i attribute class = Class type class = Class type leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree) leaf_num = the number of leaf the example belongs to(this data can be calculated by the known tree)
8
The Algorithm - gini index S - data Set C - number of Classes Pi - relative frequency of class i in S gini index :
9
The Algorithm GrowTree(DETAIL TABLE) Initialize tree T and put all records of DETAIL in root while (some leaf in T is not a STOP node) for each attribute i do evaluate gini index for each non-STOP leaf at each split value with respect to attribute i for each non-STOP leaf do get the overall best split for it; partition the records and grow the tree for one more level according to best splits; mark all small or pure leaves as STOP nodes; return T;
10
Database Implementation - Dimension table For Each Attribute and each level of the tree INSERT INTO DIMi SELECT leaf_num,class,attr i,count(*) FROM DETAIL WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr i Size of Dim i = #leaves * #distinct values of attr i * #classes
11
Database Implementation - Dimension table SQL SELECT FROM DETAIL INSERT INTO DIM 1 leaf_num,class,attr 1,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr 1 INSERT INTO DIM 2 leaf_num,class,attr 2,count(*) WHERE leaf_num,<> STOP GROUP BY leaf_num,class,attr 2
12
Database Implementation - UP/DOWN - split for each attribute we find all possible split places: INSERT INTO UP SELECT d 1.leaf_num, d 1.attr i, d 1.class,SUM(d 2. count) FROM(FULL OUTER JOIN DIM i d 1, DIM i d 2 ON d 1.leaf_num = d 2.leaf_num AND d 2. attr i <= d 1. attr i AND d 1.class = d 2.class GROUP BY d 1.leaf_num, d 1. attr i, d 1.class
13
Database Implementation - Class View create view for each class k and attribute i: CREATE VIEW C k _UP(leaf_num,attr i,count) SELECT leaf_num,attr i,count FROM UP WHERE class = k
14
Database Implementation - GINI VALUE create view for all gini values: CREATE VIEW GINI_VALUE(leaf_num, attr i,gini)AS SELECT u 1.leaf_num, u 1.attr i,ƒ gini FROM C 1 _UP u 1,..,Cc_UP u c,C 1 _DOWN d 1...,Cc_DOWN d c WHERE u 1.attr i =.. = u c. attr i =.. = d c. attr i AND u 1.leaf_num =.. = u c.leaf_num =.. = d c.leaf_num
15
Database Implementation - MIN GINI VALUE create table for minimum gini values for attribute i : INSERT INTO MIN_GINI SELECT leaf_num,i,attr i, gini FROM GINI_VALUE a WHERE a.gini = (SELECT MIN(gini) FROM GINI_VALUE b WHERE a.leaf_num = b.leaf_num
16
Database Implementation - BEST SPLIT create view over MIN_GINI for best split : CREATE VIEW BEST_SPLIT (leaf_num,attr_name,attr_value) SELECT leaf_num, attr_name,attr_value FROM MIN_GINI a WHERE a.gini = (SELECT MIN(gini) FROM MIN_GINI b WHERE a.leaf_num = b.leaf_num
17
Database Implementation - Partitioning Build new nodes by spliting old nodes according to BEST_SPLIT values Set correct node to recoreds: Update leaf_node - is done by a function No need to UPDATE data or DB
18
Performance I/O cost of MIND: I/O cost of SPRINT:
19
Experimental Results Normalized time to finish building the tree Normalized time to build the tree per example the tree per example
20
Experimental Results Normalized time to build the tree per # of processor the tree per # of processor Time to build tree By Training Set Size
21
Conclusions MIND works over DB MIND works well because –MIND rephrases the classification to a DB problem –MIND avoid UPDATES the DETAIL table –Parallelism and Scaling Are achived by the use of RDBMS – MIND uses a user function to get the performance gain in the DIMi creation.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.