Download presentation
Presentation is loading. Please wait.
Published byPhoebe Sharlene Jefferson Modified over 6 years ago
1
RainForest ارائه از: مهدی تیموری و کیانوش ملکوتی استاد: دکتر وحیدی پور
ارائه از: مهدی تیموری و کیانوش ملکوتی استاد: دکتر وحیدی پور دانشکده برق و کامپیوتر، دانشگاه کاشان؛ پاییز 96
2
Background (1/2) Decision Tree: Efficiency & Scalability
The efficiency of existing decision tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively small data sets. What if D, the disk-resident training set of class-labeled tuples, does not fit in main memory? In other words, how scalable is decision tree induction? The pioneering decision tree algorithms that we have mentioned assume that the data are memory resident (have the restriction that the training tuples should reside in memory).
3
Background (2/2) Decision Tree: Efficiency & Scalability
Efficiency becomes an issue of concern when these algorithms are applied to the mining of very large real-world databases (most often, the training data will not fit in memory). More scalable approaches, capable of handling training data that are too large to fit in memory, are required. Earlier strategies to “save space” included discretizing continuous-valued attributes and sampling data at each node. These techniques, however, still assume that the training set can fit in memory. Some other strategies reduces quality.
4
What is RainForest? (1/2) RainForest (a whimsical name!) is a scalable framework for decision tree induction. It was first introduced by Gehrke, Ramakrishnan and Ganti in an article which was published in “Data Mining and Knowledge Discovery” journal, July 2000, Volume 4. Separates the scalability aspects from the criteria that determine the quality of the tree.
5
What is RainForest? (2/2) Adapts to the amount of main memory available and applies to any decision tree induction algorithm. Also increases performance. The framework applied to different algorithms results in a scalable version of the algorithm without modifying the result. Thus evaluation of the quality of the resulting decision tree is not needed, instead we concentrate on scalability issues.
6
AVC-list (1/2) The method maintains an AVC-list (where “AVC” stands for “Attribute- Value, Classlabel”) describing the training tuples at the node. The AVC-set of an attribute P at node N gives the classlabel aggregated counts for each distinct attribute value of P for the tuples at N. Projection of F(N), the family of node N is the set of tuples of the DB that follows the path from the root to N when being classified by the tree, into P and the class label. The set of all AVC-sets at a node N is the AVC-group of N.
7
AVC-list (2/2) The size of an AVC-set for attribute A at node N depends only on the number of distinct values of A and the number of classelabels in the set of tuples at N ( F(N) ). Not the number of records. Typically, this size should fit in memory, even for real-world data. RainForest also has techniques, however, for handling the case where the AVC-group does not fit in memory. Therefore, the method has high scalability for decision tree induction in very large data sets.
8
Example: Class-Labeled Training Tuples
id age income student credit_rating class: buys_computer 1 youth high no fair 2 excellent 3 middle_aged yes 4 senior medium 5 low 6 7 8 9 10 11 12 13 14
9
Example: AVC-sets age buys_computer yes no youth 2 3 middle_aged 4
senior income buys_computer yes no low 3 1 medium 4 2 high student buys_computer yes no 6 1 3 4 credit_rating buys_computer yes no fair 6 2 excellent 3
10
Top-Down Decision Tree Induction Schema: A Common Pattern
BuildTree (Node N, datapartition D) Apply Algorithm to D to find criterion(N) let K be the number of children of N if (K>0) create K children C1,C2,…,CK of N use best split to partition D into D1,D2,...,DK for (i=1 ; i<=K ; i++) BuildTree(Ci , Di ) endfor endif
11
RainForest Refinement
for each predicator attribute P call A.find_best_partitioning(AVC-set of P) endfor K=A.decide_splitting_criterion() Scale with the size of the database Adapt gracefully to the amount of main memory available Not restricted to a specific classification algorithm Utility of a predicator attribute P is examined independent of the other predicator attributes isolates an important component, the AVC-set (which allows the separation of scalability issues from the algorithms to decide on the splitting criterion). Total main memory required = max size of any AVC-sets Evaluate previous results
12
Steps in RainForest Refinement Algorithm
1- AVC-group construction 2- Choose splitting attribute and predicate (According to algorithm) 3- Partition D across the children nodes
13
Main Memory Depending on the amount of main memory available, three cases can be distinguished: 1- The AVC-group of the root node fits in main memory. RF-Write, RF-Read, RF-Hybrid 2-Each individual AVC-set of the root node fits in main memory, but the AVC-group of the root node does not fit in main memory. RF-Vertical 3-None of the individual AVC-sets of the root fit in main memory.
14
State of Node State Precondition Processing Behavior of tuple t Send
criterion(N) computed. N’s children nodes are allocated. N is root or parent of N is in ‘Send’ state t is send to a child according to criterion(N). Fill the AVC-group an N is updated. Write t is appended to a N’s partition. FillWrite the AVC-group an N is updated by t. Undecided (New Node) - Dead N doesn’t split or all children of N is in ‘Dead’ State.
15
RF-Write 1- R:Fill. One scan of the DB and construct AVC-group of R 2- Apply our algorithm (compute criterion(R)) and create K children of R. R:Send. Each children:Write 3- Additional scan of DB where each tuple t is written into one of the K partitions 4- Recurs on each partition
16
Comparison
17
The End Thank you for your time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.