Download presentation
Presentation is loading. Please wait.
Published byΆιμον Κορομηλάς Modified over 6 years ago
1
Bootstrapped Optimistic Algorithm for Tree Construction
ارائه از: مهدی تیموری و کیانوش ملکوتی استاد: دکتر وحیدی پور دانشکده برق و کامپیوتر، دانشگاه کاشان؛ پاییز 96
2
What is it? It is an algorithm for enhancing construction of decision trees Use a statistical technique called “bootstrapping” to create several smaller subsets Each subset is used to create a tree, resulting in several trees These trees are examined and used to construct a new tree T’ It turns out that T’ is very close to the tree that would be generated using the whole data set together
3
Why the standard method is not good?
Standard decision tree construction is not efficient For tree of height h, need h passes through entire database To include new data, must rebuild the tree For large databases, this is not feasible So we need a fast, scalable method
4
What is good about BOAT? Efficient construction of decision tree
Few passes of the database is possible (possible with only 2) Sample of dataset to give insight to the full database improves both in functionality and performance, resulting in a gain of around 300%. The first scalable algorithm with the ability to incrementally update the tree
5
BOAT Intuition Begin with sample of data
Build decision tree on the sample For numeric data, use a confidence interval for split Make limited passes of full data to both verify sampled tree and construct full tree Only data that falls in confidence interval needs be rescanned to determine how to spread
6
Selection Criterion The combined information of splitting attribute and splitting predicates at a node n is called the splitting criterion at n Use of impurity functions to generate the attribute to split on Entropy, Gain Ratio, Gini
7
Confidence Interval Construct T trees
If at node n, the splitting attribute is not the same in all trees, discard n and its subtree in all trees Confidence interval on numeric attributes determined by the range of split points on the T trees Exact split point is likely to be between the min and max of the values of the split points on the T trees.
8
Verification Verifying predictions
Use a lower bound for the impurity function to determine if confidence interval and splitting attribute are correct Discard node and its subtree completely if incorrect Rerun algorithm on any set of data related to a discarded node
9
Invalidate Verification
Discarded top nodes would result in resampling of entire database No savings on full scans Doesn’t usually happen Basic probability distribution likely captured by sample Error in the detail (low) level
10
Dynamic Environments No need to frequently rebuild the decision trees
Store the confidence intervals Only need rebuild of tree if underlying probability distribution changes
11
Experimental Results Robust to noise Dynamic updating data
Noise affects detail-level probability distribution Affected the lower levels, requiring rescans of small amounts of data Dynamic updating data BOAT is much faster than brute-force
12
Weak Points May not be as useful on complex probability distributions
Failure at high level of tree means that most of the tree is discarded Hypotheses generate as simple as regular decision trees Simply a way to speed generation
13
Performance Comparision
14
The End Thank you for your time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.