Bootstrapped Optimistic Algorithm for Tree Construction

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Data Mining Lecture 9.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
IT 433 Data Warehousing and Data Mining
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Chapter 4: Trees Part II - AVL Tree
Dynamic Planar Convex Hull Operations in Near- Logarithmic Amortized Time TIMOTHY M. CHAN.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
“RainForest – A Framework for Fast Decision Tree Construction of Large Datasets” J. Gehrke, R. Ramakrishnan, V. Ganti. ECE 594N – Data Mining Spring 2003.
Scalable Classification Robert Neugebauer David Woo.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Chapter 12: Expert Systems Design Examples
Decision Tree Algorithm
Basic Data Mining Techniques Chapter Decision Trees.
Induction of Decision Trees
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Basic Data Mining Techniques
Classification Continued
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification II.
Classification.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Chapter 7 Decision Tree.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Basic Data Mining Techniques
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Classification and Regression Trees
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
B Tree Insertion (ID)‏ Suppose we want to insert 60 into this order 3 B-Tree Starting at the root: 10 < 60 < 88, so we look to the middle child. Since.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Chapter 7. Classification and Prediction
DECISION TREES An internal node represents a test on an attribute.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining.
Data Mining Classification: Basic Concepts and Techniques
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Market Basket Analysis and Association Rules
RainForest ارائه از: مهدی تیموری و کیانوش ملکوتی استاد: دکتر وحیدی پور
Classification by Decision Tree Induction
Bootstrapped Optimistic Algorithm for Tree Construction
Dept. of Computer Sciences University of Wisconsin-Madison
Presentation transcript:

Bootstrapped Optimistic Algorithm for Tree Construction BOAT Bootstrapped Optimistic Algorithm for Tree Construction

Presentation Prashanth Saka CIS 595 FALL 2000 Presentation Prashanth Saka

BOAT is a new algorithm for decision tree construction that improves both in functionality and performance, resulting in a gain of around 300%. The reason, only two scans over the entire training data set. The first scalable algorithm with the ability to incrementally update the tree w.r.t., to both insertions and deletions over the dataset.

Take a sample D ́ D from the training database and construct a sample tree with coarse splitting criteria at each node using bootstrapping. Make one scan over the database D and process each tuple by ‘streaming it’ down the tree. At the root node, n, update the counts of buckets for each numerical predictor attribute. If ‘t’ falls in the confidence interval, ‘t’ is written into a temporary file Sn at node n, else it is sent down the tree.

Then the tree is processed top-down. At each node a lower bounding technique is used to check whether the global minimum value of the impurity function could be lower than i’, the minimum impurity value. If the check is successful, then we are done with the node n. Else, we discard n and its sub tree during the current construction.

Each decision tree has exactly one incoming edge and zero or two outgoing edges. Each leaf is labeled with one class label. Each internal node is labeled with one predictor attribute Xn called the splitting attribute. Each internal node has the splitting predicate qn associated with it. If Xn is numerical, then qn is in the form Xn xn, where xn belongs to dom(Xn); xn is the split point at node n.

The combined information of splitting attribute and splitting predicates at a node n is called the splitting criterion at n.

We associate at each node nT, a predicate fn: dom(X1) x … x dom(Xm)  { true, false }, called its node predicate as follows: for the root node n, fn = true Let n be a non-root node with the parent p, whose splitting predicate is qp. If n is the left child of p, then fn = fp qp; If n is the right child of p, then fn = fp¬ qp

Since each leaf node n  T is labeled with a class label, it encodes a classification rule fn  c, where c is the label of n. T: dom(X1) x … x dom(Xm)  dom(C) and is therefore a classifier called a decision classifier. For a node nT, with parent p, Fn is the set of records in D that follows the path from the root to node n, when being processed by the tree. Formally, Fn = { t  D : f(n) is true }

Here, the impurity based split selection methods are considered, which produce binary splits. The impurity based split selection methods calculate the splitting criterion by trying to minimize a concave splitting function imp. At each node, all the predictor attributes X are examined and the impurity of the best split on X is calculated, and the final split is chosen such that the value of imp is minimized.

Let T be the final tree constructed using the split selection method CL on the training database, D. As D does not fit into the memory, consider D’D such that D’ fits into the memory. Compute a sample T’ from D’. Each node n  T’ has a sample splitting criteria consisting of a sample splitting attribute and a split point. We can use this knowledge of T’ to guide us in the construction of T, our final goal.

Consider a node n in the sample tree T’ with numerical sample attribute Xn, and sample splitting predicate Xn  x. By T’ being close to T we mean that the final splitting attribute is at node n is X and that the final split point is inside a confidence interval around x. For categorical attributes, both the splitting attribute as well as the splitting subset have to match.

Bootstrapping: The bootstrapping method can be applied to the in-memory sample D’ to obtain a tree T’ that is close to T with high probability. In addition to T’, we also obtain the confidence intervals that contain the final split points with for nodes with numerical splitting attributes. We call the information at node n obtained through bootstrapping the coarse splitting criterion at node n.

After finding the final splitting attribute at each node n and also the confidence interval of the attribute values that contain the final split point. To decide on the final split point we need to examine the value of the impurity function only at the attribute values inside the confidence interval. If we had all the tuples that fall inside the confidence interval of n in-memory, then we could calculate the final split point exactly by calculating the value of the impurity function at these points only.

To bring these tuples into memory we make one scan over D and keep all tuples that fall inside the confidence interval at any node in-memory. Then we post process each node with a numerical splitting attribute to find the exact value of the split point using the tuples collected during the database scan. This phase is called the clean-up phase. The coarse splitting criterion at node n obtained from the sample D’ through bootstrapping is only correct with a high probability.

Whenever the course splitting criterion at n is not correct, we detect it during the clean-up phase, and can take necessary action. And hence, the method guarantees to find exactly the same tree as if a traditional main memory algorithm were run on the complete training set.