Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumiére Lyon Summarized by Seong-Bae Park

Introduction Fast and Efficient Sampling Strategy to Build DTs from a very Large Database Propose a Strategy Using Successive Samples, one on Each Tree Node

Framework Play Tennis Table

Handling Continuous Attributes in DT Discretization Global Discretization  Each continuous attribute was converted to a discrete one. 1. Each continuous variable is sorted. 2-1. Several cutting points are tested so as to find the subdivision which is the best according to the class attribute. Use a splitting measure (entropy gain, chi-square, purity measure) 2-2. Looking for the number of intervals and their boundaries. Local Discretization  It is not necessary to determine how many intervals should be created as each split creates two intervals.  Interaction among attributes is accounted for. Require initially a sorting of the values  O(n log n)  Need sampling to reduce n

Local Sampling Strategy During construction, on each leaf, a sample is drawn from the part of the database that corresponds to the path associated to the leaf. Process 1. First, a complete list of individuals on the base is drawn; 2. The first sample is selected while the base is being read; 3. This sample is used to identify the best segmentation attribute, if it exists, otherwise, the stopping rule has played its role and the node becomes a terminal leaf; 4. If a segmentation is possible, then the list in step 1 is broken up into sub-lists corresponding to the various leaves just obtained; 5. Step 4 requires passing through the DB to update each examples’ leaf; this pass is an opportunity to select the samples that will be used in later computations. Iterate Step 3 to Step 5 until all nodes are converted to terminal leaves.

Local Sampling Strategy

Determining The Sample Size The size of the sample must be such that 1) This split be recognized as such, that is the power of the test must be sufficient; 2) The discretization point be estimated as precisely as possible; 3) If, on the given node on the base, many splitting attributes are possible, the criterion for the optimal attribute remains maximal in the sample.

Testing Statistical Signification for a Link For each node, we use statistical tests concepts: probability of type I and type II errors (  and  )  Looking for the attribute which provides the best split according to the criterion T.  The split is done if two conditions are met: 1) If this split is the best, 2) If this split is possible (T(Sample Data) is unlikely when H 0 is true.) Null Hypothesis H 0 : “There is no link between the class attribute and the predictive attribute we are testing.” p – value : the probability of T being greater than or equal to T(Sample Data) H 0 is rejected, so the split is possible, if the p – value is less than a predetermined significance level, .

Testing Statistical Signification for a Link True significant level  ’ is larger than . (multi hypotheses) The possibility of  ’ of observing at least one of the attributes smaller than  is:  One must use a very small value for . The significance level  limits the type I error probability.

Notations Y : class attribute X : predictor attributes  ij : the proportion of (Y = Y i and X = X j ) in the sub- population corresponding to the working node.  i+ and  +j : marginal proportions  ij 0 : the products of marginal proportions n ij : the number of (ij) cell in the sample tabulation.  E(n ij ) = n   ij : expected value of n ij

Probability Distribution of The Criterion Measure the link by  2 statistic or information gain  When H 0 is true and the sample size is large, both have approximate chi-square distribution with degrees of freedom = (p – 1)(q – 1).  When H 0 is false, the distribution is approximately non-central chi-square. Central chi-square distribution  When H 0 is true, = 0.  The further the truth is from H 0, the larger. Noncentral chi-square distribution  No closed analytic formulation  Asymptotically normal for large values of.  : a function of sample size n and the frequencies  ij in the whole database

Probability Distribution of The Criterion The value of  For information gain  For  2 statistic

Equalizing of Normal Risk Probabilities Find the minimal sample sizes to get a power (1-  )  T 1-  : the critical value  If p = q = 2, v = 1 and = nR 2

Equalizing of Normal Risk Probabilities The weaker the link (R 2 ) is in the database, the larger the sample size must be to make evidence for it. n increases as the significance level  decreases: If one wants to reduce risk probabilities, a larger sample is needed.

Sampling Methods Algorithm S  Sequentially processes the DB records and determine whether each record is selected  The first record is selected with probability (n / N)  If m records are selected from among the first t records, the (t+1)st record is selected with probability (n-m)/(N-t).  When n records are selected, stops. Algorithm D  Random jump between selected records

Experiments Objective of the Experiments  To show that a tree built with local sampling has a generalization error rate comparable to that of a tree built with the complete database  To show that sampling reduces computing time. Artificial Database  Artificial Problem  “Breiman’s et al. waves”  Generate 100 times two files, one of 500,000 records for training, the other of 50,000 records for the validation.  Binary discretization  ChAID decision tree algorithm

Experiments The marginal profit becomes weak.

With Real Benchmark DBs 5 DBs from UCI which contain more than 12,900 individuals. Repeat 10 times the following operations  Subdivide randomly the DB in a training set and in a test set.  Test the trees.

With Real Benchmark DBs

The influence of n  The sample size must be too small.  Sampling drastically reduces computing time.  “Letter” DB : data fragmentation

Conclusions Working on samples is useful. “Step by step” characteristics of DT allows us to propose a strategy using successive samples. Theoretical and Empirical evidence Open Problems  Optimal sampling methods Learning Imbalanced Classes  Local equal-size sampling

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Similar presentations

Presentation on theme: "Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Similar presentations

Presentation on theme: "Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco."— Presentation transcript:

Similar presentations

About project

Feedback