Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.

Similar presentations


Presentation on theme: "Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco."— Presentation transcript:

1 Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco Rakotomalala Laboratory ERIC – University Lumiére Lyon Summarized by Seong-Bae Park

2 Introduction Fast and Efficient Sampling Strategy to Build DTs from a very Large Database Propose a Strategy Using Successive Samples, one on Each Tree Node

3 Framework Play Tennis Table

4 Handling Continuous Attributes in DT Discretization Global Discretization  Each continuous attribute was converted to a discrete one. 1. Each continuous variable is sorted. 2-1. Several cutting points are tested so as to find the subdivision which is the best according to the class attribute. Use a splitting measure (entropy gain, chi-square, purity measure) 2-2. Looking for the number of intervals and their boundaries. Local Discretization  It is not necessary to determine how many intervals should be created as each split creates two intervals.  Interaction among attributes is accounted for. Require initially a sorting of the values  O(n log n)  Need sampling to reduce n

5 Local Sampling Strategy During construction, on each leaf, a sample is drawn from the part of the database that corresponds to the path associated to the leaf. Process 1. First, a complete list of individuals on the base is drawn; 2. The first sample is selected while the base is being read; 3. This sample is used to identify the best segmentation attribute, if it exists, otherwise, the stopping rule has played its role and the node becomes a terminal leaf; 4. If a segmentation is possible, then the list in step 1 is broken up into sub-lists corresponding to the various leaves just obtained; 5. Step 4 requires passing through the DB to update each examples’ leaf; this pass is an opportunity to select the samples that will be used in later computations. Iterate Step 3 to Step 5 until all nodes are converted to terminal leaves.

6 Local Sampling Strategy

7 Determining The Sample Size The size of the sample must be such that 1) This split be recognized as such, that is the power of the test must be sufficient; 2) The discretization point be estimated as precisely as possible; 3) If, on the given node on the base, many splitting attributes are possible, the criterion for the optimal attribute remains maximal in the sample.

8 Testing Statistical Signification for a Link For each node, we use statistical tests concepts: probability of type I and type II errors (  and  )  Looking for the attribute which provides the best split according to the criterion T.  The split is done if two conditions are met: 1) If this split is the best, 2) If this split is possible (T(Sample Data) is unlikely when H 0 is true.) Null Hypothesis H 0 : “There is no link between the class attribute and the predictive attribute we are testing.” p – value : the probability of T being greater than or equal to T(Sample Data) H 0 is rejected, so the split is possible, if the p – value is less than a predetermined significance level, .

9 Testing Statistical Signification for a Link True significant level  ’ is larger than . (multi hypotheses) The possibility of  ’ of observing at least one of the attributes smaller than  is:  One must use a very small value for . The significance level  limits the type I error probability.

10 Notations Y : class attribute X : predictor attributes  ij : the proportion of (Y = Y i and X = X j ) in the sub- population corresponding to the working node.  i+ and  +j : marginal proportions  ij 0 : the products of marginal proportions n ij : the number of (ij) cell in the sample tabulation.  E(n ij ) = n   ij : expected value of n ij

11 Probability Distribution of The Criterion Measure the link by  2 statistic or information gain  When H 0 is true and the sample size is large, both have approximate chi-square distribution with degrees of freedom = (p – 1)(q – 1).  When H 0 is false, the distribution is approximately non-central chi-square. Central chi-square distribution  When H 0 is true, = 0.  The further the truth is from H 0, the larger. Noncentral chi-square distribution  No closed analytic formulation  Asymptotically normal for large values of.  : a function of sample size n and the frequencies  ij in the whole database

12 Probability Distribution of The Criterion The value of  For information gain  For  2 statistic

13 Equalizing of Normal Risk Probabilities Find the minimal sample sizes to get a power (1-  )  T 1-  : the critical value  If p = q = 2, v = 1 and = nR 2

14 Equalizing of Normal Risk Probabilities The weaker the link (R 2 ) is in the database, the larger the sample size must be to make evidence for it. n increases as the significance level  decreases: If one wants to reduce risk probabilities, a larger sample is needed.

15 Sampling Methods Algorithm S  Sequentially processes the DB records and determine whether each record is selected  The first record is selected with probability (n / N)  If m records are selected from among the first t records, the (t+1)st record is selected with probability (n-m)/(N-t).  When n records are selected, stops. Algorithm D  Random jump between selected records

16 Experiments Objective of the Experiments  To show that a tree built with local sampling has a generalization error rate comparable to that of a tree built with the complete database  To show that sampling reduces computing time. Artificial Database  Artificial Problem  “Breiman’s et al. waves”  Generate 100 times two files, one of 500,000 records for training, the other of 50,000 records for the validation.  Binary discretization  ChAID decision tree algorithm

17 Experiments The marginal profit becomes weak.

18 With Real Benchmark DBs 5 DBs from UCI which contain more than 12,900 individuals. Repeat 10 times the following operations  Subdivide randomly the DB in a training set and in a test set.  Test the trees.

19 With Real Benchmark DBs

20 The influence of n  The sample size must be too small.  Sampling drastically reduces computing time.  “Letter” DB : data fragmentation

21 Conclusions Working on samples is useful. “Step by step” characteristics of DT allows us to propose a strategy using successive samples. Theoretical and Empirical evidence Open Problems  Optimal sampling methods Learning Imbalanced Classes  Local equal-size sampling


Download ppt "Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco."

Similar presentations


Ads by Google