Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.

Similar presentations


Presentation on theme: "Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset."— Presentation transcript:

1 Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset  A node without children is called a leaf node. Otherwise is called an internal node. –Internal nodes: denote a test on an attribute –Branch (split): represents an outcome of the test Univariate split (based on a single attribute) Multivariate split –Leaf nodes: represent class labels or class distribution

2 Example (training dataset)  Three predictor attributes: salary, age, employment  Class label attribute: group Record IDSalaryAgeEmploymentGroup 130K30SelfC 240K35IndustryC 370K50AcademiaC 460K45SelfB 570K30AcademiaB 660K35IndustryA 760K35SelfA 870K30SelfA 940K45IndustryC

3 Example (univariate split)  Each internal node of a decision tree is labeled with a predictor attribute, called the splitting attribute  each leaf node is labeled with a class label.  Each edge originating from an internal node is labeled with a splitting predicate that involves only the node’s splitting attribute.  The combined information about splitting attributes and splitting predicates at a node is called the split criterion Salary Age Employment Group C Group BGroup A <= 50K> 50K <= 40> 40 Academia, IndustrySelf

4 Two Phases  Most decision tree generation consists of two phases –Tree construction –Tree pruning

5 Building Decision Trees  Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide- and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discritized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

6 Tree-Building Algorithm  We can build the whole tree by calling: BuildTree(dataset TrainingData, split-selection-method CL) Input: dataset S, split-selection-method CL Output: decision tree for S Top-Down Decision Tree Induction Schema (Binary Splits): BuildTree(dataset S, split-selection-method CL) (1) If (all points in S are in the same class) then return; (2) Using CL to evaluate splits for each attribute (3) Use best split found to partition S into S1 and S2; (4) BuildTree(S1, CL) (5) BuildTree(S2, CL)

7 Split Selection  Information gain / Gain ratio (ID3/C4.5) –All attributes are assumed to be categorical –Can be modified for continuous-valued attributes  Gini index (IBM IntelligentMiner) –All attributes are assumed continuous-valued –Assume there exist several possible split values for each attribute –May need other tools, such as clustering, to get the possible split values –Can be modified for categorical attributes

8 Information Gain (ID3)  T – Training set; S – any set of cases  freq(C i, S) – the number of cases that belong to class C i  |S| -- the number of cases in set S  Information of set S is defined:  consider a similar measurement after T has been partitioned in accordance with the n outcomes of a test X. The expected information requirement can be found as the weighted sum over the subsets, as  The quantity measures the information that is gained by partitioning T in accordance with the test X. The gain criterion, then, selects a test to maximize this information gain. gain(X) = info(T) – info X (T)

9 Gain Ratio (C4.5)  Gain criterion has a serious deficiency – it has a strong bias in favor of tests with many outcomes.  We have This represents the potential information generated by dividing T into n subsets, whereas the information gain measures the information relevant to classification that arises from the same division.  Then, expresses the proportion of information generated by the split that is useful, i.e., that appears helpful for classification. Gain ratio(X) = gain(X)/split info(X)

10 Gini Index (IBM IntelligentMiner)  If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T.  If a data set T is split into two subsets T 1 and T 2 with sizes N 1 and N 2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as  The attribute provides the smallest gini split (T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

11 Pruning Decision Trees  Why prune? –Overfitting Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples  Two approaches to prune –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree”

12 Well-known Pruning Methods  Reduced Error Pruning  Pessimistic Error Pruning  Minimum Error Pruning  Critical Value Pruning  Cost-Complexity Pruning  Error-Based Pruning

13 Extracting Rules from Decision Trees  Represent the knowledge in the form of IF-THEN rules  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction  The leaf node holds the class prediction  Rules are easier for humans to understand  Example –IF salary=“>50k” AND age=“>40” THEN group=“C”

14 Why Decision Tree Induction  Compared to a neural network or a Bayesian classifier, a decision tree is easily interpreted/comprehended by humans  While training neural networks can take large amounts of time and thousands of iterations, inducing decision trees is efficient and is thus suitable for large training sets  Decision tree generation algorithms do not require additional information besides that already contained in the training data  Decision trees display good classification accuracy compared to other techniques  Decision tree induction can use SQL queries for accessing databases

15 Tree Quality Measures  Accuracy  Complexity –Tree size –Number of leaf nodes  Computational speed

16 Scalability  Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed  SLIQ (EDBT’96 — Mehta et al.) –builds an index for each attribute and only class list and the current attribute list reside in memory  SPRINT (VLDB’96 — J. Shafer et al.) –constructs an attribute list data structure  PUBLIC (VLDB’98 — Rastogi & Shim) –integrates tree splitting and tree pruning: stop growing the tree earlier  RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) –separates the scalability aspects from the criteria that determine the quality of the tree –builds an AVC-list (attribute, value, class label)  BOAT  CLOUDS

17 Other Issues  Allow for continuous-valued attributes –Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals  Handle missing attribute values –Assign the most common value of the attribute –Assign probability to each of the possible values  Attribute (feature) construction –Create new attributes based on existing ones that are sparsely represented –This reduces fragmentation, repetition, and replication  Incremental tree induction  Integration of data warehousing techniques  Different data access methods  Bias in split selection

18 Decision Tree Induction Using P-trees  Basic Ideas –Calculate information gain, gain ratio or gini index by using the count information recorded in P-trees. –P-tree generation replaces sub-sample set creation. –Use P-tree to determine if all the samples are in the same class. –Without additional database scan

19 Using P-trees to Calculate Information Gain/Gain Ratio  C – class label attribute  Ps – P-tree of set S  Freq(C j, S) = rc{Ps ^ Pc(Vc j )}  |S| = rc{Ps}  |Ti| = rc{P T ^P(V Xi )}  |T| = rc{P T }  So every formula of Information Gain and Gain Ratio can be calculated directly using P-trees.

20 P-Classifier versus ID3 Classification cost with respect to the dataset size


Download ppt "Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset."

Similar presentations


Ads by Google