Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.

Similar presentations


Presentation on theme: "1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi."— Presentation transcript:

1 1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi (eduardopoggi@yahoo.com.ar)eduardopoggi@yahoo.com.ar Ernesto Mislej (emislej@google.com)emislej@google.com otoño de 2005

2 2 Decision Trees Definition Definition Mechanism Mechanism Splitting Functions Splitting Functions Hypothesis Space and Bias Hypothesis Space and Bias Issues in Decision-Tree Learning Issues in Decision-Tree Learning Avoiding overfitting through pruning Avoiding overfitting through pruning Numeric and Missing attributes Numeric and Missing attributes

3 3 Illustration Example: Learning to classify stars. Luminosity Mass Type A Type B Type C > T1 <= T1 > T2 <= T2

4 4 Definition A decision-tree learning algorithm approximates a target concept using a tree representation, where each internal node corresponds to an attribute, and every terminal node corresponds to a class. A decision-tree learning algorithm approximates a target concept using a tree representation, where each internal node corresponds to an attribute, and every terminal node corresponds to a class. There are two types of nodes: There are two types of nodes: Internal node.- Splits into different branches according to the different values the corresponding attribute can take. Example: luminosity T1. Internal node.- Splits into different branches according to the different values the corresponding attribute can take. Example: luminosity T1. Terminal Node.- Decides the class assigned to the example. Terminal Node.- Decides the class assigned to the example.

5 5 Classifying Examples Luminosity Mass Type A Type B Type C > T1 <= T1 > T2 <= T2 X = (Luminosity T2) Assigned Class

6 6 Classifying Examples To classify an example X we start at the root of the tree, and check the value of that attribute on X. We follow the branch corresponding to that value and jump to the next node. We continue until we reach a terminal node and take that class as our best prediction. Luminosity Mass Type B <= T1 > T2 Class = B X = (Luminosity T2)

7 7 Representation Decision trees adopt a DNF (Disjunctive Normal Form) representation. For a fixed class, every branch from the root of the tree to a terminal node with that class is a conjunction of attribute values; different branches ending in that class form a disjunction. x1 x2 x3 A B A C For class A: (~X1 & ~x2) OR (X1 & ~x3) 10 110 0

8 8 Appropriate Problems for Decision Trees  Attributes are both numeric and nominal.  Target function takes on a discrete number of values.  A DNF representation is effective in representing the target concept.  Data may have errors.  Some examples may have missing attribute values.

9 9 Decision Trees Definition Mechanism Splitting Functions Hypothesis Space and Bias Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes

10 10 Mechanism There are different ways to construct trees from data. We will concentrate on the top-down, greedy search approach: Basic idea: 1. Choose the best attribute a* to place at the root of the tree. 1. Choose the best attribute a* to place at the root of the tree. 2. Separate training set D into subsets { D1, D2,.., Dk } where 2. Separate training set D into subsets { D1, D2,.., Dk } where each subset Di contains examples having the same value for a* each subset Di contains examples having the same value for a* 3. Recursively apply the algorithm on each new subset until 3. Recursively apply the algorithm on each new subset until examples have the same class or there are few of them. examples have the same class or there are few of them.

11 11 Illustration Attributes: size and humidity Size has two values: >t1 or t1 or <= t1 Humidity has three values: >t2, (>t3 and t2, (>t3 and <=t2), <= t3 size humidity t1 t2 t3 Class P: poisonous Class N: not-poisonous

12 12 Illustration humidity t1 t2 t3 Suppose we choose size as the best attribute: size P > T1 <= T1 Class P: poisonous Class N: not-poisonous Class N: not-poisonous ?

13 13 Illustration humidity t1 t2 t3 Suppose we choose humidity as the next best attribute: size P > T1 <= T1 humidity P NP NP >t2 <= t3 > t3 & t3 & <= t2

14 14 Formal Mechanism Create a root for the tree Create a root for the tree If all examples are of the same class or the number of examples If all examples are of the same class or the number of examples is below a threshold return that class is below a threshold return that class If no attributes available return majority class If no attributes available return majority class Let a* be the best attribute Let a* be the best attribute For each possible value v of a* For each possible value v of a* Add a branch below a* labeled “a = v” Add a branch below a* labeled “a = v” Let Sv be the subsets of example where attribute a*=v Let Sv be the subsets of example where attribute a*=v Recursively apply the algorithm to Sv Recursively apply the algorithm to Sv

15 15 Splitting Functions What attribute is the best to split the data? Let us remember some definitions from information theory. A measure of uncertainty or entropy that is associated to a random variable X is defined as H(X) = - Σ pi log pi where the logarithm is in base 2. This is the “ average amount of information or entropy of a finite complete probability scheme ” (Introduction to I. Theory by Reza F.).

16 16 Entropy 1 P(A) = 1/256, P(B) = 255/256 P(A) = 1/256, P(B) = 255/256 H(X) = 0.0369 bit H(X) = 0.0369 bit P(A) = 1/2, P(B) = 1/2 P(A) = 1/2, P(B) = 1/2 H(X) = 1 bit H(X) = 1 bit P(A) = 7/16, P(B) = 9/16 P(A) = 7/16, P(B) = 9/16 H(X) = 0.989 bit H(X) = 0.989 bit There are two possible complete events A and B (Example: flipping a biased coin).

17 17 Entropy 2 Entropy is a function concave downward. 0 0.51 1 bit

18 18 Splitting based on Entropy Let’s go back to our previous sample: sizet1 t2 t3 humidity Size divides the sample in two. S1 = { 6P, 0NP} S2 = { 3P, 5NP} S1 S2 H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) -(5/8)log2(5/8)

19 19 Splitting based on Entropy Let’s go back to our previous sample: sizet1 t2 t3 humidity humidity divides the sample in three. S1 = { 2P, 2NP} S2 = { 5P, 0NP} S3 = { 2P, 3NP} S1 S3 H(S1) = 1 H(S2) = 0 H(S3) = -(2/5)log2(2/5) -(3/5)log2(3/5) -(3/5)log2(3/5) S2

20 20 Information Gain sizet1 t2 t3 humidity Information gain IG over attribute A: IG (A) IG(A) = H(S) - Σv (Sv/S) H (Sv)

21 21 Information Gain IG(A) = H(S) - Σv (Sv/S) H (Sv) H(S) is the entropy of all examples H(Sv) is the entropy of one subsample after partitioning S based on all possible values of attribute A.

22 22 Components of IG(A) sizet1 t2 t3 humidity S1 S2 H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) -(5/8)log2(5/8) H(S) = -(9/14)log2(9/14) -(6/14)log2(6/14) -(6/14)log2(6/14) |S1|/|S| = 6/14 |S2|/|S| = 8/14

23 23 Decision Trees Definition Mechanism Splitting Functions Hypothesis Space and Bias Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes

24 24 Hypothesis Space We search over the hypothesis space of all possible decision trees. We search over the hypothesis space of all possible decision trees. We keep only one hypothesis at a time, instead of having several. We keep only one hypothesis at a time, instead of having several. We don’t do backtracking in the search. We choose the best alternative We don’t do backtracking in the search. We choose the best alternative and continue growing the tree. and continue growing the tree. We prefer shorter trees than larger trees. We prefer shorter trees than larger trees. We prefer trees where attributes with lowest entropy are placed on the top. We prefer trees where attributes with lowest entropy are placed on the top.

25 25 Tareas Leer Cap 3 de Mitchel hasta 3.7 exlc.


Download ppt "1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi."

Similar presentations


Ads by Google