Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no answer. Typically, each internal node in the tree corresponds to a test on a single attribute. However, you could have more complicated tests than that! And there are models that split on more than just a single attribute. The test need only split the data reaching the node in some way. The branches emanating from a decision node are labeled with the possible values of the test for that branch. Leaf nodes are where classification takes place. Leaf nodes are labeled with the Boolean value that should be returned if the node is reached (yes/no). Could also label with a probability.

Decision Tree Expressiveness  Any Boolean function can be represented by a decision tree.  In the worst case, the size of the tree needs to be exponential in the number of variables – i.e. a full-on representation of the truth table. Example – parity.  Can we do better than that? Is there a representation for Boolean functions that has a worst case performance better than a decision tree? Wouldn’t it be great if there was?  Each tree represents a disjunction of conjunctions of constraints on the attribute values of instances

 The problem is, there are a whole lot of Boolean functions on n variables. o A truth table with n variables has 2 n rows, 1 row for each possible truth setting of the variables. o For each row, the output of the function can be either a 0 or a 1. o So, we have 2 n bits that can be set to either 0 or 1. o That means there are 2^2^n possible Boolean functions on n variables.

ID3/C4.5  Top down induction of decision trees  non-incremental - are extensions  Highly used and successful  Attribute Features - discrete - output is also discrete  Search for smallest tree is too complex  Use greedy iterative approach

ID3 Learning Approach  C is a set of examples  A test on attribute A partitions C into {C 1, C 2,...,Cw} where w is number of states of A  First find good A for root - Attribute which is "most important"  Continue recursively until the training set is unambiguously classified or there are no more "relevant" features - C4.5 actually expands and then prunes back statistically irrelevant partitions afterwards

Choosing variables to split on - what should our Bias be?  Bias: Ockham’s razor – find the smallest possible tree  Roughly, we could accomplish this if we try to minimize depth of tree  How? Many different ways. How about by picking an attribute that maximizes classification accuracy for that step (Greedy approach)? If the first attribute classifies everything correctly, we’re done (depth=1)!

Information Theory  How much information do you need to be given in order to answer a yes/no question?  1 bit.  How much information do you need to answer a yes/no question if you know that the yes answer has probability 1?  0 bits. You already know the answer.  So, a yes/no question where each answer is.5 probable requires 1 bit of information to answer.  And, a yes/no question where one answer is 100% probable requires 0 bits.

ID3 Learning Algorithm 1. S = Training Set 2. Calculate gain for each remaining attribute 3. Select highest and create a new node for each partition 4. For each partition - if one class then end - else if > 1 class then goto 2 with remaining attributes - else if empty, label with most common class of parent (or set as null) 5. if attributes exhausted? - (this will only happen for an inconsistent S) - label with majority class  Attributes which best discriminate between classes are chosen  If the same ratios are found in partitioned set, then gain is 0

Over-fitting Definition? Over-fitting: If h1 fits the training data better than h2, but h1 performs worse than h2 on new data A general problem (see p67)! What causes it? How do we avoid it?

 Thresholds: could only allow attributes with info gains exceeding some threshold in order to sift noise. However, empirically tends to disallow relevant attribute tests. Other?  Use statistical (such as Chi-square) test to decide confidence in whether attribute is irrelevant. Best ID3 results. (Takes amount of data into account which is not done by above)  Post pruning ID3 noise handling – avoiding over training  Label node with either most common, or with probability of most common (good for distribution vs function)  Early stopping  Use a separate set of examples (holdout set).

ID3 Noise Handling Mechanisms cont. Rule Post Pruning –Convert tree to rules (1 rule for each path from a root to a leaf) –Generalize each rule by considering each of its pre-conditions –Sort rules according to estimated accuracy, and consider them in this order Advantages of pruning rules (p72)?

ID3 - Missing Attribute Values - Learning  Throw out data with missing attributes - too common, could be important, not prepared to generalize with missing attributes  Set attribute to most probable attribute class  Set attribute to most probable attribute class given the example class - similar performance  Use a learning scheme (ID3, etc) to fill in attribute class where TS is made up of complete examples and the initial output class is just another attribute. Better, but not always empirically convincing  Let unknown be just another attribute value - for ID3 has anomaly of apparent higher gain due to more attributes, can fix with gain ratio

ID3 - Missing Attribute Values - Execution  When arriving at an attribute test for which the attribute is missing during execution  Each branch has a probability of being taken based on what percentage of TS examples went down each branch  Take all branches, but carry a weight representing the probability. Weights could be further modified (multiplied) by other missing attributes in current test example as they continue down the tree.  Results in multiple active leaf nodes. Set output as leaf with highest weight, or sum weights for each output class, and output the class with the largest sum

ID3 – Other stuff  Problem with info gain: favors attributes with lots of possible values – Gain ratio (p73). Other problems  Handling real valued attributes?

ID3 - Conclusions  Good Empirical Results  Comparable application robustness and accuracy with neural networks - faster learning (though NN are better with continuous - both input and output)  Most used and well known of current systems - used widely to aid in creating rules for expert systems

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Similar presentations

Presentation on theme: "Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Similar presentations

Presentation on theme: "Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no."— Presentation transcript:

Similar presentations

About project

Feedback