Induction of Decision Trees

Induction of Decision Trees
Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp

An example application
An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether to put a new patient in an intensive-care unit. Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. Problem: to predict high-risk patients and discriminate them from low-risk patients. CS583, Bing Liu, UIC

Another application A credit card company receives thousands of applications for new cards. Each application contains information about an applicant, age Marital status annual salary outstanding debts credit rating etc. Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved. CS583, Bing Liu, UIC

Machine learning and our focus
Like human learning from past experiences. A computer does not have “experiences”. A computer system learns from data, which represent some “past experiences” of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. The task is commonly called: Supervised learning, classification, or inductive learning. CS583, Bing Liu, UIC

The data and the goal Data: A set of data records (also called examples, instances or cases) described by k attributes: A1, A2, … Ak. a class: Each example is labelled with a pre-defined class. Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances. Induction is different from deduction and DBMS does not not support induction; The result of induction is higher-level information or knowledge: general statements about data There are many approaches. Refer to the lecture notes for CS3244 available at the Co-Op. We focus on three approaches here, other examples: Other approaches Instance-based learning other neural networks Concept learning (Version space, Focus, Aq11, …) Genetic algorithms Reinforcement learning CS583, Bing Liu, UIC

An example: data (loan application)
Approved or not CS583, Bing Liu, UIC

An example: the learning task
Learn a classification model from the data Use the model to classify future loan applications into Yes (approved) and No (not approved) What is the class for following case/instance? CS583, Bing Liu, UIC

Supervised vs. unsupervised Learning
Supervised learning: classification is seen as supervised learning from examples. Supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a “teacher” gives the classes (supervision). Test data are classified into these classes too. Unsupervised learning (clustering) Class labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data CS583, Bing Liu, UIC

Supervised learning process: two steps
Learning (training): Learn a model using the training data Testing: Test the model using unseen test data to assess the model accuracy CS583, Bing Liu, UIC

What do we mean by learning?
Given a data set D, a task T, and a performance measure M, a computer system is said to learn from D to perform the task T if after learning the system’s performance on T improves as measured by M. In other words, the learned model helps the system to perform T better as compared to no learning. CS583, Bing Liu, UIC

An example Data: Loan application data
Task: Predict whether a loan should be approved or not. Performance measure: accuracy. No learning: classify all future applications (test data) to the majority class (i.e., Yes): Accuracy = 9/15 = 60%. We can do better than 60% with learning. CS583, Bing Liu, UIC

Fundamental assumption of learning
Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples). In practice, this assumption is often violated to certain degree. Strong violations will clearly result in poor classification accuracy. To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data. CS583, Bing Liu, UIC

Analysis of Severe Trauma Patients Data
The worst pH value at ICU The worst active partial thromboplastin time PH_ICU <7.2 >7.33 Death 0.0 (0/15) APPT_WORST Well 0.88 (14/16) <78.7 >=78.7 Well 0.82 (9/11) Death 0.0 (0/7) PH_ICU and APPT_WORST are exactly the two factors (theoretically) advocated to be the most important ones in the study by Rotondo et al., 1997.

Breast Cancer Recurrence
Degree of Malig < 3 >= 3 Tumor Size Involved Nodes < 15 >= 15 < 3 >= 3 Age no rec 125 recurr 39 no rec 30 recurr 18 recurr 27 no_rec 10 no rec 4 recurr 1 no rec 32 recurr 0 Tree induced by Assistant Professional Interesting: Accuracy of this tree compared to medical specialists

Prostate cancer recurrence
Secondary Gleason Grade No Yes PSA Level Stage Primary Gleason Grade 1,2 3 4 5 14.9 14.9 T1c,T2a, T2b,T2c T1ab,T3 2,3

Introduction Decision tree learning is one of the most widely used techniques for classification. Its classification accuracy is competitive with other methods, and it is very efficient. The classification model is a tree, called decision tree. C4.5 by Ross Quinlan is perhaps the best known system. It can be downloaded from the Web. CS583, Bing Liu, UIC

The loan data (reproduced)
Approved or not CS583, Bing Liu, UIC

A decision tree from the loan data
Decision nodes and leaf nodes (classes) CS583, Bing Liu, UIC

Use the decision tree No CS583, Bing Liu, UIC

Is the decision tree unique?
No. Here is a simpler tree. We want smaller tree and accurate tree. Easy to understand and perform better. Finding the best tree is NP-hard. All current tree building algorithms are heuristic algorithms CS583, Bing Liu, UIC

An Example Data Set and Decision Tree
yes no sunny rainy med small big outlook company sailboat

Classification outlook sunny rainy yes company no big med no sailboat
small big yes no

Induction of Decision Trees
Data Set (Learning Set) Each example = Attributes + Class Induced description = Decision tree TDIDT Top Down Induction of Decision Trees Recursive Partitioning

Some TDIDT Systems ID3 (Quinlan 79) CART (Brieman et al. 84)
Assistant (Cestnik et al. 87) C4.5 (Quinlan 93) See5 (Quinlan 97) ... Orange (Demšar, Zupan 98-03)

TDIDT Algorithm Also known as ID3 (Quinlan)
To construct decision tree T from learning set S: If all examples in S belong to some class C Then make leaf labeled C Otherwise select the “most informative” attribute A partition S according to A’s values recursively construct subtrees T1, T2, ..., for the subsets of S

Hunt’s Algorithm Refund Refund Refund Marital Marital Status Status
Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K

TDIDT Algorithm Resulting tree T is: Attribute A A v1 v2 vn A’s values
Tn Subtrees

Another Example

Simple Tree Outlook Humidity P Windy N P N P sunny rainy overcast high
normal yes no N P N P

Complicated Tree Temperature Outlook Outlook Windy P P Windy Windy P
hot cold moderate Outlook Outlook Windy sunny rainy sunny rainy yes no overcast overcast P P Windy Windy P Humidity N Humidity yes no yes no high normal high normal N P P N Windy P Outlook P yes no sunny rainy overcast N P N P null

Attribute Selection Criteria
Main principle Select attribute which partitions the learning set into subsets as “pure” as possible Various measures of purity Information-theoretic Gini index X2 ReliefF ... Various improvements probability estimates normalization binarization, subsetting

Information-Theoretic Approach
To classify an object, a certain information is needed I, information After we have learned the value of attribute A, we only need some remaining amount of information to classify the object Ires, residual information Gain Gain(A) = I – Ires(A) The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes Gain

Entropy The average amount of information I needed to classify an object is given by the entropy measure For a two-class problem: entropy p(c1)

Residual Information After applying attribute A, S is partitioned into subsets according to values v of A Ires is equal to weighted sum of the amounts of information for the subsets

Triangles and Squares

A set of classified objects
Triangles and Squares Data Set: A set of classified objects . . . . . .

Entropy 5 triangles 9 squares class probabilities entropy . . . . . .

Entropy reduction by data set partitioning
. . red yellow green Color? Entropy reduction by data set partitioning . .

Entropija vrednosti atributa
. . . . . . . red Color? green Entropija vrednosti atributa . yellow .

. . . . . . . red Information Gain Color? green . yellow .

Information Gain of The Attribute
Attributes Gain(Color) = 0.246 Gain(Outline) = 0.151 Gain(Dot) = 0.048 Heuristics: attribute with the highest gain is chosen This heuristics is local (local minimization of impurity)

Gain(Outline) = 0.971 – 0 = 0.971 bits
red Color? green . yellow . Gain(Outline) = – 0 = bits Gain(Dot) = – = bits

Gain(Outline) = 0.971 – 0.951 = 0.020 bits
red Gain(Outline) = – = bits Gain(Dot) = – 0 = bits Color? green . yellow . solid . Outline? dashed .

. . . Dot? . Color? . . . Outline? . red yes no green yellow solid
dashed .

Decision Tree . . . . . . Color Dot square Outline triangle square
red green yellow Dot square Outline yes no dashed solid triangle square triangle square

A Defect of Ires Ires favors attributes with many values
Such attribute splits S to many subsets, and if these are small, they will tend to be pure anyway One way to rectify this is through a corrected measure of information gain ratio.

Information Gain Ratio
I(A) is amount of information needed to determine the value of an attribute A Information gain ratio

Information Gain Ratio
. . . . . . . red Color? green Information Gain Ratio . yellow .

Information Gain and Information Gain Ratio

Gini Index Another sensible measure of impurity (i and j are classes)
After applying attribute A, the resulting Gini index is Gini can be interpreted as expected error rate

Gini Index . . . . . .

. . . . . . . red Color? green Gini Index for Color . yellow .

Gain of Gini Index

Three Impurity Measures
These impurity measures assess the effect of a single attribute Criterion “most informative” that they define is local (and “myopic”) It does not reliably predict the effect of several attributes applied jointly

Induction of Decision Trees

Similar presentations

Presentation on theme: "Induction of Decision Trees"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Induction of Decision Trees

Similar presentations

Presentation on theme: "Induction of Decision Trees"— Presentation transcript:

Similar presentations

About project

Feedback