Induction of Decision Trees

Slides:

Advertisements

Similar presentations

Decision Tree Approach in Data Mining

Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.

1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

Classification Techniques: Decision Tree Learning

Lecture Notes for Chapter 4 Introduction to Data Mining

Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,

Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.

Lecture outline Classification Decision-tree classification.

CSci 8980: Data Mining (Fall 2002)

Induction of Decision Trees Blaž Zupan, Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp.

Induction of Decision Trees

Decision Trees and Decision Tree Learning Philipp Kärger

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Lecture 5 (Classification with Decision Trees)

Decision Trees an Introduction.

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.

Classification.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Learning Chapter 18 and Parts of Chapter 20

Induction of Decision Trees Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Induction of Decision Trees. An Example Data Set and Decision Tree yes no yesno sunnyrainy no med yes smallbig outlook company sailboat.

Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.

Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

D ECISION T REES Jianping Fan Dept of Computer Science UNC-Charlotte.

Supervised Learning. CS583, Bing Liu, UIC 2 An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc)

Lecture Notes for Chapter 4 Introduction to Data Mining

1 Illustration of the Classification Task: Learning Algorithm Model.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Induction of Decision Trees Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Induction of Decision Trees

DECISION TREES An internal node represents a test on an attribute.

Decision Trees an introduction.

Classification Algorithms

Prepared by: Mahmoud Rafeek Al-Farra

Artificial Intelligence

Chapter 3: Supervised Learning

Data Science Algorithms: The Basic Methods

Dipartimento di Ingegneria «Enzo Ferrari»,

Cost-Sensitive Learning

Data Mining Classification: Basic Concepts and Techniques

Classification and Prediction

Advanced Artificial Intelligence

Jianping Fan Dept of Computer Science UNC-Charlotte

Cost-Sensitive Learning

Basic Concepts and Decision Trees

Data Mining – Chapter 3 Classification

Supervised vs. unsupervised Learning

Induction of Decision Trees and ID3

Induction of Decision Trees

Statistical Learning Dong Liu Dept. EEIS, USTC.

Learning Chapter 18 and Parts of Chapter 20

©Jiawei Han and Micheline Kamber

A task of induction to find patterns

Decision Trees Jeff Storey.

Data Mining CSCI 307, Spring 2019 Lecture 15

A task of induction to find patterns

COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr

Presentation transcript:

Induction of Decision Trees Blaž Zupan and Ivan Bratko magix.fri.uni-lj.si/predavanja/uisp

An example application An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether to put a new patient in an intensive-care unit. Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. Problem: to predict high-risk patients and discriminate them from low-risk patients. CS583, Bing Liu, UIC

Another application A credit card company receives thousands of applications for new cards. Each application contains information about an applicant, age Marital status annual salary outstanding debts credit rating etc. Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved. CS583, Bing Liu, UIC

Machine learning and our focus Like human learning from past experiences. A computer does not have “experiences”. A computer system learns from data, which represent some “past experiences” of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. The task is commonly called: Supervised learning, classification, or inductive learning. CS583, Bing Liu, UIC

The data and the goal Data: A set of data records (also called examples, instances or cases) described by k attributes: A1, A2, … Ak. a class: Each example is labelled with a pre-defined class. Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances. Induction is different from deduction and DBMS does not not support induction; The result of induction is higher-level information or knowledge: general statements about data There are many approaches. Refer to the lecture notes for CS3244 available at the Co-Op. We focus on three approaches here, other examples: Other approaches Instance-based learning other neural networks Concept learning (Version space, Focus, Aq11, …) Genetic algorithms Reinforcement learning CS583, Bing Liu, UIC

An example: data (loan application) Approved or not CS583, Bing Liu, UIC

An example: the learning task Learn a classification model from the data Use the model to classify future loan applications into Yes (approved) and No (not approved) What is the class for following case/instance? CS583, Bing Liu, UIC

Supervised vs. unsupervised Learning Supervised learning: classification is seen as supervised learning from examples. Supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a “teacher” gives the classes (supervision). Test data are classified into these classes too. Unsupervised learning (clustering) Class labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data CS583, Bing Liu, UIC

Supervised learning process: two steps Learning (training): Learn a model using the training data Testing: Test the model using unseen test data to assess the model accuracy CS583, Bing Liu, UIC

What do we mean by learning? Given a data set D, a task T, and a performance measure M, a computer system is said to learn from D to perform the task T if after learning the system’s performance on T improves as measured by M. In other words, the learned model helps the system to perform T better as compared to no learning. CS583, Bing Liu, UIC

An example Data: Loan application data Task: Predict whether a loan should be approved or not. Performance measure: accuracy. No learning: classify all future applications (test data) to the majority class (i.e., Yes): Accuracy = 9/15 = 60%. We can do better than 60% with learning. CS583, Bing Liu, UIC

Fundamental assumption of learning Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples). In practice, this assumption is often violated to certain degree. Strong violations will clearly result in poor classification accuracy. To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data. CS583, Bing Liu, UIC

Analysis of Severe Trauma Patients Data The worst pH value at ICU The worst active partial thromboplastin time PH_ICU <7.2 >7.33 7.2-7.33 Death 0.0 (0/15) APPT_WORST Well 0.88 (14/16) <78.7 >=78.7 Well 0.82 (9/11) Death 0.0 (0/7) PH_ICU and APPT_WORST are exactly the two factors (theoretically) advocated to be the most important ones in the study by Rotondo et al., 1997.

Breast Cancer Recurrence Degree of Malig < 3 >= 3 Tumor Size Involved Nodes < 15 >= 15 < 3 >= 3 Age no rec 125 recurr 39 no rec 30 recurr 18 recurr 27 no_rec 10 no rec 4 recurr 1 no rec 32 recurr 0 Tree induced by Assistant Professional Interesting: Accuracy of this tree compared to medical specialists

Prostate cancer recurrence Secondary Gleason Grade No Yes PSA Level Stage Primary Gleason Grade 1,2 3 4 5 14.9 14.9 T1c,T2a, T2b,T2c T1ab,T3 2,3

Introduction Decision tree learning is one of the most widely used techniques for classification. Its classification accuracy is competitive with other methods, and it is very efficient. The classification model is a tree, called decision tree. C4.5 by Ross Quinlan is perhaps the best known system. It can be downloaded from the Web. CS583, Bing Liu, UIC

The loan data (reproduced) Approved or not CS583, Bing Liu, UIC

A decision tree from the loan data Decision nodes and leaf nodes (classes) CS583, Bing Liu, UIC

Use the decision tree No CS583, Bing Liu, UIC

Is the decision tree unique? No. Here is a simpler tree. We want smaller tree and accurate tree. Easy to understand and perform better. Finding the best tree is NP-hard. All current tree building algorithms are heuristic algorithms CS583, Bing Liu, UIC

An Example Data Set and Decision Tree yes no sunny rainy med small big outlook company sailboat

Classification outlook sunny rainy yes company no big med no sailboat small big yes no

Induction of Decision Trees Data Set (Learning Set) Each example = Attributes + Class Induced description = Decision tree TDIDT Top Down Induction of Decision Trees Recursive Partitioning

Some TDIDT Systems ID3 (Quinlan 79) CART (Brieman et al. 84) Assistant (Cestnik et al. 87) C4.5 (Quinlan 93) See5 (Quinlan 97) ... Orange (Demšar, Zupan 98-03)

TDIDT Algorithm Also known as ID3 (Quinlan) To construct decision tree T from learning set S: If all examples in S belong to some class C Then make leaf labeled C Otherwise select the “most informative” attribute A partition S according to A’s values recursively construct subtrees T1, T2, ..., for the subsets of S

Hunt’s Algorithm Refund Refund Refund Marital Marital Status Status Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K

TDIDT Algorithm Resulting tree T is: Attribute A A v1 v2 vn A’s values Tn Subtrees

Another Example

Simple Tree Outlook Humidity P Windy N P N P sunny rainy overcast high normal yes no N P N P

Complicated Tree Temperature Outlook Outlook Windy P P Windy Windy P hot cold moderate Outlook Outlook Windy sunny rainy sunny rainy yes no overcast overcast P P Windy Windy P Humidity N Humidity yes no yes no high normal high normal N P P N Windy P Outlook P yes no sunny rainy overcast N P N P null

Attribute Selection Criteria Main principle Select attribute which partitions the learning set into subsets as “pure” as possible Various measures of purity Information-theoretic Gini index X2 ReliefF ... Various improvements probability estimates normalization binarization, subsetting

Information-Theoretic Approach To classify an object, a certain information is needed I, information After we have learned the value of attribute A, we only need some remaining amount of information to classify the object Ires, residual information Gain Gain(A) = I – Ires(A) The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes Gain

Entropy The average amount of information I needed to classify an object is given by the entropy measure For a two-class problem: entropy p(c1)

Residual Information After applying attribute A, S is partitioned into subsets according to values v of A Ires is equal to weighted sum of the amounts of information for the subsets

Triangles and Squares

A set of classified objects Triangles and Squares Data Set: A set of classified objects . . . . . .

Entropy 5 triangles 9 squares class probabilities entropy . . . . . .

Entropy reduction by data set partitioning . . red yellow green Color? Entropy reduction by data set partitioning . .

Entropija vrednosti atributa . . . . . . . red Color? green Entropija vrednosti atributa . yellow .

. . . . . . . red Information Gain Color? green . yellow .

Information Gain of The Attribute Attributes Gain(Color) = 0.246 Gain(Outline) = 0.151 Gain(Dot) = 0.048 Heuristics: attribute with the highest gain is chosen This heuristics is local (local minimization of impurity)

Gain(Outline) = 0.971 – 0 = 0.971 bits red Color? green . yellow . Gain(Outline) = 0.971 – 0 = 0.971 bits Gain(Dot) = 0.971 – 0.951 = 0.020 bits

Gain(Outline) = 0.971 – 0.951 = 0.020 bits red Gain(Outline) = 0.971 – 0.951 = 0.020 bits Gain(Dot) = 0.971 – 0 = 0.971 bits Color? green . yellow . solid . Outline? dashed .

. . . Dot? . Color? . . . Outline? . red yes no green yellow solid dashed .

Decision Tree . . . . . . Color Dot square Outline triangle square red green yellow Dot square Outline yes no dashed solid triangle square triangle square

A Defect of Ires Ires favors attributes with many values Such attribute splits S to many subsets, and if these are small, they will tend to be pure anyway One way to rectify this is through a corrected measure of information gain ratio.

Information Gain Ratio I(A) is amount of information needed to determine the value of an attribute A Information gain ratio

Information Gain Ratio . . . . . . . red Color? green Information Gain Ratio . yellow .

Information Gain and Information Gain Ratio

Gini Index Another sensible measure of impurity (i and j are classes) After applying attribute A, the resulting Gini index is Gini can be interpreted as expected error rate

Gini Index . . . . . .

. . . . . . . red Color? green Gini Index for Color . yellow .

Gain of Gini Index

Three Impurity Measures These impurity measures assess the effect of a single attribute Criterion “most informative” that they define is local (and “myopic”) It does not reliably predict the effect of several attributes applied jointly