COMP3740 CR32: Knowledge Management and Adaptive Systems

Slides:



Advertisements
Similar presentations
COMP3410 DB32: Technologies for Knowledge Management 08 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
Advertisements

Comp3776: Data Mining and Text Analytics Intro to Data Mining By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources.
COMP3740 CR32: Knowledge Management and Adaptive Systems Data Mining outputs: What knowledge can Data Mining learn? By Eric Atwell, School of Computing,
COMP3740 CR32: Knowledge Management and Adaptive Systems
Data Mining Lecture 9.
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Classification Algorithms
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Decision Tree Algorithm
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Classification Continued
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
Three kinds of learning
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Mohammad Ali Keyvanrad
Chapter 9 – Classification and Regression Trees
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Learning from Observations Chapter 18 Through
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
COMP3410 DB32: Technologies for Knowledge Management 10 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning in Practice Lecture 18
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Machine Learning: Lecture 3
CSCI N317 Computation for Scientific Applications Unit Weka
©Jiawei Han and Micheline Kamber
Decision Trees Jeff Storey.
Presentation transcript:

COMP3740 CR32: Knowledge Management and Adaptive Systems Supervised ML to learn Classifiers: Decision Trees and Classification Rules Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)

Reminder: Objectives of data mining Data mining aims to find useful patterns in data. For this we need: Data mining techniques, algorithms, tools, eg WEKA A methodological framework to guide us, in collecting data and applying the best algorithms, CRISP-DM TODAY’S objective: learn how to learn classifiers Decision Trees and Classification Rules Supervised Machine Learning: training set has the “answer” (class) for each example (instance)

Reminder: Concepts that can be “learnt” The types of concepts we try to ‘learn’ include: Clusters or ‘Natural’ partitions; Eg we might cluster customers according to their shopping habits. Rules for classifying examples into pre-defined classes. Eg “Mature students studying information systems with high grade for General Studies A level are likely to get a 1st class degree” General Associations Eg “People who buy nappies are in general likely also to buy beer” Numerical prediction Eg Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???)

Output: decision tree Outlook sunny rainy Humidity Windy high normal true false Play = ‘no’ Play = ‘yes’ Play = ‘no’ Play = ‘yes’

Decision Tree Analysis Example instance set Can we predict, from the first 3 columns, the risk of getting a virus? For convenience later: F = ‘shares Files’, S = ‘uses Scanner’, I = ‘Infected before’ Decision trees attempt to predict the value in one column from the values in other columns. They use a ‘training set’ which from a data warehouse may be a randomly selected set of facts from a data cube with the various dimensions providing the other attributes.

Decision tree building method Forms a decision tree tries for a small tree covering all or most of the training set internal nodes represent a test on an attribute value branches represent outcome of the test Decides which attribute to test at each node this is based on a measure of ‘entropy’ Must avoid ‘over-fitting’ if the tree is complex enough it might describe the training set exactly, but be no good for prediction May leave some ‘exceptions’ The decision tree method uses more than one attribute to form the rule. An example tree for the virus risk data is on the next slide. We would like to use the smallest tree possible that describes the data, but the search space is in general too large to guarantee the smallest tree. So we use an entropy measure to decide which attribute to use first. Over-fitting is a problem that we haven't time to discuss in full here. The point is that, unless there are some inconsistencies in the data set, you may, with enough variables, be able to build a massively dividing tree that describes accurately the training set even though there are no real rules to abstract. When you try to apply these rules to the data at large they will be no use - they are peculiar to the training set. The method for avoiding overfitting uses the principle of the minimum description length. Essentially if the number of bits needed to describe the tree is close to the number of bits required to describe the whole training set, then the tree is not a useful abstraction.

Building a decision tree (DT) The algorithm is recursive, at any step: T = set of (remaining) training instances, {C1, …, Ck} = set of classes If all instances in T belong to a single class Ci, then DT is a leaf node identifying class Ci. (done!) …continued

Building a decision tree (DT) …continued If T contains instances belonging to mixed classes, then choose a test based on a single attribute that will partition T into subsets {T1, …, Tn} according to n outcomes of the test. The DT for T comprises a root node identifying the test and one branch for each outcome of the test. The branches are formed by applying the rules above recursively on each of the subsets {T1, …, Tn} .

Tree Building example Classes = {High, Medium, Low} T = Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no T1 = T2 =

Tree Building example Classes = {High, Medium, Low} T1 = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no F T4 = T3 =

Tree Building example Classes = {High, Medium, Low} T1 = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no F T3 = Risk = ‘High’

Tree Building example T3 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes Risk = ‘High’ S no yes

Tree Building example T3 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes Risk = ‘High’ S no yes Risk = ‘High’ Risk = ‘Low’

Tree Building example T2 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes S yes no Risk = ‘High’ S no yes Risk = ‘Low’ Risk = ‘High’

Tree Building example T2 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes S yes no Risk = ‘High’ S no yes Risk = ‘Low’ Risk = ‘Low’ Risk = ‘High’ Risk = ‘Medium’

Example Decision Tree Shares files? no yes Infected before? Uses scanner? yes no no yes medium low Uses scanner? high no Yes Here is the tree corresponding to the table earlier. Why have we started with the ‘Shares Files?’ attribute at the top of the tree? - this is what the entropy measure will tell us. high low

Which attribute to test? The ROOT could be S or I instead of F – leading to a different Decision Tree Best DT is the “smallest”, most concise model The search space in general is too large to find the smallest tree by exhaustive searching (try them all). Instead we look for the attribute which splits the training set into the most homogeneous sets. The measure used for ‘homogeneity’ is based on entropy. Intuitively we can guess that an attribute that splits the training set into very heterogeneous groups (ie an even mix of high risks and low risks) is not much use in forming a rule that predicts high risk. The next few slides look at how the various attributes fare.

Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no

Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no High Risk = ‘yes’ 5, 1 High Risk = ‘no’ 2, 0

Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on S, number of outcomes, n = 2 (Yes or No) S yes no

Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on S, number of outcomes, n = 2 (Yes or No) S yes no High Risk = ‘no’ 4,2 High Risk = ‘yes’ 3,1

Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no

Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no High Risk = ‘no’ 3, 1 High Risk = ‘yes’ 4,1

Decision tree building algorithm For each decision point, If all remaining examples are all +ve or all -ve, stop. Else if there are some +ve and some -ve examples left and some attributes left, pick the remaining attribute with largest information gain Else if there are no examples left, no such example has been observed; return default Else if there are no attributes left, examples with the same description have different classifications: noise or insufficient attributes or nondeterministic domain This shows the steps in the decision tree algorithm. There are a few details I haven't discussed to do with what happens if we run out of examples before running out of attributes to test or vice versa. Also I haven't included here how to avoid over-fitting.

Evaluation of decision trees At the leaf nodes two numbers are given: N: the coverage for that node: how many instances E: the error rate: how many wrongly classified instances The whole tree can be evaluated in terms of its size (number of nodes) and overall error-rate expressed in terms of the number and percentage of cases wrongly classified. We seek small trees that have low error rates.

Evaluation of decision trees The error rate for the whole tree can also be displayed in terms of a confusion matrix: (A) (B) (C)  Classified as 35 2 1 Class (A) = high 4 41 5 Class (B) = medium 68 Class (C) = low

Evaluation of decision trees The error rates mentioned on previous slides are normally computed using The training set of instances. A test set of instances – some different examples! If the decision tree algorithm has ‘over-fitted’ the data, then the error rate based on the training set will be far less than that based on the test set.

Evaluation of decision trees 10-fold cross-validation can be used when the training set is limited in size: Divide the test set randomly into 10 subsets. Build a tree from 9 of the subsets and test using the 10th. Repeat the experiment 9 more times, using a different test set each time. Overall error rate is average of 10 experiments 10-fold cross-validation will lead to up to 10 different decision trees being built. The method for selecting or constructing the best tree is not clear.

From decision trees to rules Decision trees may not be easy to interpret: tests associated with lower nodes have to be read in the context of tests further up the tree ‘sub-concepts’ may sometimes be split up and distributed to different parts of the tree (see next slide) Computer Scientists may prefer “if … then …” rules!

DT for “F = G = 1 or J = K = 1” J=K=1 is split across F= 0; two subtrees. F= 0; J = 0; no J = 1; K = 0; no K = 1; yes F = 1; G = 1; yes G = 0; F J K G 1 yes no

Converting DT to rules Step 1: Every path from root to leaf represents a rule: F= 0; J = 0; no J = 1; K = 0; no K = 1; yes F = 1; G = 1; yes G = 0; If F = 0 and J = 0 then class no; If F = 0 and J = 1 and K = 0 then class no If F = 0 and J = 1 and K = 1 then class yes …. If F = 1 and G = 0 and J = 1 and K = 1 then class yes

Generalising rules If F = 0 and J = 1 and K = 1 then class yes If F = 1 and G = 0 and J = 1 and K = 1 then class yes If G = 1 then class yes If J =1 and K = 1 then class yes

Tidying up rule sets Generalisation leads to 2 problems: Rules no longer mutually exclusive Order rules and use the first matching rule used as the operative rule. Ordering is based on how many false positive errors the rule makes Rule set no longer exhaustive Choose a default value for the class when no rule applies Default class is that which contains the most training cases not covered by any rule.

Decision Tree - Revision Decision tree builder algorithm discovers rules for classifying instances. At each step, it needs to decide which attribute to test at that point in the tree; a measure of ‘information gain’ can be used. The output is a decision tree based on the ‘training’ instances, evaluated with separate “test” instances. Leaf nodes which have a small coverage may be pruned if the error rate is small for the pruned tree.

Pruning example (from W & F) Health plan contribution none half full 4 bad 2 good 1 bad 1 good 4 bad 2 good number of errors We replace the subtree with: Bad 14, 5 Number of instances

Decision trees v classification rules Decision trees can be used for prediction or interpretation. Prediction: compare an unclassified instance against the tree and predict what class it is in (with error estimate) Interpretation: examine tree and try to understand why instances end up in the class they are in. Rule sets are often better for interpretation. ‘Small’, accurate rules can be examined, even if overall accuracy of the rule set is poor.

Self Check You should be able to: Describe how the decision-trees are built from a set of instances. Build a decision tree based on a given attribute Explain what the ‘training’ and ‘test’ sets are for. Explain what “Supervised” means, and why classification is an example of supervised ML