Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Slides:



Advertisements
Similar presentations
CHAPTER 9: Decision Trees
Advertisements

Traveling Salesperson Problem
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Greedy Algorithms Greed is good. (Some of the time)
Searching on Multi-Dimensional Data
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Decision Tree Algorithm
Basic Data Mining Techniques Chapter Decision Trees.
Induction of Decision Trees
Basic Data Mining Techniques
Lecture 5 (Classification with Decision Trees)
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 6 Decision Trees
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
Chapter 9 – Classification and Regression Trees
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Machine Learning Queens College Lecture 2: Decision Trees.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Trees Recitation 1/17/08 Mary McGlohon
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Konstantina Christakopoulou Liang Zeng Group G21
Lecture Notes for Chapter 4 Introduction to Data Mining
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Traveling Salesperson Problem
Introduction to Machine Learning and Tree Based Methods
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Instance Based Learning
Decision Trees (suggested time: 30 min)
Trees, bagging, boosting, and stacking
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data preprocessing and transformation
Data Mining (and machine learning)
Decision Trees.
Machine Learning: Lecture 3
Machine Learning in Practice Lecture 17
Learning Chapter 18 and Parts of Chapter 20
Chapter 7: Transformations
From Heather’s blog:
Junheng, Shengming, Yunsheng 10/19/2018
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
INTRODUCTION TO Machine Learning 2nd Edition
Clustering.
STT : Intro. to Statistical Learning
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Support Vector Machines: a different approach to finding the decision boundary, particularly good at generalisation finishing off last lecture …

Suppose we can divide the classes with a simple hyperplane

There will be infinitely many such lines

One of them is ‘optimal’

Beause it maximises the average distance of the hyperplane from the ‘support vectors’ – instances that are closest to instances of different class

A Support Vector Machine (SVM) finds this hyperplane

But, usually there is no simple hyperplane that separates the classes!

One dimension (x), two classes

Two dimensions (x, x*sin(x)),

Now we can separate the classes

SVMs do ths: If we add enough extra dimensions/fields using arbitrary functions of the existing fields, then it becomes very likely we can separate the data. SVMs - apply such a transformation - then find the optimal separating hyperplane. The ‘optimality’ of the sep hyp means good generalisation properties

Decision Trees

Real world applications of DTs See here for a list: survey/node32.html survey/node32.html Includes: Agriculture, Astronomy, Biomedical Engineering, Control Systems, Financial analysis, Manufacturing and Production, Medicine, Molecular biology, Object recognition, Pharmacology, Physics, Plant diseases, Power systems, Remote Sensing, Software development, Text processing:

Field names

Field values

Field names Field values Class values

Why decision trees? Popular, since they are interpretable... and correspond to human reasoning/thinking about decision-making Can perform quite well in accuracy when compared with other approaches... and there are good algorithms to learn decision trees from data

Figure 1. Binary Strategy as a tree model. Mohammed MA, Rudge G, Wood G, Smith G, et al. (2012) Which Is More Useful in Predicting Hospital Mortality -Dichotomised Blood Test Results or Actual Test Values? A Retrospective Study in Two Hospitals. PLoS ONE 7(10): e doi: /journal.pone

Figure 1. Binary Strategy as a tree model.

We will learn the ‘classic’ algorithm to learn a DT from categorical data:

ID3ID3

Suppose we want a tree that helps us predict someone’s politics, given their gender, age, and wealth genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Choose a start node (field) at random genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Choose a start node (field) at random ? genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Choose a start node (field) at random Age genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Add branches for each value of this field Age young mid old genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Check to see what has filtered down Age young mid old 1 L, 2 R 1 L, 1 R0 L, 1 R genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Where possible, assign a class value Age young mid old 1 L, 2 R 1 L, 1 R0 L, 1 R Right-Wing genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Otherwise, we need to add further nodes Age young mid old 1 L, 2 R 1 L, 1 R0 L, 1 R ? ? Right-Wing genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Repeat this process every time we need a new node Age young mid old 1 L, 2 R 1 L, 1 R0 L, 1 R ? ? Right-Wing genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Starting with first new node – choose field at random Age young mid old 1 L, 2 R 1 L, 1 R0 L, 1 R wealth ? Right-Wing genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Check the classes of the data at this node… Age young mid old 1 L, 2 R 1 L, 1 R0 L, 1 R wealth ? Right-Wing rich poor 1 L, 0 R 1 L, 1 R genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

And so on … Age young mid old 1 L, 2 R 1 L, 1 R0 L, 1 R wealth ? Right-Wing rich poor 1 L, 1 R Right-wing genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

But we can do better than randomly chosen fields! genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

This is the tree we get if first choice is `gender’ genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

gender male female Right-Wing Left-Wing genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing This is the tree we get if first choice is `gender’

Algorithms for building decision trees (of this type) Initialise: tree T contains one ‘unexpanded’ node Repeat until no unexpanded nodes remove an unexpanded node U from T expand U by choosing a field add the resulting nodes to T

Algorithms for building decision trees (of this type) – expanding a node ?

Algorithms for building decision trees (of this type) – the essential step Field ? ?? Value = X Value = Y Value = Z

So, which field? Field ? ?? Value = X Value = Y Value = Z

Three choices: gender, age, or wealth genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Suppose we choose age (table now sorted by age values) genderagewealthpolitics malemiddle-agedrichRight-wing femalemiddle-agedpoorLeft-wing maleoldpoorRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing maleyoungpoorRight-wing Two of the values have a mixture of classes

Suppose we choose wealth (table now sorted by wealth values) genderagewealthpolitics femalemiddle-agedpoorLeft-wing maleoldpoorRight-wing femaleyoungpoorLeft-wing maleyoungpoorRight-wing malemiddle-agedrichRight-wing maleyoungrichRight-wing One of the values has a mixture of classes - this choice is a bit less mixed up than age?

Suppose we choose gender (table now sorted by gender values) genderagewealthpolitics femalemiddle-agedpoorLeft-wing femaleyoungpoorLeft-wing maleoldpoorRight-wing malemiddle-agedrichRight-wing maleyoungpoorRight-wing maleyoungrichRight-wing The classes are not mixed up at all within the values

So, at each step where we choose a node to expand, we make the choice where the relationship between the field values and the class values is least mixed up

Measuring ‘mixed-up’ness: Shannon’s entropy measure Suppose you have a bag of N discrete things, and there T different types of things. Where, p T is the proportion of things in the bag that are type T, the entropy of the bag is:

Examples: This mixture: { left left left right right } has entropy: − ( 0.6 log(0.6) log(0.4)) = This mixture: { A A A A A A A A B C } has entropy: − ( 0.8 log(0.8) log(0.1) log(0.1)) =0.278 This mixture: {same same same same same same} has entropy: − ( 1.0 log(1.0) ) = 0 Lower entropy = less mixed up

ID3 chooses fields based on entropy Field1 Field2 Field3 … val1 val1 val1 val2 val2 val2 val3 val3 Each val has an entropy value – how mixed up the classes are for that value choice

ID3 chooses fields based on entropy Field1 Field2 Field3 … val1 x p1 val1 x p1 val1 x p1 val2 x p2 val2 x p2 val2 x p2 val3 x p3 val3 x p3 Each val has an entropy value – how mixed up the classes are for that value choice And each val also has a proportion – how much of the data at this node has this val

ID3 chooses fields based on entropy Field1 Field2 Field3 … val1 x p1 val1 x p1 val1 x p1 val2 x p2 val2 x p2 val2 x p2 val3 x p3 val3 x p3 = = = H(D|Field1) H(D|Field2) H(D|Field3) So ID3 works out H(D|Field) for each field, which is the entropies of the values weighted by the proportions.

ID3 chooses fields based on entropy Field1 Field2 Field3 … val1 x p1 val1 x p1 val1 x p1 val2 x p2 val2 x p2 val2 x p2 val3 x p3 val3 x p3 = = = H(D|Field1) H(D|Field2) H(D|Field3) So ID3 works out H(D|Field) for each field, which is the entropies of the values weighted by the proportions. The one with the lowest value is chosen – this maximises ‘Information Gain’

Back here gender, age, or wealth genderagewealthpolitics malemiddle-agedrichRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing femalemiddle-agedpoorLeft-wing maleyoungpoorRight-wing maleoldpoorRight-wing

Suppose we choose age (table now sorted by age values) genderagewealthpolitics malemiddle-agedrichRight-wing femalemiddle-agedpoorLeft-wing maleoldpoorRight-wing maleyoungrichRight-wing femaleyoungpoorLeft-wing maleyoungpoorRight-wing H(D| age) = proportion-weighted entropy = x − ( 0.5 x log(0.5) x log(0.5) ) x − ( 1 x log(1) ) + x − ( 0.33 x log(0.33) xlog(0.66) )

Suppose we choose wealth (table now sorted by wealth values) genderagewealthpolitics femalemiddle-agedpoorLeft-wing maleoldpoorRight-wing femaleyoungpoorLeft-wing maleyoungpoorRight-wing malemiddle-agedrichRight-wing maleyoungrichRight-wing H(D|wealth) = x − ( 0.5 x log(0.5) x log(0.5) ) + x − ( 1 x log(1) )

Suppose we choose gender (table now sorted by gender values) genderagewealthpolitics femalemiddle-agedpoorLeft-wing femaleyoungpoorLeft-wing maleoldpoorRight-wing malemiddle-agedrichRight-wing maleyoungpoorRight-wing maleyoungrichRight-wing H(D| gender) = x − ( 1 x log (1) ) + x − ( 1 x log (1) ) This is the one we would choose...

Alternatives to Information Gain - all, somehow or other, give a measure of mixed-upness and have been used in building DTs Chi Square Gain Ratio, Symmetric Gain Ratio, Gini index Modified Gini index Symmetric Gini index J-Measure Minimum Description Length, Relevance RELIEF Weight of Evidence

Decision Trees Further reading is on google Interesting topics in context are: Pruning: close a branch down before you hit 0 entropy ( why?) Discretization and regression: trees that deal with real valued fields Decision Forests: what do you think these are?