Data Mining (and machine learning)

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Visual Recognition Tutorial
Decision Tree Algorithm
Feature Selection for Regression Problems
Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.
Evaluation.
Three kinds of learning
Feature Selection Lecture 5
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Module 04: Algorithms Topic 07: Instance-Based Learning
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
by B. Zadrozny and C. Elkan
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
David Corne, Heriot-Watt University - These slides and related resources: Data Mining.
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Data Mining and Machine Learning Naive Bayes David Corne, HWU
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning: Ensemble Methods
Othello Artificial Intelligence With Machine Learning
BackTracking CS255.
an introduction to: Deep Learning
Data Science Algorithms: The Basic Methods
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
ECE 5424: Introduction to Machine Learning
Data Mining (and machine learning)
Data Mining (and machine learning)
ECE 5424: Introduction to Machine Learning
CS 4/527: Artificial Intelligence
Project 2 datasets are now online.
Data Mining Practical Machine Learning Tools and Techniques
Hidden Markov Models Part 2: Algorithms
Distributions / Histograms
Data Mining (and machine learning)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining (and machine learning)
Gaussian Mixture Models And their training with the EM algorithm
CSCI N317 Computation for Scientific Applications Unit Weka
ML – Lecture 3B Deep NN.
More on Search: A* and Optimization
Ensemble learning.
Machine Learning in Practice Lecture 17
Boltzmann Machine (BM) (§6.4)
Project 2, The Final Project
Data Mining (and machine learning)
Data Mining (and machine learning)
Chapter 7: Transformations
Data Mining (and machine learning)
Evaluating Classifiers
State-Space Searches.
Non-Symbolic AI lecture 9
State-Space Searches.
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Data Mining CSCI 307, Spring 2019 Lecture 21
State-Space Searches.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Data Mining (and machine learning) DM Lecture 7: Feature Selection David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Finishing correlation/regression: Feature Selection: Coursework 2 Today Finishing correlation/regression: Feature Selection: Coursework 2 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Remember how to calculate r If we have pairs of (x,y) values, Pearson’s r is: Interpretation of this should be obvious (?) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Equivalently, you can do it like this Looking at it another way: after z-normalisation X is the z-normalised x value in the sample – indicating how many stds away from the mean it is. Same for Y The formula for r on the last slide is equivalent to this: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The names file in the C&C dataset has correlation values (class,target) for each field min Max mean std correlation median mode Population 1 0.06 0.13 0.37 0.02 0.01 Householdsize 0.46 0.16 -0.03 0.44 0.41 Racepctblack 0.18 0.25 0.63 racePctWhite 0.75 0.24 -0.68 0.85 0.98 racePctAsian 0.15 0.21 0.04 0.07 racePctHisp 0.14 0.23 0.29 agePct12t21 0.42 0.4 0.38 agePct12t29 0.49 0.48 agePct16t24 0.34 0.17 0.1 agePct65up 0.47 numbUrban 0.36 0.03 pctUrban 0.7 0.08 medIncome -0.42 0.32 pctWWage 0.56 -0.31 0.58 pctWFarmSelf 0.2 -0.15 pctWInvInc 0.5 -0.58 pctWSocSec 0.12 0.475 pctWPubAsst 0.22 0.57 0.26 pctWRetire -0.1 medFamInc -0.44 0.33 perCapInc 0.35 0.19 -0.35 0.3 whitePerCap -0.21 blackPerCap -0.28 indianPerCap -0.09 AsianPerCap -0.16 0.28 OtherPerCap -0.13 HispPerCap 0.39 -0.24 0.345 NumUnderPov 0.45 PctPopUnderPov 0.52 PctLess9thGrade 0.27 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

here … min Max mean std correlation median mode Population 1 0.06 0.13 0.37 0.02 0.01 Householdsize 0.46 0.16 -0.03 0.44 0.41 Racepctblack 0.18 0.25 0.63 racePctWhite 0.75 0.24 -0.68 0.85 0.98 racePctAsian 0.15 0.21 0.04 0.07 racePctHisp 0.14 0.23 0.29 agePct12t21 0.42 0.4 0.38 agePct12t29 0.49 0.48 agePct16t24 0.34 0.17 0.1 agePct65up 0.47 numbUrban 0.36 0.03 pctUrban 0.7 0.08 medIncome -0.42 0.32 pctWWage 0.56 -0.31 0.58 pctWFarmSelf 0.2 -0.15 pctWInvInc 0.5 -0.58 pctWSocSec 0.12 0.475 pctWPubAsst 0.22 0.57 0.26 pctWRetire -0.1 medFamInc -0.44 0.33 perCapInc 0.35 0.19 -0.35 0.3 whitePerCap -0.21 blackPerCap -0.28 indianPerCap -0.09 AsianPerCap -0.16 0.28 OtherPerCap -0.13 HispPerCap 0.39 -0.24 0.345 NumUnderPov 0.45 PctPopUnderPov 0.52 PctLess9thGrade 0.27 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Here are the top 20 (although the first doesn’t count) - this hints at how we might use correlation for feature selection ViolentCrimesPerPop 1 0.24 0.23 0.15 0.03 PctIlleg 0.25 0.74 0.17 0.09 PctKids2Par 0.62 0.21 -0.74 0.64 0.72 PctFam2Par 0.61 0.2 -0.71 0.63 0.7 racePctWhite 0.75 -0.68 0.85 0.98 PctYoungKids2Par 0.66 0.22 -0.67 0.91 PctTeen2Par 0.58 0.19 -0.66 0.6 racepctblack 0.18 0.06 0.01 pctWInvInc 0.5 -0.58 0.48 0.41 pctWPubAsst 0.32 0.57 0.26 0.1 FemalePctDiv 0.49 0.56 0.54 TotalPctDiv 0.55 PctPolicBlack 0.12 1675 MalePctDivorce 0.46 0.53 0.47 PctPersOwnOccup -0.53 PctPopUnderPov 0.3 0.52 0.08 PctUnemployed 0.36 PctHousNoPhone 0.185 PctPolicMinor 0.07 PctNotHSGrad 0.38 0.39 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Can anyone see a potential problem with choosing only (for example) the 20 features that correlate best with the target class ? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer) The data has 10,000 fields (features) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer) The data has 10,000 fields (features) you need to cut it down to 1,000 fields before you try machine learning. Which 1,000? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer) The data has 10,000 fields (features) you need to cut it down to 1,000 fields before you try machine learning. Which 1,000? The process of choosing the 1,000 fields to use is called Feature Selection David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Datasets with many features Gene expression datasets (~10,000 features) http://www.ncbi.nlm.nih.gov/sites/entrez?db=gds Proteomics data (~20,000 features) http://www.ebi.ac.uk/pride/ David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: Why? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: Why? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: Why? From http://elpub.scix.net/data/works/att/02-28.content.pdf David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Quite easy to find lots more cases from papers, where experiments show that accuracy reduces when you use more features David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Why does accuracy reduce with more features? How does it depend on the specific choice of features? What else changes if we use more features? So, how do we choose the right features? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Why accuracy reduces: Note: suppose the best feature set has 20 features. If you add another 5 features, typically the accuracy of machine learning may reduce. But you still have the original 20 features!! Why does this happen??? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Noise / Explosion The additional features typically add noise. Machine learning will pick up on spurious correlations, that might be true in the training set, but not in the test set. For some ML methods, more features means more parameters to learn (more NN weights, more decision tree nodes, etc…) – the increased space of possibilities is more difficult to search. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods A big research area! This diagram from (Dash & Liu, 1997) We’ll look briefly at parts of it David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Correlation-based feature ranking This is what you will use in CW 2. It is indeed used often, by practitioners (who perhaps don’t understand the issues involved in FS) It is actually fine for certain datasets. It is not even considered in Dash & Liu’s survey. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A made-up dataset f1 f2 f3 f4 … class 0.4 0.6 1 0.2 1.6 -0.6 0.5 0.7 1.8 -0.8 0.8 0.9 2 -0.7 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Correlated with the class f1 f2 f3 f4 … class 0.4 0.6 1 0.2 1.6 -0.6 0.5 0.7 1.8 -0.8 0.8 0.9 2 -0.7 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

uncorrelated with the class / seemingly random f1 f2 f3 f4 … class 0.4 0.6 1 0.2 1.6 -0.6 0.5 0.7 1.8 -0.8 0.8 0.9 2 -0.7 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Correlation based FS reduces the dataset to this. … class 0.4 0.6 1 0.2 0.5 0.7 0.8 2 0.9 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

But, col 5 shows us f3 + f4 – which is perfectly correlated with the class! … class 0.4 0.6 1 0.2 1.6 -0.6 0.5 0.7 1.8 -0.8 0.8 0.9 1.1 2 -0.7 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Good FS Methods therefore: Need to consider how well features work together As we have noted before, if you take 100 features that are each well correlated with the class, they may simply be correlated strongly with each other, so provide no more information than just one of them David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods Original dataset has N features You want to use a subset of k features A complete FS method means: try every subset of k features, and choose the best! the number of subsets is N! / k!(N−k)! what is this when N is 100 and k is 5? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods Original dataset has N features You want to use a subset of k features A complete FS method means: try every subset of k features, and choose the best! the number of subsets is N! / k!(N−k)! what is this when N is 100 and k is 5? 75,287,520 -- almost nothing David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods Original dataset has N features You want to use a subset of k features A complete FS method means: try every subset of k features, and choose the best! the number of subsets is N! / k!(N−k)! what is this when N is 10,000 and k is 100? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods Original dataset has N features You want to use a subset of k features A complete FS method means: try every subset of k features, and choose the best! the number of subsets is N! / k!(N−k)! what is this when N is 10,000 and k is 100? 5,000,000,000,000,000,000,000,000,000, David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

000,000,000,000,000,000,000,000,000,000, David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

000,000,000,000,000,000,000,000,000,000, David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

000,000,000,000,000,000,000,000,000,000, David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

… continued for another 114 slides. Actually it is around 5 × 1035,101 (there are around 1080 atoms in the universe) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Can you see a problem with complete methods? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Forward’ methods These methods `grow’ a set S of features – S starts empty Find the best feature to add (by checking which one gives best performance on a test set when combined with S). If overall performance has improved, return to step 2; else stop David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Backward’ methods These methods remove features one by one. S starts with the full feature set Find the best feature to remove (by checking which removal from S gives best performance on a test set). If overall performance has improved, return to step 2; else stop David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

When might you choose forward instead of backward? David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Random(ised) methods aka Stochastic methods Suppose you have 1,000 features. There are 21000 possible subsets of features. One way to try to find a good subset is to run a stochastic search algorithm E.g. Hillclimbing, simulated annealing, genetic algorithm, particle swarm optimisation, … David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

One slide introduction to (most) stochastic search algorithms A search algorithm: BEGIN: 1. initialise a random population P of N candidate solutions (maybe just 1) (e.g. each solution is a random subset of features) 2. Evaluate each solution in P (e.g. accuracy of 3-NN using only the features in that solution) ITERATE: 1. generate a set C of new solutions, using the good ones in P (e.g. choose a good one and mutate it, combine bits of two or more solutions, etc …) 2. evaluate each of the new solutions in C. 3. Update P – e.g. by choosing the best N from all of P and C 4. If we have iterated a certain number of times, or accuracy is good enough, stop David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

One slide introduction to (most) stochastic search algorithms A search algorithm: BEGIN: 1. initialise a random population P of N candidate solutions (maybe just 1) (e.g. each solution is a random subset of features) 2. Evaluate each solution in P (e.g. accuracy of 3-NN using only the features in that solution) ITERATE: 1. generate a set C of new solutions, using the good ones in P (e.g. choose a good one and mutate it, combine bits of two or more solutions, etc …) 2. evaluate each of the new solutions in C. 3. Update P – e.g. by choosing the best N from all of P and C 4. If we have iterated a certain number of times, or accuracy is good enough, stop GENERATE TEST GENERATE TEST UPDATE David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Why randomised/search methods are good for FS Usually you have a large number of features (e.g. 1,000) You can give each feature a score (e.g. correlation with target, Relief weight (see end slides), etc …), and choose the best-scoring features. This is very fast. However this does not evaluate how well features work with other features. You could give combinations of features a score, but there are too many combinations of multiple features. Search algorithms are the only suitable approach that get to grips with evaluating combinations of features. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

CW2 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

CW2 Involves: Some basic dataset processing on CandC dataset Applying a DMML technique called Naïve Bayes (NB: already implemented by me) Implementing your own script/code that can work out the correlation (Pearson’s r) between any two fields Running experiments to compare the results of NB when using the ‘top 5’, ‘top 10’ and ‘top 20’ fields according to correlation with the class field. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

CW2 Naïve Bayes: Described in next the last lecture. It only works on discretized data, and predicts the class value of a target field. It uses Bayesian probability in a simple way to come up with a best guess for the class value, based on the proportions in exactly the type of histograms you are doing for CW1 My NB awk script builds its probability model on the first 80% of data, and then outputs its average accuracy when applying this model to the remaining 20% of the data. It also outputs a confusion matrix David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

CW2 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

If time: the classic example of an instance-based heuristic method David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method An instance-based, heuristic method – it works out weight values for Each feature, based on how important they seem to be in discriminating between near neighbours David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method There are two features here – the x and the y co-ordinate Initially they each have zero weight: wx = 0; wy = 0; David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0; wy = 0; choose an instance at random David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0; wy = 0; choose an instance at random, call it R David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0; wy = 0; find H (hit: the nearest to R of the same class) and M (miss: the nearest to R of different class) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0; wy = 0; find H (hit: the nearest to R of the same class) and M (miss: the nearest to R of different class) H M David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0; wy = 0; now we update the weights based on the distances between R and H and between R and M. This happens one feature at a time H M David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method To change wx, we add to it: (MR − HR)/n ; so, the further the `miss’ in the x direction, the higher the weight of x – the more important x is in terms of discriminating the classes H M MR HR David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method To change wy, we add to it: (MR − HR)/n again, but this time calculated in the y dimension; clearly the difference is smaller; differences in this feature don’t seem important in terms of class value H M MR HR David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method Maybe now we have wx = 0.07, wy = 0.002. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0.07, wy = 0.002; Pick another instance at random, and do the same again. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0.07, wy = 0.002; Identify H and M M H David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method wx = 0.07, wy = 0.002; Add the HR and MR differences divided by n, for each feature, again … M H David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method In the end, we have a weight value for each feature. The higher the value, the more relevant the feature. We can use these weights for feature selection, simply by choosing the features with the S highest weights (if we want to use S features) NOTE It is important to use Relief F only on min-max normalised data in [0,1]. However it is fine if category attibutes are involved, in which case use Hamming distance for those attributes, Why divide by n? Then, the weight values can be interpreted as a difference in probabilities. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method, plucked directly from the original paper (Kira and Rendell 1992) David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Some recommended reading, if you are interested, is on the website David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html