Download presentation
Presentation is loading. Please wait.
Published byBruno Daniels Modified over 6 years ago
1
COMP61011 Foundations of Machine Learning Feature Selection
2
Only 200 papers in the world! I wish!
5
Square Kilometre Array (due 2024)
World’s largest radio telescope array 1 terabyte per second Need to classify stellar objects real-time.
6
Supervised Learning
7
Supervised Learning Training data + labels Possibly high dimensional.
Model Test input Standard supervised learning scenario. Filter Methods. Label prediction
8
High Dimensional Data Standard supervised learning scenario. Filter Methods. (this is real, on a USB stick on my desk – 41,672 features, 59 patients)
9
Supervised Learning Training data + labels Model Test input
Standard supervised learning scenario. Filter Methods. Label prediction
10
Supervised Learning + Feature Selection
Training data + labels Select subset of features (i.e. columns) Model Test input Label prediction
11
The “Wrapper” approach
“You want to build a model… so just do it.” Can we just do an exhaustive search…? Bit set to 1 means we use that feature, otherwise 0 … …so use 8 features. Try a feature set Model With M total features… possible sets! 20 features … 1 million feature sets to check 25 features … 33.5 million sets 30 features … 1.1 billion sets Evaluate the model
12
The “Wrapper” approach
“You want to build a model… so just do it.” Simplest strategy: greedy search REPEAT: 1. Try out each of the remaining features with your model. 2. Add the “best” one. UNTIL satisfied with accuracy/error Try a feature set Model Evaluate the model
14
Visualising the search space…
Greedy forward search evaluates sets
16
Maybe we cannot, or don’t want to, build a classifier.
How inherently “useful” is a feature?
17
Can we say how “useful” a feature is?
Imagine you’re trying to guess the price of a car. Relevant : engine size, age, mileage, presence of rust, … Irrelevant : color of windscreen wipers, size of wheels, stickers on window, … Redundant : age / mileage.
18
“Filters”
19
Relevancy = Correlation?
How often have you heard the phrase “X is correlated with Y” ?
22
All these have r = 0.81. ….Pearson only detects LINEAR relationships. ….and it is only for one feature (“univariate”). ….and it is assuming two real-valued variables.
24
How about a classification problem?
Let’s use a simple “threshold” on variable X. Each point is a person in your database. Green stars = “good” health Red circles = “bad” health. High Useful feature. “Discriminates” very well. Low Example of NL classifier: SVM
25
How about a classification problem?
Let’s use a simple “threshold” on variable X. Each point is a person in your database. Green stars = “good” health Red circles = “bad” health. High No useful threshold ! Feature is not “discriminative”. Low Example of NL classifier: SVM
26
Fisher Score (m1 – m2)2 F = v1 + v2 m1 m2 v2 v1
Example of NL classifier: SVM (m1 – m2)2 …. is called the between-class scatter – BIG for good features. v1 + v2 ………. is called the within-class scatter – SMALL for good features.
27
Fisher Score (m1 – m2)2 F = v1 + v2 m1 m2 v1 v2
Example of NL classifier: SVM (m1 – m2)2 …. is called the between-class scatter – BIG for good features. v1 + v2 ………. is called the within-class scatter – SMALL for good features.
28
How useful is a single measurement?
Imagine a feature… Example of NL classifier: SVM Small value Big value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004
29
Considering features together…
High Low Example of NL classifier: SVM Small value Big value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004
30
Two irrelevant features may be relevant together
Example of NL classifier: SVM Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004
31
How useful is a feature? Need some kind of “dependency” measure…
e.g. Pearson’s correlation …. but assumes linearity Fisher score … but assumes gaussianity And both ignore feature interactions.
33
Mutual Information Measures dependency of X,Y Zero when independent.
Maximal when identical.
34
“Filter” methods: Three Ingredients
Dependency measure Search procedure Stopping criterion X Y J(X;Y) = 0.6 Select / discard? J(X;Y) is the dependency criterion. e.g. Pearson’s correlation Fisher score Mutual Information Select most relevant features. Discard irrelevant features. Selected set S. Iteratively add/remove features. I have a bag of features S which I add/remove from according to the criterion, until a stop criterion is met.
36
Feature Selection Useful to: Many methods:
Reduce chance of overfitting Reduce computational complexity at test time Increase interpretability Many methods: Wrappers vs Filters, pros and cons of each Many variants of filters.
37
Projects due next Friday, 4pm
This is the End of COMP61011. That’s it. We’re done. Exam in January – past papers on website. Projects due next Friday, 4pm You need to submit a hardcopy to SSO: - your 6 page (maximum) report You need to send by - the report as PDF, and a ZIP file of your code.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.