COMP61011 Foundations of Machine Learning Feature Selection

COMP61011 Foundations of Machine Learning Feature Selection

Only 200 papers in the world! I wish!

Square Kilometre Array (due 2024)
World’s largest radio telescope array 1 terabyte per second Need to classify stellar objects real-time.

Supervised Learning

Supervised Learning Training data + labels Possibly high dimensional.
Model Test input Standard supervised learning scenario. Filter Methods. Label prediction

High Dimensional Data Standard supervised learning scenario. Filter Methods. (this is real, on a USB stick on my desk – 41,672 features, 59 patients)

Supervised Learning Training data + labels Model Test input
Standard supervised learning scenario. Filter Methods. Label prediction

Supervised Learning + Feature Selection
Training data + labels Select subset of features (i.e. columns) Model Test input Label prediction

The “Wrapper” approach
“You want to build a model… so just do it.” Can we just do an exhaustive search…? Bit set to 1 means we use that feature, otherwise 0 … …so use 8 features. Try a feature set Model With M total features… possible sets! 20 features … 1 million feature sets to check 25 features … 33.5 million sets 30 features … 1.1 billion sets Evaluate the model

The “Wrapper” approach
“You want to build a model… so just do it.” Simplest strategy: greedy search REPEAT: 1. Try out each of the remaining features with your model. 2. Add the “best” one. UNTIL satisfied with accuracy/error Try a feature set Model Evaluate the model

Visualising the search space…
Greedy forward search evaluates sets

Maybe we cannot, or don’t want to, build a classifier.
How inherently “useful” is a feature?

Can we say how “useful” a feature is?
Imagine you’re trying to guess the price of a car. Relevant : engine size, age, mileage, presence of rust, … Irrelevant : color of windscreen wipers, size of wheels, stickers on window, … Redundant : age / mileage.

“Filters”

Relevancy = Correlation?
How often have you heard the phrase “X is correlated with Y” ?

All these have r = 0.81. ….Pearson only detects LINEAR relationships. ….and it is only for one feature (“univariate”). ….and it is assuming two real-valued variables.

How about a classification problem?
Let’s use a simple “threshold” on variable X. Each point is a person in your database. Green stars = “good” health Red circles = “bad” health. High Useful feature. “Discriminates” very well. Low Example of NL classifier: SVM

How about a classification problem?
Let’s use a simple “threshold” on variable X. Each point is a person in your database. Green stars = “good” health Red circles = “bad” health. High No useful threshold ! Feature is not “discriminative”. Low Example of NL classifier: SVM

Fisher Score (m1 – m2)2 F = v1 + v2 m1 m2 v2 v1
Example of NL classifier: SVM (m1 – m2)2 …. is called the between-class scatter – BIG for good features. v1 + v2 ………. is called the within-class scatter – SMALL for good features.

Fisher Score (m1 – m2)2 F = v1 + v2 m1 m2 v1 v2
Example of NL classifier: SVM (m1 – m2)2 …. is called the between-class scatter – BIG for good features. v1 + v2 ………. is called the within-class scatter – SMALL for good features.

How useful is a single measurement?
Imagine a feature… Example of NL classifier: SVM Small value Big value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

Considering features together…
High Low Example of NL classifier: SVM Small value Big value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

Two irrelevant features may be relevant together
Example of NL classifier: SVM Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

How useful is a feature? Need some kind of “dependency” measure…
e.g. Pearson’s correlation …. but assumes linearity Fisher score … but assumes gaussianity And both ignore feature interactions.

Mutual Information Measures dependency of X,Y Zero when independent.
Maximal when identical.

“Filter” methods: Three Ingredients
Dependency measure Search procedure Stopping criterion X Y J(X;Y) = 0.6 Select / discard? J(X;Y) is the dependency criterion. e.g. Pearson’s correlation Fisher score Mutual Information Select most relevant features. Discard irrelevant features. Selected set S. Iteratively add/remove features. I have a bag of features S which I add/remove from according to the criterion, until a stop criterion is met.

Feature Selection Useful to: Many methods:
Reduce chance of overfitting Reduce computational complexity at test time Increase interpretability Many methods: Wrappers vs Filters, pros and cons of each Many variants of filters.

Projects due next Friday, 4pm
This is the End of COMP61011. That’s it. We’re done. Exam in January – past papers on website. Projects due next Friday, 4pm You need to submit a hardcopy to SSO: - your 6 page (maximum) report You need to send by - the report as PDF, and a ZIP file of your code.

COMP61011 Foundations of Machine Learning Feature Selection

Similar presentations

Presentation on theme: "COMP61011 Foundations of Machine Learning Feature Selection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP61011 Foundations of Machine Learning Feature Selection

Similar presentations

Presentation on theme: "COMP61011 Foundations of Machine Learning Feature Selection"— Presentation transcript:

Similar presentations

About project

Feedback