Machine Learning in Practice Lecture 22

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Feature Selection Presented by: Nafise Hatamikhah
Feature Selection for Regression Problems
Evaluation.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
by B. Zadrozny and C. Elkan
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Today Ensemble Methods. Recap of the course. Classifier Fusion
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Ensemble Methods in Machine Learning
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining and Decision Support
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Dr. Gheith Abandah 1.  Feature selection is typically a search problem for finding an optimal or suboptimal subset of m features out of original M features.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Science Credibility: Evaluating What’s Been Learned
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning in Practice Lecture 18
Advanced data mining with TagHelper and Weka
School of Computer Science & Engineering
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Data preprocessing and transformation
Data Mining (and machine learning)
Classification and Prediction
Machine Learning Feature Creation and Selection
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning in Practice Lecture 26
Linear Model Selection and regularization
Machine Learning Chapter 3. Decision Tree Learning
Dept. of Computer Science University of Liverpool
CSCI N317 Computation for Scientific Applications Unit Weka
Dimensionality Reduction
Overfitting and Underfitting
Machine Learning in Practice Lecture 23
Ensemble learning.
Machine Learning in Practice Lecture 7
Machine Learning in Practice Lecture 17
Machine Learning in Practice Lecture 6
Machine Learning in Practice Lecture 27
Chapter 7: Transformations
Feature Selection Methods
Evaluating Classifiers
CS639: Data Management for Data Science
Machine Learning in Practice Lecture 20
Presentation transcript:

Machine Learning in Practice Lecture 22 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements Multi-Level Cross-Validation Questions? Multi-Level Cross-Validation Feature Selection

Setting Up the Experimenter for Regression Problems

Cascading Classifiers

Advanced Cross Validation for Hierarchical Models Animals Raw features …… New Features Plants Living Things Target Class Let’s say you want to train a model to predict whether something is a living thing Let’s say you know that Animals have a lot in common with each other, and Plants have a lot in common with each other You think it would be easier to predict Living Thing if you first have models to predict Animal and Plant Describe the steps you’ll go through to build the Living Thing model

Remember the cluster feature example Class 1 Class 2

Added structure makes it easier to detect Class1

Advanced Cross Validation for Hierarchical Models Classifier A Raw features …… New Features Classifier B Classifier C Target Class You will use the result of Classifiers A and B to train Classifier C You need labeled data for Class A and Class B to train classifiers A and B, but you don’t want to train Classifier C with those perfect labels You can use cross validation to get “noisy” versions of A and B in the training data for C

Advanced Cross Validation A B C f F 1 2 3 4 5 6 7 Let’s say each instance in your data has features A, B, and C You are trying to predict F You want to train feature D, which you think will help you detect F better

Advanced Cross Validation A B C f F Fold 1: Train a d classifier over segments 2-7 Use this model to apply D labels to segment 1 Train an f classifier over 2-7 Since D labels in 1 will be noisy, you need to train your f classifier with noisy D labels in 2-7 You can get those noisy labels using cross validation within 2-7 Use the f model to apply F labels to segment 1 1 2 3 4 5 6 7

Advanced Cross Validation A B C f F Think about how to get those noisy labels using cross validation within 2-7 Train a d classifier on 3-7 to apply noisy D labels to 2 Train a d classifier on 2+4-7 to apply noisy D labels to 3 Etc. Now you have noisy D labels for 2-7 in addition to the perfect D labels you started with You will use these noisy labels to train f (not the perfect ones!) 1 2 3 4 5 6 7

Advanced Cross Validation A B C f F Note: This was just ONE fold! Think about how to get those noisy labels using cross validation within 2-7 Train a d classifier on 3-7 to apply noisy D labels to 2 Train a d classifier on 2+4-7 to apply noisy D labels to 3 Etc. Now you have noisy D labels for 2-7 in addition to the perfect D labels you started with You will use these noisy labels to train f (not the perfect ones!) 1 2 3 4 5 6 7

Remember: Dumping Labels from Weka Save output buffer Pull results section out Use the predicted column NOTE: If you do this using weka’s cross-validation, you won’t be able to match up the instance numbers!!!

Feature Selection

Why do irrelevant features hurt performance? Divide-and-conquer approaches have the problem that the further down in the tree you get, the less data you are paying attention to it’s easy for the classifier to get confused Naïve Bayes does not have this problem, but it has other problems, as we have discussed SVM is relatively good at ignoring irrelevant attributes, but it can still suffer Also, it’s very computationally expensive with large attribute spaces

Two Paradigms for Attribute Selection Wrapper Method Evaluate the subset using the algorithm that will be used for the classification in terms of how the classifier does with that subset Use a search method like Best First Filter Method Use an independent metric of feature goodness Rank and then select Don’t be confused – not the “standard” usage of filter versus wrapper

How do you evaluate feature goodness apart from the learning algorithm? Notice a combination of scoring heuristics and search methods

How do you evaluate feature goodness apart from the learning algorithm How do you evaluate feature goodness apart from the learning algorithm? (evaluating subsets of features) Look for the smallest set of attributes that distinguishes every training instance from every other training instance Problem occurs if there are two instances with the same attribute values but different classes You could use decision tree learning to pick out a subset of attributes to use with a different algorithm It will have no effect if you use it with decision trees It might work well with instance based learning – to avoid having it be confused by irrelevant attributes

How do you evaluate feature goodness apart from the learning algorithm How do you evaluate feature goodness apart from the learning algorithm? (evaluating individual features) You can rank attributes for decision trees using 1R to compensate for the bias towards selecting features that branch heavily Look at the correlation between each features and the class attribute

Efficiently Navigating the Attribute Space Evaluating individual attributes and then ranking them is the most efficient approach for attribute selection That’s what we have been doing up until now with ChiSquaredAttributeEval Searching for the optimal subset of features based on evaluating subsets together is more complex Exhaustive search for the optimal subset of attributes is not tractable Use a greedy search BestFirst

Efficiently Navigating the Attribute Space Remember that greedy methods are efficient, but they sometimes get stuck in locally optimal solutions Forward selection: Start with nothing and add attributes On each round, pick the attribute that will have the biggest estimated positive effect on performance Backward elimination: Start with the whole set and prune On each round, select the attribute that seems to be dragging down performance the most Bidirectional search methods combine these methods

Forward Selection

Forward Selection * Pick the most predictive feature.

Forward Selection * Pick the next most predictive feature, or the one that gives the pair the most predictive power altogether – in the case of the wrapper method, using the classification algorithm you will eventually use.

Forward Selection * Pick the next most predictive feature, or the one that gives the set the most predictive power altogether – in the case of the wrapper method, using the classification algorithm you will eventually use.

Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it.

Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it.

Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it.

Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it.

Efficiently Searching the Attribute Space You can use a beam search method rather than selecting a single attribute on each round Race search: stop when you don’t get any statistically significant increase from one round to the next Option: Schemata search is like race search except that you rank attributes first and then race the top ranking attributes (more efficient!) Sometimes you include a random selection of other attributes in with the selected attributes

Efficiently Searching the Attribute Space Different approaches will make different mistakes Backward elimination produces larger attribute sets and often better performance than forward selection Forward selection is good for eliminating redundant attributes or attributes with dependencies between them, which is good for Naïve Bayes Better if you want to be able to understand the trained model

Selecting an Attribute Selection Technique

Attribute Selection Options: Evaluating Subsets CfsSubsetEval: looks for a subset that are highly correlated with the predicted class but have low inter-correlation ClassifierSubsetEval: Evaluates subset using a selected classifier to compute performance ConsistencySubsetEval: evaluates the goodness of a subset of attributes based on consistency of instances that are close to each other in the reduced attribute space

Attribute Selection Options: Evaluating Subsets SVMAttributeEval: backwards elimination, beam search technique, you specify the number or percent to get rid of on each iteration, stops when it doesn’t help anymore WrapperSubsetEval: Just like ClassifierSubsetEval ChiSquaredAttributeEval: evaluates the worth of a feature by computing the chi-squared statistic of the attribute in relation to the predicted class GainRatioAttributeEval: Like Chi-squared but using GainRatio

Attribute Selection Options: Evaluating Subsets InfoGainAttributeEval: Like Chi-squared but using Information Gain OneRAttributeEval: Like Chi-squared but looks at accuracy of using single attributes for classification ReliefAttributeEval: evaluates the worth of an attribute based on surrounding instances based on that attribute SymetricalUncertAttributeEval: Like Chi-squared by uses symetrical uncertainty

Subsets in a new vector space ? PrincipalComponents: use principal components analysis to select a subset of eigenvectors from the diagonalized covariance matrix that accounts for a certain percentage of the variance in the predicted class A cheaper method is to use a random projection onto a smaller vector space Not as good as principal components analysis But not that much worse either

Remember Matrix Multiplication * Notice you ended up with fewer attributes!

What can we do with that? Using linear algebra Project one vector space onto a more compact one Then select the top N dimensions that as a set explain the most variance in your data

Subsets in a new vector space PrincipalComponents: use principal components analysis to select a subset of eigenvectors from the diagonalized covariance matrix that accounts for a certain percentage of the variance in the predicted class A cheaper method is to use a random projection onto a smaller vector space Not as good as principal components analysis But not that much worse either

Selecting an Attribute Selection Technique Avoid Combinations that Don't Make Sense Together

E.g., Forward Selection waste a lot of time! * Forward selection and backward selection are much slower than just ranking attributes based on a metric that can be applied to one attribute at a time. So If you are using such a metric, just use a ranking selection technique rather than a search selection technique. Using search won’t change the result, it will just waste a lot of time!

Does it matter which one you pick? Consider the Spam data set Decision trees worked well because we needed to consider interactions between attributes Cfsubseteval and ChiSquaredAttibuteEval consider attributes out of context Both significantly reduce performance In this case we’re harming performance because we’re ignoring interactions between attributes at the selection stage Cfsubseteval is significantly better than ChiSquaredAttributeEval for the same number of features But with ChiSquaredAttributeEval you can choose to have more features, and then you do not get a degradation in performance eventhough you reduce the feature space

Take Home Message Multi-level cross-validation helps prevent over-estimating performance for tuned models Needed for cascaded classification in addition to more typical tuning Wide range of feature selection techniques Make sure your options are consistent with one another Search based approaches are much slower than simple ranking approaches