Machine Learning in Practice Lecture 23 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
In the home stretch…. Announcements Questions? Quiz (second to last!) Homework (last!) Discretization and Time Series Transformations Data Cleansing http://keijiro.typepad.com/journey/images/finish_01.jpg
About the Quiz…. * Fold 1 1 2 3 4 5 For fold i in {1..5} Test set is test_i Select one of the remaining sets to be validation_i Concatenate the remaining sets into part_train_i For each algorithm in {X1-1, X1-2, X2-1, X2-2} Train the algorithm on part_train_i and test on validation_i Now concatenate all but test_i into train_i Now train the best algorithm on train_i and test on test_i to get the performance for fold i Average the performance for the 5 folds
About the Quiz…. * Fold 1 1 2 3 4 5 For fold i in {1..5} Test set is test_i Select one of the remaining sets to be validation_i Concatenate the remaining sets into part_train_i For each algorithm in {X1-1, X1-2, X2-1, X2-2} Train the algorithm on part_train_i and test on validation_i Now concatenate all but test_i into train_i Now train the best algorithm on train_i and test on test_i to get the performance for fold i Average the performance for the 5 folds
About the Quiz…. * Fold 1 1 2 3 4 5 For fold i in {1..5} Test set is test_i Select one of the remaining sets to be validation_i Concatenate the remaining sets into part_train_i For each algorithm in {X1-1, X1-2, X2-1, X2-2} Train the algorithm on part_train_i and test on validation_i Now concatenate all but test_i into train_i Now train the best algorithm on train_i and test on test_i to get the performance for fold i Average the performance for the 5 folds
About the Quiz…. * Fold 2 1 2 3 4 5 For fold i in {1..5} Test set is test_i Select one of the remaining sets to be validation_i Concatenate the remaining sets into part_train_i For each algorithm in {X1-1, X1-2, X2-1, X2-2} Train the algorithm on part_train_i and test on validation_i Now concatenate all but test_i into train_i Now train the best algorithm on train_i and test on test_i to get the performance for fold i Average the performance for the 5 folds
Discretization and Time Series Transforms
Discretization Connection between discretization and clustering Finding natural breaks in your data Connection between discretization and feature selection You can think of each interval as a feature or a feature value Discretizing before classification limits options for breaks If you attempt to discretize and it fails to find a split that would have been useful, it has the effect of eliminating a feature
Discretization and Feature Selection Adding breaks is like creating new attribute values Each attribute value is potentially a new binary attribute Inserting boundaries is like a forward selection approach to attribute selection
Discretization and Feature Selection Adding breaks is like creating new attribute values Each attribute value is potentially a new binary attribute Inserting boundaries is like a forward selection approach to attribute selection
Discretization and Feature Selection Adding breaks is like creating new attribute values Each attribute value is potentially a new binary attribute Inserting boundaries is like a forward selection approach to attribute selection
Discretization and Feature Selection Adding breaks is like creating new attribute values Each attribute value is potentially a new binary attribute Inserting boundaries is like a forward selection approach to attribute selection
Discretization and Feature Selection Adding breaks is like creating new attribute values Each attribute value is potentially a new binary attribute Inserting boundaries is like a forward selection approach to attribute selection
Discretization and Feature Selection Removing boundaries is like a backwards elimination approach to attribute selection
Discretization and Feature Selection Removing boundaries is like a backwards elimination approach to attribute selection
Discretization and Feature Selection Removing boundaries is like a backwards elimination approach to attribute selection
Discretization Discretization sometimes improves performance even if you don’t strictly need nominal attributes Breaks in good places biases classifier to learn a good model Decision tree learners do discretization locally when they are selecting an attribute to branch on Advantages and disadvantages to local discretization
Layers Think of building a model in layers You can build a complex shape by combining lots of simple shapes We’ll come back to this idea when we talk about ensemble methods in the next lecture! You could build a complex model all at once Or you could build a complex model in a series of simple stages Discretization, feature selection, model building
Unsupervised Discretization Equal intervals (equal interval binning) E.g, For temperature: breaks every 10 degrees E.g, For weight: breaks every 5 pounds Equal frequencies (equal frequency binning) E.g., Groupings of about 10 instances E.g., Groupings of about 100 instances
Supervised Discretization Supervised splitting: find the best split point by generating all possible splits and using attribute selection to pick one Keep splitting till you don’t get value anymore It’s a little like building a decision tree and then throwing the tree away, but keeping the grouping of instances at the leaf nodes Entropy based: rank splits using information gain
Built-In Supervised Discretization NaiveBayes can be used with or without supervised discretization SpeakerID data set has numeric attributes Not normally distributed Without discretization kappa = .16 With discretization kappa = .34
Doing Discretization in Weka Note: there is also an unsupervised discretization filter attributeIndices: which attributes do you want to discretize Target class set inside the classifier
Doing Discretization in Weka The last two options are for the stoping criterion Not clear how it is evaluating the goodness of each split Not well documented
Example for Time Series Transforms Amount of CO2 in a room is related to how many people were in the room N minutes ago Let’s say you take a measurement every N/2 minutes Before you apply a numeric prediction model to predict CO2 from number of people, first copy number of people forward 2 instances 1NumPeople AmountCO2 2NumPeople AmountCO2 3NumPeople AmountCO2 4NumPeople AmountCO2 ?NumPeople AmountCO2
Example for Time Series Transforms Amount of CO2 in a room is related to how many people were in the room N minutes ago Let’s say you take a measurement every N/2 minutes Before you apply a numeric prediction model to predict CO2 from number of people, first copy number of people forward 2 instances 1NumPeople AmountCO2 2NumPeople AmountCO2 3NumPeople AmountCO2 4NumPeople AmountCO2 ?NumPeople AmountCO2
Time Series Transforms Fill in with the delta or fill in with a previous value instanceRange: You specify how many instances backward or forward to look (negative means backwards) fillWithMissing: default is to ignore first and last instance. If true, use missing as the value for the attributes
Data Cleansing
Data Cleansing: Removing Outliers Noticing outliers is easier when you look at the overall distribution of your data Especially when using human judgment You know what doesn’t look right It’s harder to tell automatically whether the problem is that your data doesn’t fit the model or you have outliers
Eliminating Noise with Decision Tree Learning Train a tree Eliminate misclassified examples Train on the clean subset of the data You will get a simpler tree that generalizes better You can do this iteratively
Data Cleansing: Removing Outliers One way of identifying outliers is to look for examples that several algorithms misclassify Algorithms moving down different optimization paths are unlikely to get trapped in the same local minima You can compensate for outliers by adjusting the learning algorithm Using absolute distance rather than squared distance for a regression problem Doesn’t remove outliers, but reduces the effect of outliers
Take Home Message Discretization is related to feature selection and clustering Similar alternative search strategies Think about learning a model in stages Getting back to the idea of natural breaks in your data Difficult to tell with only one model whether a data point is noisy or a model is overly simplistic