Feature Engineering Studio February 23, 2015
Let’s start by discussing the HW
Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)
Everyone will present an outlier Alphabetical Order Based on First Name – Tie-Breaker: Last Name I’ll call out letters – Using the class roster failed last time
Tell us about your best outlier Mean, Median, SD, and some outlier values Give your “just so story” (or multiple just so stories) about what might have caused the outlier(s) What do you plan to do about it (if anything)?
Questions? Comments?
Things you can do in Excel part 2 of 3
Identifying specific cases of interest
Did event of interest ever occur for student?
Ratios between events of interest
How many students had 3 (or 4, 5, 2,…) of an event
Unitized actions (such as unitized time)
Last 3 or 5 unitized
Comparing earlier behaviors to later behaviors through caching
Counts-if
Percentages of action type
Percentages of time spent per action/location/KC/etc.
List merging
Pearson Correlation
T-tests
More complex stats in Excel I have worksheets that can do Chi-squared, Cohen’s Kappa, Extra-Sum-of-Squares F-test, and some various meta-analytic methods in Excel But if you don’t really know what you’re doing, it’s better to use a stats package for these
What else might you want to do in Excel?
Questions? Comments?
HW4 Feature Engineering 1 “Bring Me a Rock” Get your data set Open it in Excel Create as many features as you feel inspired to create – Features should be created with the goal of predicting your ground truth variable – At least 12 separate features that are not just variations on a theme (e.g. “time for last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features) For each feature, write a 1-3 sentence “just so story” for why it might work Test how good each feature is
Testing Feature Goodness For this assignment, there are a bunch of ways to test feature goodness Single-feature prediction models in data mining or stats package, giving Pearson correlation, Spearman’s rho, or Cohen’s kappa (special session this Wednesday) Compute Pearson correlation in Excel Compute t-test in Excel Compute other metrics in Excel (but see earlier disclaimer)
Were you right? Which of your “just so stories” seem to be correct? Did any of your feature correlate in the opposite direction from what you expected?
Assignment 4 Write a brief report for me me an excel sheet with your features You don’t need to prepare a presentation But be ready to discuss your features in class
Next Classes 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count… 3/2 Advanced Feature Distillation in Excel – HW4 due