Feature Engineering Studio September 23, 2013
Let’s start by discussing the HW
Sort into pairs Partner with the person next to you
Sort into pairs Go over your reports together – A maximum of 5 minutes apiece
5 minutes for first person
5 minutes for second person
Re-assemble into one big group
Who here found something really cool while taking a first look at their data? Show us, tell us
Who here found a histogram with a normal distribution? Show us, tell us
Who here found a histogram with a hypermode? Show us, tell us
Who here found a histogram with a flat distribution? Show us, tell us
Who here found a histogram with a skewed distribution? Show us, tell us
Who here found a histogram with a bimodal distribution? Show us, tell us
Who here found a histogram with something else interesting? Show us, tell us
Who here found something surprising with their min, max, average, stdev?
Categorical variables Who here found something curious, weird, or interesting in the distribution of their categorical variables?
Who here hasn’t spoken yet today? (who analyzed data) Tell us something interesting you found in your data
Who here did something else fun with their data?
Data Cleaning
What did folks think of Romero article? In particular, the part on data cleaning
Outliers You have a huge outlier in your data You need to know what to do about it
More specifically What kind of outliers should be dealt with? And what kind of outliers should be left alone? What do you think?
Ways to identify outliers Theoretical approaches – “Students used this software during 45 minute class periods. I see a 972 minute period of time taken to make a response. Something must have gone wrong here.” Deviation-based approaches – “All data that is 3 SD over or 3 SD under the mean will be treated as an outlier” When is each approach justified? (Examples please)
Ways to deal with outliers Truncation – Delete the outliers Winsorising – Set the data value to the cut-off value Doing nothing When is each approach justified? (Examples please)
Another data cleaning issue identified by Romero and colleagues Missing data A huge topic in its own right Will not focus on it today, but we can come back to it later in the semester if there’s extra time and interest – Quick vote: who wants me to try to fit in some time to talk about missing data?
Data Cleaning: Other Thoughts or Comments?
Romero article: Other Thoughts or Comments?
Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)
Assignment 3 Write a brief report for me You don’t need to prepare a presentation But be ready to discuss your features in class
Next Classes 2/23 Feature Distillation in Excel – Assignment 3 due 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count…
If there’s time…
Other cool things you can create with a few simple formulas (plus demos!)
Identifying specific cases of interest
Did event of interest ever occur for student?
Counts-so-far (and total value for student)
Counts-last-N-actions
First attempts
Ratios between events of interest
How many students had 3 (or 4, 5, 2,…) of an event
Times-so-far
Cutoff-based features
Unitized actions (such as unitized time)
Last 3 or 5 unitized
Comparing earlier behaviors to later behaviors through caching
Counts-if
Percentages of action type
Percentages of time spent per action/location/KC/etc.
Questions? Comments?
Other cool ideas?