Download presentation
Presentation is loading. Please wait.
Published byNorah Lambert Modified over 8 years ago
1
Feature Engineering Studio September 23, 2013
2
Let’s start by discussing the HW
3
Sort into pairs Partner with the person next to you
4
Sort into pairs Go over your reports together – A maximum of 5 minutes apiece
5
5 minutes for first person
6
5 minutes for second person
7
Re-assemble into one big group
8
Who here found something really cool while taking a first look at their data? Show us, tell us
9
Who here found a histogram with a normal distribution? Show us, tell us
10
Who here found a histogram with a hypermode? Show us, tell us
11
Who here found a histogram with a flat distribution? Show us, tell us
12
Who here found a histogram with a skewed distribution? Show us, tell us
13
Who here found a histogram with a bimodal distribution? Show us, tell us
14
Who here found a histogram with something else interesting? Show us, tell us
15
Who here found something surprising with their min, max, average, stdev?
16
Categorical variables Who here found something curious, weird, or interesting in the distribution of their categorical variables?
17
Who here hasn’t spoken yet today? (who analyzed data) Tell us something interesting you found in your data
18
Who here did something else fun with their data?
19
Data Cleaning
20
What did folks think of Romero article? In particular, the part on data cleaning
21
Outliers You have a huge outlier in your data You need to know what to do about it
22
More specifically What kind of outliers should be dealt with? And what kind of outliers should be left alone? What do you think?
23
Ways to identify outliers Theoretical approaches – “Students used this software during 45 minute class periods. I see a 972 minute period of time taken to make a response. Something must have gone wrong here.” Deviation-based approaches – “All data that is 3 SD over or 3 SD under the mean will be treated as an outlier” When is each approach justified? (Examples please)
24
Ways to deal with outliers Truncation – Delete the outliers Winsorising – Set the data value to the cut-off value Doing nothing When is each approach justified? (Examples please)
25
Another data cleaning issue identified by Romero and colleagues Missing data A huge topic in its own right Will not focus on it today, but we can come back to it later in the semester if there’s extra time and interest – Quick vote: who wants me to try to fit in some time to talk about missing data?
26
Data Cleaning: Other Thoughts or Comments?
27
Romero article: Other Thoughts or Comments?
28
Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)
29
Assignment 3 Write a brief report for me You don’t need to prepare a presentation But be ready to discuss your features in class
30
Next Classes 2/23 Feature Distillation in Excel – Assignment 3 due 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count…
31
If there’s time…
32
Other cool things you can create with a few simple formulas (plus demos!)
33
Identifying specific cases of interest
34
Did event of interest ever occur for student?
35
Counts-so-far (and total value for student)
36
Counts-last-N-actions
37
First attempts
38
Ratios between events of interest
39
How many students had 3 (or 4, 5, 2,…) of an event
40
Times-so-far
41
Cutoff-based features
42
Unitized actions (such as unitized time)
43
Last 3 or 5 unitized
44
Comparing earlier behaviors to later behaviors through caching
45
Counts-if
46
Percentages of action type
47
Percentages of time spent per action/location/KC/etc.
48
Questions? Comments?
49
Other cool ideas?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.