Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Engineering Studio September 23, 2013. Let’s start by discussing the HW.

Similar presentations


Presentation on theme: "Feature Engineering Studio September 23, 2013. Let’s start by discussing the HW."— Presentation transcript:

1 Feature Engineering Studio September 23, 2013

2 Let’s start by discussing the HW

3 Sort into pairs Partner with the person next to you

4 Sort into pairs Go over your reports together – A maximum of 5 minutes apiece

5 5 minutes for first person

6 5 minutes for second person

7 Re-assemble into one big group

8 Who here found something really cool while taking a first look at their data? Show us, tell us

9 Who here found a histogram with a normal distribution? Show us, tell us

10 Who here found a histogram with a hypermode? Show us, tell us

11 Who here found a histogram with a flat distribution? Show us, tell us

12 Who here found a histogram with a skewed distribution? Show us, tell us

13 Who here found a histogram with a bimodal distribution? Show us, tell us

14 Who here found a histogram with something else interesting? Show us, tell us

15 Who here found something surprising with their min, max, average, stdev?

16 Categorical variables Who here found something curious, weird, or interesting in the distribution of their categorical variables?

17 Who here hasn’t spoken yet today? (who analyzed data) Tell us something interesting you found in your data

18 Who here did something else fun with their data?

19 Data Cleaning

20 What did folks think of Romero article? In particular, the part on data cleaning

21 Outliers You have a huge outlier in your data You need to know what to do about it

22 More specifically What kind of outliers should be dealt with? And what kind of outliers should be left alone? What do you think?

23 Ways to identify outliers Theoretical approaches – “Students used this software during 45 minute class periods. I see a 972 minute period of time taken to make a response. Something must have gone wrong here.” Deviation-based approaches – “All data that is 3 SD over or 3 SD under the mean will be treated as an outlier” When is each approach justified? (Examples please)

24 Ways to deal with outliers Truncation – Delete the outliers Winsorising – Set the data value to the cut-off value Doing nothing When is each approach justified? (Examples please)

25 Another data cleaning issue identified by Romero and colleagues Missing data A huge topic in its own right Will not focus on it today, but we can come back to it later in the semester if there’s extra time and interest – Quick vote: who wants me to try to fit in some time to talk about missing data?

26 Data Cleaning: Other Thoughts or Comments?

27 Romero article: Other Thoughts or Comments?

28 Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)

29 Assignment 3 Write a brief report for me You don’t need to prepare a presentation But be ready to discuss your features in class

30 Next Classes 2/23 Feature Distillation in Excel – Assignment 3 due 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count…

31 If there’s time…

32 Other cool things you can create with a few simple formulas (plus demos!)

33 Identifying specific cases of interest

34 Did event of interest ever occur for student?

35 Counts-so-far (and total value for student)

36 Counts-last-N-actions

37 First attempts

38 Ratios between events of interest

39 How many students had 3 (or 4, 5, 2,…) of an event

40 Times-so-far

41 Cutoff-based features

42 Unitized actions (such as unitized time)

43 Last 3 or 5 unitized

44 Comparing earlier behaviors to later behaviors through caching

45 Counts-if

46 Percentages of action type

47 Percentages of time spent per action/location/KC/etc.

48 Questions? Comments?

49 Other cool ideas?


Download ppt "Feature Engineering Studio September 23, 2013. Let’s start by discussing the HW."

Similar presentations


Ads by Google