Feature Engineering Studio September 23, 2013. Let’s start by discussing the HW.

Slides:



Advertisements
Similar presentations
Feature Engineering Studio January 21, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Advertisements

ANIMAL SELF DEFENSE UNIT 4 WEEK 2. CHAMELEON This word describes a lizard that can change the color of its skin to blend in with its surroundings.
Statistics 100 Lecture Set 6. Re-cap Last day, looked at a variety of plots For categorical variables, most useful plots were bar charts and pie charts.
Feature Engineering Studio November 11, Poster Session Features What features did each of you create after the poster session? Who did the ideas.
Software Engineering Lab Session Session 4 – Feedback on Assignment 1 © Jorge Aranda, 2005.
Using the Rule Normal Quantile Plots
How to take your reading to the next level….
Tables and graphs for frequencies and summary statistics
Statistics: Data Analysis and Presentation Fr Clinic II.
Normal Distribution (Topic 12)
Transforming and Combining Random Variables
1 Re-expressing Data  Chapter 6 – Normal Model –What if data do not follow a Normal model?  Chapters 8 & 9 – Linear Model –What if a relationship between.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Creating Your Own Sermons Illustration Team Finding Current & Relevant Illustrations That Are True! By Dr. Tom Cheyney Greater Orlando Baptist Association.
Feature Engineering Studio February 23, Let’s start by discussing the HW.
Feature Engineering Week 3 Video 3. Feature Engineering.
Classifiers, Part 1 Week 1, video 3:. Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
How do you know?: Interpreting and Analyzing Data NCLC 203 New Century College, George Mason University April 6, 2010.
Slow Way Home: Unit I Lesson 2 Slow Way Home Chapter 2 Brainstorming Memories Milinda Jay, Ph. D.
Momentary detour... Ideas for collecting data from our classroom; what would YOU like to collect? So far, social media, piercings, # pets, first pet,
Chapter 2:.  Come up to board and write the number of different types of social media YOU have used TODAY  If you are male, please use a blue marker.
Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Chapter 3 (continued) Nutan S. Mishra. Exercises Size of the data set = 12 for all the five problems In 3.11 variable x 1 = monthly rent of.
Feature Engineering Studio September 23, Welcome to Mucking Around Day.
Assignment 2: remarks FIRST PART Please don’t make a division of labor so blatantly obvious! 1.1 recode - don't just delete everything that looks suspicious!
RIGOR IN ASSESSMENTS Nov 14 – Teach for America Break-Out Session.
Feature Engineering Studio September 9, Welcome to Problem Proposal Day Rules for Presenters Rules for the Rest of the Class.
HOW TO WRITE FORMAL LAB REPORTS. WHAT ARE THE STEPS? 1. Name and Lab partners 2. Period 3. Title 4. Purpose and Hypothesis 5. Procedures 6. Data 7. Data.
Feature Engineering Studio October 14, Iterative Feature Refinement.
Feature Engineering Studio March 1, Let’s start by discussing the HW.
Chapter 2:.  Come up to board and write the number of different types of social media YOU have used TODAY  If you are male, please use a blue marker.
Feature Engineering Studio September 30, Quick Note Please me for appointments rather than just showing up at my office – I’m always glad.
In your business. DATING!!! Take a few minutes and write down one of the best dates you have ever been on. Then we will have a few of you share your exciting.
Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.
Describing Distributions with Numbers Chapter 2. What we will do We are continuing our exploration of data. In the last chapter we graphically depicted.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 27, 2013.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Feature Engineering Studio October 7, Welcome to Bring Me a Rock Day 2.
Data Analysis, Presentation, and Statistics
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
What is the Story Really About ? When you were younger, and it came time for revision, many of you probably took out a special colored pen and added in.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Feature Engineering Studio February 2, Welcome to Problem Proposal Day Rules for Presenters Rules for the Rest of the Class.
Engineering College of Engineering Engineering Education Innovation Center Analyzing Measurement Data Rev: , MCAnalyzing Data1.
Chapter 2:.  Come up to board and write the number of different types of social media YOU have used TODAY; write anywhere; no need to organize in any.
Research Paper Writing Topic/Thesis. Pick your topic! The first thing you must do is choose a topic. Please tell me before you leave TODAY what your topic.
Last chapter... Four Corners: Go to your corner based on if your birthday falls in the Winter, Spring, Summer, or Fall; 1 minute In your group, come to.
Regression Chapter 5 January 24 – Part II.
 Today I will talk about the technologies that we learned this semester in our Technology class and I will provide some benefits of using them during.
PSY 325 AID Education Expert/psy325aid.com FOR MORE CLASSES VISIT
Chapter 2:.  Come up to board and write the number of different types of social media YOU have used TODAY; write anywhere; no need to organize in any.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Feature Engineering Studio October 7, Welcome to Bring Me Another Rock.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 25, 2013.
Feature Engineering Studio
MATH-138 Elementary Statistics
A linear approach to predicting house prices
Things to have out/do! How-to homework is out
Feature Engineering Studio
1.2 Displaying Quantitative Data with graphs
Feature Engineering Studio
Service Project Welcome Greet students and welcome them back.
Core Methods in Educational Data Mining
Chapter 1 Stats Starts Here.
WALT: consider two sides of an argument to reach a reasoned judgement.
The Writing Process Please take out some paper, you will need to take notes. Please label these notes “The Writing Process”
Lesson 25: Understand a Theme in “Flowers for Algernon”
Using the Rule Normal Quantile Plots
Presentation transcript:

Feature Engineering Studio September 23, 2013

Let’s start by discussing the HW

Sort into pairs Partner with the person next to you

Sort into pairs Go over your reports together – A maximum of 5 minutes apiece

5 minutes for first person

5 minutes for second person

Re-assemble into one big group

Who here found something really cool while taking a first look at their data? Show us, tell us

Who here found a histogram with a normal distribution? Show us, tell us

Who here found a histogram with a hypermode? Show us, tell us

Who here found a histogram with a flat distribution? Show us, tell us

Who here found a histogram with a skewed distribution? Show us, tell us

Who here found a histogram with a bimodal distribution? Show us, tell us

Who here found a histogram with something else interesting? Show us, tell us

Who here found something surprising with their min, max, average, stdev?

Categorical variables Who here found something curious, weird, or interesting in the distribution of their categorical variables?

Who here hasn’t spoken yet today? (who analyzed data) Tell us something interesting you found in your data

Who here did something else fun with their data?

Data Cleaning

What did folks think of Romero article? In particular, the part on data cleaning

Outliers You have a huge outlier in your data You need to know what to do about it

More specifically What kind of outliers should be dealt with? And what kind of outliers should be left alone? What do you think?

Ways to identify outliers Theoretical approaches – “Students used this software during 45 minute class periods. I see a 972 minute period of time taken to make a response. Something must have gone wrong here.” Deviation-based approaches – “All data that is 3 SD over or 3 SD under the mean will be treated as an outlier” When is each approach justified? (Examples please)

Ways to deal with outliers Truncation – Delete the outliers Winsorising – Set the data value to the cut-off value Doing nothing When is each approach justified? (Examples please)

Another data cleaning issue identified by Romero and colleagues Missing data A huge topic in its own right Will not focus on it today, but we can come back to it later in the semester if there’s extra time and interest – Quick vote: who wants me to try to fit in some time to talk about missing data?

Data Cleaning: Other Thoughts or Comments?

Romero article: Other Thoughts or Comments?

Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)

Assignment 3 Write a brief report for me You don’t need to prepare a presentation But be ready to discuss your features in class

Next Classes 2/23 Feature Distillation in Excel – Assignment 3 due 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count…

If there’s time…

Other cool things you can create with a few simple formulas (plus demos!)

Identifying specific cases of interest

Did event of interest ever occur for student?

Counts-so-far (and total value for student)

Counts-last-N-actions

First attempts

Ratios between events of interest

How many students had 3 (or 4, 5, 2,…) of an event

Times-so-far

Cutoff-based features

Unitized actions (such as unitized time)

Last 3 or 5 unitized

Comparing earlier behaviors to later behaviors through caching

Counts-if

Percentages of action type

Percentages of time spent per action/location/KC/etc.

Questions? Comments?

Other cool ideas?