Feature Engineering Week 3 Video 3. Feature Engineering.

Slides:



Advertisements
Similar presentations
Final Year Projects Some tips
Advertisements

Mentoring Conversations
Feature Engineering Studio January 21, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Non-fiction Reading Unit Introduction 1. Essential Questions How do capable readers make sense of nonfiction text? How do we read nonfiction text to become.
CS 330 Programming Languages 09 / 18 / 2007 Instructor: Michael Eckmann.
How organization can improve creativity Robotics and Automation Copyright © Texas Education Agency, All rights reserved. 1.
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Mkkk; Jackson makes a Snail. Jackson wanted to draw a snail. One thing he knew for sure was that he wanted his snail to be big and that there were lots.
CREATING A MATHEMATICS CULTURE WHANGAREIMAY JIM HOGAN TEAM SOLUTIONS.
Non-fiction Reading Unit Introduction 1. Essential Questions How do capable readers make sense of nonfiction text? How do we read nonfiction text to become.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Lecturer: Gareth Jones Class 2: The Writing Process.
*** Remember – this material is based on 7 Habits.
Unit 2: Engineering Design Process
Feature Engineering Studio March 30, Iterative Feature Refinement.
Paragraph Writing Problems. Paragraph Writing Problems 1 University differ from High School such as number of lectures, time of lectures and exams. First.
How to do Quality Research for Your Research Paper
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
Unit 1 – Improving Productivity Ayesha begum. 1.1Why did you use a computer? What other systems / resources could you have used? We use a computer because.
Personal Reading Procedure By: Kellen X. Reinsch.
Feature Engineering Studio September 23, Welcome to Mucking Around Day.
Three Reasons to Communicate Get something DONE Have a conversation Help with distress.
Big Ideas & Better Questions, Part II Marian Small May, ©Marian Small, 2009.
Reading Strategies to Promote Student Success I Don’t Get It!
Ad Prima Charter School.  R7.B Identify, explain, interpret, describe, and/or analyze bias and propaganda techniques in nonfictional text.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
TAKING AND STUDYING CLASSROOM NOTES Take effective classroom notes Study and remember your notes.
By the end of this session you should be able to...
Feature Engineering Studio September 9, Welcome to Problem Proposal Day Rules for Presenters Rules for the Rest of the Class.
Big Ideas Differentiation Frames with Icons. 1. Number Uses, Classification, and Representation- Numbers can be used for different purposes, and numbers.
Feature Engineering Studio September 23, Let’s start by discussing the HW.
Hypothesis Testing. The 2 nd type of formal statistical inference Our goal is to assess the evidence provided by data from a sample about some claim concerning.
Feature Engineering Studio October 14, Iterative Feature Refinement.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
Module 3.2.  Learn the differences between kinds of textbooks  Learn ways to help students focus their reading and manage multiple or very large reading.
MATH COMMUNICATIONS Created for the Georgia – Alabama District By: Diane M. Cease-Harper, Ed.D 2014.
The E ngineering Design Process Foundations of Technology The E ngineering Design Process © 2013 International Technology and Engineering Educators Association,
Microsoft Excel – Pivot Tables Introduction to Microsoft Excel Pivot tables Please login to the computers and launch Microsoft Excel. Rob Jones Room WG43.
Feature Engineering Studio March 1, Let’s start by discussing the HW.
Feature Engineering Studio September 30, Quick Note Please me for appointments rather than just showing up at my office – I’m always glad.
Personal Reading Procedure P2RThinking Critically P2RThinking Critically Learning Styles Learning Styles How I learn Personally How I learn Personally.
T YPE OF R EADER Thinking about yourself as a reader!
The E ngineering Design Process Advanced Design Applications The E ngineering Design Process Teacher Resource – The First Five Days: Day 2 © 2014 International.
Graphs, Continued ECE 297.
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Feature Engineering Studio October 7, Welcome to Bring Me a Rock Day 2.
Opening. Two heads Before proceeding find a partner to go through the next slides with you. If you don’t have a partner be sure to share this later with.
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
ADVANCED TOPICS IN EXCEL Evan Volgas, Lead Data Engineer at WhatRunsWhere.
Dr. Idit Caperton’s Class Summary as of 18/10/2005 What did we accomplish so far? We have been together in classes for a total of 14 hours, in 7 sessions,
 Create a flexible reading program.  Post a weekly reading schedule and allow students to find their names on it.  Allow students to move to appointed.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
#1 Make sense of problems and persevere in solving them How would you describe the problem in your own words? How would you describe what you are trying.
LearnPianoHere.com Presents Learning Piano: Ten Tips and Tricks If you are serious and determined to learn how to play the piano, do yourself a favor and.
Feature Engineering Studio October 7, Welcome to Bring Me Another Rock.
Evaluating and Summarizing Sources They Say, I Say Ch. 2.
Mathematics at Cambridge
AP CSP: Cleaning Data & Creating Summary Tables
Introduction to Engineering Design II (IE202) Section XBG Team 7 Designing a Robot Students name: IE202-Team#7 Celebration.
Engineering.
Core Methods in Educational Data Mining
Big Data, Education, and Society
Business Communication
Text Analytics and Machine Learning Workshop Machine Learning Session
Machine Learning in Practice Lecture 23
Core Methods in Educational Data Mining
Goals.
Presentation transcript:

Feature Engineering Week 3 Video 3

Feature Engineering

 Up until this point in the class, we’ve talked about building and validating prediction models  Models that infer a predicted variable from predictor variables

Where the Predicted Variable Comes From  A couple lectures ago, we went into a little more detail about where the predicted variable can come from

Where the Predictor Variables Come From  Where do the predictor variables come from?  Do they fall out of the sky?  Do they come from the Office for Predictor Variables in Washington, DC?

Feature Engineering  The art of creating predictor variables  A major topic in its own right  At Teachers College, I teach a semester-long design studio in Feature Engineering

Why a whole class?  Feature engineering is the least well-studied part of the process of developing prediction models  But it’s arguably the most important part  Your model will never be any good if your features (predictors) aren’t very good

Why a whole class?  It is an art, it is human-driven design  It involves lore rather than well-known and validated principles  It is hard!

The Big Idea  How can we take the voluminous, ill-formed, and yet under-specified data that we now have in education  And shape it into a reasonable set of variables  In an efficient, effective, and predictive way?

A process in its own right 1. Brainstorming features 2. Deciding what features to create 3. Creating the features 4. Studying the impact of features on model goodness 5. Iterating on features if useful 6. Go to 3 (or 1)

Brainstorming Features  Can be more or less formal

IDEO tips for Brainstorming 1. Defer judgment 2. Encourage wild ideas 3. Build on the ideas of others 4. Stay focused on the topic 5. One conversation at a time 6. Be visual 7. Go for quantity notes/seven-tips-on-better-brainstorming

Building on the Ideas of Others  Doesn’t just have to be people nearby  There’s a huge literature out there of features people have tried and what has worked, or failed to work, for a range of problems  Read papers from researchers working on similar problems, and see what you can use

Brainstorming Features  On hard projects, my research group often meets as a team over pizza and beer to brainstorm  On easier projects, one person brainstorms solo  And then often discusses their features with another person, who offers further suggestions

Deciding what features to create  There is never infinite time  A trade-off between the effort to create a feature and how likely it is to be useful  “How likely it is to be useful” – the best you can do is to Look at whether similar features have been useful for similar problems Use your best intuition  Worth biasing in favor of features that are different than anything else you’ve tried before  Explores a different part of the space

Creating features  Excel – Really good for prototyping features  Google Refine/OpenRefine – Some alternate features that are nice  Distillation Code – The scalable solution… but harder to check yourself or explore

Some useful tools in Excel  Pivot Tables – great for aggregating data, and getting the average, min, max, stdev  Vlookup – great for translating from aggregations (student-level data, for instance) back to action- level data  Example: How much faster is the current student action on the average student attempt on the current skill?

Further resources  vlookup-in-excel/  tables.html  ntinexcel/ss/8912pivot_table.htm

Other useful things you can do in Excel  Counts-so-far  Counts-last-n-actions  Differentiating first and subsequent attempts  Ratios between events of interest  Cut-off based features

GoogleRefine (now OpenRefine)  Functionality to make it easy to regroup and transform data  Find similar names  Connect names  Bin numerical data  Mathematical transforms showing resultant graphs  Text transforms and column creation

GoogleRefine (now OpenRefine)  Functionality for finding anomalies/outliers

GoogleRefine (now OpenRefine)  Functionality for automatically repeating the same process on a new data set  *Really* nice for cases where you complete a complex process and want to repeat it

GoogleRefine (now OpenRefine)  Some videos you may want to watch later   

Feature Iteration  Sometimes when a feature looks like it might be good  It’s worth iterating on that feature, trying close variants to see if they do better

Example  You have a feature “slow actions after hints” (cf. Shih, Koedinger, & Scheines, 2008)  You define “slow action” as an action taking over 20 seconds  What if 30 seconds is a better cut-off?

Ways to accomplish this…  By hand  Programming (Java? Matlab?)  Excel Equation Solver

Excel Equation Solver Tutorials  help/define-and-solve-a-problem-by-using-solver- HP aspx   One tip: multistart option avoids local minima (that can sometimes block the solver from even getting started)

A few thoughts

Does feature engineering over-fit?  It can  Which is why it’s useful to remember  The true test of a model is whether it works on entirely unseen data  If you iterate a lot and use cross-validated goodness  Then the true test of your model will be either a held-out data set or newly-collected data later on

Feature Engineering  Your features come from somewhere  You can take a standard set of variables or pre- existing variables  No question it’s faster  But thinking about your variables is likely to lead to better models

Next Lecture  Automated feature generation and selection