Feature Engineering Studio October 7, 2013
Welcome to Bring Me Another Rock
In birthdate order Each person should tell us about their favorite feature they created for Bring Me Another Rock Tell us what it was How you created it Your just-so story And was your just-so story correct
Next Tell us about anything cool you did in Excel or another program to create a feature
Too Hard? Were there any features that anyone kind of wanted to create, but it was too difficult? (or too much work?)
Better? Who here got better features (in terms of goodness metric) for Bring Me Another Rock, than Bring Me a Rock?
Other Interesting Observations?
GoogleRefine (now OpenRefine)
Mostly just an Excel clone, abandoned in favor of the fully-online Google Towels Sheets But some nice additional functionality
GoogleRefine (now OpenRefine) Functionality to make it easy to regroup and transform data – Find similar names – Connect names – Bin numerical data – Mathematical transforms showing resultant graphs – Text transforms and column creation
GoogleRefine (now OpenRefine) Functionality for finding anomalies/outliers
GoogleRefine (now OpenRefine) Functionality for automatically repeating the same process on a new data set *Really* nice for cases where you complete a complex process and want to repeat it – Replicates a really good logbook, which most data analysts don’t keep – Now seen in other tools like iPython Notebook – Still not in Excel, but Excel has been stagnant for years
GoogleRefine (now OpenRefine) Functionality for connecting your data set to web services to get additional relevant info
GoogleRefine (now OpenRefine) Can load in and export common but hard-to- work-with data types – JSON and XML
GoogleRefine (now OpenRefine) Some videos you should watch later AWM Ba0 k
Questions? Comments?
Assignment for next Monday
Iterative Feature Refinement Select three of the features you have created in previous assignments These features should be “among the best” of the features you have previously created For each of these three features, create at least five “close variants” of these features – “time for last 3 actions” and “time for last 4 actions” are close variants – “time for last 3 actions” and “total time between help requests and next action” are two separate features Using the Excel Equation Solver is an OK substitute for creating five “close variants” If you don’t use the excel equation solver – As you create the close variants for each feature, don’t just make them all at once – Make a variant – Test whether it’s better than the previous variant (by goodness metric) If it is, keep going in the same direction If it isn’t, try doing the opposite or something else
Iterative Feature Refinement Write a report that discusses your process – I took feature N – I changed it from N to N* – The goodness changed from G to G* – Then I did…
Iterative Feature Refinement You don’t need to prepare a presentation But be ready to discuss your features in class
Also for Next Monday Please read Rodrigo, M.M.T., Baker, R.S.J.d., McLaren, B., Jayme, A., Dy, T. (2012) Development of a Workbench to Address the Educational Data Mining Bottleneck. Proceedings of the 5th International Conference on Educational Data Mining,
Next Classes 3/30 Feature Reuse – IFR assignment due 4/1 Lab Session: Building Predictive Models – Come to this if you want to learn more about the theory behind building predictive models; how to do it effectively and appropriately (beyond just the how) – You don’t need to come to this if you’ve taken Core Methods or Big Data and Education
Thank you!