Machine learning hackathon

Slides:



Advertisements
Similar presentations
Geometric Representation of Regression. ‘Multipurpose’ Dataset from class website Attitude towards job –Higher scores indicate more unfavorable attitude.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Introduction to Data Mining with XLMiner
Intro to Statistics for the Behavioral Sciences PSYC 1900
October 28, 2010Neural Networks Lecture 13: Adaptive Networks 1 Adaptive Networks As you know, there is no equation that would tell you the ideal number.
Naive Extrapolation1. In this part of the course, we want to begin to explicitly model changes that depend not only on changes in a sample or sampling.
Peak-to-Peak, RMS Voltage, and Power. Alternating Current Defined In alternating current (ac), electrons flow back and forth through the conductor with.
Correlation and Covariance
By Rachsuda Jiamthapthaksin 10/09/ Edited by Christoph F. Eick.
Testing Theories: Three Reasons Why Data Might not Match the Theory Psych 437.
Hybrid AI & Machine Learning Systems Using Ne ural Networks and Subsumption Architecture By Logan Kearsley.
Report #1 By Team: Green Ensemble AusDM 2009 ENSEMBLE Analytical Challenge: Rules, Objectives, and Our Approach.
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Netflix Netflix is a subscription-based movie and television show rental service that offers media to subscribers: Physically by mail Over the internet.
Loan Default Model Saed Sayad 1www.ismartsoft.com.
Welcome to MM570 Psychological Statistics Unit 4 Seminar Dr. Srabasti Dutta.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
An Artificial Neural Network Approach to Surface Waviness Prediction in Surface Finishing Process by Chi Ngo ECE/ME 539 Class Project.
THE NEED FOR CONTEXT 1 Applying Machine Learning to Incident Response Matt
How Good is a Model? How much information does AIC give us? –Model 1: 3124 –Model 2: 2932 –Model 3: 2968 –Model 4: 3204 –Model 5: 5436.
Brian Lukoff Stanford University October 13, 2006.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Welcome to MM570 Psychological Statistics Unit 4 Seminar Dr. Bob Lockwood.
FCI Supplement What determines FCI scores?. Explore FCI Dataset Descriptive Statistics Histograms Correlations Factor Analysis?
1 Outcome Measures for School Evaluation Coalition for Excellence in Science and Math Education.
Multiple Regression.
PSY 626: Bayesian Statistics for Psychological Science
MANOVA Dig it!.
Internet research as an early feasibility indicator
Linear Regression CSC 600: Data Mining Class 12.
Bagging and Random Forests
Selecting the Best Measure for Your Study
Foundations of Physical Science
Reasoning in Psychology Using Statistics
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Statistics: The Z score and the normal distribution
Challenges in Creating an Automated Protein Structure Metaserver
Regression Techniques
Team TLT Taehee Jung Lev Golod Temi N Lal
the challenge... ask not what your country can do for you
Multiple Regression.
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
ECE 5424: Introduction to Machine Learning
Employee Turnover: Data Analysis and Exploration
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
QM222 Class 14 Today’s New topic: What if the Dependent Variable is a Dummy Variable? QM222 Fall 2017 Section A1.
PSY 626: Bayesian Statistics for Psychological Science
Data Mining Practical Machine Learning Tools and Techniques
Q4 : How does Netflix recommend movies?
Multiple Regression.
Joni Myres Presentation
Machine Learning Capabilities
Exploring Computer Science Lesson 4-11
Ensembles.
Intro to Machine Learning
CHAPTER 3 Describing Relationships
Analyzing Super Bowl Teams
Thematic inquiry My perfect parachute
Seasonal Forecasting Using the Climate Predictability Tool
Correlation and Covariance
Facultad de Ingeniería, Centro de Cálculo
Predicting Loan Defaults
Reading and effective note-making
Linear Regression and Correlation
Building Team work skills as a Young Professional
More on Maxent Env. Variable importance:
LSTM Practical Exercise
Pearson Correlation and R2
Presentation transcript:

Machine learning hackathon Results of 3 hour hackathon 180216

Description of casus In this hackathon, an anonymized dataset is used containing PLC output. The dataset contains data from 3 of the busiest production lines. Data is split in a training set, containing 11 months of data and a test set containing 2 months of data. The training set contains 96 variables, in the test set, the ‘Afvulsnelheid’ variable is omitted (thus the test set has 95 variables). Training set contains 23.994 cases, from 01-2015 to 11-2015 Test set contains 2.781 cases, from 12-2015 to 01-2016 The sets contain 95 variables + the values to predict in the training set. The goal of the hackathon was to predict the ‘afvulsnelheid’ in the test set. This is done by designing an algorithm which is trained on the 11 months training data, and then feeding that algorithm the 95 variables from the test set and having it make a prediction for each case in the set. To add more challenge to the hackathon, a restriction was added that you could only use 5 variables for predicting ‘afvulsnelheid’, instead of the given 95. This restriction forced contestants to really dive into the data to explore correlations and find creative ways to combine variables into only 5 without losing too much information contained in the data. February 18, 2016 www.itility.nl

Description of casus Scoring: Models are rated based on the RMSE (root mean squares error), where lower is better. Winners: hackathon is won by a team scoring a RSME of 679 using Azure ML Studio. Translated to actual values; the teams score had an MAE (mean absolute error) of 368, meaning the prediction was off by an average 367. This might seem a lot but seeing the value varies between 0 and ±12.000 this is quite accurate. Practical use: For now, it is hard to say if the winning algorithm has any practical use. We currently do not know enough details on the production process to know if predicting ‘afvulsnelheid’ has value to the process. Also, it might be that the variables we used for the prediction are also not available beforehand, making them useless as predictors. What the hackathon did learn us, however, is that there are many correlations between variables hidden in the dataset and that with help of a domain expert, we might be able to convert this into value for the business. February 18, 2016 www.itility.nl

Top 10 variables 1. PGA = Automaat 2. PGUI = Verpakt_uitloop1 3. PGVL = Verpakt_totaal 4. PGMT = snelheid_station4 5. SDVT = Druk_voor_station4 ======== 6. T2AT = Aanvoerleiding_Actuele_temperatuur 7. SDVO = Druk_voor_station3 8. SDNO = Druk_na_station3 9. PGST = Luchtdosering_Slaglengte_station4 10. T1AT = Temperaturen_Aanvoerleidingen_Actuele_temperatuur February 18, 2016 www.itility.nl

Top 10 variables February 18, 2016 www.itility.nl