Machine learning hackathon Results of 3 hour hackathon 180216
Description of casus In this hackathon, an anonymized dataset is used containing PLC output. The dataset contains data from 3 of the busiest production lines. Data is split in a training set, containing 11 months of data and a test set containing 2 months of data. The training set contains 96 variables, in the test set, the ‘Afvulsnelheid’ variable is omitted (thus the test set has 95 variables). Training set contains 23.994 cases, from 01-2015 to 11-2015 Test set contains 2.781 cases, from 12-2015 to 01-2016 The sets contain 95 variables + the values to predict in the training set. The goal of the hackathon was to predict the ‘afvulsnelheid’ in the test set. This is done by designing an algorithm which is trained on the 11 months training data, and then feeding that algorithm the 95 variables from the test set and having it make a prediction for each case in the set. To add more challenge to the hackathon, a restriction was added that you could only use 5 variables for predicting ‘afvulsnelheid’, instead of the given 95. This restriction forced contestants to really dive into the data to explore correlations and find creative ways to combine variables into only 5 without losing too much information contained in the data. February 18, 2016 www.itility.nl
Description of casus Scoring: Models are rated based on the RMSE (root mean squares error), where lower is better. Winners: hackathon is won by a team scoring a RSME of 679 using Azure ML Studio. Translated to actual values; the teams score had an MAE (mean absolute error) of 368, meaning the prediction was off by an average 367. This might seem a lot but seeing the value varies between 0 and ±12.000 this is quite accurate. Practical use: For now, it is hard to say if the winning algorithm has any practical use. We currently do not know enough details on the production process to know if predicting ‘afvulsnelheid’ has value to the process. Also, it might be that the variables we used for the prediction are also not available beforehand, making them useless as predictors. What the hackathon did learn us, however, is that there are many correlations between variables hidden in the dataset and that with help of a domain expert, we might be able to convert this into value for the business. February 18, 2016 www.itility.nl
Top 10 variables 1. PGA = Automaat 2. PGUI = Verpakt_uitloop1 3. PGVL = Verpakt_totaal 4. PGMT = snelheid_station4 5. SDVT = Druk_voor_station4 ======== 6. T2AT = Aanvoerleiding_Actuele_temperatuur 7. SDVO = Druk_voor_station3 8. SDNO = Druk_na_station3 9. PGST = Luchtdosering_Slaglengte_station4 10. T1AT = Temperaturen_Aanvoerleidingen_Actuele_temperatuur February 18, 2016 www.itility.nl
Top 10 variables February 18, 2016 www.itility.nl