MINING HISTORICAL DELAY DATA IN RAILWAYS Fabrizio Cerreto PhD student Transport modelling DTU Management Engineering
About the PhD Research Project MSc in Transport Engineering: Sapienza University of Rome + TU Delft Operation/punctuality analyst – NTV SpA PhD Student at DTU in IPTOP: Understanding delays in railways Analytical model (delay propagation) Empirical data analysis Micro-simulation Add Presentation Title in Footer via ”Insert”; ”Header & Footer” Mining historical delay data in railways 9 November 20189 November 2018
About the PhD Research Project MSc in Transport Engineering: Sapienza University of Rome + TU Delft Operation/punctuality analyst – NTV SpA PhD Student at DTU in IPTOP: Understanding delays in railways Analytical model (delay propagation) Empirical data analysis Realized running times real timetable supplements Earliness of trains Correlations heatmaps Principal components Delay profiles clustering Micro-simulation Add Presentation Title in Footer via ”Insert”; ”Header & Footer” Mining historical delay data in railways 9 November 20189 November 2018
Background – research motivation Timetable allowance Running time supplements Headway buffers Practical design: Good practices for magnitude (e.g. Capacity consumption – UIC406) Rule of the thumb for distribution (national rules: uniform, concentrated) Understanding delays Causes Recurrent patterns Robust design Timetable supplements Headway buffers Primary delay prevention Mining historical delay data in railways 9 November 20189 November 2018
Data Train timestamps at every station Train characteristics Station and operation Schedule Delay Information from dispatchers Combined data from DSB: Rolling stock plan and operation ~ 150M records 2010-2016 Scheduled time Train ID Station ID Delay Record Type Input Source Product Operator Cause Group code Cause Type Code Delay Report Code Delay Report Cause 06NOV13 09:12:00 2720 KA -5.17 I DWH RV DSB PASSAGER 611 NULL 09:12:30 -6 U 09:16:00 NA -3.67 Mining historical delay data in railways 9 November 20189 November 2018
Vestbane: Copenhagen - Roskilde Q3 2014 Time frame København H – Roskilde ~30 km Line Semi-periodic timetable Express trains from/to Copenhagen Traffic Heterogeneous Most important section Freight +Regional + National + International High interest from authorities Reasons Mining historical delay data in railways 9 November 20189 November 2018
Previous results: Copenhagen – Roskilde Realized Running Times Actual Running time supplements 2nd percentile 2nd percentile Mining historical delay data in railways
Previous results: Copenhagen – Roskilde Frequent delay patterns 1 Loses time Høje Tåstrup Gains time Early at Roskilde Late at Copenhagen 2 Valby 3 Copenhagen Roskilde Mining historical delay data in railways
Previous results: Copenhagen – Roskilde Realized Running Times Bias in Timestamping Detection points Departure bias Arrival bias PLATFORM Timetable points Track circuits Mining historical delay data in railways
Kystbane: Copenhagen - Helsingør Timetable year 2014 15/12/2013 – 14/12/2014 Time frame København H – Helsingør ~50 km Line Cyclic timetable Well isolated High interest from authorities Reasons Mining historical delay data in railways 9 November 20189 November 2018
Northbound Timetable: Copenhagen - Helsingør 3 stopping patterns 6÷9 trains/h Standardized rolling stock Analyze separately Period of day Changes in operation Rush hour reinforcement Skip-stop from Copenhagen Stop-train from Sweden Mining historical delay data in railways 9 November 20189 November 2018
Data transpose / column split New variables: delay change scheduled running time realized running time More… Observations/Rows: train-date Fields/Columns: station records 20÷25 variables more Date Train ID Data KH U KH I KN U KN I KK U KK I … 22-apr-14 1314 Delay 0.4 0.22 -0.17 -0.93 Delay_change 0.18 0.39 0.76 0.47 Sch_run_time 3 3.5 1 2.5 0.5 Real_run_time 3.18 3.89 1.76 1.57 2.97 Mining historical delay data in railways 9 November 20189 November 2018
First glance: scatterplots and distributions of delays SYM Highly correlated Highly non-normal Mining historical delay data in railways 9 November 20189 November 2018
First glance: scatterplots and distributions of delay changes SYM Non-correlated Highly non-normal Mining historical delay data in railways 9 November 20189 November 2018
Delay and Delay change profiles Train 1309 22/4/2014 Mining historical delay data in railways 9 November 20189 November 2018
Pooled data Mining historical delay data in railways 9 November 20189 November 2018
Issues with non-normality Tests for changes in operation to Helsingør Nørreport closed for renovation Trains skipped stop Hidden timetable supplement 22/4/2014: Nørreport opens again to main line trains Test: Before Vs. after Nørreport re-opening Parametric multivariate tests require normality Univariate t-test at stations Result: significantly different operation. Dataset shrunk to Roskilde/Odense to Sweden Mining historical delay data in railways 9 November 20189 November 2018
Correlation heatmaps Northbound ØK ØP Southbound Significantly different patterns by direction by stopping patternt Smooth fades vs sharp changes Northbound ØK ØP Southbound Mining historical delay data in railways 9 November 20189 November 2018
Principal components analysis Capture intrinsic variability in the data Resampling - Noise reduction Dimensions reduction: data handling Mining historical delay data in railways 9 November 20189 November 2018
Principal components analysis: example 95% Variability explained with only 2 PC Drawback: Strongly affected by non-normality Eigenvalues of the Correlation Matrix Principal Component Eigenvalue Difference Proportion Cumulative 1 17.12432 16.14742 90.13% 90.1% 2 0.976895 0.613815 5.14% 95.3% 3 0.36308 0.148796 1.91% 97.2% 4 0.214285 0.099198 1.13% 98.3% 5 0.115087 0.053289 0.61% 98.9% 6 0.061797 0.020083 0.33% 99.2% Mining historical delay data in railways 9 November 20189 November 2018
Clustering: K-means Simple Fast Converges almost always k must be chosen - metrics Clusters not fixed, no reference Mining historical delay data in railways 9 November 20189 November 2018
Clustering on Delay Northbound trains Layered Fuzzy Mining historical delay data in railways 9 November 20189 November 2018
Clustering on Delay change Northbound trains Mining historical delay data in railways 9 November 20189 November 2018
Clustering on Delay change Northbound trains Mining historical delay data in railways 9 November 20189 November 2018
Clustering on Delay Southbound trains Mining historical delay data in railways 9 November 20189 November 2018
Clustering on delay change southbound trains Not clustered Fuzzy Mining historical delay data in railways 9 November 20189 November 2018
Conclusions Real running time supplement Vs. Scheduled Calibrate with measured offset Non-normality is an issue Multivariate statistical tests PCA Clustering depends on direction Delay Vs Delay change Direction Towards bottlenecks Delays changes are distributed Clustering on Delay From bottlenecks Delays changes are concentrated at the bottleneck Clustering on Delay change Correlation heatmap explains clustering on delay changes Mining historical delay data in railways 9 November 20189 November 2018
Data mining: next steps Understanding factors that influence clustering Causes of delays – Identify Primary delays from historical data Regression/Classification into clusters Period of the day Period of the year Weekday Composition Composition changes Dynamics in delay propagation Observations: days of operation Define variables Include changes in the plan Cluster days to forecast operation – short term Mining historical delay data in railways 9 November 20189 November 2018
Thanks for your attention Fabrizio Cerreto PhD student Transport modelling DTU Management Engineering