You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager, Atass Sports)
Introduction ATASS Sports – Sports forecasting – Hardcore statistical research – Fusion of ‘academic’ and ‘pragmatic’ Today… – Building a very basic tennis model – Highlighting some key ideas
Tennis modelling Data obtained from tennis-data.co.uk – Spreadsheet for each year – You can easily get the data yourself! Ultimate goal of modelling is to determine the probability of different outcomes Can we forecast the probability of victory in a match from the players’ world rankings? How do we identify a "good" model?
Concept 1: Model calibration An effective model must be well-calibrated – The probabilities produced by the model must be consistent with the available data – Think in terms of “bins” – if we gather together all the cases where our generated win probability lies between 0.6 and 0.7 (say), the observed proportion of wins should match the mean win probability for the bin (roughly 0.65) – Here’s an extract from Nate Silver’s recent bestseller, “The Signal and the Noise”…
Concept 2: Model score Suppose we use a model to produce probabilities for a large number of sporting events (e.g. a collection of tennis matches) We can assess the model's quality by summing log(p) over all predictions, where p is the probability we assigned to the outcome that occurred - this is the model score The closer we match the "true" probabilities, the higher the model score (closer to zero)
The data set… tennis has 68,972 rows of data, with each match appearing twice (A vs B and B vs A) > dim(tennis) [1] > head(tennis) matchid date day ago surf bestof aname arank bname brank res hard 3 clement a 18 gaudenzi a hard 3 gaudenzi a 101 clement a hard 3 goldstein p 81 jones a hard 3 jones a 442 goldstein p hard 3 haas t 23 smith l hard 3 smith l 485 haas t 23 0 > tail(tennis) matchid date day ago surf bestof aname arank bname brank res clay 3 rosol l 48 simon g clay 3 simon g 16 rosol l clay 3 garcia lopez g 87 mayer f clay 3 mayer f 29 garcia lopez g clay 3 rosol l 48 garcia lopez g clay 3 garcia lopez g 87 rosol l 48 0
From ranks to probabilities How might we map the players' rankings onto a win probability? – We’ll look at an extremely rudimentary approach in a moment as a worked example – But first, consider for a moment how you might mathematically combine the players’ rankings to get a win probability for each player – What are the important properties?
A "first stab" – Model 1 Suppose our first guess is that if the two players' rankings are A and B, the probability of A winning the match is B/(A+B) matchid aname arank bname brank res aprob1 1 1 clement a 18 gaudenzi a gaudenzi a 101 clement a goldstein p 81 jones a jones a 442 goldstein p haas t 23 smith l smith l 485 haas t henman t 10 rusedski g rusedski g 69 henman t hewitt l 7 arthurs w arthurs w 83 hewitt l
A "first stab" – Model 1 In this case, the model score is The "null" model in which each player is always assigned a probability of 0.5 gets a model score of So Model 1 gives an improvement of 1710 over the null model (closer to zero is better)
How about the calibration? Let's generate a calibration plot for Model 1 – We'll use bins of width 0.1 (0 to 0.1, 0.1 to 0.2, etc), closed at the left hand side (e.g. 0.6 ≤ x < 0.7) – For each bin, we consider all instances where our model probability lies inside the bin, and plot a point whose x-coordinate is the mean of the model probabilities and whose y-coordinate is the observed proportion of wins for these instances – Example: In this case, for the bin 0.6 ≤ x < 0.7, the point plotted is (0.648, 0.588)
Systematic bias of Model 1
A quick fix... Probabilities systematically too extreme, so could try and blend Model 1 with 0.5 What weighting on Model 1 minimises the model score? A weighting of 0.71 on Model 1 is best – the model score improves from to Obtaining best weighting can be done as a one-liner in R…
Quick fix (one-liner in R) glm(tennis$res~tennis$aprob1, family=binomial(link="identity")) Call: glm(formula = tennis$res ~ aprob1, family = binomial(link = "identity")) Coefficients: (Intercept) aprob Degrees of Freedom: Total (i.e. Null); Residual Null Deviance: Residual Deviance: AIC: 85330
Bias reduced, but still apparent
A substantial improvement matchidarankbrankresaprob matchidarankbrankresaprob Model 1 Score Model 2 Score
Stepping up a gear Invlogit function – widely used to predict binary sports outcomes (logistic regression)
Logistic regression Invlogit function – widely used to predict binary sports outcomes (logistic regression) Let's do a logistic regression of the result on the difference in rank, (B – A) This is equivalent to player A's win probability being: invlogit( k*(B – A)) The optimal value of k can be found using glm
Logistic regression rankdiff = tennis$brank - tennis$arank g1 = glm(tennis$res~rankdiff-1, family=binomial(link="logit"))
Logistic regression summary(g1) Call: glm(formula = tennis$res ~ rankdiff - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) rankdiff <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4
This has terrible calibration!
Logistic regression The model score comes out as Worse than Model 1! – Model (probabilities all 0.5) – Model (simple B/(A+B) model) – Model (Model 1 squeezed) We need a better way of capturing the curvature…
Our paper... Developing an improved tennis ranking system David Irons, Stephen Buckley and Tim Paulden MathSport International, June 2013 Updated version to appear this year in JQAS (Journal of Quantitative Analysis of Sports)
Our paper... A decent model for generating probabilities is invlogit( 0.58*(log(B) - log(A)) ) where invlogit is the function exp(x)/(exp(x)+1) matchidarankbrankresaprob Model 3 Score
Our paper… logterm = log(tennis$brank)-log(tennis$arank) g1 = glm(tennis$res~logterm-1, family=binomial(link="logit"))
Our paper… summary(g1) Call: glm(formula = tennis$res ~ logterm - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) logterm < *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3
Our paper… sum(log(g1$fitted.values[which(tennis$res==1)])) [1]
Our paper… summary(g1) Call: glm(formula = tennis$res ~ logterm - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) logterm < *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3
Almost perfect calibration
Some comparisons matchidarankbrankresaprob matchidarankbrankresaprob Model 2 Score Model 3 Score
Some comparisons matchidarankbrankresaprob matchidarankbrankresaprob Model 1 Model 3 matchidarankbrankresaprob Model 2
Coming full circle In fact, a bit of algebra shows that invlogit( 0.58*(log(B) - log(A)) ) is exactly the same as B 0.58 / (A B 0.58 ) And invlogit( B-A ) is the same as exp(B)/(exp(A) + exp(B)) Try the simplest thing that could possibly work!
Graphically…
A final extension What about the effect of the number of sets? Let's take the best model (Model 3) and look at the calibration plots…
Model 3 - All matches
Model 3 - Best of 3 sets
Model 3 - Best of 5 sets
Model 4 This suggests we should have a combined model – "Model 4" – based on the rules that are in operation For "best of 3 sets": invlogit( 0.54*(log(B) - log(A)) ) For "best of 3 sets": invlogit( 0.72*(log(B) - log(A)) )
Model 4 – All matches
Model 4 – Best of 3 sets
Model 4 – Best of 5 sets
The best model score so far For Model 4, the model score is A final comparison: – Model (probabilities all 0.5) – Model (simple B/(A+B) model) – Model (Model 1 squeezed) – Logistic-22480(based on B-A) – Model (logistic with logs) – Combined-21252(split version of Model 3)
Some further questions How can we incorporate some of the other data available into the model? – Surface – Individual players Mapping rankings to probabilities is only one component of the modelling process… …you could use your own rankings or ratings!
Final thoughts Try it yourself! – Modelling principles: – Start Simple – Generalise Gradually – Capture Curvature – Banish Bias
Thank you for listening! Dr Tim Paulden