You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager,

You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager, Atass Sports)

Introduction ATASS Sports – Sports forecasting – Hardcore statistical research – Fusion of ‘academic’ and ‘pragmatic’ Today… – Building a very basic tennis model – Highlighting some key ideas

Tennis modelling Data obtained from tennis-data.co.uk – Spreadsheet for each year – You can easily get the data yourself! Ultimate goal of modelling is to determine the probability of different outcomes Can we forecast the probability of victory in a match from the players’ world rankings? How do we identify a "good" model?

Concept 1: Model calibration An effective model must be well-calibrated – The probabilities produced by the model must be consistent with the available data – Think in terms of “bins” – if we gather together all the cases where our generated win probability lies between 0.6 and 0.7 (say), the observed proportion of wins should match the mean win probability for the bin (roughly 0.65) – Here’s an extract from Nate Silver’s recent bestseller, “The Signal and the Noise”…

Concept 2: Model score Suppose we use a model to produce probabilities for a large number of sporting events (e.g. a collection of tennis matches) We can assess the model's quality by summing log(p) over all predictions, where p is the probability we assigned to the outcome that occurred - this is the model score The closer we match the "true" probabilities, the higher the model score (closer to zero)

The data set… tennis has 68,972 rows of data, with each match appearing twice (A vs B and B vs A) > dim(tennis) [1] 68972 11 > head(tennis) matchid date day ago surf bestof aname arank bname brank res 1 1 20010101 3747 4503 hard 3 clement a 18 gaudenzi a 101 1 2 1 20010101 3747 4503 hard 3 gaudenzi a 101 clement a 18 0 3 2 20010101 3747 4503 hard 3 goldstein p 81 jones a 442 1 4 2 20010101 3747 4503 hard 3 jones a 442 goldstein p 81 0 5 3 20010101 3747 4503 hard 3 haas t 23 smith l 485 1 6 3 20010101 3747 4503 hard 3 smith l 485 haas t 23 0 > tail(tennis) matchid date day ago surf bestof aname arank bname brank res 68967 34484 20130427 8246 4 clay 3 rosol l 48 simon g 16 1 68968 34484 20130427 8246 4 clay 3 simon g 16 rosol l 48 0 68969 34485 20130427 8246 4 clay 3 garcia lopez g 87 mayer f 29 1 68970 34485 20130427 8246 4 clay 3 mayer f 29 garcia lopez g 87 0 68971 34486 20130428 8247 3 clay 3 rosol l 48 garcia lopez g 87 1 68972 34486 20130428 8247 3 clay 3 garcia lopez g 87 rosol l 48 0

From ranks to probabilities How might we map the players' rankings onto a win probability? – We’ll look at an extremely rudimentary approach in a moment as a worked example – But first, consider for a moment how you might mathematically combine the players’ rankings to get a win probability for each player – What are the important properties?

A "first stab" – Model 1 Suppose our first guess is that if the two players' rankings are A and B, the probability of A winning the match is B/(A+B) matchid aname arank bname brank res aprob1 1 1 clement a 18 gaudenzi a 101 1 0.84873950 2 1 gaudenzi a 101 clement a 18 0 0.15126050 3 2 goldstein p 81 jones a 442 1 0.84512428 4 2 jones a 442 goldstein p 81 0 0.15487572 5 3 haas t 23 smith l 485 1 0.95472441 6 3 smith l 485 haas t 23 0 0.04527559 7 4 henman t 10 rusedski g 69 1 0.87341772 8 4 rusedski g 69 henman t 10 0 0.12658228 9 5 hewitt l 7 arthurs w 83 1 0.92222222 10 5 arthurs w 83 hewitt l 7 0 0.07777778

A "first stab" – Model 1 In this case, the model score is -22194 The "null" model in which each player is always assigned a probability of 0.5 gets a model score of -23904 So Model 1 gives an improvement of 1710 over the null model (closer to zero is better)

How about the calibration? Let's generate a calibration plot for Model 1 – We'll use bins of width 0.1 (0 to 0.1, 0.1 to 0.2, etc), closed at the left hand side (e.g. 0.6 ≤ x < 0.7) – For each bin, we consider all instances where our model probability lies inside the bin, and plot a point whose x-coordinate is the mean of the model probabilities and whose y-coordinate is the observed proportion of wins for these instances – Example: In this case, for the bin 0.6 ≤ x < 0.7, the point plotted is (0.648, 0.588)

Systematic bias of Model 1

A quick fix... Probabilities systematically too extreme, so could try and blend Model 1 with 0.5 What weighting on Model 1 minimises the model score? A weighting of 0.71 on Model 1 is best – the model score improves from -22194 to -21333 Obtaining best weighting can be done as a one-liner in R…

Quick fix (one-liner in R) glm(tennis$res~tennis$aprob1, family=binomial(link="identity")) Call: glm(formula = tennis$res ~ aprob1, family = binomial(link = "identity")) Coefficients: (Intercept) aprob1 0.1444 0.7112 Degrees of Freedom: 68971 Total (i.e. Null); 68970 Residual Null Deviance: 95620 Residual Deviance: 85330 AIC: 85330

Bias reduced, but still apparent

A substantial improvement matchidarankbrankresaprob1 11810110.849 7919410.508 34181410.022 1411181810.132 7897131410.997 matchidarankbrankresaprob2 11810110.748 7919410.506 34181410.160 1411181810.239 7897131410.853 Model 1 Score -22194 Model 2 Score -21333

Stepping up a gear Invlogit function – widely used to predict binary sports outcomes (logistic regression)

Logistic regression Invlogit function – widely used to predict binary sports outcomes (logistic regression) Let's do a logistic regression of the result on the difference in rank, (B – A) This is equivalent to player A's win probability being: invlogit( k*(B – A)) The optimal value of k can be found using glm

Logistic regression rankdiff = tennis$brank - tennis$arank g1 = glm(tennis$res~rankdiff-1, family=binomial(link="logit"))

Logistic regression summary(g1) Call: glm(formula = tennis$res ~ rankdiff - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -4.229 -1.123 0.000 1.123 4.229 Coefficients: Estimate Std. Error z value Pr(>|z|) rankdiff 0.0061067 0.0000987 61.87 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 95615 on 68972 degrees of freedom Residual deviance: 89920 on 68971 degrees of freedom AIC: 89922 Number of Fisher Scoring iterations: 4

This has terrible calibration!

Logistic regression The model score comes out as -22480 Worse than Model 1! – Model 0 -23904(probabilities all 0.5) – Model 1 -22194(simple B/(A+B) model) – Model 2 -21333(Model 1 squeezed) We need a better way of capturing the curvature…

Our paper... Developing an improved tennis ranking system David Irons, Stephen Buckley and Tim Paulden MathSport International, June 2013 Updated version to appear this year in JQAS (Journal of Quantitative Analysis of Sports)

Our paper... A decent model for generating probabilities is invlogit( 0.58*(log(B) - log(A)) ) where invlogit is the function exp(x)/(exp(x)+1) matchidarankbrankresaprob3 11810110.731 7919410.505 34181410.099 1411181810.252 7897131410.966 Model 3 Score -21285

Our paper… logterm = log(tennis$brank)-log(tennis$arank) g1 = glm(tennis$res~logterm-1, family=binomial(link="logit"))

Our paper… summary(g1) Call: glm(formula = tennis$res ~ logterm - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -2.614 -1.055 0.000 1.055 2.614 Coefficients: Estimate Std. Error z value Pr(>|z|) logterm 0.578851 0.006371 90.86 <0.0000000000000002 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 95615 on 68972 degrees of freedom Residual deviance: 85139 on 68971 degrees of freedom AIC: 85141 Number of Fisher Scoring iterations: 3

Our paper… sum(log(g1$fitted.values[which(tennis$res==1)])) [1] -21284.7

Our paper… summary(g1) Call: glm(formula = tennis$res ~ logterm - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max -2.614 -1.055 0.000 1.055 2.614 Coefficients: Estimate Std. Error z value Pr(>|z|) logterm 0.578851 0.006371 90.86 <0.0000000000000002 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 95615 on 68972 degrees of freedom Residual deviance: 85139 on 68971 degrees of freedom AIC: 85141 Number of Fisher Scoring iterations: 3

Almost perfect calibration

Some comparisons matchidarankbrankresaprob2 11810110.748 7919410.506 34181410.160 1411181810.239 7897131410.853 matchidarankbrankresaprob3 11810110.731 7919410.505 34181410.099 1411181810.252 7897131410.966 Model 2 Score -21333 Model 3 Score -21285

Some comparisons matchidarankbrankresaprob1 7897131410.997 matchidarankbrankresaprob3 7897131410.966 Model 1 Model 3 matchidarankbrankresaprob2 7897131410.853 Model 2

Coming full circle In fact, a bit of algebra shows that invlogit( 0.58*(log(B) - log(A)) ) is exactly the same as B 0.58 / (A 0.58 + B 0.58 ) And invlogit( B-A ) is the same as exp(B)/(exp(A) + exp(B)) Try the simplest thing that could possibly work!

Graphically…

A final extension What about the effect of the number of sets? Let's take the best model (Model 3) and look at the calibration plots…

Model 3 - All matches

Model 3 - Best of 3 sets

Model 3 - Best of 5 sets

Model 4 This suggests we should have a combined model – "Model 4" – based on the rules that are in operation For "best of 3 sets": invlogit( 0.54*(log(B) - log(A)) ) For "best of 3 sets": invlogit( 0.72*(log(B) - log(A)) )

Model 4 – All matches

Model 4 – Best of 3 sets

Model 4 – Best of 5 sets

The best model score so far For Model 4, the model score is -21252 A final comparison: – Model 0 -23904(probabilities all 0.5) – Model 1 -22194(simple B/(A+B) model) – Model 2 -21333(Model 1 squeezed) – Logistic-22480(based on B-A) – Model 3-21285(logistic with logs) – Combined-21252(split version of Model 3)

Some further questions How can we incorporate some of the other data available into the model? – Surface – Individual players Mapping rankings to probabilities is only one component of the modelling process… …you could use your own rankings or ratings!

Final thoughts Try it yourself! – www.tennis-data.co.uk Modelling principles: – Start Simple – Generalise Gradually – Capture Curvature – Banish Bias

Thank you for listening! Dr Tim Paulden tim.paulden@atass-sports.co.uk

You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager,

Similar presentations

Presentation on theme: "You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager,

Similar presentations

Presentation on theme: "You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager,"— Presentation transcript:

Similar presentations

About project

Feedback