You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager,

Slides:



Advertisements
Similar presentations
Things to do in Lecture 1 Outline basic concepts of causality
Advertisements

Logistic Regression I Outline Introduction to maximum likelihood estimation (MLE) Introduction to Generalized Linear Models The simplest logistic regression.
Jennifer Siegel. Statistical background Z-Test T-Test Anovas.
Linear Regression t-Tests Cardiovascular fitness among skiers.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
The General Linear Model Or, What the Hell’s Going on During Estimation?
Chapter 10 Curve Fitting and Regression Analysis
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression.
ANCOVA Regression with more than one line Andrew Jackson
Predicting Success in the National Football League An in-depth look at the factors that differentiate the winning teams from the losing teams. Benjamin.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
x – independent variable (input)
Lecture 23: Tues., Dec. 2 Today: Thursday:
The Basics of Regression continued
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Genetic Association and Generalised Linear Models Gil McVean, WTCHG Weds 2 nd November 2011.
1 Logistic Regression Homework Solutions EPP 245/298 Statistical Analysis of Laboratory Data.
Maximum likelihood (ML)
Lorelei Howard and Nick Wright MfD 2008
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Logistic Regression and Generalized Linear Models:
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
BIOL 582 Lecture Set 18 Analysis of frequency and categorical data Part III: Tests of Independence (cont.) Odds Ratios Loglinear Models Logistic Models.
SPH 247 Statistical Analysis of Laboratory Data May 19, 2015SPH 247 Statistical Analysis of Laboratory Data1.
Simple Linear Regression
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Introduction to Regression Analysis. Two Purposes Explanation –Explain (or account for) the variance in a variable (e.g., explain why children’s test.
Investment Analysis and Portfolio management Lecture: 24 Course Code: MBF702.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Logistic Regression Pre-Challenger Relation Between Temperature and Field-Joint O-Ring Failure Dalal, Fowlkes, and Hoadley (1989). “Risk Analysis of the.
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
Inference: Probabilities and Distributions Feb , 2012.
1 Model choice Gil McVean, Department of Statistics Tuesday 17 th February 2007.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
1 Research Methods in Psychology AS Descriptive Statistics.
Logistic Regression and Odds Ratios Psych DeShon.
Logistic Regression. What is the purpose of Regression?
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Logistic Regression Jeff Witmer 30 March Categorical Response Variables Examples: Whether or not a person smokes Success of a medical treatment.
Transforming the data Modified from:
The simple linear regression model and parameter estimation
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
CHAPTER 7 Linear Correlation & Regression Methods
Logistic Regression Logistic Regression is used to study or model the association between a binary response variable (y) and a set of explanatory variables.
Understanding Standards Event Higher Statistics Award
Measuring Success in Prediction
Statistical Methods For Engineers
SAME THING?.
PSY 626: Bayesian Statistics for Psychological Science
Logistic Regression with “Grouped” Data
Presentation transcript:

You can NOT be serious! Dr Tim Paulden EARL 2014, London, 16 September 2014 How to build a tennis model in 30 minutes (Innovation & Development Manager, Atass Sports)

Introduction ATASS Sports – Sports forecasting – Hardcore statistical research – Fusion of ‘academic’ and ‘pragmatic’ Today… – Building a very basic tennis model – Highlighting some key ideas

Tennis modelling Data obtained from tennis-data.co.uk – Spreadsheet for each year – You can easily get the data yourself! Ultimate goal of modelling is to determine the probability of different outcomes Can we forecast the probability of victory in a match from the players’ world rankings? How do we identify a "good" model?

Concept 1: Model calibration An effective model must be well-calibrated – The probabilities produced by the model must be consistent with the available data – Think in terms of “bins” – if we gather together all the cases where our generated win probability lies between 0.6 and 0.7 (say), the observed proportion of wins should match the mean win probability for the bin (roughly 0.65) – Here’s an extract from Nate Silver’s recent bestseller, “The Signal and the Noise”…

Concept 2: Model score Suppose we use a model to produce probabilities for a large number of sporting events (e.g. a collection of tennis matches) We can assess the model's quality by summing log(p) over all predictions, where p is the probability we assigned to the outcome that occurred - this is the model score The closer we match the "true" probabilities, the higher the model score (closer to zero)

The data set… tennis has 68,972 rows of data, with each match appearing twice (A vs B and B vs A) > dim(tennis) [1] > head(tennis) matchid date day ago surf bestof aname arank bname brank res hard 3 clement a 18 gaudenzi a hard 3 gaudenzi a 101 clement a hard 3 goldstein p 81 jones a hard 3 jones a 442 goldstein p hard 3 haas t 23 smith l hard 3 smith l 485 haas t 23 0 > tail(tennis) matchid date day ago surf bestof aname arank bname brank res clay 3 rosol l 48 simon g clay 3 simon g 16 rosol l clay 3 garcia lopez g 87 mayer f clay 3 mayer f 29 garcia lopez g clay 3 rosol l 48 garcia lopez g clay 3 garcia lopez g 87 rosol l 48 0

From ranks to probabilities How might we map the players' rankings onto a win probability? – We’ll look at an extremely rudimentary approach in a moment as a worked example – But first, consider for a moment how you might mathematically combine the players’ rankings to get a win probability for each player – What are the important properties?

A "first stab" – Model 1 Suppose our first guess is that if the two players' rankings are A and B, the probability of A winning the match is B/(A+B) matchid aname arank bname brank res aprob1 1 1 clement a 18 gaudenzi a gaudenzi a 101 clement a goldstein p 81 jones a jones a 442 goldstein p haas t 23 smith l smith l 485 haas t henman t 10 rusedski g rusedski g 69 henman t hewitt l 7 arthurs w arthurs w 83 hewitt l

A "first stab" – Model 1 In this case, the model score is The "null" model in which each player is always assigned a probability of 0.5 gets a model score of So Model 1 gives an improvement of 1710 over the null model (closer to zero is better)

How about the calibration? Let's generate a calibration plot for Model 1 – We'll use bins of width 0.1 (0 to 0.1, 0.1 to 0.2, etc), closed at the left hand side (e.g. 0.6 ≤ x < 0.7) – For each bin, we consider all instances where our model probability lies inside the bin, and plot a point whose x-coordinate is the mean of the model probabilities and whose y-coordinate is the observed proportion of wins for these instances – Example: In this case, for the bin 0.6 ≤ x < 0.7, the point plotted is (0.648, 0.588)

Systematic bias of Model 1

A quick fix... Probabilities systematically too extreme, so could try and blend Model 1 with 0.5 What weighting on Model 1 minimises the model score? A weighting of 0.71 on Model 1 is best – the model score improves from to Obtaining best weighting can be done as a one-liner in R…

Quick fix (one-liner in R) glm(tennis$res~tennis$aprob1, family=binomial(link="identity")) Call: glm(formula = tennis$res ~ aprob1, family = binomial(link = "identity")) Coefficients: (Intercept) aprob Degrees of Freedom: Total (i.e. Null); Residual Null Deviance: Residual Deviance: AIC: 85330

Bias reduced, but still apparent

A substantial improvement matchidarankbrankresaprob matchidarankbrankresaprob Model 1 Score Model 2 Score

Stepping up a gear Invlogit function – widely used to predict binary sports outcomes (logistic regression)

Logistic regression Invlogit function – widely used to predict binary sports outcomes (logistic regression) Let's do a logistic regression of the result on the difference in rank, (B – A) This is equivalent to player A's win probability being: invlogit( k*(B – A)) The optimal value of k can be found using glm

Logistic regression rankdiff = tennis$brank - tennis$arank g1 = glm(tennis$res~rankdiff-1, family=binomial(link="logit"))

Logistic regression summary(g1) Call: glm(formula = tennis$res ~ rankdiff - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) rankdiff <2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 4

This has terrible calibration!

Logistic regression The model score comes out as Worse than Model 1! – Model (probabilities all 0.5) – Model (simple B/(A+B) model) – Model (Model 1 squeezed) We need a better way of capturing the curvature…

Our paper... Developing an improved tennis ranking system David Irons, Stephen Buckley and Tim Paulden MathSport International, June 2013 Updated version to appear this year in JQAS (Journal of Quantitative Analysis of Sports)

Our paper... A decent model for generating probabilities is invlogit( 0.58*(log(B) - log(A)) ) where invlogit is the function exp(x)/(exp(x)+1) matchidarankbrankresaprob Model 3 Score

Our paper… logterm = log(tennis$brank)-log(tennis$arank) g1 = glm(tennis$res~logterm-1, family=binomial(link="logit"))

Our paper… summary(g1) Call: glm(formula = tennis$res ~ logterm - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) logterm < *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3

Our paper… sum(log(g1$fitted.values[which(tennis$res==1)])) [1]

Our paper… summary(g1) Call: glm(formula = tennis$res ~ logterm - 1, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) logterm < *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on degrees of freedom Residual deviance: on degrees of freedom AIC: Number of Fisher Scoring iterations: 3

Almost perfect calibration

Some comparisons matchidarankbrankresaprob matchidarankbrankresaprob Model 2 Score Model 3 Score

Some comparisons matchidarankbrankresaprob matchidarankbrankresaprob Model 1 Model 3 matchidarankbrankresaprob Model 2

Coming full circle In fact, a bit of algebra shows that invlogit( 0.58*(log(B) - log(A)) ) is exactly the same as B 0.58 / (A B 0.58 ) And invlogit( B-A ) is the same as exp(B)/(exp(A) + exp(B)) Try the simplest thing that could possibly work!

Graphically…

A final extension What about the effect of the number of sets? Let's take the best model (Model 3) and look at the calibration plots…

Model 3 - All matches

Model 3 - Best of 3 sets

Model 3 - Best of 5 sets

Model 4 This suggests we should have a combined model – "Model 4" – based on the rules that are in operation For "best of 3 sets": invlogit( 0.54*(log(B) - log(A)) ) For "best of 3 sets": invlogit( 0.72*(log(B) - log(A)) )

Model 4 – All matches

Model 4 – Best of 3 sets

Model 4 – Best of 5 sets

The best model score so far For Model 4, the model score is A final comparison: – Model (probabilities all 0.5) – Model (simple B/(A+B) model) – Model (Model 1 squeezed) – Logistic-22480(based on B-A) – Model (logistic with logs) – Combined-21252(split version of Model 3)

Some further questions How can we incorporate some of the other data available into the model? – Surface – Individual players Mapping rankings to probabilities is only one component of the modelling process… …you could use your own rankings or ratings!

Final thoughts Try it yourself! – Modelling principles: – Start Simple – Generalise Gradually – Capture Curvature – Banish Bias

Thank you for listening! Dr Tim Paulden