Analysis of 2016-17 MLS Season Data Using Poisson Regression with R Ian Campbell Dr. Bahaeddine Taoufik Dr. Nancy Cowden Dr. Kevin Peterson
Main Idea The main goal of this project is to analyze the MLS 2016-17 season and explore the statistical software R by employing a Poisson regression model to make inferences about how goals are scored based on predictor variables i.e. passes, possession time, shots etc... And explore R
The Data MLS 2016-17 Season Table 1: Team Data Per Game Each entry corresponds to an individual game
Variable Correlation
Goals per Match 574 Entries Discrete Data Right skewed Figure 1: Illustration of the distribution of goals per match
Poisson Distribution X is the discrete variable k=1,2,3,… 𝑓 𝑘;𝜆 = Pr 𝑋=𝑘 = 𝜆 𝑘 𝑒 −𝜆 𝑘! X is the discrete variable k=1,2,3,… 𝜆=mean of discrete variable
Poisson Distribution Key assumptions for the Poisson Distribution Independence: The number of goals is not affected by the time in the match Homogeneity: All variables are independent and independent of each other. Time Period is constant Mean and Variance of the Poisson distribution are the same Testing and graphing done with R Goals per Match Observed Vs. Expected Count Observed Mean=1.483 Observed Variance=1.629 Difference=.146 Figure 2: Actual vs. Poisson distribution
Poisson Regression Model ln 𝑦ˆ 𝑖 = 𝑏 0 + 𝑏 1 𝑋 𝑖,1 + 𝑏 2 𝑋 𝑖,2 + 𝑏 3 𝑋 𝑖,3 +…+ 𝑏 𝑘 𝑋 𝑖,𝑘 yi = Predicted response Xi = Predictor Variables b0 = Estimated intercept b1 -> bk = Estimated coefficients ln( 𝑦ˆ 1 ) ⋮ ⋮ ln( 𝑦ˆ 𝑛 ) = 1 𝑋 1,1 𝑋 1,2 𝑋 1,3 𝑋 1,4 … 𝑋 1,𝑘 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1 𝑋 𝑛,1 𝑋 𝑛,2 𝑋 𝑛,3 𝑋 𝑛,4 … 𝑋 𝑛,𝑘 𝑏 0 𝑏 1 𝑏 2 . . . 𝑏 𝑘 1≤𝑖≤𝑛
Data Analysis -H0: The response variable, goals scored, is over-dispersed -Ha: The response variable, goals scored, is not over-dispersed Must run final test with R to verify over-dispersion Model will follow the equation: Predictor Variables Passes made (Passes) Amount of time that team possessed the ball during the 90-minute game (Possesion.Rate) Red cards received (Red.Cards) Corners taken (Corners) Free kicks taken (Free.Kicks) Penalty kicks taken (Penalty.Kicks) Shots taken (Shots), Shots on target (SoT) ln Goals Scored = b 0 + b 1 Passes + b 2 Possesion + b 3 Shots + b 4 SoT + b 5 (Corners)+ b 6 (Penalty Kicks) + b 7 (Red Cards)+ b 8 (Free Kicks)
R Analysis Insignificant variables Red cards Free kicks Adjusted Equation: 𝐥𝐧 𝐆𝐨𝐚𝐥𝐬 𝐒𝐜𝐨𝐫𝐞𝐝 = 𝐛 𝟎 + 𝐛 𝟏 𝐏𝐚𝐬𝐬𝐞𝐬 + 𝐛 𝟐 𝐏𝐨𝐬𝐬𝐞𝐬𝐢𝐨𝐧 + 𝐛 𝟑 𝐒𝐡𝐨𝐭𝐬 + 𝐛 𝟒 𝐒𝐨𝐓 + 𝐛 𝟓 (𝐂𝐨𝐫𝐧𝐞𝐫𝐬)+ 𝐛 𝟔 (𝐏𝐞𝐧𝐚𝐥𝐭𝐲 𝐊𝐢𝐜𝐤𝐬)
Adjusted Quasi-Poisson Model Significance values increased Standard errors down Estimations improved Final Equation: ln 𝐺𝑜𝑎𝑙𝑠 𝑆𝑐𝑜𝑟𝑒𝑑 =−2.92 𝑒 −1 +1.64 𝑒 −3 𝑃𝑎𝑠𝑠𝑒𝑠 −1.39 𝑒 −2 𝑃𝑜𝑠𝑠𝑒𝑠𝑖𝑜𝑛 +3.81 𝑒 −2 𝑆ℎ𝑜𝑡𝑠 +6.23 𝑒 −2 𝑆𝑜𝑇 −1.88 𝑒 −2 (𝐶𝑜𝑟𝑛𝑒𝑟𝑠)+3.24 𝑒 −1 (𝑃𝑒𝑛𝑎𝑙𝑡𝑦 𝐾𝑖𝑐𝑘𝑠)
Interpretation The interpretation of each variable is true only when increasing that particular variable by one unit, i.e. one pass or one shot, and holding all other variables constant. Variable Estimated Coefficient Interpretation Passes 1.64e-3 (e0.00164-1) x 100%=0.164% For every pass made, the chance of scoring increases by 0.164% on average Shots 3.81e-2 (e0.0381-1) x 100%=3.88% For every shot taken, the chance of scoring increases by 3.88% on average Shots on Target 6.23e-2 (e0.0623-1) x 100%=6.42% For every shot on target, the chance of scoring increases by 6.42% average Possession -1.39e-2 (e-0.0139-1) x 100%=-1.38% For every minute of possession a team has, the chance of scoring decreases by 1.38% on average Penalty Kicks 3.24e-1 (e0.324-1) x 100%=38.38% For every penalty kick awarded, the chance of scoring increases by 38.38% on average Corners -1.88e-2 (e-0.0188-1) x 100%=-1.86% For every corner taken, the chance of scoring decreases by 1.86%
Penalty Kicks Average speed: 70mph Reaches goal line in 0.7 seconds 24’ 8’ 36’ Average speed: 70mph Reaches goal line in 0.7 seconds Average time to reach either post: 0.6 seconds Average human reaction time: 0.25 seconds
Penalty Kick Dataset 93 entries All penalty kicks taken; missed or scored
Over-Dispersion Test P-value greater than 0.05 -H0: The response variable, goals scored, is over-dispersed -Ha: The response variable, goals scored, is not over-dispersed P-value greater than 0.05
Penalty Kick Poisson Adjusted estimation Previous value: 0.324 Smaller data set resulting in lower significance values Adjusted model ln 𝐺𝑜𝑎𝑙𝑠 𝑆𝑐𝑜𝑟𝑒𝑑 =−2.92 𝑒 −1 +1.34 𝑒 −3 𝑃𝑎𝑠𝑠𝑒𝑠 −2.32 𝑒 −2 𝑃𝑜𝑠𝑠𝑒𝑠𝑖𝑜𝑛 +8.17 𝑒 −3 𝑆ℎ𝑜𝑡𝑠 +7.77 𝑒 −2 𝑆𝑜𝑇 −2.76 𝑒 −3 (𝐶𝑜𝑟𝑛𝑒𝑟𝑠)+5.95 𝑒 −1 (𝑃𝑒𝑛𝑎𝑙𝑡𝑦 𝐾𝑖𝑐𝑘𝑠)
Interpretation Variable Estimated Coefficient Interpretation The interpretation of each variable is true only when increasing that particular variable by one unit, i.e. one pass or one shot, and holding all other variables constant. Variable Estimated Coefficient Interpretation Penalty Kicks 5.95e-1 (e0.5.95-1) x 100%=81% For penalty kick taken, the chance of scoring increases by 80% on average Shots 8.17e-3 (e0.0817-1) x 100%=8.51% For every shot taken, the chance of scoring increases by 8.51% on average Shots on Target 7.77e-2 (e0.0777-1) x 100%=8.07% For every shot on target, the chance of scoring increases by 8.07% average Possession -2.32e-2 (1-e-0.0232) x 100%=-2.29% For every minute of possession a team has, the chance of scoring decreases by 2.29% on average Passes 1.34e-3 (e0.00164-1) x 100%=0.164% For every pass made, the chance of scoring increases by 0.164% on average Corners -2.76e-3 (1-e-0.0188) x 100%=-1.86% For every corner taken, the chance of scoring decreases by 1.86%
No Penalty Kick Dataset 481 data points Games that did not have penalty kicks present
Over-Dispersion Test P-value greater than 0.05 -H0: The response variable, goals scored, is over-dispersed -Ha: The response variable, goals scored, is not over-dispersed P-value greater than 0.05
Non-Penalty Kick Poisson Model Equation: ln 𝐺𝑜𝑎𝑙𝑠 𝑆𝑐𝑜𝑟𝑒𝑑 =−3.11 𝑒 −1 +1.50 𝑒 −3 𝑃𝑎𝑠𝑠𝑒𝑠 −1.24 𝑒 −2 𝑃𝑜𝑠𝑠𝑒𝑠𝑖𝑜𝑛 +3.88 𝑒 −2 𝑆ℎ𝑜𝑡𝑠 +5.94 𝑒 −2 𝑆𝑜𝑇 −1.75 𝑒 −2 (𝐶𝑜𝑟𝑛𝑒𝑟𝑠)
Interpretation The interpretation of each variable is true only when increasing that particular variable by one unit, i.e. one pass or one shot, and holding all other variables constant. Variable Estimated Coefficient Interpretation Shots 3.88e-2 (e0.0388-1) x 100%=3.95% For every shot taken, the chance of scoring increases by 3.95% on average Shots on Target 5.94e-2 (e0.0594-1) x 100%=6.11% For every shot on target, the chance of scoring increases by 6.11% average Possession -1.24e-2 (1-e-0.0124) x 100%=-1.23% For every minute of possession a team has, the chance of scoring decreases by 1.23% on average Passes 1.50e-3 (e0.00150-1) x 100%=0.15% For every pass made, the chance of scoring increases by 0.15% on average Corners -1.75e-2 (1-e-0.0175) x 100%=-1.75% For every corner taken, the chance of scoring decreases by 1.75%
Conclusion Variability of soccer and how different games are played different Investigate to improve the model by finding the best predictors Utilize this method on other Leagues with better players
Acknowledgments Dr. Bahaeddine Taoufik Dr. Nancy Cowden Dr. Kevin Peterson
Questions