Download presentation
Presentation is loading. Please wait.
Published byWidya Iskandar Modified over 5 years ago
1
Big Data Applications & Analytics Sports Analytics
Geoffrey Fox January BigDat 2015: International Winter School on Big Data Tarragona, Spain, January 26-30, 2015 School of Informatics and Computing Digital Science Center Indiana University Bloomington Sports Analytics 1/26/2015
2
Sports InformatiCS Sports Informatics Summary Sports Analytics
Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports: Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing. 1/26/2015 Sports Analytics
3
Sports InformatiCS I: Sabermetrics (BASEBALL)
Lesson 1: Introduction Introduction to all Sports Informatics Moneyball The Oakland Athletics Diamond Dollars economic model of baseball Performance – Dollar relationship Value of a Win 1/26/2015 Sports Analytics
4
Sports Analytics and Informatics
Understood to be very important In ~1990 Quants went to wall street; now they are becoming part of sports management Applicable to all fields but more prominent in some such as baseball Several data sources Statistics on actions within game Analysis of real-time video Signals from special position sensitive tags Custom instruments such as those in fitness wearables MIT Sloan Sports Analytics conference probably best source of data Baseball’s SABR (Society for American Baseball research) has an interesting conference and rich history Involves many stakeholders: players, fans, management, media, sports equipment manufacturers, country pride etc. Sports Analytics 1/26/2015
5
Some Broad Trends We will see three source types for data
Very precise numerical data on game results, athlete performances etc. Use of sensors that are tracked (position, acceleration, physical condition etc.) Video where image processing used to extract information Some results are used to predict results and make decisions on players or game play (e.g. which relief pitcher to use in baseball or which offensive play to use in NFL) These analyses use probabilities of course; chose the play that is most likely to succeed Other results are used for “spatial visualization” e.g. to map where field goal throws are successful in NBA or where passing plays are successful in NFL. Sports Analytics 1/26/2015
6
Baseball Examples Sports Analytics 1/26/2015
7
SAbermetrics http://en.wikipedia.org/wiki/Sabermetrics
Sabermetrics is the empirical analysis of baseball, especially baseball statistics that measure in-game activity. The term is derived from the acronym SABR, which stands for the Society for American Baseball Research It was coined by Bill James, who is one of its pioneers and is often considered its most prominent advocate and public face. Sabermetricians frequently question traditional measures of baseball skill. For instance, they doubt that batting average is as useful as conventional wisdom says it is because team batting average provides a relatively poor predictor for team runs scored. Sabermetric reasoning would say that runs win ballgames, and that a good measure of a player's worth is his ability to help his team score more runs than the opposing team. Use VORP or Value over replacement player to measure value of individual players Sports Analytics 1/26/2015
8
Moneyball Moneyball: The Art of Winning an Unfair Game is a book by Michael Lewis, published in 2003, and 2011 Film starring Brad Pitt 2002 Oakland Athletics finished first in the American League West with a record of Billy Beane general manager used sabermetrics and farm system Sports Analytics 1/26/2015
9
Diamond dollars I This is a book by Vince Gennaro that does not incorporate sophisticated statistics but does define and describe baseball as a business with a nontrivial economic model that has tio be considered in analytics as a players value depends on this model. Note in e-commerce and web search analysis, the fact that real criterion for success was economic strategies balancing user happiness versus amount of dollars from sales led to non trivial problems. Different teams have different economic models due to different fan characteristics: size, expectations (of winning). Fans include those that go to ballpark plus those that watch on various media outlets that include TV and Internet Yankees has highest team value and highest fan interest summing all revenue sources There is a relationship between winning and attendance and hence winning and revenues Appearing in post season makes an important revenue difference and that requires winning! Sports Analytics 1/26/2015
10
Diamond dollars II This illustrates that revenue per year varies according to team with a factor of 2.8 difference in 2013 between top and bottom Sports Analytics 1/26/2015
11
Diamond dollars III Interestingly attendance at ball games is increasing each year even with increased media outlets. Note YES network (a regional sports network RSN)has huge value to Yankees YES is Baseball, Basketball and Football The 90th win is worth more than 80th as it’s directly relevant to playoffs Wins on the margin are worth $5M each today New stadiums add value Sports Analytics 1/26/2015
12
Diamond dollars IV A player’s value (measured in dollars to team) depends on Performance on the field quantified by WAR Teams win-loss record Situation of other teams that could effect value of given player to other teams Depth of club at player’s position Contract status (free-agency?) Marquee value which depends on and builds team brand value Team Brands New York Yankees (26 world championships in 85 years) Chicago Cubs Boston Red Sox A win is roughly equivalent to 10 runs Sports Analytics 1/26/2015
13
Sports InformatiCS I: Sabermetrics (BASEBALL)
Lesson 2: Basic Sabermetrics Different Types of Baseball Data Sabermetrics Overview of all data Details of some statistics based on basic data OPS, wOBA, ERA, ERC, FIP, UZR 1/26/2015 Sports Analytics
14
BIG data and little data
There are classic statistics that tells you important information; one example is Batting Average or Earned Run Average for pitchers includes 41 Batting (including OPS, wOBA) statistics definitions 7 Base running 50 Pitching (including ERA, FIP, ERC) 12 Fielding 3 Overall Value (including WAR) 4 General As Baseball is a complicated game, these basic statistics may not be best correlated with success and there choices of/combinations of basic statistics that (are claimed to) correlate better with success OPS and WAR are two examples of such combinations However these are still calculated from “little data”; basically a measure (few numbers at most) of each pitch and at bat Sports Analytics 1/26/2015
15
Increasingly sophisticated analyses
Little data: basic statistics. Used in most popular discussions of sport One can add into basic averages, selections such as size of ball park and performance against say left and right handed pitchers to make specific predictions more precisely. This leads Little data: Sabermetric statistics like OPS and WAR. This has been used for some time in professional analyses and origin of Oakland Athletics success chronicled in Moneyball. Finally one can use the detailed video record of each action in baseball coming from products like PITCHf/x that replace a pitch or hit by a video that can be analyzed with a more sophisticated model and this is Big Data The discussion (given earlier) “Diamond Dollars” (by Vince Gennaro) also uses such statistics but includes a sophisticated fiscal model of baseball A win is worth more for some teams than others Sports Analytics 1/26/2015
16
Leading to Big Data Big data: PITCHf/X, HITf/X, FIELDf/X and Commandf/X (catching) etc. is perhaps future of Sabermetrics and described on the web by Vince Gennaro in talks and blogs. It makes a more sophisticated model of a player It uses video – much larger amounts of data although this is summarized in terms of numbers measuring speed, curve, location etc. Also can measure physical status of players and so help fitness and health The analysis of this data uses more sophisticated analytics such as recommender engines way-to-analyze-every-play is a rival to Sportvision FIELDf/X Sports Analytics 1/26/2015
17
FEATURES OF BASEBALL Very accurate clean data over a long time interval – over 140 years with clear metadata Actions clearly associated with Pitcher, Batter, Fielder although two-way interaction present E.g. A given batter will do differently for different styles of pitch This contrasts with soccer or basketball where team features much more important although some actions like shooting free throws or 3-pointers are individually focused. Enough data that can in detail train models and then test on a different sample of data Sports Analytics 1/26/2015
18
OPS: On-base plus slugging
This is a “sabermetric baseball statistic“ OPS = OBP + SLG OBP = (H + BB + HBP)/(AB + BB + SF + HBP) SLG = TB/AB OBP On-base percentage SLG Slugging Average H = Hits BB = Base on balls HBP = Times hit by pitch AB = At bats (Plate appearances, not including bases on balls, being hit by pitch, sacrifices, interference, or obstruction) SF = Sacrifice flies (Fly balls hit to the outfield which although caught for an out, allow a baserunner to advance) TB = Total bases Sports Analytics 1/26/2015
19
wOBA (weighted on-base average)
This is a “sabermetric baseball statistic“ AB (At Bats): Number of trips to the plate in which the batter does not walk, get hit by a pitch, sacrifice (fly or bunt), or reach on interference HBP (Hit By Pitches), SF (Sacrifice Flies), BB (Walks), IBB (Intentional Walks) 1B = Single, 2B = Double, 3B = Triple HR = Home run wOBA = (0.690×(BB-IBB) ×HBP ×1B ×2B ×3B ×HR) / (AB + BB – IBB + SF + HBP) in 2013 Empirically better than other statistics in measuring contribution to run scoring Weights calculated separately for each year Sports Analytics 1/26/2015
20
ERA Earned Run Average ERA is mean of earned runs given up by a pitcher per nine innings pitched (i.e. the traditional length of a game). It is determined by dividing the number of earned runs allowed by the number of innings pitched and multiplying by nine. Runs resulting from defensive errors (including pitchers' defensive errors) are recorded as unearned runs and are not used to determine ERA ERA misleading for relief pitchers, because they are charged only for runs scored by batters who reached base while batting against them. They can “blow the save” by letting batters on base when they start score but have zero ERA Pitchers for the Colorado Rockies have historically faced many problems, all damaging to their ERAs. The combination of high altitude (5,280 ft or 1,610 m) and a semi-arid climate in Denver causes fly balls to travel up to 10% farther than at sea level. Denver's altitude and low humidity also reduce the ability of pitchers to throw effective breaking balls, due to both reduced air resistance and difficulty in gripping very dry baseballs. Sports Analytics 1/26/2015
21
Sports InformatiCS I: Sabermetrics (BASEBALL)
Lesson 3: Wins Above Replacement Wins above Replacement WAR Discussion of Calculation Examples Comparisons of different methods Coefficient of Determination Another Sabermetrics Example Summary of Sabermetrics 1/26/2015 Sports Analytics
22
WAR(P): Wins Above Replacement I
Wins Above Replacement Player is a sophisticated sabermetric baseball statistic developed to sum up the extent of "a player’s total contributions to their team“ It has an agreed goal but many implementations compared at WARP from Baseball Prospectus rWAR or updated bWAR from Baseball Reference fWAR from Fangraphs 10 runs are roughly equal to a win A replacement level player is defined as contributing 20.5 runs fewer than a player of league-average performance, over 600 plate appearances i.e. below average! WAR = 2.05 is an average player over 600 plate appearances WAR can be calculated for teams, parts of teams (say all pitchers) and also clearly depends on time window. A players WAR value depends (linearly) on playing time Sports Analytics 1/26/2015
23
WAR(P): Wins Above Replacement II
A team of replacement-level players is expected to have a .294 (originally 0.32)winning percentage, or 47.6 wins in a 162 game season. Definition change made in March 2013 This is 1000 wins extra for average (0.5 win %) teams summed over 2 leagues i.e. total WAR for all teams in 2013 is 1000. Wikipedia suggests formula Wins = fWAR (old definition) Cameron found that a team's projected record based on fWAR and that team's actual record has a strong correlation of 0.83 to WAR prediction Don’t refer to average players who are relatively rare, difficult to obtain and highly paid whereas replacement level players, by their very definition, are players easy to obtain when a starter goes down. These are the players who receive non-roster invites at the start of the year or the players who are 6-year minor league free agents. Sports Analytics Baseball talent among the population is generally distributed normally, but only the very right-end of that curve plays professional baseball. 1/26/2015
24
WAR(P): Wins Above Replacement III
WAR for position players has six components: Batting Runs e.g. use wOBA (weighted on-base average) Base running Runs (Stolen Bases and Caught Stealing runs, 1st to 3rd on singles, outs on the bases, tagging up on fly balls, scoring from third on a ground ball, etc.) Runs added or lost due to Grounding into Double Plays in DP situations Fielding Runs (“Defensive Runs Saved”) Positional Adjustment Runs e.g. a catcher gets added runs and designated hitters runs removed. Pitchers (when they bat) need special treatment Replacement level Runs (based on playing time) as 5 components compared to League average These are complicated formulae but all involve league averages and complicated but little data statistics For example wOBA is custom weighted average of 7 quantities Sports Analytics 1/26/2015
25
WAR(P): Wins Above Replacement IV
WAR for pitchers based on Runs Allowed (both earned and unearned) and Innings Pitched compared to average pitcher This is then adjusted by difference between average and replacement level pitcher Fangraphs uses FIP (Fielding Independent Pitching) Use an average pitcher corrected for situation current pitcher placed in Level of Opposition Handling Interleague (Currently AL performs better than NL) and designated hitter difference Team defense ability as seen in FIP or Defense-Independent Pitching Stats (DIPS) Ball park effects. These are accounted for statistically and with a physics model as in Big data approach Relievers versus starters. Relievers have better ERA but only pitch a few innings. Relieving in close game worth more than relieving in a noncompetitive game Sports Analytics 1/26/2015
26
fWAR Examples In 2014 Mike Trout (Angels) had a fWAR of 7.8 (10.5 in 2013) In 2014, Corey Kluber (Indians) had 7.3 fWAR while Clayton Kershaw (Dodgers) had one of 7.2 even though Kluber was wins, 9 losses and 2.44 ERA Kershaw was 21 wins, 3 losses and 1.77 ERA In 2014 Dodgers had total fWAR 41.8: divided 27.1 batting and Pitching Red Sox had highest fWAR of 43.3 in either League: divided batting and 13.6 pitching Over all time (sum over appearances) Babe Ruth (10616 PA) had best fWAR at followed by Barry Bonds (12606 PA) at PA = Plate Appearances qual=y&type=8&season=2014&month=0&season1=1871&ind=0 Note definition valid and statistics available over complete recorded history of baseball so certainly not using Big Data Sports Analytics 1/26/2015
27
Little Data Sabermetrics
Note that there are modest volumes of data in sabermetrics analyses like WAR but they are determined by the data itself All the magic coefficients in FIP and wOBA and other WAR components do not come from “theory” – they come from fitting data. In this sense classic sabermetrics illustrates key features of “big data” – the data not a model determines the answer Of course one needs a lot of baseball savvy to know what variables to include in formulae – albeitg with unknown coefficients Little data sabermetrics discussed in EdX course re/10e616fc ab4457ae18df92b20/ with modules on using SQL and R to calculate Sports Analytics 1/26/2015
28
Sports InformatiCS: ADVANCED Sabermetrics (BASEBALL)
Lesson: Pitching Clustering and Video in Baseball A Big Data Pitcher Clustering method introduced by Vince Gennaro Data from Blog and video at 2013 SABR conference 1/26/2015 Sports Analytics
29
Clustering PITCHERS I Vince Gennaro’s Blog First decide on key properties of pitchers that are a) Important to batter’s performance b) Available from PITCHf/X These 12 properties are in right hand column Sports Analytics Increase predictive power by looking at clusters of similar players rather than traditional one on one analyses 1/26/2015
30
Clustering PITCHERS II
Below is a visual mapping of pitcher clusters. Each node represents a pitcher and each line between pitchers represents a “connection” or a similarity, based on a defined minimum threshold level. This graph includes only LHPs and it clusters them against only right-handed hitters. Sports Analytics 1/26/2015
31
Clustering PITCHERS III
This summarizes the traditional one on one approach (given batter record versus given pitcher) compared to clustering method Generalize to hitter clusters Sports Analytics 1/26/2015
32
ROI of Optimizing Match UPS
Return on Investment Replace 30th percentile guy by 70th percentile for 81 days – 19 runs Optimize pinch hitter 100 times – 9 runs Chose optimal relief pitcher 50 times – 5 runs 33 runs is 3 wins 1 win is worth $5M for a competitive team Sports Analytics 1/26/2015
33
FIELDf/X for fielders I
The FIELDf/x® service uses Sportvision's baseball technology to digitally record the position of all players and hit balls in real time. Left illustrates type of material available Right is result of 95 foot run to catch ball Sports Analytics 1/26/2015
34
Start of catch tracking-bob-bowman-interview/ Sports Analytics 1/26/2015
35
HITf/X and result of hit
Talk from Sabermetrics expert at Tufts Home Runs Sports Analytics 1/26/2015
36
Sports InformatiCS: Other Sports
Lesson: Mainly Pretty Pictures ……. Spatial Visualization 1/26/2015 Sports Analytics
37
General comments II sports Players analyzed in real time Speed Heart rate Hydration Breathing Fatigue Pain Coaches relate to technique Enhance fantasy play Fans engage through social media, real-time enhanced data Teams get locations contacts for fans Implies Injury reduction, and benefits for marketing and betting Sports Analytics 1/26/2015
38
Consumer devices innovation-summit-data-powered-storytelling Accelerometer Wristband Snow Goggles Track Activity and Sleep Head mounted display/camera Golf Swing analyzer Monitors racket and player Sports Analytics 1/26/2015
39
SOCCER visualization Sports Analytics 1/26/2015
40
American Football Computer vision and Machine learning to classify plays and then predict next one; players recognized from video Sports Analytics 1/26/2015
41
NFL American football Color represents completion percentage
Sports Analytics 1/26/2015
42
NBA Shooting location/success I
Jose Calderon One of the best pure shooters in NBA Sports Analytics 1/26/2015
43
NBA Shooting location/success IV
Sports Analytics 1/26/2015
44
Comparing 4 NBA players content/uploads/2012/02/Goldsberry_Sloan_Submission.pdf Sports Analytics 1/26/2015
45
TENNIS http://www.slideshare.net/elew/sport- analytics-innovation
Sports Analytics analytics-innovation 1/26/2015
46
HORSE RACING http://www.trakus.com/technology.asp#tNetText
More accurate and immediate than GPS or other positioning techniques, the Trakus system uses proprietary wireless communications to track tags fitted into each horse’s saddlecloth during live racing. Sports Analytics The durable, lightweight tag weighs 2.8 ounces (86 g) and it has the profile and size of a credit card or PCMCIA computer card, about 2 x 3 inches. 1/26/2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.