Download presentation
Presentation is loading. Please wait.
1
Mining Baseball Statistics
Data Mining – CSE881 Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: 1 1
2
Overview of Baseball Baseball is a team sport
There are two major leagues: AL (American), NL (National) Many statistics characterizing player performance are published yearly Each league names one player MVP (Most Valuable Player) each year according to a vote People place bets on who will be MVP 2 2 2
3
Overview Application: (motivation) Baseball statistics
Can we predict who will be named MVP? Learn how to do data mining Learn about baseball Impress sabermetricians Baseball: it’s not diseases, crime, or pollution Baseball statistics Main task: predict MVPs for a given year Use SVM to rank players 3 3 3
4
Overview of Data and Mining
Data: 5 CSV files (Batting, Fielding, Master, Awards, Salaries) Data Mining: Ranking (similar to classification) Anomaly detection (maybe) playerID yearID stint teamID lgID Gbat AB R H 2B 3B HR RBI SB SO aasedo01 1985 1 BAL AL 54 abregjo01 CHN NL 6 9 2 ackerji01 TOR 61 adamsri02 SFN 121 12 23 3 10 agostju01 CHA aguaylu01 PHI 91 165 27 46 7 21 26 aguilri01 NYN 22 36 5 4 4 4
5
Methodology - Preprocessing
Initial Data: ~90,000 rows in Batting table, One row: one player/year/stint/team Cut to , ~28,000 rows, b/c Salary begin, rule changes Perl script to merge tables by playerID/yearID/stint BattingFieldingAwards(MVP)SalariesMaster = 48 columns ~14 hours, but I got to relearn Perl! Discovered: infeasible to use WEKA, need to use SVM-Light Reformatted from CSV to space-delimited SVM-Light format replace every “value” with “attribute:value” replace commas, spaces deleted 131 w/out fielding record (3-max: 26, 21, 16 at-bats) create (binary) rank value based on MVP status replace all MM/DD/YYYY with YYYY insert “qid” column according to year/league (46 qids) ... 5 5 5
6
Methodology – Data Mining
Classification not apt to get good results, hence ranking with SVM-Light (Cornell University) Training generates a model which can rank input Training phase Leave one (year) out Testing Rank the players for that year Postprocessing SVM-Light returns only ranks of the players as integers match ranks with corresponding players Reformat data for visualization Ranked the data for each attribute Anomaly detection (in progress) KNN on 4 attributes (Gbat, R, HR, RBI) for players in >= 10 games Compute z-scores for each attribute/year Rank players by distance from nearest neighbor Compare ranks in various attributes for detecting anomalies 6 6 6
7
Methodology - Visualization
Bar charts of top 20 ranked players for various attributes Python Google App Engine Google Charts tool U.S. map of player birthState density 7 7 7
8
Team Roles Roles of team members Planning - Everyone
Preprocessing – Paul Cornwell Data Mining – Kajal Miyan Visualization – Mojtaba Solgi 8 8 8
9
Related Work No apparent academic work on predicting MLB MVPs PECOTA
Baseball Prospectus Baseball “forecasting” Makes statistical predictions about players No MVP prediction evident subscription service Books are available with baseball forecasts apparently for one year only 9 9 9
10
Experimental Setup Raw data downloaded from Preprocessing done using Perl, Nano, Excel, OOo, TextPad Preprocessing yields a table with ~28K rows and 45 columns Experiments were conducted on a 2 GHz P4 machine running Kubuntu with 1GB RAM Data Mining and postprocessing with SVM-Light, Visual C#, Matlab Visualization done using Python, Google App 10 10 10
11
Experimental Evaluation
Preliminary results SVM-Light trained on data tested on 2007 ranked actual MVPs #1 and #11 (out of 1242 players) (2nd NL, #2) (there is one MVP for each league each year: AL, NL) 2006: ranks 7, 16 (1371 players) 2005: ranks 1, 4 (1322 players) 2004: ranks 1, 3 (1342 players) 2003: ranks 3, 32 (1341 players) 2002: ranks 1, 11 (1316 players) Final evaluation (pending) Leave-one-out 11 11 11
12
Visualization Demo 12 12 12
13
Conclusions MVP ranking was surprisingly successful
Early results suggest that it is feasible to predict MVPs with some accuracy Lessons learned Data mining is hard work Baseball statistics are actually sort of interesting Future work Leave-one-out validation Incorporate team statistics in player evaluations (expert advice) 13 13 13
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.