Mining Baseball Statistics

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

CC SQL Utilities.
CSCI3170 Introduction to Database Systems
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Who Is The Best Player In Major League Baseball?.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Classification of the aesthetic value of images based on histogram features By Xavier Clements & Tristan Penman Supervisors: Vic Ciesielski, Xiadong Li.
Active subgroup mining for descriptive induction tasks Dragan Gamberger Rudjer Bošković Instute, Zagreb Zdenko Sonicki University of Zagreb.
Introduction to Data Mining with XLMiner
CSE881 project Diabetes Risk Classification Jingshu Chen Qingpeng Zhang Ming Wu STATE UNIVERSITY.
By Andrew Finley. Research Question Is it possible to predict a football player’s professional based on collegiate performance? That is, is it possible.
Mouse Movement Biometrics, Pace University, Fall'20071 Mouse Movement Biometrics Fall 2007 Capstone -Team Members Rafael Diaz Michael Lampe Nkem Ajufor.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April Presented by Jisu.
Projmgmt-1/23 DePaul University Tracking the Progress of Your Project In MicroSoft Project Instructor: David A. Lash.
Linear Clustering Algorithm BY Horne Ken & Khan Farhana & Padubidri Shweta.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
Mapping Techniques and Visualization of Statistical Indicators Haitham Zeidan Palestinian Central Bureau of Statistics IAOS 2014 Conference.
CSCI 347 / CS 4206: Data Mining Module 05: WEKA Topic 04: Data Preparation Tools.
Microsoft Office Word 2013 Expert Microsoft Office Word 2013 Expert Courseware # 3251 Lesson 4: Working with Forms.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Creating a Web Site to Gather Data and Conduct Research.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Advanced Software Engineering PROJECT. 1. MapReduce Join (2 students)  Focused on performance analysis on different implementation of join processors.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Product Evaluation & Quality Improvement. Overview Objectives Background Materials Procedure Report Closing.
Weka: a useful tool in data mining and machine learning Team 5 Noha Elsherbiny, Huijun Xiong, and Bhanu Peddi.
Get your hands dirty cleaning data European EMu Users Meeting, 3rd June. - Elizabeth Bruton, Museum of the History of Science, Oxford
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Acknowledgements Contact Information Anthony Wong, MTech 1, Senthil K. Nachimuthu, MD 1, Peter J. Haug, MD 1,2 Patterns and Rules  Vital signs medoids.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Academic 2016 Student Enrolment Day 1 Integrated National Education Information System (iNEIS TM )
An Exercise in Machine Learning
GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.
CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
FINAL EXAM OVERVIEW Aliya Farheen
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Differential Leveling Conversion and Analysis Toolset Lisa Berry University of Redlands, MS GIS Program.
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Efficient Image Classification on Vertically Decomposed Data
Machine Learning With Python Sreejith.S Jaganadh.G.
Who Is The Best Player In Major League Baseball?
School of Computing Science
FIFA 18 Player Analytics Project 1 IMGD 2905.
Machine Learning Week 1.
Efficient Image Classification on Vertically Decomposed Data
Lecture 12: Data Wrangling
Semantic Interoperability and Data Warehouse Design
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Prepared by: Mahmoud Rafeek Al-Farra
Aleysha Becker Ece 539, Fall 2018
Nearest Neighbors CSC 576: Data Mining.
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Physics-guided machine learning for milling stability:
Practice Project Overview
Presentation transcript:

Mining Baseball Statistics Data Mining – CSE881 Paul Cornwell Kajal Miyan Mojtaba Solgi Project URL: http://kmp-cse881.appspot.com/ 1 1

Overview of Baseball Baseball is a team sport There are two major leagues: AL (American), NL (National) Many statistics characterizing player performance are published yearly Each league names one player MVP (Most Valuable Player) each year according to a vote People place bets on who will be MVP 2 2 2

Overview Application: (motivation) Baseball statistics Can we predict who will be named MVP? Learn how to do data mining Learn about baseball Impress sabermetricians Baseball: it’s not diseases, crime, or pollution Baseball statistics Main task: predict MVPs for a given year Use SVM to rank players 3 3 3

Overview of Data and Mining Data: 5 CSV files (Batting, Fielding, Master, Awards, Salaries)‏ Data Mining: Ranking (similar to classification)‏ Anomaly detection (maybe)‏ playerID yearID stint teamID lgID Gbat AB R H 2B 3B HR RBI SB SO aasedo01 1985 1 BAL AL 54 abregjo01 CHN NL 6 9 2 ackerji01 TOR 61 adamsri02 SFN 121 12 23 3 10 agostju01 CHA aguaylu01 PHI 91 165 27 46 7 21 26 aguilri01 NYN 22 36 5 4 4 4

Methodology - Preprocessing Initial Data: ~90,000 rows in Batting table, 1871-2007 One row: one player/year/stint/team Cut to 1985-2007, ~28,000 rows, b/c Salary begin, rule changes Perl script to merge tables by playerID/yearID/stint BattingFieldingAwards(MVP)SalariesMaster = 48 columns ~14 hours, but I got to relearn Perl! Discovered: infeasible to use WEKA, need to use SVM-Light Reformatted from CSV to space-delimited SVM-Light format replace every “value” with “attribute:value” replace commas, spaces deleted 131 w/out fielding record (3-max: 26, 21, 16 at-bats)‏ create (binary) rank value based on MVP status replace all MM/DD/YYYY with YYYY insert “qid” column according to year/league (46 qids)‏ ... 5 5 5

Methodology – Data Mining Classification not apt to get good results, hence ranking with‏ SVM-Light (Cornell University)‏ Training generates a model which can rank input Training phase Leave one (year) out Testing Rank the players for that year Postprocessing SVM-Light returns only ranks of the players as integers match ranks with corresponding players Reformat data for visualization Ranked the data for each attribute Anomaly detection (in progress) KNN on 4 attributes (Gbat, R, HR, RBI)‏ for players in >= 10 games Compute z-scores for each attribute/year Rank players by distance from nearest neighbor Compare ranks in various attributes for detecting anomalies 6 6 6

Methodology - Visualization Bar charts of top 20 ranked players for various attributes Python Google App Engine Google Charts tool U.S. map of player birthState density 7 7 7

Team Roles Roles of team members Planning - Everyone Preprocessing – Paul Cornwell Data Mining – Kajal Miyan Visualization – Mojtaba Solgi 8 8 8

Related Work No apparent academic work on predicting MLB MVPs PECOTA Baseball Prospectus www.baseballprospectus.com/pecota/ Baseball “forecasting” Makes statistical predictions about players No MVP prediction evident subscription service Books are available with baseball forecasts apparently for one year only 9 9 9

Experimental Setup Raw data downloaded from http://baseball1.com/content/view/58/82/ Preprocessing done using Perl, Nano, Excel, OOo, TextPad Preprocessing yields a table with ~28K rows and 45 columns Experiments were conducted on a 2 GHz P4 machine running Kubuntu 8.04 with 1GB RAM Data Mining and postprocessing with SVM-Light, Visual C#, Matlab Visualization done using Python, Google App 10 10 10

Experimental Evaluation Preliminary results SVM-Light trained on 1985-2006 data tested on 2007 ranked actual MVPs #1 and #11 (out of 1242 players) (2nd NL, #2)‏ (there is one MVP for each league each year: AL, NL)‏ 2006: ranks 7, 16 (1371 players) 2005: ranks 1, 4 (1322 players) 2004: ranks 1, 3 (1342 players) 2003: ranks 3, 32 (1341 players) 2002: ranks 1, 11 (1316 players) Final evaluation (pending)‏ Leave-one-out 11 11 11

Visualization Demo http://kmp-cse881.appspot.com/ 12 12 12

Conclusions MVP ranking was surprisingly successful Early results suggest that it is feasible to predict MVPs with some accuracy Lessons learned Data mining is hard work Baseball statistics are actually sort of interesting Future work Leave-one-out validation Incorporate team statistics in player evaluations (expert advice)‏ 13 13 13