Basketball Position Classification Brandon Hardesty, Matt Saldaña, Audrey Bunn
Informal Problem Statement Utilize classification algorithms to predict a basketball player’s most effective position, either forward or guard. NBA players’ statistics are used for comparison in the classification and for algorithmic learning data. Audrey
Formal Problem Statement Let P be a set of basketball players of length 90. Set P has four subsets, G, F, X, and Y. Subset G is a list of the top 10 NBA guard statistics for the 2017-2018 season; and subset F is a list of the top 10 NBA forward statistics for the 2017-2018 season. Subset X is a list of NCAA statistics for 25 guards, and subset Y is a list of NCAA statistics for 25 forwards. For all players pi in G and F, pi is mapped to forward if statistics, si, is most similar to set F’s average statistics. Similarly, pi is mapped to guard if si is most similar to set G’s average statistics. P =All players G = NBA Guards F = NBA Forwards X = NCAA guards Y = NCAA forwards pi = Position classified players si = Position specific statistics Brandon
Program Use AAU & Collegiate programs NBA front offices Companies investing in big data in sports Personal interest Algorithm analysis Spicy
Context Defining Modern NBA Player Positions - Applying Machine Learning to Uncover Functional Roles in Basketball by Han Man Using Machine Learning to Find the 8 Types of Players in the NBA by Alex Cheng Utilize K-means clustering, DBSCAN, and Hierarchical clustering Similarities: data source, normalization, classifying by position Differences: classifies players within the same league, statistics used for classification Spicy
Statistical Evaluation Average points per game* Average rebounds per game* Free throw percentage 3 Point Percentage *All per 36 min game Audrey
Forwards vs Guards Statistically Brandon
Implemented Algorithms Learning Vector Quantization K’s Nearest Neighbors Brute Force Comparison
Brute Force Method Compares inputted data to average stats of learning data Position with the most “winning” comparisons is the classification Breaks ties with point differentials Pros: Easy to comprehend Low RAM usage Cons: Very naïve Inaccurate Variable run times/slow Audrey
Linear Vector Quantization Method LVQ takes training vectors and codebook vectors as inputs. Then, it iterates through the training vectors and finds the closest codebook vector. The closest codebook vector is then moved closer or further away from the training instance by a learning rate times the difference between the vectors. The test data is classified by finding the closest codebook vector after training has finished. Pros: Accurate Fastest Cons: Not the most accurate Memory intensive Brandon
K’s Nearest Neighbors Method Given: 2 vectors and value of K Pros: Select K entries from training set that are closest to the test value Make classification prediction by evaluating the training instances closest to the test data Pros: Most accurate Low RAM usage Cons: Not the fastest Spicy
Experimental Procedure Run college data tests (25 guards, 25 forwards) compared to NBA learning data to classify Multiple runs with different n values (6, 12, 18, 24, 30, 36, 42, 48) Track total run time Track accuracy percentage Track memory usage Brandon
Run Time Comparison: In Milliseconds Audrey
Accuracy Comparison Spicy
RAM Usage Brandon
Conclusion Best algorithm for this project– K’s Nearest Neighbor Most accurate Moderate total run times Lowest RAM usage Linear Vector Quantization Comes in Second Moderately accurate Fastest run times Highest RAM usage Brute Force = Worthless Audrey
Future Work Predict wins and losses of games Predict tournament winners Different sports Different normalization technique Utilizing more in-game statistics Expanding on the number of positions Spicy
Five Questions Q. Why are the brute force run times so variable? What explains the spike in total run time between 24 test players and 42 test players? A. The spike in run times comes from the “tie breaker rounds.” Some of the players that were tested during those runs were borderline players, meaning their stats are almost in between those of a forward and a guard. The additional calculations necessary to create point differentials between the inputted data and these borderline players increased the run time for brute force. Q. Why does K’s Nearest Neighbors take more time than Linear Vector Quantization? A. K Nearest Neighbors has a longer run time because the algorithm has to run through the inputted data four times. The algorithm has to run through the data this many times to find the three closest points (our K value) to the feature vector. Q. Why does LVQ use more RAM to classify? A. The Linear Vector Quantization algorithm has to use more RAM to store codebook vectors and the data set. Q. What would be a more accurate brute force method? A. Running the point differential “tie breaker round” from the beginning instead of the initial “majority rules” method. Point differentials would be more straight to the point, fine- tuned, and accurate. Q. Is there a better classification algorithm out there to solve this problem? A. LVQ input data with the same number of attributes as the training data. The data that is being classified by the LVQ algorithm. Brandon