Sorting Music, Intelligently

Slides:



Advertisements
Similar presentations
Test Development.
Advertisements

Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Clustering Categorical Data The Case of Quran Verses
Describing Process Specifications and Structured Decisions Systems Analysis and Design, 7e Kendall & Kendall 9 © 2008 Pearson Prentice Hall.
Concepts of Database Management Seventh Edition
Concepts of Database Management Sixth Edition
Concepts of Database Management Seventh Edition
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
A new crossover technique in Genetic Programming Janet Clegg Intelligent Systems Group Electronics Department.
Ensemble Learning: An Introduction
Overview of Search Engines
Page 1 ISMT E-120 Introduction to Microsoft Access & Relational Databases The Influence of Software and Hardware Technologies on Business Productivity.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.
Concepts of Database Management Seventh Edition
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Chapter 6: Analyzing and Interpreting Quantitative Data
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
The Hashemite University Computer Engineering Department
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
1 Determining How Costs Behave. 2 Knowing how costs vary by identifying the drivers of costs and by distinguishing fixed from variable costs are frequently.
Data Visualization with Tableau
Information Retrieval in Practice
Ensemble Classifiers.
Machine Learning: Ensemble Methods
GCSE COMPUTER SCIENCE Practical Programming using Python
Complexity Analysis (Part I)
Advanced Algorithms Analysis and Design
Recommender Systems & Collaborative Filtering
Search Engine Architecture
Determining How Costs Behave
Data Mining K-means Algorithm
Outlier Processing via L1-Principal Subspaces
Methodology – Physical Database Design for Relational Databases
Modern Systems Analysis and Design Third Edition
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Database Performance Tuning and Query Optimization
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Forecasting The Future of Movies
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Python I/O.
Lecture 12: Data Wrangling
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 1 Introduction(1.1)
Conjoint Analysis.
Ensemble learning.
Chapter 11 Describing Process Specifications and Structured Decisions
CSE 491/891 Lecture 25 (Mahout).
Nearest Neighbors CSC 576: Data Mining.
Model generalization Brief summary of methods
Suitability Test Wednesday, 22 May 2019.
Chapter 12 Analyzing Semistructured Decision Support Systems
Complexity Analysis (Part I)
Lecture 23 CS 507.
Complexity Analysis (Part I)
Working Scientifically
Presentation transcript:

Sorting Music, Intelligently

Mathematical models for song recommendation engines Algorithms by Blake Ottinger, Kevin Todd, Steven Tucker

1. What makes music likeable? Music tastes are often highly subjective, at times elusive. Recommendation engines rely on analysis of the attributes of music that make one song “similar” to another.

Contextual clues Categorization by genre (broad classification) Classification by artist. Classification by year (or era, e.g. 70s, 80s, etc.) Classification by tags. Tags are manually created when they are entered into the database. Songs are classified with multiple tags. Because tags are human-entered for each song, the accuracy of these tags is critical.

Database Formatted with the following structure: Artist - Song Title, Year, Genre, Listens, (Tag1, tag2, tag3) Listens is a count of current listens on Spotify. We have a total of about 20 possible tags in the database. We have taken care to ensure the accuracy of these tags. These songs are parsed within Python into a list from this file, with each item in the list representing one song object.

Song Database We have songs stored in a formatted text file, available for the algorithm to choose from.

INFORMAL PROBLEM STATEMENT Design a machine-learning algorithm to efficiently sort through a finite database of music in a txt file format, ranking songs with a similarity variable based on the comparison of user inputted music preferences and chosen song data.

FORMAL PROBLEM STATEMENT Given a set of S songs, distribute each item from set (songs) into a distinct list given the song’s genre, tag, and era. Given a set of Q user preferred data, iterate through S and produce songs with high similarity rating (calculated below) and append songs into a list of suggested songs. Find the similarity of a suggested song by finding a percentage of matched data, m, over total compared data, n, all multiplied by 100. By printing results, present the data containing song suggestions that had the highest similarity calculated to the user.

One problem. Multiple implementations. 2. Ways it’s been done One problem. Multiple implementations. Cue - Steven.

Computers can’t listen to music… Music tastes are human by nature, can be highly subjective. But computer scientists have found a way to algorithmically generate accurate and optimized suggestions. Algorithms rely on a numerical representation of a song’s attributes. Algorithms must mathematically evaluate the data from these attributes, and make educated guesses based on complex tastes. As a result, different algorithms often have varied results, and may be practical for a different set of purposes.

Who Else Has Done It? Pandora Spotify Apple Music

3. Our Approach Our three algorithms that take different approaches to optimize music suggestions Cue - Kevin

K-Nearest Neighbors Converts categorical string data in numerical data that can be manipulated. Plots converted data onto Cartesian graph Determines focus points (P) Draws a radius around focus points (K) Derives similarity by Euclidean distance

Converting Data Converting data is crucial to this algorithm to retain accuracy X-axis: Metal - Electronic Y-axis: Rage - Cheesy

Converting Data Courtesy of MusicMap.info

Plotting Data and Determining Focus Points Songs are plotted by their primary genre (x-axis) and a mathematical average of their tags (y-axis) User preference is split into three points of focus. Each K is determined by the users primary preference, secondary, and third.

Converting Data

Determining Similarity Using Euclidean distance from the focus points to potential points (within K) Points with a closer distance are of higher similarity

Results

Pros and Cons Pros Consistent Fast with small n value Easy to implement different test data Cons Shallow Optimization Dependent on graph setup Slow with large n value

Probability Sliders Creates a list of user preferences for genres, tags, and eras. Slider: Each genre, tag, and era has a float (0, 1) associated to it, representing a % of how much the user likes each label. If a potential new song meets the requirements, the song’s Sliders are updated in the user Sliders. Cue - Steven

User Sliders Initialization Initial user Slider values based on what songs they say they like. If 2 songs out of a total 10 songs have the tag “happy”, the happy Slider will be set to 0.20 Repeat for each era, genre, and other tags Sliders not given a value will be assigned 0.10 to increase diversity

User Sliders Visualization Sliders generated by user input of songs desired. For testing, arbitrary and random values were used. In reality, the Sliders would be automatically created based on recorded data of what the user listens to.

Evaluating a Potential Song S ← song Slider, U ← user Slider For each era, genre, and tag, compare random float (from random.random() ) to U If random < U, add that random to S Else, add that random / 4 to S for a small chance to be accepted

O(g + e + t) g ← number of genres e ← number of eras t ← number of tags Algorithm: O(n), where n is the number of songs to be accepted

Accept / Decline If random.random() < Fitness, the song is accepted as a suitable song! Increases user Sliders by a % Ideally adds song to playlist Else, decline song For each song label, set the user Slider to 95% of its original value Continue to evaluate next song.

Percentage Sliders Pros: Easy to visualize Works well with mid-size user databases Quick comparisons Easy manipulations Cons: Selecting 1 song will not be enough Up to random probability Less likely to accept as you request more songs Takes genre, era, and tags equally into account.

Multiple Correspondence Analysis [MCA] Each attribute for a song is converted to a numeric value, representing a comparison to the user’s preferences. Higher comparison values = higher similarity. Compute these values mathematically into a single “similarity score” for the song. This score represents the likeability of the song. The highest scores are ranked first, and are picked for recommendations to the user. Cue - Blake.

Measuring User Preferences We need to know how much artists, genres, eras, and tags should weigh, respectively. (How important is each factor to the user) We gather slightly redundant user preference data to do this. We aggregate this data into sets A and B. MCA then measures the correlation along each column for data sets A and B, and then sets the weight for these attributes according to the correlation.

Mathematical Correlation Analysis (cont.) A consistency value of 10 indicates a strong correlation. Strong correlation indicates that the attribute type is more important to the user, and should be weighed more heavily.

Example: Measuring correlation Data set A (primitive data): Data set B (favorite Songs): Favorite Genre Favorite artists Preferred Tags Preferred Era: Sam Rock AJR, Rush Upbeat, happy 80s Song Title Genre Artist Tags Year Round and Round Rock 3 doors down Upbeat, catchy 2011 Clocks Pop Coldplay Chill, deep, hits 2002

MCA - Computing final similarity We represent each attribute numerically. Initializes to a 1 if it matches user preferences. 0 otherwise. The correlation scores calculated previously determine how much genre, artist, era, and tags should weigh, respectively.

MCA Pros and Cons Very deep analysis, potential for good results Can determine the weight that should be applied to genre, tags, artists, and era. Is moderately “self-correcting” due input data redundancy. Can be computationally expensive Requires redundant input data. User has to supply a list of their favorite songs to the algorithm Expected Efficiency: O(n * k) * m Where N is the number of database songs. M is the number of users to run for. K is the number of songs that the user provides as favorite songs.

MCA: Real Example Output:

Results

What can we begin to take away? 4. Conclusion What can we begin to take away? Cue - Kevin

Future Work Implement user feedback for all algorithms Expand K-Nearest Neighbors Optimize x and y-axis Expand Percentage Sliders Add mutations to slightly increase/decrease the chance a song is accepted Directly editable Sliders to manually change preferences Expand MCA Try different data rankings Cue - Kevin

Conclusion Different implementations of music optimization call for different approaches Deep analysis, continuous user tracking:MCA Medium level analysis: Percentage Sliders Shallow analysis, Initial user preference: K-Nearest Neighbors

5 Questions 1.) How to determine if optimization is accurate when analyzing subjective things like music taste? 2.) How should the algorithmic approach change with different sizes of data sets? 3.) How do we determine how songs are tagged? 4.) How to determine which data to derive from the user and how to best utilize that data? 5.) What would be the most effective way to implement user feedback for improvement?