Music Information Retrieval: Overview and Challenges J.-S. Roger Jang (張智星) Multimedia Information Retrieval (MIR) Lab CSIE Dept, National Taiwan Univ. http://mirlab.org/jang 2017/4/25
Outline Music information Retrieval (MIR) Intro to MIR Intro to ISMIR & MIREX Two classical paradigms of MIR QBSH (query by singing/humming) AFP (audio fingerprinting) Conclusions
Introduction to QBSH QBSH: Query by Singing/Humming Progression Input: Singing or humming from microphone Output: A ranked list retrieved from the song database according to similarity to the query Progression First paper: Around 1994 Extensive studies since 2001 State of the art: QBSH tasks at ISMIR/MIREX, since 2006
Two Steps in QBSH Pitch Tracking Database comparison To detect the period of a waveform Time domain (時域) ACF (Autocorrelation function) NSDF (Normalized squared difference function) AMDF (Average magnitude difference function) Frequency domain (頻域) Harmonic product spectrum Cepstrum To find similarity between query and database songs Linear scaling Dynamic time warping Recursive alignment Hybrid methods
Frame Blocking for Pitch Tracking Overlap Sample rate = 16 kHz Frame size = 512 samples Frame duration = 512/16000 = 0.032 s = 32 ms Overlap = 192 samples Hop size = frame size – overlap = 512-192 = 320 samples Frame rate = 16000/320 = 50 frames/sec = Pitch rate Zoom in Frame
ACF: Auto-correlation Function 1 128 Original frame s(t): Shifted frame s(t-t): t=30 acf(30) = inner product of the overlap part Pitch period To play safe, the frame size needs to cover at least two fundamental periods!
Frequency to Semitone Conversion Semitone : A music scale based on A440 Reasonable pitch range: E2 - C6 82 Hz - 1047 Hz ( - )
Demos Pitch related demos Pitch tracking Pitch shift
Basic Comparison Method: Linear Scaling Scale the query pitch linearly to match the candidates Target pitch in database Compressed by 0.5 Compressed by 0.75 Original pitch Original input pitch Best match Stretched by 1.25 Stretched by 1.5
Typical Result of Pitch Tracking Pitch tracking via autocorrelation for茉莉花 (jasmine)
Comparison of Pitch Vectors Yellow line : Target pitch vector
QBSH Demos QBSH demos by our lab Existing commercial QBSH systems Description QBSH on the web: MIRACLE QBSH on toys Existing commercial QBSH systems www.midomi.com www.soundhound.com
Our QBSH System: Miracle Single server with GPU NVIDIA 560 Ti, 384 cores (speedup factor = 10) Clients Single server PC Master server Request: pitch vector Master server Response: search result PDA/Smartphone Database size: ~20,000 Cellular
Improving QBSH Many ways to improve QBSH Sorted error vector Various weight for rests Re-ranking for better accuracy Better memory arrangement in GPU …
Intro to Audio Fingerprinting (AFP) Goal Identify a noisy version of a given audio clips Also known as… “Query by exact example” no “cover versions” are allowed
AFP Applications Commercial applications of AFP Music identification & purchase Royalty assignment (over radio) TV shows or commercials ID (over TV) Copyright violation (over web) Major commercial players Shazam, Soundhound, Intonow, Viggle…
Two Stages in AFP Offline Online Feature extraction Hash table construction for songs in database Inverted indexing Online Feature extraction Hash table search Ranked list of the retrieved songs/music
Robust Feature Extraction Various kinds of features for AFP Invariance along time and frequency Landmark of a pair of local maxima Wavelets … Extensive test required for choosing the best features
Representative Approaches to AFP Philips J. Haitsma and T. Kalker, “A highly robust audio fingerprinting system”, ISMIR 2002. Shazam A.Wang, “An industrial-strength audio search algorithm”, ISMIR 2003 Google S. Baluja and M. Covell, “Content fingerprinting using wavelets”, Euro. Conf. on Visual Media Production, 2006. V. Chandrasekhar, M. Sharifi, and D. A. Ross, “Survey and evaluation of audio fingerprinting schemes for mobile query-by-example applications”, ISMIR 2011
Improvement on AFP Re-ranking of AFP by learning to rank Demo: http://mirlab.org/demo/audioFingerprinting
Shazam’s Method Ideas Take advantage of music local structures Find salient peaks on spectrogram Pair peaks to form landmarks for comparison Efficient search by hash tables Use positions of landmarks as hash keys Use song ID and offset time as hash values Use time constraints to find matched landmarks
How to Find Salient Peaks We need to find peaks that are salient along both frequency and time axes Frequency axis: Gaussian local smoothing Time axis: Decaying threshold over time
How to Find Initial Threshold? Goal To suppress neighboring peaks Ideas Find the local max. of mag. spectra of initial 10 frames Superimpose a Gaussian on each local max. Find the max. of all Gaussians
How to Update the Threshold along Time? Decay the threshold Find local maxima larger than the threshold salient peaks Define the new threshold as the max of the old threshold and the Gaussians passing through the active local maxima
Time-decaying Thresholds Forward: Backward:
How to Pair Salient Peaks? Target zone
Salient Peaks and Landmarks Peak picking after forward smoothing Matched landmarks (green) (Source: Dan Ellis)
Landmarks for Hash Table Access
Optimization Strategies for AFP Several ways to optimize AFP Strategy for query landmark extraction Confidence measure Incremental retrieval Better use of the hash table Re-ranking for better performance
Demos of Audio Fingerprinting Commercial apps Shazam Soundhound Our demo http://mirlab.org/demo/audioFingerprinting
QBSH vs. AFP QBSH AFP Goal: MIR Feature: Pitch Method: LS Database Perceptible Small data size Method: LS Database Harder to collect Small storage Bottleneck CPU/GPU-bound AFP Goal: MIR Features: Landmarks Not perceptible Big data size Method: Matched LM Database Easier to collect Large storage Bottleneck I/O-bound
Conclusions Successful applications in MIR Due to Challenges in MIR QBSH AFP Due to Faster bigger memory Advances in GPU/CPU (Moore’s law) New machine learning methods Challenges in MIR Audio melody extraction from polyphonic music Database collection for QBSH Cover song ID (which cannot handled by AFP) Polyphonic music transcription
Thank you for your attention! Questions & comments?