Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia, Vol. 7, No. 1, February 2005 Mark A. Bartsch, Member, IEEE, and Gregory H. Wakefield, Member, IEEE
Introduction Multimedia content is growing rapidly Efficient method of browsing is necessary Indexing and retrieval methods are media- dependent
Primary goal Minimize audition time for a given type of media
Current methods Images –Downsampling Produces a smaller version of image (thumbnail) Reduces cost of delivery and display
Current methods Audio: speech –Symbolic representation Produces a transcript of the audio
What about music? Adapt an existing method: –Downsampling (time compression) Results in highly distorted, unintelligible audio
What about music? Adapt an existing method (cont’d): –Symbolic representation (score transcription) Extremely difficult Results in essentially meaningless information Does not convey other important elements: –Vocal style –Instruments used –Processing effects used
Essential problem: Adapting existing methods cannot reduce the audition time for music and effectively convey the “gist” of the song
Possible Solution: Audio thumbnailing via chroma- based analysis
Audio thumbnailing Produces a short clip of the selection to represent the “gist” of the song
Chroma-based analysis Based on the extraction of chroma features from the audio Chroma Feature Extraction Algorithm: –Frame Segmentation –Feature Calculation –Correlation Calculation –Correlation Filtering –Thumbnail Selection
Chroma Feature Extraction Extract frequencies from audio file Calculate chroma values from frequencies: Categorize chroma values into pitch classes –12 pitch classes: A, A#/Bb, C, C#/Db, …, G#/Ab
Frame Segmentation Author’s Implementation: –Determined via beat tracking algorithm –Range: 0.25s to 0.56s Our Implementation: –Average of range: 0.41s
Feature Calculation Calculate 12-element chroma feature vector, v t for each frame: –Apply FFT to each frequency: –Constraints: Minimum frequency: 20 Hz –Lower limit of human hearing Maximum frequency: 2000 Hz –Higher frequencies effect the perception of chroma
Correlation Calculation Calculate similarity matrix C –Each element is equal to the correlation between two feature vectors: –High correlation along diagonals in the matrix indicate repetitions within the song
Correlation Filtering Calculate the filtered time-lag matrix T: –Exposes similarity between extended segments that are separated by constant lag –Filtering is performed along the diagonals of C Uses a symmetric rectangular windowing function (a uniform moving average filter) –T is then “rotated” so that the diagonals are oriented vertically
Thumbnail Selection Select maximum value in T –The location of this value indicates: Occurrence of the segment (the y-coordinate) Lag time (the x-coordinate) –Constraints: Minimum lag time = 1/10 of song length Maximum start time = 3/4 of song length –To reduce susceptibility to “fading repeat”
Results Jimmy Buffet – “Math Sucks” –System: [64, 89] Lifehouse – “You and Me” –System: [38, 63] Gavin DeGraw – “I Don’t Want To Be” –System: [95, 120] Super Mario Brothers Theme –System: [18, 43]
Conclusion Successfully extracted time segments which closely match the chorus of the song Feature Calculation issue: –Author’s implementation unclear
Possible Uses Audio domain: –Improved search capability Searching for similar songs –Audio fingerprinting Other domains: –Detection of irregular heartbeats
Suggested Improvements and Alternatives Image-based analysis on the waveform Tested alternatives –MSE on signal frequencies Chroma-based analysis proved more correct