Download presentation
Presentation is loading. Please wait.
Published byIris Harrison Modified over 9 years ago
1
Audio Thumbnailing of Popular Music Using Chroma-Based Representations Matt Williamson Chris Scharf Implementation based on: IEEE Transactions on Multimedia, Vol. 7, No. 1, February 2005 Mark A. Bartsch, Member, IEEE, and Gregory H. Wakefield, Member, IEEE
2
Introduction Multimedia content is growing rapidly Efficient method of browsing is necessary Indexing and retrieval methods are media- dependent
3
Primary goal Minimize audition time for a given type of media
4
Current methods Images –Downsampling Produces a smaller version of image (thumbnail) Reduces cost of delivery and display
5
Current methods Audio: speech –Symbolic representation Produces a transcript of the audio
6
What about music? Adapt an existing method: –Downsampling (time compression) Results in highly distorted, unintelligible audio
7
What about music? Adapt an existing method (cont’d): –Symbolic representation (score transcription) Extremely difficult Results in essentially meaningless information Does not convey other important elements: –Vocal style –Instruments used –Processing effects used
8
Essential problem: Adapting existing methods cannot reduce the audition time for music and effectively convey the “gist” of the song
9
Possible Solution: Audio thumbnailing via chroma- based analysis
10
Audio thumbnailing Produces a short clip of the selection to represent the “gist” of the song
11
Chroma-based analysis Based on the extraction of chroma features from the audio Chroma Feature Extraction Algorithm: –Frame Segmentation –Feature Calculation –Correlation Calculation –Correlation Filtering –Thumbnail Selection
12
Chroma Feature Extraction Extract frequencies from audio file Calculate chroma values from frequencies: Categorize chroma values into pitch classes –12 pitch classes: A, A#/Bb, C, C#/Db, …, G#/Ab
13
Frame Segmentation Author’s Implementation: –Determined via beat tracking algorithm –Range: 0.25s to 0.56s Our Implementation: –Average of range: 0.41s
14
Feature Calculation Calculate 12-element chroma feature vector, v t for each frame: –Apply FFT to each frequency: –Constraints: Minimum frequency: 20 Hz –Lower limit of human hearing Maximum frequency: 2000 Hz –Higher frequencies effect the perception of chroma
15
Correlation Calculation Calculate similarity matrix C –Each element is equal to the correlation between two feature vectors: –High correlation along diagonals in the matrix indicate repetitions within the song
16
Correlation Filtering Calculate the filtered time-lag matrix T: –Exposes similarity between extended segments that are separated by constant lag –Filtering is performed along the diagonals of C Uses a symmetric rectangular windowing function (a uniform moving average filter) –T is then “rotated” so that the diagonals are oriented vertically
17
Thumbnail Selection Select maximum value in T –The location of this value indicates: Occurrence of the segment (the y-coordinate) Lag time (the x-coordinate) –Constraints: Minimum lag time = 1/10 of song length Maximum start time = 3/4 of song length –To reduce susceptibility to “fading repeat”
18
Results Jimmy Buffet – “Math Sucks” –System: [64, 89] Lifehouse – “You and Me” –System: [38, 63] Gavin DeGraw – “I Don’t Want To Be” –System: [95, 120] Super Mario Brothers Theme –System: [18, 43]
19
Conclusion Successfully extracted time segments which closely match the chorus of the song Feature Calculation issue: –Author’s implementation unclear
20
Possible Uses Audio domain: –Improved search capability Searching for similar songs –Audio fingerprinting Other domains: –Detection of irregular heartbeats
21
Suggested Improvements and Alternatives Image-based analysis on the waveform Tested alternatives –MSE on signal frequencies Chroma-based analysis proved more correct
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.