Content-based retrieval of audio Francois Thibault MUMT 614B McGill University
Overview Need effective ways to browse by content through audio databases of growing sizes Need effective ways to browse by content through audio databases of growing sizes Using descriptive sound parameters or query by example systems Using descriptive sound parameters or query by example systems Determine similarity to query in order to rank search results by relevance (AudioGoogle) Determine similarity to query in order to rank search results by relevance (AudioGoogle) Feature selection is the sinews of war… Feature selection is the sinews of war…
Cheng Yang Approach (1) Audio files preprocessed to identify local peaks in signal power (n = /min) Audio files preprocessed to identify local peaks in signal power (n = /min) Spectrogram computed using STFT of 2048 samples with Hamming window of 1024 samples and overlap factor of 2 Spectrogram computed using STFT of 2048 samples with Hamming window of 1024 samples and overlap factor of 2 Spectral vector extracted around each peak makes up (n, 180, k<<2048) feature space ( Hz range only) Spectral vector extracted around each peak makes up (n, 180, k<<2048) feature space ( Hz range only)
Yang Approach (2) Given an example query, compute the feature vector for the query and look for similar audio in database Given an example query, compute the feature vector for the query and look for similar audio in database Compute minimum distance between query and database feature sets saving time using dynamic programming techniques (use results from previous pairs) Compute minimum distance between query and database feature sets saving time using dynamic programming techniques (use results from previous pairs) Linearity filtering to favor time- scaled version compared to error orientation disagreement Linearity filtering to favor time- scaled version compared to error orientation disagreement
Yang’s Results Use database of 120 song excerpts (~1min) Use database of 120 song excerpts (~1min) Good performance with varying tempos, audio quality, performance variations Good performance with varying tempos, audio quality, performance variations Poor performance with transposed versions Poor performance with transposed versions Slow response, improved with indexing schemes Slow response, improved with indexing schemes
Jonathan Foote Approach Calculate feature vectors of audio examples of desired classes (12 MFCCs plus energy) Calculate feature vectors of audio examples of desired classes (12 MFCCs plus energy) Supervise training of quantized tree (partition feature space in maximally different class populations) Supervise training of quantized tree (partition feature space in maximally different class populations) Parameterized data is quantized using the tree for subsequent retrieval (creates template) Parameterized data is quantized using the tree for subsequent retrieval (creates template) To retrieve similar audio content, template is constructed for query audio, compared with corpus templates using cosine distance measure To retrieve similar audio content, template is constructed for query audio, compared with corpus templates using cosine distance measure
Foote’s Results Good way of measuring subjective qualities of sound, without using targeted features Good way of measuring subjective qualities of sound, without using targeted features Not as accurate to other techniques using psycho-acoustic knowledge in finding similar timbres (e.g. instruments) Not as accurate to other techniques using psycho-acoustic knowledge in finding similar timbres (e.g. instruments) Sensitive to pitch (will often return different timbres of same pitch) Sensitive to pitch (will often return different timbres of same pitch)
Erling Wold et al. Approach (1) Implemented several approaches in Muscle Fish software Implemented several approaches in Muscle Fish software More particularly, specify explicit perceptual features (loudness, pitch, brightness, bandwidth, harmonicity) More particularly, specify explicit perceptual features (loudness, pitch, brightness, bandwidth, harmonicity) Statistics of corresponding acoustic correlates calculated for entire sample (mean, variance, autocorrelation) form a-vector Statistics of corresponding acoustic correlates calculated for entire sample (mean, variance, autocorrelation) form a-vector For training set, mean vector calculated and covariance matrix built from the examples and becomes systems model For training set, mean vector calculated and covariance matrix built from the examples and becomes systems model
Wold Approach (2) Use a weighted Euclidean distance for classification and similarity measurements Use a weighted Euclidean distance for classification and similarity measurements Distance compared to threshold to decide if objects belong to the same class (optional) Distance compared to threshold to decide if objects belong to the same class (optional)
Wold Approach (3) Segmentation is required beforehand, achieved using same features, detecting strong discrepancies Segmentation is required beforehand, achieved using same features, detecting strong discrepancies
Wold and Foote comparison What I retain: Wold has proven that it is possible to use statistical methods for flexible classification