/25 Singer Similarity A Brief Literature Review Catherine Lai MUMT-611 MIR March 24,
/25 Outline of Presentation Introduction –Motivation –Related research Recent publications –Kim & Whitman, 2002 –Liu & Huang, 2002 –Tsai, Wang, Rodgers, Cheng & Yu, 2003 –Bartsch & Wakefield, 2004 Discussion Conclusion 2
/25 Introduction Motivation –Multitude of audio files circulation on the Internet –Replace human documentation efforts and organize collection of music recordings automatically –Singer identification relatively easy for human but not machines Related Research –Speaker identification –Musical instrument identification 3
/25 Kim & Whitman, “Singer Identification in Popular Music Recordings Using Voice Coding Features” (MIT Media Lab) Automatically establish the I.D. of singer using acoustic features extracted from songs in a DB of pop music –Perform segmentation of vocal region prior to singer I.d. –Classifier uses features drawn from voice coding based on Linear Predictive Coding (LPC) Good at highlight formant locations Regions of resonance significant perceptually 4
/25 Kim & Whitman, Detection of Vocal Region Detect region of singing detect energy within frequencies bounded by the range of vocal energy –Filter audio signal with band-pass filter –Used Chebychev IIR digital filter of order 12 Attenuate other instruments fall outside of the vocal range regions e.g. bass and cymbals –Voice not only remaining instrument in the region Discriminate the other sounds e.g. drums use a measure of harmonicity –Vocal segment is > 90% voiced is highly harmonic –Measure harmonicity of filtered signal within analysis frame and thresholding the harmonicity against a fixed value 5
/25 Kim & Whitman, Feature Extraction 12-pole LP analysis based on the general principle behind LPC for speech used for feature extraction LP analysis performed on linear and warped scales Linear scale treats all frequencies equally on linear scale –Human ears not equally sensitive to all frequencies linearly –Warping function adjusts closely to the Bark scale approx. frequency sensitivity of human hearing –Warp function better at capture formant location at lower frequencies 6
/25 Kim & Whitman, Experiments Data sets include 17 different singer > 200 songs 2 classifier Gaussian Mixture Model (GMM) and SVM used on 3 different feature sets –Linear scaled, warped scaled, both linear and warped data Run on entire song data and on segments classified as vocal only 7
/25 Kim & Whitman, Results Linear frequency feature tend to outperform the warped frequency feature when each used alone; combined best Song and frame accuracy increases when using only vocal segments in GMM Song and frame accuracy decreases when using only vocal segments in SVM 8 Kim & Whitman, 2002
/25 Kim & Whitman, Discussion and Future Work Better performance of linear frequency scale features vs. warped frequency scale features indicate –Machine find increased accuracy of the linear scale at higher frequencies useful –Contrary to human auditory system The performance of the SVM decreased is puzzling –Finding aspects of the features not specifically related to voice Add high-level musical knowledge to the system –Attempt to I.D. song structure such as locate verses or choruses –Higher probability of vocals in these sections 9
/25 Liu & Huang, “A Singer Identification Technique for Content-Based Classification of MP3 Music Objects” Automatically classify MP3 music objects according to singers Major steps: –Coefficients extracted from compressed raw data used to compute the MP3 features for segmentation –Use these features to segment MP3 objects into a sequence of notes or phonemes Waveform of 2 phonemes –Each MP3 phoneme in the training set, its MP3 features extracted and stored with its associated singer in phoneme DB –Phoneme in the MP3 DB used as discriminators in an MP3 classifier to I.D. the singers of unknown MP3 objects 10 Liu & Huang, 2002
/25 Liu & Huang, Classification Number of different phonemes a singer can sing is limited and singer with different timbre possess unique phoneme set Phonemes of an unknown MP3 song can be associated with the similar phoneme of the same singer in the phoneme DB kNN classifier used for classification –Each unknown MP3 song first segmented into phonemes –First N phonemes used and compared with every discriminators in the phoneme DB –K closest neighbors found For each of the k closest neighbor, –If its distance within a threshold, a weighted vote given –K*N weighted votes accumulated according to singer –Unknown MP3 song is assigned to the singer with largest score 11
/25 Liu & Huang, Experiments Data set consists of 10 male and 10 female Chinese singers each with 30 songs 3 factors dominate the results of the MP3 music classification method –Setting of k in the kNN classifier (best k = 80 result 90% precision rate) –Threshold for vote decision used by the discriminator (best threshold = 0.2) –Number of singer allowed in a music class (larger no. higher precision) Allow > 1 singer in a musical class Grouping several singers with similar voice provide ability to find songs with singers of similar voices 12
/25 Liu & Huang, Results and Future Work Results within expectation –Songs sung by a singer with very unique style resulted in the highest precision rate (> 90%) –Songs sung by a singer with a common voice resulted in only 50% of the precision rate Future work to use more music features –Pitch, melody, rhythm, and harmonicity for music classification –Try to represent MP3 features according to syntax and semantics of the MPEG7 standards 13 Liu & Huang, 2002
/25 Tsai et al., “Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics” (ISMIR) Technique for automatically clustering undocumented music recording based on associated singers given no singer information or population of singers Clustering method based on the singer’s voice rather than background music, genre, or others 3-stage process proposed: –Segmentation of each recording into vocal/non-vocal segments –Suppressing the characteristics of background from vocal segment –Clustering the recording based on singer characteristic similarity 14
/25 Tsai et al., Classification Classifier for vocal/non-vocal segmentation –Front-end signal processor to convert digital waveform into spectrum-based feature vectors –Back-end statistical processor to perform modeling, matching, and decision making 15
/25 Tsai et al., Classification Classifier operates in 2 phases: training and testing –During training phase, a music DB with manual vocal/non- vocal transcriptions used to form two separate GMMS: a vocal GMM and non-vocal GMM –In testing phase, recognizer takes as input feature vectors extracted from an unknown recording, produces as output the frame log-likelihoods for the vocal GMM and the non- vocal GMM 16
/25 Tsai et al., Classification Block diagram 17 Tsai, 2003
/25 Tsai et al., Decision Rules Decision for each frame made according to one of three decision rules: 1. frame-based, 2. fixed-length-segment-based, and 3. homogeneous-segment based decision rules. 18 Tsai, 2003 Assign a single classification per segment
/25 Tsai et al., Singer Characteristic Modeling Characteristics of voice be modeled to cluster recordings –V = {v1, v2, v3, …} be features vectors from a vocal region, is a mixture of solo feature vectors S = {s1, s2, s3, …} background accompaniment feature vectors B = {b1, b2, b3, …} S and B unobservable –B can be approximated from the non-vocal segments –S is subsequent estimated given V and B A solo and a background music model is generate for each recording to be clustered 19
/25 Tsai et al., Clustering Each recording evaluated against each singer’s solo model –Log-likelihood of the vocal portion of one recording tested against one solo model computed (for all solo models) K-mean algorithm used for clustering –Starts with a single cluster and recursively split clusters –Bayesian Information Criterion employed to decide the best value of k 20
/25 Tsai et al., Experiments Data set consists of 416 tracks from Mandarin pop music CD Experiments run to validate the vocal/non-vocal segmentation method –Best accuracy achieved was 78% using the homogeneous segment-based method 21
/25 Tsai et al., Results System evaluation on the basis of average cluster purity When k = singer population, the highest purity = Tsai, 2003
/25 Tsai et al., Future Work Test method on a wider variety of data –Larger singer population –Richer songs with different genre 23
/25 Discussion and Conclusion Singer similarity technique can be used to –Automatically organize a collection of music recordings based on lead singer –Labeling of guest performers information usually omitted in music in music database –Replace human documentation efforts Extend to handle duets, chorus, background vocals, other musical data with multiple simultaneous or non-simultaneous singers –Rock band songs with parts sung by the guitarist, drummer band members can be identified 24
/25 Bibliography Bartsch, M. and G. Wakefield (2004). Singing voice identification using spectral envelope estimation. IEEE Transactions on Speech and Audio Processing, vol. 12, no. 2, Kim, Y. and B. Whitman (2002). Singer identification in popular music recordings using voice coding features. In Proceedings of the 2002 International Symposium on Music Information Retrieval. Liu, C. and C. Huang (2002). A singer identification technique for content-based classification of mp3 music objects. In Proceedings of the 2002 Conference on Information and Knowledge Management (CIKM), Tsai, W., H. Wang, D. Rodgers, S. Cheng, and H. Yu (2003). Blind clustering of popular music recording based on singer voice characteristics. In Proceedings of the 2003 International Symposium on Music Information Retrieval. 25