Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

CS335 Principles of Multimedia Systems Audio Hao Jiang Computer Science Department Boston College Oct. 11, 2007.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.

Time-Frequency Analysis Analyzing sounds as a sequence of frames

KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)

Spectral envelope analysis of TIMIT corpus using LP, WLSP, and MVDR Steve Vest Matlab implementation of methods by Tien-Hsiang Lo.

Institute of Information Science Academia Sinica 1 Singer Identification and Clustering of Popular Music Recordings Wei-Ho Tsai

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.

Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.

Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.

/25 Singer Similarity A Brief Literature Review Catherine Lai MUMT-611 MIR March 24,

G. Valenzise *, L. Gerosa, M. Tagliasacchi *, F. Antonacci *, A. Sarti * IEEE Int. Conf. On Advanced Video and Signal-based Surveillance, 2007 * Dipartimento.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.

1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Speech Recognition in Noise

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.

Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh

Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.

Masquerade Detection Mark Stamp 1Masquerade Detection.

INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.

Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.

1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Basics of Neural Networks Neural Network Topologies.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER

Predicting Voice Elicited Emotions

Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

PATTERN COMPARISON TECHNIQUES

Instance Based Learning

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Linear Prediction.

Brian Whitman Paris Smaragdis MIT Media Lab

Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science

Image Segmentation Techniques

Musical Style Classification

Linear Predictive Coding Methods

Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa

Bandwidth Extrapolation of Audio Signals

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

John H.L. Hansen & Taufiq Al Babba Hasan

Presentation on Timbre Similarity

Govt. Polytechnic Dhangar(Fatehabad)

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

Singer Similarity Doug Van Nort MUMT 611

Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer  Storage and retrieval

Introduction Identification of singer fairly easy task for humans regardless of musical context Not so easy to find parameters for automatic identification More file sharing and databases leads to increased demand

Introduction Much work done in speech recognition, performs poorly for singer ID  Systems trained on speech data, with no background noise The vocal problem has some fundamental differences  Vocals exist in variety of background noise  Voiced/unvoiced content Singer recognition similar problem to solo instrument identification

The Players Kim and Whitman 2002 Liu and Huang 2002

Kim and Whitman From MIT Media Lab Singer identification which  Assumes strong harmonicity from vocals  Assumes pop music  Instrumentation/levels within critical frequency range

Two step process Untrained algorithm for automatic segmentation Classification with training based on vocal segments

Detection of Vocal Regions Filter frequencies outside of vocal range of 200-2,000 Hz  Chebychev IIR digital filter Detect harmonicity

Filter Frequency Response

Filtering alone not enough  Bass and cymbals gone, but  Other instruments fall within range Need to extract features within vocal range to find voice

Harmonic detection Band limited output sent through bank of inverse comb filters  Delay varied

Most attenuated signal represents strongest harmonic content Harmonicity measure calculated by ratio of signal energy to maximally attenuated signal  Allows for establishment of threshold

Singer Identification Linear Predictive Coding (LPC) used to extract location and magnitude of formants One of two classifiers used to identify singer based on formant information

Feature Extraction A 12-pole linear predictor used to find formants using autocorrelation method Standard LPC treats frequencies linearly, but human sensitivity is more logarithmic  Warp function maps frequencies to approximation of Bark scale  Further beneficial in finding fundamental

Classification Techniques 2 established pattern recognition algorithms used:  Gaussian Mixture Model (GMM)  Support Vector Machine (SVM)

GMM Uses multiple weighted Gaussians to capture behavior of each class  Each vector assumed to arise from mixture of gaussian dists. Parameters for Gaussians found via Expectation Maximization (EM)  Mean and variance Prior to EM, Principal Component Analysis (PCA) taken of data  Normalizes variances, avoids highly irregular scalings which EM can produce

SVM Computes optimal hyperplane to linearly separate two classes of data Does not depend on probability estimation Determined by a small number of data points (support vectors)

Experiments & Results Testbed of 200 songs by 17 different artists/vocalists Tracks downsampled to Khz  Vocal range still well below Nyquist

Half of database used for training, half for testing Two experiments:  LPC features taken from entire song  LPC features taken from vocal segments

1024 frame analysis with hop size of 2 LP analysis used both linear and warped freq scales

Results

Results better than chance (~6%) but fall short of expected human performance Linear freq alone outperforms warped freq Oddly, using only vocal segments decreases performance for SVM

Liu and Huang Based on MP3 database Particularly high demand for such an approach, given widespread use of Mpeg 1, layer 3 Algorithm works directly on MP3 decoder algorithm

Process Coefficients of polyphase filter taken from MP3 decoding process File segmented into phonemes based on said coefficients Feature vector constructed for each phoneme, and stored along with artist name in database Classifier trained on database, used to identify unknown MP3 files

Flowchart for singer similarity System of Liu/Huang

Phoneme Segmentation MP3 decoding provides polyphase coefficients

Energy intensity of each subband is sum of squares of subband coefficients Frame energy calculated from polyphase coefficients

Energy gap exists between two phonemes Segmentation looks to automatically identify this gap

Waveform of two phonemes Frame energy of two phonemes

Phoneme Feature Extraction Phoneme features computed directly from MDCT coefficients 576 dimensional feature vector for each frame Phoneme feature vector of n frames

Classification : setup Create database of phoneme feature vectors  Becomes training set Discriminating Radius: measure of uniqueness by min Euclidean distance between dissimilar vectors

Good vs. Bad discriminators

Number of similar phonemes within discriminating radius also cosidered Number of phonemes within radius = w f = frequency of phoneme f

Discriminating ability of each phoneme depends on frequency and distance

Classification: in action Unknown MP3 segmented into phonemes  Only first N used for efficiency kNN used as classifier  K neighbors compared to N phonemes and weighted by discriminating function  K*N weighted “votes” clustered by singer, and the winner is one with largest score

Experiments/Results 10 Male, 10 Female singers 30 songs apiece  10 phoneme database  10 training (discriminator weights)  10 test set

Free parameters User defined parameters:  k value  Discrimination threshold  Number of singers in a class

Varying threshold

Varying k

Varying number of singers

Results for all Singers

Conclusion Not much work yet strictly on singer Tough because of time and background variances Quite useful as many people identify artists with singer Initial results promising, short of human performance See also: Minnowmatch [Whitman, Flake, Lawrence]