Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,

Slides:



Advertisements
Similar presentations
November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.
Advertisements

Computational Rhythm and Beat Analysis Nick Berkner.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Department of Computer Science University of California, San Diego
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Evaluation of the Audio Beat Tracking System BeatRoot By Simon Dixon (JNMR 2007) Presentation by Yading Song Centre for Digital Music
Rhythmic Similarity Carmine Casciato MUMT 611 Thursday, March 13, 2005.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
Results Audio Information Retrieval using Semantic Similarity Luke Barrington, Antoni Chan, Douglas Turnbull & Gert Lanckriet Electrical & Computer Engineering.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Communications & Multimedia Signal Processing Meeting 7 Esfandiar Zavarehei Department of Electronic and Computer Engineering Brunel University 23 November,
Speaker Adaptation for Vowel Classification
A Supervised Approach for Detecting Boundaries in Music using Difference Features and Boosting Douglas Turnbull Computer Audition Lab UC San Diego, USA.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
Semantic Similarity for Music Retrieval Luke Barrington, Doug Turnbull, David Torres & Gert Lanckriet Electrical & Computer Engineering University of California,
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh
Representing Acoustic Information
Audio Retrieval David Kauchak cs458 Fall Administrative Assignment 4 Two parts Midterm Average:52.8 Median:52 High:57 In-class “quiz”: 11/13.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Summarized by Soo-Jin Kim
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
Mike Nonte.  Apply voltage or current with known frequency and amplitude  Record current or voltage response  Use phase shift and change in magnitude.
Jacob Zurasky ECE5526 – Spring 2011
Multimodal Information Analysis for Emotion Recognition
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Timbre and Modulation Features for Music Genre/Mood Classification J.-S. Roger Jang & Jia-Min Ren Multimedia Information Retrieval Lab Dept. of CSIE, National.
Combining Audio Content and Social Context for Semantic Music Discovery José Carlos Delgado Ramos Universidad Católica San Pablo.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Audio Tempo Extraction Presenter: Simon de Leon Date: February 9, 2006 Course: MUMT611.
Introduction to Digital Signals
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
Speech Processing Using HTK Trevor Bowden 12/08/2008.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Audio Processing Mitch Parry. Resource! Sound Waves and Harmonic Motion.
Audio Processing Mitch Parry. Similar to Image Processing? For images a pixel is the smallest unit The color is a distribution of the spectrum of visible.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
David Sears MUMT November 2009
Carmine Casciato MUMT 611 Thursday, March 13, 2005
ARTIFICIAL NEURAL NETWORKS
Artist Identification Based on Song Analysis
Spoken Digit Recognition
Brian Whitman Paris Smaragdis MIT Media Lab
Urban Sound Classification with a Convolution Neural Network
Carmine Casciato MUMT 611 Thursday, March 13, 2005
3. Applications to Speaker Verification
Musical Style Classification
Computer Vision Lecture 16: Texture II
EE513 Audio Signals and Systems
Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611
Presentation on Timbre Similarity
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09, Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09, Audio features that approximate timbre, rhythm and tempo are used for genre classification and music similarity estimation. In our system, audio data are modelled as long-term accumulative distribution of frame-based spectral features. This is also known as the “bag-of-frames” (BOF) approach wherein audio data are treated as a global distribution of frame occurrences. Audio Pre-processing The first step before calculating any features is to normalize the audio signal. The process scales and shifts the sound vectors so they have maximum amplitude of one and have average value of zero. The signal is then cut into frames with a window size 512 samples (~23 F Sampling =22 kHz) and hop size 512 samples. Timbre Component The timbre component is represented by the Mel-Frequency Cepstral Coefficients (coefficients 2:20) [1]. We use a single Gaussian represented by its mean µ and covariance matrix . Other features derive derived from the spectral representation are spectral flux, spectral flux from delta spectrum, spectral flux from Mel-frequency spectrum, and spectral flux from Mel- frequency delta spectrum. Rhythm Component The rhythm component is based on the Fluctuation Patterns [2] (FPs) of the audio signal. Fluctuation patterns describe the amplitude modulation of the loudness per frequency band. For each frame, the fluctuation pattern is represented by a 12x30 matrix. The rows correspond to reduced Mel- frequency bin while the columns correspond to modulating frequency bands. To summarize the FPs, the median of the matrices is computed. Additional features derived are FP mean and FP standard deviation. Tempo Component The tempo component is derived from a technique using onset autocorrelation [3]. The tempo is computed by taking the first-order difference along time of a Mel-frequency spectrogram then summing across frequency. A high-pass filter is used to remove slowly-varying offsets. The global tempo is estimated by autocorrelating the onset strength and choosing the period with the highest windowed peak. Audio Pre-processing The first step before calculating any features is to normalize the audio signal. The process scales and shifts the sound vectors so they have maximum amplitude of one and have average value of zero. The signal is then cut into frames with a window size 512 samples (~23 F Sampling =22 kHz) and hop size 512 samples. Timbre Component The timbre component is represented by the Mel-Frequency Cepstral Coefficients (coefficients 2:20) [1]. We use a single Gaussian represented by its mean µ and covariance matrix . Other features derive derived from the spectral representation are spectral flux, spectral flux from delta spectrum, spectral flux from Mel-frequency spectrum, and spectral flux from Mel- frequency delta spectrum. Rhythm Component The rhythm component is based on the Fluctuation Patterns [2] (FPs) of the audio signal. Fluctuation patterns describe the amplitude modulation of the loudness per frequency band. For each frame, the fluctuation pattern is represented by a 12x30 matrix. The rows correspond to reduced Mel- frequency bin while the columns correspond to modulating frequency bands. To summarize the FPs, the median of the matrices is computed. Additional features derived are FP mean and FP standard deviation. Tempo Component The tempo component is derived from a technique using onset autocorrelation [3]. The tempo is computed by taking the first-order difference along time of a Mel-frequency spectrogram then summing across frequency. A high-pass filter is used to remove slowly-varying offsets. The global tempo is estimated by autocorrelating the onset strength and choosing the period with the highest windowed peak. The timbre, rhythm, and tempo distances are calculated separately. Before they are combined, each distance component is normalized by removing the mean and dividing by the standard deviation of all the distances. The results are then weighted before being summed. Symmetry is obtained by summing up the distances in both directions for each pair of tracks [4]. Distances between timbres are computed by comparing the GMM models. We use symmetric Kullback-Leibler (KL) distance between two models [5]. The Euclidean distance is used to compute distance between rhythms. For tempo distances, a simple absolute distance is computed. The timbre, rhythm, and tempo distances are calculated separately. Before they are combined, each distance component is normalized by removing the mean and dividing by the standard deviation of all the distances. The results are then weighted before being summed. Symmetry is obtained by summing up the distances in both directions for each pair of tracks [4]. Distances between timbres are computed by comparing the GMM models. We use symmetric Kullback-Leibler (KL) distance between two models [5]. The Euclidean distance is used to compute distance between rhythms. For tempo distances, a simple absolute distance is computed. Normalize Spectral Distances Distance Matrix Distance Matrix Rhythm Distances Rhythm Distances Tempo Distances Normalize W1W1 W1W1 W2W2 W2W2 W3W3 W3W3   Genre classification involves training a classifier. The timbre, rhythm and tempo features are combined as a single vector to represent each training file. For the classifier, the Directed Acyclic Graph-Support Vector Machine [6] is used to handle the multi-class problem. For the DAG-SVM classifier, the radial basis function kernel is used. Spectral Features DAGSVM Rhythm Features Rhythm Features Tempo Features Combine Features Combine Features Scale Features Test Song Genre Training Acknowledgements Franz de Leon is supported by the Engineering Research and Development for Technology Faculty Development Program, DOST Philippines. References [1] B. Logan and A. Salomon, “A Music Similarity Function Based on Signal Analysis,” in IEEE International Conference on Multimedia and Expo, pp , [2] E. Pampalk, “Computational Models of Music Similarity and their Application in Music Information Retrieval,” Vienna University of Technology, [3] D. P. W. Ellis and G. E. Poliner, “Identifying `Cover Songs’ with Chroma Features and Dynamic Programming Beat Tracking,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 2007, p. IV-1429-IV [4] E. Pampalk, “Audio-Based Music Similarity and Retrieval : Combining a Spectral Similarity Model with Information Extracted from Fluctuation Patterns,” in Submission to MIREX 2006, [5] M. I. Mandel and D. P. W. Ellis, “Song-level Features and Support Vector Machines for Music Classification,” in Submission to MIREX 2005, 2005, pp [6] J. C. Platt, N. Cristianini, and J. Shawe-taylor, “Large Margin DAGs for Multiclass Classification,” Advances in Neural Information Processing Systems 12, pp , Acknowledgements Franz de Leon is supported by the Engineering Research and Development for Technology Faculty Development Program, DOST Philippines. References [1] B. Logan and A. Salomon, “A Music Similarity Function Based on Signal Analysis,” in IEEE International Conference on Multimedia and Expo, pp , [2] E. Pampalk, “Computational Models of Music Similarity and their Application in Music Information Retrieval,” Vienna University of Technology, [3] D. P. W. Ellis and G. E. Poliner, “Identifying `Cover Songs’ with Chroma Features and Dynamic Programming Beat Tracking,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 2007, p. IV-1429-IV [4] E. Pampalk, “Audio-Based Music Similarity and Retrieval : Combining a Spectral Similarity Model with Information Extracted from Fluctuation Patterns,” in Submission to MIREX 2006, [5] M. I. Mandel and D. P. W. Ellis, “Song-level Features and Support Vector Machines for Music Classification,” in Submission to MIREX 2005, 2005, pp [6] J. C. Platt, N. Cristianini, and J. Shawe-taylor, “Large Margin DAGs for Multiclass Classification,” Advances in Neural Information Processing Systems 12, pp , Rhythm Features Rhythm Features Tempo Features