Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.

Slides:

Advertisements

Similar presentations

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

An Overview of Machine Learning

Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

ECE 8527 Homework Final: Common Evaluations By Andrew Powell.

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.

Artificial Neural Networks

1 Music Classification Using SVM Ming-jen Wang Chia-Jiu Wang.

Dan Simon Cleveland State University

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Audio Retrieval David Kauchak cs458 Fall Administrative Assignment 4 Two parts Midterm Average:52.8 Median:52 High:57 In-class “quiz”: 11/13.

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

Isolated-Word Speech Recognition Using Hidden Markov Models

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,

7-Speech Recognition Speech Recognition Concepts

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Jacob Zurasky ECE5526 – Spring 2011

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.

1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.

Speech Processing Using HTK Trevor Bowden 12/08/2008.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

ADAPTIVE BABY MONITORING SYSTEM Team 56 Michael Qiu, Luis Ramirez, Yueyang Lin ECE 445 Senior Design May 3, 2016.

Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.

Deep Feedforward Networks

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

Randomness in Neural Networks

ARTIFICIAL NEURAL NETWORKS

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Conditional Random Fields for ASR

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

3. Applications to Speaker Verification

Musical Style Classification

A Tutorial on Bayesian Speech Feature Enhancement

Digital Systems: Hardware Organization and Design

Presenter: Shih-Hsiang(士翔)

Keyword Spotting Dynamic Time Warping

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002

10/31/2015 The Problem Classify digitally sampled music by genres or other categories. Categories are defined by “likeness” to other members. Solution should be quick, flexible and accurate.

10/31/2015 Motivation Organize large digital libraries. Search for music by melody/sound (second project). Understand music better.

10/31/2015 Early Considerations. Look for common patterns in music. The nature of music is sequential. Digitally sampled (WAV, MP3, etc.) formats?  More readily available and practical.  Raw information – harder to deal with. Symbolic (MIDI, melody) formats?  Less readily available in practical applications.  Different order information – some information is lost, some is gained.

10/31/2015 Early Considerations. Is melodic information enough? Consider orchestration, emphasis, etc. What are good models for this data? Learn from speech recognition, pattern recognition, digital signal processing.

10/31/2015 Previous Work: Folk Music Classification Using Hidden Markov Models. Wei Chai and Barry Varcoe and MIT Media Laboratory. Input: monophonic symbolic pieces of folk music from Germany, Austria and Ireland. Product: 2- and 3-country classifiers using HMMs. Results:  Hidden state number doesn’t matter much (2, 3, 4, 6).  Strict left-right and left-right models are better  Interval-sequence representation worked best  2-way accuracies of 75%, 77% and 66%, 3-way 63%

10/31/2015 Previous Work: Music Classification using Neural Networks Paul Scott – last year’s term project at Stanford. Data: 8 audio CDs in 4 genre categories + 4 audio CDs in 4 artist categories. Algorithm: Multi-feature vectors extracted as input to a feed-forward ANN. Product: 4-way genre classifier and 4-way artist classifier. Results: genre classification 94.8% accurate, artist classification 92.4% accurate. Problematic experimental setup.

10/31/2015 Previous Work: Distortion Discriminant Analysis for Audio Fingerprinting. Chris Burges et al at Microsoft Research. Task: find short audio clips in 5 hours of distorted audio. Product: new algorithm for feature extraction (fingerprinting) of audio streams. Key: linear neural network does Oriented Principal Component Analysis (OPCA). Signal/noise-optimal dimensionality reduction.

10/31/2015 Dataset songs/genre in MP3 format compressed from 44.1 kHz stereo. Converted to Wave-PCM linear encoding kHz mono signal. Cut 10 evenly spaced ten-second segments per song = clips/category samples per clip. 4 categories: rock, techno/trance, classical, Celtic dance.

10/31/2015 Dataset. Easily extracted from real world data Contains a lot of information Enough for humans to distinguish between genres.

10/31/2015 The Model. Continuous Hidden Markov model. 3,4 or 5 hidden states. Left-to-right architecture

10/31/2015 The Model. Each state “outputs” a feature vector with probability distribution b j (O).  FFT-based Mel cepstral coefficients.  Mel cepstra with delta and acceleration information.  Linear prediction cepstral coefficients.  (to be implemented) DDA fingerprints. s1s1 s2s2 s3s3 s4s4 b1(O)b1(O)b2(O)b2(O)b3(O)b3(O)b4(O)b4(O)

10/31/2015 Feature Extraction: FFT and Mel. Pre-emphasize audio signal. Multiply by a Hamming window function. Take Fourier transform of the window. Derive 12 Mel cepstra coefficients from the spectrum. (Models non-linear human audition). Amplitude Frequency

10/31/2015 Features of the Features.

10/31/2015 Feature Extraction: Deltas and Accelerations For each Mel coefficient C t, append  t = C t -C t-1 to the feature vector. For each  t, append a t =  t -  t-1 to the feature vector. Enhances the HMM model by adding memory of past states.

10/31/2015 Feature Extraction: LPC. Linear Predictive Coding. Model the signal as y n+1 =w 0 *y n +w 1 *y n-1 +…+w L-1 *y n-L-1 +e n+1 Find the weights that minimize the mean squared error over the window 12 weights were used as a feature vector

10/31/2015 Feature Extraction: Overall. Combine multiple kinds of features into hand- crafted vectors (like Paul Scott). Build in prior knowledge about the problem into the learning task. (Todo) Use optimizing feature extraction methods like DDA.

10/31/2015 Continuous HMMs. Feature vectors are from a continous domain. Two solutions:  Discretize the space by finding a representative basis and a distance measure.  Use continuous multivariate probability functions. Chose to use continuous HMMs.

10/31/2015 Continuous HMMs Represent output probability by a mixture of Gaussians. Use EM and Baum-Welch reestimation to get the Gaussian parameters and mixture coefficients. What should M be? Many parameters vs. expressive power. M=1 worked well.

10/31/2015 The Platform. HTK library, originally developed for speech recognition at Microsoft, now at Cambridge.  Feature extraction tools.  Continuous HMMs.  Baum-Welch and Viterbi algorithms.  Optimized for performance. Worked well – the only thing I had to write were scripts, models and data converters.

10/31/2015 The Algorithm. One HMM “M” for each category Use Baum-Welch reestimation for 20 steps (or until convergence) to obtain M that maximizes log P(O training |M). Use Viterbi algorithm to obtain log P(O test |M) for each category. Pick the greatest.

10/31/2015 Problems Celtic data set is similar to classical and rock and smaller than the other three. Failed to find MIDI copies of the dataset. Viterbi training did not have enough examples for the number of parameters even with 3-state HMM – undetermined parameters – Had to use Baum-Welch. Memory-intensive training – had to use dedicated Ultra-10s in parallel.

10/31/2015 Results: 4-way by 12 MFCC. 70%/30% training/test split 470 clips/category 15 cross-validation trials per experiment Top: 12 Mel Cepstral Coefficients Bottom: delta and acceleration coefficients added. 4 hidden states. 4 state HMM Tech.Class.RockCelt. techno classical rock Celtic state HMM Tech.Class.RockCelt. techno classical rock Celtic

10/31/2015 Results: 4-way by 12 LPC. 12 LPC cepstra of 14- order LPC Same experimental conditions as before. 4 state HMM Tech.Class.RockCelt. techno classical rock Celtic

10/31/2015 Results: Varying Hidden State Number. 660 clips per genre 12 Mel cepstral coefficients with deltas and accelerations (36 total features) 4 state HMM Tech.Class.Rock techno classical rock state HMM Tech.Class.Rock techno classical rock state HMM Tech.Class.Rock techno classical rock

10/31/2015 Results: Generalization Verify that we are generalizing across songs. An entire song must be either all training or all test. Top: random selection (15 cross-validated) Bottom: constrained selection (15 c.v.) 5 state HMM Tech.Class.Rock techno classical rock state HMM Tech.Class.Rock techno classical rock

10/31/2015 Conclusions HMMs are a viable solution to the problem. Number of hidden states does not influence results within limits tested. Most information is contained in extracted feature vectors. Feature vectors are readily modeled by simple Gaussians.

10/31/2015 Conclusions Some types of music are harder to recognize than others.  Less unique features identifiable by feature extraction (Celtic)  Sound like other genres

10/31/2015 Conclusions Models generalize across songs – not just different segments of the same song. Better feature extraction (DDA) is the main factor for improving performance. Practically useful tools for sorting MP3s can be easily developed using this technique.