Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University.

Slides:



Advertisements
Similar presentations
State Estimation and Kalman Filtering CS B659 Spring 2013 Kris Hauser.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Building an ASR using HTK CS4706
Character Recognition using Hidden Markov Models Anthony DiPirro Ji Mei Sponsor:Prof. William Sverdlik.
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Chapter 15 Probabilistic Reasoning over Time. Chapter 15, Sections 1-5 Outline Time and uncertainty Inference: ltering, prediction, smoothing Hidden Markov.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Natural Language Processing - Speech Processing -
Speech Recognition. What makes speech recognition hard?
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Berenzweig - Music Recommendation1 Music Recommendation Systems: A Progress Report Adam Berenzweig April 19, 2002.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL.
COMP 4060 Natural Language Processing Speech Processing.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Why is ASR Hard? Natural speech is continuous
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Introduction to MIR Course Overview 1.
Introduction to Automatic Speech Recognition
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
7-Speech Recognition Speech Recognition Concepts
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Music Information Retrieval from a Singing Voice Using Lyrics and Melody Information Motoyuki Suzuki, Toru Hosoya, Akinori Ito, and Shozo Makino EURASIP.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Mr. Darko Pekar, Speech Morphing Inc.
Artist Identification Based on Song Analysis
Hierarchical Multi-Stream Posterior Based Speech Recognition System
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Automatic Speech Recognition Introduction
Tracking Objects with Dynamics
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Measuring the Similarity of Rhythmic Patterns
Cengizhan Can Phoebe de Nooijer
Listen Attend and Spell – a brief introduction
Presentation transcript:

Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 012 LabROSA What Where Who Why you love us

Berenzweig and Ellis - WASPAA 013 The Future as We Hear It Online Digital Music Libraries The Coming Age of Streaming Music Services Information Retrieval: How do we find what we want? Recommendation: How do we know what we want to find? –Collaborative Filtering vs. Content-Based –What is Quality?

Berenzweig and Ellis - WASPAA 014 Motivation Lyrics Recognition: Baby Steps –Segmentation –Forced Alignment –A Corpus Song structure through singing structure? –Fingerprinting –Retreival –Feature for similarity measures

Berenzweig and Ellis - WASPAA 015 Lyrics Recognition: Can YOU do it? Notoriously hard, even for humans. –amIright.com, kissThisGuy.com Why so hard? –Noise, music, whatever. –Singing is not speech: voice transformations –Strange word sequences (“poetry”) Need a corpus

Berenzweig and Ellis - WASPAA 016 History of the Problem Segmentation for Speech Recognition: Music/Speech –Scheirer & Slaney Forced Alignment - Karaoke –Cano et al. [REF NEEDED] Acoustic feature design: Custom job or Kitchen Sink? Idea! Use a speech recognizer: PPF (Posterior Probability Features) –Williams & Ellis Ultimately: Source separation, CASA

Berenzweig and Ellis - WASPAA 017 A Peek at the End

Berenzweig and Ellis - WASPAA 018 Architecture Overview Audio PLP Speech Recognizer (Neural Net) Feature Calculation posteriogramcepstra Time- averaging Entropy H H /h# Dynamism D P(h#) Segmentation (HMM) Gaussian Model Gaussian Model

Berenzweig and Ellis - WASPAA 019 Architecture Overview Audio PLP Speech Recognizer (Neural Net) posteriogramcepstra Segmentation (HMM) Neural Net Neural Net

Berenzweig and Ellis - WASPAA 0110 “So how’s that working out for you, being clever?” Entropy Entropy excluding background Dynamism Background probability Distribution Match: Likelihoods under single Gaussian model –Cepstra –PPF

Berenzweig and Ellis - WASPAA 0111 Recovering context with the HMM Transition probabilities –Inverse average segment duration Emission probabilities –Gaussian fit to time-averaged distribution Segmentation: the Viterbi path Evaluation –Frame error rate (no boundary consideration)

Berenzweig and Ellis - WASPAA 0112 Results [Table, figures] Listen! –Good, bad –trigger & stick –genre effects?

Berenzweig and Ellis - WASPAA 0113 Results

Berenzweig and Ellis - WASPAA 0114 E =.075 P(h#) in effect

Berenzweig and Ellis - WASPAA 0115 E =.68 P(h#) gone bad

Berenzweig and Ellis - WASPAA 0116 E =.61 Strong phones trigger, but can’t hold it Production quality effect? ‘ey’ ‘uw’ ‘m’,’n’

Berenzweig and Ellis - WASPAA 0117 E =.25 “Trigger and Stick” ‘s’

Berenzweig and Ellis - WASPAA 0118 E =.54 False phones ‘bcl’,’dcl’, ’b’, ‘d’ ‘l’,’r’

Berenzweig and Ellis - WASPAA 0119 E =.20 Genre effect?

Berenzweig and Ellis - WASPAA 0120 Discussion The Moral of the Story: Just give it the data PPF is better than cepstra. Speech Recognizer is pretty powerful. Why does the extra Gaussian model help PPF but not cepstra? Time averaging helps PPF: proves that it’s using the overall distribution, not short-time detail (at least, when modelled by single gaussians)