Download presentation
Presentation is loading. Please wait.
2
Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University alb63@columbia.edu, dpwe@ee.columbia.edu
3
Berenzweig and Ellis - WASPAA 012 LabROSA What Where Who Why you love us
4
Berenzweig and Ellis - WASPAA 013 The Future as We Hear It Online Digital Music Libraries The Coming Age of Streaming Music Services Information Retrieval: How do we find what we want? Recommendation: How do we know what we want to find? –Collaborative Filtering vs. Content-Based –What is Quality?
5
Berenzweig and Ellis - WASPAA 014 Motivation Lyrics Recognition: Baby Steps –Segmentation –Forced Alignment –A Corpus Song structure through singing structure? –Fingerprinting –Retreival –Feature for similarity measures
6
Berenzweig and Ellis - WASPAA 015 Lyrics Recognition: Can YOU do it? Notoriously hard, even for humans. –amIright.com, kissThisGuy.com Why so hard? –Noise, music, whatever. –Singing is not speech: voice transformations –Strange word sequences (“poetry”) Need a corpus
7
Berenzweig and Ellis - WASPAA 016 History of the Problem Segmentation for Speech Recognition: Music/Speech –Scheirer & Slaney Forced Alignment - Karaoke –Cano et al. [REF NEEDED] Acoustic feature design: Custom job or Kitchen Sink? Idea! Use a speech recognizer: PPF (Posterior Probability Features) –Williams & Ellis Ultimately: Source separation, CASA
8
Berenzweig and Ellis - WASPAA 017 A Peek at the End
9
Berenzweig and Ellis - WASPAA 018 Architecture Overview Audio PLP Speech Recognizer (Neural Net) Feature Calculation posteriogramcepstra Time- averaging Entropy H H /h# Dynamism D P(h#) Segmentation (HMM) Gaussian Model Gaussian Model
10
Berenzweig and Ellis - WASPAA 019 Architecture Overview Audio PLP Speech Recognizer (Neural Net) posteriogramcepstra Segmentation (HMM) Neural Net Neural Net
11
Berenzweig and Ellis - WASPAA 0110 “So how’s that working out for you, being clever?” Entropy Entropy excluding background Dynamism Background probability Distribution Match: Likelihoods under single Gaussian model –Cepstra –PPF
12
Berenzweig and Ellis - WASPAA 0111 Recovering context with the HMM Transition probabilities –Inverse average segment duration Emission probabilities –Gaussian fit to time-averaged distribution Segmentation: the Viterbi path Evaluation –Frame error rate (no boundary consideration)
13
Berenzweig and Ellis - WASPAA 0112 Results [Table, figures] Listen! –Good, bad –trigger & stick –genre effects?
14
Berenzweig and Ellis - WASPAA 0113 Results
15
Berenzweig and Ellis - WASPAA 0114 E =.075 P(h#) in effect
16
Berenzweig and Ellis - WASPAA 0115 E =.68 P(h#) gone bad
17
Berenzweig and Ellis - WASPAA 0116 E =.61 Strong phones trigger, but can’t hold it Production quality effect? ‘ey’ ‘uw’ ‘m’,’n’
18
Berenzweig and Ellis - WASPAA 0117 E =.25 “Trigger and Stick” ‘s’
19
Berenzweig and Ellis - WASPAA 0118 E =.54 False phones ‘bcl’,’dcl’, ’b’, ‘d’ ‘l’,’r’
20
Berenzweig and Ellis - WASPAA 0119 E =.20 Genre effect?
21
Berenzweig and Ellis - WASPAA 0120 Discussion The Moral of the Story: Just give it the data PPF is better than cepstra. Speech Recognizer is pretty powerful. Why does the extra Gaussian model help PPF but not cepstra? Time averaging helps PPF: proves that it’s using the overall distribution, not short-time detail (at least, when modelled by single gaussians)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.