Presentation is loading. Please wait.

Presentation is loading. Please wait.

Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University.

Similar presentations


Presentation on theme: "Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University."— Presentation transcript:

1

2 Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University alb63@columbia.edu, dpwe@ee.columbia.edu

3 Berenzweig and Ellis - WASPAA 012 LabROSA What Where Who Why you love us

4 Berenzweig and Ellis - WASPAA 013 The Future as We Hear It Online Digital Music Libraries The Coming Age of Streaming Music Services Information Retrieval: How do we find what we want? Recommendation: How do we know what we want to find? –Collaborative Filtering vs. Content-Based –What is Quality?

5 Berenzweig and Ellis - WASPAA 014 Motivation Lyrics Recognition: Baby Steps –Segmentation –Forced Alignment –A Corpus Song structure through singing structure? –Fingerprinting –Retreival –Feature for similarity measures

6 Berenzweig and Ellis - WASPAA 015 Lyrics Recognition: Can YOU do it? Notoriously hard, even for humans. –amIright.com, kissThisGuy.com Why so hard? –Noise, music, whatever. –Singing is not speech: voice transformations –Strange word sequences (“poetry”) Need a corpus

7 Berenzweig and Ellis - WASPAA 016 History of the Problem Segmentation for Speech Recognition: Music/Speech –Scheirer & Slaney Forced Alignment - Karaoke –Cano et al. [REF NEEDED] Acoustic feature design: Custom job or Kitchen Sink? Idea! Use a speech recognizer: PPF (Posterior Probability Features) –Williams & Ellis Ultimately: Source separation, CASA

8 Berenzweig and Ellis - WASPAA 017 A Peek at the End

9 Berenzweig and Ellis - WASPAA 018 Architecture Overview Audio PLP Speech Recognizer (Neural Net) Feature Calculation posteriogramcepstra Time- averaging Entropy H H /h# Dynamism D P(h#) Segmentation (HMM) Gaussian Model Gaussian Model

10 Berenzweig and Ellis - WASPAA 019 Architecture Overview Audio PLP Speech Recognizer (Neural Net) posteriogramcepstra Segmentation (HMM) Neural Net Neural Net

11 Berenzweig and Ellis - WASPAA 0110 “So how’s that working out for you, being clever?” Entropy Entropy excluding background Dynamism Background probability Distribution Match: Likelihoods under single Gaussian model –Cepstra –PPF

12 Berenzweig and Ellis - WASPAA 0111 Recovering context with the HMM Transition probabilities –Inverse average segment duration Emission probabilities –Gaussian fit to time-averaged distribution Segmentation: the Viterbi path Evaluation –Frame error rate (no boundary consideration)

13 Berenzweig and Ellis - WASPAA 0112 Results [Table, figures] Listen! –Good, bad –trigger & stick –genre effects?

14 Berenzweig and Ellis - WASPAA 0113 Results

15 Berenzweig and Ellis - WASPAA 0114 E =.075 P(h#) in effect

16 Berenzweig and Ellis - WASPAA 0115 E =.68 P(h#) gone bad

17 Berenzweig and Ellis - WASPAA 0116 E =.61 Strong phones trigger, but can’t hold it Production quality effect? ‘ey’ ‘uw’ ‘m’,’n’

18 Berenzweig and Ellis - WASPAA 0117 E =.25 “Trigger and Stick” ‘s’

19 Berenzweig and Ellis - WASPAA 0118 E =.54 False phones ‘bcl’,’dcl’, ’b’, ‘d’ ‘l’,’r’

20 Berenzweig and Ellis - WASPAA 0119 E =.20 Genre effect?

21 Berenzweig and Ellis - WASPAA 0120 Discussion The Moral of the Story: Just give it the data PPF is better than cepstra. Speech Recognizer is pretty powerful. Why does the extra Gaussian model help PPF but not cepstra? Time averaging helps PPF: proves that it’s using the overall distribution, not short-time detail (at least, when modelled by single gaussians)


Download ppt "Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University."

Similar presentations


Ads by Google