Download presentation
Presentation is loading. Please wait.
1
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph Keshet, Hebrew University Yoram Singer, Google Dan Chazan, IBM
2
Phoneme Alignment. Slide 2 The Alignment Problem Phonetic transcription: Waveform: Text: /hh ae v ey tcl t eh s tcl t/ Have a test
3
Phoneme Alignment. Slide 3 The Alignment Problem Setting alignment function acoustic representation phonetic representation start-time of phoneme p i in x /hh ae v ey tcl t eh s tcl t/
4
Phoneme Alignment. Slide 4 Acoustic Representation Short-time Fourier Transform
5
Phoneme Alignment. Slide 5 Comparing Alignments e.g.
6
Phoneme Alignment. Slide 6 insensitive Cost insensitivity region
7
Phoneme Alignment. Slide 7 A Discriminative Learning Approach Training set: Learning Algorithm Hypotheses class Alignment function:
8
Phoneme Alignment. Slide 8 Outline of Solution 1.Define the hypotheses class - constitutes the template of our alignment function: a.Map each possible alignment into vectors in an abstract vector-space b.Devise a projection in the vector-space which order alignments in accordance to their quality 2.Derive a simple online learning algorithm 3.Convert the Online Alg. to a Batch procedure with some formal guarantees
9
Phoneme Alignment. Slide 9 Feature “Primitives” for Alignment feature primitive for alignment Assessing the quality of a suggested alignment acoustic and phonetic representation suggested alignment
10
Phoneme Alignment. Slide 10 Feature Primitive I Cumulative spectral change across the boundaries
11
Phoneme Alignment. Slide 11 Feature Primitives I Cumulative spectral change across the boundaries
12
Phoneme Alignment. Slide 12 Feature Primitives II Cumulative confidence in the phoneme sequence frame based phoneme classifier is the confidence that phoneme was uttered at frame (Dekel, Keshet, Singer, ‘04) Learn a static frame-based phoneme classifier
13
Phoneme Alignment. Slide 13 Feature Primitive III Phoneme duration model - average length of phoneme - standard deviation of the length of phoneme
14
Phoneme Alignment. Slide 14 Feature Primitive IV Spectogram at different rates of articulation (Pickett, 1980) Speaking-rate (“dynamics”)
15
Phoneme Alignment. Slide 15 Feature Functions for Alignment correct alignment slightly incorrect alignment grossly incorrect alignment Mapping all possible alignments into a vector space
16
Phoneme Alignment. Slide 16 Main Solution Principle grossly incorrect alignment correct alignment slightly incorrect alignment Find a linear projection that ranks alignments according to their quality
17
Phoneme Alignment. Slide 17 slightly incorrect alignment Main Solution Principle (cont.) example of low confidence projection correct alignment grossly incorrect alignment
18
Phoneme Alignment. Slide 18 slightly incorrect alignment Main Solution Principle (cont.) example of incorrect projection correct alignment grossly incorrect alignment
19
Phoneme Alignment. Slide 19 Online Learning Algorithm Hypotheses class Cumulative cost
20
Phoneme Alignment. Slide 20 Online Learning For Receive an instance Predict Receive true alignment and Pay cost If Set Update
21
Phoneme Alignment. Slide 21 Converting from Online to Batch Run online algorithm on the training set and generate w 1,…,w M Small online error exists w 2 {w 1,…,w M } whose generalization error is low (Cesa-bianchi et al.) Choose w 2 {w 1,…,w M } which minimizes the error on a fresh validation set
22
Phoneme Alignment. Slide 22 Algorithmic aspects Running-time: If the “inference”,, can be performed in polynomial time (e.g. dynamic programming), then the entire algorithm operates in polynomial time as well. Worst case analysis for Online Learning: For any competitor u, Generalization error Online-to-batch conversion guarantees that: low online error low generalization error
23
Phoneme Alignment. Slide 23 Experiments TIMIT corpus Phoneme representation: 48 phonemes (Lee & Hon, 1989) Acoustic Representation: MFCC+∆+∆∆ (ETSI standard) TIMIT training set: 500 utterances for training a frame classifier 3096 utterances for learning alignment function 100 utterances used for validation
24
Phoneme Alignment. Slide 24 Alternative Approaches Brugnara, Falavigna & Omologo, Automatic segmentation and labeling of speech based on HMM, 1993. Hosom, Automatic phoneme alignment on acoustic-phonetic modeling, 2002. Toledano, Gomez & Grande, Automatic Phoneme Alignment, 2003.
25
Phoneme Alignment. Slide 25 Results Brugnara, Falavigna and Omologo, “Automatic segmentation and labling of speech based on Hidden Markov Models”, Speech Comm., 12 (1993) 357-370. Training size Test set t < 10 ms t < 20 ms t < 30 mst < 40 ms Discrim. Algo. 650 or 3696 192 core79.7%92.1%96.2%98.1% Brugnara et al 3696192 core75.3%88.9%94.4%97.1% Discrim. Algo. 650 or 2336 1344 entire80.0%92.3%96.4%98.2% Brugnara et al 23361344 entire74.6%88.8%94.1%96.8%
26
Phoneme Alignment. Slide 26 Current and Future Work Discriminative learning methods for: Whole phoneme sequence classification 64% (ours) vs. 59% (HMM – IDIAP Torch3) Results without normalization of silences etc. Small vocabulary continuous speech recognition Segmentation of utterances to speakers Full online learning setting: real-time adaptation to Speaker/environment changes
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.