CNTS LTG (UA) (i) Phoneme-to-Grapheme (ii) Transcription-to-Subtitles Bart Decadt Erik Tjong Kim Sang Walter Daelemans
Machine Learning of Phoneme-to-Grapheme Conversion For Out-of-Vocabulary handling in Speech Recognition
CNTS-Atranos3 Proper Names Domain Terminology Complex Morphology (compounds) gespreksonderwerp (topic of conversation) gesprek zonder werk (conversation without work) Out Of Vocabulary word problem
CNTS-Atranos4 Speech Recognizer (ESAT) input: speech output: text Confidence threshold Suspected OOV Phoneme Recognizer (ESAT) Phoneme string P2G Converter (TIMBL) Spelling Spelling correction with large vocabulary Training Data Architecture
CNTS-Atranos5 Memory-Based Learning Classification-based (alignment) =,=,k,A,s,t,= a Similarity-based Parameter Optimization MBL algorithm (ib1, igtree) Number of nearest neighbors Feature weighting method Class distance weighting Timbl (1998, 2002)
CNTS-Atranos6 Experiment Training data (129k words – 9k OOVs): –from ESAT’s phoneme recognizer –error rate = ~29% (substitutions + insertions + deletions) –phoneme deletions are problematic Baselines –Near-perfect phoneme data (CELEX) 99.1 (grapheme)91.4 (word) –Probabilistic 70.5 (grapheme)60.2 (word) 30.0 (grapheme) 3.0 (word) (OOV only)
CNTS-Atranos7 Results Performance: all wordsOOVs grapheme-level word-level Spelling correction: Net effect: 8.6 (OOVs) (Simulated) interaction with speech recognizer: Increases WER, but improves readability
CNTS-Atranos8 Examples –gespreksonderwerp speech recognizer gesprek zonder werk P2G-converter gespreksonberwerp –speelgoedmitrailleur /sperGutnitrKj-yr/ speech recognizer speelgoed moet hier P2G-converter spergoetmietrijer
Automatic subtitling (normalization) Data collection and alignment
CNTS-Atranos10 Architecture News autocuesSubtitles (semi-)automatic alignment (semi-)automatic data capture Machine Learner Training Data Linguistic Annotation Classifier autocues subtitles
CNTS-Atranos11 Status (March 02) Teletext subtitle data capture hardware and software Software for VRT autocue file processing Software for alignment autocues with subtitles Autocue-subtitle alignment Similar procedure for VRT soap series “Thuis” data
CNTS-Atranos12 Statistical Subtitle Prediction Baseline experiment –8000 words soap (Thuis) –actor scenario word-aligned with subtitles –classification task (memory-based learning) predict deletion, substitution, copy –Features: focus word + 8 words context + pos tags –Feature selection (hill-climbing) selects only focus word Results (10-fold CV) –71.7% (copy all: 67.3%) –Most frequent replacement: {ge, gij, u, uw} je