Institute of Information Science, Academia Sinica 12 July, IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee ( 李鴻欣 ) Based on Chen and Wang in ISCSLP’08 and Interspeech’09
Page-2 Detection-Based ASR Knowledge Detection Knowledge Detection Integration Knowledge (Higher Level) Knowledge (Higher Level) Phonological attr. Prosodic attr. Acoustic attr. … Human SR HMM CRF … HMM CRF … DB ASR Detectors Integrator Results Phone Syllable Word Sentence Semantic info … Phone Syllable Word Sentence Semantic info …
Page-3 Phonological Systems SPE (Sound Pattern of English) MV (Multi-valued Feature) GP (Government Phonology) Literatures (N. Chomsky & M. Halle, 1968) (S. King, 2000)?(J. Harris, 1994) Feature Types Production-based, Binary Production-based, 2-10 values Sound structure primes, Binary Feature Number Examples anterior, nasal, round centrality, front back, manner, phonation, place, roundness
Page-4 Phonological Feature Detection (1) MLP (Detectors) hidden layer posterior probability quantization SPE_ GP_ ii-4i+4 9 frames 13 MFCCs input layer recurrent time-delay
Page-5 Phonological Feature Detection (2) ii-4i+4 9 frames 13 MFCCs MLP (Centrality) MLP (Front-Back) MLP (Roundness) MV_29 time-delay 6 MV Features
Page-6 Conditional Random Field (CRF) Integrator General Chain CRF state feature functiontransition feature function λ j, μ k : feature function weight parameters X y i-1 Output (phone) Input (phonological features) yiyi x i-1 xixi x i+1 Y
Page-7 CRF Integrator – Training Issues Required Label for CRF Training –Phone: y –Phonological features: x Detectors MLP Detectors MLP Speech Detected-data trained CRF Phonological features (with errors) DT CRF DT CRF Phone labels Mapping phones → phonological features Mapping phones → phonological features Phone labels Oracle-data trained CRF Phonological features OT CRF OT CRF Training Data
Page-8 Experiments Corpus: TIMIT –No SA1, SA2 –Training set (3296 utts), Dev set (400 utts) –Test set (1344 utts) Phone set: TIMIT61 –Evaluation: CMU/MIT 39 Baseline –CI-HMM Toolkits –Nico Toolkit (for MLP), CRF++ (for CRF)
Page-9 Results (1) Phone Corr. %Phone Acc. % SPE GP MV Model:OT CRF Test:OD Features Phone Corr. %Phone Acc. % HMM-baseline OT CRF SPE GP MV DT CRF SPE GP MV Model:OT/DT CRF Test:DD Features
Page-10 Results (2) Methods# SystemPhone Corr. (%)Phone Acc. (%) HMM baseline OT: SPE+GP+MV DT: SPE+GP+MV OT+DT: SPE+GP+MV OT: SPE+GP+MV +HMM DT: SPE+GP+MV +HMM OT+DT: SPE+GP+MV +HMM System Fusion
Page-11 System Fusion with CRF X y i-1 Combined Results (Phone) Phone Sequence yiyi x i-1 xixi x i+1 Y SPE Sys. MV Sys. GP Sys. HMM Sys.
Page-12 Two Types of AFDT Imperfection h# n eh ow kcl k w eh ae eh s tcl t ix n Phone AF(A) AF(A’) AF asynchronyAFDT errors
Page-13 CRF Training (1) Phone y AFs x t Mapping Table Phone AFs Oracle Data Training Phone y AFs x t AFDT Detected Data Training Detected Errors
Page-14 CRF Training (2) Phone y AFs x t AFDT Aligned Data Training AF Sequence
Page-15 Results (3) SystemPhone Corr. (%)Phone Acc. (%) Upper Bound OT CRF AT CRF Real Case OT CRF DT CRF AT CRF % acc. drops on the introduction of AF asynchrony Detection Error causes further 7.99 % acc. drop
Page-16 AF Asynchrony Compensation AF asynchrony is caused by context variation We can reduce AF asynchrony by letting our systems learn context variation directly – Long-Term information Windows + DCTs MLP Windows + DCTs Right Context Left Context 23 dim Mel MLP 310ms 144Dim 72Dim
Page-17 Results (4) Test Data TypeSystemCorrAcc - CI-HMM CD-HMM Detected (real case) OT CRF (±3) Long Term AFDT + DT CRF (±3) Ideal (upper bound) Long Term AFDT + AT CRF MFCC AFDT + AT CRF (±3) Long Term AFDT + AT CRF (±3) Detected (real case) Long Term AFDT + AT CRF MFCC AFDT + AT CRF (±3) Long Term AFDT + AT CRF (±3)
Page-18 Conclusions A well-designed phonological feature system is important –AF asynchrony minimization training and AF-phone synchronization could also be investigated Oracle Trained CRF is able to retrieve more phonological information from speech –High phone correction rate (but sensitive to detection error) –Helpful for combination Detection-Based ASR is promising –A front-end detector is a major issue
Page-19 AF and Phone Alignment Using AFDT t t t t t phone sequence AF sequence