Compensating speaker-to-microphone playback system for robust speech recognition So-Young Jeong and Soo-Young Lee Brain Science Research Center and Department of Electrical Engineering and Computer Science Korea Advanced Institute of Science and Technology
ASR in mismatched environments Environmental information –Background noise, acoustic/transmission channel Assume environment degradation model Motivation Clean speech Channel Additive noise Distorted speech
–P.S –F.B. –L.S. –C.S. Channel Impacts on feature Channel Assumption 2 Channel Assumption 1
Speaker-to-Microphone playback Speaker distortion Nonlinearity caused by voice coil Microphone distortion Frequency response caused by different fabrication Nonlinearity caused by dynamic range Ambient noise by directionality Speaker-to-Microphone compensation
Mapper train Where and which type of mapper should be deployed? Mapper apply Speaker-to-Microphone mapping F.E. + clean F.E.Mapper distorted Error F.E.Trained Mapper distorted To recognizer
Diamond, plus, cross denotes PS,FB.LS level Mapping error at L.S.
Frequency correlation plots
Task Phoneme recognition for 40 TIMIT phone sets Phone accuracy = (N-D-S-I) * 100 /N Database HTIMIT : re-recording TIMIT sentence thru. 10 various telephone handsets Training : 246 speaker * 8 sent. = 1968sent. Test : 48 speaker * 8 = 384 sent. Baseline 3-state monophone HMM with 16 gaussian mixture Recognition Experiments
Experiment I – CI result typematchedmismatchCMSDIAGLINPERMLP senh54.7 cb cb cb cb el el el el pt
Speech signal distorted by low-quality speaker-to- microphone playback system can be compensated with feature mapping network Feature mapping scheme would be useful in cases that environmental condition is tough for collecting database Conclusion