Download presentation
Presentation is loading. Please wait.
Published byAndrea Davidson Modified over 7 years ago
1
Using Speech Recognition to Predict VoIP Quality
Wenyu Jiang IRT Lab April 3, 2002
2
Introduction to Voice Quality
Quality factors in Voice over IP (VoIP) Packet loss, delay, and jitter Choice of voice codec Quality metric: Mean Opinion Score Widely used Human based Time consuming Labor intensive Results N/A in real-time MOS Grade Score Excellent 5 Good 4 Fair 3 Poor 2 Bad 1
3
Motivation Features of a speech recognizer:
Automatic speech recognition (ASR), no human listeners needed Accuracy of recognition is apparently coupled with the quality of input speech Recognition can be done in real-time, allowing online quality monitoring. Recognition performance may be related to speech intelligibility as well as quality.
4
Related Work ITU-T E-model [G.107/G.108]
An analytical model for estimating perceived quality Provides loss-to-MOS mapping for some common codecs (G.729, G.711, G.723.1). Chernick et al studies speech recognition performance with DoD-CELP codec Effect of bit error rate instead of packet loss Phoneme (instead of word) recognition ratio Some MOS results, but not accurate enough
5
Experiment Setup Speech recognition engine Training and Testing
IBM ViaVoice on Linux Wrote software for both voice model training and performance testing Training and Testing 2 scripts, #1 for training, #2 for testing. 2 speakers, A and B, both read 2 scripts. Script #2 is split into 25 audio clips, with 5 clips per loss condition (0%, 2%, 5%, 10%, 15%) Codec: G.729 Training by G.729 processed audio
6
Experiment Setup, contd.
Performance metric Absolute word recognition ratio Relative word recognition ratio p is packet loss probability MOS listening tests: 22 listeners
7
Recognition Ratio vs. MOS
Both MOS and Rabs decrease w.r.t loss Then, eliminate middle variable p
8
Properties of ASR Performance
When loss probability is low Recognition ratio changes slowly Possibly due to robustness in ViaVoice Less accurate MOS prediction in such case Importance of voice training method Training audio should use same codec as testing
9
Speaker Dependence in ASR
ViaVoice SDK cites a 90% accuracy for Average speaker without a heavy accent Sampling at 22KHz, PCM linear-16 For speaker A, we achieved About 42% accuracy with no packet loss Reasons: 8KHz sampling + G.729 compression Accent + talk speed Does not interfere with MOS prediction, but need to check for speaker dependence
10
Speaker Dependence Check
Absolute recognition ratio is 70% for speaker B, but 42% for speaker A dependent on the speaker But the relative recognition ratio Rrel is universal and speaker-independent
11
Rrel as Universal MOS Predictor
Mapping from relative recognition ratio Rrel to MOS
12
Human Recognition Results
Listeners are asked to transcribe what they hear in addition to MOS grading. Human recognition result curves are less “smooth” than MOS curves.
13
Human Results, contd. Two flat regions in loss-human curve
2-5% loss (some loss but not very high) 10-15% loss (loss is already too high) Mapping between machine and human recognition performance
14
Application Scenarios
Sender transmits a pre-recorded audio clip of a speaker known to receiver. Receiver does the following: Looks up Rabs(0%) for this speaker Performs speech recognition Compare to the original text, compute Rrel No need to store the original audio clip Just the text is sufficient less storage Need not know packet loss probability Suitable for e2e black-box measurements
15
Conclusions Evaluation of speech recognition performance as a MOS predictor Used ViaVoice speech engine Performance metric: word recognition ratio The relative word recognition ratio is a universal, speaker-independent metric Also analyzed human recognition performance Future work: evaluate other codecs, e.g., G.726, GSM.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.