Using Speech Recognition to Predict VoIP Quality

Slides:

Advertisements

Similar presentations

Speech Coding Techniques

Advertisements

Speech Processing for NSR Vs DSR Veeru Ramaswamy PhD CTO, Vianix LLC

STQ Workshop, Sophia-Antipolis, February 11 th, 2003 Packet loss concealment using audio morphing Franck Bouteille¹ Pascal Scalart² Balazs Kövesi² ¹ PRESCOM.

Forecasting the Demand Those who do not remember the past are condemned to repeat it George Santayana ( ) a Spanish philosopher, essayist, poet.

Advanced Speech Enhancement in Noisy Environments

Speech Compression. Introduction Use of multimedia in personal computers Requirement of more disk space Also telephone system requires compression Topics.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

CELLULAR COMMUNICATIONS 5. Speech Coding. Low Bit-rate Voice Coding  Voice is an analogue signal  Needed to be transformed in a digital form (bits)

© 2006 AudioCodes Ltd. All rights reserved. AudioCodes Confidential Proprietary Signal Processing Technologies in Voice over IP Eli Shoval Audiocodes.

Voice over the Internet (the basics) CS 7270 Networked Applications & Services Lecture-2.

1 TAC2000/ IP Telephony Lab Perceptual Evaluation of Speech Quality (PESQ) Speaker: Wen-Jen Lin Date: Dec

Christian Schmidmer, OPTICOM1 Subjective Quality Testing - Voice & Audio.

ACM Multimedia October 4, 2001 Real-time Voice Communication over the Internet Using Packet Path Diversity Yi Liang, Eckehard Steinbach, and Bernd Girod.

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

QoS Measurement and Management for Multimedia Services Thesis Proposal Wenyu Jiang April 29, 2002.

Adaptive Playout Scheduling Using Time-scale Modification Yi Liang, Nikolaus Färber Bernd Girod, Balaji Prabhakar.

Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

Classification and Prediction: Regression Analysis

Objective and Subjective Degradations of Transcoded Voice for Heterogeneous Radio Networks Interoperability Ľubica Blašková 1, Jan Holub 1, Michael Street.

Department of Electrical Engineering | University of Texas at Dallas Erik Jonsson School of Engineering & Computer Science | Richardson, Texas ,

Speech coding. What’s the need for speech coding ? Necessary in order to represent human speech in a digital form Applications: mobile/telephone communication,

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

Secure Steganography in Audio using Inactive Frames of VoIP Streams

Speaker : Chungyi Wang Advisor: Quincy Wu Date :

1 BILC SEMINAR 2009 Speech Recognition: Is It for Real? Tony Mirabito Defense Language Institute English Language Center (DLIELC) DLIELC.

An Empirical Evaluation of VoIP Playout Buffer Dimensioning in Skype, Google Talk, and MSN Messenger Chen-Chi Wu, Kuan-Ta Chen, Yu-Chun Chang, and Chin-Laung.

Chapter 3.2 Speech Communication Human Performance Engineering Robert W. Bailey, Ph.D. Third Edition.

17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.

New Models for Perceived Voice Quality Prediction and their Applications in Playout Buffer Optimization for VoIP Networks University of Plymouth United.

Department of Communication and Electronic Engineering University of Plymouth, U.K. Lingfen Sun Emmanuel Ifeachor New Methods for Voice Quality Evaluation.

From last time …. ASR System Architecture Pronunciation Lexicon Signal Processing Probability Estimator Decoder Recognized Words “zero” “three” “two”

University of Plymouth United Kingdom {L.Sun; ICC 2002, New York, USA1 Lingfen Sun Emmanuel Ifeachor Perceived Speech Quality.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.

Glencoe Introduction to Multimedia Chapter 8 Audio 1 sound effect An artificially created or enhanced sound used to achieve an effect (without speech or.

Quality of Service - applications Henning Schulzrinne with Wenyu Jiang Dept. of Computer Science Columbia University NSF QoS workshop, April 2002.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.

Comparisons of FEC and Codec Robustness on VoIP Quality and Bandwidth Efficiency Wenyu Jiang Henning Schulzrinne Columbia University ICN 2002, Atlanta,

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

Brian Lukoff Stanford University October 13, 2006.

Alan Clark Telchemy Modeling the effects of Burst Packet Loss and Recency on Subjective Voice Quality Alan Clark Telchemy

Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.

RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

Probabilistic Congestion Control for Non-Adaptable Flows Jörg Widmer, Martin Mauve, Jan Peter Damm (NOSSDAV’02) Presented by Ankur Upadhyaya for CPSC 538A.

Voice Quality in IP Telephony

Voice Performance Measurement and related technologies

VoIP over Wireless Networks

Investigating Pitch Accent Recognition in Non-native Speech

Scalable Speech Coding for IP Networks

Empirically Characterizing the Buffer Behaviour of Real Devices

Artificial Intelligence for Speech Recognition

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Wenyu Jiang Henning Schulzrinne Columbia University

SWE 423: Multimedia Systems

Tabulations and Statistics

Wenyu Jiang , Henning Schulzrinne 이주경

Packet loss concealment using audio morphing

Lab 3: Isolated Word Recognition

Neil T. Heffernan, Joseph E. Beck & Kenneth R. Koedinger

Muhammad Niswar Graduate School of Information Science

Speech recognition, machine learning

Speaker Identification:

Quality of Service for TDR Traffic

Investigation of Voice Traffic in Wi-Fi Environment

Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS

Speech recognition, machine learning

Presentation transcript:

Using Speech Recognition to Predict VoIP Quality Wenyu Jiang IRT Lab April 3, 2002

Introduction to Voice Quality Quality factors in Voice over IP (VoIP) Packet loss, delay, and jitter Choice of voice codec Quality metric: Mean Opinion Score Widely used Human based Time consuming Labor intensive Results N/A in real-time MOS Grade Score Excellent 5 Good 4 Fair 3 Poor 2 Bad 1

Motivation Features of a speech recognizer: Automatic speech recognition (ASR), no human listeners needed Accuracy of recognition is apparently coupled with the quality of input speech Recognition can be done in real-time, allowing online quality monitoring. Recognition performance may be related to speech intelligibility as well as quality.

Related Work ITU-T E-model [G.107/G.108] An analytical model for estimating perceived quality Provides loss-to-MOS mapping for some common codecs (G.729, G.711, G.723.1). Chernick et al studies speech recognition performance with DoD-CELP codec Effect of bit error rate instead of packet loss Phoneme (instead of word) recognition ratio Some MOS results, but not accurate enough

Experiment Setup Speech recognition engine Training and Testing IBM ViaVoice on Linux Wrote software for both voice model training and performance testing Training and Testing 2 scripts, #1 for training, #2 for testing. 2 speakers, A and B, both read 2 scripts. Script #2 is split into 25 audio clips, with 5 clips per loss condition (0%, 2%, 5%, 10%, 15%) Codec: G.729 Training by G.729 processed audio

Experiment Setup, contd. Performance metric Absolute word recognition ratio Relative word recognition ratio p is packet loss probability MOS listening tests: 22 listeners

Recognition Ratio vs. MOS Both MOS and Rabs decrease w.r.t loss Then, eliminate middle variable p

Properties of ASR Performance When loss probability is low Recognition ratio changes slowly Possibly due to robustness in ViaVoice Less accurate MOS prediction in such case Importance of voice training method Training audio should use same codec as testing

Speaker Dependence in ASR ViaVoice SDK cites a 90% accuracy for Average speaker without a heavy accent Sampling at 22KHz, PCM linear-16 For speaker A, we achieved About 42% accuracy with no packet loss Reasons: 8KHz sampling + G.729 compression Accent + talk speed Does not interfere with MOS prediction, but need to check for speaker dependence

Speaker Dependence Check Absolute recognition ratio is 70% for speaker B, but 42% for speaker A dependent on the speaker But the relative recognition ratio Rrel is universal and speaker-independent

Rrel as Universal MOS Predictor Mapping from relative recognition ratio Rrel to MOS

Human Recognition Results Listeners are asked to transcribe what they hear in addition to MOS grading. Human recognition result curves are less “smooth” than MOS curves.

Human Results, contd. Two flat regions in loss-human curve 2-5% loss (some loss but not very high) 10-15% loss (loss is already too high) Mapping between machine and human recognition performance

Application Scenarios Sender transmits a pre-recorded audio clip of a speaker known to receiver. Receiver does the following: Looks up Rabs(0%) for this speaker Performs speech recognition Compare to the original text, compute Rrel No need to store the original audio clip Just the text is sufficient  less storage Need not know packet loss probability Suitable for e2e black-box measurements

Conclusions Evaluation of speech recognition performance as a MOS predictor Used ViaVoice speech engine Performance metric: word recognition ratio The relative word recognition ratio is a universal, speaker-independent metric Also analyzed human recognition performance Future work: evaluate other codecs, e.g., G.726, GSM.