Speech Coding EE 516 Spring 2009

Slides:



Advertisements
Similar presentations
Wideband Speech Coding for CDMA2000® Systems
Advertisements

Chapter 3: PCM Noise and Companding
Speech Coding Techniques
VMR-WB – Operation of the 3GPP2 Wideband Speech Coding Standard M. Jelinek†, R. Salami‡ and S. Ahmadi * †University of Sherbrooke, Canada ‡VoiceAge Corporation,
Part II (MPEG-4) Audio TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.
Digital Coding of Analog Signal Prepared By: Amit Degada Teaching Assistant Electronics Engineering Department, Sardar Vallabhbhai National Institute of.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Ranko Pinter Simoco Digital Systems
Speech-Coding Techniques Chapter 3. Internet Telephony 3-2 Introduction Efficient speech-coding techniques Advantages for VoIP Digital streams of ones.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
CELLULAR COMMUNICATIONS 5. Speech Coding. Low Bit-rate Voice Coding  Voice is an analogue signal  Needed to be transformed in a digital form (bits)
Speech codecs and DCCP with TFRC VoIP mode Magnus Westerlund
© 2006 AudioCodes Ltd. All rights reserved. AudioCodes Confidential Proprietary Signal Processing Technologies in Voice over IP Eli Shoval Audiocodes.
Speech Coding Nicola Orio Dipartimento di Ingegneria dell’Informazione IV Scuola estiva AISV, 8-12 settembre 2008.
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
MPEG Audio Compression by V. Loumos. Introduction Motion Picture Experts Group (MPEG) International Standards Organization (ISO) First High Fidelity Audio.
Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.
© 2006 Cisco Systems, Inc. All rights reserved. 2.2: Digitizing and Packetizing Voice.
COMP 249 :: Spring 2005 Slide: 1 Audio Coding Ketan Mayer-Patel.
Waveform SpeechCoding Algorithms: An Overview
1 Audio Compression Multimedia Systems (Module 4 Lesson 4) Summary: r Simple Audio Compression: m Lossy: Prediction based r Psychoacoustic Model r MPEG.
CS :: Fall 2003 Audio Coding Ketan Mayer-Patel.
Formatting and Baseband Modulation
LE 460 L Acoustics and Experimental Phonetics L-13
DIGITAL VOICE NETWORKS ECE 421E Tuesday, October 02, 2012.
LECTURE Copyright  1998, Texas Instruments Incorporated All Rights Reserved Encoding of Waveforms Encoding of Waveforms to Compress Information.
AUDIO COMPRESSION msccomputerscience.com. The process of digitizing audio signals is called PCM PCM involves sampling audio signal at minimum rate which.
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
SPEECH CODING Maryam Zebarjad Alessandro Chiumento.
1 Linear Prediction. Outline Windowing LPC Introduction to Vocoders Excitation modeling  Pitch Detection.
© 2006 Cisco Systems, Inc. All rights reserved. Optimizing Converged Cisco Networks (ONT) Module 2: Cisco VoIP Implementations.
Speech Coding Techniques. Introduction Efficient speech-coding techniques Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth,
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Compression No. 1  Seattle Pacific University Data Compression Kevin Bolding Electrical Engineering Seattle Pacific University.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
1 Speech Synthesis User friendly machine must have complete voice communication abilities Voice communication involves Speech synthesis Speech recognition.
Submitted By: Santosh Kumar Yadav (111432) M.E. Modular(2011) Under the Supervision of: Mrs. Shano Solanki Assistant Professor, C.S.E NITTTR, Chandigarh.
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Digital Multiplexing 1- Pulse Code Modulation 2- Plesiochronous Digital Hierarchy 3- Synchronous Digital Hierarchy.
LOG Objectives  Describe some of the VoIP implementation challenges such as Delay/Latency, Jitter, Echo, and Packet Loss  Describe the voice encoding.
ITU-T G.729 EE8873 Rungsun Munkong March 22, 2004.
1 Audio Coding. 2 Digitization Processing Signal encoder Signal decoder samplingquantization storage Analog signal Digital data.
SPEECH CODING Maryam Zebarjad Alessandro Chiumento Supervisor : Sylwester Szczpaniak.
Present document contains informations proprietary to France Telecom. Accepting this document means for its recipient he or she recognizes the confidential.
Voice Coding in 3G Networks
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2014.
Voice Sampling. Sampling Rate Nyquist’s theorem states that a signal can be reconstructed if it is sampled at twice the maximum frequency of the signal.
Fundamentals of Multimedia Chapter 6 Basics of Digital Audio Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.
Digital Communications Chapter 13. Source Coding
Vocoders.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
CS 4594 Data Communications
ON THE ARCHITECTURE OF THE CDMA2000® VARIABLE-RATE MULTIMODE WIDEBAND (VMR-WB) SPEECH CODING STANDARD Milan Jelinek†, Redwan Salami‡, Sassan Ahmadi*, Bruno.
Linear Predictive Coding Methods
Mobile Systems Workshop 1 Narrow band speech coding for mobile phones
Vocoders.
PCM & DPCM & DM.
Linear Prediction.
Govt. Polytechnic Dhangar(Fatehabad)
Presentation transcript:

Speech Coding EE 516 Spring 2009 Alex Acero

Acknowledgments Thanks to Allen Gersho for some slides…

Outline Quality vs Bit rate Types of speech coders Waveform Coding Speech production and vocoders Analysis by Synthesis VoIP

Voice Quality Excellent – 5 Good – 4 Fair – 3 Poor – 2 Bad – 1 Bandwidth is easily quantified Voice quality is subjective MOS, Mean Opinion Score ITU-T Recommendation P.800 Excellent – 5 Good – 4 Fair – 3 Poor – 2 Bad – 1 A minimum of 30 people Listen to voice samples or in conversations

Voice Quality P.800 recommendation Toll quality The selection of participants The test environment Explanations to listeners Analysis of results Toll quality A MOS of 4.0 or higher

Quality Measurements Subjective and objective quality-testing techniques PSQM – Perceptual Speech Quality Measurement ITU-T P.861 faithfully represent human judgement and perception algorithmic comparison between the output signal and a know input type of speaker, loudness, delay, active/silence frames, clipping, environmental noise

Evolution of Speech Coder Performance Excellent North American TDMA Good 2000 Speech Quality Fair 1990 ITU Recommendations Cellular Standards 1980 Secure Telephony Be sure to mention where the curves flatten out for each decade. Also point out that these examples are meant to be listened to over a handset, not a loudspeaker. This is particularly true of 32 kb/s G.726. Before each coder is played, tell what it is Poor 1980 Profile 1990 Profile 2000 Profile Bad Bit Rate (kb/s) Eurospeech 2003

Speech Coding (Telephony) Ceiling Speech Coding (Telephony) More complicated than Moore’s Law Many Dimensions: Bit Rate, Quality, Complexity and Delay Quality ceiling (imposed by telephone standards) Easy to reach the ceiling at high bit rates (≥ 8 kb/s) More room for progress at low bit rates (≤ 8 kb/s) Be sure to mention where the curves flatten out for each decade. Also point out that these examples are meant to be listened to over a handset, not a loudspeaker. This is particularly true of 32 kb/s G.726. Before each coder is played, tell what it is

Speech Coding (Telephony) Ceiling Speech Coding (Telephony) More complicated than Moore’s Law Many Dimensions: Bit Rate, Quality, Complexity and Delay Quality ceiling (imposed by telephone standards) Easy to reach the ceiling at high bit rates (≥ 8 kb/s) More room for progress at low bit rates (≤ 8 kb/s) Moore’s Law Time Constant Bit rates half every decade (≤ 8 kb/s) Relatively slow by Moore’s Law standards (not hyper-inflation) Performance doubles every decade Like disk seek or money in the bank (normal inflation) Limited more by physics than investment Be sure to mention where the curves flatten out for each decade. Also point out that these examples are meant to be listened to over a handset, not a loudspeaker. This is particularly true of 32 kb/s G.726. Before each coder is played, tell what it is

Speech Coding (Telephony) Ceiling Speech Coding (Telephony) More complicated than Moore’s Law Many Dimensions: Bit Rate, Quality, Complexity and Delay Quality ceiling (imposed by telephone standards) Easy to reach the ceiling at high bit rates (≥ 8 kb/s) More room for progress at low bit rates (≤ 8 kb/s) Moore’s Law Time Constant Bit rates half every decade (≤ 8 kb/s) Relatively slow by Moore’s Law standards (not hyper-inflation) Performance doubles every decade Like disk seek or money in the bank (normal inflation) Limited more by physics than investment Potential compression opportunity At most 10x: 8 kb/s  2 kb/s  1 kb/s (?)  50 bits per sec (??) Speech (2 kb/s) >> text (2 bits/char): 10-1000 times more bits Speech coding will not close this gap for foreseeable future Be sure to mention where the curves flatten out for each decade. Also point out that these examples are meant to be listened to over a handset, not a loudspeaker. This is particularly true of 32 kb/s G.726. Before each coder is played, tell what it is

Speech Coding (Telephony) Ceiling Speech Coding (Telephony) More complicated than Moore’s Law Many Dimensions: Bit Rate, Quality, Complexity and Delay Quality ceiling (imposed by telephone standards) Easy to reach the ceiling at high bit rates (≥ 8 kb/s) More room for progress at low bit rates (≤ 8 kb/s) Moore’s Law Time Constant Bit rates half every decade (≤ 8 kb/s) Relatively slow by Moore’s Law standards (not hyper-inflation) Performance doubles every decade Like disk seek or money in the bank (normal inflation) Limited more by physics than investment Potential compression opportunity At most 10x: 8 kb/s  2 kb/s  1 kb/s (?) Speech (2 kb/s) >> text (2 bits/char): 100-1000 times more bits Speech coding will not close this gap for foreseeable future Be sure to mention where the curves flatten out for each decade. Also point out that these examples are meant to be listened to over a handset, not a loudspeaker. This is particularly true of 32 kb/s G.726. Before each coder is played, tell what it is

Outline Quality vs Bit rate Types of speech coders Waveform Coding Speech production and vocoders Analysis by Synthesis VoIP

Type of Speech Coders Waveform codecs Source codecs (vocoders) Sample and code High-quality and not complex Large amount of bandwidth Source codecs (vocoders) Match the incoming signal to a math model Linear-predictive filter model of the vocal tract A voiced/unvoiced flag for the excitation The information is sent rather than the signal Low bit rates, but sounds synthetic Higher bit rates do not improve much

Type of Speech Coders Hybrid codecs Attempt to provide the best of both Perform a degree of waveform matching Utilize the sound production model Quite good quality at low bit rate

Outline Quality vs Bit rate Types of speech coders Waveform Coding Speech production and vocoders Analysis by Synthesis VoIP

Waveform coders High quality, high bitrate Pulse Code Modulation (PCM) Sample input waveform Quantization Differential PCM Encode difference between adjacent samples Adaptive DPCM Adapt step size for quantization based on speech statistics

Voice Sampling A-to-D Human speech discrete samples of the waveform and represent each sample by some number of bits A signal can be reconstructed if it is sampled at a minimum of twice the maximum frequency (Nyquist Theorem) Human speech 300-3800 Hz 8000 samples per second Each sample is encoded into an 8-bit PCM code word (e.g. 01100101) time => 8000 x 8 bit/s

Quantization How many bits is used to represent Quantization noise The difference between the actual level of the input analog signal More bits to reduce Diminishing returns Uniform quantization levels Louder talkers sound better

Non-uniform quantization % quantization error is larger for smaller values of x(t) Goal: create a set of smaller % error at small signal values and similarly at large ones. This process is called “companding” at the source encoding end and “decompanding” at the decoding (D/A) end. The net effect is to make the sum of the quantization errors smaller and more uniform percentage-wise. Logarithmic scaling (A-law in Europe and µ-law in US)

Non-uniform quantization Smaller quantization steps at smaller signal levels Spread signal-to-noise ratio more evenly

G.711 The most commonplace codec If uniform quantization Used in circuit-switched telephone network PCM, Pulse-Code Modulation If uniform quantization 12 bits * 8 k/sec = 96 kbps Non-uniform quantization 64 kbps DS0 rate mu-law North America A-law Other countries, a little friendlier to lower signal levels An MOS of about 4.3

DPCM DPCM, Differential PCM No algorithmic delay Only transmit the difference between the predicated value and the actual value Voice changes relatively slowly It is possible to predict the value of a sample base on the values of previous samples The receiver perform the same prediction The simplest form No prediction No algorithmic delay

ADPCM (Adaptive DPCM) Predicts sample values based on Past samples Factoring in some knowledge of how speech varies over time The error is quantized and transmitted Fewer bits required G.721 32 kbps G.726 A-law/mu-law PCM -> 16, 24, 32, 40 kbps An MOS of about 4.0 at 32 kbps

Subjective quality metrics for speech BASIS (better-worse) RANGE FOR TOLL SYNTHETIC Mean Opinion Score 5 - 1 4.0 – 3.5 3.5 - 2.5 Diagnostic Rhyme Test (consonants) 100 ~ 95 ~ 90 Acceptability Measure ~ 73 ~ 54

Common Waveform Coders Type Quality MOS –DRT DAM Kbit/sec MIPS Complexity PCM Toll 96 Very low log PCM 4.3 – 95 - 73 64 0.01 ADM/CVSD 40 Low

Outline Quality vs Bit rate Types of speech coders Waveform Coding Speech production and vocoders Analysis by Synthesis VoIP

Information rate of speech Phonetic content at a rate of about 72 bits/second: 6 bits sufficient for 40-50 different phonemes Average speaking rate is about 12 phonemes/second This neglects: Intonation (no pitch transmitted) Emotion Individual characterization of speech (the ability to recognize the speaker) Phone durations are different

Redundancies in speech Our sampling frequency Fs is >> than vocal tract rate of change (with the exception of closures ) F0 (or perceived pitch) changes slowly as compared to windowing rate Adjacent windows correlate rather well Spectral waveform changes slowly and most of the energy is at the low end of frequencies so it changes even more slowly there (important part of speech) It is possible to model phones as periodic/noisy filtered excitation and still obtain reasonable quality Speech parameters may be weighted since they occur nonuniformly (different probabilities) The ear is insensitive to phase so it can be discarded

Average power spectrum of speech Notice that the frequency scale is logarithmic in this figure. Speech has in general higher power at the lower frequencies for sonorants and less power above 3.3kHz, as shown here.

Human Speech Production System Air flow forced from lungs to vocal tract short-term correlations Filter with resonances (called formants) Speech sound classes Voiced sounds Voice cord vibration Long-term periodicity Unvoiced sounds Constriction in the vocal tract No long-term periodicity Plosive sounds Release of air pressure behind mouth

A Little About Speech Speech Model the vocal tract as a filter Air pushed from the lungs past the vocal cords and along the vocal tract The basic vibrations – vocal cords The sound is altered by the disposition of the vocal tract ( tongue and mouth) Model the vocal tract as a filter The shape changes relatively slowly The vibrations at the vocal cords The excitation signal

Voiced Speech The vocal cords vibrate open and close Interrupt the air flow Quasi-periodic pluses of air The rate of the opening and closing – the pitch A high degree of periodicity at the pitch period 2-20 ms

Voiced Speech Voiced speech Power spectrum density

Unvoiced Speech Forcing air at high velocities through a constriction The glottis is held open Noise-like turbulence Show little long-term periodicity Short-term correlations still present

Unvoiced Speech unvoiced speech Power spectrum density

Stops Plosive sounds A vast array of sounds A complete closure in the vocal tract Air pressure is built up and released suddenly A vast array of sounds The speech signal is relatively predictable over time The reduction of transmission bandwidth can be significant

Linear predictive Coding (LPC) Predict current sample as linear combination of past samples An all-pole model: Minimize squared error Orthogonality principle Solution

Vocoders (source coders) Linear prediction model for human voice system Medium quality, low bitrate

Vector Quantization Example Key challenge Solution: Given a source distribution, how to select codebook (*) and partitions (---) to result in smallest average distortion Solution: Divide and conquer Two codes  four  eight …

Outline Quality vs Bit rate Types of speech coders Waveform Coding Speech production and vocoders Analysis by Synthesis VoIP

Analysis-by-Synthesis (AbS) Codecs Hybrid method Vocoder’s linear prediction model Careful selection of excitation signal to reconstruct original waveform High quality, low bitrate! The most successful and commonly used Time-domain AbS codecs Not a simple two-state, voiced/unvoiced Different excitation signals are attempted Closest to the original waveform is selected Types: MPE, Multi-Pulse Excited RPE, Regular-Pulse Excited CELP, Code-Excited Linear Predictive

Linear-Prediction-based Analysis-by-Synthesis How it works Segment speech into frames (typically 20ms long) Find filter parameter for each frame Find excitation whose that minimizes prediction error Perceptual weighting More accuracy where speech energy is low Transmit the filter parameter and excitation signal Vector quantization

LPAS Classification Three classes Multi-Pulse Excited (MPE) Regular-Pulse Excited (RPE) Code-Excited Linear Predictive (CELP) Difference lies in representation of excitation signal

Multi-Pulse Excited (MPE) Excitation is given by a fixed number of pulses Position and amplitude of the pulses are computed to minimize error and transmitted to decoder Finding the best match is theoretically possible but not practical Suboptimal estimations are given Typically about 4 pulses per 5 ms are used

Regular-Pulse Excited (RPE) Multiple pulses used like in MPE Regularly spaced at fixed period Only needs to transmit first pulse’s position and all pulses amplitude More pulses are allowed for better quality at same bitrate Around 10 pulses per 5 ms

Code-Excited Linear Predictive (CELP) Excitation is given by an entry from a large vector quantizer codebook A gain term for its power (amplitude) Key challenge Searching for the right excitation entries in realtime Solution: restructure the codebook optimized for searching (such as a tree) Performance 4.8kbps or lower bitrate with good quality

Further Improvements on CELP Representation of pitch period Adaptive Long-term prediction + short-term adjustment Coding of LP filter Vector quantization of filter representation Multimode coding Dynamic bit allocation between excitation, LP filter and pitch

G.728 LD-CELP CELP codecs A filter; its characteristics change over time A codebook of acoustic vectors A vector = a set of elements representing various char. of the excitation Transmit Filter coefficients, gain, a pointer to the vector chosen Low Delay CELP Backward-adaptive coder Use previous samples to determine filter coefficients Operates on five samples at a time Delay < 1 ms Only the pointer is transmitted

G.728 LD-CELP 1024 vectors in the code book 10-bit pointer (index) 16 kbps LD-CELP encoder Minimize a frequency-weighted mean-square error

G.728 LD-CELP MOS score of about 3.9 One-quarter of G.711 bandwidth (16kbps) 30 MIPS 2 kilobytes of RAM is needed for codebooks 50th order LPC filter. Lower delays are obtained by making the excitation vectors very short (~5 samples or 0.625 ms)

Algebraic CELP (ACELP) Algebraic CELP (ACELP) the residual samples are not VQ-ed but derived directly from an algebraic computation to be used in exciting the LP synthesizer accelerating the search for optimal excitation Main advantage is algebraic codebook can be very large (> 50 bits) without running into storage or CPU time problems. A 16-bit algebraic codebook is used in the innovative codebook search, the aim of which is to find the best innovation and gain parameters The innovation vector contains, at most, four non-zero pulses. In ACELP a block of N speech samples is synthesized by filtering an appropriate innovation sequence from a codebook, scaled by a gain factor, through two time varying filters, one a long-term or pitch, synthesis filter and the other a shorter term synthesis filter. Conjugate Structure ACELP yields toll-quality with a 10th order LPC

G.723.1 ACELP 6.3 or 5.3 kbps The coder Both mandatory Can change from one to another during a conversation The coder A band-limited input speech signal Sampled at 8 KHz, 16-bit uniform PCM quantization Operate on blocks of 240 samples at a time A look-ahead of 7.5 ms A total algorithmic delay of 37.5 ms + other delays A high-pass filter to remove any DC component

G.723.1 ACELP Various operations to determine the appropriate filter coefficients 5.3 kbps, Algebraic Code-Excited Linear Prediction 6.3 kbps, Multi-pulse Maximum Likelihood Quantization The transmission Linear predication coefficients Gain parameters Excitation codebook index 24-octet frames at 6.3 kbps, 20-octet frames at 5.3 kbps

G.723.1 ACELP G.723.1 Annex A The two lsbs of the first octet Silence Insertion Description (SID) frames of size four octets The two lsbs of the first octet 00 6.3kbps 24 octets/frame 01 5.3kbps 20 10 SID frame 4 An MOS of about 3.8 At least 27.5 ms delay

G.729 8 kbps Input frames of 10 ms, 80 samples for 8 KHz sampling rate 5 ms look-ahead Algorithmic delay of 15 ms An 80-bit frame for 10 ms of speech A complex codec G.729.A (Annex A), a number of simplifications Same frame structure Encoder/decoder, G.729/G.729.A Slightly lower quality

G.729 Based on analysis of several parameters of the input VAD, Voice Activity Detection Based on analysis of several parameters of the input The current frames plus two preceding frames DTX, Discontinuous Transmission Send nothing or send an SID frame SID frame contains information to generate comfort noise CNG, Comfort Noise Generation G.729, an MOS of about 4.0 G.729A an MOS of about 3.7

G.729 G.729 Annex D G.729 Annex E a lower-rate extension 6.4 kbps; 10 ms speech samples, 64 bits/frame MOS  6.3 kbps G.723.1 G.729 Annex E a higher bit rate enhancement the linear prediction filter of G.729 has 10 coef. that of G.729 Annex E has 30 coef. the codebook of G.729 has 35 bits that of G.729 Annex E has 44 bits 118 bits/frame; 11.8 kbps

CDMA QCELP (IS-733) Variable-rate coder Two most common rates The high rate, 13.3 kbps A lower rate, 6.2 kbps Silence suppression For use with RTP, RFC 2658

GSM Enhanced Full-Rate (EFR) An enhanced version of GSM Full-Rate ACELP-based codec The same bit rate and the same overall packing structure 12.2 kbps Support discontinuous transmission For use with RTP, RFC 1890

GSM Adaptive Multi-Rate (AMR) 20 ms coding delay Eight different modes 4.75 kbps to 12.2 kbps 12.2 kbps, GSM EFR 7.4 kbps, IS-641 (TDMA cellular systems) Change the mode at any time Offer discontinuous transmission The SID (Silence Descriptor) is sent in every 8th frame and is 5 bytes in size The coding choice of many 3G wireless networks

VSELP Vector-Sum-Excited Linear Prediction: Data rate: In IS-54 standard TDMA cell phones in US and a variation in Japan In the first version of RealAudio for audio over the Internet Data rate: Data rate of 7.95 kbit/s: 20 ms of speech into 159-bit frames In an actual TDMA cell phone, the vocoder output is packaged with error correction and signaling information, resulting in an over-the-air data rate of 16.2 kbit/s For internet audio, each 159-bit frame is stored in 20 bytes, leaving 1 bit unused. The resulting file thus has a data rate of exactly 8 kbit/s Limited ability to encode non-speech sounds Performs poorly in the presence of background noise

References Human voice model http://cnx.rice.edu/content/m0049/latest/ Speech Compression http://www.data-compression.com/speech.shtml Speech coding tutorial http://www-mobile.ecs.soton.ac.uk/speech_codecs/ Standard codecs http://www.ittiam.com/pages/products/g711.htm Spanias, AS, “Speech coding, a tutorial review”, 1994

Outline Quality vs Bit rate Types of speech coders Waveform Coding Speech production and vocoders Analysis by Synthesis VoIP

Effects of packetization

Advantages of VoIP Cost savings: Flexibility: Avoid the need for separate voice and data networks Conference calling, IVR, call forwarding, automatic redial, and caller ID features (traditional telcos normally charge extra) While regular telephone calls are billed by the minute or second, VoIP calls are billed per megabyte Flexibility: Simple way to add an extra telephone line to a home or office. Secure calls using standardized protocols (such as Secure Real-time Transport Protocol) Location independence: call center agents using VoIP phones can work from anywhere Integration with other services available over the Internet, including video conversation, message or data file exchange in parallel with the conversation, audio conferencing, managing address books Potential quality improvements: Break legacy 8kHz => 16kHz, stereo, 5.1

Challenges with VoIP Play silence? Audio “healing” Quality of Service (QoS) Latency Packet loss Play silence? Audio “healing” Transcoding leads to quality degradations Susceptibility to power failure 911 calls Security: hackers in your phone

DTX and Comfort Noise DTX is Discontinuous Transmission Voice activity detector (VAD) detects if there is active speech or not. When there is no active speech different DTX procedures can be used: No Transmission at all Comfort Noise (CN) using RFC 3389 Codec built CN in like AMR SID (Silence Descriptor) Frequency of Comfort Noise packets varies but is usually some fraction of normal packet rate

Tones, Signal, and DTMF Digits The hybrid codecs are optimized for human speech Other data may need to be transmitted Tones: fax tones, dialing tone, busy tone DTMF digits for two-stage dialing or voice-mail G.711 is OK G.723.1 and G.729 can be unintelligible The ingress gateway needs to intercept The tones and DTMF digits Use an external signaling system