Audio Coding Lecture 7. Content  Digital Audio Basic  Speech Compression  Music Compression.

Slides:



Advertisements
Similar presentations
Speech Coding Techniques
Advertisements

T.Sharon-A.Frank 1 Multimedia Compression Basics.
Department of Computer Engineering University of California at Santa Cruz MPEG Audio Compression Layer 3 (MP3) Hai Tao.
Introduction to MP3 and psychoacoustics Material from website by Mark S. Drew
MPEG/Audio Compression Tutorial Mike Blackstock CPSC 538a January 11, 2004.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Digital Audio Compression
Analogue to Digital Conversion (PCM and DM)
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
4.2 Digital Transmission Pulse Modulation (Part 2.1)
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
1 Digital Audio Compression. 2 Formats  There are many different formats for storing and communicating digital audio:  CD audio  Wav  Aiff  Au 
CELLULAR COMMUNICATIONS 5. Speech Coding. Low Bit-rate Voice Coding  Voice is an analogue signal  Needed to be transformed in a digital form (bits)
Speech Coding Nicola Orio Dipartimento di Ingegneria dell’Informazione IV Scuola estiva AISV, 8-12 settembre 2008.
Multimedia communications EG-371Dr Matt Roach Multimedia Communications EG 371 and EG 348 Dr Matthew Roach Lecture 2 Digital.
Speech & Audio Processing
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
Spatial and Temporal Data Mining
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
MPEG Audio Compression by V. Loumos. Introduction Motion Picture Experts Group (MPEG) International Standards Organization (ISO) First High Fidelity Audio.
Digital Voice Communication Link EE 413 – TEAM 2 April 21 st, 2005.
T.Sharon-A.Frank 1 Multimedia Image Compression 2 T.Sharon-A.Frank Coding Techniques – Hybrid.
Department of Computer Engineering University of California at Santa Cruz Data Compression (2) Hai Tao.
COMP 249 :: Spring 2005 Slide: 1 Audio Coding Ketan Mayer-Patel.
Chapter 4 Digital Transmission
Digital Audio Multimedia Systems (Module 1 Lesson 1)
1 Audio Compression Multimedia Systems (Module 4 Lesson 4) Summary: r Simple Audio Compression: m Lossy: Prediction based r Psychoacoustic Model r MPEG.
Image Compression - JPEG. Video Compression MPEG –Audio compression Lossy / perceptually lossless / lossless 3 layers Models based on speech generation.
CS :: Fall 2003 Audio Coding Ketan Mayer-Patel.
Speech coding. What’s the need for speech coding ? Necessary in order to represent human speech in a digital form Applications: mobile/telephone communication,
LE 460 L Acoustics and Experimental Phonetics L-13
Chapter Seven: Digital Communication
Sampling Terminology f 0 is the fundamental frequency (Hz) of the signal –Speech: f 0 = vocal cord vibration frequency (>=80Hz) –Speech signals contain.
Digital Audio Watermarking: Properties, characteristics of audio signals, and measuring the performance of a watermarking system نيما خادمي کلانتري
GODIAN MABINDAH RUTHERFORD UNUSI RICHARD MWANGI.  Differential coding operates by making numbers small. This is a major goal in compression technology:
LECTURE Copyright  1998, Texas Instruments Incorporated All Rights Reserved Encoding of Waveforms Encoding of Waveforms to Compress Information.
Audio Compression Usha Sree CMSC 691M 10/12/04. Motivation Efficient Storage Streaming Interactive Multimedia Applications.
CSC361/661 Digital Media Spring 2002
AUDIO COMPRESSION msccomputerscience.com. The process of digitizing audio signals is called PCM PCM involves sampling audio signal at minimum rate which.
Media Representations - Audio
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
ECE 4710: Lecture #9 1 PCM Noise  Decoded PCM signal at Rx output is analog signal corrupted by “noise”  Many sources of noise:  Quantizing noise »Four.
Speech and Audio Coding Heejune AHN Embedded Communications Laboratory Seoul National Univ. of Technology Fall 2013 Last updated
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
Sound Sound is a continuous wave that travels through the air
Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 9 This presentation © 2004, MacAvon Media Productions Sound.
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.
Compression No. 1  Seattle Pacific University Data Compression Kevin Bolding Electrical Engineering Seattle Pacific University.
Submitted By: Santosh Kumar Yadav (111432) M.E. Modular(2011) Under the Supervision of: Mrs. Shano Solanki Assistant Professor, C.S.E NITTTR, Chandigarh.
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Digital Multiplexing 1- Pulse Code Modulation 2- Plesiochronous Digital Hierarchy 3- Synchronous Digital Hierarchy.
Digital Audio III. Sound compression (I) Compression of sound data requires different techniques from those for graphical data Requirements are less stringent.
1 Audio Coding. 2 Digitization Processing Signal encoder Signal decoder samplingquantization storage Analog signal Digital data.
4.2 Digital Transmission Pulse Modulation Pulse Code Modulation
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2014.
Voice Sampling. Sampling Rate Nyquist’s theorem states that a signal can be reconstructed if it is sampled at twice the maximum frequency of the signal.
Digital Audio I. Acknowledgement Some part of this lecture note has been taken from multimedia course made by Asst.Prof.Dr. William Bares and from Paul.
Fundamentals of Multimedia Chapter 6 Basics of Digital Audio Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
UNIT V. Linear Predictive coding With the advent of inexpensive digital signal processing circuits, the source simply analyzing the audio waveform to.
Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.
Digital Communications Chapter 13. Source Coding
Vocoders.
4.1 Chapter 4 Digital Transmission Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
MPEG-1 Overview of MPEG-1 Standard
Govt. Polytechnic Dhangar(Fatehabad)
Presentation transcript:

Audio Coding Lecture 7

Content  Digital Audio Basic  Speech Compression  Music Compression

Audio Basics  Analog to Digital Conversion  Sampling  Quantisation  Aliasing effects  Filtering  Companding  PCM encoding  Digital to Analog Conversion

Analog Audio Image from Mark Handley’s slides

Simple Analog-to-Digital Converter  Signal is sampled at sampling frequency f  Sampled signal is quantised into discrete values Sample and Hold + Quantiser Analog Signal Digitised Codewords

Sampling and Quantization. (a): Sampling the analog signal in the time dimension. (b): Quantization is sampling the analog signal in the amplitude(/voltage) dimension. Image from Li & Drew’s slides

Sample and Hold

Sample Rate  Sample rate: the number of samples per second  Also known as sampling frequency  Telephone: 8000 samples/sec  CD: samples/sec  DVD: samples/sec

Sample and Hold

Sampling Rate

Real music example

How fast to sample?  Nyquist-Shannon sampling theorem  It states how frequently we must sample in time to be able to recover the original sound.  For no loss of information:  Sampling frequency ‘slightly’ >= 2 * maximum signal frequency  Nyquist frequency is the highest (or maximum) frequency that can be accurately represented (that will not alias given a sampling rate)  Example:  Limit of human hearing: 20Hz – 20kHz ; human voice reached to 4kHz  By Nyquist, sample rate must be >= 40,000 samples/sec  CD sample rate: 44,100 samples/sec

Aliasing  What happens to all those higher frequencies you can’t sample?  They add noise to the sampled data at lower frequencies Aliasing noise (real music)

Analong-to-Digital Converter  Low-pass anti-aliasing filter (cutoff at f/2) on input  Signal is sampled at sampling frequency f  Sampled signal is quantised into discrete values Sample and Hold + Quantiser Analog Signal Digitised Codewords Low-pass filter Filtered Analog Signal

Quantisation  Sampled analog signal needs to be quantised (digitised)  Two questions:  How many discrete digital values?  What analog value does each digital value correspond to?  Simplest quantisation: linear  8-bit linear (divides the vertical axis into 256 levels)  16-bit linear (divides the vertical axis into levels)

Quantisation Noise

How many levels?  8 bits linear encoding would probably be enough if the signal always used the full range  But signal varies in loudness  If full range is used for loud parts, quiet parts will suffer from bad quantisation noise (only a few levels used)  If full range is used for quiet parts, loud parts will clip, resulting in really bad noise  CD uses 16-bit linear encoding Pretty good match to dynamic range of human ear.

Telephony  16 bit linear would be rather expensive for telephony  8 bit linear poor quality  Solution: use 8 bits with an “logarithmic” encoding  Known as companding (compressing/expanding)  Goal is that quantisation noise is a fixed proportion of the signal, irrespective of weather the signal is quiet or loud

Linear Encoding

Logarithmic Encoding

μ-law and A-Law The parameter μ is set to μ = 100 or μ = 255. The parameter A for the A-law encoder is usually set to A = 87.6.

Nonlinear transform for audio signals

μ-law vs A-law  8-bit μ-law used in US for telephony  8-bit A-law used in Europe for telephony  Similar, but slightly different curve  Both give similar quality to 12-bit linear encoding  A-law used for International circuits  Both are linear approximations to a log curve  8000 samples/sec * 8bits/sample = 64Kb/s data rate

Speech Coding

Data Rates  Telephone quality voice:  8000 samples/sec, 8 bits/sample, mono  64Kb/s  CD quality audio:  samples/sec, 16 bits/sample, stereo  ~1.4Mb/s  Communications channels and storage cost money (although less than they used to)  What can we do to reduce the transmission and/or storage costs without sacrificing too much quality?

Speech Codec Overview  PCM – send every sample  DPCM – send differences between samples  ADPCM – send differences but adapt how we code them  SB-ADPCM – wideband codec, use ADPCM twice, once for lower frequencies, again at lower bitrate for upper frequencies  LPC – linear model of speech formation  CELP – use LPC as base but also use some bits to code corrections for the things LPC gets wrong

PCM  μ -law and A-law PCM have already reduced the data sent.  Lost frequencies above 4KHz  Non-linear encoding to reduce bits per sample  However, each sample is still independently encoded.  In reality, samples are correlated  Can utilise this correlation to reduce the data sent PCM signal encoding and decoding

Differential PCM  Normally the difference between samples is relatively small and can be coded with less than 8 bits  Simplest codec sends only the differences between samples  Typically use 6 bits for difference, rather than 8 bits for absolute value  Compression is lossy, as not all differences can be coded  Decoded signal is slightly degraded  Next difference must then be encoded off the previous decoded sample, so losses don’t accumulate

Differential PCM

ADPCM (Adaptive Differential PCM)  Makes a simple prediction of the next sample, based on weighted previous n samples  Lossy coding of the difference between the actual sample and the prediction  Receiver runs same prediction algorithm and adaptive quantisation levels to reconstruct speech Adaptive quantiser + Predictor Transmitted values Measured samples

ADPCM  Adaptive quantisation cannot always exactly encode a difference  Shows up as quantisation noise  Modems and fax machines try to use the full channel capacity  If they succeed, one sample is not predictable from the next  ADPCM will cause them to fail or work poorly  ADPCM not normally used on national voice circuits, but commonly used internationally to save capacity on expensive satellite or undersea fibres  May attempt to detect if it’s a modem, and switch back to regular PCM

Predictor Error  What happens if the signal gets corrupted while being transmitted?  Wrong value will be decoded  Predictor will be incorrect  All future values will be decoded incorrectly  Modern voice circuits have low but non-zero error  But ADPCM was used on older circuits with higher loss rates too.

ADPCM Predictor Error  Want to design a codec so that errors do not persist  Build in an automatic decay towards zero  If only differences of zero were sent, the predictor would decay the predicted (and hence decoded) value towards zero  Differences have a mean value of zero (there are as many positive differences as negative ones)  Thus predictor decay ensures that any error will also decrease over time until it disappears

ADPCM Predictor Error

Sub-band ADPCM  Regular ADPCM reduces the bitrate of 8KHz sampled audio (typically 32Kb/s)  If we have a 64 Kb/s channel (e.g. ISDN), we could use the same techniques to produce better that toll-quality  Could just use ADPCM with 16KHz sampled audio, but not all frequencies are of equal importance  Sub-band ADPCM codes these two ranges separately

Sub-band ADPCM  Filter into two bands  50Hz – 3.5 KHz: sample 8KHz, encode at 48KB/s  3.5Hz – 7 KHz: sample 16KHz, encode at 16KB/s

Sub-band ADPCM  Practical issue:  Unless you have dedicated hardware, probably can’t sample two sub-bands separately at the same time  Need to process digitally  Sample at 16KHz  Use digital filters to split sub-bands and downsample the lower sub-band to 8KHz  Key point of Sub-band ADPCM  Not all frequencies are of equal importance (quantisation noise is more disruptive to some parts of the signal than others)  Allocate the bits where they do most good

Model-based Coding  PCM, DPCM and ADPCM directly code the received audio signal  An alternative approach is to build a parameterised model of the sound source (Human voice)  For each time slice  Analyse the audio signal to determine how the signal was produced  Determine the model parameters that fit  Send the model parameters  At the receiver, synthesize the voice model and received parameters

Speech formation Voiced sounds: series of pulses of air as larynx opens and closes. Basic tone then shaped by changing resonance of vocal tract. Unvoiced sounds: larynx held open, turbulent noise made in mouth.

Linear Predictive Coding (LPC)  Low-bitrate encoder  1.2Kb/s - 4Kb/s  Sounds very synthetic  Basic LPC mostly used where bitrate really matters (eg in military applications)  Most modern voice codecs (eg GSM) are based on enhanced LPC encoders. Reading:

LPC  Digitize signal, and split into segments  For each segment, determine:  Pitch of the signal (ie basic formant frequency)  Loudness of the signal  Whether sound is voiced or unvoiced  Voiced: vowels, “m”, “v”, “l”  Unvoiced: “f”, “s”  Vocal tract excitation parameters (LPC coefficients)

LPC Decoder

 Vocal chord synthesizer generates a series of impulses.  Unvoiced synthesizer is a white noise source.  Vocal tract model uses a linear predictive filter.  n th sample is a linear combination of the previous p samples plus an error term: x n = a 1 x n-1 + a 2 x n a n x n-p + e n  e n comes from the synthesizer.  The coefficients a 1.. a p comprise the vocal tract model, and shape the synthesized sounds.

LPC Encoder  Once pitch and voice/unvoiced are determined, encoding consists of deriving the optimal LPC coefficients (a 1.. a p ) for the vocal tract model so as to minimize the mean-square error between the predicted signal and the actual signal.  Problem is straightforward in principle. In practice it involves: 1. The computation of a matrix of coefficient values. 2. The solution of a set of linear equations.  Several different ways exist to do this efficiently (autocorrelation, covariance, recursive lattice formulation) to assure convergence to a unique solution.

Limitation of LPC Model  LPC linear predictor is very simple.  For this to work, the vocal tract “tube” must not have any side branches (these would require a more complex model).  OK for vowels (tube is a reasonable model)  For nasal sounds, nose cavity forms a side branch.  In practice this is ignored in pure LPC.  More complex codecs attempt to code the residue signal, which helps correct this.

Code Excited Linear Prediction (CELP)  Goal is to efficiently encode the residue signal, improving speech quality over LPC, but without increasing the bit rate too much.  CELP codecs use a codebook of typical residue values.  Analyzer compares residue to codebook values.  Chooses value which is closest.  Sends that value.  Receiver looks up the code in its codebook, retrieves the residue, and uses this to excite the LPC formant filter.

CELP  Problem is that codebook would require different residue values for every possible voice pitch.  Codebook search would be slow, and code would require a lot of bits to send.  One solution is to have two codebooks.  One fixed by codec designers, just large enough to represent one pitch period of residue.  One dynamically filled in with copies of the previous residue delayed by various amounts (delay provides the pitch)  CELP algorithm using these techniques can provide pretty good quality at 4.8Kb/s.

Enhanced LPC Usage  GSM (Groupe Speciale Mobile) Residual Pulse Excited LPC 13Kb/s  LD-CELP Low-delay Code-Excited Linear Prediction (G.728) 16Kb/s  CS-ACELP Conjugate Structure Algebraic CELP (G.729) 8Kb/s  MP-MLQ Multi-Pulse Maximum Likelihood Quantization (G.723.1) 6.3Kb/s

Music Compression

Music Coding  LPC-based codecs model the sound source to achieve good compression. Works well for voice. Terrible for music.  What if you can’t model the source?  Model the limitations of the human ear.  Not all sounds in the sampled audio can actually be heard.  Analyze the audio and send only the sounds that can be heard.  Quantize more coarsely where noise will be less audible.

Amplitude Sensitivity  Dynamic range is ratio of maximum signal amplitude to minimum signal amplitude (measured in decibels).  D = 20 log (A max /A min ) dB  Human hearing has dynamic range of ~ 96dB  Sensitivity of the ear is dependent on frequency.  Most sensitive in range of 2-5KHz.

Amplitude Sensitivity Frequencies only heard if they exceed a sensitivity threshold: Source: Halsall, p184

Frequency Masking  The sensitivity threshold curve is distorted by the presence of loud sounds.  Frequencies just above and below the frequency of a loud sound need to be louder than the normal minimum amplitude before they can be heard.

Frequency Masking Effect of masking tone at three different frequencies

Temporal Masking  After hearing a loud sound, the ear is deaf to quieter sounds in the same frequency range for a short time The louder is the test tone, the shorter it takes for our hearing to get over hearing the masking.

MPEG Audio

MPEG Audio Encoding  Sample audio as PCM (typically 16 bit linear). 12 sets of 32 samples.  Use filter bank to divide signal into 32 frequency bands  Maps time-domain samples into 12 values for each of 32 frequency subbands.  Determine power in each subband.  Use psychoacoustic model to predict masking for each subband.  If power in a subband is below masking threshold, don’t code it.  Otherwise determine the number of bits needed to code the subband such that the quantization noise is below the masking threshhold. [One fewer bit of quantization introduces ~6dB of noise]

MPEG Audio Layer Layer 1:  DCT type filter with one frame and equal frequency spread per band.  Psychoacoustic model only uses frequency masking. Layer 2:  Uses three frames in filter (1152 samples).  More compact encoding of scale factors and samples. Layer 3:  Better critical band filter is used (non-equal frequencies)  Psychoacoustic model includes temporal masking effects.  Takes into account stereo redundancy  Huffman encoding of quantized samples.

MPEG Layer 1, 2 Encoder

Fixed Bitrate Encoding in MP3  Goal is to encode at a fixed bitrate.  Eg: 128Kb/s.  Can’t directly allocate bits to subbands because of Huffman encoding (don’t know how many bits will result).  Use an iterative approach to changing the scale factors used in quantizing each subband.

MP3 Iterative Encoding

MP3 Stereo  Multiple stereo modes:  Mono  Code each channel separately.  Joint stereo:  Code mean + difference.  For low frequencies, only code mean (you can’t hear stereo at low frequencies)

MP3 Decoder  No need for psychoacoustic model at decoder.  Improved encoder can improve quality for any decoder.

Beyond MP3  MP3 is no longer state of the art.  Most newer codecs follow same general principles though.  MPEG 2 Advanced Audio Codec (AAC)  Ogg Vorbis  Windows Media Audio (WMA)  FLAC

MPEG-2 AAC AAC is derived from MP3. Main differences:  5.1 Surround Sound  Better Filter bank :  MP3 used a hybrid filter bank for backward compatibility reasons.  MPEG-2 AAC uses a plain Modified Discrete Cosine Transform  Temporal Noise Shaping (TNS):  Shapes the distribution of quantization noise in time by prediction in the frequency domain.  Helps with voice signals.  Finer control of quantization resolution - the given bit rate can be used more efficiently.  Bit-stream format : the information to be transmitted undergoes entropy coding in order to keep redundancy as low as possible.

Ogg Vorbis  Patent free, similar quality to AAC.  Like AAC, MDCT used to transform to frequency domain.  Psychoacoustic model used to determine the noise floor (envelope of masking effects) across frequency bands. Includes simultaneous noise masking. Noise floor subtracted from MDCT components. Noise floor and MDCT residue coded separately using a codebook- based vector quantization algorithm.

Window Media Audio (WMA)  MDCT-based codec, pretty similar to AAC and Ogg.  Frequency and temporal masking, then requantise.  Main differences:  More block sizes to choose from (can trade off temporal vs frequency precision better).  Different use of Huffman coding:  Independently Huffman-code mantissa and exponent of floating-point quantized frequency values.  Mid/side coding of stereo.  Code L+R (mid) and L-R (side) separately.

Reading - FLAC  What is FLAC?  

Comparison of Internet Audio Compression Format

Thanks!