UNIT III Audio Compression

UNIT III Audio Compression

Outline Psychoacoustics Fundamentals of audio
Temporal and frequency masking MPEG audio Compandors Speech compression-Introduction Vocoders-different types

Psychoacoustics • The range of human hearing is about 20 Hz to about 20 kHz • The frequency range of the voice is typically only from about 500 Hz to 4 kHz • The dynamic range, the ratio of the maximum sound amplitude to the quietest sound that humans can hear, is on the order of about 120 dB 3

Equal-Loudness Relations
• Fletcher-Munson Curves – Equal loudness curves that display the relationship between perceived loudness (“Phons”, in dB) for a given stimulus sound volume (“Sound Pressure Level”, also in dB), as a function of frequency • Fig shows the ear’s perception of equal loudness: – The bottom curve shows what level of pure tone stimulus is required to produce the perception of a 10 dB sound – All the curves are arranged so that the perceived loudness level gives the same loudness as for that loudness level of a pure tone at 1 kHz 4

Fig. 14.1: Flaetcher-Munson Curves (re-measured by Robinson and Dadson)
5

Frequency Masking • Lossy audio data compression methods, such as MPEG/Audio encoding, remove some sounds which are masked anyway • The general situation in regard to masking is as follows: 1. A lower tone can effectively mask (make us unable to hear) a higher tone 2. The reverse is not true – a higher tone does not mask a lower tone well 3. The greater the power in the masking tone, the wider is its influence – the broader the range of frequencies it can mask. 4. As a consequence, if two tones are widely separated in frequency then little masking occurs 6

Fig. 14.2: Threshold of human hearing, for pure tones
Threshold of Hearing • A plot of the threshold of human hearing for a pure tone Fig. 14.2: Threshold of human hearing, for pure tones 7

Threshold of Hearing (cont’d)
• The threshold of hearing curve: if a sound is above the dB level shown then the sound is audible • Turning up a tone so that it equals or surpasses the curve means that we can then distinguish the sound • An approximate formula exists for this curve: (14.1) – The threshold units are dB; the frequency for the origin (0,0) in formula (14.1) is 2,000 Hz: Threshold(f) = 0 at f =2 kHz 8

Frequency Masking Curves
• Frequency masking is studied by playing a particular pure tone, say 1 kHz again, at a loud volume, and determining how this tone affects our ability to hear tones nearby in frequency – one would generate a 1 kHz masking tone, at a fixed sound level of 60 dB, and then raise the level of a nearby tone, e.g., kHz, until it is just audible • The threshold in Fig plots the audible level for a single masking tone (1 kHz) • Fig shows how the plot changes if other masking tones are used 9

Fig. 14.3: Effect on threshold for 1 kHz masking tone
10

Fig. 14.4: Effect of masking tone at three different frequencies
11

Critical Bands • Critical bandwidth represents the ear’s resolving power for simultaneous tones or partials – At the low-frequency end, a critical band is less than 100 Hz wide, while for high frequencies the width can be greater than 4 kHz • Experiments indicate that the critical bandwidth: – for masking frequencies < 500 Hz: remains approximately constant in width ( about 100 Hz) – for masking frequencies > 500 Hz: increases approximately linearly with frequency 12

Table 14.1 25-Critical Bands and Bandwidth
13 Li & Drew

Fig. 14.5: Effect of masking tones, expressed in Bark units
• Bark unit is defined as the width of one critical band, for any masking frequency • The idea of the Bark unit: every critical band width is roughly equal in terms of Barks (refer to Fig. 14.5) Fig. 14.5: Effect of masking tones, expressed in Bark units 15

Conversion: Frequency & Critical Band Number
• Conversion expressed in the Bark unit: (14.2) • Another formula used for the Bark scale: b = 13.0 arctan(0.76 f)+3.5 arctan(f2/56.25) (14.3) where f is in kHz and b is in Barks (the same applies to all below) • The inverse equation: f = [(exp(0.219*b)/352)+0.1]*b−0.032*exp[−0.15*(b−5)2] (14.4) • The critical bandwidth (df) for a given center frequency f can also be approximated by: df = × [ (f2)] (14.5) 16

Temporal Masking • Phenomenon: any loud tone will cause the hearing receptors in the inner ear to become saturated and require time to recover • The following figures show the results of Masking experiments: 17

Fig. 14.6: The louder is the test tone, the shorter it takes for our hearing to get over hearing the masking. 18

Fig. 14.7: Effect of temporal and frequency maskings depending on both time and closeness in frequency. 19

Fig. 14.8: For a masking tone that is played for a longer time, it takes longer before a test tone can be heard. Solid curve: masking tone played for 200 msec; dashed curve: masking tone played for 100 msec. 20

14.2 MPEG Audio • MPEG audio compression takes advantage of psychoacoustic models, constructing a large multi-dimensional lookup table to transmit masked frequency components using fewer bits • MPEG Audio Overview 1. Applies a filter bank to the input to break it into its frequency components 2. In parallel, a psychoacoustic model is applied to the data for bit allocation block 3. The number of bits allocated are used to quantize the info from the filter bank – providing the compression 21

MPEG Layers • MPEG audio offers three compatible layers:
– Each succeeding layer able to understand the lower layers – Each succeeding layer offering more complexity in the psychoacoustic model and better compression for a given level of audio quality – each succeeding layer, with increased compression effectiveness, accompanied by extra delay • The objective of MPEG layers: a good tradeoff between quality and bit-rate 22

MPEG Layers (cont’d) • Layer 1 quality can be quite good provided a comparatively high bit- rate is available – Digital Audio Tape typically uses Layer 1 at around 192 kbps • Layer 2 has more complexity; was proposed for use in Digital Audio Broadcasting • Layer 3 (MP3) is most complex, and was originally aimed at audio transmission over ISDN lines • Most of the complexity increase is at the encoder, not the decoder – accounting for the popularity of MP3 players 23 L

MPEG Audio Strategy • MPEG approach to compression relies on:
– Quantization – Human auditory system is not accurate within the width of a critical band (perceived loudness and audibility of a frequency) • MPEG encoder employs a bank of filters to: – Analyze the frequency (“spectral”) components of the audio signal by calculating a frequency transform of a window of signal values – Decompose the signal into subbands by using a bank of filters (Layer 1 & 2: “quadrature-mirror”; Layer 3: adds a DCT; psychoacoustic model: Fourier transform) 24

MPEG Audio Strategy (cont’d)
• Frequency masking: by using a psychoacoustic model to estimate the just noticeable noise level: – Encoder balances the masking behavior and the available number of bits by discarding inaudible frequencies – Scaling quantization according to the sound level that is left over, above masking levels • May take into account the actual width of the critical bands: – For practical purposes, audible frequencies are divided into 25 main critical bands (Table 14.1) – To keep simplicity, adopts a uniform width for all frequency analysis filters, using 32 overlapping subbands 25

MPEG Audio Compression Algorithm
Fig. 14.9: Basic MPEG Audio encoder and decoder. 26

Basic Algorithm (cont’d)
• The algorithm proceeds by dividing the input into 32 frequency subbands, via a filter bank – A linear operation taking 32 PCM samples, sampled in time; output is 32 frequency coefficients • In the Layer 1 encoder, the sets of 32 PCM values are first assembled into a set of 12 groups of 32s – an inherent time lag in the coder, equal to the time to accumulate 384 (i.e., 12×32) samples • Fig shows how samples are organized – A Layer 2 or Layer 3, frame actually accumulates more than 12 samples for each subband: a frame includes 1,152 samples 27 L

Fig. 14.11: MPEG Audio Frame Sizes
28

Bit Allocation Algorithm
• Aim: ensure that all of the quantization noise is below the masking thresholds • One common scheme: – For each subband, the psychoacoustic model calculates the Signal-to-Mask Ratio (SMR)in dB – Then the “Mask-to-Noise Ratio” (MNR) is defined as the difference (as shown in Fig.14.12): (14.6) – The lowest MNR is determined, and the number of code-bits allocated to this subband is incremented – Then a new estimate of the SNR is made, and the process iterates until there are no more bits to allocate 29

Fig : MNR and SMR. A qualitative view of SNR, SMR and MNR are shown, with one dominate masker and m bits allocated to a particular critical band. 30

Fig. 14.13: MPEG-1 Audio Layers 1 and 2.
• Mask calculations are performed in parallel with subband filtering, as in Fig. 4.13: Fig : MPEG-1 Audio Layers 1 and 2. 31

Layer 2 of MPEG-1 Audio • Main difference: • Advantage:
– Three groups of 12 samples are encoded in each frame and temporal masking is brought into play, as well as frequency masking – Bit allocation is applied to window lengths of 36 samples instead of 12 – The resolution of the quantizers is increased from 15 bits to 16 • Advantage: – a single scaling factor can be used for all three groups 32

Layer 3 of MPEG-1 Audio • Main difference:
– Employs a similar filter bank to that used in Layer 2, except using a set of filters with non-equal frequencies – Takes into account stereo redundancy – Uses Modified Discrete Cosine Transform (MDCT) — addresses problems that the DCT has at boundaries of the window used by overlapping frames by 50%: (14.7) 33

Fig : MPEG-Audio Layer 3 Coding.
34

Table : MP3 compression performance
• Table shows various achievable MP3 compression ratios: Table : MP3 compression performance 35

MPEG-2 AAC (Advanced Audio Coding)
• The standard vehicle for DVDs: – Audio coding technology for the DVD-Audio Recordable (DVD-AR) format, also adopted by XM Radio • Aimed at transparent sound reproduction for theaters – Can deliver this at 320 kbps for five channels so that sound can be played from 5 different directions: Left, Right, Center, Left-Surround, and Right-Surround • Also capable of delivering high-quality stereo sound at bit-rates below 128 kbps 36

MPEG-2 AAC (cont’d) • Support up to 48 channels, sampling rates between 8 kHz and 96 kHz, and bit-rates up to 576 kbps per channel • Like MPEG-1, MPEG-2, supports three different “profiles”, but with a different purpose: – Main profile – Low Complexity(LC) profile – Scalable Sampling Rate (SSR) profile 37

MPEG-4 Audio • Integrates several different audio components into one standard: speech compression, perceptually based coders, text-to-speech, and MIDI • MPEG-4 AAC (Advanced Audio Coding), is similar to the MPEG-2 AAC standard, with some minor changes • Perceptual Coders – Incorporate a Perceptual Noise Substitution module – Include a Bit-Sliced Arithmetic Coding (BSAC) module – Also include a second perceptual audio coder, a vector- quantization method entitled TwinVQ 38

MPEG-4 Audio (Cont’d) • Structured Coders
– Takes “Synthetic/Natural Hybrid Coding” (SNHC) in order to have very low bit-rate delivery an option – Objective: integrate both “natural” multimedia sequences, both video and audio, with those arising synthetically – “structured” audio – Takes a “toolbox” approach and allows specification of many such models. – E.g., Text-To-Speech (TTS) is an ultra-low bit-rate method, and actually works, provided one need not care what the speaker actually sounds like 39

Other Commercial Audio Codecs
• Table 14.3 summarizes the target bit-rate range and main features of other modern general audio codecs Table 14.3: Comparison of audio coding systems 40

The Future: MPEG-7 and MPEG-21
• Difference from current standards: – MPEG-4 is aimed at compression using objects. – MPEG-7 is mainly aimed at “search”: How can we find objects, assuming that multimedia is indeed coded in terms of objects 41 Li & Drew

– MPEG-7: A means of standardizing meta-data for audiovisual multimedia sequences – meant to represent information about multimedia information In terms of audio: facilitate the representation and search for sound content. Example application supported by MPEG-7: automatic speech recognition (ASR). – MPEG-21: Ongoing effort, aimed at driving a standardization effort for a Multimedia Framework from a consumer’s perspective, particularly interoperability In terms of audio: support of this goal, using audio. 42

Uniform Quantization It was discussed in the previous lecture that the disadvantage of using uniform quantization is that low amplitude signals are drastically effected. This fact can be observed by considering the simulation results in the next four slides. In both cases two signals with a similar shape, but different amplitudes, are applied to the same quantizer with a spacing of between two quantization levels. The effects of quantization on the low amplitude signal are obviously more significant than on the high amplitude signal.

Uniform Quantization Max Amplitude = 1 Input Signal 1.

Uniform Quantization Quantized Signal Δv=0.0625

Uniform Quantization Max Amplitude = 0.125 Input Signal 2.

Uniform Quantization Quantized Signal Δv=0.0625

Uniform Quantization Figure-1 Input output characteristic of a uniform quantizer.

Uniform Quantization Recall that the Signal to Quantization Noise Ratio of a uniform quantizer is given by: This equation verifies the discussion on slide-1 that SNqR for a low amplitude signal is quite low. Therefore, the effect of quantization noise on such audio signals should be noticeable. Lets consider the case of voice signals (see next slide)

Uniform Quantization Click on the following links to listen to a sample voice signal. First play “voice file-1”; then play “voice file-1 Quantized”. Do you notice the degradation in voice quality? This degradation can be attributed to uniformly spaced quantization levels. Voice file-1 Voice file-1. Quantized (uniform) Note: You may not notice the difference between the two clips if you are using small laptop speakers. You should use either headphones or larger speakers.

Figure-2 Histogram of voice signal-1
Uniform Quantization More insight into signal degradation can be gained by looking at the voice signal’s Histogram. A histogram shows the distribution of values of data. Figure-2 below shows the histogram of the voice signal-1. Most of the values have low amplitude and occur around zero. Therefore, for voice signals uniform quantization will result in signal degradation. Figure-2 Histogram of voice signal-1

Non-Uniform Quantization
The effect of quantization noise can be reduced by increasing the number of quantization intervals in the low amplitude regions. This means that spacing between the quantization levels should not be uniform. This type of quantization is called “Non-Uniform Quantization”. Input-Output Characteristics shown below.

Non-uniform Quantization
Non-uniform quantization is achieved by, first passing the input signal through a “compressor”. The output of the compressor is then passed through a uniform quantizer. The combined effect of the compressor and the uniform quantizer is that of a non-uniform quantizer. (see figure 3.) At the receiver the voice signal is restored to its original form by using an expander. This complete process of Compressing and Expanding the signal before and after uniform quantization is called Companding.

Non-uniform Quantization (Companding)
y=g(x) 1 -1 x=m(t)/mp 1 -1 Input output relationship of a compressor.

A-Law (USA) Where, The value of ‘µ’ used with 8-bit quantizers for voice signals is 255

The µ-law compressor characteristic curve for different values of ‘µ’.

Compressor Uniform Quantizer Expander Click on symbols to listen to voice signal at each stage 57

Compressor Uniform Quantizer Expander Click on symbols to listen to voice signal at each stage The 3 stages combine to give the characteristics of a Non-uniform quantizer. 58

Uniform Quantizer Click on symbols to listen to voice signal at each stage A uniform quantizer with input and output voice files is presented here for comparison with non-uniform quantizer.

Lets have a look at the histogram of the compressed voice signal. In contrast to the histogram of the uncompressed signal (figure-2) you can see that the values are now more distributed. Therefore, it can be said that the compressor changes the histogram/ pdf of the voice signal from gaussian (bell shape) to a uniform distribution (shown below). Figure-3 Histogram of compressed voice signal

Where is the Compression..??? The compression process in Non-uniform quantization demands some elaboration for clarity of concepts. It should be noted that the compression mentioned in previous slides is not the time or frequency domain compression which students are familiar with. This can be verified by looking at the time domain waveforms at the input and output of the compressor. Note that both the signals last for 3.75 seconds. Therefore, there is no compression in time or frequency. Fig-4-a Signal at Compressor Input Fig-4-b Signal at Compressor Output

Where is the Compression..??? The compression here occurs in the amplitude values. An intuitive way of explaining this compression in amplitudes is to say that the amplitudes of the compressed signal are more closely spaced (compressed) in comparison to the original signal. This can also be observed by looking at the waveform of the compressed signal (fig-4-b). The compressor boosts the small amplitudes by a large amount. However, the large amplitude values receive very small gain and the maximum value remains the same. Therefore, the small values are multiplied by a large gain and are spaced relatively closer to the large amplitude values. A parameter which can be used to measure the degree of compression here is the Dynamic range. “The Dynamic Range is the ratio of maximum and minimum value of a variable quantity such as sound or light” [ ]. In the simulations the Dynamic Range (DR) of the compressor input = dB Whereas Dynamic Range (DR) of compressor output = dB

Vocoders

The Channel Vocoder (analyzer):
The channel vocoder employs a bank of bandpass filters, Each having a bandwidth between 100 HZ and 300 HZ. Typically, linear phase FIR filter are used. The output of each filter is rectified and lowpass filtered. The bandwidth of the lowpass filter is selected to match the time variations in the characteristics of the vocal tract. For measurement of the spectral magnitudes, a voicing detector and a pitch estimator are included in the speech analysis.

The Channel Vocoder (analyzer block diagram):
Bandpass Filter A/D Converter Lowpass Rectifier Voicing detector Pitch Encoder S(n) To Channel

The Channel Vocoder (synthesizer):
At the receiver the signal samples are passed through D/A converters. The outputs of the D/As are multiplied by the voiced or unvoiced signal sources. The resulting signal are passed through bandpass filters. The outputs of the bandpass filters are summed to form the synthesized speech signal.

The Channel Vocoder (synthesizer block diagram):
Decoder D/A Converter Bandpass Filter Output speech ∑ D/A Converter Bandpass Filter From Channel Voicing Information Switch Random Noise generator Pitch period Pulse generator

The Phase Vocoder : The phase vocoder is similar to the channel vocoder. However, instead of estimating the pitch, the phase vocoder estimates the phase derivative at the output of each filter. By coding and transmitting the phase derivative, this vocoder destroys the phase information .

The Phase Vocoder (analyzer block diagram):
Compute Short-term Magnitude And Phase Derivative Short-term magnitude Encoder Lowpass Filter Decimator Differentiator S(n) To Channel Differentiator Lowpass Filter Decimator Short-term phase derivative

The Phase Vocoder (synthesizer block diagram, kth channel):
Interpolator Decoder ∑ From Channel Cos Integrator Sin Decimate Short-term amplitude Phase derivative

The Formant Vocoder : The formant vocoder can be viewed as a type of channel vocoder that estimate the first three or four formants in a segment of speech. It is this information plus the pitch period that is encoded and transmitted to the receiver.

The Formant Vocoder : Example of formant: (a) (b)
(a) : The spectrogram of the utterance “day one” showing the pitch and the harmonic structure of speech. (b) : A zoomed spectrogram of the fundamental and the second harmonic. (a) (b)

The Formant Vocoder (analyzer block diagram):
Input Speech F1 F1 B1 Pitch And V/U Decoder V/U F0 Fk :The frequency of the kth formant Bk :The bandwidth of the kth formant

The Formant Vocoder (synthesizer block diagram):
∑ B2 F1 F1 B1 V/U Excitation Signal F0

Linear Predictive Coding :
The objective of LP analysis is to estimate parameters of an all-pole model of the vocal tract. Several methods have been devised for generating the excitation sequence for speech synthesizes. LPC-type of speech analysis and synthesis are differ primarily in the type of excitation signal that is generated for speech synthesis.

LPC 10 : This methods is called LPC-10 because of 10 coefficient are typically employed. LPC-10 partitions the speech into the 180 sample frame. Pitch and voicing decision are determined by using the AMDF and zero crossing measures.

Residual Excited LP Vocoder :
Speech quality in speech quality can be improved at the expense of a higher bit rate by computing and transmitting a residual error, as done in the case of DPCM. One method is that the LPC model and excitation parameters are estimated from a frame of speech.

Residual Excited LP Vocoder :
The speech is synthesized at the transmitter and subtracted from the original speech signal to form the residual error. The residual error is quantized, coded, and transmitted to the receiver At the receiver the signal is synthesized by adding the residual error to the signal generated from the model.

RELP Block Diagram : Buffer And window ∑ Encoder LP analysis S(n) To
Parameters LP analysis To Channel Excitation parameters LP Synthesis model

Code Excited LP : CELP is an analysis-by-synthesis method in which the excitation sequence is selected from a codebook of zero-mean Gaussian sequence. The bit rate of the CELP is 4800 bps.

CELP (analysis-by-synthesis coder) :
Speech samples Buffer and LP analysis Side information LP parameters Gain + Gaussian Excitation codebook Pitch Synthesis filter Spectral Envelope (LP) Synthesis filter ∑ - Perceptual Weighting Filter W(z) Compute Energy of Error (square and sum) Index of Excitation sequence

gain and pitch estimate
CELP (synthesizer) : From Channel decoder Buffer And controller Gaussian Excitation codebook Pitch Synthesis filter LP LP parameters, gain and pitch estimate updates

Vector Sum Excited LP : The VSELP coder and decoder basically differ in method by which the excitation sequence is formed. In next block diagram of the VSELP, there are three excitation source. One excitation is obtained from the pitch period state. The other two excitation source are obtained from two codebook.

Vector Sum Excited LP : The bit rate of the VSELP is about 8000 bps.
Bit allocations for 8000-bps VSELP Parameters Bits/5-ms Frame Bits/20ms 10 LPC coefficients Average speech energy Excitation codewords from two VSELP codebooks Gain parameters Lag of pitch filter Total

VSELP Decoder : Long-term Filter state Codebook 1 2 ∑ Pitch synthesis
Spectral post filter envelop (LP) Synthetic Speech

UNIT III Audio Compression

Similar presentations

Presentation on theme: "UNIT III Audio Compression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIT III Audio Compression

Similar presentations

Presentation on theme: "UNIT III Audio Compression"— Presentation transcript:

Similar presentations

About project

Feedback