Download presentation
Presentation is loading. Please wait.
1
Audio Compression
2
Audio Compression CD quality audio: Telephone-quality speech
Sampling rate = 44 KHz, Quantization = 16 bits/sample Bit-rate = ~700 Kb/s (1.41 Mb/s if 2 channel stereo) Telephone-quality speech Sampling rate = 8KHz Bit rate = 128 Kb/s
3
Absolute Threshold A tone is audible only if its power is above the absolute threshold level
4
Masking effect If a tone of a certain frequency and amplitude is present, the audibility threshold curve is changed Other tones or noise of similar frequency, but of much lower amplitude, are not audible
5
Masking Effect (Single Masker)
6
Masking Effect (Multiple Maskers)
7
Temporal Masking A loud tone of finite duration will mask a softer tone that follows it (for around 30 ms) A similar effect is verified also when the the softer tone precedes the louder tone!!!
8
Perceptual Coding Perceptual coding tries to minimize the perceptual distortion in a transform coding scheme Basic concept: allocate more bits (more quantization levels, less error) to those channels that are most audible, fewer bits (more error) to those channels that are the least audible Needs to continuously analyze the signal to determine the current audibility threshold curve using a perceptual model
9
Audio Coding: Main Standards
MPEG (Moving Picture Expert Group) family (note: the standard only specifies the decoder!) MPEG-1 Layer 1 Layer 2 Layer 3 (MP-3) MPEG-2 Back-compatible AAC (non-back-compatible) Dolby AC3
10
MPEG-1 Audio Coder Layer 1 Deemed transparent at 384 Kb/s per channel
Subband coding with 32 channels Input divided into groups of 12 input samples Coefficient normalization (extracts Scale Factor) For each block, chooses among 15 quantizers for perceptual quantization No entropy coding after transform coding Decoder is much simpler than encoder
11
Intensity stereo mode Stereo effect of middle and high frequencies depends not so much on the different channel content but on the different channel amplitude Middle and upper subbands of the left and right channel are added together, and only the resulting summed samples are quantized The scale factor is sent for both channel so that amplitudes can be controlled independently during playback
12
MPEG-1 Audio Coder (cont’d)
Layer 2 Transparent at 256 Kb/s per channel Improved perceptual model (more computationally intensive) Finer resolution quantizers Layer 3 (MP-3) Transparent at 96 Kb/s per channel Applies a variable-size modified DCT on the samples of each subband channel Uses non-uniform quantizers Has entopy coder (Huffman) - requires buffering! Much mode complex than Layer 1 and 2
13
MPEG-1 Layers 1 and 2 Audio Encoder/Decoder (Single Channel)
14
MPEG-1 Layers 3 (MP-3) Audio Encoder/Decoder (Single Channel)
15
Middle-side Stereo Mode
Frequency ranges that would normally be coded as left and right are instead coded as Middle (left+right) and Side (left-right) Side channel can be coded with fewer bits (because the two channels are highly correlated)
16
MPEG-2 Audio Coder Backward compatible (i.e., MPEG-1 decoders can decode a portion of MPEG-2 bit-stream): Original goal: provide theater-style surround-sound capabilities Modes of operation: mono-aural stereo three channel (left, right and center) four channel (left, right, center and rear surround) five channel (four channel + center) Full five-channel surround stereo at 640 Kb/s
17
MPEG-2 Audio Coder (Cont’d)
Non-backward compatible (AAC): At 320 Kb/s judged to be equivalent to MPEG-2 at 640 Kb/s for five-channels surround-sound Can operate with any number of channels (between 1 and 48) and output bit rate (from 8 Kb/s per channel to 182 Kb/s per channel) Sampling rate can be as low as 8Khz and as high as 96 KHz per channel
18
Dolby AC-3 Used in movie theaters as part of the Dolby digital film system Selected for the USA Digital TV (DTV) and DVD Bit-rate: 320 Kb/s for 5.1 stereo Uses 512-point Modified DCT (can be switched to 256-point) Floating-point conversion into exponent-mantissa pairs (mantissas quantized with variable number of bits) Does not transmit bit allocation but perceptual model parameters
19
Dolby AC-3 Encoder Frequency Domain Transform Block Floating- Point
Bit Allocation Bitstream Packing Mantissa Quantization Encoded Audio Quantized Mantissas Coefficients PCM Samples Exponents
20
Speech Compression
21
Application Scenarios
Key Attributes of a Speech Codec Delay Complexity Quality Bit-rate
22
Key Attributes (cont’d)
Delay One-way end-to-end delay for real-time should be below 150 ms (at 300 ms becomes annoying) If more than two parties, the conference bridge (in which all voice channels are decoded, summed, and then re-encoded for transmission to their destination) can double the processing delay Internet real-time connections with less than 150 ms of delay are unlikely , due to packet assembly and buffering, protocol processing, routing, queuing, network congestion… Speech coders often divide speech into blocks (frames) e.g. G.723 uses frames of 240 samples each (30 ms) + look-ahead time
23
Key Attributes (cont’d)
Complexity Can be very complex PC video telephony: the bulk of the computation is for video coding/decoding, which leaves less CPU time for speech coding/decoding Quality Intelligibility + naturalness of original speech Speech coders for very low bit rates are based on a speech production model (not good for music - not robust to extraneous noise)
24
ITU Speech Coding Standards
25
ITU G.711 Designed for telephone bandwidth speech signal (3KHz)
Does direct sample-by-sample non-uniform quantization (PCM) Provides the lowest delay possible (1 sample) and the lowest complexity Not specific for speech High-rate and no recovery mechanism Default coder for ISDN video telephony
26
ITU G.722 Designed for transmitting 7-Khz bandwidth voice or music
Divides signal in two bands (high-pass and low-pass) which are then encoded with different modalities Quality not perfectly transparent, especially for music. Nevertheless, for teleconference-type applications, G.722 is greatly preferred to G.711 PCM because of increased bandwidth
27
ITU G.726, G.727 ADPCM (Adaptive Differential PCM) codecs for telephone bandwidth speech Can operate using 2, 3, 4 or 5 bit per sample
28
ITU G.729, G.723 Model-based coders: use special models of production (synthesis ) of speech Linear synthesis: feed a “noise” signal into a linear LPC filter (whose parameters are estimated from the original speech segment). Analysis by synthesis: the optimal “input noise” is computed and coded into a multipulse excitation LPC parameters coding Pitch prediction Have provision for dealing with frame erasure and packet-loss concealment (good on the Internet) G.723 is part of the standard H.324 standard for communication over POTS with a modem
29
ITU G.723 Scheme
30
ITU G.728 Hybrid between the lower bit-rate model-based coders (G.723 and G.729) and ADPCM coders Low-delay but fairly high complexity Considered equivalent in performance to 32 Kb/s G.726 and G.727 Suggested speech coder for low-bit rate ( Kb/s) ISDN video telephony Remarkably robust to random bit errors
31
Applications Video telephony/teleconference
Higher rate, more reliable networks (ISDN, ATM) logical choice is G.722 (best quality - 7KHz band) Kb/s: G.728 is a good choice because of its robust performance for many possible speech and audio inputs Telephone bandwidth modem, or less reliable network (e.g., Internet): G.723 is the coder of choice
32
Applications (cont’d)
Multimedia messaging Speech, perhaps combined with text, graphics, images, data or video (asynchronous communication). Here delay is not an issue. Message may be shared with a wide community. The speech coder ought to be a commonly available standard. For the most part, fidelity will not be an issue G.729 or G.723 seem like good candidates
33
Structured Audio
34
What Is Structured Audio?
Description format that is made up of semantic information about the sounds it represents, and that makes use of high-level (algorithmic) models Event-list representation: sequence of control parameters that, taken alone, do not define the quality of a sound but instead specify the ordering and characteristics of parts of a sound with regards to some external model
35
Event-list representation
Event-list representations are appropriate to soundtracks, piano, percussive instruments. Not good for violin, speech and singing Sequencers: allow the specification and modification of event sequences
36
MIDI MIDI (Musical Instrument Digital Interface) is a system specification consisting of both hardware and software components that define interconnectivity and a communication protocol for electronic synthesizers, sequencers, rhythm machines, personal computers and other musical instruments Interconnectivity defines standard cabling scheme, connectors and input/output circuitry Communication protocol defines standard multibyte messages to control the instrument’s voice, send responses and status
37
MIDI Communication Protocol
The MIDI communication protocol uses multibyte messages of two ckinds: channel messages and system messages Channel messages: address one of the 16 possible channels Voice Messages: used to control the voice of the instrument Switch notes on/off Send key pressure messages indicating the key is depressed Send control messages to control effects like vibrato, sustain and tremolo Pitch-wheel messages are used to change the pitch of all notes Channel key pressure provide a measure of force for the keys related to a specific channel (instrument)
38
Sound Representation and Synthesis
Sampling Individual instrument sounds (notes) are digitally recorded and stored in memory in the instrument. When the instrument is played, the note recording are reproduced and mixed to produce the output sound Takes a lot of memory! To reduce storage: Transpose the pitch of a sample during playback Quasi-periodic sounds can be “looped” after the attack transient has died Used for creating sound effects for film (Foley)
39
Sound Representation and Synthesis (cont’d)
Additive and subtractive synthesis Synthesize sound from the superposition of sinusoidal components (additive) or from the filtering of an harmonically rich source sound (subtractive) Very compact but with “analog synthesizer” feel Frequency modulation synthesis Can synthesize a variety of sounds such as brass-like and woodwind-like, percussive sounds, bowed strings and piano tones No straightforward method available to determine a FM synthesis algorithm from an analysis of a desired sound
40
Application of Structured Audio
Low-bandwith transmission Sound generation from process models Interactive music applications Content-based retrieval
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.