Audio Compression.

Audio Compression

Audio Compression CD quality audio: Telephone-quality speech
Sampling rate = 44 KHz, Quantization = 16 bits/sample Bit-rate = ~700 Kb/s (1.41 Mb/s if 2 channel stereo) Telephone-quality speech Sampling rate = 8KHz Bit rate = 128 Kb/s

Absolute Threshold A tone is audible only if its power is above the absolute threshold level

Masking effect If a tone of a certain frequency and amplitude is present, the audibility threshold curve is changed Other tones or noise of similar frequency, but of much lower amplitude, are not audible

Masking Effect (Single Masker)

Masking Effect (Multiple Maskers)

Temporal Masking A loud tone of finite duration will mask a softer tone that follows it (for around 30 ms) A similar effect is verified also when the the softer tone precedes the louder tone!!!

Perceptual Coding Perceptual coding tries to minimize the perceptual distortion in a transform coding scheme Basic concept: allocate more bits (more quantization levels, less error) to those channels that are most audible, fewer bits (more error) to those channels that are the least audible Needs to continuously analyze the signal to determine the current audibility threshold curve using a perceptual model

Audio Coding: Main Standards
MPEG (Moving Picture Expert Group) family (note: the standard only specifies the decoder!) MPEG-1 Layer 1 Layer 2 Layer 3 (MP-3) MPEG-2 Back-compatible AAC (non-back-compatible) Dolby AC3

MPEG-1 Audio Coder Layer 1 Deemed transparent at 384 Kb/s per channel
Subband coding with 32 channels Input divided into groups of 12 input samples Coefficient normalization (extracts Scale Factor) For each block, chooses among 15 quantizers for perceptual quantization No entropy coding after transform coding Decoder is much simpler than encoder

Intensity stereo mode Stereo effect of middle and high frequencies depends not so much on the different channel content but on the different channel amplitude Middle and upper subbands of the left and right channel are added together, and only the resulting summed samples are quantized The scale factor is sent for both channel so that amplitudes can be controlled independently during playback

MPEG-1 Audio Coder (cont’d)
Layer 2 Transparent at 256 Kb/s per channel Improved perceptual model (more computationally intensive) Finer resolution quantizers Layer 3 (MP-3) Transparent at 96 Kb/s per channel Applies a variable-size modified DCT on the samples of each subband channel Uses non-uniform quantizers Has entopy coder (Huffman) - requires buffering! Much mode complex than Layer 1 and 2

MPEG-1 Layers 1 and 2 Audio Encoder/Decoder (Single Channel)

MPEG-1 Layers 3 (MP-3) Audio Encoder/Decoder (Single Channel)

Middle-side Stereo Mode
Frequency ranges that would normally be coded as left and right are instead coded as Middle (left+right) and Side (left-right) Side channel can be coded with fewer bits (because the two channels are highly correlated)

MPEG-2 Audio Coder Backward compatible (i.e., MPEG-1 decoders can decode a portion of MPEG-2 bit-stream): Original goal: provide theater-style surround-sound capabilities Modes of operation: mono-aural stereo three channel (left, right and center) four channel (left, right, center and rear surround) five channel (four channel + center) Full five-channel surround stereo at 640 Kb/s

MPEG-2 Audio Coder (Cont’d)
Non-backward compatible (AAC): At 320 Kb/s judged to be equivalent to MPEG-2 at 640 Kb/s for five-channels surround-sound Can operate with any number of channels (between 1 and 48) and output bit rate (from 8 Kb/s per channel to 182 Kb/s per channel) Sampling rate can be as low as 8Khz and as high as 96 KHz per channel

Dolby AC-3 Used in movie theaters as part of the Dolby digital film system Selected for the USA Digital TV (DTV) and DVD Bit-rate: 320 Kb/s for 5.1 stereo Uses 512-point Modified DCT (can be switched to 256-point) Floating-point conversion into exponent-mantissa pairs (mantissas quantized with variable number of bits) Does not transmit bit allocation but perceptual model parameters

Dolby AC-3 Encoder Frequency Domain Transform Block Floating- Point
Bit Allocation Bitstream Packing Mantissa Quantization Encoded Audio Quantized Mantissas Coefficients PCM Samples Exponents

Speech Compression

Application Scenarios
Key Attributes of a Speech Codec Delay Complexity Quality Bit-rate

Key Attributes (cont’d)
Delay One-way end-to-end delay for real-time should be below 150 ms (at 300 ms becomes annoying) If more than two parties, the conference bridge (in which all voice channels are decoded, summed, and then re-encoded for transmission to their destination) can double the processing delay Internet real-time connections with less than 150 ms of delay are unlikely , due to packet assembly and buffering, protocol processing, routing, queuing, network congestion… Speech coders often divide speech into blocks (frames) e.g. G.723 uses frames of 240 samples each (30 ms) + look-ahead time

Key Attributes (cont’d)
Complexity Can be very complex PC video telephony: the bulk of the computation is for video coding/decoding, which leaves less CPU time for speech coding/decoding Quality Intelligibility + naturalness of original speech Speech coders for very low bit rates are based on a speech production model (not good for music - not robust to extraneous noise)

ITU Speech Coding Standards

ITU G.711 Designed for telephone bandwidth speech signal (3KHz)
Does direct sample-by-sample non-uniform quantization (PCM) Provides the lowest delay possible (1 sample) and the lowest complexity Not specific for speech High-rate and no recovery mechanism Default coder for ISDN video telephony

ITU G.722 Designed for transmitting 7-Khz bandwidth voice or music
Divides signal in two bands (high-pass and low-pass) which are then encoded with different modalities Quality not perfectly transparent, especially for music. Nevertheless, for teleconference-type applications, G.722 is greatly preferred to G.711 PCM because of increased bandwidth

ITU G.726, G.727 ADPCM (Adaptive Differential PCM) codecs for telephone bandwidth speech Can operate using 2, 3, 4 or 5 bit per sample

ITU G.729, G.723 Model-based coders: use special models of production (synthesis ) of speech Linear synthesis: feed a “noise” signal into a linear LPC filter (whose parameters are estimated from the original speech segment). Analysis by synthesis: the optimal “input noise” is computed and coded into a multipulse excitation LPC parameters coding Pitch prediction Have provision for dealing with frame erasure and packet-loss concealment (good on the Internet) G.723 is part of the standard H.324 standard for communication over POTS with a modem

ITU G.723 Scheme

ITU G.728 Hybrid between the lower bit-rate model-based coders (G.723 and G.729) and ADPCM coders Low-delay but fairly high complexity Considered equivalent in performance to 32 Kb/s G.726 and G.727 Suggested speech coder for low-bit rate ( Kb/s) ISDN video telephony Remarkably robust to random bit errors

Applications Video telephony/teleconference
Higher rate, more reliable networks (ISDN, ATM) logical choice is G.722 (best quality - 7KHz band) Kb/s: G.728 is a good choice because of its robust performance for many possible speech and audio inputs Telephone bandwidth modem, or less reliable network (e.g., Internet): G.723 is the coder of choice

Applications (cont’d)
Multimedia messaging Speech, perhaps combined with text, graphics, images, data or video (asynchronous communication). Here delay is not an issue. Message may be shared with a wide community. The speech coder ought to be a commonly available standard. For the most part, fidelity will not be an issue G.729 or G.723 seem like good candidates

Structured Audio

What Is Structured Audio?
Description format that is made up of semantic information about the sounds it represents, and that makes use of high-level (algorithmic) models Event-list representation: sequence of control parameters that, taken alone, do not define the quality of a sound but instead specify the ordering and characteristics of parts of a sound with regards to some external model

Event-list representation
Event-list representations are appropriate to soundtracks, piano, percussive instruments. Not good for violin, speech and singing Sequencers: allow the specification and modification of event sequences

MIDI MIDI (Musical Instrument Digital Interface) is a system specification consisting of both hardware and software components that define interconnectivity and a communication protocol for electronic synthesizers, sequencers, rhythm machines, personal computers and other musical instruments Interconnectivity defines standard cabling scheme, connectors and input/output circuitry Communication protocol defines standard multibyte messages to control the instrument’s voice, send responses and status

MIDI Communication Protocol
The MIDI communication protocol uses multibyte messages of two ckinds: channel messages and system messages Channel messages: address one of the 16 possible channels Voice Messages: used to control the voice of the instrument Switch notes on/off Send key pressure messages indicating the key is depressed Send control messages to control effects like vibrato, sustain and tremolo Pitch-wheel messages are used to change the pitch of all notes Channel key pressure provide a measure of force for the keys related to a specific channel (instrument)

Sound Representation and Synthesis
Sampling Individual instrument sounds (notes) are digitally recorded and stored in memory in the instrument. When the instrument is played, the note recording are reproduced and mixed to produce the output sound Takes a lot of memory! To reduce storage: Transpose the pitch of a sample during playback Quasi-periodic sounds can be “looped” after the attack transient has died Used for creating sound effects for film (Foley)

Sound Representation and Synthesis (cont’d)
Additive and subtractive synthesis Synthesize sound from the superposition of sinusoidal components (additive) or from the filtering of an harmonically rich source sound (subtractive) Very compact but with “analog synthesizer” feel Frequency modulation synthesis Can synthesize a variety of sounds such as brass-like and woodwind-like, percussive sounds, bowed strings and piano tones No straightforward method available to determine a FM synthesis algorithm from an analysis of a desired sound

Application of Structured Audio
Low-bandwith transmission Sound generation from process models Interactive music applications Content-based retrieval

Audio Compression.

Similar presentations

Presentation on theme: "Audio Compression."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Audio Compression.

Similar presentations

Presentation on theme: "Audio Compression."— Presentation transcript:

Similar presentations

About project

Feedback