Coding Technologies for Speech and Audio Signals

Coding Technologies for Speech and Audio Signals
ISPACS 2005 Thank you for kind introduction. My name is Takehiro Moriya. NTT Communication Science Labs. Takehiro Moriya　守谷　健弘

Self introduction 1980 Joined NTT, Basic research
Transform domain interleave VQ Conjugate VQ 1989 guest researcher at AT&T Bell Labs 1990 Standardization for Japanese PDC (PSI-CELP) 1993 Standardization for ITU-T (CS-ACELP) 1995 Standardization for MPEG-4 (TwinVQ) 2001 Standardization for MPEG lossless audio Let me introduce myself. I am a research manager at NTT labs. I have 25 years experience of speech and audio coding, starting in 1980.

Technologies of speech and audio coding
bit rate [kbit/s] ubiquitous 1024 music 512 MPEG-4 (lossless) MPEG-1 CD, DAT MPEG-2 256 MP3 AAC wideband 128 archive telephone G.722 MPEG-4 64 G.711 32 APC-AB G.726 G.728 16 streaming This is a historical view of the speech and audio. My research carrier can be partly mapped on this figure. vocoder mobile 8 VSELP VoIP/mobile G.729 4 LSP PSI-CELP mobile phone PARCOR 2 year 1975 1980 1985 1990 1995 2000 2005

Outline 1. Fundamentals 2. Standardization 3. Hot topics
1.1 Time domain for speech 1.2 Frequency domain for audio 2. Standardization 2.1 ITU-T speech coding 2.2 MPEG audio coding 3. Hot topics 3.1 MPEG lossless (ALS, SLS, DTS) 3.2 MPEG SBR and SSC 3.3 MPEG surround This is the ouline of this talk.

Fundamentals

Category of coding lossless text compression time-domain speech coding
lossy frequency-domain presentation audio image video metadata coding is rather vague term in wide sense. I will focus on the compression of speech and audio. I will also mention about the latest topics on the lossless coding of audio. speech language

Time-domain linear prediction -> CELP predictive coefficients
PARCOR (partial auto correlation) LSP (line spectral pair) vector quantization of excitation source algebraic structure (ACELP) Big market for cellular phone and VoIP

LPC (Linear Predictive Coding)
predictive coefficients Σ excitation (innovation) (prediction residual) Ｚ－１ α１Ｚ－１ α2 synthesized output ・Ｚ－１ αp

Family of LPC parameters
LSP parameters ω ωp PARCOR coefficients k kp ω１ ω２ ωｐ frequency merits of LSP stability interpolation quantization prediction LSP parameters are convenient for interpolation, quantization, prediction, sine it have natural relations with the power spectrum. predictive coefficients α αp

CELP (Code Excited Linear Prediction)
input LSP parameter adaptive codebook (periodic) LPC synthesis gain + random codebook (noise, pulse) perceptual error Feedback (analysis by synthesis)

Synthesis model for vocoder
Σ pitch interval synthesis filter gain （random）

Synthesis model for multi-pulse
Σ pitch interval synthesis filter gain amplitude and position of pulse

Synthesis model for regular multi-pulse
Σ pitch interval synthesis filter gain amplitude of regular pulse

Synthesis model for CELP
Σ pitch interval gain synthesis filter ・・・・・・・ selection of code vector

Synthesis model of VSELP
Σ pitch interval gain synthesis filter +/- +/- ・・・・・・・・・・ +/- polarity of base vector

Synthesis model for CS-CELP
Σ pitch interval gain synthesis filter +/- +/- ・・・・・・・ +/- selection of vector pair

Synthesis model of ACELP
Σ pitch interval gain synthesis filter +/- +/- +/- Finally, ACELP has been commonly used for various speech coding standards. The structure is extremly simple. +/- +/- selection of unit pulse position Simplicity is the seal of truth

Frequency-domain adaptive noise control psycho-acoustics
Lapped transform: MDCT Without frame noise nor information loss due to overlap Filter bank: QMF compromises time and frequency adaptive noise control psycho-acoustics

Transform coding input output Transform Transform time to quantization
frequency quantization Transform frequency to time envelope estimation Adaptive bit allocation Side information

Base of ＤＣＴ time frequency

Base of ＭＤＣＴ 0verlap with 0verlap with previous frame next frame
anti-symmetry symmetry 0verlap with next frame Keeping the continuity in the time domain

QMF for MPEG1,2 Layer-I, II 32 band QMF filter bank (analysis)
frequency ….. down sample adaptive bit allocation for 32 equal bands (energy, masking) adaptive quantization bit stream reconstruction frequency ….. 32 band QMF filter bank (synthesis)

QMF for MPEG1,2 Layer-III 32 band QMF filter bank (analysis) frequency
….. down sample long and short MDCT adaptive bit allocation for Bark-scale (energy, masking) adaptive quantization (Huffman coding), bit reservoir bit stream reconstruction frequency ….. 32 band QMF filter bank (synthesis)

QMS for MPEG extension tools
32 band QMF filter bank (analysis) frequency ….. SBR (Spectral Band Replication) PS (Parametric Stereo) Surround bit stream reconstruction frequency ….. 32 band QMF filter bank (synthesis)

Masking effect original spectrum log spectrum allowable noise level
audible level masked region frequency

Physical and perceptual distortion
result of compression un-noticeable (masking) additive noise original un-noticeable region 視覚は次の発表をきいてください additive echo characteristics of perception application

Distortion by additional noise
original original log spectrum frequency distortion distortion time noticeable

Distortion by data compression
original log spectrum original frequency distortion distortion time distortion is masked control quantization noise

Distortion by echo watermark echo is masked search or recognition
original original log spectrum frequency 40 ms distortion distortion time echo is masked watermark search or recognition

Predictive coding and transform coding
small correlation effect gain large method unpredictable flat spectrum prediction gain transform gain waveform energy residual energy arithmetic mean geometric mean predictable varied spectrum closed-loop quantization adaptive bit allocation weighted quantization time-domain (prediction) frequency-domain (transform, subband) = I have shown some technologies in both time and frequency domain. From the view points of compression, both are closely related. For example, compression gain due to prediction and transform is ideally same depending on the statistical properties of the signal. Time domain scheme is used for speech coding, One of the main differences is the time resolution. Speech (5 ms) Audio (30 ms)

Standards

Example of standard ITU-T ISO/IEC JPEG, MPEG cellular phone VoIP
TV-phone FAX ISO/IEC JPEG, MPEG digital camera, video digital broadcasting portable music player, DVD As you know, we have various commercial products and services based on the standard scheme. Here are some examples.

Merits of standard interoperability open source
long term maintenance visible patent holders Integration of the highest technologies cost reduction by mass production market creation

Circulatory evolution of market
cost reduction disclosure of technology patent users service and products service product standard convenient R & D competition basic research market research royalty patent pool

Standardization for speech
ITU-T　G. IMT-2000 (International Mobile Telecommunication) GSM (European, Asia) TIA (North America) US FS-1015 (LPC-10), 1016 (CELP), 1017 (MELP) Japanese Cellular - PDC full/half rate - PHS - cdmaOne - PDC enhanced full rate

ITU-T standard for speech
Telephone band (8 kHz sample) G PCM 64 kbit/s G ADPCM 32 kbit/s (16,24,40 kbit/s) G.727　 Embedded ADPCM　 32 kbit/s (16,24,40 kbit/s) G.728　 Low-delay CELP　　　　 16 kbit/s G ACELP/MPC-MLQ /6.3 kbit/s G CS-ACELP 　　　　　　　　8 kbit/s Wide band (16 kHz sample) G SB-ADPCM 64, 56, 48 kbit/s G Transform coding 24, 32 kbit/s　 G AMR-WB kbit/s There are number of ITU-T speech coding standard. Number is running short, and some has prefix or reused.

Standard for IMT-2000 3GPP　(3rd Generation Partnership Project)　 (ARIB, TTC, T1, ETSI,TTA ) 3GPP2 bi-directional CODEC　 AMR (Advanced Multi Rate) AMR-WB (wide band) video phone (H.263) Audio/Low rate speech packet transmission (MPEG-4) For So far as the mobile speech coding is concerned, research on compression is finished. I have no idea when is next.

Bandwidth and bitrate for audio coding
18 12 6 24 DAT 　MD multi-channel MPEG-2 CD AC-3,AAC bandwidth [kHz] MPEG-4 　MPEG-1 MPEG-2,1/2sample 24 48 96 192 384 768 Rate[kbit/s]

Basic technology for audio coding
Transform Quantization MPEG-1 L1,2 subband adaptive bit MPEG-1 L3 subband+MDCT adaptive+Huffman ATRAC subband+MDCT adaptive bit AC-3 MDCT adaptive+Huffman AAC MDCT adaptive+Huffman TwinVQ MDCT adaptive VQ

MPEG-１, 2/audio MPEG-1 sampling rate: 32, 44.1, 48 kHz stereo
algorithm: Layer-I band split Layer-II + improved quantizer Layer-III + MDCT + Variable length + bit reservoir ++ MPEG-2 low sampling rate 16, 22.05, 24 kHz multi channel 5.1ch backward compatibility

MPEG-2/AAC 3 profiles -main, -LC (Low Complexity),-SSR (Scalable Sampling Rate) sampling rate: 32, 44.1, 48 kHz, +X2, X1/2, X1/4 channel: 1-48 bit rate: kbit/s/ch MDCT 1024 or 128 TNS (Time domain Noise Shaping) MS (Middle-Side) stereo/intensity stereo non-linear scale quantizer + variable length code (2 and 4 dimension Huffman code)

Tools in MPEG-4 audio Low rate speech　HVXC (Harmonic Vector eXcitation Coder) Speech (narrow/wide)　CELP Low rate audio　TwinVQ (Transform domain Weighted Interleave VQ) Audio　MPEG-2 AAC　(Advanced Audio Coder) Error resilient framework Parametric audio coding HILN Fine granular scalable audio coding BSAC Low delay audio coding LD-AAC Low overhead Audio Transport LATM

MPEG-4 General audio TwinVQ common tools interleave VQ for MDCT LTP
TNS stereo coding scalability output AAC IMDCT scale factor Huffman coding BSAC scale factor Bit-slice arithmetic

Audio Demo (low rate) ITU-T G.711 64 kbit/s ITU-T G.726 32 kbit/s
PDC Full 6.7 kbit/s PDC Half 3.45 kbit/s MPEG4 HVXC 2 kbit/s MPEG4 TwinVQ 8 kbit/s

Hot Topics

Background of lossless coding
Demand for lossless compression of audio archiving analog and digital contents delivery over broadband network high quality audio format up to 24 bit 192 kHz sampling multi-channel medical data, seismic data, sensor array, etc. MPEG-4 extension official tools (open source) inter operability (good for over 100 years) I think it is not necessary to talk about the necessities of lossless coding. One thing I want to emphasize is that the strongest points of the MPEG standard Is the Inter operability and this will be maintained over 100 years.

Family of MPEG lossless
ALS one-step compression in time domain SLS scalable to lossless from MPEG lossy core fine grain scalability in frequency domain Integer MDCT DTS 1-bit oversample format compatible with Sony-Philips SACD format I think it is not necessary to talk about the necessities of lossless coding. One thing I want to emphasize is that the strongest points of the MPEG standard Is the Inter operability and this will be maintained over 100 years.

Property of ALS Time domain adaptive prediction extension
simple to high-performance backward prediction BGMC for prediction residual Golomb-Rice Code for PARCOR Progressive order prediction Long-term prediction Hierarchical block switching extension Floating-point support Multi-channel predictive coding Let me review the history of development. The initial system is based on the TUB’s proposal. In the course of core experiment process, there have been a number of enhancements and extentions.

Prediction residual amplitude Original wave Prediction residual wave
This figure shows how LPC is efficient for reducing the amplitude. Green waveform shows the original audio signals, and red one is the prediction residual. We only need to encode the prediction residual instead of original waveform. It is obvious that the amplitude can be reduced by dB, namely 5 to 7 bits can be reduced for per sample. Prediction residual wave time

Predictive coding different framework rich commonality input residual
vocoder compression ratio 1/30 pulse interval magnify 30 times prediction waveform coding codebook for residual ratio 1/10 synthesis lossless coding ratio 1/2 all residual parameters different framework rich commonality

Compression and decoding time
[%] [%] ALS (reference decoder) 50 50 49 49 Monkey’s Audio (free Software) compression ratio 48 48 MPEG-4 SLS ALS （high-compression） 47 47 OptimFrog (free Software) 46 46 ALS (enhanced decoder) 45 45 5 10 15 20 40 60 80 100 120 140 averaged decoding time for 30 sec files (48,96,192 kHz) [sec]

Quality improvements by SBR and PS
Japanese mobile digital broadcasting (2006) MP3 AAC relative quality HE-AAC HE-AAC V2 Japanese digital broadcasting (2003) HE-AAC V2 profile HE-AAC profile AAC profile AAC SBR PS 24 48 72 96 120 144 stereo bit rate [kbit/s]

MPEG SBR (HE-AAC) down sample high frequency analysis (Spectral Band
Replication) SBR bit steam full-band output input synthesis envelope excitation low-pass input AAC stereo encoder AAC stereo bit steam HE-AAC profile includes SBR AAC stereo decoder low-pass output

MPEG SBR+PS (HE-AAC v2) mix down stereo output PS (parametric stereo)
analysis bit stream synthesis input Channel level differences Inter channel correlation monaural input AAC monaural encoder AAC monaural bit steam Similar to the SBR, Parametric Stereo tool can be combined with the base coder for monaural signal. The side information of this tool is the channel level difference and inter channel correlation. AAC monaural decoder monaural output

MPEG surround mix-down surround analysis surround bit stream 5-ch
output input surround synthesis Channel level differences Inter channel correlation Channel prediction coefficients stereo input AAC stereo encoder AAC stereo bit stream This project is no-going now. We have a base of stereo coder. AAC stereo decoder stereo output

*Multi-channel and Low Sampling Frequency
History of MPEG Audio surround SLS SBR forward and backward compatibility SSC ALS MP3 on 4 MPEG-4 V1 V2 2001 DST 2005 MPEG-2 MC/LSF MPEG-2 AAC lossless MPEG-1 *Multi-channel and Low Sampling Frequency 1992 1994 1996 1998 2000 2002 2004 2006

Future challenge Open problems Integrated service
all-mighty coder for both speech and audio at less than 16 kbit/s Wave field synthesis (multi-channel) Integrated service video copyright management

Coding Technologies for Speech and Audio Signals

Similar presentations

Presentation on theme: "Coding Technologies for Speech and Audio Signals"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Coding Technologies for Speech and Audio Signals

Similar presentations

Presentation on theme: "Coding Technologies for Speech and Audio Signals"— Presentation transcript:

Similar presentations

About project

Feedback