Download presentation
Presentation is loading. Please wait.
Published byAldous O’Neal’ Modified over 6 years ago
1
Coding Technologies for Speech and Audio Signals
ISPACS 2005 Thank you for kind introduction. My name is Takehiro Moriya. NTT Communication Science Labs. Takehiro Moriya 守谷 健弘
2
Self introduction 1980 Joined NTT, Basic research
Transform domain interleave VQ Conjugate VQ 1989 guest researcher at AT&T Bell Labs 1990 Standardization for Japanese PDC (PSI-CELP) 1993 Standardization for ITU-T (CS-ACELP) 1995 Standardization for MPEG-4 (TwinVQ) 2001 Standardization for MPEG lossless audio Let me introduce myself. I am a research manager at NTT labs. I have 25 years experience of speech and audio coding, starting in 1980.
3
Technologies of speech and audio coding
bit rate [kbit/s] ubiquitous 1024 music 512 MPEG-4 (lossless) MPEG-1 CD, DAT MPEG-2 256 MP3 AAC wideband 128 archive telephone G.722 MPEG-4 64 G.711 32 APC-AB G.726 G.728 16 streaming This is a historical view of the speech and audio. My research carrier can be partly mapped on this figure. vocoder mobile 8 VSELP VoIP/mobile G.729 4 LSP PSI-CELP mobile phone PARCOR 2 year 1975 1980 1985 1990 1995 2000 2005
4
Outline 1. Fundamentals 2. Standardization 3. Hot topics
1.1 Time domain for speech 1.2 Frequency domain for audio 2. Standardization 2.1 ITU-T speech coding 2.2 MPEG audio coding 3. Hot topics 3.1 MPEG lossless (ALS, SLS, DTS) 3.2 MPEG SBR and SSC 3.3 MPEG surround This is the ouline of this talk.
5
Fundamentals
6
Category of coding lossless text compression time-domain speech coding
lossy frequency-domain presentation audio image video metadata coding is rather vague term in wide sense. I will focus on the compression of speech and audio. I will also mention about the latest topics on the lossless coding of audio. speech language
7
Time-domain linear prediction -> CELP predictive coefficients
PARCOR (partial auto correlation) LSP (line spectral pair) vector quantization of excitation source algebraic structure (ACELP) Big market for cellular phone and VoIP
8
LPC (Linear Predictive Coding)
predictive coefficients Σ excitation (innovation) (prediction residual) Z-1 α1 Z-1 α2 synthesized output ・ Z-1 αp
9
Family of LPC parameters
LSP parameters ω ωp PARCOR coefficients k kp ω1 ω2 ωp frequency merits of LSP stability interpolation quantization prediction LSP parameters are convenient for interpolation, quantization, prediction, sine it have natural relations with the power spectrum. predictive coefficients α αp
10
CELP (Code Excited Linear Prediction)
input LSP parameter adaptive codebook (periodic) LPC synthesis gain + random codebook (noise, pulse) perceptual error Feedback (analysis by synthesis)
11
Synthesis model for vocoder
Σ pitch interval synthesis filter gain (random)
12
Synthesis model for multi-pulse
Σ pitch interval synthesis filter gain amplitude and position of pulse
13
Synthesis model for regular multi-pulse
Σ pitch interval synthesis filter gain amplitude of regular pulse
14
Synthesis model for CELP
Σ pitch interval gain synthesis filter ・・・・・・・ selection of code vector
15
Synthesis model of VSELP
Σ pitch interval gain synthesis filter +/- +/- ・・・・・・・ ・・・ +/- polarity of base vector
16
Synthesis model for CS-CELP
Σ pitch interval gain synthesis filter +/- +/- ・・・・・・・ +/- selection of vector pair
17
Synthesis model of ACELP
Σ pitch interval gain synthesis filter +/- +/- +/- Finally, ACELP has been commonly used for various speech coding standards. The structure is extremly simple. +/- +/- selection of unit pulse position Simplicity is the seal of truth
18
Frequency-domain adaptive noise control psycho-acoustics
Lapped transform: MDCT Without frame noise nor information loss due to overlap Filter bank: QMF compromises time and frequency adaptive noise control psycho-acoustics
19
Transform coding input output Transform Transform time to quantization
frequency quantization Transform frequency to time envelope estimation Adaptive bit allocation Side information
20
Base of DCT time frequency
21
Base of MDCT 0verlap with 0verlap with previous frame next frame
anti-symmetry symmetry 0verlap with next frame Keeping the continuity in the time domain
22
QMF for MPEG1,2 Layer-I, II 32 band QMF filter bank (analysis)
frequency ….. down sample adaptive bit allocation for 32 equal bands (energy, masking) adaptive quantization bit stream reconstruction frequency ….. 32 band QMF filter bank (synthesis)
23
QMF for MPEG1,2 Layer-III 32 band QMF filter bank (analysis) frequency
….. down sample long and short MDCT adaptive bit allocation for Bark-scale (energy, masking) adaptive quantization (Huffman coding), bit reservoir bit stream reconstruction frequency ….. 32 band QMF filter bank (synthesis)
24
QMS for MPEG extension tools
32 band QMF filter bank (analysis) frequency ….. SBR (Spectral Band Replication) PS (Parametric Stereo) Surround bit stream reconstruction frequency ….. 32 band QMF filter bank (synthesis)
25
Masking effect original spectrum log spectrum allowable noise level
audible level masked region frequency
26
Physical and perceptual distortion
result of compression un-noticeable (masking) additive noise original un-noticeable region 視覚は次の発表をきいてください additive echo characteristics of perception application
27
Distortion by additional noise
original original log spectrum frequency distortion distortion time noticeable
28
Distortion by data compression
original log spectrum original frequency distortion distortion time distortion is masked control quantization noise
29
Distortion by echo watermark echo is masked search or recognition
original original log spectrum frequency 40 ms distortion distortion time echo is masked watermark search or recognition
30
Predictive coding and transform coding
small correlation effect gain large method unpredictable flat spectrum prediction gain transform gain waveform energy residual energy arithmetic mean geometric mean predictable varied spectrum closed-loop quantization adaptive bit allocation weighted quantization time-domain (prediction) frequency-domain (transform, subband) = I have shown some technologies in both time and frequency domain. From the view points of compression, both are closely related. For example, compression gain due to prediction and transform is ideally same depending on the statistical properties of the signal. Time domain scheme is used for speech coding, One of the main differences is the time resolution. Speech (5 ms) Audio (30 ms)
31
Standards
32
Example of standard ITU-T ISO/IEC JPEG, MPEG cellular phone VoIP
TV-phone FAX ISO/IEC JPEG, MPEG digital camera, video digital broadcasting portable music player, DVD As you know, we have various commercial products and services based on the standard scheme. Here are some examples.
33
Merits of standard interoperability open source
long term maintenance visible patent holders Integration of the highest technologies cost reduction by mass production market creation
34
Circulatory evolution of market
cost reduction disclosure of technology patent users service and products service product standard convenient R & D competition basic research market research royalty patent pool
35
Standardization for speech
ITU-T G. IMT-2000 (International Mobile Telecommunication) GSM (European, Asia) TIA (North America) US FS-1015 (LPC-10), 1016 (CELP), 1017 (MELP) Japanese Cellular - PDC full/half rate - PHS - cdmaOne - PDC enhanced full rate
36
ITU-T standard for speech
Telephone band (8 kHz sample) G PCM 64 kbit/s G ADPCM 32 kbit/s (16,24,40 kbit/s) G.727 Embedded ADPCM 32 kbit/s (16,24,40 kbit/s) G.728 Low-delay CELP 16 kbit/s G ACELP/MPC-MLQ /6.3 kbit/s G CS-ACELP 8 kbit/s Wide band (16 kHz sample) G SB-ADPCM 64, 56, 48 kbit/s G Transform coding 24, 32 kbit/s G AMR-WB kbit/s There are number of ITU-T speech coding standard. Number is running short, and some has prefix or reused.
37
Standard for IMT-2000 3GPP (3rd Generation Partnership Project) (ARIB, TTC, T1, ETSI,TTA ) 3GPP2 bi-directional CODEC AMR (Advanced Multi Rate) AMR-WB (wide band) video phone (H.263) Audio/Low rate speech packet transmission (MPEG-4) For So far as the mobile speech coding is concerned, research on compression is finished. I have no idea when is next.
38
Bandwidth and bitrate for audio coding
18 12 6 24 DAT MD multi-channel MPEG-2 CD AC-3,AAC bandwidth [kHz] MPEG-4 MPEG-1 MPEG-2,1/2sample 24 48 96 192 384 768 Rate[kbit/s]
39
Basic technology for audio coding
Transform Quantization MPEG-1 L1,2 subband adaptive bit MPEG-1 L3 subband+MDCT adaptive+Huffman ATRAC subband+MDCT adaptive bit AC-3 MDCT adaptive+Huffman AAC MDCT adaptive+Huffman TwinVQ MDCT adaptive VQ
40
MPEG-1, 2/audio MPEG-1 sampling rate: 32, 44.1, 48 kHz stereo
algorithm: Layer-I band split Layer-II + improved quantizer Layer-III + MDCT + Variable length + bit reservoir ++ MPEG-2 low sampling rate 16, 22.05, 24 kHz multi channel 5.1ch backward compatibility
41
MPEG-2/AAC 3 profiles -main, -LC (Low Complexity),-SSR (Scalable Sampling Rate) sampling rate: 32, 44.1, 48 kHz, +X2, X1/2, X1/4 channel: 1-48 bit rate: kbit/s/ch MDCT 1024 or 128 TNS (Time domain Noise Shaping) MS (Middle-Side) stereo/intensity stereo non-linear scale quantizer + variable length code (2 and 4 dimension Huffman code)
42
Tools in MPEG-4 audio Low rate speech HVXC (Harmonic Vector eXcitation Coder) Speech (narrow/wide) CELP Low rate audio TwinVQ (Transform domain Weighted Interleave VQ) Audio MPEG-2 AAC (Advanced Audio Coder) Error resilient framework Parametric audio coding HILN Fine granular scalable audio coding BSAC Low delay audio coding LD-AAC Low overhead Audio Transport LATM
43
MPEG-4 General audio TwinVQ common tools interleave VQ for MDCT LTP
TNS stereo coding scalability output AAC IMDCT scale factor Huffman coding BSAC scale factor Bit-slice arithmetic
44
Audio Demo (low rate) ITU-T G.711 64 kbit/s ITU-T G.726 32 kbit/s
PDC Full 6.7 kbit/s PDC Half 3.45 kbit/s MPEG4 HVXC 2 kbit/s MPEG4 TwinVQ 8 kbit/s
45
Hot Topics
46
Background of lossless coding
Demand for lossless compression of audio archiving analog and digital contents delivery over broadband network high quality audio format up to 24 bit 192 kHz sampling multi-channel medical data, seismic data, sensor array, etc. MPEG-4 extension official tools (open source) inter operability (good for over 100 years) I think it is not necessary to talk about the necessities of lossless coding. One thing I want to emphasize is that the strongest points of the MPEG standard Is the Inter operability and this will be maintained over 100 years.
47
Family of MPEG lossless
ALS one-step compression in time domain SLS scalable to lossless from MPEG lossy core fine grain scalability in frequency domain Integer MDCT DTS 1-bit oversample format compatible with Sony-Philips SACD format I think it is not necessary to talk about the necessities of lossless coding. One thing I want to emphasize is that the strongest points of the MPEG standard Is the Inter operability and this will be maintained over 100 years.
48
Property of ALS Time domain adaptive prediction extension
simple to high-performance backward prediction BGMC for prediction residual Golomb-Rice Code for PARCOR Progressive order prediction Long-term prediction Hierarchical block switching extension Floating-point support Multi-channel predictive coding Let me review the history of development. The initial system is based on the TUB’s proposal. In the course of core experiment process, there have been a number of enhancements and extentions.
49
Prediction residual amplitude Original wave Prediction residual wave
This figure shows how LPC is efficient for reducing the amplitude. Green waveform shows the original audio signals, and red one is the prediction residual. We only need to encode the prediction residual instead of original waveform. It is obvious that the amplitude can be reduced by dB, namely 5 to 7 bits can be reduced for per sample. Prediction residual wave time
50
Predictive coding different framework rich commonality input residual
vocoder compression ratio 1/30 pulse interval magnify 30 times prediction waveform coding codebook for residual ratio 1/10 synthesis lossless coding ratio 1/2 all residual parameters different framework rich commonality
51
Compression and decoding time
[%] [%] ALS (reference decoder) 50 50 49 49 Monkey’s Audio (free Software) compression ratio 48 48 MPEG-4 SLS ALS (high-compression) 47 47 OptimFrog (free Software) 46 46 ALS (enhanced decoder) 45 45 5 10 15 20 40 60 80 100 120 140 averaged decoding time for 30 sec files (48,96,192 kHz) [sec]
52
Quality improvements by SBR and PS
Japanese mobile digital broadcasting (2006) MP3 AAC relative quality HE-AAC HE-AAC V2 Japanese digital broadcasting (2003) HE-AAC V2 profile HE-AAC profile AAC profile AAC SBR PS 24 48 72 96 120 144 stereo bit rate [kbit/s]
53
MPEG SBR (HE-AAC) down sample high frequency analysis (Spectral Band
Replication) SBR bit steam full-band output input synthesis envelope excitation low-pass input AAC stereo encoder AAC stereo bit steam HE-AAC profile includes SBR AAC stereo decoder low-pass output
54
MPEG SBR+PS (HE-AAC v2) mix down stereo output PS (parametric stereo)
analysis bit stream synthesis input Channel level differences Inter channel correlation monaural input AAC monaural encoder AAC monaural bit steam Similar to the SBR, Parametric Stereo tool can be combined with the base coder for monaural signal. The side information of this tool is the channel level difference and inter channel correlation. AAC monaural decoder monaural output
55
MPEG surround mix-down surround analysis surround bit stream 5-ch
output input surround synthesis Channel level differences Inter channel correlation Channel prediction coefficients stereo input AAC stereo encoder AAC stereo bit stream This project is no-going now. We have a base of stereo coder. AAC stereo decoder stereo output
56
*Multi-channel and Low Sampling Frequency
History of MPEG Audio surround SLS SBR forward and backward compatibility SSC ALS MP3 on 4 MPEG-4 V1 V2 2001 DST 2005 MPEG-2 MC/LSF MPEG-2 AAC lossless MPEG-1 *Multi-channel and Low Sampling Frequency 1992 1994 1996 1998 2000 2002 2004 2006
57
Future challenge Open problems Integrated service
all-mighty coder for both speech and audio at less than 16 kbit/s Wave field synthesis (multi-channel) Integrated service video copyright management
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.