Download presentation
Presentation is loading. Please wait.
Published byJules Grenier Modified over 6 years ago
1
ON THE ARCHITECTURE OF THE CDMA2000® VARIABLE-RATE MULTIMODE WIDEBAND (VMR-WB) SPEECH CODING STANDARD Milan Jelinek†, Redwan Salami‡, Sassan Ahmadi*, Bruno Bessette†, Philippe Gournay†‡ and Claude Laflamme† †University of Sherbrooke, Canada - ‡VoiceAge Corp., Canada - *Nokia inc., USA VMR-WB Variable-Rate Multi-Mode Wideband Speech Codec New 3GPP2 WB Speech Coding Standard for 3G applications Main Features: Near Face-to-Face Communication Speech Quality Source and Channel Controlled Operation (4 Modes) 3GPP/ITU AMR-WB Directly Interoperable in Mode 3 Average Bit Rates (ABR): Compliant with CDMA2000 Rate Set 2 (FR), 6.2 (HR) , 2.7 (QR) or 1.0 (ER) kbit/s frames WB ( HZ) and NB ( Hz) Input/Output 20 ms Frames Noise Reduction with Adjustable Maximum Reduction Encoder Flow Chart VMR-WB Coding Techniques Source-Controlled Operation Hierarchical Signal Classification Operating on Frame-level 1. Voice Activity Detection (VAD) 2. Unvoiced Frame Decision Spectral Analysis LP Analysis Pitch Tracking Noise Reduction Noise Estimation Voice Activity? Voice Activity Decision: Parameters Input De-noised lower for noisy speech higher for clean speech Based on the following parameters: Coding Type Bitrate kbit/s Description Inactive Speech Coding CNG ER 1.0 -Noise excited LP filter -Smoothed over time CNG QR 2.7 -As previous, but interoperable with AMR-WB CNG Unvoiced Coding Unvoiced HR 6.2 -13 bit Gaussian codebook (4x/frame) Unvoiced QR -As previous, but randomly chosen vectors Voiced Coding Voiced HR -Frame level signal modification -12 bit ACELP codebook (4x/frame) Generic Coding Interoperable FR 13.3 -Similar to kbit/s Generic FR -As previous + FER protection Interoperable HR -As Interoperable FR, but with random algebraic codebook indices Signaling HR Generic HR -Pitch coded 2x/frame Normalized Correlation T – open-loop pitch period estimate xi – perceptually weighted input signal Begin 1. Voice Activity? 2. Unvoiced Frame? 3. Voiced Frame? 4. Low Energy? CNG Encoding or DTX Unvoiced Speech Optimized Encoding Voiced Speech Optimized Encoding Generic HR Encoding Generic FR Encoding Yes No Spectral Tilt Eh – average energy of last 2 critical bands. El – average energy of pitch-synchronous bins in the first 10 critical bands Active speech kbit/s 40% Speech Activity Mode 3 13.3 6.1 Mode 0 12.8 5.7 Mode 1 10.5 4.8 Mode 2 8.1 3.8 Frame Energy Variation Noise Estimation Update Decision: Based on parameters with low sensitivity to noise level: Pitch period varying AND normalized correlation at pitch period low AND low estimated order of AR model AND signal energy stationary INDEPENDENT of VAD decision! - Robust to noise level variations - Conservative approach: the noise estimation is updated only if quite sure the frame is inactive E32(j) – energy maximum in a bloc of 32-samples Relative Frame Energy - Erel Decision: 3. Voiced Frame Decision / Signal Modification 4. Low Energy Decision Channel-Controlled Operation 4 Operational Modes Controlled by Channel Conditions Transparent Memory-less Mode Switching Per-Frame Bit Rate Control Capability Coding Types Relative Usage in Active Speech: Mode Switching Performance: Comparing MOS scores of modes 0, 1, 2 with random mode switching at 0.5, 1 and 5 second intervals (from characterization test) Enhancements at Decoder Low Frequency Post-processing: Enhancement of the periodicity in low frequency region: Performance (MOS scores from selection test) CDMA Specific Modes (Modes 0, 1, 2), WB Input Performance (MOS scores from characterization test) Voiced Decision is an Inherent Part of Original Signal Modification Algorithm Frame is coded as voiced if all constraints of the modification are satisfied Signal modification is done pitch-synchronously Pitch period evolution is piecewise linear (constant at frame end) to avoid pitch period oscillations Modified input is synchronous with original input at frame end Modification is transparent at least up to 30% of active speech frames (in the example bellow, no coding is used and 30 % of active clean speech frames are modified) NB Input Test Modes 0, 1, 2, 3, Clean speech, nominal level Test on Interworking with kbit/s -WB input, clean speech conditions Purpose: To avoid encoding unclassified frames with low perceptual importance at Full Rate Condition: 2000 Hz Ref 0 – 14.25 Ref 1 – 12.65 Ref 2 – 8.85 Test 0 – VMR-WB Mode 0 Test 1 – VMR-WB Mode 1 Test 2 – VMR-WB Mode 2 Coding Type Mode 0 Mode 1 Mode 2 Mode 3 Generic FR 93.4 % 60.4 % 34.1 % - Interoperable FR 100.0 % Generic HR 7.1 % 13.1 % Voiced HR 13.0 % 33.2 % Unvoiced HR 6.6 % 19.5 % 5.6 % Unvoiced QR 14.0 % Et – sum of critical band energies for current frame, in dB Ef – long-term mean of Et for active speech Clean Speech Conditions: Example: Typical example of a low-energy frame encoded with Generic HR in mode 2 Frame Errors Concealment: Lost Frame Concealment: Excitation energy and spectral envelope converge to estimated noise. Excitation periodicity converges to 0. Convergence rate depends on the signal class of last good frame. Recovery after erasure: Careful energy control of synthesized speech. Artificial onset reconstruction in case of lost voiced onset. Channel Error Conditions: Background Noise Conditions:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.