Jingming Xu Multimedia Communications Lab University of Waterloo Rate-distortion Optimization for MP3 and AAC Audio Coding with Complete Decoder Compatibility Welcome Ladies and Gentlemen. Thanks for coming to my master’s seminar. I am Jingming Xu. I did my master’s with Dr. En-hui Yang at the Multimedia Communications Lab. Today, I will talk about… If you have any comments or questions during my talk, I will be happy to answer them at the end of the presentation. Jingming Xu Multimedia Communications Lab University of Waterloo
Outline Introduction and motivation MP3, AAC, and Two-nested-loop Search Rate-distortion optimization for MP3 Rate-distortion optimization for AAC Conclusions and Future Research Introduction to audio coding and MPEG standards, and motivation of our research in ….. Then I will give a short review of MP3 and AAC audio coding standards, with emphasis on quantization and entropy coding constraints. State-of-the-art MP3 and AAC quantization and entropy coding scheme, and its problems. Based on the standard constraints, we will develop …. , in each case, I will provide simulation results with comparison to state-of-the-art… Rate-distortion optimization for MP3. Rate-distortion optimization for AAC. Concluding remarks. September 16th, 2005 2
Introduction Audio coding - different from universal data compression Long term correlations Multi-channel correlations Subject to natural noises Subjective perceptual quality judgement Audio coding methods - for both lossy and lossless Linear prediction Time-frequency mapping (DCT, FFT, MDCT, etc.) Parameter coding …. Audio signals have their characteristics such as …., which makes universal data compression schemes inefficient when applied directly on audio. During the past 30 years, people have come up with many methods specifically for both lossy and lossless audio coding, such as … Many of them have become international or industry standards, among which, Mpeg is certainly the most popular one. September 16th, 2005 3
Introduction (2) MPEG - the most successful audio coding standard series so far MPEG-1 (1992) - T/F mapping based, 3 Layers with increased complexity MPEG-2 BC (1994) - backward compatible with MPEG-1, with multi-channel and sampling frequency extensions MPEG-2 AAC (1997) - introducing more coding tools and giving up backward compatibility to improve quality MPEG-4 AAC (1999) - inherited from MPEG-2 AAC with TwinTQ and bitrate scalability extensions The development of MPEG audio standard can be roughly divided into 2 phases …. The first phase, …., MP3 MPEG-4 even supports a new vector quantization scheme TwinTQ, and bitrate scalability. However, since those extensions are beyond the scope of our research, we simply denote both MPEG-2 AAC and MPEG-4 AAC as AAC. MPEG-1 Layer 3 and MPEG-2 BC Layer 3 define the popular “MP3” September 16th, 2005 4
Introduction (3) Motivations MP3 and AAC leave structured encoding blocks design open for performance enhancement. The state-of-the-art MP3 and AAC quantization and entropy coding scheme, Two-nested-loop Search (TNLS), is essentially incapable to exploit the maximal standard-constrained flexibility for best rate-distortion tradeoff. The huge success of MP3 and AAC in the digital audio industry. Like many other multimedia compression standards, …. As we will later, …. …. There is still room we can exploit. The huge success of MP3 and AAC in the digital audio industry also motivates our research. September 16th, 2005 5
Introduction (4) Quality evaluation of compressed audio Most widely used objective measure - noise-to-mask ratio Most widely used subjective measure - ITU listening test (ITU-R Recommendation BS.1116) Triple sources A, B, C with hidden reference, double blind 5-grade impairment score scale Two quality evaluation methods used in our research …. NMR is the ratio of noise energy in this band to its perceptual masking threshold, here w_{i} is the inverse of the masking threshold. During the test, listener is free to listen to sources A, B, or C. Source A is known to be the reference signal. However, source B and C may be either the reference signal or the test signal. The assignment is determined randomly in that neither the listener nor the test administrator should know beforehand. After listening, listener is asked to rate sources B and C relative to source A according to a continuous 5-grade impairment scale. September 16th, 2005 6
MP3 and AAC audio coding standards Encoding process Window switching Stereo coding Pre-processing in AAC: gain control, prediction, noise shaping and substitution, etc. A high-level block diagram of MP3 encoding process is shown in … Time domain audio samples are first fed into a T/F mapping block which converts them into spectral coefficients. They are also fed into a psychoacoustic model which generates control information for T/F mapping (window switching), quantization and entropy coding. Under the psychoacoustic modeling control, spectral coefficients are quantized, entropy coded, and packed up with format information and control information. T/F mapping option: window switching Quantization and entropy coding option: separate/joint channel coding September 16th, 2005 7
MP3 and AAC audio coding standards (2) Quantization and entropy coding in MP3 Scale factor bands and non-uniform quantization scale_factor values are encoded by fixed number of bits in the side information and variable number of bits in the main_data stream The whole spectrum is divided into a fixed number of scale factor bands The non-uniform quantizer, corresponding to the de-quantizer defined in MP3, can be formulated as, Where global_gain is …. After quantization, the scale factor values for one frame are broken down into four parts in the bitstream for efficient storage. Occupy September 16th, 2005 8
MP3 and AAC audio coding standards (3) Quantization and entropy coding in MP3 Huffman coding 34 fixed Huffman codebooks Huffman coding region division: Each region is coded with a different codebook that best matches the statistics of that region. big_value, count_1, zero, …. After quantization, the quantized spectrum is Huffman encoded. …. The region division fashion is generally open to design. Except for the big_value subdivision in short windows. For AAC, there are lots of … Those audio pre-processing tools are optionally applied before the quantization and entropy coding block. The coding efficiency resulting from the adoption of these tools is signal dependent, and thus they usually operate under the control of psychoacoustic model. September 16th, 2005 9
MP3 and AAC audio coding standards (4) Quantization and entropy coding in AAC Non-uniform quantizer: same as in MP3 scale_factor values are differentially encoded relatively to the one of the preceding band by fixed Huffman codebook Huffman coding 12 fixed Huffman codebooks Huffman coding region division: Section boundaries can only be at the scale factor band boundaries For each section, the length of the section in scale factor bands, and the index of the codebook used for that section, are transmitted with a fixed number of bits. AAC uses the same …. In nature, the codebook index of each band is also differentially encoded solely relative to the one of the preceding band except for band 0: if they are the same, no bit needs to be transmitted at all; otherwise, it costs a fixed number of bits. September 16th, 2005 10
Two-nested-loop Search algorithm Outer Loop Inner Loop Given a target data rate, the task of its outer loop is to amplify the scale factor for “distorted” band so that NMR is less than 1. Since the amplified parts of the spectrum need more bits for encoding, but the number of available bits is constant, the inner loop changes the global quantizer step size until the given spectrum can be encoded by available bits. In all, this mechanism shifts bits from spectral regions where they are not required to those where they are required. September 16th, 2005 11
Two-nested-loop Search algorithm (2) Problems in TNLS Quantization, scale factor adaption and Huffman coding are considered separately. Has no convergence guarantee Does not target at minimizing the overall distortion Disregards the inter-band correlations of scale factors and Huffman codebook selection in AAC However, they actually work together to determine the rate-distortion performance. Optimization on only one of these factors in one step may force sub-optimal selection of the other factors in the following steps and degrade the performance of the whole system in the end. The best parameters so far have to be stored during each iteration and restored as output after the final termination. And the iteration process has to be terminated according to predefined conditions without knowing the optimality of the result. In our research, we aim at directly attack these problems … September 16th, 2005 12
Rate-distortion optimization for MP3 Problem formulation Lagrangian RD cost minimization - quantized coefficients - scale factors We formulate the rate-distortion optimization problem as the minimization of the actual Lagrangian RD cost …. By incorporating all coding factors in the …. stage - Huffman coding region division - Huffman codebook selection - non-uniform de-quantizer defined in MP3 - noise-to-mask ratio September 16th, 2005 13
Rate-distortion optimization for MP3 (2) Problem formulation Soft-decision quantization In conventional hard-decision quantization, is solely determined by given , i.e., . However, in the soft-decision quantization scenario, is considered as a flexible coding factor and selected such that the actual RD cost can be minimized. Therefore, . The key point in our problem formulation is to further consider the quantized coefficients as optimization variable, leading to the so-called soft decision quantization. September 16th, 2005 14
Rate-distortion optimization for MP3 (3) Fixed-slope graph-based iterative RD optimization Step 1: Initialize a set of scale factors from the given frame of spectrum with a HCB selection fashion . Set t=0, and specify a tolerance as the convergence criterion. Step 2: Given and for any t 0, find the optimal quantized spectrum and HCB region division fashion throughout a standard-constrained graph, where and achieve the minimum Denote by . Based on the problem formulation, we propose a fixed-slope graph-based iterative algorithm for RD optimization in mp3 Since the maximal allowed coefficient amplitude is closely related to the Huffman codebook region in which that coefficient lies given fixed Huffman codebook selection, the coding gains from the quantized spectrum and the region division are exploited jointly in Step 2, by an efficient graph-based optimal path search algorithm. September 16th, 2005 15
Rate-distortion optimization for MP3 (4) The directed graph is constructed based on the MP3 quantization and entropy coding constraints for long window. A simpler version exists for short window. Each layer corresponds to a HCB region and each state in one layer stands for two neighboring coefficients to be encoded using the HCB selected for that region. Two special states, frame_begin and frame_end, are used to take care of the start and the end of the frame, respectively. Assign each transition a cost resulting from minimizing the decomposed cost on the state which that transition goes to by adapting the corresponding quantized coefficients, the minimization then becomes the problem to search for the path with minimal accumulated cost through the graph. This graph-based search is a full dynamic programming and always gives the optimal solution. Graph Search for MP3 Quantized Spectrum and Region Division September 16th, 2005 16
Rate-distortion optimization for MP3 (5) Fixed-slope graph-based iterative RD optimization Step 3: Given , and , update to , so that achieves the minimum Step 4: Given , and , update to , so that Step 5: Repeat Steps 2, 3 and 4 for t = 0,1,2…. Until , then output , , and . Step3 update scale factors, Note that MP3 has adaptive storage for scale factors: R(q) is usually determined by a few largest scale factors in the frame, and there is no close-form formula to calculate optimal ones. Therefore, a full search of all scale factor storage fashions within the standard needs to be applied. Step4 update Huffman codebook selection, For each region, the Huffman codebook that gives the minimum codeword length is selected. Step5 repeat …. Until convergence occurs. September 16th, 2005 17
Rate-distortion optimization for MP3 (6) Simulation results: ANMR (implementation based on ISO MP3 reference codec) We implement our optimization algorithm based on ISO reference codec and the most advanced state-of-the-art MP3 codec LAME respectively. In each case, we use the original output as our initialization. For ISO reference codec, we see that the joint optimization algorithm successfully improves the coding efficiency with at least 0.8dB distortion reduction for bitrates above 64 kbit/s. violin.wav spme50_1.wav September 16th, 2005 18
Rate-distortion optimization for MP3 (7) Simulation results: ANMR (implementation based on LAME3.96.1 Best-quality mode) At least 0.6dB distortion reduction for bitrates above 64 kbit/s for LAME. In reference codec, the best parameters so far are not stored during each outer loop, which also leads to a great advantage for our joint optimization algorithm, especially in high bitrates. In both cases, the proposed joint optimization algorithm plays a less important role at low bitrates than high bitraes, since there are less non-zero quantized coefficients that can be optimized on using soft-decision quantization. violin.wav spme50_1.wav September 16th, 2005 19
Rate-distortion optimization for MP3 (8) Simulation results: ITU listening test (80kb/s) Furthermore, in listening tests, we notice that even though the proposed optimization algorithm yields roughly 0.7 gain for ISO reference codec in both music and speech cases, it still lags behind the LAME encoder alone. We believe it’s mainly the initial parameters we directly derive from the original reference codec output to blame. The iterative optimization progress most likely ends at inferior local optimality from this initialization. September 16th, 2005 20
Rate-distortion optimization for MP3 (9) Remarks The iteration process may only achieve local optimality, thus a wisely chosen initial state is favored when one targets at achieving the best possible RD performance. The fixed-slope graph-based iterative algorithm we proposed provides a feasible solution to the problems in TNLS. One can adaptively adjust the value of , to meet rate or distortion constraints in real audio compression applications. The iterative process, which can be viewed as a steepest descent minimization approach, may only achieve local optimality considering its discrete parameter space which is most likely non-convex. Specifically, we incorporate all quantization and entropy coding variables in our optimization framework, directly target at minimizing the overall perceptual distortion, and guarantee convergence to global/local optimum. September 16th, 2005 21
Rate-distortion optimization for AAC Problem formulation Lagrangian RD cost minimization - scale factor sequence - Huffman codebook index sequence Suggest we use Viterbi algorithm to solve the optimization problem. first-order inter-band dependency -> Dynamic programming (Viterbi algorithm) September 16th, 2005 22
Rate-distortion optimization for AAC (2) Fixed-slope trellis-based RD optimization Step 1: Build up trellis structure. For each state , = 0,1,…., -1, = 0,1,…., -1, = 0,1,…., -1, in the trellis, find the best to minimize its decomposed RD cost Step 2: Find the optimal path throughout the Trellis by Viterbi algorithm Step 3: Backtrack the optimal , and as final output Trellis: We have totally N stages, where N denotes the number of scale factor bands of one frame. And each stage is represented by (si; hi). There are Ns*Nh possible representations or states for each (si; hi), corresponding to the combination of Ns possible values of si and Nh possible values of hi. Step1: …. Given fixed scale factor j and Huffman codebook k. Based on state cost got in step 1 and state transition cost …. September 16th, 2005 23
Rate-distortion optimization for AAC (3) Trellis Structure for AAC Quantization and Entropy Coding September 16th, 2005 24
Rate-distortion optimization for AAC (4) Simulation results: ANMR Implementation based on ISO AAC reference codec Also compared with Aggarwal’s approach (Steps 2, 3 only) We see that our proposed algorithm successfully improves the coding efficiency of ISO reference codec with at least 2.2dB distortion reduction for bitrates above 96 kbit/s, and more distortion reduction for bitrates below. The side-information, including the scale factors and Huffman codebook indexs, counts for a larger portion of the total bitrate in low bitrates, which leads to a great advantage for Viterbi over traditional TNLS scheme used in ISO reference codec. While, soft-decision quantization (Step1) plays a more important role at high bitrates. The distortion reduction from soft-decision quantization grows up to roughly the same level as that from Viterbi when bitrate reaches 192 kbit/s. violin.wav spme50_1.wav September 16th, 2005 25
Rate-distortion optimization for AAC (5) Simulation results: ITU listening test (64kb/s) Even though pure Viterbi already yields roughly 1.25 improvement against TNLS in both music and speech cases, joint soft-decision quantization and Viterbi can still push for another 0.25 gain. September 16th, 2005 26
Rate-distortion optimization for AAC (6) Remarks The fixed-slope trellis-based algorithm we proposed achieves the global optimum RD performance within the quantization and entropy coding stage under the AAC standard constraints. Joint design of the pre-processing decisions with our proposed optimization can theoretically achieve the global optimum performance in the entire standard-constrained parameter space, however, with computational complexity exponential to the number of bands per frame. Please refer to the proof of “global optimality” in the thesis. Since AAC supports M/S stereo coding, TNS, Prediction, LTP and PNS on a band-to-band basis …. September 16th, 2005 27
Conclusions and Future Research Fixed-slope approach converts the encoding problem to a search problem through a constrained space and then permits the implementation of efficient sequential search algorithm. Soft-decision quantization spirit completes our RD optimization frameworks, and introduces significant performance enhancement. Substantial performance improvement against the state-of-the-art encoders is achieved with complete decoder compatibility in each case. …. It’s also important to emphasize that …. The additional computation complexity due to the proposed optimization is only incurred at the encoder. September 16th, 2005 28
Conclusions and Future Research (2) Real-time implementations Extension to scalable AAC Joint pre-processing and optimization for AAC Optimal lossy audio compression without syntax constraints Optimal settings for transform (e.g. block lengths), quantization (e.g. stepsizes) and prediction Joint design of quantization and entropy coding …. The proposed MP3 RD optimization, implemented by using pure C code, runs 8 times slower than real-time on a 1.7Ghz CPU, and the proposed AAC rate-distortion optimization, also pure C code based, takes even more time. Joint RD optimization schemes for each transmission layer so that the optimal RD performance for the entire system can be achieved. To make the complexity affordable in real audio compression applications, an immediate challenge is then how to design certain joint perceptual modeling and signal analysis methods to sort all possible pre-processing decision candidates. The successes of our RD optimization for lossy audio compression with complete MP3 and AAC decoder compatibility also give rise to the research proposal of developing optimal lossy audio compression algorithms without any standard constraints, where many fundamental design problems are open, such as, September 16th, 2005 29
Questions?