High-Quality, Low-Delay Music Coding in the Opus Codec

High-Quality, Low-Delay Music Coding in the Opus Codec
Jean- Marc Valin Gregory Maxwell Koen Vos Timothy B. Terriberry

What is Opus? New highly-flexible speech and audio codec
Completely free Royalty-free licensing Open-source implementation IETF RFC 6716 (Sep. 2012)

Features Highly flexible Bit-rates from 6 kb/s to 510 kb/s
Narrowband (8 kHz) to fullband (48 kHz) Frame sizes from 2.5 ms to 60 ms Speech and music support Mono and stereo Flexible rate control Flexible complexity All changeable dynamically

Opus Operating Modes SILK-only: Narrowband, Mediumband or Wideband speech Hybrid: Super-wideband or Fullband speech CELT-only: Narrowband to Fullband music CELT SILK In ↓ ↑ + Out MUX DEMUX Encoder Decoder 8-16 kHz 48 kHz bit-stream D The C-E-L-T in CELT stands for "constrained energy lapped transform". CELT is a conventional transform codec, that's the "LT" part. It uses the modified discrete cosine transform, just like MP3 and Vorbis, but it uses much smaller block sizes. The downside is that you have much poorer frequency resolution, which I'll explain more precisely in a moment. But the big idea that really makes CELT work is the "constrained energy" part. We chop the signal up into bands and then explicitly code how much energy is in each one. Then, whatever else the codec does, we constrain the output to have exactly that much energy in it. This is an extremely powerful technique for avoiding the kind of blatantly obvious artifacts that you can get out of MP3 or other codecs. Then there's a couple of other pieces to CELT that didn't make it into the name. We use a special kind of vector quantization to code the details within each band, and a "pitch predictor", a technique we adapted from speech codecs. This works a lot like motion compensation in a video codec, and helps compensate for the poor frequency resolution of our small transform size.

CELT: "Constrained Energy Lapped Transform"
Transform coding with Modified Discrete Cosine Transform (MDCT) Explicitly code energy of each band of the signal Spectral envelope preserved no matter what Code remaining details using algebraic VQ Gain-shape quantization Implicit psychoacoustics and bit allocation Built into the format The C-E-L-T in CELT stands for "constrained energy lapped transform". CELT is a conventional transform codec, that's the "LT" part. It uses the modified discrete cosine transform, just like MP3 and Vorbis, but it uses much smaller block sizes. The downside is that you have much poorer frequency resolution, which I'll explain more precisely in a moment. But the big idea that really makes CELT work is the "constrained energy" part. We chop the signal up into bands and then explicitly code how much energy is in each one. Then, whatever else the codec does, we constrain the output to have exactly that much energy in it. This is an extremely powerful technique for avoiding the kind of blatantly obvious artifacts that you can get out of MP3 or other codecs. Then there's a couple of other pieces to CELT that didn't make it into the name. We use a special kind of vector quantization to code the details within each band, and a "pitch predictor", a technique we adapted from speech codecs. This works a lot like motion compensation in a video codec, and helps compensate for the poor frequency resolution of our small transform size.

CELT Window MDCT with low-overlap window
Fixed 2.5 ms overlap for all sizes Overlap shape is like the Vorbis window Pre-emphasis reduces spectral leakage

Critical Bands Group MDCT coefficients into bands approximating the critical bands (Bark scale) Band layout the same for all frame sizes Need at least 1 coefficient for 120 sample frames Corresponds to 8 coefficients for 960 sample frames So CELT is going to take the MDCT coefficients from our transform, and partition them up into bands that match the critical bands, and code each separately. They fall on what's known as the Bark scale, which is roughly a log scale, so the low frequency bands are much smaller than the high frequency bands. We set a minimum band size of 3 coefficients, to limit the amount of per-band overhead we have, even though that means we don't have quite enough frequency resolution for all the bands below about 1.5kHz. But that's okay, because that's where the human ear is the most sensitive, which means we're going to spend a lot of bits getting those bands to sound right anyway, and will have fewer gross artifacts to conceal. On the other end, the high frequency bands can have 50 to 100 coefficients in them, depending on the frame size, and match the critical bands quite well. For stereo, we combine the bands from both channels, so there can be up to 200 coefficients in the last band. This is a huge percentage of the total audio, and because of the band structure we can get away with using a ridiculously small number of bits for it.

Coding Band Energy Energy computed for each band Coarse-fine strategy
Coarse energy quantization Scalar quantization with 6 dB resolution Predicted from previous frame and from previous band Entropy-coded Fine energy quantization Variable resolution (based on bit allocation) Not entropy coded Now, the single most important lesson that we learned from Vorbis is that you must preserve the energy in each band at all costs. People spent a lot of time optimizing solely for mean squared error, but a lot of artifacts which have a reasonable mean squared error sound really awful, and it turns out you do much better by trying to preserve the band energy, even if it hurts mean squared error. Vorbis does this implicitly with its floor curve, which is a rough approximation of the energy across the spectrum. But nothing in the Vorbis bitstream guarantees that the energy of the final reconstruction matches that of the floor curve. The Vorbis encoder has to take great pains (and spend extra bits) to ensure that no energy is lost during quantization. In CELT, on the other hand, we encode the energy of each band explicitly. Then, we mathematically constrain the rest of the codec to ensure that the decoded signal matches the coded energy in each band exactly. This is the big idea behind CELT, and one of the primary reasons it does so well. To code the energy, we split it up into two parts. The coarse energy is stored at 6 dB resolution, regardless of the bitrate. We predict it from the same band in previous frames, and the previous band in the same frame. That prediction makes us less robust to packet loss, but it saves about 28 bits per frame, which translates into more than 5 kbps with a 5 ms frame size. That's pretty significant compared to, say, a total budget of 64 kbps. Now, because we mathematically constrain the signal to match the coded energy, we can't compensate for quantization errors in the coarse energy by spending more bits on something else later. So we include a "fine energy" term that gives us better resolution where we need it. The fine energy doesn't benefit much from prediction, so it's just stored directly.

Coding Band Shape Quantizing N-dimensional vectors of unit norm
N-1 degrees of freedom (hyper-sphere) Describes "shape" of spectrum within the band CELT uses algebraic vector quantization Pyramid Vector Quantization (Fischer, 1986) Combinations of K signed pulses Set of vectors y such that ||y||L1 = K Projected on unit sphere: x = y / ||y||L2 The first thing we do is divide each band by its energy, giving us an N-dimensional unit vector. You can also think about it as a point on an N-dimensional sphere. Then we're going to code these unit vectors using two pieces: an adaptive codebook that uses previously decoded data to construct a set of vectors that predict the current frame and a fixed codebook that handles errors in the prediction. Both of these pieces use vector quantization, so first I'll explain what that is.

Coding Band Shape N=3 at Various Rates
Here's an example of four codebooks in three dimensions at different rates. But to see what the predictor is really doing, watch how it moves the points of the fixed codebook on the surface of the sphere. All of the points in front of the plane orthogonal to the pitch predictor get squeezed into a smaller area, and all of the points behind it get stretched out. The number of points around the circle after prediction is approximately the same as the number of points around the equator in the fixed codebook, but the diameter of the circle is much smaller. That means we have more points squeezed into a smaller area, and hence the quantization error is smaller. We tried squeezing the points even closer to the circle around the pitch predictor, since we know in the optimal case our target has to lie on it. But this required a lot more CPU because the math got more complicated, and didn't actually sound any better.

Coding Band Shape Pyramid Vector Quantization
PVQ codebook has a fast enumeration algorithm Converts between vector and integer codebook index Encoded with flat probability model Range coded but cost is known in advance Codebooks larger than 32 bits Split the vector in half and code each half separately I don't have time to go into the details, but it turns out there's a fast way to convert back and forth from the vector representation and an integer codebook index. The integer is what gets written to the bitstream, and then the decoder converts it back to a vector on the other side. You can do this in order N+K if you're allowed to use a lookup table or multiplications. There's also a simpler algorithm which is order NK, but only requires addition. For embedded processors that don't have much RAM and have slow 32-bit multipliers, that can be a lot faster. The codebook is also easy to search. Given an arbitrary unit vector to quantize, we can normalize it so the sum of the absolute value of its coefficients is K, and then round each component down to the nearest integer. That accounts for at least K-N pulses, in the worst case. Then we go back and add pulses one at a time, using a greedy algorithm, until we've placed all K of them. The whole thing is independent of the number of pulses, which is helpful in the low frequency bands, where the dimension is small, but we use lots of pulses. These fast algorithms let us use enormous codebooks, and in fact they can sometimes have more than 4 billion entries at high rates. Originally we used 64-bit arithmetic to handle these cases, but that turned out to be very expensive, so now we just split the vector in half and code each half separately. At 128 kbps, the extra overhead this adds is less than 7 bits per packet on average, or under 1%.

Implicit Psychoacoustics: Bit Allocation
Sychronized allocator in encoder and decoder Allocates fine energy and PVQ bits for each band Based on shared information (no signaling) Implicit psychoacoustic model Intra-band masking: near-constant per-band SMR Does not model inter-band masking, tone vs noise Allocation tuning (signaled) Tilt: balances between LF vs HF bits Boost: Gives more bits to individual bands

CELT Stereo Coupling Code separate energy for each channel
Prevents cross-talk Converts to mid-side after normalization Mid and side coded separately with their relative energy conserved Prevents stereo unmasking Intensity stereo Discards side past a certain frequency

Normalized Mid-Side Stereo
Input audio left right

Channel normalization left right

Mid-side vectors left mid side right

Mid-side energy ratio θ = atan( |side| / |mid| ) mid side

Normalized mid and side, coded separately mid side

Avoiding Birdie Artifacts
Small K → sparse spectrum after quantization Produces tonal “tweets” in the HF CELT: Use pre-rotation and post-rotation to spread the spectrum Completely automatic (no per-band signaling)

Spectral Folding When rate in a band is too low, code nothing
Spectral folding: copy previous coefficients Preserves band energy Gives correct temporal envelope Better than coding an extremely sparse spectrum Partial signaling Hard threshold at 3/16 bit per coefficient Encoder can choose to skip additional bands

Transients (avoiding pre-echo)
Quantization error spreads over whole window Can hear noise before an attack: pre-echo Split a frame into smaller MDCT windows Up to 8 “short blocks” Interleave results and code as normal Still code one energy value per band for all MDCTs Simultaneous tones and transients Use adaptive time-frequency resolution Per-band Walsh-Hadamard transform

Transients Time-Frequency Resolution
Standard Short Blocks Per-band TF Resolution Good frequency resolution Good time resolution Frequency Frequency Time Time

Configuration Switching
Mode/bandwidth/framesize/channels changes Avoiding glitches when we switch All modes can change frame sizes without issue CELT can change audio bandwidth or mono/stereo SILK can change mono/stereo with encoder help How about everything else? 5 ms “redundant” CELT frames smooth transition Bitrate sweep example: 8 to 64 kb/s

Opus Music Quality 64 kb/s stereo music ABC/HR listening test by Hydrogen Audio

Cascading Tests 5 cascadings Bitrate = 128 kbit/s

Future Work Upcoming libopus 1.1 release
Automatic speech/music detection Better VBR Better surround quality Optimizations Specs RTP payload format File format (Ogg, Matroska)

Questions? Resources Website: http://opus-codec.org
Mailing list: IRC: #opus on irc.freenode.net Git repository: git://git.opus-codec.org/opus.git Questions?

Anti-Collapse Pre-echo avoidance can cause collapse
Solution: fill holes with noise No anti-collapse With anti-collapse

Psychoacoustics Pitch Prefilter/Postfilter
Shapes quant. noise (like SILK’s LPC filter), but for harmonic signals (like SILK’s LTP filter) Prefilter Postfilter

High-Quality, Low-Delay Music Coding in the Opus Codec

Similar presentations

Presentation on theme: "High-Quality, Low-Delay Music Coding in the Opus Codec"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Quality, Low-Delay Music Coding in the Opus Codec

Similar presentations

Presentation on theme: "High-Quality, Low-Delay Music Coding in the Opus Codec"— Presentation transcript:

Similar presentations

About project

Feedback