Speech over Packet Networks Variable Jitter Buffering Decoder-Based Time-Scaling Performance Analysis Performance Analysis of a Decoder-Based Time-Scaling Algorithm for Variable Jitter Buffering of Speech over Packet Networks Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne Montréal (Québec) Canada H3R 2H6 Philippe.Gournay@USherbrooke.ca This title is somewhat lengthy… In fact we will take things in the reversed order and talk about : Speech over Packet Networks Variable Jitter Buffering Decoder-Based Time-Scaling Performance Analysis
Voice over Packet Networks Voice communications over packet networks (VoIP) is characterized by a variable transmission time (jitter) VoIP receivers generally use a jitter buffer to control the effect of the jitter The jitter buffer works by introducing an additional playout delay The playout delay is chosen to minimize the number of late packets while keeping the total end-to-end delay within acceptable limits Packets that arrive before their playout time are temporarily stored in a reception buffer Speech communication over packet networks is characterized by variations in the transit time of the packets through the network. The difference between the actual arrival time of the packets and a reference clock at the precise packet rate is what we call the jitter. VoIP receivers generally use a jitter buffer in order to control the effects of the jitter. The aim of this “jitter buffer” is to transform the irregular flow of arriving packets into a regular flow of packets, so that the speech decoder can provide a sustained flow of speech to the listener. The jitter buffer works by introducing an additional delay, which is called playout delay (this delay is defined with respect to a certain reference clock that was, for example, started at the reception of the first packet). The playout delay should be chosen such as to minimize the number of packets that will arrive too late to be decoded, while keeping the total end-to-end delay within acceptable limits. Packets that arrive before their playout time are temporarily stored in a reception buffer. Several strategies can be used to achieve the best compromise between delay and number of rejected packets. The playout delay can be fixed throughout the conversation, or varied according to the network condition. The following slides illustrate, using simple examples, the different possible strategies.
Voice over Packet Networks (2) Fixed transmission time (no jitter) a fixed playout delay is enough to produce a sustained flow of speech to the listener Transmission delay 1 2 n n+1 Sender Receiver Playout delay This scenario corresponds to the ideal network condition where there is no jitter. The packets are sent regularly, at the frame rate. They are received at the same rate, but after a certain transmission delay. As shown in the figure, a fixed playout delay is enough to produce a sustained flow of speech to the listener. Since there is no jitter, there won’t be any late packets, even with a playout delay as low as zero. Note that, in this example (and the following one), the packets are supposed to contain only one frame. When the frame size (CDMA eighth rate) or the frame duration (10 ms frames in the ITU G.729 standard) is small, however, sending several consecutive frames in the same packet reduces the packet header overhead.
Voice over Packet Networks (3) Variable transmission delay (some jitter), fixed playout delay some packets (n, n+1) arrive too late to be decoded Transmission delay 1 2 Playout delay n n+1 Dn-1 Dn Dn+1 Sender Receiver This scenario corresponds to the case where the network produces some jitter. Let Deltan be the jitter associated with frame number n. We have chosen the following convention for the sign of Deltan: Deltan<0: Frame n arrives before its playout time Deltan>0: Frame n is late In the figure, we see that: Frame n-1 arrives before its playout time: It is therefore decoded and played normally. Frame n arrives less than 20ms after its playout time: It must therefore be considered as lost and concealed by the decoder. However, since it finally arrives before the following frame is decoded, it can be used as a “frame” to update the internal state of the decoder. Frame n+1 arrives more than 20 ms after its playout time: It is considered as lost and concealed by the decoder if the decoder cannot process packets delayed by more than one speech duration. All those frames would have been decoded normally if the playout delay had been longer enough (in that case, at least Dn+1 longer). Ideally, the playout delay would be changed on a frame-by-frame basis so that the next frame always arrives before its playout time. This would ensure that no late packet arrives too late to be decoded. However, this “ideal” approach is not practicable for the following reasons: It is not always possible to perfectly predict the arrival time of the next packet. It is not always possible to stretch a speech frame exactly so as to achieve the required playout delay.
Jitter Buffering Strategies Fixed jitter buffer Playout delay chosen at the beginning of the conversation Variable Jitter buffer Talk-spurt based: playout delay changed at the beginning of each silence period For quickly varying networks: better results are obtained when the playout delay is also adapted during active speech In the classical “talk-spurt based” approach, the playout delay is adjusted at the beginning of each talk-spurt (or equivalently, at the end of each silence period), based on the previous jitter statistics. Since the playout delay is only changed during silences or background noise, this is done very easily by just removing (in order to decrease the playout delay) or inserting (in order to increase the playout delay) a certain number of samples. The “talkspurt based” approach is interesting when the network characteristics are varying in time slowly. For more quickly varying networks (exhibiting for example a large jitter or a strong clock difference between the sender and the receiver), even better results are obtained when the playout delay is also adapted during active speech. This strategy requires a means for stretching speech frames: playing out longer frames increases the playout delay, while playing out shorter frames decreases that delay.
Playout Delay Adaptation Using past jitter values, estimate the “ideal” playout time Pi+1 of frame number i+1. Send frame number i to the decoder, requesting it to generate an output frame of length Ti=Pi+1-Pi. The actual playout time of packet i+1 is Pi+1=Pi+Ti, where Ti is the actual length of frame i. Iterate from step 1. ^ ^ ^ Changing the playout delay during active speech requires a means for time scaling decoded speech frames: playing out a longer frame increases the playout delay, while playing out a shorter frame decreases that delay. The playout delay for packet i is the difference between Pi and the reference clock at the normal frame rate
Time Scaling Inside the Decoder … presents several advantages over “standalone” time scaling (SOLA, TDHC, …) : Uses the decoder’s internal parameters: Pitch period In VMR-WB: voicing classification Regulates the processor load (esp. for shorter frames) Number of operations per second tends to increase as the frame length decreases Some complexity is saved during the synthesis operation Improves quality smoothing performed by the synthesis filters Playout delay adaptation is generally done outside the decoder (on the PCM signal) using a technique such as PSOLA [3] (Pitch Synchronous Overlap Add) or TDHC [4] (Time Domain Harmonic Scaling). It is clear however that doing it inside the decoder, in the excitation domain, has many advantages in terms of complexity: Working inside the decoder makes it possible to use the internal parameters of the codec. The pitch values and gains, and the voicing classification in the case of the VMR-WB codec [5], are particularly useful parameters for time scaling. Working inside the decoder also regulates the processor load (i.e., the number of arithmetic and logical operations needed to decode one frame divided by the duration of that frame). This is particularly true when the frame is shortened since in that case the processor load tends to increase as the frame length decreases even if the number of operations remains the same. Working in the excitation domain, for example, saves some complexity because the synthesis operation (which, in the case of VMR-WB, includes LP synthesis, post-filtering, up-sampling and high-frequency generation) is performed on the shortened frame instead of the normal frame. It is also potentially advantageous in terms of quality, as the smoothing performed by the synthesis filters of the decoder is supposed to be beneficial to quality.
General Principle Time scaling is performed in the excitation domain The adaptive codebook is updated before time scaling to keep the encoder and decoder synchronized Frames are modified depending on their voicing classification In VMR-WB, voicing classification is a part of the bitstream Not all frames are modified Voiced frames can only be modified by a multiple number of pitch periods Concealed frames can also be time-scaled
Inactive Frames The pseudo-random number generator used to build the excitation signal of CNG frames is simply run for the requested number of samples. Output frame duration limited to between 0 and 40ms (twice the standard duration)
Unvoiced Frames Plosive frames and frames that are too voiced are not modified To lengthen frames: Insert zeroes between the original excitation samples Adjust gain to preserve average energy per sample To shorten frames: Remove samples from the excitation signal Frame duration limited to between 10 and 40ms Plosive frames and frames that seem “too voiced” to be handled as unvoiced are not modified. Plosive frames are those for which one subframe is more than 9 dB higher than the preceding subframe. “Too voiced” frames are those for which the average pitch gain is above 0.45. The remaining unvoiced frames are processed as follows. To generate longer frames: Unvoiced frames are lengthened by inserting the necessary number of zeroes between the original excitation samples. Those zeroes are distributed “as uniformly as possible” throughout the frame. A weighting factor equal to the square root of the ratio between the requested and standard frame lengths is further applied to the modified excitation in order to preserve the average energy per sample. To generate shorter frames: Unvoiced frames are shortened by removing the necessary number of samples from the excitation signal. These samples are removed from either the beginning or the end of the frame, depending on what the previous frame was. If it was an unvoiced frame, the samples are removed from the beginning of the frame. If the previous frame was not an unvoiced frame, then they are removed from the end of the frame. This protects against removing any possible weak voiced component from an unvoiced frame, such as the transition from an unvoiced sound to a voiced sound (voiced onset) or vice versa (voiced offset).
Voiced Frames Onset frames and frames that are not voiced enough are not modified To lengthen frames Use the long-term predictor to duplicate some pitch cycles To shorten frames Remove selected pitch cycles Frame duration limited to between 0 and 40ms
To Lengthen Voiced Frames Voiced frames are lengthened by repeating selected pitch cycles Past Excitation Current Excitation T0 Subframes (i) 1 2 3 4 This slide shows how voiced frames are lengthened. The subframe with the highest pitch gain is selected. Within that subframe, a sliding window of 20 samples is used to find the minimum-energy point (i). Some processing is performed on the four pitch values of the frame in order to correct any obvious pitch error (specifically, to avoid using a submultiple of the real pitch value). The difference between the desired frame length and the standard frame length, rounded up to the nearest integer multiple of the pitch period of the subframe, gives the number of samples that will be added to the excitation. A space is created in the excitation signal right after the minimum-energy point to receive the extra pitch cycles. The long-term prediction function of the decoder is used to replicate the pitch cycle that immediately precedes the minimum-energy point, thus filling in the space.
Experimental Results Experiments conducted on clean speech using mode 2 of the VMR-WB codec (Average Data Rate of 4.96 kbits/s with 60% active speech) Subjective quality: Excellent for “fast playback” (up to twice the normal speed) Small degradation for “slow playback” (down to half the normal speed) Very efficient at adding or removing a few ms (e.g. 20ms) to the playout delay from time to time Much better than losing a few frames…
Distribution of Unmodified Frames Required frame length (40ms) is twice the standard frame length Total number of frames: 22803 Number of unmodified frames: 4085 (18%) Distribution of unmodified frames: 1. Voiced frames Onsets: 814 (20% of unmodified) Not voiced enough: 935 (23%) 2. Unvoiced frames Plosive: 1850 (45%) Too voiced: 486 (12%) The percentage of frames that can be modified (regardless of the impact on quality) is around 82%. Note that this does not depend on the requested amount of scaling. Frames that cannot be modified are distributed as shown in the table.
Frame Length Distribution for Modified Voiced Frames As mentioned before, while modified unvoiced frames always have the requested frame length rounded to the nearest multiple of 5, voiced frames can only be modified by an integer multiple of the pitch period. This slide shows the length distribution of modified voiced frames, when the requested frame length is twice the standard length. Desired frame length was 640 samples = 40 ms
Time Required for a 50-ms Increase of the Playout Delay In order to evaluate the reactivity that the proposed time scaling method would bring to an adaptive jitter buffer, we measured the time needed to increase the playout delay by 50 ms. The adaptation time is the sum of the durations of the frames (be they modified or not) that must be played out before the desired playout delay is achieved. The duration of the last modified frame is not taken into account in the adaptation time, as the desired playout delay can be considered as already achieved when playing out the first sample of that frame. For example, if the request to increase the playout delay arrives just before decoding frame n, and if frames n, n+1 and n+2 are stretched to 40 ms, 40 ms and 30 ms, respectively, then the adaptation time is equal to 80 ms (sum of the modified durations of frames n and n+1). This slide shows that the adaptation time is always between 70 ms and 300 ms with very few cases above 200 ms (121 cases out of 8000). This is well below the average duration of a talkspurt, which is rather on the order of one second. The shortest adaptation time is 70 ms (14 cases) and is achieved when the adaptation is done in 3 frames, the first two frames being stretched to a total of 70 ms and the third frame being 20 ms longer than usual. When the adaptation is performed during purely unvoiced segments (containing no onsets or plosives), the adaptation time is exactly 80 ms (which corresponds to the example given above). The adaptation time is never between 80 ms and 90 ms, because it cannot be more than 80 ms when done in 3 frames nor less than 90 ms when done in 4 frames (three 30-ms frames followed by one 40-ms frame). Experiment done for 8000 different active speech frames
Maximum Complexity and Corresponding Frame Length Requested Frame Length Standard (20ms) Longer (40ms) Shorter (10ms) Shorter* (10ms) Maximum Complexity 6.68 WMOPS 6.69 WMOPS 33.90 WMOPS 9.57 WMOPS Corresponding 20ms 1.875ms 10ms We instrumented the VMR-WB decoder in order to count the number of operations performed for each frame. We then ran it on the same test file as before, asking it to scale all the frames by a certain amount. The complexity is given in weighted mops (Million Operation Per Second) where the operations are weighted according to their respective complexity and the actual frame duration is used. When no time scaling is requested (standard 20-ms frames), the maximum decoding complexity is 6.68 wmops. When the decoder is requested to double the frame length, the highest complexity is just slightly more (6.69 wmops). But the corresponding frame length is in fact 20 ms, which indicates that the frame was in fact not modified. Lengthening speech frames therefore never increases the decoding complexity. When the decoder is asked to halve the frame length, the highest complexity is 33.90 wmops for a frame length of 1.875 ms. When an additional constraint is set so that the length of modified voiced frames never goes below 10 ms, the highest complexity is 9.57 wmops. This renders the adaptive jitter buffer less “reactive” for decreasing the playout delay but is much more reasonable from a complexity point of view. As a comparison, the complexity of a standard decoder followed by an external time compression with a factor of 50% would be at least twice that of the decoder (6.68 WMOPS*20 ms/10 ms = 13.36 WMOPS) plus that of the time scaling operation. We noted that generating shorter frames is more complex during active speech. In that case, the whole excitation signal must be decoded before time scaling. The resulting complexity (number of operations divided by the frame duration) goes up even though some processing is saved during the synthesis. In contrast, time scaling does not really affect the complexity of CNG frames (the pseudo-random number generator is run only for the requested number of samples, and the synthesis is done only for the actual frame length). Therefore, if complexity is an issue, a good solution is: 1. to increase the playout delay by lengthening speech frames as soon as it is necessary to avoid late frames, and 2. to decrease the playout delay by shortening speech frames only during silences. In that case, playout delay adaptation requires almost no additional complexity and does not unduly degrade quality. *: Modified voiced frames not allowed to be less than 10 ms
Optimize the Complexity Lengthening does not increase complexity Shortening CNG frames does not increase complexity Shortening active speech frames increases complexity Good compromise: Increase the playout delay as soon as it is necessary Decrease the playout delay during inactive periods Playout delay adaptation requires no additional complexity
Audio Demonstration Male voice Female voice Original VMR-WB mode 2 Male voice Female voice Original VMR-WB mode 2 Fast playback (200% of normal speed) Slow playback (66% of normal speed) Adaptation: -20ms every 10 frames Adaptation: +20ms every 10 frames
Summary Adaptive jitter buffering requires a means of time scaling of speech Time scaling can be done in the decoder’s “excitation domain” This approach is very efficient in terms of both quality and reactivity It requires almost no complexity providing some clever limitations are imposed on the amount of time scaling