RTP and playout delay compensation Henning Schulzrinne Dept. of Computer Science Columbia University Fall 2003
RTP packet header |V=2|P|X| CC |M| PT | sequence number | | timestamp | | synchronization source (SSRC) identifier | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | contributing source (CSRC) identifiers | |.... |
RTP: timestamp Timestamp measured in sample units reflects nominal sampling time of first sample in packet e.g., 20 ms block size of 8,000 Hz audio 160 timestamp units per packet always 90 kHz for video – e.g., 3000 timestamp units per packet for 30 fps – 3600 for 25 fps – 3750 for 24 fps even if real system clock is slower or faster note: 32 bit integer may wrap around – if start at 0, after about 6 days for audio, ½ day for video – but starting value is supposed to be random
RTP sequence number Counts packets actually sent Wraps around much quicker – e.g., for 20 ms packets, in about 22 minutes Also uses random starting value
RTP timestamp vs. sequence number Related, but different purposes – timestamp for timing reconstruction: playout delay compensation (later) synchronization with other sources (later) – sequence number for loss measurements and gap detection t = s*b + c where t = timestamp s = sample units per packet offset c is constant within a talkspurt, but changes after each talkspurt or after transmission gap
Playout delay Converts variable network delay (“jitter”) into fixed delay – thus, end-to-end delay is max(jitter) + propagation delay – or, if willing to tolerate some late packets: delay < 95% of jitter + propagation delay Propagation delay is invisible – and hard to measure without synchronized clocks – about 5 ms/1000 km one way Total delay should be less than 150 ms one-way End-to-end delay must remain constant within a talkspurt – otherwise gaps
Playout delay time playout delay late = lost packet jitter
Playout buffer Logically infinite buffer Implemented as “circular buffer”, with wrap around Takes care of jitter and re- ordering based on RTP timestamp t Playout point p = t*b + c – p = buffer position, measured in samples (typically, 16 bits if decoding is done before playout) – b = buffer positions per sample (usually, = 1) – c = offset decoder (G.729 L16) silence Usually, best to think of each talkspurt as an independently schedulable unit p = p 0 + (t – t 0 ) * b t 0 = timestamp for first packet in talkspurt p 0 = position for first packet in talkspurt
Playout buffer, cont’d. Thus, hard part is computing insertion point for first packet in talkspurt Trying to predict future – late loss vs. excessive delay Conceptually, two approaches: – look at current playout point when first packet arrives then, leave some margin of error may be too conservative – compute based on last talkspurt and change c avoids overestimation due to slow first packet deals less well with jumps in delay after long pauses t=100 t=140 play insert t Simple method: assume roughly normal distribution and take n times the variance of the delay (= jitter) – this becomes the extra delay Other mechanisms: – spike detection – optimal value for last talkspurt