H.264 ITU-T H.264 or ISO/IEC IS 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC)

H.264 ITU-T H.264 or ISO/IEC IS (MPEG-4 part 10) Advanced Video Coding (AVC)

Goals Improved Coding Efficiency Improved Network Friendliness
Average bit rate reduction of 50% given fixed fidelity compared to any other standard Complexity vs. coding efficiency scalability Improved Network Friendliness Issues examined in H.263 and MPEG-4 are further improved Anticipate error-prone transport over mobile networks and the wired and wireless Internet Simple syntax specification Targeting simple and clean solutions Avoiding any excessive quantity of optional features or profile configurations An increasing number of services and growing popularity of high definition TV are creating greater needs for higher coding efficiency. Many transmission medium offer much lower data rates than broadcast channels: Cable Modem, xDSL or UMTS.

The Scope of Picture and Video Coding Standardization
Only Restrictions on the Bitstream, Syntax, and Decoder are standardized: Permits optimization beyond the obvious Permits complexity reduction for implementability Provides no guarantees of quality Only the central decoder is standardized, by imposing restrictions on the bitstream and syntax, and defining the decoding process of the syntax elements such that every decoder conforming to the standard will produce similar output when given an encoded bitstream that conforms to the constraints of the standard. Permit maximal freedom to optimize implementations in a manner appropriate to specific applications (balancing compression quality, implementation cost, time to market, etc.).

Applications Entertainment Video (1-8+ Mbps, higher latency)
Broadcast via Satellite / Cable / terrestrial/ DSL DVD for standard and high-definition video Vod via various channels Conversational Services (usu. <1Mbps, low latency) H.320 Conversational 3GPP Conversational H.324/M H.323 Conversational Internet/best effort IP/RTP 3GPP Conversational IP/RTP/SIP Streaming Services (usu. lower bit rate, higher latency) 3GPP Streaming IP/RTP/RTSP Streaming IP/RTP/RTSP (without TCP fallback) Other Services 3GPP Multimedia Messaging Services circuit switched Packet switched

Profiles & Levels Concepts
Many standards contain different configurations of capabilities – often based in “profiles” & “levels” A profile is usually a set of algorithmic features A level is usually a degree of capability (e.g. resolution or speed of decoding) H.264/AVC has three profiles Baseline (lower capability plus error resilience, e.g., videoconferencing, mobile video) Main (high compression quality, e.g., broadcast) Extended (added features for efficient streaming) Profiles and levels specify conformance points. These conformance points are designed to facilitate interoperability between various applications of the standard that have similar functional requirements. A profile defines a set of coding tools or algorithms that can be used in generating a conforming bit-stream, whereas a level places constraints on certain key parameters of the bitstream. The Baseline profile supports all features in H.264/AVC except the following two feature sets: - Set 1: B slices, weighted prediction, CABAC, field coding, and picture or macroblock adaptive switching between frame and field coding. - Set 2: SP/SI slices, and slice data partitioning. The first set of additional features is supported by the Main profile. However, the Main profile does not support the FMO, ASO, and redundant pictures features which are supported by the Baseline profile. Thus, only a subset of the coded video sequences that are decodable by a Baseline profile decoder can be decoded by a Main profile decoder. (Flags are used in the sequence parameter set to indicate which profiles of decoder can decode the coded video sequence.) The Extended Profile supports all features of the Baseline profile, and both sets of features on top of Baseline profile, except for CABAC.

H.264|AVC Layer Structure VCL: efficiently represent the video content. NAL: format the VCL representation of the video and provides header information in a manner appropriate for conveyance by a variety of transport layers or storage media.

High-Level VCL Summary
Video coding layer is based on hybrid video coding and similar in spirit to other standards but with important differences Some new key aspects are: Enhanced motion compensation Small blocks for transform coding Improved de-blocking filter Enhanced entropy coding Substantial bit-rate savings relative to other standards for the same quality

Input Video Signal Progressive and interlaced frames can be coded as one unit Progressive vs. interlace frame is signaled but has no impact on decoding Each field can be coded separately Dangling fields If the two fields of a frame were captured at different time instants, the frame is referred to as an interlaced frame and otherwise it is referred to as a progressive frame. The coding representation in H.264/AVC is primarily agnostic with respect to this video characteristics, i.e., the underlying interlaced or progressive timing of the original captured pictures. Instead, its coding specifies a representation based primarily on geometric concepts rather than being based on timing. H.264/AVC uses a sampling structure 4:2:0 with 8 bits of precision per sample, same as in MPEG-2 Main-profile video.

Partitioning of the Picture
Slices: A picture is split into 1 or several slices Slices are self-contained Slices are a sequence of macroblocks Macroblocks: Basic syntax & processing unit Contains 16x16 luma samples and 2 x 8x8 chroma samples Macroblocks within a slice depend on each other Macroblocks can be further partitioned Slices are self-contained in the sense that given the active sequence and picture parameter sets, their syntax elements can be parsed from the bitstream and the values of the samples in the area of the picture that the slice represents can be correctly decoded without use of data from other slices provided the utilized reference pictures are identical at encoder and decoder.

Flexible Macroblock Ordering (FMO)
Foreground Leftover Slice Group: Pattern of macroblocks defined by a Macroblock allocation map A slice group may contain 1 to several slices Macroblock allocation map types: Interleaved slices Dispersed macroblock allocation Explicitly assign a slice group to each macroblock location in raster scan order One or more “foreground” slice groups and a “leftover” slice group checkboard Macroblock allocation map is specified by the content of the picture parameter set and some information from slice headers. A slice in a group is a sequence of macroblocks within the same slice group that is processed in the order of a raster scan within the set of macroblocks of a particular slice group. Foreground and leftover indicate region of interest. Checker-board type is useful for concealment in video conferencing applications where slice group #0 and #1 are transmitted in separate packets and one of them is lost.

Interlaced Processing
Field coding: each field is coded as a separate picture using fields for motion compensation Frame coding: Type 1: the complete frame is coded as a separate picture Type 2: the frame is scanned as macroblock pairs, for each macroblock pair: switch between frame and field coding H.264/AVC design allows encoders to use any of the field coding or frame coding when coding a frame. frame coding of type 2 is referred to as picture-adaptive frame/field (PAFF) coding. PAFF coding was reported to reduce bit rates in the range of 16 to 20% over frame-only coding mode.

Macroblock-Based Frame/Field Adaptive Coding
If a frame consists of mixed regions where some regions are moving and others are not, it is typically more efficient to code the non-moving regions in frame mode and the moving regions in the field mode. Thus the frame/field encoding decision can also be made independently for each vertical pair of macroblocks (a 16*32 luma region) in a frame. MBAFF keeps the basic macroblock processing structure intact, and to permit motion compensation areas as large as the size of a macroblock. MBAFF performs better than PAFF by 14 to 16% for resolution sequences like “Mobile and Calendar”. PAFF performs better in the case of rapid global motion, scene change, or intra picture refresh.

Scanning of a Macroblock
Intra_4x4 mode is based on predicting each 4x4 luma block separately and is well suited for coding of parts of a picture with significant detail. Intra_16x16 does prediction of the whole 16x16 luma block and is more suited for coding very smooth areas of a picture.

Basic Coding Structure
The input video signal is split into macroblocks, the association of macroblocks to slice groups and slices is processed as shown.

Basic Coding Structure
Slice as a unit of basic coding.

Common Elements with other Standards
Macroblocks: 16x16 luma + 2 x 8x8 chroma samples Input: Association of luma and chroma and conventional sub-sampling of chroma (4:2:0) Block motion displacement Motion vectors over picture boundaries Variable block-size motion Block transforms Scalar quantization I, P, and B coding types

Inter-Frame Prediction

Motion Compensation Accuracy
Each P macroblock type corresponds to a specific partition of the macroblock into the block shapes used for motion-compensated prediction. A maximum of sixteen motion vectors may be transmitted for a single P macroblock.

Quarter Sample Luma Interpolation
Half sample positions are obtained by applying a 6-tap filter with tap values: (1, -5, 20, 20, -5, 1) Quarter sample positions are obtained by averaging samples at integer and half sample positions The accuracy of motion compensation is in units of one quarter of the distance between luma samples. In case the motion vector points to an integer-sample position, the prediction signal consists of the corresponding samples of the reference picture; otherwise the corresponding sample is obtained using interpolation to generate non-integer positions. The prediction values at half-sample positions are obtained by applying a one-dimensional 6-tap FIR filter horizontally and vertically. Prediction values at quarter sample positions are generated by averaging samples at integer- and half-sample positions. Half sample positions: b, h, j. (m, s are shared by other full samples) The samples at half sample positions labeled b and h are derived by first calculating intermediate values b1 and h1, respectively by applying the 6-tap filter as follows: b1 = ( E – 5 F + 20 G + 20 H – 5 I + J ) h1 = ( A – 5 C + 20 G + 20 M – 5 R + T ) The final prediction values for locations b and h are obtained as follows and clipped to the range of 0 to 255. b = ( b ) >> 5 h = ( h ) >> 5 The samples at half sample positions labeled as j are obtained by j1 = cc – 5 dd + 20 h m1 – 5 ee + ff, where intermediate values denoted as cc, dd, ee, m1 and ff are obtained in a manner similar to h1. The final prediction value j is then computed as j = ( j ) >> 10 and is clipped to the range of 0 to 255. Quarter sample positions: a, c, d, n, f, I, k, and q are derived by averaging with upward rounding of the two nearest samples at integer and half sample positions as, for example, by a = (G+b+1)>>1 When motion vectors point outside the image area, the reference frame is extrapolated beyond the image boundaries by repeating the edge samples before interpolation.

Chroma Sample Interpolation
Chroma interpolation is 1/8-sample accurate since luma motion is 1/4-sample accurate

Multiple Reference Frames

Multiple Reference Frames and Generalized Bi-Predictive Frames

New Types of Temporal Referencing
Known dependencies (MPEG-1, MPEG-2, etc.) New types of dependencies: Referencing order and display order are decoupled Referencing ability and picture type are decoupled

Intra Prediction A-Q are pixel samples of previous blocks and a-p are pixels of the coding block.

Weighted Prediction In addition to shifting in spatial position, and selecting from among multiple reference pictures, each region’s prediction sample values can be multiplied by a weight, and given an additive offset Some key uses: Improved efficiency for B coding, e.g., accelerating motion, multiple non-reference B temporally between reference pics Excels at representation of fades: fade-in fade-out cross-fade from scene-to-scene Encoder can apply this to both P and B prediction types

Spatial prediction using surrounding “available” samples
Available samples are… Previously reconstructed within the same slice at the decoder Inside the same slice Luma intra prediction either: Single prediction for entire 16x16 macroblock 4 modes (vertical, horizontal, DC, planar) 16 individual predictions of 4x4 blocks 9 modes (DC, 8 directional) Chroma intra prediction: Single prediction type for both 8x8 regions

16x16 Intra Prediction Directions
4 prediction modes are supported. Mode 0 and Mode 1 just copy to respective pixels. Mode 2 does copy of average to respective pixels.

Mode 0 and mode 1 are copied to the respective pixels. Mode 2 copies the average value of A-Q to the coding block. copy

4x4 Boundary Conditions

Transform Coding An additional 2x2 transform is also applied to the DC coefficients of the four 4x4 blocks of each chroma component.

Advantage of Small Size Transform
Improved prediction process both for inter and intra. Residual signal has less spatial correlation. ] Transform has less to offer concerning decorrelation. A 4x4 transform is essentially as efficient in removing statistical correlation as a larger transform Resulting in less noise around edges Referred to as “mosquito noise” or "ringing“ artifacts. Less computations and a smaller processing wordlength. Involves only adds and shifts, Mismatch between encoder and decoder is avoided * One of the main improvements of the present standard is the improved prediction process both for inter and intra. Consequently the residual signal has less spatial correlation. This generally means that the transform has less to offer concerning decorrelation. This also means that a 4x4 transform is essentially as efficient in removing statistical correlation as a larger transform • With similar objective compression capability, the smaller 4x4 transform has visual benefits resulting in less noise around edges (referred to as “mosquito noise” or "ringing“ artifacts). • The smaller transform requires less computations and a smaller processing wordlength. Since the transformation process for H.264/AVC involves only adds and shifts, it is also specified such that mismatch between encoder and decoder is avoided (this has been a problem with earlier 8x8 DCT standards)

Deblocking Filter Improves subjective visual and objective quality of the decoded picture Significantly superior to post filtering Filtering affects the edges of the 4x4 block structure Highly content adaptive filtering procedure mainly removes blocking artifacts and does not unnecessarily blur the visual content On slice level, the global filtering strength can be adjusted to the individual characteristics of the video sequence On edge level, filtering strength is made dependent on inter/intra, motion, and coded residuals On sample level, quantizer dependent thresholds can turn off filtering for every individual sample Specially strong filter for macroblocks with very flat characteristics almost removes “tiling artifacts” One particular characteristic of block-based coding is the accidental production of visible block structures. Block edges are typically reconstructed with less accuracy than interior pixels and “blocking” is generally considered to be one of the most visible artifacts with the present compression methods.

Principle of Deblocking Filter
QP: Quantization Parameter The basic idea is that if a relatively large absolute difference between samples near a block edge is measured, it is quite likely a blocking artifact and should therefore be reduced. However, if the magnitude of that difference is so large that it cannot be explained by the coarseness of the quantization used in the encoding, the edge is more likely to reflect the actual behavior of the source picture and should not be smoothed over.

Order of Filtering Filtering can be done on a macroblock basis that is, immediately after a macroblock is decoded First, the vertical edges are filtered then the horizontal edges The bottom row and right column of a macroblock are filtered when decoding the corresponding adjacent macroblocks

Deblocking: Subjective Result for Intra

Deblocking: Subjective Result for Inter
The blockiness is reduced, while the sharpness of the content is basically unchanged. Consequently, the subjective quality is significantly improved. The filter reduces bit rate typically by 5-10% while producing the same objective quality as the non-filtered video.

Entropy Coding

Variable Length Coding
Exp-Golomb code is used universally for almost all symbols except for transform coefficients Context adaptive VLCs for coding of transform coefficients No end-of-block, but number of coefficients is decoded Coefficients are scanned backwards Contexts are built dependent on transform coefficients Please note that the statistics of coefficient values has less spread for the last non-zero coefficients than for the first ones. For this reason coefficient values are coded in reverse scan order.

Context Adaptive VLC (CAVLC)
Transform coefficients are coded with the following elements: Number of non-zero coefficients (N) and “trailing 1s”. Levels and signs for all non-zero coefficients. Total number of zeros before last non-zero coefficient. Run before each non-zero coefficient

Number of Coefficients/Trailing ”1s”
Typically the last non-zero coefficients have |Level | = 1 The number of non-zero coefficients (example: N=6) and number of ”Trailing 1s” (T1s=2) are coded in a combined symbol In this way typically > 50% of the coefficients are signalled as T1s and no other level information than sign is then needed for these coefficients. The VLC table to use is adaptively chosen based on the number of coefficients in neighboring blocks. “Trailing 1s” (T1s) indicate the number of coefficients with absolute value equal to 1 at the end of the scan. In the example T1s = 2 and the number of coefficients is N = 5. The values of the coefficients are coded. The T1s need only sign specification since they all are equal to +1 or -1. Please note that the statistics of coefficient values has less spread for the last non-zero coefficients than for the first ones. For this reason coefficient values are coded in reverse scan order. In the examples above, -2 is the first coefficient value to be coded. A starting VLC is used for that. When coding the next coefficient (having value of 6 in the example) a new VLC may be used based on the just coded coefficient. TotalZeros: This codeword specifies the number of zeros between the last non-zero coefficient of the scan and its start. In the example the value of TotalZeros is 3. Since it is already known that N = 5, the number must be in the range 0-11. RunBefore: In the example it must be specified how the 3 zeros are distributed. First the number of 0s before the last coefficient is coded. In the example the number is 2. Since it must be in the range 0-3 a suitable VLC is used. Now there is only one 0 left. The number of 0s before the second last coefficient must therefore be 0 or 1. In the example the number is 1. At this point there are no 0s left and no more information is coded

Bit-Rate Savings for CAVLC

A Comparison of Performance
Test of different standards (Trans. on Circuits and Systems for Video Technology, July 2003, Wiegand et al) Using same rate-distortion optimization techniques for all codecs “Streaming” test: High-latency (included B frames) “Real-time conversation” test: No B frames “Entertainment-quality application“ test: SD & HD resolutions Several video sequences for each test Compare four codecs: MPEG-2 (in Main profile high-latency/streaming test only) H.263 (High-Latency profile, Conversational High-Compression profile, Baseline profile) MPEG-4 Visual (Simple profile and Advanced Simple profile with & without B pictures) H.264/AVC (Main profile and Baseline profile)

Test Results for Streaming Application

Example Streaming Test Result

Comparison to MPEG-4 ASP

Comparison to MPEG-2, H.263, MPEG-4

Test Results for Real-Time Conversation

Example Real-Time Conversation Result

Example Real-Time Test Result

Comparison to MPEG 2, H.263, MPEG-4

Test Results Entertainment-Quality Applications

Example Entertainment-Quality Applications Result

H.264/AVC Layer Structure

Networks and Applications
Broadcast over cable, satellite, DSL, terrestrial, etc. Interactive or serial storage on optical and magnetic devices, DVD, etc. Conversational services over ISDN, Ethernet, LAN, DSL Wireless Networks, modems, etc. or a mixture of several. Video-on-demand or multimedia streaming services over ISDN, DSL Ethernet, LAN, Wireless Networks, etc. Multimedia Messaging Services (MMS) over ISDN, DSL, Ethernet, LAN, Wireless Network, etc. New applications over existing and future networks! How to handle this variety of applications and networks?

Network Abstraction Layer
Mapping of H.264/AVC video to transport layers like RTP/IP for any kind of real-time wireline and wireless Internet services (conversational and streaming) File formats, e.g. ISO MP4 for storage and MMS H.32X for wireline and wireless conversational services MPEG-2 systems for broadcasting services, etc. Outside the scope the H.264/AVC standardization, but awareness! Provision of appropriate mechanisms and interfaces Provide mapping to network and to facilitate gateway design Key Concepts: Parameter Sets, Network Abstraction Layer (NAL) Units, NAL unit and byte-stream formats Completely within the scope of H.264/AVC standardization

Network Abstraction Layer (NAL) Units
Constraints • Many relevant networks are packet switched networks • Mapping packets to streams is easier than vice versa • Undetected bit-errors practically do not exist on the application layer Architecture: NAL units as the transport entity • NAL units may be mapped into a bit stream… • … or forwarded directly by a packet network • NAL units are self-contained (independently decodable) • The decoding process assumes NAL units in decoding order • The integrity of NAL units is signaled by the correct size (conveyed externally) and the forbidden_bit set to 0.

Access Units

NAL Unit Format and Types
NAL unit header: 1 byte consisting of forbidden_bit (1 bit): may be used to signal that a NAL unit is corrupt (useful e.g. for decoders capable to handle bit errors) nal_storage_idc (2 bit): signals relative importance, and if the picture is stored in the reference picture buffer nal_unit_type (5 bit): signals 1 of 10 different NAL unit types Coded slice (regular VCL data), Coded data partition A, B, C (DPA, DPB, DPC), Instantaneous decoder refresh (IDR), Supplemental enhancement information (SEI), Sequence and picture parameter set (SPS, PPS), Picture delimiter (PD) and filler data (FD). NAL unit payload: an emulation prevented sequence of bytes.

RTP Payload Format for H.264/AVC
The specification of an RTP payload format is on the way within the IETF AVT The draft also follows the goals “back-to-basic” and simple syntax specification RTP payload specification expects that NAL units are transmitted directly as the RTP payload Additional concept of aggregation packets is introduced to aggregate more than one NAL unit into a single RTP packet (helpful for gateway designs between networks with different MTU size requirements) RTP time stamp matches presentation time stamp using a fixed 90 kHz clock Open Issue: media unaware fragmentation

Data Partitioning NAL Units 1/2
H.264 | AVC contains Data Partitioning w/ 3 Partitions Data partition A (DPA) contains header info Slice header All macroblock header information Motion vectors Data partition B (DPB) contains intra texture info Intra CBPs Intra coefficients Data partition C (DPC) contains inter texture info Inter CBPs Inter Coefficients When DP is used, all partitions are in separate NAL units

Data Partitioning NAL Units 2/2
Properties of the Partition Types DPA is (perceptually) more important than DPB DPB cleans up error propagation, DPC does not Transport DPA w/ higher QoS than DPB, DPC In lossy transmission environments typically leads to overall higher reconstructed picture quality at the same bit rate Most packet networks contain some prioritization Sub-Transport and Transport level, e.g. in 3GPP networks or when using DiffServ in IP Application Layer protection Packet Duplication Packet-based FEC

Parameter Set Concept Sequence, random access, picture headers can get lost Solutions in previous standards: duplication of headers H.264/AVC coding applies a new concept: parameter sets

Parameter Set Discussion
Parameter Set: Information relevant to more than one slice Information traditionally found in sequence / picture header Most of this information is static, hence transmission of a reference is sufficient Problem: picture-dynamic info, namely timing (TR) Solution: picture-dynamic info in every slice Overhead is smaller than one would expect Parameter Sets are conveyed out-of-band and reliable No corruption/synchronization problems Aligned with closed control application Need in-band transmission mechanism for broadcast

Nested Parameter Sets Each slice references a picture parameter set (PPS) to be used for decoding its VCL data: PPS selected by short variable length codeword transported in slice header Contains, e.g. entropy coding mode, FMO parameters, quantization initialization, weighted prediction indications, etc. PPS reference can change between pictures Each PPS references a sequence parameter set (SPS) SPS is referenced only in the PPS Contains, e.g. profile/level indication, display parameters, timing concept issues, etc. SPS reference can change only on IDR pictures

Summarizing NAL In H.264/AVC, the transport of video has been taken into account from the very beginning Flexibility for integration to different transport protocols is provided Common structure based on NAL units and parameter sets is maintained for simple gateway operations Mapping to MPEG-2 transport stream is provided via byte-stream format On the way are payload specification to different transport protocols, e.g. to RTP/IP

Conclusions Video coding layer is based on hybrid video coding and similar in spirit to other standards but with important differences New key features are: Enhanced motion compensation Small blocks for transform coding Improved deblocking filter Enhanced entropy coding Bit-rate savings around 50 % against any other standard for the same perceptual quality (especially for higher-latency applications allowing B pictures) Standard of both ITU-T VCEG and ISO/IEC MPEG

H.264 ITU-T H.264 or ISO/IEC IS 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC)

Similar presentations

Presentation on theme: "H.264 ITU-T H.264 or ISO/IEC IS 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

H.264 ITU-T H.264 or ISO/IEC IS 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC)

Similar presentations

Presentation on theme: "H.264 ITU-T H.264 or ISO/IEC IS 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC)"— Presentation transcript:

Similar presentations

About project

Feedback