MPEG-4 Objective Standardize algorithms for audiovisual coding in multimedia applications allowing for Interactivity High compression Scalability of audio and video content Support for natural and synthetic audio and video The Idea An audiovisual scene is a coded representation of audiovisual objects related in space and time
MPEG-4: Scenario A/V object A video object within a scene The background An instrument or voice Coded independently A/V scene Mixture of natural or synthetic objects Individual bitstreams multiplexed and transmitted One or more channels Each channel may have its own quality of service
MPEG-4: Video Object Plane Video frame = sum of segmented regions with arbitrary shape (VOP) Shape motion and texture information of VOPs belonging to the same video object is encoded into a video object layer (VOL) Encode VOL identifiers Composition information Overlapping configuration of VOPs
MPEG-4: Coding Shape coding Shape information in alpha planes Transparency of shape encoded Inter and intra shape coding functions After shape coding each VOP in a VO is partitioned into non-overlapping macroblocks Motion coding Shift parameter wrt reference window Standard macroblock Contour macroblock
MPEG-4: Coding Texture coding Intra-VOPs, residual errors from motion compensation are DCT coded like MPEG-1 4 luminance and 2 chrominance blocks in a macroblock P-VOPs (prediction error blocks) may not conform to VOP boundary Pixels outside the active area are set to a constant value Standard compression Efficient prediction of DC and AC components from intra and inter coded blocks Multiplexing Shape motion texture coded data Motion and DCT coefficients can be jointly (H.263) or individually coded
MPEG-4 Video Object Segmentation-I Construct a video object User selects start frame, outlines polygon designating rough object boundary Refine boundary using snake algorithm, if needed Compute a k-pixel bounding box around the object Within bounding box compute Edge map: bit plane, after thresholding a convolution kernel Color map: compute luminance and chrominance, quantize by k- means clustering, keep quantization table Motion field: block-based motion vector Segment into regions no significant edge, smooth color having smooth motion Intersect segments and initial object boundary and determine foreground and background region Estimate the motion of regions in the next frame with an affine motion model
MPEG-4 Video Object Segmentation-II Track object Locate estimated position of foreground and background regions from previous frame. Call this the object mask. Generate same three feature maps with the quantization table; Requantize if error is large Classify regions into foreground/background and new regions Intersection ratio r with object mask For foreground regions, if r > 80% OR foreground mask, mark as foreground; label foreground - mask as new For new regions, if r 80% mark as foreground; else find nearest-motion-similar neighbor. If it is in the foreground, do previous step, else keep region as new Iterate until stable