What is MPEG?

MPEG is the digtial video compression standard developed by the Joint Photographic Experts Group. It works best on natural video images (scenes). This tutorial describes general MPEG compression concepts for color images. MPEG actually consists of three types of data streams: video, audio, and system. The combined effect of these three streams is to produce an artifact free video and audio multimedia experience.

To the real world, MPEG is a generic means of compactly representing digital video and audio signals for consumer distribution. The basic idea is to transform a stream of discrete samples into a bitstream of tokens which takes less space, but is just as filling to the eye (…or ear). This "transformation," or better representing, exploits perceptual and even some actual statistical redundancies. The orthogonal dimensions of Video and Audio streams can be further linked with the Systems layer---MPEG's own means of keeping the data types synchronized and multiplexed in a common serial bitstream.

MPEG Components
Video Token Stream	Compressed Video Stream
Audio Token Stream	Compressed Audio Stream
System Token Stream	Synchronization Stream

Video Compresssion

In many ways the MPEG-2 standard follows directly from the pioneering work of the JPEG standards group. In fact, several non-standard "MPEG" data streams were merely sequences of JPEG encoded images. However, significant temporal reducancy remains in such and implementation. This redundancy is addressed in the MPEG-1 and 2 specifications.

Types of Frames

An MPEG data stream consists of up to three types of video frames: the Intra or 'I', Predictive or 'P', and Bi-directional or 'B'. The Intra or 'I' frame consists of all the data required to completely recreate a single video image. The Predictive or 'P' frame consists of the differences between between two Intra frames. Finally, Bi-directional or 'B' frames contain information to correct a video frame between P and I frames.

A requirement on MPEG is that it must support functionality similar to that of a VCR or Laser disk. This means that the bit stream and encoding there of must support forward and reverse play along with fast forward and fast reverse. The industry seems to have settled upon a maximum latenancy or update rate of no more than 0.4 seconds per frame. These requirements cause the data stream to contain a minimum number of 'I' frames to allow rapid resynchronization of the decoder logic.

A sequence of frames may consist of almost any pattern of I, P, and B pictures (there are a few minor semantic restrictions on their placement). It is common in industrial practice to have a fixed pattern (e.g. IBBPBBPBBPBBPBB), however, more advanced encoders will attempt to optimize the placement of the three picture types according to local sequence characteristics in the context of more global characteristics. (or at least they claim to because it makes them sound more advanced).

The sequence shown must be reordered for an actual decoded to function properly. This is becuase the P frame must arrive before the B frames can be processed.

602-t1.gif (3886 bytes)

Intra Frames

The image building block is the Intra or 'I' frame. This frame is essentially a JPEG encoded image that is created without using any past history. This frame is sent to the decoder to provide a known starting point for subsequent predictive and bi-directional frames. The standard JPEG encoding process of dividing the image into 8x8 blocks, performing a DCT, quantizing the result, using a serpentine readout and Huffman, Shannon-Fano or Arithmetic encoding are all used. However, the quantization and encoding use different numbers of bits, non-linear quantization values, and variable encoding tables.

Generalized MPEG Block Diagram

Discrete Cosine Transform

Serpentine Readout of DCT Coefficients
(Progressive and Interlaced)

Predicted Frames

Predicted frames are predicted from the most recently reconstructed I or P frame. Predicted frames can and do create subsequent predictive frames. Each macroblock in a predicted frame can be encoded as a motion vector and DCT differences or intra coded just like all I frame macroblocks.

img00005.gif (2700 bytes)

Prediction with and without motion conpensation

Skipped Macroblocks in P frames

Prediction of Frames

Bi-directional Frames

Bi-directional frames are created from the closet two I and P frames. One each from the past and future. Using these two frames serach for the block in each frame that best matches the block to be encoded. Now, test the compression achieved using the following three methods: use the forward vector, the backward vector, or the average of the two blocks. If none of the three moethods produce acceptable results, intra code the macroblock.

The process of bi-directional frame correction has been referred to "digital spackle" as it consists of taking an interpolated image and applying corrections (spackle) where the image deviates from nominal by an unacceptable amount. Some disucssion centers around the advisability of using tranmission channel bandwith to send information that is useful to only one picture that cannot be used to enhance temporally proximate pictures.

Forward and Backward predictions for Bi-directional frames

How does MPEG achieve compression?

Here are some typical statistical conditions addressed by specific syntax and semantic tools:

1. Spatial correlation: transform coding with 8x8 DCT.

2. Human Visual Response---less acuity for higher spatial frequencies: lossy scalar quantization of the DCT coefficients.

3. Correlation across wide areas of the picture: prediction of the DC coefficient in the 8x8 DCT block.

4. Statistically more likely coded bitstream elements/tokens: variable length coding of macroblock_address_increment, macroblock_type, coded_block_pattern, motion vector prediction error magnitude, DC coefficient prediction error magnitude.

5. Quantized blocks with sparse quantized matrix of DCT coefficients: end_of_block token (variable length symbol).

6. Spatial masking: macroblock quantization scale factor.

7. Local coding adapted to overall picture perception (content dependent coding): macroblock quantization scale factor.

8. Adaptation to local picture characteristics: block based coding, macroblock_type, adaptive quantization.

9. Constant stepsizes in adaptive quantization: new quantization scale factor signaled only by special macroblock_type codes. (adaptive quantization scale not transmitted by default).

10. Temporal redundancy: forward, backwards macroblock_type and motion vectors at macroblock (16x16) granularity.

11. Perceptual coding of macroblock temporal prediction error: adaptive quantization and quantization of DCT transform coefficients (same mechanism as Intra blocks).

12. Low quantized macroblock prediction error: "No prediction error" for the macroblock may be signaled within macroblock_type. This is the macroblock_pattern switch.

13. Finer granularity coding of macroblock prediction error: Each of the blocks within a macroblock may be coded or not coded. Selective on/off coding of each block is achieved with the separate coded_block_pattern variable-length symbol, which is present in the macroblock only of the macroblock_pattern switch has been set.

14. Uniform motion vector fields (smooth optical flow fields): prediction of motion vectors.

15. Occlusion: forwards or backwards temporal prediction in B pictures. Example: an object becomes temporarily obscured by another object within an image sequence. As a result, there may be an area of samples in a previous picture (forward reference/prediction picture) which has similar energy to a macroblock in the current picture (thus it is a good prediction), but no areas within a future picture (backward reference) are similar enough. Therefore only forwards prediction would be selected by macroblock type of the current macroblock. Likewise, a good prediction may only be found in a future picture, but not in the past. In most cases, the object, or correlation area, will be present in both forward and backward references. macroblock_type can select the best of the three combinations.

16. Sub-sample temporal prediction accuracy: bi-linearly interpolated (filtered) "half-pel" block predictions. Real world motion displacements of objects (correlation areas) from picture-to-picture do not fall on integer pel boundaries, but on irrational . Half-pel interpolation attempts to extract the true object to within one order of approximation, often improving compression efficiency by at least 1 dB.

17. Limited motion activity in P pictures: skipped macroblocks. When the motion vector is zero for both the horizontal and vertical vector components, and no quantized prediction error for the current macroblock is present. Skipped macroblocks are the most desirable element in the bitstream since they consume no bits, except for a slight increase in the bits of the next non-skipped macroblock.

18. Co-planar motion within B pictures: skipped macroblocks. When the motion vector is the same as the previous macroblock's, and no quantized prediction error for the current macroblock is present.

Typical coded sizes for the MPEG frames

Typical bit sizes for the three different picture types:

Level	I	P	B	Average
30 Hz SIF @ 1.15 Mbit/sec	150,000	50,000	20,000	38,000
30 Hz CCIR 601@ 4 Mbit/sec	400,000	200,000	80,000	130,000

Note: the above example is taken from a standard test sequence coded by the Test Model method, with an I frame distance of 15 (N = 15), and a P frame distance of 3 (M = 3).

Of course, among differing source material, scene changes, and use of advanced encoder models these numbers can be significantly different.

Optimal MPEG-2 video bitrates

The Test subgroup has defined a few example "Sweet spot" sampling dimensions and bit rates for MPEG-2:

Dimensions	Coded rate	Application
352x480x24 Hz (progressive)	2 Mbit/sec	Equivalent to VHS quality. Intended for film source video. Half horizontal 601(HHR). Looks almost broadcast NTSC quality
544x480x30 Hz (interlaced).	4 Mbit/sec	PAL broadcast quality (nearly full capture of 5.4 MHz luminance signal). 544 samples matches the width of a 4:3 picture windowed within 720 sample/line 16:9 aspect ratio via pan&scan
704x480x30 Hz.(interlaced)	6 Mbit/sec	Full CCIR 601 sampling dimensions

These numbers may be too ambitious. Bit rates of 3, 6, and 8 Mbit/sec respectively provide transparent quality for the above application examples when generated by a reasonably sophisticated encoder.

Typical picture sizes and their associated applications

352 x 240	SIF. CD WhiteBook Movies, video games.
352 x 480	HHR. VHS equivalent
480 x 480	Bandlimited (4.2 Mhz) broadcast NTSC.
544 x 480	Laserdisc, D-2, Bandlimited PAL/SECAM.
640 x 480	Square pixel NTSC
720 x 480	CCIR 601. Studio D-1. Upper limit of Main Level.

Prepared by Bob Clodfelter, last updated 28 July 1998.