MPEG is the digtial video compression standard developed by the Joint Photographic
Experts Group. It works best on natural video images (scenes). This tutorial describes
general MPEG compression concepts for color images. MPEG actually consists of three types
of data streams: video, audio, and system. The combined effect of these three streams is
to produce an artifact free video and audio multimedia experience.
To the real world, MPEG is a generic means of compactly representing digital video
and audio signals for consumer distribution. The basic idea is to transform a stream of
discrete samples into a bitstream of tokens which takes less space, but is just as filling
to the eye (
or ear). This "transformation," or better representing,
exploits perceptual and even some actual statistical redundancies. The orthogonal
dimensions of Video and Audio streams can be further linked with the Systems
layer---MPEG's own means of keeping the data types synchronized and multiplexed in a
common serial bitstream.
MPEG Components | |
Video Token Stream | Compressed Video Stream |
Audio Token Stream | Compressed Audio Stream |
System Token Stream | Synchronization Stream |
In many ways the MPEG-2 standard follows directly from the pioneering work of the JPEG standards group. In fact, several non-standard "MPEG" data streams were merely sequences of JPEG encoded images. However, significant temporal reducancy remains in such and implementation. This redundancy is addressed in the MPEG-1 and 2 specifications.
An MPEG data stream consists of up to three types of video frames: the Intra or 'I', Predictive or 'P', and Bi-directional or 'B'. The Intra or 'I' frame consists of all the data required to completely recreate a single video image. The Predictive or 'P' frame consists of the differences between between two Intra frames. Finally, Bi-directional or 'B' frames contain information to correct a video frame between P and I frames.
A requirement on MPEG is that it must support functionality similar to that of a VCR or Laser disk. This means that the bit stream and encoding there of must support forward and reverse play along with fast forward and fast reverse. The industry seems to have settled upon a maximum latenancy or update rate of no more than 0.4 seconds per frame. These requirements cause the data stream to contain a minimum number of 'I' frames to allow rapid resynchronization of the decoder logic.
A sequence of frames may consist of almost any pattern of I, P, and B pictures (there are a few minor semantic restrictions on their placement). It is common in industrial practice to have a fixed pattern (e.g. IBBPBBPBBPBBPBB), however, more advanced encoders will attempt to optimize the placement of the three picture types according to local sequence characteristics in the context of more global characteristics. (or at least they claim to because it makes them sound more advanced).
The sequence shown must be reordered for an actual decoded to function properly.
This is becuase the P frame must arrive before the B frames can be processed.
The image building block is the Intra or 'I' frame. This frame is essentially a JPEG encoded image that is created without using any past history. This frame is sent to the decoder to provide a known starting point for subsequent predictive and bi-directional frames. The standard JPEG encoding process of dividing the image into 8x8 blocks, performing a DCT, quantizing the result, using a serpentine readout and Huffman, Shannon-Fano or Arithmetic encoding are all used. However, the quantization and encoding use different numbers of bits, non-linear quantization values, and variable encoding tables.
Generalized MPEG Block Diagram
Discrete Cosine Transform
Serpentine Readout of DCT Coefficients
(Progressive and Interlaced)
Predicted frames are predicted from the most recently reconstructed I or P frame. Predicted frames can and do create subsequent predictive frames. Each macroblock in a predicted frame can be encoded as a motion vector and DCT differences or intra coded just like all I frame macroblocks.
Prediction with and without motion conpensation
Skipped Macroblocks in P frames
Prediction of Frames
Bi-directional frames are created from the closet two I and P frames. One each from the past and future. Using these two frames serach for the block in each frame that best matches the block to be encoded. Now, test the compression achieved using the following three methods: use the forward vector, the backward vector, or the average of the two blocks. If none of the three moethods produce acceptable results, intra code the macroblock.
The process of bi-directional frame correction has been referred to "digital spackle" as it consists of taking an interpolated image and applying corrections (spackle) where the image deviates from nominal by an unacceptable amount. Some disucssion centers around the advisability of using tranmission channel bandwith to send information that is useful to only one picture that cannot be used to enhance temporally proximate pictures.
Forward and Backward predictions for Bi-directional frames
Here are some typical statistical conditions addressed by specific syntax and
semantic tools:
1. Spatial correlation: transform coding with 8x8 DCT.
2. Human Visual Response---less acuity for higher spatial frequencies: lossy
scalar quantization of the DCT coefficients.
3. Correlation across wide areas of the picture: prediction of the DC coefficient
in the 8x8 DCT block.
4. Statistically more likely coded bitstream elements/tokens: variable length
coding of macroblock_address_increment, macroblock_type, coded_block_pattern, motion
vector prediction error magnitude, DC coefficient prediction error magnitude.
5. Quantized blocks with sparse quantized matrix of DCT coefficients: end_of_block token
(variable length symbol).
6. Spatial masking: macroblock quantization scale factor.
7. Local coding adapted to overall picture perception (content dependent coding):
macroblock quantization scale factor.
8. Adaptation to local picture characteristics: block based coding,
macroblock_type, adaptive quantization.
9. Constant stepsizes in adaptive quantization: new quantization scale factor
signaled only by special macroblock_type codes. (adaptive quantization scale not
transmitted by default).
10. Temporal redundancy: forward, backwards macroblock_type and motion vectors at
macroblock (16x16) granularity.
11. Perceptual coding of macroblock temporal prediction error: adaptive
quantization and quantization of DCT transform coefficients (same mechanism as Intra
blocks).
12. Low quantized macroblock prediction error: "No prediction error" for
the macroblock may be signaled within macroblock_type. This is the macroblock_pattern switch.
13. Finer granularity coding of macroblock prediction error: Each of the blocks
within a macroblock may be coded or not coded. Selective on/off coding of each block is
achieved with the separate coded_block_pattern variable-length symbol, which is
present in the macroblock only of the macroblock_pattern switch has been set.
14. Uniform motion vector fields (smooth optical flow fields): prediction of
motion vectors.
15. Occlusion: forwards or backwards temporal prediction in B pictures. Example:
an object becomes temporarily obscured by another object within an image sequence. As a
result, there may be an area of samples in a previous picture (forward
reference/prediction picture) which has similar energy to a macroblock in the current
picture (thus it is a good prediction), but no areas within a future picture (backward
reference) are similar enough. Therefore only forwards prediction would be selected by
macroblock type of the current macroblock. Likewise, a good prediction may only be found
in a future picture, but not in the past. In most cases, the object, or correlation area,
will be present in both forward and backward references. macroblock_type can select the
best of the three combinations.
16. Sub-sample temporal prediction accuracy: bi-linearly interpolated (filtered)
"half-pel" block predictions. Real world motion displacements of objects
(correlation areas) from picture-to-picture do not fall on integer pel boundaries, but on
irrational . Half-pel interpolation attempts to extract the true object to within one
order of approximation, often improving compression efficiency by at least 1 dB.
17. Limited motion activity in P pictures: skipped macroblocks. When the motion
vector is zero for both the horizontal and vertical vector components, and no quantized
prediction error for the current macroblock is present. Skipped macroblocks are the most
desirable element in the bitstream since they consume no bits, except for a slight
increase in the bits of the next non-skipped macroblock.
18. Co-planar motion within B pictures: skipped macroblocks. When the motion
vector is the same as the previous macroblock's, and no quantized prediction error for the
current macroblock is present.
Typical bit sizes for the three different picture types:
Level |
I | P | B | Average |
30 Hz SIF @ 1.15 Mbit/sec |
150,000 | 50,000 | 20,000 | 38,000 |
30 Hz CCIR 601@ 4 Mbit/sec |
400,000 | 200,000 | 80,000 | 130,000 |
Note: the above example is taken from a standard test sequence coded by the Test
Model method, with an I frame distance of 15 (N = 15), and a P frame distance of 3 (M =
3).
Of course, among differing source material, scene changes, and use of advanced
encoder models these numbers can be significantly different.
The Test subgroup has defined a few example "Sweet spot" sampling
dimensions and bit rates for MPEG-2:
Dimensions | Coded rate | Application |
352x480x24 Hz (progressive) | 2 Mbit/sec | Equivalent to VHS quality. Intended for film source video. Half horizontal 601(HHR). Looks almost broadcast NTSC quality |
544x480x30 Hz (interlaced). | 4 Mbit/sec | PAL broadcast quality (nearly full capture of 5.4 MHz luminance signal). 544 samples matches the width of a 4:3 picture windowed within 720 sample/line 16:9 aspect ratio via pan&scan |
704x480x30 Hz.(interlaced) | 6 Mbit/sec | Full CCIR 601 sampling dimensions |
These numbers may be too ambitious. Bit rates of 3, 6, and 8 Mbit/sec respectively provide transparent quality for the above application examples when generated by a reasonably sophisticated encoder.
352 x 240 | SIF. CD WhiteBook Movies, video games. |
352 x 480 | HHR. VHS equivalent |
480 x 480 | Bandlimited (4.2 Mhz) broadcast NTSC. |
544 x 480 | Laserdisc, D-2, Bandlimited PAL/SECAM. |
640 x 480 | Square pixel NTSC |
720 x 480 | CCIR 601. Studio D-1. Upper limit of Main Level. |
Prepared by Bob Clodfelter, last updated 28 July 1998.