Digital Video Compression

Digital video compression is used as a method of maximising picture quality for the defined network / video storage / processing capacity of a system. This has always been a trade-off which is explianed below. Let’s start by taking a look at CIF (Common Interchange Format), which was first defined in the H.261 standard in 1998.

This standard was used in early teleconferencing devices and DVR or NVR with IP encoder devices from the late 1990’s and early 2000’s, to compress analogue CCTV pictures into a digital format for recording. With the limitations of the equipment of the time (for instance In 1998 the largest hard drive available for retail was 16 GB), small resolutions provided a tough compromise between picture quality and recording time.

CIF images are presented in a standardised format for the picture resolution, frame rate, colour space and colour subsampling of digital video sequences. As the word "common" in its name implies, CIF was designed as a common compromise format to be relatively easy to convert for use either with PAL or NTSC television standard displays and cameras.

CIF defines a video sequence with a resolution of 352 × 288 pixels, which has a simple relationship to the PAL picture size (normally supplied at 25 frames per second), but with a frame rate of almost 30 frames per second like NTSC (which has a different picture size). The compromise was established as a way to reach international agreement so that video conferencing systems in different countries could communicate with each other without needing two separate modes for displaying the received video.

Since H.261 systems typically operated at very low bit rates, they also typically operated at low frame rates by skipping many of the camera’s source frames, so conversion to PAL or NTSC standard for viewing on a monitor by skipping further frames etc. tended not to be so noticeable. Bearing in mind that the typical analogue CCTV ‘multiplexer’ of the time provided reasonable quality images at a rate of 1 frame per second per camera onto a video tape, being able to record several cameras at up to 3 frames per second at reasonable resolution onto a hard drive seemed to be an improvement. Of course at this point the technology hadn’t quite caught up with the aspirations of the installers and clients, and so further improvements were sought.

Low resolution images are fine in some cases but not in others. It’s a tough choice to make. Below are some of the lowest available image resolutions, with ‘real-time’ bit rates for a single camera stream. This illustrates why 1 to 3 frames per second was the limit :-

CIF Format Resolution in Pixels Bitrate at 25 FPS. 12 bit colour sampling*
QCIF (Quarter CIF) 176 x 144 7.6 Mbit/Sec
CIF (Full CIF, FCIF) 352 x 288 30.4 Mbit/Sec
4CIF (4 x CIF) 704 x 576 121.6 Mbit/Sec
D1 720 x 576 124.4 Mbit/Sec

 *12 bit colour sampling is an older standard allowing just 4096 colours to be represented. With H.261 compression methods, few systems of the time could handle many (if any at all) 4CIF or D1 image streams.

So how are digital images produced?

The image information picked up by the camera’s CCD sensor is converted into pixels. The pixel information such as brightness and colour are encoded as binary numbers. The more bits per pixel that are allowed for colour sampling, the more colours can be represented:
- 8 bits provides a set of 256 colours
- 12 bits provides a set of 4096 colours
- 16 bits provides a set of 65,536 colours
- 24 bits provides a set of 16.77 million colours

More bits = more colours = higher image fidelity, but larger bitrates

Digital versus Analogue

From the outset Digital Video Systems were aiming to compete directly against traditional analogue CCTV. As the equipment has improved in quality and capability and full IP systems have emerged, IP digital video has become and equal to modern digital CCTV video solutions.

Newer video standards and more capable equipment is available to allow the possibility of HD and UHD (4k) images, and the higher the image resolution is, then the better the image definition. But with these higher resolutions comes the need for much more data storage and inevitably much higher network bandwidth demands.

Resolution: More pixels = Better image quality
It is clear that more pixels provide a better image by providing a higher image resolution. But that is only half the story. It also depends on whether black and white only images are required or whether full colour is required and then, if colour images are preferred, how many colours. This not only affects the number of pixels needed for a high quality display, but also the number of bits per pixel required to produce the colours. All of this has an effect on the processing time to produce each pixel, and the transmission time of each pixel.

If, typically, 65,636 colours are a requirement - which is pretty much a minimum these days - then each pixel will require 16 binary digits (or bits) to produce. Halving the number of bits required would speed up everything considerably, but reduce the number of colours available to just 256. To try to get around this roadblock, images are compressed to some extent.

Uncompressed versus compressed images

Analogue CCTV images are uncompressed and therefore no loss of definition is experienced. Digital images are normally compressed by a number of standard or proprietary methods. Typically there are two sorts of digital compression methods:

Lossless compression - like ZIP, RAR, etc. which are not efficient for video and not fast enough for network use.

Lossy compression like H.261, JPEG, MJPEG, MPEG-4, H.264, H.265, etc. which is used to reduce latency, bandwidth usage and storage requirements.

Uncompressed images: Okay for individual images
If we take a standard digital camera, like a modern mobile phone camera, using a relatively low image resolution of 1024 x 768 pixels, with 65,636 colour capability, then each individual uncompressed image would require just over 1.5 Megabytes of storage, calculated thus:

(1024x768) pixels = 786,432 pixels
x16 bits per pixel = 12,582,912 bits
/8 bits per Byte = 1,572,864 Bytes (1.5 MegaBytes)

Dealing with this is a simple task for still images stored on the memory card of an SLR or even mobile phone camera. However, the storage required for this quality of uncompressed image, repeating itself 25 times per second however, is huge:

38.4 Megabytes for one second
2.3 Gigabytes for one minute
138.24 Gigabytes for one hour

Obviously some sort of compression is needed, and there are two essential differences in the possible techniques used:

Lossless compression allows the exact original data to be reconstructed from the compressed data. This can be contrasted to lossy compression, which by definition does not allow the exact original data to be reconstructed from the compressed data. Creating a ZIP or RAR file uses an algorithm which allows exact lossless reconstruction of a data file. Lossy compression, using any of the standard video compression techniques, will reconstruct something close to the original, but not in fact exactly like the original.

Let’s look at this another way:

Digital Video Technology
The uncompressed CCIR 601 data rate for a single PAL video image is:

720 (pixels) x 576 (pixels) = 414,720 pixels
414,720 x 16 (bits per pixel)
= 6,480 Megabits per frame
= 6,480 x 25 FPS (frames per second) = 162,000 Megabits per second

The calculation above shows the scale of the problem if no compression is used. For a standard 625 line PAL picture, the uncompressed data rate required is 162Mbit/s - far in excess of the capacity available on a standard 100 Mbit/s Ethernet LAN.

Note that storage and memory ratings are in Bytes, and network data ratings are in bits.

Digital Compression

Digital compression uses a CODEC for compression (COding) and decompression (DECoding) of images. A codec is a device or computer program for encoding or decoding a digital data stream or signal. A codec encodes a data stream or a signal for transmission and storage, possibly in encrypted form, and the decoder function reverses the encoding for playback or editing. Codecs are widely used in videoconferencing, streaming media, and video editing applications.

Spatial compression squeezes the description of the visual area of a video frame by looking for patterns and repetition among pixels. For example, in a picture that includes a blue sky, spatial compression will notice that many of the sky pixels are a similar shade of blue. Instead of describing each of several thousand pixels, spatial compression can record a much shorter description such as “All the pixels in this area are light blue” Run-length encoding is a version of this technique that is used by many codecs. As you increase spatial compression, the data rate and file size decrease, but the picture loses sharpness and definition.

Temporal compression looks for ways to compact the description of the changes during a sequence of frames. It does this by looking for patterns and repetition over time. For example, in a video clip of a person speaking in front of a static background, temporal compression will notice that the only pixels that change from frame to frame are those forming the face of the speaker. All the other pixels don't change (when the camera is motionless). Instead of describing every pixel in every frame, temporal compression describes all the pixels in the first frame (known as a Key frame or I frame), and then for each frame that follows, describes only the pixels that are different from the previous frame. These frame types are termed I, B and P frames. They are different in the following characteristics:

I frames are the least compressible but don't require other video frames to decode. I frames are also known as key frames. An I frame (Intra-coded picture), is a complete image, like a JPG or BMP image file. P and B frames hold only part of the image information (the part that changes between frames), so they need less space in the output file than an I frame.

B frames can use both previous and next frames in the sequence for data reference to get the highest amount of data compression. A B frame (Bidirectional predicted picture) saves even more space by using differences between the current frame and both the preceding and following frames to specify its content.

P frames can use data from previous frames to decompress and are more compressible than I frames. A P frame (Predicted picture) holds only the changes in the image from the previous frame. For example, in a scene where a car moves across a stationary background, only the car's movements need to be encoded. The encoder does not need to store the unchanging background pixels in the P frame, thus saving space. P frames are also known as delta frames.

The frame types above are constructed and transported in video blocks, known as a GOP (Group Of Pictures) Typical GOP size settings for early equipment was ‘IP’ ‘IBP’ and ‘IBBP’. Key frames (I frames) are always at the beginning, with the B and P frames following on – up to 24 of those on some newer systems running each channel at 25 frames per second.
These different types of frames have formed the basis of the whole H.26x series of compression codecs for network video.

Digital Compression Techniques

JPEG (which dates from September 1992) specifies both the codec, which defines how an image is compressed into a stream of bytes and decompressed back into an image, and the file format used to contain that stream.

JPEG is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable trade-off between storage size and image quality.

JPEG typically achieves 10:1 compression with little loss in image quality.

JPEG compression is used in a number of image file formats. JPEG/Exif is the most common image format used by digital cameras and other photographic image capture devices. Along with JPEG/JFIF, it is the most common format for storing and transmitting photographic images on the internet.

The term "JPEG" is an acronym for the Joint Photographic Experts Group, which created the standard. JPEG files commonly have a filename extension of .jpg or .jpeg.

 

M-JPEG (Motion JPEG) provides spatial compression. This codec requires large amounts of CPU power and unfortunately, M-JPEG is not standardised. This means that images produced on one manufacturer’s M-JPEG codec cannot normally be decoded on another’s. M-JPEG is designed to exploit known limitations of the human eye, notably the fact that small colour changes are perceived less accurately than small changes in brightness.

M-JPEG typically gives 20:1 compression with little loss in image quality.

The codec compresses the entire image of each frame as a key frame (I frame), which is encoded and decoded independently. In multimedia, Motion JPEG (M-JPEG or MJPEG) is a video compression format in which each video frame or interlaced field of a digital video sequence is compressed separately as a JPEG image. Originally developed for multimedia PC applications, M-JPEG is now used by video-capture devices such as digital cameras, webcams, as well as by non-linear video editing systems. It is natively supported by the QuickTime Player, the PlayStation console, and web browsers such as Safari, Google Chrome, Mozilla Firefox and Microsoft Edge.

 

The Moving Picture Experts Group (MPEG) is a working group of authorities that was formed by ISO and IEC to set standards for audio and video compression and transmission. It was established in 1988 by Hiroshi Yasuda and Leonardo Chiariglione.

The MPEG standards consist of several different parts. Some of the approved MPEG standards were revised by later amendments and/or new editions. Let’s look at some of these:

MPEG-1 (1993): Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s (ISO/IEC 11172). This initial version was the first MPEG compression standard for audio and video, replacing the H.261 standard. It is commonly limited to about 1.5 Mbit/s although the specification is capable of much higher bit rates. It was basically designed to allow moving pictures and sound to be encoded into the bitrate of a Compact Disc. To meet the low bit requirement, MPEG-1 down-samples the images, as well as uses picture rates of only 24–30 Hz (12-15 fps), resulting in a moderate quality. This standard includes the popular MPEG-1 Audio Layer III (MP-3) audio compression format.

MPEG-2 (1995): Generic coding of moving pictures and associated audio information (ISO/IEC 13818). The MPEG-2 standard was considerably broader in scope and of wider appeal than MPEG-1, as it supported interlacing and high definition. MPEG-2 is considered important because it was chosen as the compression scheme for non-HD over-the-air digital television ATSC, DVB and ISDB, digital satellite TV services, digital cable television signals and DVD Video. It is also used for playing DVDs on Blu-ray players, but these normally use MPEG-4 Part 10 or SMPTE VC-1 for true high-definition content.

MPEG-3 was not released generally.

MPEG-4 (1998): Coding of audio-visual objects. (ISO/IEC 14496)
MPEG-4 uses further coding tools with additional complexity to achieve higher compression factors than MPEG-2. In addition to more efficient coding of video, MPEG-4 moves closer to computer graphics applications. In more complex profiles, the MPEG-4 decoder effectively becomes a rendering processor and the compressed bit-stream describes three-dimensional shapes and surface texture. Several higher-efficiency video standards (than MPEG-2 Video) are included, notably: MPEG-4 Part 2 (or Simple and Advanced Simple Profile) and MPEG-4 AVC (a.k.a. MPEG-4 Part 10 or H.264). MPEG-4 AVC / H.264 may be used on HD DVD and Blu-ray Discs, along with VC-1 and MPEG-2. MPEG-4 uses temporal compression, therefore only changes in the image scene are compressed and stored, which reduces the amount of storage required.

MPEG-4 can provide up to 100:1 compression whilst retaining acceptable image quality.

MPEG bit rate settings – CBR vs VBR
CBR (Constant Bit Rate) setting
With limited bandwidth available, the preferred setting mode is normally CBR because this mode generates a constant and predefined bit rate. The disadvantage with CBR is that image quality will vary - the quality will remain relatively high when there is no motion in a scene, it will significantly decrease with increased motion. This is extremely noticeable when PTZ cameras are moved or zoomed, as the picture pixelates then settles once the camera has stopped moving. This means that CBR cannot be used in every case.

MPEG - VBR (Variable Bit Rate) setting
Using VBR, a predefined level of image quality can be maintained, regardless of motion or the lack of it in a scene. This often is desirable in video surveillance applications where there is a need for high quality, particularly if there is motion in a scene, However the fluctuations in bit rate must be accounted for in storage and network bandwidth availability.
Switching off individual camera options like on-board analytics and Wide Dynamic Range can further reduce bandwidth usage when using VBR.

H.264 aka MPEG-4 part 10, or MPEG 4 AVC (advanced video coding) (May 2003)
H.264 is similar to MPEG-4 in that sequential and previous key frames are required during compression and decompression. H.264 provides a more efficient method of compression with more precise motion search and prediction capability. However, it requires a more powerful CPU processing capability.

H.264 has been the de-facto standard for CCTV encoding and decoding for over ten years. It works well, and installers have grown used to it’s good and bad points.

H.265 / HEVC
H.265 High Efficiency Video Coding (HEVC) version 2 was formally published on January 12, 2015.
H.265 was designed to substantially improve coding efficiency compared to H.264/MPEG-4, and is targeted at next-generation HD displays and content capture systems which feature progressive scanned frame rates and display resolutions from QVGA (320x240) to 4K and 8K Ultra HD, as well as improved picture quality in terms of noise level, colour spaces, and dynamic range.

Average bit rate reduction compared to H.264/MPEG-4 is typically over 50% 

Latency

Although it has already been stated that there should be no real difference between analogue and digital image quality, the fact is that when we add video compression, delay is introduced. The compression / decompression process requires CPU time to analyse the data, and the speed of both the camera processor and the NVR hardware obviously has a bearing on this. When we also add IP networks into the mix, some delay comes from the Transport and Network Layers (UDP/TCP and IP) of the network itself and some comes from the Network Cards connecting to the network.

Poorly configured or poorly specified network equipment, electrical noise on the network, damaged, kinked or squashed cabling, and poor termination can all have a very detrimental effect on network speeds and connectivity.

Put together this shows up on screen as a delay in showing images in extreme cases up to a few seconds of delay could be experienced, or images could be dropped, causing juddering or missing video segments.