Basic knowledge of audio and video entry

Basic knowledge of audio and video entry

insert image description here

Video Encapsulation Format (MP4/MKV…) vs Video Encoding Format (H.264/FLAC/AAC…)

  What is a video? In fact, it is just a picture, which is displayed continuously at a small time interval. People feel that the characters in the picture are moving, and this is the film. In other words, the essence of a movie is a collection of more than N pictures. So what is the relationship between each picture and frame?

  In fact, if we store all the pictures in a movie intact, there will be a lot of space. However, if each picture is encoded into a frame through a certain algorithm, and then the frames are connected into a stream, and then different streams are put into a certain container, this is the movie file we usually see.

  MP4 and MKV are the most common types of video files you download. These files are actually similar to a package, and its suffix is ​​the packaging method of the package. These packages contain video (only images), audio (only sound), subtitles, etc. When the player is playing, first unpack the package (the technical term is called demux), take out the video, audio, etc., and then decode and play it.

  Since they are just a package, it means that this suffix cannot guarantee what is inside, nor how many things there are. Each item in the package, as we call it 轨道(track), generally has the following:

  • Video (Video): Generally speaking, there must be, but there are exceptions, such as the external audio track in mka format, which is actually mkv without video. Note that when we talk about video, we don't include sound.
  • Audio (audio): Generally speaking, there must be, but in some cases it is muted, so there is no need to bring it.
  • Chapter (Chapter): The segmentation information that comes with the original Blu-ray disc. If the file is brought, then you can see the effect with chapters in the player: right-click the screen of potplayer, option-play-show bookmark/chapter mark on the progress bar; right-click the screen of mpc-hc, option-adjustment-in progress The bars show chapter markers.
  • Subtitles (Subtitles): Sometimes the file comes with subtitles, and the subtitles are not hard subtitles directly integrated into the video, so they are packaged together in a packaging container.

  There may be other attachments, etc., which are not listed one by one. Each type does not necessarily have only one track, for example, MKV with multiple audio tracks is often seen.

  Each track has its own format. For example, as we often say, the video is H.264, and the audio is AAC. These are the formats of each track.

  Common video formats include H.264 (which can be subdivided into 8bit/10bit), H.265 (currently also divided into 8bit/10bit), RealVideo (common in early rm/rmvb), VC-1 (Microsoft-led , common in wmv). Basically, H.264=AVC=AVC1, H.265=HEVC.

  Common audio formats include FLAC/ALAC/TruseHD/DTS-HD MA, which are lossless, and AAC/MP3/AC3/DTS (Core), which are lossy.

insert image description here

  

Basic parameters of video: resolution, frame, frame rate and bit rate

  Video is made up of successive images. Each image, we call a帧(frame). Images are 像素(pixel)made of .How many pixels an image has, called the image's分辨率. For example, an image of 1920×1080 means that it is composed of 1920×1080 pixels horizontally and vertically. The resolution of the video is the resolution of each frame of the image.

  A video, how many images are composed of each second, is called the video帧率(frame-rate,fps). Common frame rates are 24000/1001=23.976, 30000/1001=29.970, 60000/1001=59.940, 25.000, 50.000 and so on. This number is the number of images flashed in one second. For example, 23.976 means that there are 24000 images in 1001 seconds. The frame rate of the video can be constant (cfr, Const Frame-Rate) or variable (vfr, Variable Frame-Rate).

  码率The definition of is the video file size divided by the time, the unit is generally Kbps (Kbit/s) or Mbps (Mbit/s). Note, 1B (Byte) = 8b (bit). So a 24 minute, 900MB video:

体积:900MB = 900MByte = 7200Mbit
时间:24min = 1440s
码率:7200/1440  = 5000 Kbps = 5Mbps

  When the time of video files is basically the same (for example, an episode is about 24 minutes now), the bit rate and volume are basically equivalent, and they are parameters used to describe the size of the video. Files with the same length and resolution have different sizes, but the code rate is actually different.

  The bit rate can also be interpreted as the total amount of data used to record video per unit time . A video with a higher bit rate means more data used to record the video, and the potential interpretation is that the video can have better quality. Note that it is only potential. Later we will analyze why high bit rate does not necessarily equal high image quality.

  

  

Code rate control CQP/CRF/ABR/CBR/VBV

  The amount of data required to play a video per second is its bit rate (that is, the often-called bit rate).

bitrate = width * height * color depth * frames per second

For example, a video with 30 frames per second, 24 bits per pixel, and a resolution of 480x240 would require 82,944,000 bits per second or 82.944 Mbps (30x480x240x24)   if we did not do any compression .

  

Ⅰ. CQP(Constant QP)

Fixed QP , the simplest code rate control method,Each frame of image is encoded according to a specific QP, the amount of encoded data per frame is unknown, it is neither a rate-first model nor a quality-first model, but it is the simplest model to implement;

Applicable scenarios: This method is generally not recommended, because this method does not consider the complexity of the encoded content, and processes each frame with the same compression ratio. The output video quality and bit rate are not fixed . Personally, I think only very simple scenes, such as static scenes with little movement, can be used. When encountering complex scenes, the bit rate fluctuations are very large. Or it can be used in algorithm research or verification.

Features:

  • The instantaneous bit rate will fluctuate with the complexity of the scene;

  • The encoding speed is fast, the control is the simplest, and the QP value of each frame is the same;

  • CQP mode is supported in x264 and x265, but not in libvpx;

  • The QP range in H.264 is [0, 51]. The larger the QP value, the larger the quantization step size, and the lower the quality of the encoded video. QP is 0 for lossless encoding;

Ⅱ . CRF(Constant Rate Factor)

constant quality factor .Target a certain "visual quality" as an output. It does this by reducing the bitrate of frames that are expensive but hard to see with the naked eye (high-speed motion or rich textures) and boosting the bitrate of static frames.

Features: QP changes between frames, QP changes of intra-frame macroblocks, the output bit rate is unknown, and the visual quality of each frame output is basically constant. This method is equivalent to the method of fixed quality mode + limiting bit rate peak value.

Applicable scenarios: It is suitable for occasions that have certain requirements for video quality. The CRF value can be simply understood as a fixed output value for video quality expectations. It is hoped that there will be a stable Subjective video quality can choose this mode, which is a video quality priority model. Video quality can be simply understood as the clarity of the video, the fineness of the pixels and the fluency of the video.

Features:

  • Similar to constant QP, but the pursuit of constant subjectively perceived quality, the instantaneous bit rate will also fluctuate with the complexity of the scene, and the QP values ​​​​are different between video frames or between internal macroblocks;

  • For scenes with fast motion or rich details, the quantization distortion will be appropriately increased (because the human eye is not sensitive), and vice versa for static or flat areas, the quantization distortion will be reduced;

  • CRF is the default rate control method for x264 and x265, and can also be used for libvpx;

  • The larger the CRF value, the higher the video compression rate, but the lower the video quality. The CRF value range of each codec is generally [0-51], but the general default value is 23 for x264, and 28 for x265 library;

  • If you're not sure what CRF to use, start with the default and make changes based on your subjective impression of the output. Lower CRF if quality is not good enough. Choose a higher CRF if the file size is too large. Changing ±6 will result in a change of about half/twice the bitrate size, and ±1 will result in a change of about 10% in bitrate.

  

Ⅲ . CBR:(Constant Bit Rate)

constant bit rate ,The bit rate is basically kept constant within a certain time range, which belongs to the rate-first model.

Applicable scenarios: It is generally not recommended to use this method. Although the output bit rate is always at a stable value, the quality is unstable and the network bandwidth cannot be fully and effectively used, because this model does not consider the complexity of the video content. The content of video frames is treated uniformly. However, some encoding software only supports fixed quality or fixed bit rate, and sometimes it has to be used. When using it, set the bandwidth as large as possible within the allowable bandwidth range to prevent the video quality from being very low in complex motion scenes. If the setting is unreasonable, it will be blurry and cannot be seen directly in motion scenes.

Features:

  • The bit rate is stable, but the quality is unstable, and the effective bandwidth utilization rate is not high, especially when the value is set unreasonably, the picture is very blurred in complex motion scenes, which greatly affects the viewing experience;

  • However, the output video bit rate is basically stable, which is convenient for calculating the video volume;

  

Ⅳ . VBR:(Variable Bit Rate)

Variable bit rate ,dynamic bit rate, indicating that the encoder will dynamically adjust the output bit rate according to the complexity of the image content (actually, the amount of variation between frames). If the image is complex, the bit rate will be high, and if the image is simple, the bit rate will be low. The output bit rate will fluctuate within a certain range. For small shakes, the block effect will be improved, but it is still powerless for long-term violent shakes. This encoding method is suitable for local storage and local encoding. This encoding method can be used in scenes that require relatively high video and audio quality but do not care about bandwidth.

There are two control modes: quality priority mode and 2PASS secondary encoding mode.

Quality priority mode:

Regardless of the size of the output video file, the bit rate is allocated completely according to the complexity of the video content, so that the video playback quality is the best.

Secondary encoding method 2PASS:

The first encoding detects both simple and complex parts of the video content, while determining the ratio of simple and complex.

The second pass of encoding will keep the average bit rate of the video unchanged, assigning more bits to complex places and less bits to simple places. This kind of encoding is good, but the speed will not keep up.

Applicable scenarios: VBR is suitable for scenarios that do not have too much restrictions on bandwidth and encoding speed, but have high requirements on quality. Especially in complex sports scenes, it can maintain a relatively high definition and output quality is relatively stable, suitable for on-demand, recording or storage systems that are not sensitive to delay.

Features:

  • The code rate is unstable, and the quality is basically stable and very high;

  • The encoding speed is generally slow, and on-demand, download and storage systems can be used first, and are not suitable for low-latency, live broadcast systems;

  • This model does not consider the output video bandwidth at all. For the sake of quality, it takes up as much bit rate as needed, and does not consider the encoding speed;

ABR:(Average Bit Rate)

Constant average target bit rate ,Simple scenes allocate lower bits, and complex scenes allocate enough bits, so that the limited number of bits can be allocated reasonably in different scenarios, which is similar to VBR. At the same time, within a certain period of time, the average bit rate is close to the set target bit rate , so that the size of the output file can be controlled, which is similar to CBR. It can be considered as a compromise between CBR and VBR, which is the choice of most people. Especially when both quality and video bandwidth are required, you can choose this mode first. Generally, the speed is two to three times that of VBR, but the quality of video files with the same volume is much better than CBR.

Applicable scenarios: ABR is mostly used in live broadcast and low-latency systems, because it is only encoded once, so the speed is fast, and it takes into account both video quality and bandwidth. This mode can also be selected when there is a requirement for transcoding speed. Most of the videos at station B choose this mode.

Features:

  • The overall video quality is controllable, taking into account the video bit rate and speed at the same time, it is a compromise solution, and it is actually used more;

  • The use process generally requires the caller to set the minimum code rate, maximum code rate and average code rate, and these values ​​should be set as reasonable as possible;

Summarize:

  Several code rate control schemes are introduced above, which have different names and titles in different encoders, and the details may be different. But it is basically achieved by affecting the size of the QP, and then further affecting the granularity of the quantization process. When using it, you need to further refer to the specific encoder implementation.

  Generally, ABR is preferred, and a satisfactory balance can be achieved in terms of speed, bit rate, and quality. Other VBR, CBR, and CRF have their own scenarios, and they need to be used conditionally when using them.

  

  

progressive scan technology

  In the early days, engineers came up with a technique that would convert videoSensory frame rate doubledwithout consuming additional bandwidth . This technique is known as interlacing ; in general, it sends a frame at one point in time - a frame to fill one half of the screen, and a frame at the next point to fill the other half of the screen.

  Today's screen rendering mostly uses progressive scan technology . This is a method of displaying, storing, and transmitting moving images where all the lines in each frame are drawn sequentially.

insert image description here

  

  

Color depth (8bit, 10bit)

  色深(bit-depth), is what we usually say 8bit and 10bit, refers toAccuracy per channel(It can be simply understood as the different brightness of a color). 8bit means that each channel is represented by an 8bit integer (0~255), 10bit is (0~1023)displayed by a 10bit integer, and 16bit is (0~65535). Note that the above statement is imprecise. When the video is encoded, not 0~255all ranges must be used, but may be reserved, and only a part is used, for example 16~235. We will not expand on this in detail.

  Your display is 8bit, which means it can display all the intensities of each RGB channel from 0 to 255. However, the color depth of the video is the color depth of YUV. When playing, YUV needs to be converted to RGB through calculation. Therefore, the high precision of 10bit is indirect, which increases the precision in the calculation process to make the final color more delicate.

One pixel on the display is a light of three colors

如何理解8bit显示器,播放10bit是有必要的呢:
一个圆的半径是12.33m, 求它的面积,保留两位小数。
半径的精度给定两位小数,结果也要求两位小数,那么圆周率精度需要给多高呢?也只要两位小数么?
取pi=3.14, 面积算出来是477.37平方米
取pi=3.1416,面积算出来是477.61平方米
取pi精度足够高,面积算出来是477.61平方米。所以取pi=3.1416是足够的,但是3.14就不够了。

  In other words, even if the precision requirement of the final output is low, it does not mean that the numbers involved in the calculation and the calculation process can maintain low precision. Under the premise that the final output is 8bit RGB, the reason why 10bit YUV still has an accuracy advantage over 8bit YUV is here. In fact, after 8bit YUV conversion, the coverage accuracy is about 26% of 8bit RGB, and after 10bit conversion, the accuracy can cover about 97% - do you want your 8bit display to play 97% of the fineness? Look at 10bit.

  The 8-bit precision is insufficient, mainly in areas with low brightness, and it is easy to form color bands;

  

  

Image representation method: RGB model vs YUV model

RGB

  The three primary colors of light are red (Red), green (Green), blue (Blue). Modern display technology achieves any color of visible light by combining the three primary colors of different intensities. In image storage, the method of recording an image by recording the red, green and blue intensity of each pixel is called RGB模型 (RGB Model). Among the common image formats, PNG and BMP are based on the RGB model.

  If the intensity of each color (plane) occupies 8 bits (the value ranges from 0 to 255), then the color depth is 24 (8*3) bits, which is 3Byte. We can also deduce that we can use 2 24 2^{24}224 different colors.

YUV

  In addition to the RGB model, there is also a widely used model called YUV模型, also known as the Luma-Chroma model. It converts the three channels of RGB into a channel representing brightness (Y, also known as Luma) and two channels representing chroma (UV, and becomes Chroma) through mathematical conversion.

  Under the YUV model, there are different implementations. To give a more useful one YCbCr模型: it converts RGB into a luminance (Y), and a blue chroma (Cb) and a red chroma (Cr).

conversion process

1.RGB —— > YUV
  When converting RGB signal to YUV signal, it is generally converted to YUV444 format first (see chroma signal sampling format), and then the resolution of UV signal is reduced to the format we need.

2.YUV —— > RGB
  When converting from YUV to RGB, it is first necessary to increase the resolution of the shrunk UV signal to the same resolution as the Y signal, and then convert to RGB signal. When playing video or displaying images, we need to convert YUV signals to RGB signals. This step is called rendering (Rendering).

conversion formula

The following is the conversion formula between RGB and YUV:
insert image description here

Generally, this step can be realized by encoding the matrix, which is written in the form of matrix:

insert image description here

The two matrices in the above figure are the encoding matrices

Advantages of YUV

  In image and video processing and storage, YUV format is generally more popular for the following reasons:

  1. The sensitivity of the human eye to brightness is much higher than that of chroma, so the effective information seen by the human eye mainly comes from brightness. The YUV model can assign most of the effective information to the Y channel, and the UV channel has much less information than the recorded information. Compared with the more even distribution of the RGB model, the YUV model concentrates most of the effective information in the Y channel, which not only reduces the amount of redundant information, but also facilitates compression.
  2. Backward compatibility with black and white display devices (black and white TV) is maintained.
  3. In image editing, it is more convenient to adjust the brightness and color saturation under the YUV model.

  Almost all video formats, as well as the widely used JPEG image format, are based on the YCbCr model. When playing, the player needs to convert the YCbCr information into RGB through calculation. This step is called 渲染(Rendering).

Since YUV has more advantages, why keep RGB?

Because all color input and output devices invented by human beings essentially only support RGB data. Even if the device allows YUV input and output, it is indirectly supported through internal data conversion.

  

  

Eliminate Redundancy - 1

  We realized that it wouldn't work without compressing the video; a single hour-long video at 720p and 30fps would require 278GB . Just using a lossless data compression algorithm -- such as DEFLATE (used by PKZIP, Gzip, and PNG) -- cannot sufficiently reduce the bandwidth required for video, and we need to find other ways to compress video.

We use the product to get this number 1280 x 720 x 24 x 30 x 3600 (width, height, bits per pixel, fps and seconds)

  To do this, we can take advantage of a property of vision : we are much more sensitive to brightness than to color. Repetition in time : A video contains many images with only small changes. Repetition within the image : Each frame also contains many regions of the same or similar color.

  

Chroma Subsampling _

  In the application of the YUV model, the importance of Y and UV is not equal. In the actual storage and transmission of image and video, Y is usually recorded at full resolution, and UV is recorded at half or even 1/4 resolution. This means is called 色度子采样(Chroma Sub-Sampling). Chroma subsampling can effectively reduce the transmission bandwidth and increase the compression rate of the UV plane.But it is inevitable that the effective information of the UV plane will be lost

  Our usual video, the most common is 420 samples. With the YUV format, it is often written as yuv420, which is color subsampling.

insert image description here

  Now we find that the essence of yuv444, yuv422, yuv420 yuv and other pixel formats is that each graphic pixel will contain a luminance value, but some graphic pixels will share a chrominance value. This proportional relationship is through the 4 x 2 rectangular reference block to decide. This makes it easy to understand formats like yuv440 and yuv420.

Example YCbCr 4:2:0 Merging
This is a slice of an image merged using YCbCr 4:2:0, notice we only spend 12 bits per pixel.

Calculated, Y: 8bit, Cb: 2bit, Cr: 2bit

Because an image pixel only occupies 1/4 UV pairs, YUV each occupies 8 bits, so a total of 8+2+2 = 12bit

  

Eliminate Redundancy - 2

  Earlier we calculated that we need 278GB to store a 1 hour long video file with a resolution of 720p and 30fps. If we use YCbCr 4:2:0we can cut that 一半的大小(139GB), but it's still not ideal. We get this value by multiplying width, height, color depth and fps. Previously we needed 24 bits, now we only need 12 bits.

  

I frame, P frame, B frame of video coding

Video transmission (storage) principle

  Video uses the principle of persistence of vision of the human eye to make the human eye feel the movement by playing a series of pictures. Simply transmitting video images, the video volume is very large, which is unacceptable for existing networks and storage. In order to facilitate the transmission and storage of video, people find that the video has a lot of repeated information. If the repeated information is removed at the sending end and restored at the receiving end, the file size of the video data will be greatly reduced, so there isH.264 Video Compression Standard

  The original image data in the video will be compressed using the H.264 encoding format, and the audio sample data will be compressed using the AAC encoding format. After the video content is encoded and compressed, it is indeed beneficial to storage and transmission. However, when the playback is to be watched, a corresponding decoding process is also required. Therefore, between encoding and decoding, it is obviously necessary to agree on a convention that both the encoder and the decoder can understand. As far as video image encoding and decoding is concerned, the convention is simple:

   The encoder encodes multiple images and produces a segment of GOP (Group of Pictures), and the decoder reads the segment of GOP for decoding, reads the picture and then renders and displays it. GOP (Group of Pictures) is a group of continuous pictures, consisting of an I frame and several B / P frames, which is the basic unit of video image encoder and decoder access, and its sequence will be repeated until Video ends. I frames are intra-coded frames (also known as key frames), P frames are forward predicted frames (forward reference frames), and B frames are bidirectional interpolated frames (bidirectional reference frames). Simply put, an I frame is a complete picture, while P and B frames record changes relative to an I frame. P and B frames cannot be decoded without I frames.

  In the H.264 compression standard, I frames, P frames, and B frames are used to represent transmitted video images.

insert image description here

1, I

  An I frame, also known as an intra-frame coded frame, is an independent frame with all information, which can be decoded independently without referring to other images, and can be simply understood as a static picture. The first frame in a video sequence is always an I-frame because it is a key frame.

2. P frame

  A P frame is also called an inter-frame predictive coding frame, and it needs to refer to the previous I frame for coding. Indicates the difference between the current frame and the previous frame (the previous frame may be an I frame or a P frame). When decoding, it is necessary to superimpose the difference defined in this frame with the previously cached picture to generate the final picture. Compared with I frames, P frames usually occupy fewer data bits, but the disadvantage is that P frames are very sensitive to transmission errors due to their complex dependencies on previous P and I reference frames.

insert image description here

3. B frame

  The B frame is also called the bidirectional predictive coding frame, that is, the B frame records the difference between the current frame and the previous and subsequent frames. That is to say, to decode the B frame, not only the previous cached picture must be obtained, but also the decoded picture, and the final picture can be obtained by superimposing the front and back pictures with the current frame data. The compression rate of B frame is high, but it requires high decoding performance.

insert image description here

Summarize:

  I frame only needs to consider this frame; P frame records the difference with the previous frame; B frame records the difference between the previous frame and the next frame, which can save more space, and the video file is small, but relatively It is more troublesome when it comes to decoding. Because when decoding, not only need to use the previously cached picture, but also need to know the next I or P picture, for players that do not support B frame decoding, it is easy to get stuck.

insert image description here

  The video images previewed in the video surveillance system are real-time, and have high requirements for the smoothness of the images. Using I frame and P frame for video transmission can improve the adaptability of the network, and can reduce the cost of decoding. Therefore, the current video decoding only uses I frame and P frame for transmission. Hikvision camera encoding, I frame interval is 50, including 49 P frames.

Guess you like

Origin blog.csdn.net/qq_40342400/article/details/129621369