H.264相关概念和技术细节

2021年1月31日 / 94次阅读 / Last Modified 2021年2月1日
图像处理

基本概念:

  • frame number,与 decoding order 并不对应,由 POC (Picture Order Count)定义。
  • reference pictures,就是 previously coded picture,被组织在两个list里面,list 0(P)和list 1(B)。
  • macroblock(MB),16x16(luma sample)大小的块,最小处理单位。
  • macroblocks are arranged in slices, which are set of macroblocks in raster order.
  • I slice,only contain I macroblock type, which is predicted using intra prediction from decoded sample in the current slice.
  • P slice,may contain P and I macroblock types. (还有skipped macroblock,)
  • B slice,may contain B and I macroblock types.
  • P macroblock, predicted using inter prediction from one reference picture in list 0.
  • B macroblock, predicted using inter prediction from one or two reference pictures, one from list 0, another from list 1.
  • macroblock partition for inter coded macroblock. 可以分成16x16, 16x8, 8x16, 8x8大小,如果分成了4个8x8,还可以继续分,8x8, 8x4, 4x8, 4x4(在中间横向或纵向画线),叫 sub-macroblock partition。MB是选择prediction的最小单位,macroblock partition和sub-macroblock partition,要使用相同的reference picture。

Performance limits for codecs are defined by a set of Levels, each placing limits on parameters such as sample processing rate, picture size, coded bitrate and memory requirements.

RBSP: Raw Byte Sequence Payload

  • VCL: Video Coding Layer,
  • NAL: Network Abstraction layer

区分VCL和NAL的原因:VCL层专注coding feature,NAL层专注传输features。

slice最少只有1个macroblock,最多可以包含整个picture的所有marcoblock(1 slice per picture)。一个picture中的所有slice各自包含的macroblock数量不需要相同。slice之间 minimal inter-dependency,这是为了限制error的影响范围。

DPB: Decoded Picture Buffer

IDR:Instantaneous Decoder Refresh,made up of I- or SI- slices, used to clear up the contents of refernece picture buffer. 当decoder收到IDR picture时,decoder marks all pictures in the reference buffer as unused for reference. The first picture is always an IDR picture.

ASO:Arbitrary Slice Order. Slices in a coded picture may follow any decoding order. 只要某个slice的第一个MB的序号,比前面的slice的第一个MB序号要小,就是这个feature了。

slice group, 没看懂...

Tree Structured Motion Compensation

上图的这种划分,全是是针对inter prediction的MB,即只有inter的MB才能够这样做partition。而选择了Intra的MB,划分方式不同,只允许正方形,即为16x16*1,8x8*4,或4x4*16。这是H264的特点。Transform Block的划分是独立的,8x8*4,或4x4*16。在H.264中,MB是选择prediction scheme的最小单位!

chroma部分等比例缩小操作,包括motion vector。

区块越大,residual的energy就可能越大,但是motion vector数量越小;区块越小,residual的energy就越小,但motion vecotr的数量越大。因此,一般对homogeneous area采用大区块,这样residual的energy也不会很大,对有很多detail的area,采用小区块。

six tap Finite Impulse Response (FIR) filter withweights (1/32, -5/32, 5/8, 5/8, -5/32, 1/32)。这是在做 interpolation 的时候用的方法, interpolation是为了做sub-sample motion compensation,还是 half-pel sample。做quarter-pel sample时,就用 linear interpolation。(此时产生的 motion vector 是 float number)

Motion Vector Prediction

每个partition都需要一个motion vector,这需要很多bits来coding,特别是在partition很小的时候。同时,相邻partition的motion vector具有强相关性,因此就有了用相邻的partition的vector来预测当前partition的vector的条件。更motion compensation一样,有一个预测是vector,MVp,还有MVD,motion vector difference!

Intra Prediction

4x4 luma block has 9 optional modes, 16x16 has 4 modes, and 4 modes for chroma components.

一种特殊的intra mode,I_PCM,直接传输 image samples,不做prediction,transform,quantization 和 entropy coding。在一些特殊的场景,比如特别异常的 image content,这种模式可能会更有效率。I_PCM模式可以给出一个MB bits数的绝对限制,而不影响画面质量。

对于 mode 3 -- 8, the predicted samples are formed from a weighted average of prediction samples. (书中没有给出具体的weights)用SAE这个指标来判断和选择效果最好mode!

Mode 3 Plane: A linear plane function is fitted to the upper and left-hand samples H and V. This works well in areas of smoothly-varying luminance.

对于8x8 Chroma Prediction Modes,编号不太一样:
mode 0: DC
mode 1: horizontal
mode 2: vertical
mode 3: plane

对于4x4的相邻block的intra mode也是强相关的,老规矩,用predictive coding来signalling intra prediction mode。(16x16的block和chroma block,不使用这个方法)

所谓predictive coding,其实就是通过某种方式来得到一份prediction,prediction和true value之间有个difference(residual),给解码端发送得到prediction的方法以及difference。

Deblocking Filter

A filter is applied to each decoded macroblock to reduce blocking distortion. The deblocking filter is applied just after the inverse transform. The filter smooths block edges, improving the appearance of decoded frames.

When QP is larger, blocking distortion is likely to be more significant.

H.264有3种transform:

  • Hadamard transform for the 4x4 array of luma DC coefficients in intra MB predicted in 16x16 mode,
  • Hadamard transform for the 2x2 array of chroma DC coefficients,
  • a DCT-based transform for all other 4x4 blocks in residual data.

transformed data 的传输顺序:

从 -1 到 25 这个顺序。

H.264的transform基于DCT,但是也有很大的不同!

DCT的计算如下:

$$Y=AXA^T= \begin{bmatrix}
a & a & a & a \\
b & c & -c-& -b \\
a & -a & -a & a \\
c & -b & b & -c \end{bmatrix} [X] \begin{bmatrix}
a & b & a & c \\
a & c & -a & -b \\
a & -c & -a & b \\
a & -b & a & -c \end{bmatrix}$$

这个式子可以做一下因式分解,得到:

$$\begin{align}
Y &= (CXC^T)*E \\
&= \left( \begin{bmatrix}
1 & 1 & 1 & 1 \\
1 & d & -d & -1 \\
1 & -1 & -1 & 1 \\
d & -1 & 1 & -d \end{bmatrix} [X] \begin{bmatrix}
1 & 1 & 1 & d \\
1 & d & -1 & -1 \\
1 & -d & -1 & 1 \\
1 & -1 & 1 & -d \end{bmatrix} \right) * \begin{bmatrix}
a^2 & ab & a^2 & ab \\
ab & b^2 & ab & b^2 \\
a^2 & ab & a^2 & ab \\
ab & b^2 & ab & b^2 \end{bmatrix} \end{align}$$

\((CXC^T)\) is a core 2D transform. The constant a and b are as before and \(d=\frac{c}{b}=0.414\)

为了简化计算,并且保持orthogonal(其实不是,没搞懂?),让 \(d=\frac{1}{2},a=\frac{1}{2},b=\sqrt\frac{2}{5}\),通过乘2,去掉矩阵C中的小数,同时相应的调整post-scaling matrix E,最后得到的式子为:

$$\begin{align}
Y &= (C_fXC_f^T)*E_f \\
&= \left( \begin{bmatrix}
1 & 1 & 1 & 1 \\
2 & 1 & -1 & -2 \\
1 & -1 & -1 & 1 \\
1 & -2 & 2 & -1 \end{bmatrix} [X] \begin{bmatrix}
1 & 2 & 1 & 1 \\
1 & 1 & -1 & -2 \\
1 & -1 & -1 & 2 \\
1 & -2 & 1 & -1 \end{bmatrix} \right) * \begin{bmatrix}
a^2 & \frac{ab}{2} & a^2 & \frac{ab}{2} \\
\frac{ab}{2} & \frac{b^2}{4} & \frac{ab}{2} & \frac{b^2}{4} \\
a^2 & \frac{ab}{2} & a^2 & \frac{ab}{2} \\
\frac{ab}{2} & \frac{b^2}{4} & \frac{ab}{2} & \frac{b^2}{4} \end{bmatrix} \end{align}$$

为什么要这样计算?

  • transform计算过程全是integer(16 bits),在decoding的时候,没有精度损失,zero mismatch;
  • transform的核心计算,可以只有加法和移位;
  • 最后处理E矩阵,这一步操作并入quantization,减少做乘法的次数;

inverse transform的计算公式:

$$\begin{align}
Y &= C_i^T(X*E_i)C_i \\
&= \begin{bmatrix}
1 & 1 & 1 & \frac{1}{2} \\
1 & \frac{1}{2} & -1 & -1 \\
1 & -\frac{1}{2} & -1 & 1 \\
1 & -1 & 1 & -\frac{1}{2} \end{bmatrix} \left([X]* \begin{bmatrix}
a^2 & ab & a^2 & ab \\
ab & b^2 & ab & b^2 \\
a^2 & ab & a^2 & ab \\
ab & b^2 & ab & b^2 \end{bmatrix}\right) \begin{bmatrix}
1 & 1 & 1 & 1 \\
1 & \frac{1}{2} & -\frac{1}{2} & -1 \\
1 & -1 & -1 & 1 \\
\frac{1}{2} & -1 & 1 & -\frac{1}{2} \end{bmatrix} \end{align}$$

书上说:C中的 1/2 可以用 右移来实现,不会带来显著的精度损失,就是因为 Y is pre-scaled by E。

量化时:(1)避免浮点数出发,这应该是考虑到硬件的成本;(2)包含上文描述的scaling。H264一共有52个Qstep value,QP是这52个值的index。QP ofr Chroma is derived from QP of luma, 也可以自定义QPy和QPc之间的mapping,在PPS包中传递这些信息。

如前所述,如果MB是intra的预测模式,还要对4x4的DC coefficients再次进行hadamard transform。不详细记录了,还有chroma部分的transform,需要的时候再仔细研究。

H.264的prediction和transform的关系:当inter时,如果有小于8x8的partition,要用4x4的transform;当intra时,8x8*4和4x4*16,这两种划分,prediction和transform相同,只有在16x16*1时,先做4x4*16的transform,然后对DC values再做一次4x4的Hadamard Transform。这样的关系,在代码编写时,会稍显复杂。

下面是MAIN PROFILE的一些记录,前面是内容属于BASELINE。

B slices,reference picture可以一个来自past,一个来自future,也可以两个都来自past,或两个都来自future。B slices使用 list0和list1,这两个list含有short term和long term,这两个list也都可以含有past和future。

Prediction Options:

  • Direct: skipped MB in B slices
  • List0
  • List1
  • Bi-prediction: 分别从list0和list1中得到两个参考区域,对这两个区域取平均:\(pred(i,j)=(pred0(i,j)+pred1(i,j)+1)>>1\)

weighted prediction

echo prediction sample pred0(i,j) or pred1(i,j) is scaled by a weighting factor w0 or w1 prior to motion-compensated prediction. In the explicit types, the weighting factors are determined by the encoder and transmitted in the slice header. If implicit prediction is used, w0 and w1 are calculated based on the relative temporal position of the list0 and list1 reference picture. 这个距离越近,w越大。应用场景:fade transition where one scene fades into another.

EXTENDED PROFILE,很适合video stream,stream的特点是switching,在不同的stream之间切换(内容不同,或者内容相同但bitrate不同)。

一种switching的方案是有规律的插入I-slice(假设frame就一个slice),这些插入点就是switching point,I-slice的decode不需要reference。这种方案有个问题,在switching point,bitrate会出现peak,因为I-slice体量较大。

SP-slice用来在相同内容的不同bitrate的stream中切换使用:

AB2的关键,参考A1,生成B2,有了B2,就可以继续decode B3.......

SP-slice的另一个作用是提供fast forward的功能,

还有个相似的SI-slice,书中介绍很少。

-- EOF --

本文链接:https://www.pynote.net/archives/3353

留言区

《H.264相关概念和技术细节》有7条留言

您的电子邮箱地址不会被公开。 必填项已用*标注

  • 麦新杰

    [B frame]B picture or B frame (bipredictive coded picture) – contains motion-compensated difference information relative to previously decoded pictures. In older designs such as MPEG-1 and H.262/MPEG-2, each B picture can only reference two pictures, the one which precedes the B picture in display order and the one which follows, and all referenced pictures must be I or P pictures. These constraints do not apply in newer standards H.264/MPEG-4 AVC and HEVC. [回复]

  • 麦新杰

    [P frame]P picture or P frame (predictive coded picture) – contains motion-compensated difference information relative to previously decoded pictures. In older designs such as MPEG-1, H.262/MPEG-2 and H.263, each P picture can only reference one picture, and that picture must precede the P picture in display order as well as in decoding order and must be an I or P picture. These constraints do not apply in the newer standards H.264/MPEG-4 AVC and HEVC. [回复]

  • 麦新杰

    关于P和B slice:P Slice里所有帧间预测的预测块只能有一个运动补偿预测信息。P Slice只能有一个参考图像列表。B Slice里所有帧间预测的预测块最多可以有两个运动补偿预测信息,B Slice可以使用两个参考图像列表。reference picture list里面可以存放多张图片哦。 [回复]

    • 麦新杰

      P-frames provide the “differences” between the current frame and one (or more) frames that came before it. P-frames offer much better compression than I-frames, because they take advantage of both temporal and spatial compression and use less bits within a video stream. [回复]

  • 麦新杰

    H.264/AVC uses a 6-tap filter for half-pixel interpolation and then simple linear interpolation to achieve quarter-pixel precision from the half-pixel data. [回复]

  • 麦新杰

    从书中的说明来分析,决定如何划分一个MB,还是要分析residual data,变化不大,partition就大,变化不小,partition就小。 [回复]

  • 麦新杰

    The terms‘sub-pixel’,‘half-pixel’and‘quarter-pixel’are widely used in this context although in fact the processis usually applied to luma and chroma samples, not pixels. [回复]


前一篇:
后一篇:

More


©Copyright 麦新杰 Since 2019 Python笔记

go to top