Learning in the Frequency Domain 解读

论文:Learning in the Frequency Domain, CVPR 2020

Code: https://github.com/calmevtime/DCTNet

The actual image size is relatively large, it can not be entered directly into the CNN processing. Therefore, various types of CNN regard the model image is first down-sampled 224x224, and then processed. However, this will cause loss of information and affect accuracy. Thus, Ali Baba researchers propose a new method, the RGB image to the DCT transform in the frequency domain, rather than directly downsampling. This model does not require changing the existing network architecture, it can be applied to any network CNN.

The general idea of ​​the method: the high-resolution RGB image is first converted to the YCbCr color space, and then converted to a frequency domain DCT. This produces a plurality of channels. For a greater impact on certain channels of classification, therefore, retaining only the significant input channels can be processed to CNN.

Specifically, according to the 8x8 image block, each block in the Y channel 64 will be DCT signals, corresponding to 64 different frequency components. For size W x H of the original image, there will be W / 8 x H / 8 blocks. Frequency component in each block may be composed of the same position in a dimension of feature W / 8 x H / 8 of the map, so that 8x8 = 64 th generated feature map. For the Cb and Cr channels, each may produce 64 feature map. It produced a total 64x3 = 192 Ge feature map. Suppose W = H = 448, then the existing feature map based on the size of the frequency domain into 56x56x192.

For ResNet-50, the input is 224x224, and after a convolution pooling, feature map size of 56x56. So we can feature map 56x56x192 on here can be.

Time is limited, do not read the code, there is time to make up parsing code.

Guess you like

Origin www.cnblogs.com/gaopursuit/p/12552257.html