TResNet: ResNet improvements to achieve high accuracy while maintaining high GPU utilization

Finally open the topic, hurry up and publish the article, and then let it fly​​​​​​, the reason is that I want to write a paper, but after adding something, the speed slows down, so the instructor suggested adding this thing to see if it can be faster point.

Thesis title: TResNet: High Performance GPU-Dedicated Architecture

Paper address: https://arxiv.org/abs/2003.13630

Code: https://github.com/mrT23/TResNet

Contains three variants, TResNet-M, TResNet-L, and TResNet-XL, which differ only in depth and number of channels. The TResNet architecture contains the following improvements: SpaceToDepth stem, Anti-Alias ​​downsampling, In-Place Activated BatchNorm, Blocks selection and SE layers. Some improvements increase model throughput, while others decrease model throughput. Just these five, one by one.

1.  SpaceToDepth stem

The ResNet50 stem consists of a stride-2 conv7×7 and a max pooling layer. ResNet-D replaces conv7×7 with three conv3×3 layers. Improved accuracy and reduced training throughput. The paper uses a dedicated SpaceToDepth conversion layer to rearrange spatial data blocks into depth. The SpaceToDepth layer is followed by simple convolutions to match the number of channels required.

code:

class SpaceToDepth(nn.Module):
    def __init__(self, block_size=4):
        super().__init__()
        assert block_size == 4
        self.bs = block_size

    def forward(self, x):
        N, C, H, W = x.size()
        x = x.view(N, C, H // self.bs, self.bs, W // self.bs, self.bs)  # (N, C, H//bs, bs, W//bs, bs)
        x = x.permute(0, 3, 5, 1, 2, 4).contiguous()  # (N, bs, bs, C, H//bs, W//bs)
        x = x.view(N, C * (self.bs ** 2), H // self.bs, W // self.bs)  # (N, C*bs^2, H//bs, W//bs)
        return x

2.  Anti-Alias downsampling

 The stride-2 convolution is replaced by a stride-1 convolution, followed by a 3×3 blur filter with a stride of 2.

class AADownsample(nn.Module):
    def __init__(self, filt_size=3, stride=2, channels=None):
        super(AADownsample, self).__init__()
        self.filt_size = filt_size
        self.stride = stride
        self.channels = channels


        assert self.filt_size == 3
        a = torch.tensor([1., 2., 1.])

        filt = (a[:, None] * a[None, :])
        filt = filt / torch.sum(filt)

        # self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1))
        self.register_buffer('filt', filt[None, None, :, :].repeat((self.channels, 1, 1, 1)))

    def forward(self, input):
        input_pad = F.pad(input, (1, 1, 1, 1), 'reflect')
        return F.conv2d(input_pad, self.filt, stride=self.stride, padding=0, groups=input.shape[1])

3. In-Place Activated BatchNorm (Inplace-ABN)

Throughout the architecture, the author replaces all BatchNorm + ReLU layers with an Inplace-ABN layer, which implements BatchNorm and activation as a single in-place operation, thereby greatly reducing the memory required to train a deep network, while the amount of computation The increase in cost is negligible.

Using Inplace-ABN in the TResNet model has the following advantages:

The BatchNorm layer is the main consumer of GPU memory. Replacing the BatchNorm layer with Inplace-ABN actually doubles the maximum batch size, improving GPU throughput.
For TResNet, Leaky-ReLU provides better accuracy than ordinary ReLU. While some modern activation functions, such as Swish and Mish, may also provide better accuracy compared to ReLU, their GPU memory consumption is higher, and their computation costs are higher. In contrast, Leaky-ReLU has exactly the same GPU memory consumption and computation cost as normal ReLU.

4. Blocks Selection

The left side of the figure below is the BasicBlock used by ResNet34, and the right side is the Bottleneck used by ResNet50. The Bottleneck uses a higher GPU, but can get higher precision. BasicBlock has a larger receptive field.

Therefore, TResNet uses BasicBlock in the first two stages and Bottleneck in the last two stages

5. SE Layers

The SE layer is only placed in the first three stages of the network for maximum speed-accuracy advantage. For the Bottleneck unit, the SE module is added after the conv3×3 operation with a reduction factor of 8 (r = 8). For BasicBlock units, an SE module is added before the sum of residuals with a reduction factor of 4 (r=4).

Then there are various comparative experimental results plus ablation experiments. The topic opening is over, and I'm going to write a small paper. I'm a big bastard, long live graduation!

Guess you like

Origin blog.csdn.net/Zosse/article/details/127783334