SF-NET from Tencent

一、网络结构概览

二、网络各模块简介

Frame Level

A mini-batch sample can be represented as a 5D matrix : B*T*C*H*W

The 2D convolution can be done per sample per frame as :

The 3D convolution is done per sample as :

We do not reduce the temporal dimension during 3D convolution, so in SF-Net X Ξ T.

Inspired by MiCT Network, after each 2D and 3D convolution, we merge features of the two branches with an cross domain element-wise summation. As a result, the final output is :

 

This operation can speed up learning and allow training of deeper architectures. At the same time, it allows the 3D convolution branch only to learn residual temporal features, which is fast and samll motions in sign language for us, to compensate features learned in 2D convolution.

After the last convolution block, we conduct a global average pooling to reduce dimension.

The output feature will be dimension : Y B * T * K, where K is the number of channels in the last block.

Gloss Level

Similar to framing in automatic speech recognition (ASR), given the input of length T, the window size L and stride S, the number of meta frames generated is :

and each meta frame contains L frames and is of dimension [L * K].

After framing, the output of the frame level feature will be transformed into a 4D matrix of dimension Y' B * F * L * K.

Then is a LSTM layer, the output dimension of it is M B * F * H, where H is the number of hidden units in LSTM layer.

We added a regularizer in the gloss level to enhance the generalization of the features :

This regularizer is introduced after the first few epochs to ensure stable training.

Sentence Level

Add Bidirectional-LSTM.

CTC as the loss function.

When combined with the regularizer in the gloss level, the loss function becomes:

 

 During testing, the final output canbe obtained by simply doing greedy decoding ont he probability P sl.

三、实验设置

The setting for the frame level part is shown in Figure 4.

In the rest of the network, we use 1 LSTM layer with 512 hidden units and 1 BiLSTM layer with 256 hidden units in each direction respectively.

The window size L is 12, which is approximately 0.5 seconds for CSL and RWTH datasets. The framing stride S is set to 3.

BN is used after every 2D convolution and 2D/3D blocks.

Sequence-wise BN is used for LSTM and BiLSTM layers.

Data augmentation, Adam, weight decay.

四、实验结果

 

猜你喜欢

转载自www.cnblogs.com/august-en/p/11789201.html
今日推荐