[Computer Vision] Image Feature Extractors method introduction collection (2)

一、Mixed Depthwise Convolution

MixConv (or Mixed Depth Convolution) is a type of depth convolution that naturally mixes multiple kernel sizes in a single convolution. It is based on the insight that deep convolutions apply a single kernel size to all channels, which MixConv overcomes by combining the advantages of multiple kernel sizes. It does this by dividing channels into groups and applying different core sizes to each group.

Insert image description here

二、Deformable Kernel

The deformable kernel is a convolution operator used for deformation modeling. DK learns free-form offsets on kernel coordinates, deforming the original kernel space into specific data modalities rather than reassembling the data. This directly adjusts the effective receptive field (ERF) while keeping the receptive field unchanged. They can be used as a drop-in replacement for rigid cores.

Insert image description here
Insert image description here

三、Dynamic Convolution

DynamicConv is a convolution for sequential modeling whose kernel changes over time as a learned function for individual time steps. It is built on top of LightConv and takes the same form, but uses a time-step dependent kernel:

Insert image description here
Insert image description here

四、Submanifold Convolution

Insert image description here

5. CondConv

CondConv (or conditional parameterized convolution) is a type of convolution that learns specialized convolution kernels for each example.

To effectively increase the capacity of the CondConv layer, developers can increase the number of experts. This is more computationally efficient than increasing the size of the kernel itself, since the kernel is applied to many different locations within the input, while experts are only combined once per input.

Insert image description here

六、Active Convolution

Active convolution is a kind of convolution that does not have a fixed receptive field shape and can use more diverse receptive field forms for convolution. Its shape can be learned through backpropagation during training. Can be seen as a generalization of convolution; it can define not only all regular convolutions, but also convolutions with fractional pixel coordinates. We can freely change the shape of the convolution, which provides greater freedom in forming CNN structures. Secondly, the shape of the convolution is learned during training and does not require manual adjustment.

Insert image description here

七、Depthwise Dilated Separable Convolution

Depthwise dilated separable convolutions are a type of convolution that combines depth separability with the use of dilated convolutions.

Insert image description here

八、Involution

Convolution is an atomic operation of deep neural networks, which inverts the design principle of convolution. The convolution kernels are spatially distinct but shared between channels. If the involution kernel is parameterized as a fixed-size matrix (such as a convolution kernel) and updated using the backpropagation algorithm, the learned involution kernel will not be transferable between input images with variable resolutions.

The authors believe that involution has two benefits over convolution: (i) involution can summarize context in a wider spatial arrangement, thereby overcoming the difficulty of modeling long-range interactions; (ii) involution can adaptively assign different Positional weighting, thereby prioritizing the most informative visual elements in the spatial domain.

Insert image description here

九、Dilated convolution with learnable spacings

Dilated convolution with learnable spacing (DCLS) is a type of convolution that allows the spacing between non-zero elements of the kernel to be learned during training. This makes it possible to increase the receptive field of the convolution without increasing the number of parameters, which can improve the performance of the network on tasks that require long-range dependencies.

Dilated convolution is a type of convolution that allows the kernel to skip certain input features. This is done by inserting zeros between non-zero elements of the kernel. The effect of this is to increase the receptive field of the convolution without increasing the number of parameters.

DCLS takes this idea one step further by allowing the spacing between non-zero elements of the kernel to be learned during training. This means that the network can learn to skip different input features depending on the task at hand. This is particularly useful for tasks that require remote dependencies, such as image segmentation and object detection.

DCLS has been proven effective for a variety of tasks, including image classification, object detection, and semantic segmentation. This is a promising new technology that has the potential to improve the performance of convolutional neural networks on a variety of tasks.

Insert image description here

十、Attention-augmented Convolution

Attention-enhanced convolution is a convolution with a two-dimensional relative self-attention mechanism that can replace convolution as an independent computing primitive for image classification. Like Transformers, it uses scaling dot product attention and multi-head attention.

Insert image description here
Similar to convolutions, attention-augmented convolutions are 1) equivariant with translation and 2) can easily operate on inputs of different spatial dimensions.

Insert image description here

11. PP-OCR

PP-OCR is an OCR system, which consists of three parts: text detection, detection frame correction and text recognition. The purpose of text detection is to locate text areas in images. In PP-OCR, Differentiable Binarization (DB) is used as a text detector based on a simple segmentation network. It integrates feature extraction and sequence modeling. It employs connectionist temporal classification (CTC) loss to avoid inconsistencies between predictions and labels.

Insert image description here

12. Displaced Aggregation Units

Permuted aggregation units replace the classic convolutional layers in ConvNet with learnable unit locations. This introduces an explicit structure of hierarchical composition and brings several benefits:

Achieve fully tunable and learnable receptive fields through spatially adjustable filter units
Reduce spatial coverage parameters for efficient inference
Decoupling parameters from receptive field size

Insert image description here

Thirteen, Dimension-wise Convolution

Insert image description here
Insert image description here

14. Local Relation Layer

The local relation layer is an image feature extractor and an alternative to the convolution operator. Intuitively, aggregation in convolution is basically a pattern matching process applying fixed filters, which is inefficient when modeling visual elements with different spatial distributions. The local relationship layer adaptively determines the aggregation weight according to the composition relationship of local pixel pairs. It is argued that through this relational approach, it can combine visual elements into higher-level entities in a more efficient way, thus facilitating semantic reasoning.

Insert image description here

15. Lightweight Convolution

LightConv is a deep convolution for sequential modeling that shares some output channels and uses softmax to normalize the weights in the temporal dimension. Compared with self-attention, LightConv has a fixed context window, which determines the importance of contextual elements through a set of weights that do not change with time steps. LightConv computes the following sequence and the first element in the output channel:

Insert image description here
Insert image description here

16. Hamburger

Hamburger is a global context module that employs matrix factorization to decompose the learned representation into submatrices, thereby recovering a clean low-rank signal subspace. The key idea is that if we formulate an inductive bias like a global context as an objective function, then an optimization algorithm that minimizes the objective function can construct a computational graph, the architecture we need in the network.

Insert image description here

Seventeenth, Span-Based Dynamic Convolution

Stride-based dynamic convolution is a type of convolution used in the ConvBERT architecture to capture local dependencies between markers. The kernel is generated by getting the local scope of the current token, which better exploits local dependencies and distinguishes different meanings of the same token (for example, if "a" comes before "can" in the input sentence, then "can" is obviously noun rather than verb).

Specifically, using classical convolution, we will share fixed parameters for all input tokens. Therefore, dynamic convolution is preferable due to its higher flexibility in capturing the local dependencies of different markers. Dynamic convolution uses a kernel generator to generate different kernels for different input tokens. However, this dynamic convolution cannot distinguish the same token in different contexts and generate the same kernel (e.g., the three “can”s in figure (b)).

Therefore, span-based dynamic convolutions are developed to produce more adaptive convolution kernels by receiving input spans instead of only single tokens, which enables the differentiation of generated kernels for the same token in different contexts. For example, as shown in Figure (c), span-based dynamic convolution generates different kernels for different “can” tokens.

Insert image description here

18. ShapeConv

ShapeConv, the shape-aware convolutional layer, is a convolutional layer used to process depth features in indoor RGB-D semantic segmentation. The deep features are first decomposed into shape components and fundamental components, then two learnable weights are introduced to fit them independently, and finally convolution is applied to the reweighted combination of these two components.

Insert image description here

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132923493