[Computer Vision | Image Segmentation] arxiv Computer Vision Academic Express on Image Segmentation (July 19 Collection of Papers)

1. Segmentation | Semantic Correlation (12 articles)

1.1 Disentangle then Parse:Night-time Semantic Segmentation with Illumination Disentanglement

Unwrap first, then parse: night-time semantic segmentation with light unwrapping

https://arxiv.org/abs/2307.09362

insert image description here
Most existing semantic segmentation methods are developed for daytime scenes, but usually perform poorly in nighttime scenes due to insufficient and complex lighting conditions. In this work, we address this challenge by proposing a novel nighttime semantic segmentation paradigm, Disentangle Then Parse (DTP). DTP explicitly decomposes nighttime images into light-invariant reflectance and light-specific illumination components, and then recognizes semantics based on their adaptive fusion. Specifically, the proposed DTP consists of two key components: 1) Instead of handling illuminated entangled features like previous work, our Semantics-Oriented Disentanglement (SOD) framework is able to extract reflection components unhindered by illumination, enabling the network to consistently recognize semantics under varying and complex illumination conditions. 2) Based on the observation that lighting components can serve as clues for some semantically confused regions, we further introduce an Illumination-Aware Parser (IAParser) to explicitly learn the correlation between semantics and lighting, and aggregate lighting features to produce more precise predictions. Extensive experiments on nighttime segmentation tasks in various settings show that DTP significantly outperforms state-of-the-art methods. Furthermore, since the additional parameters are negligible, DTP can be directly used to benefit existing day-night segmentation methods.

1.2 OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

OnlineRefer: A Simple Online Baseline for Reference Video Object Segmentation

https://arxiv.org/abs/2307.09356

insert image description here
Reference Video Object Segmentation (RVOS) aims to segment objects in videos following human instructions. Current state-of-the-art methods belong to the offline mode, where each clip independently interacts with text embeddings for cross-modal understanding. They generally indicate that an offline mode is necessary for RVOS, but models limited temporal associations within each clip. In this work, we break previous offline beliefs and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our method exploits object cues to gather semantic information and locations to improve the accuracy and ease of reference prediction for the current frame. Furthermore, we generalize the online model as a semi-online framework for compatibility with video-based backbones. To demonstrate the effectiveness of our method, we evaluate it on four benchmarks, namely Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.

1.3 MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

MarS3D: A plug-and-play multi-scan 3D point cloud semantic segmentation model

https://arxiv.org/abs/2307.09316

insert image description here
3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Different from the single-scan based semantic segmentation task, this task also needs to distinguish the motion state of the point in addition to the semantic category of the point. However, methods designed for single-scan-based segmentation tasks perform poorly on multi-scan tasks due to the lack of effective methods to integrate temporal information. We present MarS3D, a plug-and-play motion perception module for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with a single-scan model, making it capable of multi-scan perception. The model contains two key designs: a cross-frame feature embedding module for enriched representation learning and a motion-aware feature learning module for enhanced motion perception. Extensive experiments show that MarS3D can substantially improve the performance of baseline models. The code can be obtained from this https URL.

1.4 Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

Extraction of Coarse and Fine Semantic Matching Knowledge in Weakly Supervised 3D Vision Foundation

https://arxiv.org/abs/2307.09267

insert image description here
The fundamentals of 3D vision involve finding target objects in a 3D scene that correspond to a given sentence query. Although many methods have been proposed and achieved impressive performance, all of them require dense object-sentence pair annotation in 3D point clouds, which is time-consuming and expensive. To address the difficulty of obtaining fine-grained annotated data, we propose to leverage weakly supervised annotations to learn a 3D vision base model, i.e., to learn object-sentence links using only rough scene-sentence correspondences. To achieve this goal, we design a novel semantic matching model to analyze the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and roughly select the top K candidates based on feature and class similarity matrices. Next, we use each candidate to reconstruct the masked keywords of the sentence one by one, and the reconstruction accuracy well reflects the semantic similarity of each candidate to the query. Furthermore, we distill the coarse-to-fine semantic matching knowledge into a typical two-stage 3D vision base model, which reduces inference cost and improves performance by leveraging the well-studied structure of existing architectures. We conduct extensive experiments on ScanRefer, Nr3D and Sr3D to demonstrate the effectiveness of our proposed method.

1.5 CG-fusion CAM: Online segmentation of laser-induced damage on large-aperture optics

CG-Fusion CAM: Online Segmentation of Laser Damage in Large Aperture Optical Components

https://arxiv.org/abs/2307.09161

insert image description here
The online segmentation of laser damage of large-aperture optical devices in high-power laser facilities faces the challenges of complex damage morphology, uneven illumination and stray light interference. Fully supervised semantic segmentation algorithms have achieved state-of-the-art performance, but rely on a large number of pixel-level labels, which are time-consuming and laborious to generate. LayerCAM is an advanced weakly-supervised semantic segmentation algorithm that can generate pixel-accurate results using only image-level labels, but its scattered and partially inactive class activation regions degrade segmentation performance. In this paper, we propose a weakly supervised semantic segmentation method using continuous gradient CAM and its nonlinear multi-scale fusion (CG-fusion CAM). This method redesigns the way of backpropagating gradients, nonlinearly activates multi-scale fusion heatmaps, and generates more fine-grained class activation maps with appropriate activation levels for different sizes of damage sites. Experiments on our dataset show that the proposed method can achieve segmentation performance comparable to fully supervised algorithms.

1.6 Mining of Single-Class by Active Learning for Semantic Segmentation

Semantic Segmentation Single-Class Mining Based on Active Learning

https://arxiv.org/abs/2307.09109

insert image description here
Some active learning (AL) strategies require multiple retraining of the target model in order to identify the most informative samples and rarely provide the option to focus on samples from underrepresented categories. Here we introduce the Active Learning Mining One-Class (MiSiCAL) paradigm, where AL policies are built via deep reinforcement learning and quantification-precision correlations are exploited to build datasets on which high-performance models can be trained for specific classes. MiSiCAL is particularly useful with very large batch sizes, as it does not require repeated model training sessions as is common in other AL methods. This is thanks to its ability to exploit fixed representations of candidate data points. We find that MiSiCAL is able to outperform random strategies on 150 of the 171 COCO10k categories, while the strongest baseline only outperforms random strategies on 101 categories.

1.7 Connections between Operator-splitting Methods and Deep Neural Networks with Applications in Image Segmentation

The relationship between operator splitting method and deep neural network and its application in image segmentation

https://arxiv.org/abs/2307.09052

insert image description here
Deep neural networks are powerful tools for many tasks. Understanding why it is so successful and providing a mathematical explanation is an important problem and a hot research direction in the past few years. In the mathematical analysis literature of deep neural networks, much work has been devoted to establishing a representation theory. How to make the connection between deep neural networks and mathematical algorithms is still being developed. In this paper, we give an algorithmic explanation of deep neural networks, especially their connection to operator splitting and multigrid methods. We demonstrate that with certain splitting strategies, operator splitting methods have the same structure as the network. Using this connection and the Potts model for image segmentation, two networks inspired by operator segmentation methods are proposed. These two networks are essentially two operator segmentation algorithms for solving the Potts model. Numerical experiments demonstrate the effectiveness of the proposed network.

1.8 Online Self-Supervised Thermal Water Segmentation for Aerial Vehicles

Aircraft online self-supervised hot water segmentation

https://arxiv.org/abs/2307.09027

insert image description here
We propose a novel approach to adapt an RGB-trained water segmentation network to target-domain aerial thermal images using online self-supervision by exploiting texture and motion cues as supervisory signals. This new thermal capability enables autonomous flying robots currently operating in nearshore environments to perform tasks such as visual navigation, bathymetry, and flow tracking at night. Our method overcomes the scarcity and inaccessibility of nearshore thermal data, which hinders the application of traditional supervised and unsupervised methods. In this work, we curate the first nearshore aviation thermal dataset, show that our method outperforms fully supervised segmentation models trained on limited target domain thermal data, and demonstrate real-time capabilities on the Nvidia Jetson embedded computing platform.

1.9 EVIL: Evidential Inference Learning for Trustworthy Semi-supervised Medical Image Segmentation

Trusted Semi-Supervised Medical Image Segmentation Based on Evidence-Based Inference Learning

https://arxiv.org/abs/2307.08988

insert image description here
Recently, uncertainty-aware methods have attracted increasing attention in semi-supervised medical image segmentation. However, current methods often have the disadvantage of being difficult to balance computational cost, estimation accuracy, and theoretical support in a unified framework. To alleviate this problem, we introduce Dempster-Shafer Evidence Theory (DST) to semi-supervised medical image segmentation, called Evidence-Based Inference Learning (EVIL). EVIL provides theoretically guaranteed solutions that can infer accurate uncertainty quantifications in a single forward pass. Trusted pseudo-labels on unlabeled data are generated after uncertainty estimation. Our framework adopts the recently proposed training paradigm based on consistency regularization, which enforces the consistency of perturbation predictions to enhance generalization from small amounts of labeled data. Experimental results show that EVIL achieves competitive performance compared to several state-of-the-art methods on public datasets.

1.10 Semantic Counting from Self-Collages

Semantic statistics based on self-collages

https://arxiv.org/abs/2307.08727

insert image description here

While recent reference-based supervised methods for object counting continue to improve performance on benchmark datasets, they must rely on small datasets due to the cost required to manually annotate dozens of objects in an image. We propose Unsupervised Counter (UnCo), a model that can learn this task without any manual annotation. To this end, we construct "SelfCollages", i.e. images with various pasted objects as training samples, providing a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representation and segmentation techniques and successfully demonstrates the ability to count objects without human supervision. Our experiments show that our method not only outperforms simple baselines and general models such as FasterRCNN, but also matches the performance of supervised counting models in some domains.

1.11 Evaluate Fine-tuning Strategies for Fetal Head Ultrasound Image Segmentation with U-Net

Evaluation of U-net fine-tuning strategy in fetal head ultrasound image segmentation

https://arxiv.org/abs/2307.09067

insert image description here
Fetal head segmentation is a critical step in measuring fetal head circumference (HC) during pregnancy and is an important biometric for monitoring fetal growth in obstetrics. However, manually generating biometrics is time-consuming and leads to inconsistent accuracy. To address this issue, convolutional neural network (CNN) models are used to improve the efficiency of medical biometrics. But training a CNN network from scratch is a challenging task, and we propose a transfer learning (TL) method. Our method involves fine-tuning (FT) the U-Net network using a lightweight MobileNet as an encoder to perform segmentation on a set of ultrasound (US) images of fetal heads with limited effort. This approach addresses the challenges associated with training CNN networks from scratch. This shows that our proposed FT strategy produces segmentation performance comparable to that when trained with an 85.8% reduced number of parameters. Our proposed FT strategy outperforms other strategies with a trainable parameter size below 4.4 million. Therefore, we believe it can be used as a reliable FT method to reduce the model size in medical image analysis. Our key findings highlight the importance of the balance between model performance and scale when developing artificial intelligence (AI) applications via TL methods.

1.12 Frequency-mixed Single-source Domain Generalization for Medical Image Segmentation

A Mixed-Frequency Single-Source Domain Generalization Algorithm for Medical Image Segmentation

https://arxiv.org/abs/2307.09005

insert image description here
The scarcity of annotations for medical image segmentation poses challenges for deep learning models to collect sufficient training data. Specifically, models trained on limited data may not generalize well to other unseen data domains, leading to the domain transfer problem. Therefore, domain generalization (DG) is developed to improve the performance of segmentation models on unknown domains. However, the DG setting requires multiple source domains, which hinders the effective deployment of segmentation algorithms in clinical scenarios. To address this challenge and improve the generalization of segmentation models, we propose a novel method called Frequency Mixed Single-Source Domain Generalization Method (FreeSDG). By analyzing the effect of frequency on domain variance, FreeSDG exploits mixed spectrums to enhance single-source domains. Furthermore, self-supervision is built into domain augmentation to learn robust context-aware representations for segmentation tasks. Experimental results on five datasets of three modes demonstrate the effectiveness of the proposed algorithm. FreeSDG outperforms state-of-the-art methods and significantly improves the generalizability of segmentation models. Therefore, FreeSDG provides a promising solution for enhancing the generalization of medical image segmentation models, especially when annotated data is scarce.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131884312