[Computer Vision | Image Segmentation] Arxiv Computer Vision Academic Express on Image Segmentation (A collection of papers on August 22)

Article directory

1. Segmentation | Semantic Correlation (16 articles)

1.1 Test-time augmentation-based active learning and self-training for label-efficient segmentation

Active Learning and Self-Training for Label Efficient Segmentation with Test-Time Augmentation

https://arxiv.org/abs/2308.10727

Deep learning techniques rely on large datasets, and annotation of these datasets is time-consuming. To reduce annotation burden, self-training (ST) and active learning (AL) methods and methods combining them in an iterative manner have been developed. However, it is unclear when each approach is most useful and when combining them is advantageous. In this paper, we propose a new approach that combines ST with AL using test-time augmentation (TTA). First, TTA is performed on the initial teacher network. Then, the cases for annotation are selected based on the lowest estimated Dice score. Cases with high estimated scores are used as soft pseudo-labels for ST. The selected annotated cases are trained with existing annotated cases and ST cases with boundary slice annotations. We demonstrate our approach for different data variability characterizations of the MRI fetal body and placenta segmentation tasks. Our results show that ST is very effective for both tasks, improving performance on distributed (ID) and distributed (OOD) data. However, while self-training improves the performance of single-sequence fetal body segmentation when combined with AL, it slightly worsens the performance of multi-sequence placenta segmentation on ID data. AL helps with high-variability placental data but does not improve after random selection on single-sequence volumetric data. For fetal body segmentation sequence transfer, combining AL with ST after ST iterations yielded a Dice of 0.961 with only 6 original scans and 2 new sequence scans. Results using only 15 cases of hypervariable placentas were similar to those using 50 cases. Code available at: https://github.com/Bella31/TTA-quality-estimation-ST-AL

1.2 Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations

Improving Diversity in Zero-Shot GaN Adaptation Using Semantic Variation

https://arxiv.org/abs/2308.10554

Training deep generative models often requires large amounts of data. To alleviate the data collection cost, the task of zero-shot GAN adaptation aims to reuse a well-trained generator to synthesize images of unseen target domains without any further training samples. Due to missing data, textual descriptions of the target domain and visual-language models (e.g., CLIP) are used to effectively guide the generator. However, there is only a single representative textual feature instead of real images, and the synthesized images gradually lose their diversity. model, which is also known as mode collapse. To address this problem, we propose a new method to find semantic changes in the CLIP space in the target text. Specifically, we explore different semantic changes based on The target domain of informative text features while normalizing the uncontrolled deviation of semantic information. With the obtained variation, we design a new directional moment loss that matches the first and second moments of the image and text Orientation distribution. Furthermore, we introduce elastic weight pooling and relation consistency loss to effectively preserve valuable content information from the source domain, e.g., appearance. Effectiveness of zero-shot GAN adaptation in various situations. We also conduct ablation studies to verify the effect of each proposed component. Notably, our model achieves state-of-the-art in terms of diversity and quality The zero-shot GAN adaptation.

1.3 CVFC: Attention-Based Cross-View Feature Consistency for Weakly Supervised Semantic Segmentation of Pathology Images

CVFC: Attention-Based Cross-Viewpoint Feature Consistency Weakly Supervised Semantic Segmentation in Pathological Images

https://arxiv.org/abs/2308.10449

Histopathological image segmentation is the gold standard for diagnosing cancer and can indicate cancer prognosis. However, histopathology image segmentation requires high-quality masks, so many studies now use imagelevel labels to achieve pixel-level segmentation to reduce the need for fine-grained annotations. To address this issue, we propose an attention-based end-to-end pseudo-mask generation framework for cross-view feature consistency, CVFC. Specifically, CVFC is a three-branch joint framework consisting of two Resnet 38 and one Resnet 50, and an independent branch multi-scale integrated feature map to generate a class activation map (CAM); in each branch, by downsampling and expanding The method adjusts the size of the CAM; the middle branch projects the feature matrix to the query and key feature spaces, and generates a feature space perception matrix through the connection layer and the inner product, and adjusts and refines the CAM of each branch; finally, through the feature consistency Loss and Feature Intersection Loss optimize the parameters of CVFC in co-training mode. After a lot of experiments, IoU of 0.7122 and fwIoU of 0.7018 were obtained on the WSSS 4LUAD dataset, and its performance is better than HistoSegNet, SEAM, C-CAM, WSSS-Tissue and OEMM respectively.

1.4 Hyper Association Graph Matching with Uncertainty Quantification for Coronary Artery Semantic Labeling

Uncertainty-Quantitative Super-Association Map Matching for Coronary Artery Semantic Annotation

https://arxiv.org/abs/2308.10320

Coronary artery disease (CAD) is one of the leading causes of death worldwide. Accurate extraction of individual arterial branches on invasive coronary angiography (ICA) is important for stenosis detection and CAD diagnosis. However, deep learning-based models face challenges in generating semantic segmentation of coronary arteries due to the morphological similarities between different types of coronary arteries. To address this challenge, we propose an innovative approach, Uncertainty Quantification with Hyperassociative Graph Matching Neural Networks (HAGMN-UQ) coronary semantic labeling ICA. The graph matching process maps arterial branches between two separate graphs such that unlabeled arterial segments are classified by labeled segments and enables semantic labeling of coronary arteries. By combining anatomical loss and uncertainty, our model achieves a coronary semantic labeling accuracy of 0.9345 with fast inference speed, leading to an effective and efficient predictive real-time clinical decision-making scenario.

1.5 BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

BAVS: Bootstrap Audiovisual Segmentation Integrating Basics

https://arxiv.org/abs/2308.10175

Given an audiovisual pair, audiovisual segmentation (AVS) aims to localize sound sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in an image. However, this assumption ignores that off-screen sounds and background noise often contaminate audio recordings in real-world scenes. They pose significant challenges to establishing a consistent semantic mapping between audio and visual signals for AVS models, thereby hindering precise sound localization. In this work, we propose a two-stage guided audiovisual segmentation framework that incorporates multimodal fundamentals. In short, our BAVS aims to remove background noise or off-screen sounds from interfering with segmentation by establishing audiovisual correspondence in an unambiguous manner. In the first stage, we employ a segmentation model to localize potential vocal objects from visual data without contamination from audio signals. At the same time, we also utilize a basic audio classification model to identify audio semantics. Considering that the audio labels provided by the audio base model are noisy, associating object masks with audio labels is not trivial. Therefore, in the second stage, we develop an Audiovisual Semantic Integration Strategy (AVIS) to localize real-sounding objects. Here, we construct an audiovisual tree based on hierarchical correspondences between sound and object categories. We then examine the audiovisual tree for label concurrency tracking between localized objects and classified audio labels. With AVIS, we can efficiently segment real-sound objects. Extensive experiments demonstrate the superiority of our method on the AVS dataset, especially when background noise is involved. Our project website is https://yenanliu.github.io/AVSS.github.io/.

1.6 SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

SSMG: Spatial-semantic map-guided layout-free diffusion model for image generation

https://arxiv.org/abs/2308.10156

Despite significant progress in text-to-image (T2I) generative models, even long and complex textual descriptions still struggle to convey detailed control. In contrast, layout-to-image (L2I) generation, which aims to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods convert layout information into tokens or RGB images for conditional control during generation, resulting in insufficient control over space and semantics. To address these limitations, we propose a novel Spatial Semantic Map-Guided (SSMG) diffusion model, employing a feature map, from a layout, as a guide. Due to the rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Furthermore, we propose relation-sensitive attention (RSA) and location-sensitive attention (LSA) mechanisms. The former aims to model the relationship between multiple objects in a scene, while the latter aims to improve the model's sensitivity to the spatial information embedded in the bootstrap. Extensive experiments show that SSMG achieves very promising results, setting a new state-of-the-art on metrics covering fidelity, diversity and controllability.

1.7 Controllable Multi-domain Semantic Artwork Synthesis

Controllable Multi-Domain Semantic Illustration Synthesis

https://arxiv.org/abs/2308.10111

We propose a new framework for multi-domain synthesis of artwork from semantic layouts. One of the major limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset, which we call ArtSem, which contains 40,000 artwork images from 4 different domains, and their corresponding semantic label maps. We first extract semantic maps from landscape photography, and then propose a conditional generative adversarial network (GAN)-based method to generate high-quality artwork from semantic maps without paired training data. Furthermore, we propose an artwork synthesis model for high-quality multi-domain synthesis using a domain-dependent variational encoder. This model improves and complements a simple but effective normalization method based on the joint semantic and style of normalization, which we call Spatial Style Adaptive Normalization (SSTAN). Compared to previous methods, which only take semantic layout as input, our model is able to learn a joint representation of style and semantic information, which leads to better generation quality synthetic art images. Results show that our model learns to separate domains in the latent space, thus, by identifying hyperplanes that separate different domains, we also gain fine-grained control over the synthesized artwork. By combining our proposed dataset and method, we are able to generate user-controllable artwork of higher quality than existing

1.8 TSAR-MVS: Textureless-aware Segmentation and Correlative Refinement Guided Multi-View Stereo

TAR-MVS: Multiview Stereo without Texture-Aware Segmentation and Associated Refinement Guide

https://arxiv.org/abs/2308.09990

Reconstruction of texture-free regions has been a challenging problem in MVS due to the lack of reliable pixel correspondence between images. In this paper, we propose Texture-Aware Segmentation and Associated Refinement-Guided Multi-View Stereo (TSAR-MVS), a novel approach that efficiently addresses the reconstruction of textured regions in 3D by filtering, thinning and segmentation challenge. First, we implement joint hypothesis filtering, a technique that combines a confidence estimator with a disparity discontinuity detector to remove incorrect depth estimates. Second, to propagate the depth of pixels with confidence, we introduce an iterative correlation refinement strategy using RANSAC to generate superpixels, followed by a median filter to expand the influence of accurately identified pixels. Finally, we propose a texture-aware segmentation method that utilizes edge detection and line detection to accurately identify large texture-free regions using 3D plane fitting. Experiments on extensive datasets show that our method significantly outperforms most non-learning methods and exhibits robustness to texture-free regions while preserving fine details.

1.9 Anomaly-Aware Semantic Segmentation via Style-Aligned OoD Augmentation

Anomaly-aware Semantic Segmentation Enhanced by OOD Based on Style Alignment

https://arxiv.org/abs/2308.09965

In the context of autonomous driving, encountering unknown objects during deployment in the open world becomes inevitable. Therefore, it is crucial to equip standard semantic segmentation models with anomaly awareness. Many previous methods have exploited synthetic out-of-distribution (OoD) data augmentation to address this issue. In this work, we advance the OoD synthesis process by reducing the domain gap between OoD data and driving scenes, effectively mitigating style differences that might otherwise serve as an obvious shortcut during training. Furthermore, we propose a simple fine-tuning loss that effectively induces pre-trained semantic segmentation models to generate a "no class given" prediction, leveraging per-pixel OoD scores for anomalous segmentation. With minimal fine-tuning efforts, our pipeline enables anomalous segmentation using pre-trained models while maintaining performance on the original task.

1.10 Learning Multiscale Consistency for Self-supervised Electron Microscopy Instance Segmentation

Multiscale Consistent Learning for Self-Supervised Electron Microscopy Instance Segmentation

https://arxiv.org/abs/2308.09917

Instance segmentation of electron microscopy (EM) volumes presents a significant challenge due to complex morphological instances and insufficient annotations. Self-supervised learning has recently emerged as a promising solution, enabling the acquisition of prior knowledge of cellular organization that is essential for EM instance segmentation. However, existing pre-training methods often lack the ability to capture complex visual patterns and relationships between voxels, which leads to insufficient prior knowledge acquired for downstream EM analysis tasks. In this paper, we propose a novel pre-training framework that utilizes multi-scale visual representations to capture voxel-level and feature-level consistent EM volumes. Specifically, our framework enforces voxel-level consistency between the outputs of Siamese networks via a reconstruction function, and incorporates a cross-attention mechanism for soft feature matching to achieve fine-grained feature-level consistency. Furthermore, we propose a feature pyramid with a contrastive learning scheme to extract discriminative features across multiple scales. We extensively pre-train our method on four large-scale EM datasets, achieving promising performance improvements in representative tasks of neuronal and mitochondrial instance segmentation.

1.11 Scalable Video Object Segmentation with Simplified Framework

Scalable Video Object Segmentation Based on Simplified Framework

https://arxiv.org/abs/2308.09903

Current popular video object segmentation (VOS) methods implement feature matching through several hand-crafted modules, which perform feature extraction and matching, respectively. However, the aforementioned hand-crafted designs empirically lead to insufficient target interactions, thus limiting dynamic target-aware feature learning in VOS. To address these limitations, this paper proposes a scalable simplified VOS (SimVOS) framework utilizing a single Transformer backbone for joint feature extraction and matching. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target part features for accurate mask prediction. More importantly, SimVOS can directly apply well-pretrained ViT backbones (e.g., MAE), which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance and speed tradeoff, we further explore intra-frame attention and propose a new token refinement module to improve running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, namely, DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9% J&F) and YouTube-VOS 2019 (84.2% J&F) , without applying any synthetic video or BL 30 K pre-training used in previous VOS methods.

1.12 Semantic-Human: Neural Rendering of Humans from Monocular Video with Human Parsing

Semantic-Person: Neural Representation of Person in Monocular Video Based on Human Parsing

https://arxiv.org/abs/2308.09894

Neural rendering of humans is a topic of great research interest. However, most previous works focus on achieving photorealistic details, ignoring the exploration of human parsing. Furthermore, classical semantic works are limited in their ability to efficiently represent complex movements with fine results. Human parsing is intrinsically related to radiometric reconstruction, since similar appearance and geometry often correspond to similar semantic parts. Furthermore, previous works often design a playing field that maps from observation space to norm space, while it often exhibits underfitting or overfitting, resulting in limited generalization. In this paper, we present Semantic Human, a novel approach to neurally rendered humans that achieves realistically detailed and viewpoint-consistent human parsing. Specifically, we extend Neural Radiative Fields (NeRF) to jointly encode semantics, appearance, and geometry for accurate 2D semantic labeling using noisy pseudo-label supervision. Leveraging the inherent consistency and smoothness of NeRF, Semantic-Human achieves consistent human parsing across continuous and novel views. We also introduce constraints on the motion field and regularized recovery volume geometry from SMPL surfaces. We evaluate the model using the ZJU-MoCap dataset and obtain highly competitive results demonstrating the effectiveness of our proposed semantic human. We also demonstrate various compelling applications, including label denoising, label synthesis, and image editing, and empirically verify its advantageous properties.

1.13 Microscopy Image Segmentation via Point and Shape Regularized Data Synthesis

Microscopic Image Segmentation Based on Point-Like Regularized Data Synthesis

https://arxiv.org/abs/2308.09835

Current deep learning-based methods for microscopic image segmentation rely heavily on large amounts of training data with dense annotations, which is very expensive and laborious in practice. Compared to full annotations delineating the complete outline of an object, point annotations (especially object centroids) are easier to obtain and still provide critical information about the object for subsequent segmentation. In this paper, we assume access point annotations only during training and develop a unified pipeline for microscopy image segmentation using synthetically generated training data. Our framework consists of three stages: (1) take point annotations and sample pseudo-dense segmentation masks constrained by shape priors; (2) leverage an image generation model trained in an unpaired fashion, which transforms the masks into (3) Pseudo-masks together with synthetic images constitute a paired dataset for training a self-organizing segmentation model. On the public MoNuSeg dataset, our synthesis pipeline produces more diverse and realistic images than baseline models, while maintaining high consistency between input masks and generated images. When using the same segmentation skeleton, models trained on our synthetic dataset significantly outperform those trained on images generated from pseudo-labels or the baseline. Furthermore, our framework achieves similar results to models trained on real microscope images with dense labels, demonstrating its potential as a reliable and efficient alternative to labor-intensive manual pixel-annotated microscope image segmentation. Code is available.

1.14 EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

EAVL: Explicit Alignment Vision and Language for Reference Image Segmentation

https://arxiv.org/abs/2308.09779

The purpose of reference image segmentation is to segment objects mentioned in natural language from images. A major challenge is language-dependent localization, which means locating objects in the relevant language. Previous methods mainly focus on the fusion of visual and linguistic features, without fully addressing language-dependent localization. In previous methods, the fused visual-linguistic features are directly fed into the decoder, and the result is obtained by convolution with a fixed kernel, which follows a similar pattern to traditional image segmentation. This approach does not explicitly align linguistic and visual features during the segmentation stage, leading to suboptimal language-dependent localization. Different from previous methods, we propose Explicitly Aligning Visual and Linguistic Reference Image Segmentation (EAVL). Instead of using a fixed convolutional kernel, we propose an Aligner that explicitly aligns visual and linguistic features during the segmentation stage. Specifically, a series of unfixed convolution kernels are generated based on the input l, and then used to explicitly align visual and linguistic features. To achieve this, we generated multiple queries representing different emphases of language expressions. These queries are converted into a series of query-based convolution kernels. We then use these kernels to perform convolutions in the segmentation stage and obtain a sequence of segmentation masks. The final result is obtained by aggregation of all masks. Our method not only effectively fuses visual and linguistic features, but also exploits their potential in the segmentation stage. Most importantly, we explicitly align differently focused linguistic features with image features for language-dependent localization. Our method outperforms the previous state-of-the-art methods RefCOCO, RefCOCO+, and G-Ref by a large margin.

1.15 Enhancing Medical Image Segmentation: Optimizing Cross-Entropy Weights and Post-Processing with Autoencoders

Enhanced Medical Image Segmentation: Optimizing Cross-Entropy Weights and Postprocessing with Autoencoders

https://arxiv.org/abs/2308.10488

The task of medical image segmentation presents unique challenges, requiring both local and global semantic understanding to accurately delineate regions of interest such as key tissues or abnormal features. This complexity is heightened in medical image segmentation due to the high degree of inter-class similarity, intra-class variation and possible image blurring. To further diversify the segmentation task, consider studying autoimmune diseases such as dermatomyositis on histopathology slides. Analysis of cellular inflammation and interactions in these situations has been less studied due to limitations in the data acquisition pipeline. Despite advances in medical science, we lack a comprehensive collection of autoimmune diseases. Their research has become increasingly important as the prevalence of autoimmune diseases continues to rise globally and has shown relevance to COVID-19. While existing studies have integrated artificial intelligence into the analysis of various autoimmune diseases, dermatomyositis remains relatively underexplored. In this paper, we propose a deep learning approach tailored for medical image segmentation. Our proposed method achieves an average performance of 12.26% for U-Net and 12.04% for U-Net++ among the ResNet family of encoders on the dermatomyositis dataset. Furthermore, we explore the importance of optimizing loss function weights and benchmark three challenging medical image segmentation tasks

1.16 EDDense-Net: Fully Dense Encoder Decoder Network for Joint Segmentation of Optic Cup and Disc

EDDense-Net: A Full-Density Codec-Net for Optical Cup-Disk Joint Segmentation

https://arxiv.org/abs/2308.10192

Glaucoma is an eye disease that causes damage to the optic nerve, which can lead to vision loss and permanent blindness. Therefore, early glaucoma detection is crucial to avoid permanent blindness. Estimation of the cup-to-disk ratio (CDR) during a disc (OD) examination is used in the diagnosis of glaucoma. In this paper, we propose a joint segmentation of OC and OD for the EDDense-Net segmentation network. The encoder and decoder in this network consist of dense blocks with grouped convolutional layers in each block, allowing the network to acquire and convey spatial information from images while reducing network complexity. To reduce spatial information loss, an optimal number of filters is used in all convolutional layers. In semantic segmentation, the decoder adopts dice pixel classification to alleviate the problem of class imbalance. The proposed network is evaluated on two publicly available datasets and outperforms existing state-of-the-art methods in terms of accuracy and efficiency. For the diagnosis and analysis of glaucoma, the method can be used as a second opinion system to assist medical ophthalmologists.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132490601