[Computer Vision | Target Detection] arxiv Computer Vision Academic Express on Target Detection (Collection of Papers on December 4) (Part 1)

1. Detection related (13 articles)

1.1 Rethinking Detection Based Table Structure Recognition for Visually Rich Documents

Visually enriched document table structure recognition based on reflective detection

https://arxiv.org/abs/2312.00699

Table Structure Recognition (TSR) aims to convert unstructured table images into structured formats such as HTML sequences. A popular solution is to use a detection model to detect components of a table, such as columns and rows, and then apply a rule-based post-processing approach to convert the detection results into HTML sequences. However, existing detection-based studies often have the following limitations. First, these studies usually focus more on improving detection performance, which does not necessarily lead to better performance with respect to cell-level metrics such as TEDS. Second, some solutions oversimplify the problem and may miss some key information. Finally, although some studies define the problem as detecting more components to provide as much information as other types of solutions, these studies ignore the fact that this problem definition is multi-label detection because of the row, projected row headers and column headers Can share the same bounding box. Furthermore, there is often a performance gap between two-stage and transformer-based detection models in terms of structural TEDS only, even though they have similar performance in terms of COCO metric. Therefore, we revisit the limitations of existing detection-based solutions, compare two-stage and transformer-based detection models, and determine the TSR tasks of two-stage detection models, including multi-class problem definition, aspect ratio of anchor box generation, and functional generation of the backbone network are critical design aspects for success. We apply simple methods to improve these aspects of the Cascade R-CNN model, achieving state-of-the-art performance and improving the baseline Cascade R-CNN model by 19.32 in structure-only TEDS on SciTSR, FinTabNet, and PubTables 1 M datasets. %, 11.56% and 14.77%.

1.2 Object Detector Differences when using Synthetic and Real Training Data

Object detector differences when using synthetic and real training data

https://arxiv.org/abs/2312.00694

In order to train a generalizing neural network with good performance, a sufficiently large and diverse data set is required. Collecting data while complying with privacy legislation is becoming increasingly difficult, and annotating these large datasets is a resource-heavy and time-consuming task. One way to overcome these difficulties is to use synthetic data, which is inherently scalable and can be automatically annotated. However, how training on synthetic data affects the layers of a neural network remains unclear. In this paper, we train the YOLOv 3 object detector on real and synthetic images from urban environments. We perform similarity analysis using central kernel alignment (CKA) to explore the impact of training on synthetic data layer by layer. The analysis captures the architecture of the detector, showing both different and similar patterns between different models. Through this similarity analysis, we hope to understand how training synthetic data affects each layer and better understand the inner workings of complex neural networks. The results show that the maximum similarity between detectors trained on real data and detectors trained on synthetic data is in the early layers, and the maximum difference is in the head part. The results also showed that no major differences in performance or similarity were seen between frozen and unfrozen skeletons.

1.3 Towards Efficient 3D Object Detection in Bird’s-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

Convolution-based three-dimensional target detection method in bird's-eye view space for autonomous driving

https://arxiv.org/abs/2312.00633

3D object detection in Bird's Eye View (BEV) space has recently become a popular method in the field of autonomous driving. Despite improvements in accuracy and speed estimation compared to perspective methods, deploying BEV-based technology in real-world autonomous vehicles remains challenging. This is mainly due to their reliance on a visual transformer (ViT)-based architecture, which introduces quadratic complexity relative to the input resolution. To solve this problem, we propose an efficient BEV-based 3D detection framework, called BEVENet, which utilizes the architectural design of convolution to circumvent the limitations of the ViT model while maintaining the effectiveness of BEV-based methods. Our experiments show that BEVENet is 3x faster than contemporary state-of-the-art (SOTA) methods on the NuScenes challenge, achieving an average precision (mAP) of 0.456 and a nuScenes Detection Score (NDS) of 0.555 on the NuScenes validation dataset. Inference speed That's 47.6 frames per second. To our knowledge, this study is the first to achieve such significant efficiency improvements for BEV-based approaches, highlighting their feasibility for real-world autonomous driving applications.

1.4 Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)

Target position tracking in reinforcement learning: Metrics for key point detection (extended version)

https://arxiv.org/abs/2312.00592

Reinforcement learning (RL) for robot control often requires detailed representations of the state of the environment, including information about task-relevant objects that cannot be measured directly. Keypoint detectors, such as spatial autoencoders (SAE), are a common method for extracting low-dimensional representations from high-dimensional image data. The target of SAE is spatial features such as object locations, which are often useful representations in robotic RL. However, whether SAE is actually able to track objects in a scene and thereby produce spatial state representations well suited for RL tasks has rarely been examined due to the lack of established metrics. In this paper, we propose to evaluate the performance of SAE instances by measuring how well keypoints track ground-truth objects in images. We propose a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from simulated robotic tasks. We found that common SAEs differ greatly in their spatial extraction capabilities. Furthermore, we verify that SAEs that perform well in our metrics achieve superior performance when used in downstream RL. Therefore, our metric is an efficient and lightweight indicator of RL performance before performing expensive RL training. Based on these insights, we identified three key modifications to the SAE architecture to improve tracking performance. We make our code available at anonymous.4open.science/r/sae-rl.

1.5 LiDAR-based curb detection for ground truth annotation in automated driving validation

Lidar-based curb detection for ground truth annotation in autonomous driving verification

https://arxiv.org/abs/2312.00534

Roadside detection is crucial for environment perception in autonomous driving (AD) as it often limits drivable and non-driving areas. Annotation data is necessary for developing and validating AD functionality. However, the number of public datasets with annotated point cloud limitations is small. This paper proposes a method to detect 3D curbs from point cloud sequences captured by LiDAR sensors, which consists of two main steps. First, our method uses a segmented deep neural network to detect curbs at each scan. A sequence-level processing step then uses the vehicle's odometry to estimate 3D curbs in the reconstructed point cloud. From these 3D points of the curb, we obtain polylines constructed according to the ASAM OpenLABEL standard. These detections can be used as pre-annotations in labeling pipelines to efficiently generate curb-related ground truth data. We validated our approach through an experiment in which different human annotators were required to annotate constraints on a set of LiDAR-based sequences, with and without our automatically generated pre-annotations. The results show that due to our detection, manual annotation time is reduced by 50.99%, maintaining the data quality level.

1.6 Unsupervised textile defect detection using convolutional neural networks

Unsupervised textile defect detection based on convolutional neural network

https://arxiv.org/abs/2312.00224

In this study, we propose a new pattern-based unsupervised textile anomaly detection method that combines the advantages of traditional convolutional neural networks with unsupervised learning paradigms. The method mainly includes five steps: preprocessing, automatic extraction of pattern periods, patch extraction, feature selection and anomaly detection. The method adopts a new dynamic and heuristic feature selection method that avoids the shortcomings of the number of filters (neurons) and the initialization of their weights, as well as the shortcomings of the backpropagation mechanism, such as vanishing gradients, which are the most Common practices in advanced methods. The network is designed and trained in a dynamic and input domain-based manner, so no self-organizing configuration is required. Before building the model, only the number of layers and strides are defined. We do not randomly initialize the weights, nor define the filter size or the number of filters, as is usually done in CNN-based methods. This reduces the effort and time spent on hyperparameter initialization and fine-tuning. Only a defect-free sample is required for training, and no further labeled data is required. The trained network is then used to detect anomalies on defective fabric samples. We demonstrate the effectiveness of our approach on a patterned fabric benchmark dataset. Compared to state-of-the-art unsupervised methods, our algorithm produces reliable and competitive results (in terms of recall, precision, accuracy, and f1 measures) in less time, effectively within a single epoch. training, the computational cost is low.

1.7 Raising the Bar of AI-generated Image Detection with CLIP

Raising the standard for AI-generated image detection using CLIP

https://arxiv.org/abs/2312.00195

The purpose of this work is to explore the potential of pretrained visual language models (VLMs) for general detection of AI-generated images. We develop a lightweight detection strategy based on CLIP functionality and study its performance in a variety of challenging situations. We find that, contrary to previous belief, training with large domain-specific datasets is neither necessary nor convenient. In contrast, by using only a few example images from a single generative model, CLIP-based detectors show surprising generalization capabilities and high robustness across several different architectures, including recent commercial tools such as Dalle- 3. Midjourney v5 and Firefly. We match SoTA on distributed data and achieve significant improvements in generalization to out-of-distribution data (+6% in terms of AUC) and robustness to corrupted/cleaned data (+13%). Our project can be found at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/

Guess you like

Origin blog.csdn.net/wzk4869/article/details/134798701