Prof. Dai Jifeng: Sharing of the latest research results of super-large-scale vision general model

ad6ec7dcb872d64c5ee077ed22c07932.png

Track social hotspots, interpret the frontiers of AI, use open source algorithms to promote the penetration of AI knowledge, and use supercomputing/high-performance computing as the origin to open the perspective of frontier applications of artificial intelligence. The OpenMMLab open source community and the Beijing Super Cloud Computing Center jointly released a live broadcast column [AI Wonderful Night], every month at 8:00 p.m. on Thursdays, to accompany everyone to open the wonderful door of the AI ​​​​field.

This issue is wonderful

With the evolution of deep learning technology, super-large-scale general-purpose model technology is developing rapidly, and the era of a model that is widely used in various tasks and has some general intelligence characteristics is coming. Although related technologies have made great progress in the field of natural language processing, such as ChatGPT. But in the field of computer vision, there are still many difficulties and problems. This sharing will introduce the efforts and progress of Professor Dai Jifeng, the doctoral supervisor of Tsinghua University, and his team in this direction.

Secret spoiler, during the live broadcast, lucky viewers will be drawn to distribute 500 yuan cards to calculate resources, and there are a lot of exquisite peripheral gifts waiting for you to collect! This course will be jointly broadcast on the video accounts of OpenMMLab and Beijing Super Cloud Computing Center. Welcome to pay attention and make an appointment.

share content

  • Development status of general perception models for ultra-large-scale vision

  • Research progress on multi-modal multi-task unified pre-training

  • Research progress of very large-scale image backbone network

  • Research progress of Uni-Perceiver general visual task representation

  • Research progress of BEV surround view autonomous driving perception

share time

Beijing time

May 4, 2023 (Thursday)

20: 00 - 20: 40 (sharing)

20: 40 - 21: 00(Q&A)

Sharing guests

9293092ee1a06e598f1e6a34cfccb906.jpeg

Dai Jifeng

Associate Professor, Department of Electronic Engineering, Tsinghua University

PhD Tutor

Core member of OpenGVLab

In 2009 and 2014, he obtained a bachelor's degree in engineering and a doctoral degree from the Department of Automation of Tsinghua University in 2014. The doctoral supervisor is Professor Zhou Jie. From 2014 to 2019, he worked in the Vision Group of Microsoft Research Asia as a principal researcher and research manager. From 2019 to 2022, he will work at SenseTime Research Institute as executive research director and second-level department head. In July 2022, he will join the Department of Electronic Engineering of Tsinghua University full-time.

His research interests include computer vision, deep learning, etc. He has published more than 50 articles in international journals and conferences in related fields, and the papers have been cited more than 26,000 times. Many papers have become milestone achievements in the field of object recognition, and have been compiled into the visual course lecture notes of world-class universities, and have been selected into the authoritative deep learning framework PyTorch as a standard operator.

He has won the authoritative COCO competition championship in the field of object recognition for two consecutive years, and the algorithm proposed by him has also been used in previous champion systems. The algorithm he proposed won the championship of the authoritative Waymo 2022 competition in the field of autonomous driving perception. He is the editorial board member of the top journal IJCV, the field chair of the top conferences NeurIPS 2023, ICCV 2023, CVPR 2023, CVPR 2021, ECCV 2020, and the publicity chair of ICCV 2019.

host

8828502e09df27d36af7b107a059cae1.png

Li Yining

Young Researcher of Shanghai Artificial Intelligence Laboratory

The person in charge of multiple frameworks of OpenMMLab, Ph.D. from the Chinese University of Hong Kong. The main research direction is Human-Centric machine vision, including attributes, gesture recognition, image generation, metric learning, etc.

Details

The general perception model leads the progress of general artificial intelligence, which originated from NLP and is developing into more modes. Multimodal technology broadens the application range of AIGC technology. Multimodal technology integrates different modalities (image, sound, language, etc.) state model.

0699fcf4ec28f3b3a57fdf9bafd90756.png

At the same time, there are many challenges and difficulties in the general perception model, such as:

  1. The number of network parameters is huge (more than one billion parameters vs less than ten million parameters): problems such as training stability, convergence, and overfitting are smaller than the network challenges;

  2. Complicated training process (billions of heterogeneous low-quality images, graphic-text pairs vs tens of millions of homogeneous fine-labeled images): multi-step training to utilize heterogeneous multi-modal multi-task data, complex process, catastrophic forgetting, difficult to locate Accuracy problem;

  3. High experimental cost (thousands of GPU parallel training for several weeks vs 8 GPU training for several hours): researchers need to have keen analytical skills and solid knowledge;

  4. There are many engineering challenges: the throughput of massive data, parallel algorithms on large GPU clusters, and memory management of models with very large parameters.

6e7d6cc1297285c81fa9db4a0e97da8d.png

In response to the above problems, we will introduce our four recent research results, hoping to bring good inspiration to researchers.

Research progress 1: multi-modal multi-task unified pre-training

In order to efficiently train super-large-scale visual models on Internet-scale images and image-text pairs, we propose "unified pre-training with maximized mutual Modal multi-task unified pre-training, complete multi-modal multi-task pre-training of multiple data sources in one step, the training process is simple and efficient, and the training process is easy to monitor and troubleshoot. It solves the problems of complicated and unrobust training process, difficult to analyze positioning training problems, catastrophic forgetting, and high cost of making mistakes in the existing multi-modal multi-task training.

0085026e603d2c599499bfd0dfa37bf5.png

Code:https://github.com/OpenGVLab/M3I-Pretraining

Research progress 2: Ultra-large-scale image backbone network

In order to obtain a high-quality image backbone network so that it can be applied to various heterogeneous vision tasks, we propose the InternImage large model, which achieves the best performance of benchmark tasks in the image field by means of deformable convolution, breaking Vision Transformer's monopoly on large visual models surpasses the large visual models of organizations including Microsoft, Meta, and Google. In the research of very large-scale image backbone network, we need multi-faceted problems:

  1. Paradigms for large model design: scaling up strategies that consider network depth/width/resolution/number of group calculations, characteristics and gradient adjustment strategies for unstable convergence of large networks, initialization strategies for slow convergence of large models, easy overshoot for large models Fitting training strategy, etc.;

  2. Large-scale accelerated training framework: PyTorch DDP, FSDP, DeepSpeed ​​ZeROs, mixed precision computing, fusion operator, kernel-level acceleration, gradient accumulation, gradient checkpointing, efficient data reading, data segmentation, cluster file and computing system troubleshooting, training Abnormal automatic monitoring push and restart, profiler, etc.;

  3. Multi-task model training framework: support multi-network/multi-task/multi-dataset/multi-modal joint training (design Meta Dataloader & Sampler and Meta Training & inference Pipeline for high degree of freedom modularization), dozens of tasks-data Set-to-simultaneous efficient reading and preprocessing, multi-task and multi-dataset sampling, automatic super-parameter search based on agent tasks, comparison and monitoring of statistics such as multi-task gradient/Loss/Acc, etc.

76581a83d21e69d411379fc11c802505.png

1c3204826fd682568734fed00ac87452.png

The InternImage model we proposed takes the lead in dozens of visual task datasets, and it will be open-sourced in early March 2023. The total number of GitHub stars has reached 1K+, and it is growing rapidly.

4a6784668ef53064a46b05e7ede8d271.png

Code:https://github.com/opengvlab/internimage

Research Progress 3: Uni-Perceiver General Visual Task Representation

In the field of computer vision, the representations of different tasks vary greatly. In order to build a universal decoder network for visual tasks and achieve the goal of task-level generalization, we proposed the Uni-Perceiver series, which is a pioneer in the research of general visual task representation models, and for the first time unifies dozens of visual tasks in one representation. under the frame. Among them, Uni-Perceiver v2 has achieved comparable performance to proprietary models on core visual issues such as object detection and instance segmentation.

98f99dd6756a73739fad6ff6e97358cc.png

Code:https://github.com/fundamentalvision/Uni-Perceiver

Research Progress 4: BEV Surround View Autonomous Driving Perception

cab11f207eba2538b2ecc1fbdb6f49ce.png

At present, the industry has different exploration paths for camera 3D perception, which can be roughly divided into two types: Image-view and BEV methods. The Image-view scheme uses different networks to complete the perception subtasks, and finally fuses the perception results of different networks through a rule-based fusion method. Different from the Image view solution, the BEV solution usually uses Transformer to convert the Image feature to the BEV perspective for related perception tasks. Aiming at the problem that the current vision-based 3D target detection method does not make full use of temporal information, BEVFormer proposes a terminal based on Deformable Attention to integrate multi-camera and temporal features. The end-to-end framework is suitable for a variety of autonomous driving perception tasks, and the detection algorithm is robust. BEVFormer was selected into the "Top-10 most influential papers of ECCV 2022", and also won the first place in the Waymo Pure Vision 3D Detection Challenge.

Code:https://github.com/fundamentalvision/BEVFormer

Relevant information

Paper:

Su et. al., Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information. CVPR 2023.

Wang et. al., InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. CVPR 2023.

Zhu et. al., Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. CVPR 2022.

Zhu et. al., Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs. NeurIPS 2022.

Li et. al., Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks. CVPR 2023.

Li et. al., BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. ECCV 2022.

Yang et. al., BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision. CVPR 2023.

Interactive rewards

1. You can participate in the lottery interaction by watching the live broadcast, and get 500 yuan card for free when computing resources.

6a5ce9f9a046559522c3536e1598041c.png

2. During the live broadcast, participate in the barrage discussion on the OpenMMLab video account, Bilibili or Beijing Supercomputing video account, and Bilibili, and the assistant will draw 2 students from each platform to give away 1 copy of OpenMMLab's exquisite peripheral [add assistant: OpenMMLabwx, receive prize】.

d6ad1bca32931957a6d6b72472f8dbec.jpeg

Exchange group

At the same time, in order to facilitate everyone's communication, we have also established a community with the theme of "AI Wonderful Night". All information related to the live broadcast will be shared in the group, and you can also 1v1 with the big guys. Hurry up and scan the QR code to join us~

b02f368b37d79ef723f67b51b256c19a.png

event organizer

Guiding units: China Computer Federation High Performance Computing Professional Committee, Beijing Science and Technology Association

Sponsors: OpenMMLab, Beijing Super Cloud Computing Center

Co-organizers: Beijing Image and Graphics Society, OpenGVLab, TechBeat Artificial Intelligence Community

OpenMMLab

The OpenMMLab open source community has the most complete computer vision open source algorithm system in the deep learning era, and is an open source algorithm platform integrating industry, education, research and application.

OpenMMLab focuses on the field of visual deep learning, covers 30+ computer vision directions, supports 300+ algorithms, and provides 2,300+ pre-trained models. All toolboxes are based on a unified architecture, providing a code library with excellent code engineering organization and a large number of high-quality algorithm content, and complementing deep learning frameworks such as PyTorch that provide model training capabilities.

OpenMMLab can help users reduce the difficulty of algorithm reproduction, and facilitate the reproduction and comparison of algorithm benchmarks. At the same time, it can also help users avoid repeated pitfalls, solve the problem of diversified versions generated during the algorithm landing process, and improve the application and deployment efficiency of artificial intelligence algorithms.

Beijing Super Cloud Computing Center

Beijing Super Cloud Computing Center (referred to as "Beijing Super Computing"), established in 2011, is a "Beijing Super Cloud Computing and National Important Informatization Basic Platform" led by the Beijing Municipal People's Government and co-built by the Beijing Municipal People's Government. It is now located in Beijing Huairou Comprehensive National Science Center -- Huairou Science City. Since 2019, Beijing Supercomputing has deployed three main computing power hubs in Beijing, Ningxia, Inner Mongolia and other places to build a cross-domain resource collaborative scheduling system, optimize the overall coordination and linkage between computing power, improve scientific research and production efficiency, and reduce corporate R & D costs have made a strong response to the implementation of the national "East Counting West Counting" project.

In 2020, 2021, and 2022, Beijing Supercomputing has been shortlisted in the top 100 of China's HPC TOP for three consecutive years, and won the "No. 1 in General CPU Computing Power Performance" for three consecutive years. At the same time, in the 2021 AIPerf 500 list, Beijing Supercomputer has 10 sets of AI computing power systems on the list, ranking first in the total share. 

c661a0dbca832867891dfd5ebfb2b006.jpeg

(Scan the QR code to add Miaomiao Assistant WeChat)

0c6b4282f8e964a65e5c31cd0ecd0ed1.jpeg

It's not easy to organize, please like and watchdb9df4f50649fc1dc7644c7a506bc99a.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130418148