Multimodal Fusion Perception Technology for Robots

01
Introduction

With the rapid development of sensor technology and the Internet, various big data in different modalities are emerging rapidly at an unprecedented speed. For a thing to be described (target, scene, etc.), coupled data samples collected through different methods or perspectives are multimodal data. It is common to refer to each method or perspective that collects these data as a modality. Narrowly defined multimodal information usually focuses on modalities with different perceptual characteristics (such as image-text, video-speech, visual-tactile, etc.), while broadly defined multimodal fusion usually also includes multi-feature fusion in the same modal information. , and data fusion of multiple sensors of the same type, etc. Therefore, the issue of multimodal perception and learning is closely related to "multi-source fusion" and "multi-sensor fusion" in the field of signal processing, and "multi-view learning" or "multi-view fusion" in the field of machine learning. Multimodal data can obtain more comprehensive and accurate information, and enhance the reliability and fault tolerance of the system. In multimodal perception and learning problems, since different modalities have completely different description forms and complex coupling correspondences, it is necessary to solve the problem of multimodal perceptual representation and cognitive fusion uniformly. Multimodal perception and fusion is to make two seemingly completely unrelated data samples in different formats compare and fuse with each other through appropriate transformation or projection. In layman's terms, it is to realize "Guan Gong fighting Qin Qiong" between different modes (see Figure 1). This fusion of heterogeneous data can often achieve unexpected results.

Multimodal data has played a huge role in Internet information search, human-computer interaction, industrial environment fault diagnosis and robotics. Multimodal learning between vision and language is currently a field where research results on multimodal fusion are relatively concentrated. In the field of robotics, there are still many challenging problems that need to be further explored. This paper will focus on the multimodal information perception and fusion of robots, especially the related work on the fusion of vision and touch .

02

Multimodal Awareness of Robots
insert image description here
Robots are an important tool for situational awareness in command and control systems. However, in engineering systems represented by robots, sensors of different modalities can only be fused after their respective perception and recognition are completed, making the design of fusion logic very difficult. The most typical case is the fatal car accident in the United States in 2016 when a Tesla car was in autopilot mode. Although the car is equipped with excellent sensors, due to layout problems, it cannot effectively fuse the visual sensor and distance sensor information. In the industrial production site, due to the lack of perception mode fusion ability, only some very simple mechanical operations can be realized at present.

The sensors configured on the robot system are complex and diverse. From cameras to lidar , from hearing to touch, from taste to smell , almost all sensors have applications in robots. However, due to the complexity of the task, cost and efficiency of use, most of the work is still in the laboratory stage. In the field of service robots that are currently popular in the market, vision and voice sensors are still the most used. These two types of modalities are generally processed independently (such as vision for target detection and hearing for voice interaction). Since most robots still lack the ability to operate and physically interact with humans, tactile sensors have basically not been applied.

In order to solve the problem of fine robot operation in complex scenes, the robot is required to perceive the environment through modal information such as vision and distance , and also needs contact information such as touch to perceive objects . Various modal sensors provide a basis for robots to achieve more efficient scientific decision-making, provide new opportunities for intelligent robots, and bring new challenges to multi-modal information fusion methods. This work is of great significance for solving the robot's environmental perception, target detection, navigation control , etc., and can be directly applied to various platforms such as unmanned vehicles and manipulators to achieve disaster rescue, anti-terrorism and explosion-proof, emergency response and other tasks.

However, for robotic systems, the collected multimodal data has some distinct characteristics, which pose great challenges to the research work of fusion perception. These questions include:

(1) "Polluted" multimodal data: The operating environment of the robot is very complex, so the collected data usually has a lot of noise and wild points.

(2) "Dynamic" multimodal data: Robots always work in a dynamic environment, and the collected multimodal data must have complex dynamic characteristics.

(3) "Mismatched" multi-modal data: The operating frequency bands and service cycles of the sensors carried by the robot are very different, making it difficult to "match" the data between the various modalities.

These problems have brought great challenges to the multi-modal fusion perception of robots. In order to realize the organic fusion of various modal information, it is necessary to establish a unified feature representation and association matching relationship for them.

03Robot
Vision-Touch Mode Fusion Technology

Many robots today are equipped with vision sensors. Conventional visual perception techniques are subject to many limitations in practical applications, such as lighting, occlusion, and so on. For many intrinsic properties of objects, such as "soft", "hard", etc., it is difficult to perceive and obtain through visual sensors. For robots, touch is also an important way to obtain environmental information. Unlike vision, tactile sensors can directly measure various properties of objects and environments. At the same time, the sense of touch is also a basic mode for human beings to perceive the external environment. As early as the 1980s, scholars in the field of neuroscience anesthetized the skin of volunteers in experiments to verify the importance of tactile perception in the process of stable grasping operations. Therefore, the introduction of the tactile perception module to the robot not only simulates the human perception and cognitive mechanism to a certain extent, but also meets the strong needs of practical application.

With the development of modern sensing, control and artificial intelligence technologies, researchers include dexterous hand tactile sensors, and use the collected tactile information combined with different algorithms to realize the analysis of dexterous hand grasping stability and the classification of grasped objects Extensive research has been carried out with recognition. Tactile sensors are of great importance for fine manipulations of dexterous hands. As early as the 1990s, two academic papers published by Banks and other scholars from the University of California, Berkeley in Nature (2002)① and Science (2002)② revealed that humans naturally have the ability to optimally integrate visual and tactile information. , but how to build such capabilities for engineering systems through computational models is far from being resolved.

Visual information and tactile information collect information on different parts of objects. The former is non-contact information, while the latter is contact information. Therefore, the characteristics of objects reflected by them are obviously different, which also makes visual information and tactile information very complex. internal relationship. At this stage, it is difficult to obtain a complete representation method of related information through manual mechanism analysis, so the data-driven method is currently a more effective way to solve this kind of problem. To this end, under the unified framework of structured sparse coding, we established a computational model of multimodal fusion perception for robots (see Figure 2), and developed several fusion understanding methods. A few related research works are briefly introduced below.

(1) Tactile array fusion target recognition

The tactile information obtained by the robot during operation has the characteristics of array and serialization. Most of the existing work focuses on the modeling of the "fingertip" of the manipulator. For multifingered hands, the tactile sequences acquired by different fingertips can be regarded as different sensors. Conventional processing methods either treat them as independent sensors, or directly and simply stitch them together. The former ignores the commonalities between different fingertips, and the latter ignores the differences between different fingertips. We developed a joint sparse coding model (Fig. 3) for the characteristics of the tactile array, which solved the problem of modeling the relationship between different fingertips and applied it effectively to tactile target recognition. The "fingertip" modeling is promoted to the "finger" modeling (see literature [2]).
insert image description here
insert image description here
(2) Tactile Affective Computing

If visual object recognition is determining the noun properties of objects (such as "stone", "wood"), then the tactile modality is particularly suitable for determining the adjective properties of objects (such as "hard", "soft"). "Haptic adjectives" have emerged as a useful tool for models of tactile affective computing. It is worth noting that for a specific target, there are usually multiple different attributes of tactile adjectives (see Figure 4 left), and there is often a certain relationship between different "tactile adjectives", such as "hard" and "soft". "Generally cannot appear at the same time, but "hard" and "solid" have a strong correlation. To this end, we established the correlation frequency matrix of common tactile adjectives (see Figure 4, right), from which we can see that these correlations are highly consistent with our intuitive understanding. By incorporating these associations into the encoding process, we model tactile affective computing. This model can effectively identify the tactile properties of object materials. In particular, when it perceives properties such as "hard" and "smooth", the system can automatically analyze that the object also has the property of "cold", thus effectively establishing the synaesthesia of touch and temperature. The specific algorithm model and experimental results can be found in literature [3-4].

(3) Visual-touch modal fusion recognition

Visual and tactile modal information are significantly different. On the one hand, their acquisition difficulty is different. Usually the visual modality is easier to acquire, while the tactile modality is more difficult. This often results in a large difference in the amount of data between the two modalities. On the other hand, due to "what you see is not what you touch", the visual information and tactile information collected during the collection process are often not aimed at the same part, and have a very weak "pairing characteristic". Therefore, the fusion perception of visual and tactile information is extremely challenging. Since Banks' analysis of human visual-tactile fusion capabilities, the research on achieving efficient visual-tactile fusion perception on robots has been extremely slow. In response to this problem, we used the developed associated sparse coding model (see Figure 5) to solve the visual-tactile fusion target recognition problem in the case of weak pairing for the first time (see literature [5]).

04

Outlook

The robot is a complex engineering system, and multi-modal fusion perception of the robot needs to consider the task characteristics, environment characteristics and sensor characteristics comprehensively. Although everyone has fully realized the application of tactile modalities in robot systems, many domestic institutions, such as Southeast University [6] and Beijing University of Aeronautics and Astronautics [8], have carried out research work in this area for many years, but the current robot Advances in tactile perception have lagged far behind progress in visual perception. On the other hand, although the research on how to integrate the visual modality and the tactile modality began in the 1980s, the progress has been slow. In recent years, the University of Pennsylvania and the University of California, Berkeley [1] in the United States, and the Technical University of Munich in Germany [7] have been carrying out research work on robot manipulation tasks. In the future, it is necessary to make breakthroughs in the cognitive mechanism, computing model, data set, and application system of visual-touch fusion, and comprehensively solve the fusion computing problems of fusion representation, fusion perception, and fusion learning.

references

[1] V. Chu, I. McMahon, L. Riano, C. McDonalda, Q. Hea,J. Perez-Tejada, M. Arrigo, T. Darrell, K.Kuchenbeckera, Robotic learning of haptic adjectivesthrough physical interaction, Robotics and AutonomousSystem s, vol. 63, no. 3, pp. 279-292, 2015.

[2] H. Liu, D. Guo, F. Sun, Object recognition using tactile measurements: Kernel sparse coding methods, IEEE Transactions on Instrumentation and Measurement, vol.65, no.3, pp.656-665, 2016.

[3] H. Liu, F. Sun, D. Guo, B. Fang, Structured outputassociated dictionary learning for haptic understanding,IEEE Transactions on System s, Man and Cybernetics:System s, In press

[4] H. Liu, J. Qin, F. Sun, D. Guo, Extreme kernel sparse learning for tactile object recognition, IEEE Transactions on Cybernetics, In press

[5] H. Liu, Y. Yu, F. Sun, J. Gu, Visual-tactile fusion for object recognition, IEEE Transactions on Automation Science and Engineering, In press

[6] A.Song, Y.Han, H.Hu, J.Li, A novel texture sensor for fabric texture measurement and classification, IEEE Transactions on Instrumentation and Measurement, vol.63, no.7, pp.1739-1747, 2014.

[7] M. Strese, C. Schuwerk, A. Iepure, E. Steinbach, Multimodal feature-based surface material classification, IEEE Transactions on Haptics, In press

[8]D.Wang,X.Zhang, Y. Zhang, J. Xiao, Configurationbased optimization for six degree-of-freedom haptic rendering for fine manipulation, IEEE Transactions on Haptics, vol.6, no.2, pp.167-180, 2013

Author: Huaping Liu, Department of Computer Science and Technology, Tsinghua University

Associate professor and doctoral supervisor of the Department of Computer Science and Technology of Tsinghua University, member of the Youth Working Committee of the Chinese Society of Command and Control, and senior member of IEEE. Mainly engaged in the research of multi-modal perception, learning and control technology of intelligent robots. He was once rated as the "Youth Innovation Star" of the "Twelfth Five-Year Plan" of the National High-tech Research and Development Plan, and won the second prize of the Innovation Award of the Chinese Society of Command and Control in 2016.
"China Command and Control Society" WeChat public account released

Multimodal explanation
Multimodal system (multimodalsystem) refers to two or more perceptrons (usually called modal sensors) with different characteristics, which are integrated into the same hardware platform. Modal sensors are optional, including multispectral image sensors, cameras, video sensors, laser scanners, etc. The 7 modalities (7CI) are independent of the choice of media, and the user can choose arbitrarily from many perceptrons.

Guess you like

Origin blog.csdn.net/qq_27353621/article/details/128812630