【Paper Reading】Group Emotion Detection Based on Social Robot Perception

【Paper Reading】Group Emotion Detection Based on Social Robot Perception

Summary

This blog refers to the paper Group Emotion Detection Based on Social Robot Perception included in MDPI sensors 2022 , and summarizes its main content in order to deepen understanding and memory

1 Introduction

Social robots and perspectives:

Social robotics is an emerging field that provides services, performs tasks, and interacts with people in such social environments, which requires more efficient and complex human-robot interaction (HRI) designs. One strategy to improve HRI is to provide robots with the ability to detect the emotions of people around them in order to plan trajectories, modify their behavior, and generate appropriate interactions with people based on the analyzed information. Existing studies mainly focus on the detection of group cohesion and the recognition of group emotions; however, these works have not focused on performing recognition tasks from a robotics-centric perspective.

Social robots are increasingly being incorporated into human spaces such as museums, hospitals, and restaurants to provide services, perform tasks, and interact with people. Social robots are considered as physical agents with the ability to act in complex social environments [1]. They must mimic human social cognitive abilities, explore empathetic behaviors, and facilitate interactions between robots and humans [2,3], which in turn requires more efficient and sophisticated human-robot interaction (HRI) designs. HRI must include behavioral adaptation techniques, cognitive architecture, persuasive communication strategies, and empathy [4].

In a group, people can express different emotions, and the robot must process each individual's emotions and summarize them into group emotions to define its actions. In this case, it is necessary to consider the robot's first-person perspective. Cameras mounted on the robot's head or chassis allow the robot to view the world from a first-person perspective. This field of study in computer vision is known as egocentric or first-person vision [11]. This practice is useful when a social robot interacts with more than one person, such as in social settings such as schools, hospitals, restaurants, and museums. A first-person perspective develops systems that allow robots to adapt to human social groups [12]. However, most of the existing studies related to group emotion detection are based on third-person cameras [17-21].

2. Related work

1)GER

① Image preprocessing: face, pose, skeleton, object, scene

② Feature extraction

③ Fusion methods and evaluation indicators: weighted fusion, LSTM, attention mechanism; mean absolute error MAE, root mean square error RMSE, mean square error MSE, precision (most commonly used)

④Comparative evaluation (from the face point of view)

2) Emotion recognition for social robots

In the context of HRI, emotion recognition has emerged as an important strategy for generating the behavior of social and service robots that share spaces with humans. Based on the detected emotion, the robot can change its behavior or navigate to exhibit socially acceptable attitudes.

  • The research presented in [41] describes 232 papers focusing on emotional intelligence (i.e., how the system processes emotions, the algorithms used, the use of external Three perspectives show trends and progress in improving HRI.
  • The authors mentioned the importance of emotion recognition for HRI in [9].
  • Emotion expression by robots is also another area of ​​concern in this field, as shown in [42]. The survey reviewed research papers from 2000 to 2020 focusing on the generation of artificial emotions in robots (stimuli), human recognition of artificial emotions in robots (organisms), and human responses to emotions in robots (reactions) as insights into robot psychology. contributions to the field of science.

These works described in these two surveys [41, 42] show that social robotics is a growing field where aspects of psychology and sociology are converging [8].

  • Estimation of individual emotions also affects the agent behavior that a social robot should have. This separation between robot and human may be limited by reachable distance, user comfort distance, and user emotion. Based on these characteristics and the ability of the robot to recognize the emotion or emotional state of the person, the robot can plan the optimal route [15, 43, 44].

Various sensory capabilities enable the robot to capture a variety of multimedia content (e.g., image, video, speech, text), from which emotions can be detected. In this work, many studies have focused on recognizing facial emotions from images and videos to improve HRI or social navigation.

  • A survey of 101 papers from 2000 to 2020 on the detection of facial emotions in humans and the generation of facial expressions in robots is presented in [45]. The authors compared facial emotion recognition accuracy in wild images to images in a control scene and found that the accuracy in the first case was much lower than in the second.

To improve the accuracy of information obtained from the wild (such as social robots in services), an emerging strategy involves considering a multi-modal or multi-source approach. Therefore, some works have started to adopt a multimodal approach, combining several modalities based on information captured by several robotic sensors, such as:

  • Recognition of emotions based on human facial expressions and gait from a Kinect camera, as studied in [46];
  • From cameras and speech systems in robots, some studies combine face and speech [47–53] and body gestures and speech [5] to detect human emotions and improve HRI or navigation accordingly
  • From text and speech, emotions are identified by converting speech to text and then applying natural language processing (NLP), as described in [54].

However, this topic of robotics remains limited, as reported in the survey in [55].

Regarding group emotion recognition in social robots, only a few studies have addressed group detection and recognition of individual emotions.

  • For the navigation of social robots, parameters such as the trajectory, position or velocity of the human or the robot itself are considered, but the emotions of multiple people are not considered [12, 56-58].
  • Some studies considered the influence of robots in a group of people [13, 16, 59], but did not detect group emotions, let alone detect environmental emotions.

Few studies have proposed methods for group sentiment estimation

  • In [60], a method for estimating group emotions from facial expression and prosody information was proposed based on Bayesian network-based individual emotion recognition.
  • In [61], through Bayesian networks and individual facial expression recognition, but combined with environmental conditions (e.g. light, temperature), a method for estimating group emotions is proposed, and then appropriate stimuli are generated to induce target group emotions.
  • In [62], a system for recognizing crowd emotions in entertainment robots based on individual facial expressions is described
  • In [63], HRI in small groups was investigated and it was concluded that small groups are complex, adaptable and dynamic systems. The authors recommend developing robots suitable for group interaction and improving methods used in measuring human and robot behavior in situations involving HRI.

These studies do not pretend to be exhaustive reviews but reveal some limitations and challenges

3. Method

First, frame capture is performed through the robot's front-facing camera as the robot navigates the indoor space. In each frame, all faces are detected by the Viola–Jones algorithm and stored in a vector. For each stored face, the area of ​​the face is calculated, feature extraction is performed, and the individual emotion of each face in the frame is estimated. Then, the frame emotion is determined using the method of individual emotion fusion; if there is only one person in the frame, then the emotion of the frame is that person's emotion.

1) Face detection

A Viola-Jones classifier is used as a face detector, which stores the upper-left coordinates (x, y), width (w) and height (h) of each face. Using the values ​​of w and h, the area (w * h) of each face is calculated and its information is used for scene detection. In this case, the area of ​​faces captured by the robot started to increase as it moved toward a group of people, and decreased as it moved away. With this information, the constraints to determine the scenario are established.

2) Feature extraction

The VGGFace neural network is used, pre-trained using 2.6 million images. The image feature vector is in the flatten layer, and in the flatten layer, the multidimensional data obtained by the convolutional layer is converted into one-dimensional data. According to the configuration of this neural network, the size of the input image must be 224 × 224 pixels.

3) Personal emotion estimation

After training, VGGFace can recognize 2622 classes. However, in this case, there are no 2622 emotions to classify; therefore, the fully connected layers of the VGGFace model are modified, as shown in Table 2. Layers fc6 and fc7 have 512 nodes, and layer fc8 has 6 nodes, representing the emotions to be classified (happiness, sadness, anger, fear, disgust, surprise). In addition, a dropout of 0.5 is added to reduce the overfitting of the neural network (d1 and d2 layers). Once this setup is done, only the fully connected layers of the VGGFace neural network are trained using the image dataset.

4) Emotion estimation in each frame, scene and video

Surprise is considered a neutral emotion because it can be positive or negative.

5) Scene detection ( scene emotion )

The duration of a set of detected scenes is determined by the area of ​​the recognized faces in the frame. As the robot approaches or moves away from a group of people, the area of ​​the face increases or decreases accordingly. In this approach, the robot analyzes b-frames to differentiate a scene. An example is shown in Figure 2, where b = 10; every 10 frames conform to a BOF, and the robot extracts the largest face in the frame belonging to that BOF (blue dot in Figure 2); since BOF1, BOF2, BOF3 and BOF4 The largest face area in , all these frames belong to the first scene (until frame 32 in BOF4; green line in Fig. 2); this is the end of the first scene; from that frame until the end of BOF6 belongs to Second scene. Thus, a scene consists of all frames captured by the robot when it approaches a group (increases the face area), and if the robot sees a distant face (decreases the face area), it is treated as another scene.

4. Dataset generation

With the help of Pepper, a social robot simulated by ROS and Gazebo, we created a dataset Dataset-in-ROS, with images for training and videos for scene detection

5. Simulation and Results

1) Simulation environment

2)IER

3) The emotion of the video (museums, cafeterias)

4) Simulation in ROS/Gazebo ( Emoción de una Escena en ROS - YouTube )

6. Discussion

1) Through the recognition of crowds and their emotions, the concept of "scene emotion" is proposed. This concept can be applied in contexts related to group behavior, for example, identifying artwork (in a museum), speaker (in a conference), animals (in a zoo), or food (in a restaurant) in a group of people Emotions produced.

2) Why would a robot want to detect the emotions of a group of people?

An obvious application is designing the behavior of robots and improving human resource indices to make them more socially acceptable. For example, if the robot recognizes negative emotions in the group, it will move away to avoid conflict or respond submissively; conversely, if the detected emotion is positive, the robot can approach and talk to the group.

Another application example is a robot tasked with monitoring and recording the emotions of a group of people attending a class, conference, music performance, museum, etc.; this goes beyond recognizing the emotion of a group, but the emotion of a scenario defined for the group; these The information can not only be used to define the behavior of the robot and improve the human resource index, but also can be used for post-analysis

Guess you like

Origin blog.csdn.net/qq_44930244/article/details/130910644