【论文笔记】Combining Reinforcement Learning and Rule-based Method to Manipulate Objects in Clutter

Combining Reinforcement Learning and Rule-based Method to Manipulate Objects in Clutter

Abstract

To reduce the complexity of strategy learning, we propose a framework for robots to pick up the objects in clutter on table based on deep reinforcement learning and rule-based method.

深度强化学习+基于目标的方法

To manipulate the objects on table, we mainly divide the robot actions into two categories: one is pushing that uses the reinforcement learning method, while the other one is grasping that is inferred by image morphological processing.

位置移动:强化学习;抓取过程:图像形态处理技术

The pushing action can separate the stacking objects, create a robust grasp point for the following grasp.

“前进动作”可以将物品堆叠进行划分,创造出合适的抓取点进行抓取

利用强化学习,找到合适的点之后,就返回一个奖励

Taking images as input, our framework can keep a high grasp rate with low computational complexity, which makes it achieve clutter clearing quickly.

图片作为输入,提出的框架可以保持一个高抓取率和低计算复杂度,使抓取速度更快

Introduction

Especially for grasping, few positive samples and diverse objects lead to the fact that hundreds of hour for collecting data is inescapable.

特别是对于抓取,很少有正确样本和多样的物体会导致数百小时收集数据过程是不可避免的。

This kind of problem is hard to define manually and doesn’t require a very precise solution, hence it is suitable for reinforcement learning to deal with this problem.

这类问题很难手动定义,也不需要一个非常精确的解决方案,因此它适合于强化学习来处理这个问题。

Compared with their work, we try to employ the reinforcement learning network with continuous output to remedy this issue.

与他们的工作相比,我们尝试使用具有连续输出的强化学习网络来解决这一问题。

We find that the grasp algorithm based on supervised learning is mostly trained on Cornell Grasping Dataset or Jacquard Dataset, whose depth image is strikingly different from the depth image in simulation because of different shooting angles.

我们发现,基于监督学习的抓取算法大多是在康奈尔抓取数据集Jacquard数据集上训练的,由于拍摄角度不同,模拟中的深度图像与深度图像有显著不同。

We make use of the twin delayed deep deterministic policy gradient [6] to train our policy that determines where to start pushing and pushing direction according to current image.

使用DDDQN网络作为策略函数

The grasp detecting is processed with rule-based method mainly based on the recognition of minimum bounding convex hull and minimum bounding rectangle of connected regions.

抓取检测采用基于规则的方法,主要基于连接区域的最小边界凸包最小边界矩形的识别。

The grasp detecting is processed with rule-based method mainly based on the recognition of minimum bounding convex hull and minimum bounding rectangle of connected regions.

抓取检测采用基于规则的方法,主要基于连接区域的最小边界凸包最小边界矩形的识别。

The grasp detecting algorithm will calculate out whether it is graspable, the grasp center and the grasp orientation.

抓取检测算法将计算出它是否可抓取抓取中心抓取方向

Yuan et al. learn the nonprehensile rearrangement based on deep Q-learning [1], pushing an object to the predefined goal pose in an environment with obstacles.

袁等人学习了基于深度Q学习[1]的非综合性重排,在有障碍物的环境中将物体推到预定的目标姿势

Nair et al. utilize variational auto-encoder to encode the input image [15], calculate the reward based on the Euclidean distance of encoded vector, and verify this algorithm in the experiment of reaching and pushing.

Nair等利用变分自动编码器对输入图像[15]进行编码,根据编码向量的欧氏距离计算奖励,并在推送实验中验证该算法。

The large-scale exploration space and delayed reward makes it hard to get training data of high quality, and thus lots of time is needed to collect data.

大规模的探索空间 + 延迟的奖励函数 造成了训练集的低效率 造成了需要大量的时间收集数据

In [20] they achieve pixel-wise grasp rectangle detection by using the fully convolutional network like U-net to predict rectangle for every pixel. Without fully connected layers, their network is significantly smaller than other networks.

用全卷积神经网络进行检测,因为缺少了全连接层,网络规模变得很小

In the face of cleaning clustered objects that needs to combine pushing and grasping, we are inspired by the algorithm that maps the image to the high-level actions instead of continuous actions of low level based on the mapping relation between image and workspace [9] [22].

在清理需要结合推取和抓取的聚类对象时,我们受到了将图像映射到高级动作的算法的启发,而不是基于图像与工作空间之间的映射关系的低层次连续动作

Pushing and Grasping

Pushing

We employ the Twined Delayed DDPG to learn the policy, which consists of one policy network, double critic networks and their own target networks.

策略函数的公式是:
a t = π ϕ ( s t ) a_{t} = \pi_{\phi}(s_{t}) at=πϕ(st)
critic network的损失函数是:
l o s s = ( R ( s t , s t + 1 ) + γ max ⁡ i = 1 , 2 Q θ i , ( s t + 1 , a , ) − Q θ i ( s t , a t ) ) ) 2 loss = (R(s_{t},s_{t+1})+\gamma\max\limits_{i=1,2}Q_{\theta_{i}^{,}}(s_{t+1},a^{,})-Q_{\theta_{i}}(s_{t},a_{t})))^2 loss=(R(st,st+1)+γi=1,2maxQθi,(st+1,a,)Qθi(st,at)))2
策略函数的更新是:
∇ ϕ J ( ϕ ) = ∇ a Q θ 1 ( s , a ) ∣ a t = π ϕ ( s t ) ∇ ϕ π ϕ ( s ) \nabla_{\phi}J(\phi)=\nabla_{a}Q_{\theta_{1}}(s,a)|_{a_{t} = \pi_{\phi}(s_{t})}\nabla_{\phi}\pi_{\phi}(s) ϕJ(ϕ)=aQθ1(s,a)at=πϕ(st)ϕπϕ(s)
In this work, only depth image is used as the state that is captured by the camera over the table. The pixel plane is parallel to table surface so that pixel coordinate and table planimetric position are linearly proportional.

深度相机拍摄的图片作为DQN的状态;

相机镜头面与桌面平行,使得像素坐标和桌面表面呈线型比例

The policy network outputs action with four dimensions (a1, a2, a3, a4) and each dimension is limited to (−1, 1). They present x and y coordinate of the table surface, which side pushing to and the pushing angle, respectively.

Specifically, (a1, a2) decides the position where to start pushing, and (a3, a4) decides the pushing orientation.

四维方向,前两维是桌面的坐标,从哪里开始推

第三维是推向哪一侧,最后一维是推进的角度,最后两维是朝哪个方向推多少

To avoid pushing objects out of table, we limit the length of area that can start push to 0.6 times the length of table surface.

避免将物品推出桌面,设置开始推点的范围是桌面边界的0.6

Although cosine-sine encoder is widely used in supervised learning [20] to represent the angle at the circumference, we found it hard to master the many-to-one mapping for reinforcement learning in the absence of direct oversight of the target.

余弦-正弦编码器广泛运用在监督学习,但是对于多对一的强化学习来说,缺少目标显得困难。

The robot end-effector reaches the position that is 30cm over the pushing start point decided by (a1, a2). The robot end-effector moves straight down until it contacts with objects or it is 1.5cm above the table surface. The robot end-effector pushes a constant distance in a given orientation decided by (a3, a4).

机器人先根据(a1,a2)来到距离物品30cm的地方,然后一直移动到距离物体表面1.5cm的地方,再根据(a3,a4)制定的方向推进一段距离。

If a grasp can be performed after the push action, the reward R (st, st+1) = 1.

If the push action results in enough change of the clustered object positions which can be judged by calculating the difference between depth images before and after pushing, the reward R (st, st+1) = 0.5.

抓取动作成功执行,奖励+1;只是推进一段距离,并没有成功抓取,奖励+0.5;

Both the policy network and critic network have the same convolutional layers to extract image feature.

策略网络和评价网络都用相同的卷积层

用SeLU激活函数较好;用批标准化来保证梯度反向传播平衡

Grasping

grasp rectangle g:
g = { x , y , θ , h , w g = \begin{cases}x,y,\theta,h,w\end{cases} g={ x,y,θ,h,w
(x,y)抓取中心点的位置;θ是抓取臂角度;w是抓取臂的宽度;h是抓握器的厚度。

We start by making a binary image to separate the objects in the picture from the background based on depth image. Due to the ideal simulation environment, the pixel intensity of objects in an image is always greater than that of desktop background. It is simple to make binary processing with a fixed threshold. Then, we detect a grasp configuration for every connected region in binary image and make up a grasp list. Every element in this grasp list is a grasp configuration (x, y, θ, w).

原来的方法是抓取单个物体,但是如果物体比较多的话,抓取臂就会伸不进去

因此需要检测一下是否抓取有效,其中τ是一个超参数
i s v a l i d = { T r u e , I ( c e n t e r − p o i n t ) < I ( e n d − p o i n t ) − τ F a l s e , O t h e r s isvalid=\begin{cases}True,I(center-point)<I(end-point)-\tau \\False,Others \end{cases} isvalid={ True,I(centerpoint)<I(endpoint)τFalse,Others

Experiment

After a grasp or a push, the robot arm is reset to a position out of camera field. Then, the camera capture an image for the next detection.

抓取或者位置移动结束之后,抓取臂返回一边,让深度相机拍照

We perform the experiment in a simulation environment called MuJoCo. The module is built with a toolkit called robosuite [24], which contains a modularized design of APIs for building new environments.

MuJoco仿真强化学习;robosuite用于建模

项目 数值
Input 84*84 pixels
Termination Conditions 1. all the objects on table are taken away; 2. push action has been performed 15 times
CPU Intel Core i7-8700
GPU NVIDIA 2080Ti
Optimizer Adam
Learning Rate 0.0003
Batch Size 128
Target Network Update Delay 0.01
Noise Gaussian Noise ( without a3 )

Grasp algorithm verification

Therefore, we reset the environment if no grasp is detected in the condition of multiple objects.

没有抓取,就重置环境

原本的算法:对单个抓取很厉害,但是多个就怂了

The main reason for grasp failure presently is that two objects next to each other are recognized as one object, and the grasp center is on where they connect.

抓取失败的原因:俩物体靠的太近了,以至于看成了一个物体

解决方法,分开这俩物体后,抓取成功

Reinforcement Learning Training

Therefore, we evaluate pushing performance for 50 episodes after 300 episodes of training.

评估的是训练300个episodes之后的后面50个episode

And we can see that the discount factor γ has a great impact on pushing performance.

折扣因子γ会对推进动作产生显著效果

这篇论文的γ在0.5左右

In this kind of task, the pushing action should have a positive immediate effect to help grasp, and the relationship between two pushes is little. Therefore, small γ reduces the consideration of future state and have greater performance.

γ为什么要小的原因:这个任务里面更看重及时的信息,也可以说是更需要短视一点,由于前后两个push动作关联不大,因此不需要进行长期利益的考量。

From the perspective of input type, depth image is the main factor affecting performance.

深度图片比RGB图片对性能效果更好,成为了主要因素

效果是加速了收敛因速度 accelerate speed

Clutter clearing

人工放置物体,物体都是紧紧挨着的

The main reason for unsuccessful clearing is that double objects stay in a corner, and the robot can’t divide them by pushing because of the limitation of push working range.

清理不成功的原因是:由于存在推进力的作用范围,有个死角存在两个紧邻的物体,机器人不能把它们分开

calculate from image to push action 1 ms
detect the best grasp 5 ms

Conclusion

In the future, we will try to further improve the grasp rate, transfer this framework to real Baxter robot, and test it with more objects of different shapes.

接下来的工作:提升抓取率、将这个模型运用到Baxter机器人上进行实物测试

猜你喜欢

转载自blog.csdn.net/m0_48948682/article/details/123644227