Deep Reinforcement Learning with Smooth Policy Update for Robotic Cloth Manipulation

Yoshihisa Tsurumine


Deep Reinforcement Learning (DRL), which can learn complex policies with high-dimensional observations as inputs, e.g., images, has been successfully applied to various tasks. Therefore, it may be suitable to apply them for robots to learn and perform daily activities like washing and folding clothes, cooking, and cleaning. However, there are only a few studies that have applied DRL to real robots remains. In this thesis, we apply DRL to cloth manipulation tasks that are part of daily tasks and focuses on two main problems: (1) generating a huge number of samples in a real robot system is arduous because of the high sampling cost, and (2) learning environments require a reward function to evaluate selected behaviors, but designing rewards for cloth with flexible shapes is challenging.

The first motivation is to apply sample efficient DRL to real robotic cloth manipulation tasks. Previous value function-based DRL stabilizes the approximation of the value function with Deep Neural Networks (DNN) by learning from a huge number of samples. In this thesis, we employ a smooth policy update to enable stable learning from a small number of samples. We propose two sample efficient DRL algorithms: Deep P-Network (DPN) and Dueling Deep P-Network (Dueling DPN). Proposed methods are value function-based DRL with smooth policy update and employ the Kullback-Leibler divergence to limit the over large policy update. Dueling DPN has a DNN structure suitable for approximating value functions and improves sample efficiency in tasks with large action spaces.

The second motivation is to learn a cloth manipulation policy without explicit reward function design. This thesis explores an approach of Generative Adversarial Imitation Learning (GAIL) for robotic cloth manipulation tasks, which allows an agent to learn near-optimal behaviors from expert demonstration and self explorations without explicit reward function design. However, the performance in real robots remains is not good because it is difficult for humans to collect appropriate expert samples using a state action space of a robot. In this thesis, we focus on target state labels that can be adequately presented by humans. GAIL with target state labels improves the performance of learning policies by estimating higher target state rewards. We propose Double Discriminator P-Generative Adversarial Imitation Learning (DDP-GAIL), which learns policy from leaning reward with an expert discriminator and a target discriminator. DDP-GAIL employs DPN to policy updates in order to enable stable learning from complex reward functions with two discriminators.

Proposed methods were first investigated by a robot-arm reaching task in the simulation and were compared to previous methods in performance and sample efficiency. In a real robot experiment, we applied to two real robotic cloth manipulation tasks: 1) flipping a handkerchief and 2) folding clothes. Proposed methods were evaluated by flipping a handkerchief task with a hand-engineered reward function and were investigated by folding clothes tasks similar to daily tasks.