NAIST-IS-DD1461206: Yunduan Cui

Practical Model-free Reinforcement Learning in Complex Robot Systems with High Dimensional States

Yunduan Cui (1461206)

As a promising learning paradigm in recent years, reinforcement learning learns good policies by interacting with an unknown environment and thus be suitable to the scenario of controlling robots to explore in challenging tasks. On the other hand, both the main two groups of reinforcement learning algorithms, the value function approach and the policy search, are still impractical in model-free learning of complex robot systems due to several limitations. The value function approach learns value function over all states and actions without any prior knowledge but suffers both the unstable learning process with insufficient real world samples and the intractable computational complexity in high dimensional systems. The policy search efficiently finds an optimal solution in a local area while be sensitive to the initialization of a well parameterized policy based on some knowledge of the task and model.

The motivation of this thesis is to explore practical model-free reinforcement learning algorithm to control complex robot systems. Our main idea is to take advantages of both the value function approach and the policy search. We propose a new approach that focuses on learning the global value function from the local sample space defined by the current policy. The Kullback-Leibler divergence is employed to limit the over large policy update in order to generate samples in continuous and local areas. Other machine learning methods are then applied on the local samples to locally approximate the value function. This framework solves the high sampling cost and intractable computational complexity while does not need any prior knowledge of the model or task.

Two algorithms are proposed based on this framework as examples: Local Update Dynamic Policy Programming (LUDPP) and Kernel Dynamic Policy Programming (KDPP). We first investigate the learning performance of the proposed methods in a range of simulation tasks including pendulum swing up and multiply DOF manipulator reaching, the proposed algorithms significantly outperform the conventional algorithms in high dimensional cases. Both LUDPP and KDPP are then successfully applied to control a Pneumatic Artificial Muscle (PAM) driven robotic hand, a high-dimensional system in finger position control and unscrew bottle cap task respectively while given limited samples and with ordinary computing resources. All results indicate the practicability of the proposed framework in controlling complex robot systems.