コロキアムB発表

日時: 9月21日(水)1限（9:20-10:50）

会場: L1

司会: 織田泰彰

岡田　颯太	M, 2回目発表	ロボットラーニング	松原　崇充, 和田　隆広, 杉本　謙二, 小林　泰介
title: Pneumatic Artificial Muscle Control Robust to Hysteresis and Individual Differences by Recurrent Distributed Reinforcement Learning abstract: In recent years, the use of pneumatic artificial muscles with excellent weight-to-output ratio is expected to reduce the cost and weight of robots. However, the control of pneumatic artificial muscles is difficult due to the existence of nonlinearity in the input-output relationship, as typified by hysteresis, and individual differences due to aging and deterioration. In this study, we apply R2D2, a distributed reinforcement learning method utilizing a recurrent neural network that can deal with hysteresis, to the control of pneumatic artificial muscles. By learning from the experience of each distributed system under the influence of individual differences, robustness to individual differences can be expected. In this presentation, we will verify the feasibility of controlling pneumatic artificial muscles using R2D2. language of the presentation: Japanese 発表題目: リカレント分散強化学習によるヒステリシスと個体差に頑健な空気圧人工筋の制御発表概要: 近年、重量出力比に優れた空気圧人工筋を用いることでロボットの低コスト化、軽量化が期待されている。しかし、空気圧人工筋は入出力関係にヒステリシス性に代表される非線形性や経年劣化などによる個体差が存在するため制御を困難にしている。そこで本研究では、ヒステリシスに対処可能なリカレントニューラルネットワークを活用した分散強化学習手法であるR2D2を空気圧人工筋の制御に応用する。分散システムそれぞれの個体差の影響を受けた経験より学習することで、個体差への頑健性が期待できる。本発表ではR2D2による空気圧人工筋の制御の実現可能性について検証する。

神　孝典	M, 2回目発表	ロボットラーニング	松原　崇充, 和田　隆広, 杉本　謙二, 小林　泰介
title: Bipedal Walking on Soft-Stepping Stones Using Model-Based Reinforcement Learning in PDAC abstract: Bipedal robots have been the focus of much research due to their adaptability to real-world environments. Gait control is one of the most important elements of bipedal robots, and various methods have been proposed. Model-based methods are widely used to linearize and facilitate control, but they tend to be quasi-static. Learning-based methods can acquire a variety of gaits using model-free reinforcement learning, but they are computationally demanding during training. In this context, there are virtual constraints-based methods that take advantage of nonlinear dynamics. Although this method can achieve a dynamic gait, it is difficult to construct a stable landing position transition model, and it is difficult to satisfy hard stepping stones. Therefore, this study integrates model-based reinforcement learning with virtual constraints-based method to achieve walking on soft-stepping stones by long-term prediction. In this presentation, a preliminary verification is shown that a state transition model can be obtained by combining these methods, and that it is possible to walk using the model. language of the presentation: Japanese 発表題目: PDAC型二足歩行におけるモデルベース強化学習を用いたソフト着地位置制約の実現発表概要: 実環境への適応性の高さなどから，二脚ロボットに関する研究が盛んに行われている．中でも歩行制御は二脚ロボットの重要な要素であり，様々な手法が提案されている．例えばモデルベースの手法では，線形化し制御を容易にしたものが広く用いられるが，それらは準静的な動作になりやすい．また，学習ベースの手法ではモデルフリー強化学習を用いて多様な歩容を獲得させることが可能であるが，学習時に膨大な計算量が要求される．そのような中，モデルベースの手法の中に非線形ダイナミクスを活かした仮想拘束に基づく手法がある．この手法はダイナミックな歩行を実現できるが，安定な着地位置の遷移モデルを構築することが難しく，ハードな着地位置制約を満たすことが困難である．したがって，本研究では仮想拘束に基づく手法とモデルベース強化学習を統合することによって，将来の状態を長期的に予測可能にし，ソフトな着地位置制約下での歩行の実現を目指す．本発表では予備検証として，これらの組み合わせによって状態遷移モデルを獲得し，そのモデルを用いて歩行可能であることを示す．

高橋　慶一郎	M, 2回目発表	ロボットラーニング	松原　崇充, 和田　隆広, 杉本　謙二, 小林　泰介
title: Temporal Difference Learning with Fechner’s Law derived from Reinforcement Learning as Probabilistic Inference abstract: There are new methods that consider reinforcement learning problems as probabilistic inference problems. In these methods, gradients of learning models for computing a value function and a policy are approximately derived by eliminating uncomputable elements. However, if rewards over zero and punishments under zero are used instead of general rewards which can be positive or negative values, the uncomputable elements of the conventional methods do not exist and approximation becomes unnecessary. The newly derived gradients have properties similar to Fechner's law, which describes nature of human perception, and they may produce different learning properties than traditional gradients. In this research, we investigate effect of the new gradients on learning. By analyzing behavior of the gradients and experiments, we confirmed that the proposed method has properties that actively avoid punishments. In this presentation, we show the differences between the conventional methods and the proposed method, properties of the new gradients, and their effect on learning. language of the presentation: Japanese 発表題目: フェヒナーの法則に従うTD学習則を導く確率的推論問題としての強化学習発表概要: 強化学習問題を確率的推論問題として扱う新しい手法が存在し有効性が確認されている。これらの手法は、計算不可能な要素を近似により消去することで価値関数と方策を求める学習モデルの勾配を導出している。だが、正と負どちらの値もとる一般的な報酬を用いるのではなく罰に分割することで、従来手法のような計算不可能な要素が無くなり近似が不要となる。こうして求められた勾配は、人間の知覚の特徴を表すフェヒナーの法則に沿った性質があり、従来とは異なる学習特性を持つ可能性がある。本研究では、新しく導出された勾配が学習に及ぼす影響を調査する。勾配の解析と実験により、罰を積極的に回避する性質があることを確認した。本発表では、従来手法と提案手法の違い、新たな勾配が持つ性質、学習に及ぼす影響を示す。

米澤　壮太郎	M, 2回目発表	ロボットラーニング	松原　崇充, 和田　隆広, 杉本　謙二, 小林　泰介
title: Experience replay that makes use of data widely with prioritization with reachability by current policy abstract: In reinforcement learning, where agents acquire control strategies by trial and error in an unknown environment, sample efficiency is a major obstacle to practical application. Experience replay, in which experience data is stored in a buffer and reused, is widely used for sample efficiency. However, it has become clear in recent studies that experience with a higher reachability using the current strategy should be replayed with higher priority. In this study, we propose a new experience replay method that quantitatively evaluates the reachability of the current state and the next state in the experience data, and prioritizes experiences with high reachability by weighting them. We report the results of comparing the proposed method with conventional methods through a benchmark task of reinforcement learning. language of the presentation: Japanese 発表題目: 現方策に従った際の到達可能性を用いてデータを広く活用する経験再生発表概要: 未知環境で試行錯誤的にエージェントの制御方策を獲得する強化学習において，そのサンプル効率は実用化の大きな障害となる．この改善に向けて，経験データをバッファに蓄えて再利用する経験再生が広く活用されている．しかし，現在の方策を用いた場合の到達可能性が高い経験ほど，優先的に再生すべきであることが近年の研究で分かってきた．そこで本研究では，経験データに含まれる現状態と次状態それぞれへの到達可能性を定量評価し，一方でも到達可能性が高い経験は重み付けにより優先する新たな経験再生手法を提案する．この提案手法を強化学習のベンチマークタスクを通じて従来手法と比較した結果を報告する．