Trust Region Policy Optimization
不知道写什么了。。。。
0x00 Abstract
对理论方法进行近似,本文提出了一种新算法TRPO;TRPO的近似值可能偏离理论,但是能够在对超参调整很小的情况下,实现单调提升。
KEY WORDS:Reinforcement Learning,TRPO,On-Policy
0x01 Introduction
目前策略优化的方案大致分为三类:
- Policy iteration methods:在估计当前策略的价值函数和更新策略之间交替进行
- Policy gradient methods:从样本轨迹中预估价值
- Derivative-free optimization methods:将成本视为黑匣子,进行优化,如交叉熵、协方差等方法
本文贡献:
- 证明最小化替身函数,能保证策略以不小的步长进行更新
- 对理论方法进行近似,得到一种实际算法TRPO
- 两种变体:
- Single-path: model-free setting
- Vine: particular states, only in simulation
0x02 Preliminaries
0x03 Monotonic Improvement Guarantee for General Stochastic Policies
0x04 Optimization of Parameterized Policies
0x05 Sample-Based Estimation of the Objective and Constraint
0x06 Practical Algorithm
0x07 Connections with Prior Work
0x08 Experiments
0x09 Discussion
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.