What follows is a list of papers in deep RL that are worth reading. This is far from comprehensive, but should provide a useful starting point for someone looking to do research in the field.
- Model-Free RL ================
-
Playing Atari with Deep Reinforcement Learning`_, Mnih et al, 2013. Algorithm: DQN.
-
Deep Recurrent Q-Learning for Partially Observable MDPs <https://arxiv.org/abs/1507.06527>
_, Hausknecht and Stone, 2015. Algorithm: Deep Recurrent Q-Learning.
.. [#] Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>
_, Wang et al, 2015. Algorithm: Dueling DQN.
.. [#] Deep Reinforcement Learning with Double Q-learning <https://arxiv.org/abs/1509.06461>
_, Hasselt et al 2015. Algorithm: Double DQN.
.. [#] Prioritized Experience Replay <https://arxiv.org/abs/1511.05952>
_, Schaul et al, 2015. Algorithm: Prioritized Experience Replay (PER).
.. [#] Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>
_, Hessel et al, 2017. Algorithm: Rainbow DQN.
.. [#] Asynchronous Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1602.01783>
_, Mnih et al, 2016. Algorithm: A3C.
.. [#] Trust Region Policy Optimization <https://arxiv.org/abs/1502.05477>
_, Schulman et al, 2015. Algorithm: TRPO.
.. [#] High-Dimensional Continuous Control Using Generalized Advantage Estimation <https://arxiv.org/abs/1506.02438>
_, Schulman et al, 2015. Algorithm: GAE.
.. [#] Proximal Policy Optimization Algorithms <https://arxiv.org/abs/1707.06347>
_, Schulman et al, 2017. Algorithm: PPO-Clip, PPO-Penalty.
.. [#] Emergence of Locomotion Behaviours in Rich Environments <https://arxiv.org/abs/1707.02286>
_, Heess et al, 2017. Algorithm: PPO-Penalty.
.. [#] Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation <https://arxiv.org/abs/1708.05144>
_, Wu et al, 2017. Algorithm: ACKTR.
.. [#] Sample Efficient Actor-Critic with Experience Replay <https://arxiv.org/abs/1611.01224>
_, Wang et al, 2016. Algorithm: ACER.
.. [#] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor <https://arxiv.org/abs/1801.01290>
_, Haarnoja et al, 2018. Algorithm: SAC.
.. [#] Deterministic Policy Gradient Algorithms <http://proceedings.mlr.press/v32/silver14.pdf>
_, Silver et al, 2014. Algorithm: DPG.
.. [#] Continuous Control With Deep Reinforcement Learning <https://arxiv.org/abs/1509.02971>
_, Lillicrap et al, 2015. Algorithm: DDPG.
.. [#] Addressing Function Approximation Error in Actor-Critic Methods <https://arxiv.org/abs/1802.09477>
_, Fujimoto et al, 2018. Algorithm: TD3.
.. [#] A Distributional Perspective on Reinforcement Learning <https://arxiv.org/abs/1707.06887>
_, Bellemare et al, 2017. Algorithm: C51.
.. [#] Distributional Reinforcement Learning with Quantile Regression <https://arxiv.org/abs/1710.10044>
_, Dabney et al, 2017. Algorithm: QR-DQN.
.. [#] Implicit Quantile Networks for Distributional Reinforcement Learning <https://arxiv.org/abs/1806.06923>
_, Dabney et al, 2018. Algorithm: IQN.
.. [#] Dopamine: A Research Framework for Deep Reinforcement Learning <https://openreview.net/forum?id=ByG_3s09KX>
, Anonymous, 2018. Contribution: Introduces Dopamine, a code repository containing implementations of DQN, C51, IQN, and Rainbow. Code link. <https://github.com/google/dopamine>
.. [#] Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic <https://arxiv.org/abs/1611.02247>
_, Gu et al, 2016. Algorithm: Q-Prop.
.. [#] Action-depedent Control Variates for Policy Optimization via Stein's Identity <https://arxiv.org/abs/1710.11198>
_, Liu et al, 2017. Algorithm: Stein Control Variates.
.. [#] The Mirage of Action-Dependent Baselines in Reinforcement Learning <https://arxiv.org/abs/1802.10031>
_, Tucker et al, 2018. Contribution: interestingly, critiques and reevaluates claims from earlier papers (including Q-Prop and stein control variates) and finds important methodological errors in them.
.. [#] Bridging the Gap Between Value and Policy Based Reinforcement Learning <https://arxiv.org/abs/1702.08892>
_, Nachum et al, 2017. Algorithm: PCL.
.. [#] Trust-PCL: An Off-Policy Trust Region Method for Continuous Control <https://arxiv.org/abs/1707.01891>
_, Nachum et al, 2017. Algorithm: Trust-PCL.
.. [#] Combining Policy Gradient and Q-learning <https://arxiv.org/abs/1611.01626>
_, O'Donoghue et al, 2016. Algorithm: PGQL.
.. [#] The Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning <https://arxiv.org/abs/1704.04651>
_, Gruslys et al, 2017. Algorithm: Reactor.
.. [#] Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning <http://papers.nips.cc/paper/6974-interpolated-policy-gradient-merging-on-policy-and-off-policy-gradient-estimation-for-deep-reinforcement-learning>
_, Gu et al, 2017. Algorithm: IPG.
.. [#] Equivalence Between Policy Gradients and Soft Q-Learning <https://arxiv.org/abs/1704.06440>
_, Schulman et al, 2017. Contribution: Reveals a theoretical link between these two families of RL algorithms.
.. [#] Evolution Strategies as a Scalable Alternative to Reinforcement Learning <https://arxiv.org/abs/1703.03864>
_, Salimans et al, 2017. Algorithm: ES.
- Exploration ==============
.. [#] VIME: Variational Information Maximizing Exploration <https://arxiv.org/abs/1605.09674>
_, Houthooft et al, 2016. Algorithm: VIME.
.. [#] Unifying Count-Based Exploration and Intrinsic Motivation <https://arxiv.org/abs/1606.01868>
_, Bellemare et al, 2016. Algorithm: CTS-based Pseudocounts.
.. [#] Count-Based Exploration with Neural Density Models <https://arxiv.org/abs/1703.01310>
_, Ostrovski et al, 2017. Algorithm: PixelCNN-based Pseudocounts.
.. [#] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning <https://arxiv.org/abs/1611.04717>
_, Tang et al, 2016. Algorithm: Hash-based Counts.
.. [#] EX2: Exploration with Exemplar Models for Deep Reinforcement Learning <https://arxiv.org/abs/1703.01260>
_, Fu et al, 2017. Algorithm: EX2.
.. [#] Curiosity-driven Exploration by Self-supervised Prediction <https://arxiv.org/abs/1705.05363>
_, Pathak et al, 2017. Algorithm: Intrinsic Curiosity Module (ICM).
.. [#] Large-Scale Study of Curiosity-Driven Learning <https://arxiv.org/abs/1808.04355>
_, Burda et al, 2018. Contribution: Systematic analysis of how surprisal-based intrinsic motivation performs in a wide variety of environments.
.. [#] Exploration by Random Network Distillation <https://arxiv.org/abs/1810.12894>
_, Burda et al, 2018. Algorithm: RND.
.. [#] Variational Intrinsic Control <https://arxiv.org/abs/1611.07507>
_, Gregor et al, 2016. Algorithm: VIC.
.. [#] Diversity is All You Need: Learning Skills without a Reward Function <https://arxiv.org/abs/1802.06070>
_, Eysenbach et al, 2018. Algorithm: DIAYN.
.. [#] Variational Option Discovery Algorithms <https://arxiv.org/abs/1807.10299>
_, Achiam et al, 2018. Algorithm: VALOR.
- Transfer and Multitask RL ============================
.. [#] Progressive Neural Networks <https://arxiv.org/abs/1606.04671>
_, Rusu et al, 2016. Algorithm: Progressive Networks.
.. [#] Universal Value Function Approximators <http://proceedings.mlr.press/v37/schaul15.pdf>
_, Schaul et al, 2015. Algorithm: UVFA.
.. [#] Reinforcement Learning with Unsupervised Auxiliary Tasks <https://arxiv.org/abs/1611.05397>
_, Jaderberg et al, 2016. Algorithm: UNREAL.
.. [#] The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously <https://arxiv.org/abs/1707.03300>
_, Cabi et al, 2017. Algorithm: IU Agent.
.. [#] PathNet: Evolution Channels Gradient Descent in Super Neural Networks <https://arxiv.org/abs/1701.08734>
_, Fernando et al, 2017. Algorithm: PathNet.
.. [#] Mutual Alignment Transfer Learning <https://arxiv.org/abs/1707.07907>
_, Wulfmeier et al, 2017. Algorithm: MATL.
.. [#] Learning an Embedding Space for Transferable Robot Skills <https://openreview.net/forum?id=rk07ZXZRb¬eId=rk07ZXZRb>
_, Hausman et al, 2018.
.. [#] Hindsight Experience Replay <https://arxiv.org/abs/1707.01495>
_, Andrychowicz et al, 2017. Algorithm: Hindsight Experience Replay (HER).
- Hierarchy ============
.. [#] Strategic Attentive Writer for Learning Macro-Actions <https://arxiv.org/abs/1606.04695>
_, Vezhnevets et al, 2016. Algorithm: STRAW.
.. [#] FeUdal Networks for Hierarchical Reinforcement Learning <https://arxiv.org/abs/1703.01161>
_, Vezhnevets et al, 2017. Algorithm: Feudal Networks
.. [#] Data-Efficient Hierarchical Reinforcement Learning <https://arxiv.org/abs/1805.08296>
_, Nachum et al, 2018. Algorithm: HIRO.
- Memory =========
.. [#] Model-Free Episodic Control <https://arxiv.org/abs/1606.04460>
_, Blundell et al, 2016. Algorithm: MFEC.
.. [#] Neural Episodic Control <https://arxiv.org/abs/1703.01988>
_, Pritzel et al, 2017. Algorithm: NEC.
.. [#] Neural Map: Structured Memory for Deep Reinforcement Learning <https://arxiv.org/abs/1702.08360>
_, Parisotto and Salakhutdinov, 2017. Algorithm: Neural Map.
.. [#] Unsupervised Predictive Memory in a Goal-Directed Agent <https://arxiv.org/abs/1803.10760>
_, Wayne et al, 2018. Algorithm: MERLIN.
.. [#] Relational Recurrent Neural Networks <https://arxiv.org/abs/1806.01822>
_, Santoro et al, 2018. Algorithm: RMC.
- Model-Based RL =================
.. [#] Imagination-Augmented Agents for Deep Reinforcement Learning <https://arxiv.org/abs/1707.06203>
_, Weber et al, 2017. Algorithm: I2A.
.. [#] Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning <https://arxiv.org/abs/1708.02596>
_, Nagabandi et al, 2017. Algorithm: MBMF.
.. [#] Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning <https://arxiv.org/abs/1803.00101>
_, Feinberg et al, 2018. Algorithm: MVE.
.. [#] Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion <https://arxiv.org/abs/1807.01675>
_, Buckman et al, 2018. Algorithm: STEVE.
.. [#] Model-Ensemble Trust-Region Policy Optimization <https://openreview.net/forum?id=SJJinbWRZ¬eId=SJJinbWRZ>
_, Kurutach et al, 2018. Algorithm: ME-TRPO.
.. [#] Model-Based Reinforcement Learning via Meta-Policy Optimization <https://arxiv.org/abs/1809.05214>
_, Clavera et al, 2018. Algorithm: MB-MPO.
.. [#] Recurrent World Models Facilitate Policy Evolution <https://arxiv.org/abs/1809.01999>
_, Ha and Schmidhuber, 2018.
.. [#] Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm <https://arxiv.org/abs/1712.01815>
_, Silver et al, 2017. Algorithm: AlphaZero.
.. [#] Thinking Fast and Slow with Deep Learning and Tree Search <https://arxiv.org/abs/1705.08439>
_, Anthony et al, 2017. Algorithm: ExIt.
- Meta-RL ==========
.. [#] RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning <https://arxiv.org/abs/1611.02779>
_, Duan et al, 2016. Algorithm: RL^2.
.. [#] Learning to Reinforcement Learn <https://arxiv.org/abs/1611.05763>
_, Wang et al, 2016.
.. [#] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks <https://arxiv.org/abs/1703.03400>
_, Finn et al, 2017. Algorithm: MAML.
.. [#] A Simple Neural Attentive Meta-Learner <https://openreview.net/forum?id=B1DmUzWAW¬eId=B1DmUzWAW>
_, Mishra et al, 2018. Algorithm: SNAIL.
- Scaling RL =============
.. [#] Accelerated Methods for Deep Reinforcement Learning <https://arxiv.org/abs/1803.02811>
_, Stooke and Abbeel, 2018. Contribution: Systematic analysis of parallelization in deep RL across algorithms.
.. [#] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures <https://arxiv.org/abs/1802.01561>
_, Espeholt et al, 2018. Algorithm: IMPALA.
.. [#] Distributed Prioritized Experience Replay <https://openreview.net/forum?id=H1Dy---0Z>
_, Horgan et al, 2018. Algorithm: Ape-X.
.. [#] Recurrent Experience Replay in Distributed Reinforcement Learning <https://openreview.net/forum?id=r1lyTjAqYX>
_, Anonymous, 2018. Algorithm: R2D2.
.. [#] RLlib: Abstractions for Distributed Reinforcement Learning <https://arxiv.org/abs/1712.09381>
, Liang et al, 2017. Contribution: A scalable library of RL algorithm implementations. Documentation link. <https://ray.readthedocs.io/en/latest/rllib.html>
- RL in the Real World =======================
.. [#] Benchmarking Reinforcement Learning Algorithms on Real-World Robots <https://arxiv.org/abs/1809.07731>
_, Mahmood et al, 2018.
.. [#] Learning Dexterous In-Hand Manipulation <https://arxiv.org/abs/1808.00177>
_, OpenAI, 2018.
.. [#] QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation <https://arxiv.org/abs/1806.10293>
_, Kalashnikov et al, 2018. Algorithm: QT-Opt.
.. [#] Horizon: Facebook's Open Source Applied Reinforcement Learning Platform <https://arxiv.org/abs/1811.00260>
_, Gauci et al, 2018.
- Safety ==========
.. [#] Concrete Problems in AI Safety <https://arxiv.org/abs/1606.06565>
_, Amodei et al, 2016. Contribution: establishes a taxonomy of safety problems, serving as an important jumping-off point for future research. We need to solve these!
.. [#] Deep Reinforcement Learning From Human Preferences <https://arxiv.org/abs/1706.03741>
_, Christiano et al, 2017. Algorithm: LFP.
.. [#] Constrained Policy Optimization <https://arxiv.org/abs/1705.10528>
_, Achiam et al, 2017. Algorithm: CPO.
.. [#] Safe Exploration in Continuous Action Spaces <https://arxiv.org/abs/1801.08757>
_, Dalal et al, 2018. Algorithm: DDPG+Safety Layer.
.. [#] Trial without Error: Towards Safe Reinforcement Learning via Human Intervention <https://arxiv.org/abs/1707.05173>
_, Saunders et al, 2017. Algorithm: HIRL.
.. [#] Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning <https://arxiv.org/abs/1711.06782>
_, Eysenbach et al, 2017. Algorithm: Leave No Trace.
- Imitation Learning and Inverse Reinforcement Learning =========================================================
.. [#] Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy <http://www.cs.cmu.edu/~bziebart/publications/thesis-bziebart.pdf>
_, Ziebart 2010. Contributions: Crisp formulation of maximum entropy IRL.
.. [#] Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization <https://arxiv.org/abs/1603.00448>
_, Finn et al, 2016. Algorithm: GCL.
.. [#] Generative Adversarial Imitation Learning <https://arxiv.org/abs/1606.03476>
_, Ho and Ermon, 2016. Algorithm: GAIL.
.. [#] DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills <https://xbpeng.github.io/projects/DeepMimic/2018_TOG_DeepMimic.pdf>
_, Peng et al, 2018. Algorithm: DeepMimic.
.. [#] Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow <https://arxiv.org/abs/1810.00821>
_, Peng et al, 2018. Algorithm: VAIL.
.. [#] One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL <https://arxiv.org/abs/1810.05017>
_, Le Paine et al, 2018. Algorithm: MetaMimic.
- Reproducibility, Analysis, and Critique ===========================================
.. [#] Benchmarking Deep Reinforcement Learning for Continuous Control <https://arxiv.org/abs/1604.06778>
_, Duan et al, 2016. Contribution: rllab.
.. [#] Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control <https://arxiv.org/abs/1708.04133>
_, Islam et al, 2017.
.. [#] Deep Reinforcement Learning that Matters <https://arxiv.org/abs/1709.06560>
_, Henderson et al, 2017.
.. [#] Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods <https://arxiv.org/abs/1810.02525>
_, Henderson et al, 2018.
.. [#] Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? <https://arxiv.org/abs/1811.02553>
_, Ilyas et al, 2018.
.. [#] Simple Random Search Provides a Competitive Approach to Reinforcement Learning <https://arxiv.org/abs/1803.07055>
_, Mania et al, 2018.
- Bonus: Classic Papers in RL Theory or Review ================================================
.. [#] Policy Gradient Methods for Reinforcement Learning with Function Approximation <https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf>
_, Sutton et al, 2000. Contributions: Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes.
.. [#] An Analysis of Temporal-Difference Learning with Function Approximation <http://web.mit.edu/jnt/www/Papers/J063-97-bvr-td.pdf>
_, Tsitsiklis and Van Roy, 1997. Contributions: Variety of convergence results and counter-examples for value-learning methods in RL.
.. [#] Reinforcement Learning of Motor Skills with Policy Gradients <http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/Neural-Netw-2008-21-682_4867%5b0%5d.pdf>
_, Peters and Schaal, 2008. Contributions: Thorough review of policy gradient methods at the time, many of which are still serviceable descriptions of deep RL methods.
.. [#] Approximately Optimal Approximate Reinforcement Learning <https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf>
_, Kakade and Langford, 2002. Contributions: Early roots for monotonic improvement theory, later leading to theoretical justification for TRPO and other algorithms.
.. [#] A Natural Policy Gradient <https://papers.nips.cc/paper/2073-a-natural-policy-gradient.pdf>
_, Kakade, 2002. Contributions: Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL.
.. [#] Algorithms for Reinforcement Learning <https://sites.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf>
_, Szepesvari, 2009. Contributions: Unbeatable reference on RL before deep RL, containing foundations and theoretical background.