Implementation of First return, then explore (Go-Explore) by Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune. The result is a neural network policy that reaches a score of 2500 on the Atari environment MontezumaRevenge.
- Exploration Phase with demonstration generation
- Robustification Phase (PPO + SIL + Backward algorithm)