Your goal is to implement the REINFORCE algorithm. The key difference between REINFORCE and DQN is that REINFORCE learns the policy rather than the action-value function Q. It can also be applied to systems with a continuous state space and (unlike DQN) a continuous action space, although you will focus in this assignment on its application to a system with finite state and action spaces.
You will test your algorithms in the following provided environment:
- The simple grid-world from HW1. The environment is defined in
gridworld.py
. An example of how to use the environment is provided intest_gridworld.py
.
You should express the reinforcement learning problem as a Markov Decision Process with an infinite time horizon and a discount factor of
Include a train_gridworld.py
files that can be run to generate all necessary plots and data.
Please apply REINFORCE to the easy version of the grid-world environment with a tabular policy, using softmax to compute action probabilities, using auto-differentiation to compute payoff gradients (the categorical distribution may be helpful), and using the SGD optimizer without momentum to take descent steps.
More specifically, please write code to generate, at a minimum, the following results:
- A plot that contains a learning curve (the total reward versus the number of simulation steps)
- A plot of the policy for at least one trained agent (compared to the optimal policy, if known)
Please do at least one of the following additional investigations:
- Test your algorithm on both the "easy" version of gridworld (deterministic state transition) and the "hard" version (stochastic state transition).
- Compare your results when gradient steps are implemented with the SGD optimizer and with the Adam optimizer.
- Apply REINFORCE with a tabular policy to the discrete pendulum environment, with finite state and action spaces, that you considered in HW1.
- Apply REINFORCE to the pendulum environment, with continuous state and action spaces, assuming the use of a Gaussian policy that is described by a neural network. By default, this pendulum environment uses a "dense reward" instead of the sparse reward we have assumed in prior versions of the pendulum. Beware! Even with a bug-free implementation and a good choice of hyper-parameters, it is likely that you will need a very large number of simulation steps in order for REINFORCE to produce good results (e.g., tens of millions). We will be impressed if you get this to work (and astonished if you get it to work with sparse reward).
Create a pull request (PR) to enable a review of your code. Your code should be functional - that is, each algorithm should be working and ready to be reviewed for improvements. Remember that you need to write your own code from scratch.
Name your PR "Initial hwX for Firstname Lastname (netid)".
Due: 10am on Tuesday, April 18
Review the code of at least one colleague. That is, you should:
- Choose a PR that does not already have a reviewer, and assign yourself as a reviewer.
- Perform a code review. See Resources for guidance on how to perform code reviews (at a minimum, look at Google's best practices).
- Modify your
README.md
file to specify whose code you reviewed (using their GitHub username).
The goal of this review process is to arrive at a version of your code that is functional, reasonably efficient, and easy for others to understand. The goal is not to make all of our code the same (there are many different ways of doing things). The goal is also not to grade the work of your colleagues - your reviews will have no impact on others' grades. Don't forget to remind your colleagues to do the simple things like name their PR and training file correctly!
Due: 10am on Wednesday, April 19
Improve your own code based on reviews that you receive. Respond to every comment and merge your PR. If you address a comment fully (e.g., by changing your code), you mark it as resolved. If you disagree with or remain uncertain about a comment, engage in follow-up discussion with the reviewer on GitHub. Don't forget to reply to follow-ups on code you reviewed as well.
Submit your repository, containing your final code and a (very brief) report titled hwX-netid.pdf
, to Gradescope. The report should be formatted using either typical IEEE or AIAA conference/journal paper format and include the following, at a minimum:
- Plots discussed in Results.
- An introduction (e.g., discussion of the problem(s)), discussion of methods (e.g., including any specific details needed to understand your algorithm implementations like hyperparameters chosen), discussion of results, and general conclusions.
Due: 10am on Thursday, April 20
Here are some resources that may be helpful:
- Google's best practices for code review
- A Microsoft blog post on code review and study of the review process
- A RedHat blog post on python-specific code review
- A classic reference on writing code: The Art of Readable Code (Boswell and Foucher, O'Reilly, 2012)
Many other resources are out there - we will happily accept a PR to add more to this list!