-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Thakorn Swaengkit edited this page Jun 22, 2023
·
93 revisions
note: most algorithm explanation is included in their PRs
Algorithm | Variants Implemented | PR |
---|---|---|
1. Dynamic Programming | ||
Value iteration | - | |
Policy iteration | - | |
2. Cross Entropy method | ||
Cross Entropy method on CartPole environment | cross_entropy/cross_entropy_cartpole.ipynb |
#4 |
3. Monte Carlo | ||
MC Prediction and Control on custom Gridworld environment |
monte_carlo/mc_prediction.ipynb monte_carlo/mc_control.ipynb
|
PR #1 |
MC Control on FrozenLake environment | monte_carlo/mc_control_frozenlake.ipynb |
PR #1 |
- (PR #2) Feat: add Temporal Difference algorithm
- (PR #5) Feat: add Double Q-Learning on Taxi environment
- N-step SARSA, SARSAmax, Expected SARSA on CliffWalking environment: temporal_difference/sarsa_cliffwalking.ipynb
- Copy of above scripts as importable classes: temporal_difference/algorithms.py
- Double Q-Learning on Taxi environment: temporal_difference/double_qlearning_taxi.ipynb
- Function Approximation on MountainCar environment: ref github
- (PR #7) Feat: add DQN algorithm on MountainCar environment
- Deep Q-Network (DQN) on MountainCar environment: dqn/dqn_mountaincar.ipynb
N-step Deep Q-NetworkCategorical Deep Q-NetworkDouble Deep Q-NetworkDueling Deep Q-NetworkDueling Double Deep Q-Network
- (PR #8) Feat: add REINFORCE algorithm on LunarLander environment
- (PR #9) Feat: add REINFORCE for CartPole environment
- REINFORCE on Lunarlander environment: policy_gradient/reinforce_lunarlander.ipynb
- REINFORCE on CartPole environment: policy_gradient/reinforce_cartpole.ipynb
- REINFORCE for continuous action space
- REINFORCE with baseline: ref doc
Trust Region Policy Optimization (TRPO)- Proximal Policy Optimization (PPO)
- ref: All types of Policy Gradient https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
- ref: PPO https://docs.cleanrl.dev/rl-algorithms/ppo/#experiment-results_2
- ref: PPO explanation https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12
- ref: PPO from scratch https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
- ref: https://github.com/lucifer2859/Policy-Gradients/blob/master/ppo.py
- ref: https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b
- Q Actor Critic
- TD Actor Critic
- Advantage Actor Critic (A2C)
- Advantage Actor Critic (A2C) in continuous actions
Asynchronous Advantage Actor Critic (A3C)- ref: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f
- ref: https://medium.com/intro-to-artificial-intelligence/the-actor-critic-reinforcement-learning-algorithm-c8095a655c14
- ref: https://huggingface.co/learn/deep-rl-course/unit6/introduction?fw=pt
- ref: https://spikingjelly.readthedocs.io/zh_CN/0.0.0.0.8/clock_driven_en/7_a2c_cart_pole.html
- ref: https://github.com/lucifer2859/Policy-Gradients/blob/master/actor-critic.py
Deep Deterministic Policy Gradient (DDPG): refTwin Delayed DDPG (TD3): refSoft Actor Critic (SAC): ref- https://github.com/trackmania-rl/tmrl
- On-policy vs. Off-policy
- Online Learning vs. Offline Learning
- "World model"
Exploration techniquesSelf-Imitation Learning on DQN to increase the training speed- https://docs.cleanrl.dev/
- https://docs.ray.io/en/latest/rllib/index.html#
- https://tianshou.readthedocs.io/en/master/
- https://github.com/thu-ml/tianshou/blob/master/docs/index.rst
- https://github.com/tinkoff-ai/CORL
- http://github.com/HumanCompatibleAI/imitation
- Hierarchical Reinforcement Learning
- Multi-Agent Reinforcement Learning
- (PR #1) Feat: add Monte Carlo algorithm and custom Gridworld environment
- Custom Gridworld environment script: environments/grid_world.py
- ref: https://github.com/linesd/tabular-methods
- (PR #3) Feat: add algorithm conversion to ONNX format
- Sample model conversion and usage script: onnx_export/sample_ppo.ipynb