번역체, 용어
Agent Policy Exploration Action value On Policy Environment Action Exploitation Discount rate Off Policy State Reward Value Discount factor T-horizon Observation Return State value MDP epoch Bellman equation Bellman Optimality equation Multi-Armed Bandit Problem Dynamic programming Offline Reinforcement Learning Backup Episode History Trajectory Model base Planning Prediction Control Actor-Critic Model Free Rollout Policy evaluation Policy iteration(improvement) Value iteration Temporal Difference Monte-Carlo