Review (pt. 2)

Signed-off-by: Andrej Orsula <[email protected]>
AndrejOrsula · Jun 2, 2021 · 31da409 · 31da409
1 parent 6d35561
commit 31da409
Show file tree

Hide file tree

Showing 4 changed files with 9 additions and 10 deletions.
diff --git a/content/appendix/joint_trajectory_controller.tex b/content/appendix/joint_trajectory_controller.tex
@@ -1,5 +1,5 @@
 \section{Joint Trajectory Controller}\label{app:joint_trajectory_controller}
 
-In order to enable execution of planned motions for robot manipulators inside Ignition Gazebo, a standard joint trajectory controller was implemented as a system plugin and contributed upstream. In its simplest, it provides simultaneous control of multiple joints, which can be used to follow trajectories generated by a motion planning framework such as MoveIt 2. Each trajectory consists of discrete temporal points that each contain per-joint targets for position, velocity, acceleration and effort. Control of each joint is accomplished by the use of PID controllers for position and velocity control. Effort computed by these controllers is combined with feed-forward effort from the trajectory itself and then applied to the joint for physics computations.
+In order to enable execution of planned motions for robotic manipulators inside Ignition Gazebo, a standard joint trajectory controller was implemented as a system plugin and contributed upstream. In its simplest, it provides simultaneous control of multiple joints, which can be used to follow trajectories generated by a motion planning framework such as MoveIt 2. Each trajectory consists of discrete temporal points that each contain per-joint targets for position, velocity, acceleration and effort. Control of each joint is accomplished by the use of PID controllers for position and velocity control. Effort computed by these controllers is combined with feed-forward effort from the trajectory itself and then applied to the joint for physics computations.
 
 In this work, trajectories generated by MoveIt 2 are followed with position-controlled joints, where PID gains for both UR5 and Panda robots were manually tuned.
diff --git a/content/discussion_and_conclusion.tex b/content/discussion_and_conclusion.tex
@@ -12,7 +12,7 @@ \chapter{Discussion and Conclusion}\label{ch:discussion_and_conclusion}
 
 When comparing the same hyperparameters on three actor-critic algorithms, an agent using TD3 for training was unable to solve the task. There might be a set of hyperparameters that would make TD3 applicable and allow it achieve a comparable success rate, however, the results indicate that TD3 is at the very least more sensitive to hyperparameters than the other two algorithms. The distributional representation of TQC's critics provide it with faster learning and better success rate of~77\%, when compared to~64\% of SAC. Therefore, these results support the claim of \citet{kuznetsov_controlling_2020} that TQC outperforms SAC as the new state-of-the-art RL algorithm for continuous control in robotics. Experiments conducted in this work extend this claim to the task of robotic grasping with visual observations and actions in Cartesian space.
 
-Policies trained with SAC and TQC for 500,000 time steps optimised a behaviour that tries to repeatedly perform slightly different grasps of objects if previous attempts failed. This behaviour of end-to-end control therefore resembles signs of the entropy maximisation that is targetted by these algorithms during the training. Quantitatively, it provides agent with a better chance of grasping an object during each episode and therefore maximising the accumulated reward. Given longer duration of episodes, the success rate could be artificially increased. However, qualitative analysis of such policy is considered to be excessively chaotic and unsafe. Real-world applications of robotic manipulation require to meet safety standards and use a more structured interaction with the environment. Agents trained in this work struggle to provide such guarantees, and their unsupervised use on real robots is limited to compliant objects that reduce the risk of accidental damage. With this in mind, discrete action spaces, e.g.~pixel-wise action space with predefined action primitives and safety limits, might be currently more suitable for real-world applications due to their more deterministic behaviour, despite having a reduced ability to learn more complex policy that would improve their task-solving capabilities. It is therefore believed that a theory of safety needs to be developed for RL before it is applicable for solving real-world robot manipulation tasks with continuous end-to-end control.
+Policies trained with SAC and TQC for 500,000 time steps optimised a behaviour that tries to repeatedly perform slightly different grasps of objects if previous attempts failed. This behaviour of end-to-end control therefore resembles signs of the entropy maximisation that is targetted by these algorithms during the training. Quantitatively, it provides agent with a better chance of grasping an object during each episode and therefore maximising the accumulated reward. Given longer duration of episodes, the success rate could be artificially increased. However, qualitative analysis of such policy is considered to be excessively chaotic and unsafe. Real-world applications of robotic manipulation require to meet safety standards and use a more structured interaction with the environment. Agents trained in this work struggle to provide such guarantees, and their unsupervised use on real robots is limited to compliant objects that reduce the risk of accidental damage. With this in mind, discrete action spaces, e.g.~pixel-wise action space with predefined action primitives and safety limits, might be currently more suitable for real-world applications due to their more deterministic behaviour, despite having a reduced ability to learn more complex policy that would improve their task-solving capabilities. It is therefore believed that a theory of safety needs to be developed for RL before it is applicable for solving real-world robotic manipulation tasks with continuous end-to-end control.
 
 Study of ablations brought some unexpected results. Notably, the use of demonstrations reduced the attainable success rate on novel scenes by~7\%, despite faster learning in the early stages. It can be discussed that this significant downgrade in performance is caused by a bias that was introduced by the suboptimal scripted policy, which lead to eventual convergence to a local optimal policy. Agent that needs to explore completely from scratch has a better chance of converging to a policy that is globally optimal and not affected by such bias. Therefore, the experimental results indicate that the use of demonstrations for RL should be discouraged, if possible, and other methods with better guarantees such as curriculum learning should be applied instead.