Submission

Signed-off-by: Andrej Orsula <[email protected]>
AndrejOrsula · Jun 3, 2021 · 70a569d · 70a569d
1 parent 31da409
commit 70a569d
Show file tree

Hide file tree

Showing 13 changed files with 30 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -10,6 +10,17 @@ Compiled PDF: [**master_thesis.pdf**](./master_thesis.pdf)
 
 Accompanying recordings can be viewed under [YouTube playlist](https://youtube.com/playlist?list=PLzcIGFRbGF3Qr4XSzAjNwOMPaeDn5J6i1). Data acquired during experimental evaluation can be found inside [experimental_evaluation directory](./experimental_evaluation).
 
+## Citation
+
+```bibtex
+@mastersthesis{orsula_deep_2021,
+  author = {Andrej Orsula},
+  title  = {{Deep} {Reinforcement} {Learning} for {Robotic} {Grasping} from {Octrees}},
+  school = {Aalborg University},
+  year   = {2021}
+}
+```
+
 ## Disclaimer (LaTeX Template)
 
 Parts of frontmatter are adapted from [jkjaer/aauLatexTemplates](https://github.com/jkjaer/aauLatexTemplates) and modified for use with `memoir` class.
diff --git a/_frontmatter/titlepage.tex b/_frontmatter/titlepage.tex
@@ -26,7 +26,7 @@
     \textbf{Author:}                \\\thesisauthor\bigskip\par
     \textbf{Supervisor:}            \\\supervisor\bigskip\par
     \textbf{Number of Pages:}       \\\pageref*{LastPage}\bigskip\par
-    \textbf{Submission Date:}       \\\today
+    \textbf{Submission Date:}       \\June 3, 2021
   } &
   \parbox[t]{\titlepagerightcolumnwidth}{%
     {\large\textbf{Abstract:}\bigskip\par}

diff --git a/bibliography/bibliography.bib b/bibliography/bibliography.bib
@@ -40,7 +40,7 @@ @book{sutton_reinforcement_2018
   year       = {2018},
   address    = {Cambridge, MA, USA},
   isbn       = {978-0-262-03924-6},
-  publisher  = {A Bradford Book},
+  publisher  = {A~Bradford Book},
   shorttitle = {Reinforcement {Learning}}
 }
 
@@ -124,7 +124,6 @@ @article{nian_review_2020
   year       = {2020},
   journal    = {Computers \& Chemical Engineering},
   volume     = {139},
-  pages      = {106886},
   doi        = {10.1016/j.compchemeng.2020.106886},
   issn       = {0098-1354},
   language   = {en},
@@ -525,7 +524,6 @@ @article{breyer_comparing_2019
   year    = {2019},
   journal = {IEEE Robotics and Automation Letters},
   volume  = {PP},
-  pages   = {1--1},
   doi     = {10.1109/LRA.2019.2896467},
   month   = jan,
   url     = {https://doi.org/10.1109/LRA.2019.2896467}
@@ -619,7 +617,6 @@ @article{bousmalis_using_2018
   author = {Bousmalis, Konstantinos and Irpan, Alex and Wohlhart, Paul and Bai, Yunfei and Kelcey, Matthew and Kalakrishnan, Mrinal and Downs, Laura and Ibarz, Julian and Pastor, Peter and Konolige, Kurt and Levine, Sergey and Vanhoucke, Vincent},
   title  = {Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping},
   year   = {2017},
-  pages  = {},
   month  = sep
 }
 

diff --git a/content/appendix/camera_configuration_and_postprocessing.tex b/content/appendix/camera_configuration_and_postprocessing.tex
@@ -2,7 +2,7 @@ \section{Camera Configuration and Post-Processing}\label{app:camera_configuratio
 
 In order to improve success of sim-to-real transfer, the quality of visual observations is of great importance. However, the default configuration of the utilised D435 camera produces a very noisy depth map with many holes. Primary reason for this is the utilised workspace setup that consisted of a reflective surface inside a laboratory with large amount of ambient illumination. Not only does the smooth metallic surface of the workspace result in a specular reflection of ceiling lights, but the pattern projected by the laser emitter of the camera is completely reflected. Lack of such pattern results in limited material texture of the surface, which further decreases the attainable depth quality.
 
-To improve quality of the raw depth map, few steps are taken. First, automatic expose of the camera's IR sensors is configured for a region of interest that covers only the workspace. This significantly reduces hot-spot clipping caused by the specular reflection, which in turn decreases the amount of holes. To mitigate noise, spatial and temporal filters are applied to the depth image. In order to achieve best results, these filters are applied to a corresponding disparity map with a high resolution of~1280~\(\times\)720~px at~30~FPS. Furthermore, the depth map is clipped only to the range of interest in order to reduce computational load. Once filtered, the image is decimated to a more manageable resolution of~320~\(\times\)180~px and converted to a point cloud, which can then be converted to an octree. Post-processed point cloud can be seen in \autoref{app_fig:camera_config_and_post_processing}.
+To improve quality of the raw depth map, few steps are taken. First, automatic expose of the camera's IR sensors is configured for a region of interest that covers only the workspace. This significantly reduces hot-spot clipping caused by the specular reflection, which in turn decreases the amount of holes. To mitigate noise, spatial and temporal filters are applied to the depth image. In order to achieve best results, these filters are applied to a corresponding disparity map with a high resolution of~1280\(\times\)720~px at~30~FPS. Furthermore, the depth map is clipped only to the range of interest in order to reduce computational load. Once filtered, the image is decimated to a more manageable resolution of~320\(\times\)180~px and converted to a point cloud, which can then be converted to an octree. Post-processed point cloud can be seen in \autoref{app_fig:camera_config_and_post_processing}.
 
 \setcounter{figure}{0}
 \begin{figure}[ht]

diff --git a/content/background.tex b/content/background.tex
@@ -7,7 +7,7 @@ \section{Markov Decision Process}
 
 The goal of RL agent is to maximise the total reward that is accumulated during a sequential interaction with the environment. This paradigm can be expressed with a classical formulation of Markov decision process (MDP), where \autoref{fig:bg_mdp_loop} illustrates its basic interaction loop. In MDPs, actions of agent within the environment make it traverse different states and receive corresponding rewards. MDP is an extension of Markov chains, with an addition that agents are allowed to select the actions they execute. Both of these satisfy the Markov property, which assumes that each state is only dependent on the previous state, i.e.~a memoryless property where each state contains all information that is necessary to predict the next state. Therefore, MDP formulation is commonly used within the context of RL because it captures a variety of tasks that general-purpose RL algorithms can be applied to, including robotic manipulation tasks.
 
-It should be noted that partially observable Markov decision process (POMDP) is a more accurate characterisation of most robotics tasks because the states are commonly unobservable or only partially observable, however, the difficulty of solving POMDPs limits their usage \cite{kroemer_review_2021}. Therefore, this chapter presents only on MDPs where observations and states are considered to be the same.
+It should be noted that partially observable Markov decision process (POMDP) is a more accurate characterisation of most robotics tasks because the states are commonly unobservable or only partially observable, however, the difficulty of solving POMDPs limits their usage \cite{kroemer_review_2021}. Therefore, this chapter focuses only on MDPs where observations and states are considered to be the same.
 
 \begin{figure}[ht]
     \centering

diff --git a/content/discussion_and_conclusion.tex b/content/discussion_and_conclusion.tex
@@ -2,15 +2,15 @@ \chapter{Discussion and Conclusion}\label{ch:discussion_and_conclusion}
 
 Experimental evaluation from previous chapter indicates that DRL with 3D visual observations can be successfully applied for end-to-end robotic grasping of diverse objects and provide advantages over 2D and 2.5D observations, especially in terms of camera pose invariance. With a static camera pose, 3D octree-based observations were able to reach success rate of~81.5\% on novel scenes, whereas~59\% was achieved with 2.5D RGB-D observations and~35\% with 2D RGB images. Lower success rate of RGB image observations implies that depth perception is of great importance for robotic manipulation in scenes where objects can be difficult to distinguish from a textured background surface. Octrees provide better success rate than RGB-D images, which is considered to be due to the better ability of 3D convolutions to generalise over spatial positions and orientations compared to 2D convolutions that only generalise over pixel positions.
 
-However, the primary strength of 3D observations comes from their invariance to the camera pose. Agents with RGB and RGB-D image observations were unable to learn a policy that would solve robotic grasping if the camera pose is randomised, whereas an agent with octree observations was still able to achieve success rate of~77\% on novel scenes and camera poses, even with a configuration that has much lower computation complexity than images. Although the use of 3D visual observations requires the relative pose between camera and robot to be known or estimated via calibration, it is considered to be a valid assumption for majority of vision-based manipulation setups.
+However, the primary strength of 3D observations comes from their invariance to the camera pose. Agents with RGB and RGB-D image observations were unable to learn a policy that would solve robotic grasping if the camera pose is randomised, whereas an agent with octree observations was still able to achieve success rate of~77\% on novel scenes and camera poses, even with a configuration that has a much lower computational complexity than images. Although the use of 3D visual observations requires the relative pose between camera and robot to be known or estimated via calibration, it is considered to be a valid assumption for majority of vision-based manipulation setups.
 
 Combination of PBR rendering and domain randomisation of the implemented simulation environment enabled sim-to-real transfer. In real-world domain, a policy that was trained solely inside the simulation was able to achieve success rate of~68.3\% on a variety of real everyday objects. Due to the invariance of octree observations to camera pose, the setup for evaluation of sim-to-real transfer did not require exact replication of its digital counterpart, which resulted in a simpler transfer. This is considered to be advantageous because it allows a single learned policy to be employed in a variety of real-world setups.
 
-The primary cause of failed grasps on a real robot originates in cases where a finger of the utilised gripper is obscured by another object, which prevents the gripper from closing due to its safety features. Lower success rate compared to the evaluation in simulation could also be attributed to the noise profile of depth map acquired by a real stereo camera, which is complex and cannot be modelled by a simple Gaussian noise that was applied inside the simulation. Therefore, better modelling of noise patterns in observation combined with further data augmentation could result in a more robust policy that would better adapt to real-world visual observations. Addition of more extensive domain randomisation of physical interactions could also bring significant benefits, especially if multiple implementations of physics engine with random configurations are used during the training.
+The primary cause of failed grasps on a real robot originates in cases where a finger of the utilised gripper is obstructed by another object, which prevents the gripper from closing due to its safety features. Lower success rate compared to the evaluation in simulation could also be attributed to the noise profile of depth map acquired by a real stereo camera, which is complex and cannot be modelled by a simple Gaussian noise that was applied inside the simulation. Therefore, better modelling of noise patterns in observation combined with further data augmentation could result in a more robust policy that would better adapt to real-world visual observations. Addition of more extensive domain randomisation of physical interactions could also bring significant benefits, especially if multiple implementations of physics engine with random configurations are used during the training.
 
 Having a policy that would be completely invariant to the utilised robot can also significantly improve its applicability and ease the transfer to real robots. Experiments show that it is significantly simpler to learn manipulation when using UR5 with RG2 gripper compared to Panda with its default gripper. Transferred policy from Panda to UR5 also performs much better than a transfer in the opposite direction. The reason for this is presumably the smaller size of Panda's fingers, which affects how precise the gripper pose needs to be before activating it. The difference in such task difficulty is often overlooked in research for robot learning, which makes comparing of reported results with different setups nearly impossible. Therefore, there is a need for a common open-source benchmark for a variety of manipulation tasks that would value generalisation over single-object performance.
 
-When comparing the same hyperparameters on three actor-critic algorithms, an agent using TD3 for training was unable to solve the task. There might be a set of hyperparameters that would make TD3 applicable and allow it achieve a comparable success rate, however, the results indicate that TD3 is at the very least more sensitive to hyperparameters than the other two algorithms. The distributional representation of TQC's critics provide it with faster learning and better success rate of~77\%, when compared to~64\% of SAC. Therefore, these results support the claim of \citet{kuznetsov_controlling_2020} that TQC outperforms SAC as the new state-of-the-art RL algorithm for continuous control in robotics. Experiments conducted in this work extend this claim to the task of robotic grasping with visual observations and actions in Cartesian space.
+When comparing the same hyperparameters on three actor-critic algorithms, an agent using TD3 for training was unable to solve the task. There might be a set of hyperparameters that would make TD3 applicable and allow it to achieve a comparable success rate, however, the results indicate that TD3 is at the very least more sensitive to hyperparameters than the other two algorithms. The distributional representation of TQC's critics provide it with faster learning and better success rate of~77\%, when compared to~64\% of SAC. Therefore, these results support the claim of \citet{kuznetsov_controlling_2020} that TQC outperforms SAC as the new state-of-the-art RL algorithm for continuous control in robotics. Experiments conducted in this work extend this claim to the task of robotic grasping with visual observations and actions in Cartesian space.
 
 Policies trained with SAC and TQC for 500,000 time steps optimised a behaviour that tries to repeatedly perform slightly different grasps of objects if previous attempts failed. This behaviour of end-to-end control therefore resembles signs of the entropy maximisation that is targetted by these algorithms during the training. Quantitatively, it provides agent with a better chance of grasping an object during each episode and therefore maximising the accumulated reward. Given longer duration of episodes, the success rate could be artificially increased. However, qualitative analysis of such policy is considered to be excessively chaotic and unsafe. Real-world applications of robotic manipulation require to meet safety standards and use a more structured interaction with the environment. Agents trained in this work struggle to provide such guarantees, and their unsupervised use on real robots is limited to compliant objects that reduce the risk of accidental damage. With this in mind, discrete action spaces, e.g.~pixel-wise action space with predefined action primitives and safety limits, might be currently more suitable for real-world applications due to their more deterministic behaviour, despite having a reduced ability to learn more complex policy that would improve their task-solving capabilities. It is therefore believed that a theory of safety needs to be developed for RL before it is applicable for solving real-world robotic manipulation tasks with continuous end-to-end control.
 
@@ -20,4 +20,4 @@ \chapter{Discussion and Conclusion}\label{ch:discussion_and_conclusion}
 
 Analysis for sharing of feature extractor parameters brings some interesting results. When separate feature extractors are used for each observation stack, the initial learning is much faster than use of a single shared feature extractor, which indicates that different set of features can be useful for historic observations compared to the current one. This result is counterintuitive because separate feature extractors have a much larger number of combined learnable parameters. However, both approaches are able to reach a very similar final success rate, which means that a shared feature extractor is eventually capable of extracting features from octrees that are time-independent.
 
-Despite large potential and significantly advancements of RL in recent years, its applicability for real world robotic manipulation tasks is still limited. There are several challenges that need to be addressed before end-to-end policies learned by DRL can be robustly integrated into real robotic systems. Although there have been attempts to improve sample efficiency of model-free RL algorithms, even off-policy algorithms with experience replay often require millions of transitions to learn the optimal policy. Algorithms based on the maximum entropy reinforcement learning framework such as SAC and TQC provide a good step towards balancing the trade-off between exploration \& exploitation, however, a guarantee of safe exploration and subsequent operation is required for safety-critical systems. Sensitivity to hyperparameters is another significant problem that needs to be addressed in order to enable large-scale use of RL. Optimisation of hyperparameters for every task is a very time consuming procedure due to the long training duration of each trial. Similarly, reproducibility in RL is very challenging for continuous tasks due to high stochasticity of environments that many robots operate in. Inexpensive parallelised simulations with high-fidelity physics and rendering could alleviate some of these issues in the near future. It is therefore believed that DRL will have a promising future in the field of robotic manipulation. Themes such as model-based RL, hierarchical RL and aspects of broader generalisation are expected to be extensively studied within this context, where 3D visual observations could be employed to bridge some of concepts together.
+Despite large potential and significantly advancements of RL in recent years, its applicability for real world robotic manipulation tasks is still limited. There are several challenges that need to be addressed before end-to-end policies learned by DRL can be robustly integrated into real robotic systems. Although there have been attempts to improve sample efficiency of model-free RL algorithms, even off-policy algorithms with experience replay often require millions of transitions to learn the optimal policy. Algorithms based on the maximum entropy reinforcement learning framework such as SAC and TQC provide a good step towards balancing the trade-off between exploration \& exploitation, however, a guarantee of safe exploration and subsequent operation is required for safety-critical systems. Sensitivity to hyperparameters is another significant problem that needs to be addressed in order to enable large-scale use of RL. Optimisation of hyperparameters for every task is a very time consuming procedure due to the long training duration of each trial. Similarly, reproducibility in RL is very challenging for continuous tasks due to high stochasticity of environments that many robots operate in. Inexpensive parallelised simulations with high-fidelity physics and rendering could alleviate some of these issues in the near future. It is therefore believed that DRL will have a promising future in the field of robotic manipulation. Themes such as model-based RL, hierarchical RL and aspects of broader generalisation are expected to be extensively studied within this context, where 3D visual observations could be employed to bridge some of these concepts together.