diff --git a/_frontmatter/titlepage.tex b/_frontmatter/titlepage.tex index 798032f..97d52a5 100644 --- a/_frontmatter/titlepage.tex +++ b/_frontmatter/titlepage.tex @@ -37,8 +37,5 @@ \end{tabular} \capstarttrue% \vfill -\makeatletter -\def\blfootnote{\gdef\@thefnmark{}\@footnotetext} -\makeatother -\blfootnote{\href{mailto:\thesisauthormail}{{\includegraphics[height=6pt]{_misc/email_logo.pdf}}~\thesisauthormail}} +{\noindent\tiny\href{mailto:\thesisauthormail}{{\includegraphics[height=6pt]{_misc/email_logo.pdf}}~\thesisauthormail}} \cleardoublepage diff --git a/_style/_style.tex b/_style/_style.tex index db35913..4aca7fd 100644 --- a/_style/_style.tex +++ b/_style/_style.tex @@ -10,7 +10,7 @@ \usepackage{framed} \usepackage{geometry} \usepackage[dvips]{graphicx} -\usepackage{hyperref} +\usepackage[hyperfootnotes=false]{hyperref} \usepackage[all]{hypcap} \usepackage[utf8]{inputenc} \usepackage{lastpage} diff --git a/content/appendix/_appendix.tex b/content/appendix/_appendix.tex index 4d09972..f39f750 100644 --- a/content/appendix/_appendix.tex +++ b/content/appendix/_appendix.tex @@ -13,5 +13,4 @@ \chapter*{Appendices} \input{content/appendix/camera_pose_calibration} \newpage \input{content/appendix/camera_configuration_and_postprocessing} -\newpage \input{content/appendix/feature_extraction_from_rgb_and_rgbd_observations} diff --git a/content/appendix/camera_configuration_and_postprocessing.tex b/content/appendix/camera_configuration_and_postprocessing.tex index ca351e9..0fbbe41 100644 --- a/content/appendix/camera_configuration_and_postprocessing.tex +++ b/content/appendix/camera_configuration_and_postprocessing.tex @@ -1,8 +1,8 @@ \section{Camera Configuration and Post-Processing}\label{app:camera_configuration_and_postprocessing} -In order to improve success of sim-to-real transfer, the quality of visual observations is of great importance. However, the default configuration of the utilised D435 camera produced a very noisy depth map with many holes. Primary reason for this is the utilised workspace setup that consisted of a reflective surface inside a laboratory with large amount of ambient illumination. Not only does the polished metallic surface of the workspace result in a specular reflection of ceiling lights, the pattern projected by the laser emitter of the camera is completely reflected. Lack of such pattern results in limited material texture of the surface, which further decreases the attainable depth quality. +In order to improve success of sim-to-real transfer, the quality of visual observations is of great importance. However, the default configuration of the utilised D435 camera produces a very noisy depth map with many holes. Primary reason for this is the utilised workspace setup that consisted of a reflective surface inside a laboratory with large amount of ambient illumination. Not only does the smooth metallic surface of the workspace result in a specular reflection of ceiling lights, but the pattern projected by the laser emitter of the camera is completely reflected. Lack of such pattern results in limited material texture of the surface, which further decreases the attainable depth quality. -To improve quality of the raw depth map, few steps are taken. First, automatic expose of the camera's IR sensors is configured for a region of interest that covers only the workspace. This significantly reduces hot-spot clipping caused by the specular reflection, which in turn decreases the amount of holes. To mitigate noise, spatial and temporal filters are applied to the depth image. In order to achieve best results, these filters are applied to a corresponding disparity map with a high resolution of~1280~\(\times\)720~px at~30~FPS. Furthermore, the depth map is clipped only to depth rage of interest in order to reduce computational load. Once filtered, the image is decimated to a more manageable resolution of~320~\(\times\)180~px and converted to a point cloud, which can then be converted to an octree. Post-processed point cloud can be seen in \autoref{app_fig:camera_config_and_post_processing}. +To improve quality of the raw depth map, few steps are taken. First, automatic expose of the camera's IR sensors is configured for a region of interest that covers only the workspace. This significantly reduces hot-spot clipping caused by the specular reflection, which in turn decreases the amount of holes. To mitigate noise, spatial and temporal filters are applied to the depth image. In order to achieve best results, these filters are applied to a corresponding disparity map with a high resolution of~1280~\(\times\)720~px at~30~FPS. Furthermore, the depth map is clipped only to the range of interest in order to reduce computational load. Once filtered, the image is decimated to a more manageable resolution of~320~\(\times\)180~px and converted to a point cloud, which can then be converted to an octree. Post-processed point cloud can be seen in \autoref{app_fig:camera_config_and_post_processing}. \setcounter{figure}{0} \begin{figure}[ht] diff --git a/content/appendix/camera_pose_calibration.tex b/content/appendix/camera_pose_calibration.tex index 1d7c882..a3e8d0b 100644 --- a/content/appendix/camera_pose_calibration.tex +++ b/content/appendix/camera_pose_calibration.tex @@ -1,6 +1,6 @@ \section{Camera Pose Calibration}\label{app:camera_pose_calibration} -For evaluation of sim-to-real transfer, the camera pose is calibrated with respect to the robot base frame. For this, a calibration board with ArUcO markers \cite{garrido-jurado_automatic_2014} is used as an intermediate reference. \autoref{app_fig:calibration_setup} shows the utilised setup. Position of this intermediate reference is first found in the robot coordinate system by positioning robot's tool centre point above origin of the calibration board, and using robot's joint encoders together with forward kinematics. Hereafter, ArUcO pattern is detected from RGB images of the utilised camera. The perceived pixel positions of the pattern are then used with its known design to solve perspective-n-point problem and determine camera pose with respect to the pattern. Once known, pose of the camera is determined with respect to the robot and the calibration board is removed from the scene. +For evaluation of sim-to-real transfer, the camera pose is calibrated with respect to the robot base frame. For this, a calibration board with ArUcO markers \cite{garrido-jurado_automatic_2014} is used as an intermediate reference. \autoref{app_fig:calibration_setup} shows the utilised setup. Position of this intermediate reference is first found in the robot coordinate system by positioning robot's tool centre point above origin of the calibration board, and using robot's joint encoders together with forward kinematics. Hereafter, ArUcO pattern is detected from RGB images of the utilised camera. The perceived pixel positions of the pattern are then used with its known design to solve a perspective-n-point problem and determine camera pose with respect to the pattern. Once known, pose of the camera is determined with respect to the robot and the calibration board is removed from the scene. \setcounter{figure}{0} \begin{figure}[ht] diff --git a/content/background.tex b/content/background.tex index c7b97b1..1b76d4d 100644 --- a/content/background.tex +++ b/content/background.tex @@ -5,7 +5,7 @@ \chapter{Background}\label{ch:background} \section{Markov Decision Process} -The goal of RL agent is to maximize the total reward that is accumulated during a sequential interaction with the environment. This paradigm can be expressed with a classical formulation of Markov decision process (MDP), where \autoref{fig:bg_mdp_loop} illustrates its basic interaction loop. In MDPs, actions of agent within the environment make it traverse different states and receive corresponding rewards. MDP is an extension of Markov chains, with an addition that agents are allowed to select the actions they execute. Both of these satisfy the Markov property, which assumes that each state is only dependent on the previous state, i.e.~a memoryless property where each state contains all information that is necessary to predict the next state. Therefore, MDP formulation is commonly used within the context of RL because it captures a variety of tasks that general-purpose RL algorithms can be applied to, including robotic manipulation tasks. +The goal of RL agent is to maximise the total reward that is accumulated during a sequential interaction with the environment. This paradigm can be expressed with a classical formulation of Markov decision process (MDP), where \autoref{fig:bg_mdp_loop} illustrates its basic interaction loop. In MDPs, actions of agent within the environment make it traverse different states and receive corresponding rewards. MDP is an extension of Markov chains, with an addition that agents are allowed to select the actions they execute. Both of these satisfy the Markov property, which assumes that each state is only dependent on the previous state, i.e.~a memoryless property where each state contains all information that is necessary to predict the next state. Therefore, MDP formulation is commonly used within the context of RL because it captures a variety of tasks that general-purpose RL algorithms can be applied to, including robotic manipulation tasks. It should be noted that partially observable Markov decision process (POMDP) is a more accurate characterisation of most robotics tasks because the states are commonly unobservable or only partially observable, however, the difficulty of solving POMDPs limits their usage \cite{kroemer_review_2021}. Therefore, this chapter presents only on MDPs where observations and states are considered to be the same. @@ -85,7 +85,7 @@ \subsection{Value-Based Methods}\label{subsec:bg_value_based_methods} \subsection{Policy-Based Methods} -Instead of determining actions based on their value, policy-based methods directly optimize a stochastic policy~\(\pi\) as a probability distribution~\(\pi(a \vert s, \theta)\) that is parameterised by~\(\theta\). +Instead of determining actions based on their value, policy-based methods directly optimise a stochastic policy~\(\pi\) as a probability distribution~\(\pi(a \vert s, \theta)\) that is parameterised by~\(\theta\). \begin{equation} \pi(a \vert s, \theta) = \Pr\{A_{t}{=}a \vert S_{t}{=}s, \theta_{t}{=}\theta \} \end{equation} @@ -98,7 +98,7 @@ \subsection{Actor-Critic Methods} In contrast to value- and policy-based methods as the two primary categories, actor-critic methods include algorithms that utilise both a parameterised policy, i.e.~actor, and a value function, critic. This is achieved by using separate networks, where the actor and critic can sometimes share some common parameters. Such combination allows actor-critic algorithms to simultaneously possess advantages of both approaches such as sample efficiency and continuous action space. Therefore, these properties have made actor-critic methods popular for robotic manipulation while achieving state of the art performance among other RL approaches in this domain. -Similar to policy-based methods, the actor network learns the probability of selecting a specific action~\(a\) in a given state~\(s\) as~\(\pi(a \vert s, \theta)\). The critic network estimates action-value function~\(Q(s, a)\) by minimising TD error~\(\delta_{t}\) via \autoref{eq:q_learning}, which is used to critique the actor based on how good the selected action is. This process is visualized in \autoref{fig:bg_actor_critic_loop}. It is however argued that the co-dependence of each other's output distribution can result in instability during learning and make them difficult to tune \cite{quillen_deep_2018}. Despite of this, actor-critic model-free RL algorithms are utilised in this work. +Similar to policy-based methods, the actor network learns the probability of selecting a specific action~\(a\) in a given state~\(s\) as~\(\pi(a \vert s, \theta)\). The critic network estimates action-value function~\(Q(s, a)\) by minimising TD error~\(\delta_{t}\) via \autoref{eq:q_learning}, which is used to critique the actor based on how good the selected action is. This process is visualised in \autoref{fig:bg_actor_critic_loop}. It is however argued that the co-dependence of each other's output distribution can result in instability during learning and make them difficult to tune \cite{quillen_deep_2018}. Despite of this, actor-critic model-free RL algorithms are utilised in this work. \begin{figure}[ht] \centering diff --git a/content/experimental_evaluation.tex b/content/experimental_evaluation.tex index 4a9e021..a2b6341 100644 --- a/content/experimental_evaluation.tex +++ b/content/experimental_evaluation.tex @@ -21,7 +21,7 @@ \subsection{Real}\label{subsec:real_setup} \begin{figure}[ht] \centering - \includegraphics[width=0.75\textwidth]{experimental_evaluation/real_setup.png} + \includegraphics[width=0.66\textwidth]{experimental_evaluation/real_setup.png} \caption{UR5 robot with RG2 gripper and RealSense D435 camera in a setup that is used to evaluate sim-to-real transfer.} \label{fig:real_setup} \end{figure} @@ -30,7 +30,7 @@ \subsection{Real}\label{subsec:real_setup} \begin{figure}[ht] \centering - \includegraphics[width=0.75\textwidth]{experimental_evaluation/real_objects.png} + \includegraphics[width=0.66\textwidth]{experimental_evaluation/real_objects.png} \caption{A set of 18 objects that were used during the evaluation of sim-to-real transfer.} \label{fig:real_objects} \end{figure} @@ -40,11 +40,11 @@ \section{Results} Results of the following experiments are presented in this section. First, actor-critic algorithms are compared on the created simulation environment in order to select the best performing one for the task of robotic grasping. Hereafter, octree-based 3D observations are compared to traditional 2D and 2.5D image observations, and studied with respect to camera pose invariance. Similarly, invariance to the utilised robot is evaluated for both training process and transfer of already learned policy. Lastly, results of sim-to-real transfer are presented. -All agents were trained over the duration of~500,000 time steps, which is assumed to provide a comparative analysis among the different experiments from this work. It is expected, that the final performance for many of these agents can be improved with a longer training duration. On average, each agent takes~65~hours to complete~500,000 steps while training on a laptop with Intel Core i7-10875H CPU and Nvidia Quadro T2000 GPU. Therefore, only a single random seed is employed for all agents due to the time-consuming training procedure and time constraints. However, use of several different seeds with longer training duration is encouraged as it would provide more definitive results. During the training of each agent, success rate and episode lengths are logged for grasps on the training dataset while agent follows its current policy that also contains exploration noise and it is stochastic for SAC and TQC. After training, each agent is evaluated on novel scenes for~200 episodes, where deterministic actions are selected each step based on the learned policy. +All agents were trained over the duration of~500,000 time steps, which is assumed to provide a comparative analysis among the different experiments from this work. It is expected, that the final performance for many of these agents can be improved with a longer training duration. On average, each agent takes~65~hours to complete~500,000 steps while training on a laptop with Intel Core i7-10875H CPU and Nvidia Quadro T2000 GPU. Therefore, only a single random seed is employed for all agents due to the time-consuming training procedure and time constraints. However, use of several different seeds with a longer training duration is encouraged as it would provide more definitive results. During the training of each agent, success rate is logged for grasps on the training dataset while the agent follows its current stochastic policy that contains exploration noise. After training, each agent is evaluated on novel scenes for~200 episodes, where deterministic actions are selected each step based on the learned policy. \subsection{Comparison of Actor-Critic Algorithms} -TD3, SAC and TQC were trained using the same grasping environment with a network architecture presented in \autoref{subsec:actor_critic_network_architecture} and hyperparameters from \hyperref[app:hyperparameters]{appendix~\ref*{app:hyperparameters}}. The success rate during training and the final success rate on novel objects and textures is presented in \hyperref[fig:training_curve_comparison_actor_critic_algorithms]{\figurename~\ref*{fig:training_curve_comparison_actor_critic_algorithms}~\&~\tablename~\ref*{tab:training_curve_comparison_actor_critic_algorithms}} for all three algorithms. +TQC, SAC and TD3 were trained using the same grasping environment with a network architecture presented in \autoref{subsec:actor_critic_network_architecture} and hyperparameters from \hyperref[app:hyperparameters]{appendix~\ref*{app:hyperparameters}}. The success rate during training and the final success rate on novel objects and textures is presented in \hyperref[fig:training_curve_comparison_actor_critic_algorithms]{\figurename~\ref*{fig:training_curve_comparison_actor_critic_algorithms}~\&~\tablename~\ref*{tab:training_curve_comparison_actor_critic_algorithms}} for all three algorithms. The episode lengths of successful episodes are also logged in order to determine how fast an agent can grasp previously unseen objects. \begin{figure}[ht] \centering @@ -67,7 +67,7 @@ \subsection{Comparison of Actor-Critic Algorithms} \addtocounter{table}{1} \captionlistentry[table]{} \label{tab:training_curve_comparison_actor_critic_algorithms} - \caption{Success rate of TD3, SAC and TQC algorithms on the created grasping environment. All plots are processed with moving average,~\(n\)~=~\(100\), and exponential smoothing,~\(\alpha\)~=~\(0.002\).} + \caption{Comparison of TQC, SAC and TD3 algorithms on the created grasping environment. The training success rate is processed with a moving average,~\(n = 100\), and exponential smoothing,~\(\alpha = 0.002\), for all agents.} \label{fig:training_curve_comparison_actor_critic_algorithms} \end{figure} @@ -85,7 +85,7 @@ \subsection{Comparison of 2D/2.5D/3D Observations}\label{subsec:comparison_of_2d \textbf{Octree} & \textbf{RGB-D} & \textbf{RGB} \\ \hline Learnable Parameters & 226,494 & 229,680 & 229,248 \end{tabular} - \caption{Number of learnable parameters per each observation stack for the utilised RGB, RGB-D and octree-based feature extractors.} + \caption{Number of learnable parameters per each observation stack for the utilised octree-based, RGB and RGB-D feature extractors.} \label{tab:feature_extractor_number_of_learnable_parameters_comparison} \end{table} @@ -102,7 +102,7 @@ \subsection{Comparison of 2D/2.5D/3D Observations}\label{subsec:comparison_of_2d \centering \begin{tabular}{c|ccc} & \textbf{Octree} & \textbf{RGB-D} & \textbf{RGB} \\ \hline - \begin{tabular}[c|]{@{}c@{}}Success\\Rate\end{tabular} & 77\% & 5\% & 3\% \\[4mm] + \begin{tabular}[c|]{@{}c@{}}Success\\Rate\end{tabular} & 77\% & 5\% & 3\% \\[4mm] \begin{tabular}[c|]{@{}c@{}}Episode\\Length\end{tabular} & 14.0 & 36.5 & 51.0 \end{tabular} \caption*{\textit{Evaluation on novel scenes}} @@ -112,7 +112,7 @@ \subsection{Comparison of 2D/2.5D/3D Observations}\label{subsec:comparison_of_2d \addtocounter{table}{1} \captionlistentry[table]{} \label{tab:training_curve_comparison_2d_2_5d_3d_random_camera_pose} - \caption{Success rate of RGB, RGB-D and octree-based feature extractors on environment with randomised camera pose.} + \caption{Results of octree-based, RGB and RGB-D feature extractors on the full environment that randomises camera pose on each episode.} \label{fig:training_curve_comparison_2d_2_5d_3d_random_camera_pose} \end{figure} @@ -129,7 +129,7 @@ \subsection{Comparison of 2D/2.5D/3D Observations}\label{subsec:comparison_of_2d \centering \begin{tabular}{c|ccc} & \textbf{Octree} & \textbf{RGB-D} & \textbf{RGB} \\ \hline - \begin{tabular}[c|]{@{}c@{}}Success\\Rate\end{tabular} & 81.5\% & 59\% & 35\% \\[4mm] + \begin{tabular}[c|]{@{}c@{}}Success\\Rate\end{tabular} & 81.5\% & 59\% & 35\% \\[4mm] \begin{tabular}[c|]{@{}c@{}}Episode\\Length\end{tabular} & 24.6 & 9.4 & 9.3 \end{tabular} \caption*{\textit{Evaluation on novel scenes}} @@ -139,18 +139,20 @@ \subsection{Comparison of 2D/2.5D/3D Observations}\label{subsec:comparison_of_2d \addtocounter{table}{1} \captionlistentry[table]{} \label{tab:training_curve_comparison_2d_2_5d_3d_fixed_camera_pose} - \caption{Success rate of RGB, RGB-D and octree-based feature extractors on environment with a fixed camera pose.} + \caption{Results of octree-based, RGB and RGB-D feature extractors on environment with a fixed camera pose.} \label{fig:training_curve_comparison_2d_2_5d_3d_fixed_camera_pose} \end{figure} -Lastly, comparison of memory usage and computational time is presented in \autoref{tab:feature_extractor_memory_and_computational_time}. +\newpage + +Lastly, a comparison of memory usage and computational time for the three observation types and feature extractors is presented in \autoref{tab:feature_extractor_memory_and_computational_time}. \begin{table}[ht] \centering \begin{tabular}{r|ccc} & \textbf{Octree} & \textbf{RGB-D} & \textbf{RGB} \\ \hline - Resolution \textit{(per sample)} & 16\(\times\)16\(\times\)16 & 128\(\times\)128 & 128\(\times\)128 \\ + Shape \textit{(per sample)} & 16\(\times\)16\(\times\)16 & 128\(\times\)128 & 128\(\times\)128 \\ Cell Count \textit{(per sample)} & 4096 octets (theoretical) & 16384 px & 16384 px \\ \multirow{2}{*}{Size \textit{(per sample)}} & 27 kB (average) & \multirow{2}{*}{49 kB} & \multirow{2}{*}{115 kB} \\ & 44 kB (maximum) & & \\ \hline @@ -159,28 +161,16 @@ \subsection{Comparison of 2D/2.5D/3D Observations}\label{subsec:comparison_of_2d Forward \textit{(average, batch of 32)} & 2.1 ms & 0.8 ms & 0.7 ms \\ TQC Update \textit{(average, batch of 32)} & 32.4 ms & 141.7 ms & 82.3 ms \end{tabular} - \caption{Comparison of computational complexity for octree, RGB and RGB-D observations with their corresponding feature extractors. Pre-processing of octrees is performed during data collection and consists of point cloud processing, estimation of normals and creation of octree. Colour features are stored in octree as 32-bit floating point values, whereas RGB and RGB-D utilise byte array for memory efficiency in order to allow use of the same replay buffer size. Therefore, the time of batch formation includes conversion of colour channels to floating point values for RGB and RGB-D images.} + \caption{Comparison of computational complexity for octree-based, RGB and RGB-D observations with their corresponding feature extractors. Pre-processing of octrees is performed during data collection and consists of point cloud processing, estimation of normals and creation of octree. Colour features are stored in octree as 32-bit floating point values, whereas RGB and RGB-D utilise byte arrays for memory efficiency in order to allow use of the same replay buffer size. Therefore, the time of batch formation includes conversion of colour channels to floating point values for RGB and RGB-D images.} \label{tab:feature_extractor_memory_and_computational_time} \end{table} - -\subsection{Invariance to Camera Pose} - -Based on the results of the previous experiment, agent with octree observations and fully randomised camera pose is evaluated with respect to its learned generalisation to different camera poses on novel scenes. A total of X various poses with different azimuth and height are evaluated, with corresponding results shown in \autoref{fig:invariance_to_camera_pose}. - -\begin{figure}[ht] - \centering - % \includegraphics[width=1.0\textwidth]{experimental_evaluation/.pdf} - \caption{Success rate to novel scenes for different camera poses.} - \label{fig:invariance_to_camera_pose} -\end{figure} - +\newpage \subsection{Invariance to Robot} In addition to training agents with octree observations on UR5 robot with RG2 gripper, an agent is also trained on Panda robot in order to study the robustness of state-of-the-art actor-critic algorithm with octree observations to different kinematic chains and gripper designs. Comparison of success rate between UR5 and Panda can be seen in \hyperref[fig:invariance_to_robot]{\figurename~\ref*{fig:invariance_to_robot}~\&~\tablename~\ref*{tab:invariance_to_robot}}. -% Consider skipping this figure if section gets too crowded \begin{figure}[ht] \centering \begin{subfigure}[ht]{0.5845\textwidth} @@ -202,7 +192,7 @@ \subsection{Invariance to Robot} \addtocounter{table}{1} \captionlistentry[table]{} \label{tab:invariance_to_robot} - \caption{Success rate of UR5 and Panda robots using the same environment, algorithm and hyperparameters.} + \caption{Results of using the same algorithm and hyperparameters on the created environment with UR5 and Panda robots.} \label{fig:invariance_to_robot} \end{figure} @@ -216,14 +206,15 @@ \subsection{Invariance to Robot} \multirow{2}{*}{\begin{tabular}[c]{@{}c@{}}Training\end{tabular}} & \textbf{UR5} & 77\% & 27.5\% \\ & \textbf{Panda} & 75\% & 61.5\% \end{tabular} - \caption{Comparison of success rate on novel scenes for policies trained one robot and evaluated on another. UR5 robot with RG2 gripper and Panda robot with its default gripper were evaluated.} + \caption{Comparison of success rate on novel scenes for policies trained on one robot and evaluated on another, for UR5 with RG2 gripper and Panda robot with its default gripper.} \label{tab:results_robot_transfer} \end{table} +\newpage \subsection{Sim-to-Real Transfer} -Finally, an agent trained inside simulation is evaluated in real-world domain to study the feasibility of sim-to-real transfer for environment with extensive domain randomisation and octree-based observations. Setup described in \autoref{subsec:real_setup} is used, where objects are randomly replaced after each success or~100 time steps. With this setup,~41 out of~60 episodes were successful, which results in a success rate of~68\%. \autoref{fig:sim_to_real_success_examples} shows few examples of successful grasps and a recording is available on YouTube\footnote{\href{https://youtube.com/watch?v=btxqzFOgCyQ&list=PLzcIGFRbGF3Qr4XSzAjNwOMPaeDn5J6i1}{https://youtube.com/watch?v=btxqzFOgCyQ}}. +Finally, an agent trained inside simulation is evaluated in real-world domain to study the feasibility of sim-to-real transfer for environment with domain randomisation and octree-based observations. Setup described in \autoref{subsec:real_setup} is used, where objects are randomly replaced after each success or after~100 time steps have elapsed. With this setup,~41 out of~60 episodes were successful, which results in a success rate of~68.3\%. \autoref{fig:sim_to_real_success_examples} shows few examples of successful grasps and a recording is available on YouTube\footnote{\href{https://youtube.com/watch?v=btxqzFOgCyQ&list=PLzcIGFRbGF3Qr4XSzAjNwOMPaeDn5J6i1}{https://youtube.com/watch?v=btxqzFOgCyQ}}. \begin{figure}[ht] \centering @@ -232,6 +223,7 @@ \subsection{Sim-to-Real Transfer} \label{fig:sim_to_real_success_examples} \end{figure} +\newpage \section{Ablation Studies} @@ -273,7 +265,6 @@ \section{Ablation Studies} \begin{tabular}[c|]{@{}c@{}}Episode\\Length\end{tabular} & 14.0 & 24.5 & 19.9 & 29.4 & 23.0 & 27.5 & 22.8 \end{tabular} - \caption*{\textit{Evaluation on novel scenes}} \end{subfigure}% \captionsetup{labelformat=figure_and_table} \addtocounter{figure}{-1} diff --git a/content/implementation.tex b/content/implementation.tex index 1614050..3a94da7 100644 --- a/content/implementation.tex +++ b/content/implementation.tex @@ -10,7 +10,9 @@ \section{Simulation Environment}\label{sec:impl_simulation_environment} \subsection{Selection of Robotics Simulator} -There is a variety of simulation tools that could be applied for training RL agents for robotics, some of which are based on video game engines due to their mature state. Generally, a trade-off between accuracy, stability and performance that must be considered. Although most everyday objects have certain properties of soft bodies, rigid-body dynamics usually provide a satisfactory degree of realism for generic robotic grasping without suffering much performance loss. Therefore, a considered simulator shall have an appropriate physics engine for handling environments with a number of rigid bodies, and a support for actuated joints that can be used to connect links of a robot. Similarly, PBR rendering capabilities are highly preferred because of the utilised visual observations. Some of the popular simulators for robotics RL research are therefore described with aim to select one that will be used to implement the environment. +There is a variety of simulation tools that could be applied for training of RL robotics agents, some of which are based on video game engines due to their mature state. Generally, a trade-off between accuracy, stability and performance must be considered. Although most everyday objects have certain properties of soft bodies, rigid-body dynamics usually provide a satisfactory degree of realism for generic robotic grasping without suffering much performance loss. Therefore, a considered simulator shall have an appropriate physics engine for handling environments with a number of rigid bodies, and a support for actuated joints that can be used to connect links of a robot. Similarly, PBR rendering capabilities are highly preferred because of the utilised visual observations. Some of the popular simulators for robotics RL research are therefore described with aim to select one that will be used to implement the environment. + +\newpage \paragraph{MuJoCo~\protect\cite{todorov_mujoco_2012}} MuJoCo is a physics engine that can accurately model physical interactions. It has been a popular choice for robotics research for years, including RL applications. Unfortunately, MuJoCo is a proprietary software, which has resulted in the decline of its use over the recent years in favour of open-source alternatives. Furthermore, it has limited rendering capabilities. @@ -20,13 +22,13 @@ \subsection{Selection of Robotics Simulator} \paragraph{Ignition Gazebo\protect\footnote{\href{https://ignitionrobotics.org}{https://ignitionrobotics.org}}} Due to the limitations and outdated architecture, Gazebo Classic is planned to be deprecated in favour of Ignition Gazebo, i.e.~the next generation of Gazebo. Although it is in its early development, Ignition Gazebo supports DART physics engine and has an upcoming support for Bullet. In addition to OGRE~1, PBR rendering is enabled by using OGRE~2, and there is also a partial support for ray tracing with OptiX\footnote{\href{https://developer.nvidia.com/optix}{https://developer.nvidia.com/optix}}. Both physics and rendering engines can be loaded during runtime due to the utilised plugin-based architecture. Although little RL robotics research has been conducted with the use of Ignition Gazebo so far, \citet{ferigo_gym-ignition_2020} introduced Gym-Ignition as a framework that simplifies its usage for RL research. -\paragraph{Isaac\protect\footnote{\href{https://developer.nvidia.com/isaac-sim}{https://developer.nvidia.com/isaac-sim}, \href{https://developer.nvidia.com/isaac-gym}{https://developer.nvidia.com/isaac-gym}}} Isaac Sim is a new and promising robotics simulator that is being developed by Nvidia. It utilises PhysX physics engine and has support for state-of-the-art PBR rendering. Isaac Gym is extension of Isaac for RL. One of its significant advantages is that physics computations, rendering as well as the process of determining rewards can be offloaded to GPU and enable running large number of environments in parallel. Unfortunately, the proprietary nature of Isaac might limit its use and possible customisation. Furthermore, Isaac Gym is still available only as an early access as of May~2021 with limited functionalities. +\paragraph{Isaac\protect\footnote{\href{https://developer.nvidia.com/isaac-sim}{https://developer.nvidia.com/isaac-sim}, \href{https://developer.nvidia.com/isaac-gym}{https://developer.nvidia.com/isaac-gym}}} Isaac Sim is a new and promising robotics simulator that is being developed by Nvidia. It utilises PhysX physics engine and has support for state-of-the-art PBR rendering. Isaac Gym is extension of Isaac for RL. One of its significant advantages is that physics computations, rendering as well as the process of determining rewards can be offloaded to GPU in order to enable running large number of environments in parallel. Unfortunately, the proprietary nature of Isaac might limit its use and possible customisation. As of May~2021, Isaac Gym is still available only as an early access and its functionalities are limited. \bigskip From the considered robotics simulators, Ignition Gazebo is selected in this work due to the following reasons. Compared to MuJoCo that requires a license, it is open-source, which significantly encourages reproducibility. Although Isaac might be a very promising choice for robotics RL research in the future, it is still under development and its proprietary nature could make it difficult to extend for the needs of this work. PyBullet is currently considered to be a one of the best open-source options due to its maturity and a large amount of RL research that has already been conducted with it. However, it lacks PBR rendering capabilities that are already part of Ignition Gazebo. Furthermore, the plugin-based architecture of Ignition Gazebo simplifies addition of new physics engine, where Bullet support is already pending. Its ability to switch between various physics engines during run-time could eventually provide Ignition Gazebo with one of the best physics-based domain randomisation, as it would not only allow randomising physics parameters but also the entire physics implementation. The major disadvantage of the selected Ignition Gazebo robotics simulator is its relatively early stage and a very limited amount of RL research conducted with it. Despite of this, the full availability of its source code makes it possible to extend where needed. Gazebo Classic was excluded from this considerations due to its planned deprecation. -Therefore, Ignition Gazebo is used to create an environment for robotic grasping with RL. For the physics engine, the default option of DART is kept unchanged. For rendering engine, OGRE~2 is selected due to its PBR capabilities. Gym-Ignition \cite{ferigo_gym-ignition_2020} is utilised because it simplifies interaction with Ignition Gazebo with focus on RL research. Furthermore, Gym-Ignition facilitates the process of exposing OpenAI Gym interface for the environments, which provides a standardised form that makes environments compatible with most RL frameworks that contain implementations of algorithms. +Therefore, Ignition Gazebo is used to create an environment for robotic grasping with RL. For the physics engine, the default option of DART is kept unchanged. For rendering engine, OGRE~2 is selected due to its PBR capabilities. Gym-Ignition \cite{ferigo_gym-ignition_2020} is utilised because it simplifies interaction with Ignition Gazebo with focus on RL research. Furthermore, Gym-Ignition facilitates the process of exposing OpenAI Gym \cite{brockman_openai_2016} interface for the environments, which provides a standardised form that makes environments compatible with most RL frameworks that contain implementations of algorithms. \subsection{Environment for Robotic Grasping} @@ -42,13 +44,13 @@ \subsubsection{Robot Models} \centering \begin{subfigure}[ht]{0.4975\textwidth} \centering - % \includegraphics[width=0.75\textwidth]{implementation/ur5_robot.png} + \includegraphics[width=0.75\textwidth]{implementation/ur5_rg2.png} \caption*{UR5 with RG2 sweeping-parallel gripper} \end{subfigure}% ~% \begin{subfigure}[ht]{0.4975\textwidth} \centering - % \includegraphics[width=0.75\textwidth]{implementation/panda_robot.png} + \includegraphics[width=0.75\textwidth]{implementation/panda.png} \caption*{Panda with its default parallel gripper} \end{subfigure}% \caption{Robot models used inside the simulation environment for robotic grasping.} @@ -59,6 +61,8 @@ \subsubsection{Robot Models} With this information, description that uses Simulation Description Format (SDF) compatible with Ignition Gazebo was created for both robots \cite{orsula_manipulators_2021}. A simplification for the sweeping-parallel RG2 gripper was made in order to provide a better stability. It was modelled by using a single actuated revolute joint per finger, whereas the full model would use three additional passive joints on each finger. Parallel gripper for Panda is modelled with two prismatic joints, i.e.~one for each finger. +\newpage + \paragraph{Motion Planning} To control the motion of both robots, a joint trajectory controller described in \hyperref[app:joint_trajectory_controller]{appendix~\ref*{app:joint_trajectory_controller}} was implemented for Ignition Gazebo. It follows trajectories that are generated in Cartesian space by the use of MoveIt~2\footnote{\href{https://moveit.ros.org}{https://moveit.ros.org}} motion planning framework. In this framework, the default configuration of TRAC-IK \cite{beeson_trac-ik_2015} and RRTConnect \cite{kuffner_rrt-connect_2000} were used for solving kinematics and motion planning, respectively. An advantage of utilising MoveIt~2 is that a single interface can be used to control both simulated and real robots during sim-to-real transfer. @@ -112,36 +116,40 @@ \subsubsection{Dataset} All of these objects contain only their corresponding mesh geometry and material texture but lack all other properties. Inertial properties were therefore estimated from their geometry in a procedure that is similar to the aforementioned robot models. Mass of each object used during such estimation was randomly selected alongside other properties of the model, which is detailed below in \autoref{sec:impl_domain_randomisation}. This also includes the scale of their geometry, as many of these objects would be too large to fit inside the utilised grippers. -The 3D scanned objects contain meshes with a very high resolution, which makes them unsuitable for computing physical interactions due to the enormous computational cost it would bring. Therefore, a low resolution copy of each mesh is created for use as a collision geometry, alongside the original mesh that is kept for visual appearance. Such copy is automatically generated for each model by simplifying the original mesh geometry though decimation procedure based quadric error metrics by \citet{garland_surface_1997}. The algorithm was configured to reduce the geometry to~2.5\% of the original faces but clipped to the range of~[8,~400] faces in order to avoid outliers. +The 3D scanned objects contain meshes with a very high resolution, which makes them unsuitable for computing physical interactions due to the enormous computational cost it would bring. Therefore, a low resolution copy of each mesh is created for use as a collision geometry, alongside the original mesh that is kept for visual appearance. Such copy is automatically generated for each model by simplifying the original mesh geometry though a decimation procedure based quadric error metrics by \citet{garland_surface_1997}. The algorithm was configured to reduce the geometry to~2.5\% of the original faces but clipped to the range of~[8,~400] faces in order to avoid outliers. \subsubsection{Performance of Simulation} Having a performant simulation accelerates the data collection, which can in turn enable faster iteration for RL research due to reduced training duration. Besides reducing computational load by decimating geometry of objects, few more tricks are applied in this work. +\newpage + \paragraph{Disabling of Collision for Robot Links} During early trials, it was found that a collision never occurs between robot links and the environment. This is primarily because MoveIt~2 is used to plan collision-free trajectories. Furthermore the action space is restricted only to the yaw rotation, which further reduces the possibility of collisions. Therefore, collision geometry of robot links is disabled during the training with aim to bring a slight performance gain. The collision geometry of gripper, i.e.~hand and fingers, is kept enabled for both robots as these are required for interaction with the objects. \paragraph{Larger Simulation Step Size} As previously mentioned, dynamic properties of robot joints were manually tuned in order to obtain stable manipulation across a variety of control frequencies. The primary purpose of this tuning is to allow the use of larger simulation step size, which determines the rate at which simulation progresses. This in turn affects the accuracy of physics as well as the frequency of low-level controller. A step size of~4~ms is used for the grasping environment because it was found to have a balanced trade-off between physics stability and performance. \bigskip -With performance in mind, the control rate of RL agent is set to a lower frequency of~2.5~Hz. This is because the agent only provides high-level control, whereas the motion planner and low-level joint controllers take care of interactions that require faster reaction times. On a laptop with Intel Core i7-10875H CPU and Nvidia Quadro T2000 GPU, the simulated environment with physics and the aforementioned perception progresses at real-time factor of~329\% for a single object and~196\% for four objects. +With performance in mind, the control rate of RL agent is set to a lower frequency of~2.5~Hz. This is because the agent only provides high-level control, whereas the motion planner and low-level joint controllers take care of interactions that require faster reaction times. On a laptop with Intel Core i7-10875H CPU and Nvidia Quadro T2000 GPU, the simulated environment with physics and the aforementioned perception progresses at a real-time factor of~329\% for a single object and~196\% for four objects. \subsection{Domain Randomisation}\label{sec:impl_domain_randomisation} -Even though the simulation environment uses objects with realistic appearance and PBR-capable rendering engine, domain randomisation can still provide advantages for sim-to-real transfer as described in \autoref{subsec:sim2real}. Therefore, domain randomisation is applied for several properties at each reset of the environment, i.e.~before the beginning of every episode. Unless otherwise state, uniform distribution is used for sampling of random variables. +Even though the simulation environment uses objects with realistic appearance and PBR-capable rendering engine, domain randomisation can still provide advantages for sim-to-real transfer as described in \autoref{subsec:sim2real}. Therefore, domain randomisation is applied for several properties at each reset of the environment, i.e.~before the beginning of every episode. Unless otherwise stated, a uniform distribution is used for sampling of random variables. \paragraph{Random Objects} At each reset, a number of random objects from the utilised dataset is spawned. Each object is first randomly and uniformly scaled, such that its longest side is between~12.5 and~17.5~cm. Hereafter, object's inertial properties are recomputed to account for the new scale, while also randomising its mass to be in range~[0.05,~0.5]~kg. Lastly~the coefficient of friction for the object is randomised in range~[0.75,~1.5]. In this way, visual, inertial and mechanical properties of each object are random for every episode. \paragraph{Random Pose of Objects} Besides randomising the type and attributes of each object, the pose at which they spawn is also randomised. It is randomly sampled for each object from a predefined volume in 3D space. In case two objects are overlapping, one of them is spawned again with a new unique pose. -\paragraph{Random Ground Plane Material Textures} To further randomize visuals of the environment at each reset, a random material texture is given to the ground plane. Similar to the objects,~100 different PBR materials are used with a split of~80/20 for training and testing, respectively. PBR materials are used, therefore, each of them uses four different texture maps, i.e.~albedo, normal, specular and roughness. +\paragraph{Random Ground Plane Material Textures} To further randomise visuals of the environment at each reset, a random material texture is given to the ground plane. Similar to the objects,~100 different PBR materials are used with a split of~80/20 for training and testing, respectively. Since PBR materials are used, each of them uses four different texture maps, i.e.~albedo, normal, specular and roughness. \paragraph{Random Camera Pose} In order to further increase variety in observations and provide invariance to camera pose, it is randomised at each reset. The pose of the camera is randomly sampled from an arc around the centre of workspace, except for~\(\pm\)22.5\textdegree\ behind the robot in order to avoid complete occlusion of the scene. Thereafter, a random height for the camera in a range~[0.1,~0.7]~m is selected. The camera is then oriented towards the workspace centre and placed~\(1\)~m away from it. This step is expected to provide significant benefits for sim-to-real transfer by allowing camera to be positioned in a location that is suitable for the real-world setup, instead of trying to reproduce simulation setup as closely as possible. \paragraph{Random Initial Joint Configuration} Finally, the initial joint configuration of the utilised robot is randomised. At the beginning of each episode, Gaussian noise~\(\mathcal{N}(0, 6\)\textdegree\()\) is added to each joint in the default configuration. -Examples of fully randomised scenes are shown in \autoref{fig:impl_domain_randomisation}. The aim of this variety in observations is to enable sim-to-real transfer that would allow agent to achieve similar degree of success rate in real-world domain after training only in the simulation +\bigskip + +Examples of fully randomised scenes are shown in \autoref{fig:impl_domain_randomisation}. The aim of this variety in observations is to enable sim-to-real transfer that would allow agent to achieve similar degree of success rate in real-world domain after training only inside the simulation environment. \begin{figure}[ht] \centering @@ -155,9 +163,9 @@ \subsection{Demonstrations and Curriculum} As mentioned in \autoref{sec:rw_reinforcement_learning}, use of demonstrations and curriculum learning can mitigate issues with lengthy exploration. Both of these concepts are therefore investigated in this work and implemented in the following way. -\paragraph{Demonstrations} For demonstrations, approach by \citet{kalashnikov_qt-opt_2018} with the use of a scripted policy is applied. Since off-policy RL algorithms with experience replay buffer are employed, the demonstrations can be simply loaded into such buffer at the beginning of training. More specifically,~\(5000\) transitions are loaded into the replay buffer. A very simple scripted policy is implemented as a state machine that moves gripper towards one of the objects, and once it is reached, the gripper is closed and a lifting motion is performed. Due to its simplicity, it only achieves~19\% success rate on objects with diverse geometry. However, that is considered to be adequate as its sole purpose is to provide few successful attempts that RL agents can improve upon. +\paragraph{Demonstrations} For demonstrations, approach by \citet{kalashnikov_qt-opt_2018} with the use of a scripted policy is applied. Since off-policy RL algorithms with experience replay buffer are employed, the demonstrations can be simply loaded into such buffer at the beginning of training. More specifically,~\(5000\) transitions are loaded into the replay buffer. A very simple scripted policy is implemented as a state machine that moves gripper towards one of the objects, and once it is reached, the gripper is closed and a predefined lifting motion is performed. Due to its simplicity, it only achieves~19\% success rate on objects with diverse geometry. However, it is considered to be adequate as its sole purpose is to provide few successful attempts that RL agents can improve upon. -\paragraph{Curriculum} Similar to demonstrations, the use of curriculum can improve learning for tasks in complex environment. This work utilises a curriculum that progressively increases the number of spawned objects and area on top of which these objects are spawned based on current success rate determined by moving average with~\(n\)~=~\(100\). The spawn area increases linearly from~2.4\({\times}\)2.4~cm at~0\% success to~24\({\times}\)24~cm at success rate of~60\%. Similarly, training begins with a single object and and additional one every~20\% until reaching a maximum of four objects at~60\% success rate. +\paragraph{Curriculum} Similar to demonstrations, the use of curriculum can improve learning for tasks in complex environment. This work utilises a curriculum that progressively increases the number of objects and the area on top of which these objects are spawned based on the current success rate determined by moving average with~\(n = 100\). The spawn area increases linearly from~2.4\({\times}\)2.4~cm at~0\% success to~24\({\times}\)24~cm at success rate of~60\%. Similarly, training begins with a single object, and an additional one is added every~20\% until reaching a maximum of four objects at~60\% success rate. \section{Deep Reinforcement Learning} @@ -169,21 +177,21 @@ \subsection{Framework for Reinforcement Learning} It can be very time-consuming and error-prone to implement DRL algorithms from scratch due to several issues that could arise. Therefore, a framework with pre-existing implementations of the utilised actor-critic algorithms from \autoref{sec:bg_actor_critic_algorithms}, i.e.~TD3, SAC and TQC, is utilised. After a brief investigation of the available frameworks for model-free RL, Stable Baselines3 by \citet{raffin_stable-baselines3_2019} was selected due to its reliable implementation of the utilised algorithms, open-source nature and active development. Underneath, PyTorch \cite{paszke_pytorch_2019} is utilised as a machine learning backend that enables training of NNs via its automatic differentiation engine. -In order to enable octree-based feature extraction, the implementation of algorithms was extended with few modifications. These primarily consisted of support for octrees inside replay buffer, formation of octree batches and integration of the octree-based feature extractor with PyTorch-based NNs of actor and critics. All other configuration of the algorithms was performed through their hyperparameters. +In order to enable octree-based feature extraction, the implementation of algorithms was extended with few modifications. These primarily consisted of support for octrees inside replay buffer, formation of octree batches and integration of the octree-based feature extractor with PyTorch-based NNs of actor and critics. All other configurations of these algorithms were performed through their hyperparameters. \subsection{Feature Extraction}\label{subsec:feature_extraction} -With visual features, the first part of the network can often be considered as a feature extractor that transforms raw data into more abstract features. This fact is often employed in network architectures for actor-critic DRL methods, where a feature extractor CNN network is shared between the actor and critics. This work therefore utilises the same approach, where a common feature extractor transforms raw input into features that are then provided as input for actor and critic networks. To extract features from octrees, O-CNN implementation by \citet{wang_o-cnn_2017} is used as a base for the employed feature extractor. +With visual features, the first part of the network can often be considered as a feature extractor that transforms raw data into more abstract features. This fact is often employed in network architectures for actor-critic DRL methods, where a CNN feature extractor network is shared between the actor and critics. This work therefore utilises the same approach, where a common feature extractor transforms raw input into features that are then provided as input for actor and critic networks. To extract features from octrees, O-CNN implementation by \citet{wang_o-cnn_2017} is used as a base for the employed feature extractor. \subsubsection{Construction of Octree} First, an octree is constructed from the aforementioned transformed point cloud of the scene during each step. For this, a volume of~24\({\times}\)24\({\times}\)24~cm is defined to be the observable workspace and set to be coincidental with the spawn volume of objects. Therefore, each point cloud is cropped to occupy only this volume in order to preserve assumption about volumetric 3D data representations from \autoref{subsec:problem_formulation_octree}. -Maximum depth of the octree was selected as~\(d_{max}\)~=~\(4\) in order to provide metric resolution of each finest leaf octant of~1.5\({\times}\)1.5\({\times}\)1.5~cm. This depth was found to provide enough detail for grasping of objects from the utilised dataset, while not slowing down the training due to enormous number of cells. Every octree therefore contains a theoretical maximum of~4096 cells, however, an average of~13\% of these cells are occupied at any given time in the created simulation environment. This is primarily because only a single view of the scene is used, where each occlusion prohibits the formation of new cells in the occluded regions behind the visible surfaces. Therefore, it is expected that for each additional depth, the workspace volume can be increased eightfold for the utilised dataset while the actual number of occupied cells would be increased at a much slower rate. +Maximum depth of the octree was selected as~\(d_{max} = 4\) in order to provide metric resolution of each finest leaf octant of~1.5\({\times}\)1.5\({\times}\)1.5~cm. This depth was found to provide enough detail for grasping of objects from the utilised dataset, while not slowing down the training due to enormous number of cells. Every octree therefore contains a theoretical maximum of~4096 cells, however, an average of~13\% of these cells are occupied at any given time in the created simulation environment. This is primarily because only a single view of the scene is used, where each occlusion prohibits the formation of new cells in the occluded regions behind the visible surfaces. Therefore, it is expected that for each additional depth, the workspace volume can be increased eightfold while the actual number of occupied cells would be increased at a much slower rate. -As it was previously described in \autoref{subsec:problem_formulation_octree}, each occupied finest leaf octant contains the average unit normal vector~\(\overline{n}\), the average distance between the centre of the cell and points that formed it~\(\overline{d}\), and the average colour~\(\overline{rgb}\). All of these features are extracted directly from the point cloud that is used to create the octree, where each octet considers only the points that belong to its volume. Since the cropped point cloud does not contain normals, these are estimated for each point from their nearest neighbourhood, where maximum of~10 closest neighbours at a maximum distance of~5~cm are considered. Position of the camera is then used to orient all normals correctly. Once these are found, an octree is created from the point cloud by hierarchical subdivision of the cells. An example of created octree is visualised in \autoref{fig:octree_example}. +As it was previously described in \autoref{subsec:problem_formulation_octree}, each occupied finest leaf octant contains the average unit normal vector~\(\overline{n}\), the average distance between the centre of the cell and points that formed it~\(\overline{d}\), and the average colour~\(\overline{rgb}\). All of these features are extracted directly from the point cloud that is used to create the octree, where each octet considers only the points that belong to its volume. Since the cropped point cloud does not contain normals, these are estimated for each point from their nearest neighbourhood, where maximum of~10 closest neighbours at a maximum distance of~5~cm are considered. Position of the camera is then used to orient all normals correctly. Once these are found, an octree is created from the point cloud by hierarchical subdivision of the cells. An example of a created octree is visualised in \autoref{fig:octree_example}. \begin{figure}[ht] \centering @@ -204,9 +212,9 @@ \subsubsection{Network Architecture of Feature Extractor} \label{fig:feature_extractor_architecture} \end{figure} -The network begins with processing octrees at the maximum depth~\(d\)~=~\(d_{max}\)~=~\(4\). Each octree contains seven channels, which encompass the aforementioned features. From this depth, the octree is processed through a series of 3D convolutions, ReLU (Rectified Linear Unit) activation function and maximum pooling. After each pooling operation, the depth of the octree is decremented such that next convolutional layer computes features that are at a larger scale. This series of modules is applied twice, such that the depth of the octree is reduced to~\(d\)~=~\(2\). While doing so, the dimensionality of channels is increased to provide a wider feature space. However, it is necessary to reduce the number of channels before the next step. For this, 1D convolution is applied in order to compress the feature space by combining together features from the different channels for each cell. Once the dimensionality is reduced, the octree is voxelised in order to acquire a structure that has a static size regardless on the input, which enables use of more traditional DL layers. It is achieved by padding the octree at~\(d\)~=~\(2\) with~0s wherever a cell is not already occupied. Once voxelised, the feature space is flattened into a feature vector that is then processed by a single fully connected layer followed by ReLU activation in order to provide the final set of features from octree observations. +The network begins with processing octrees at the maximum depth~\(d = d_{max} = 4\). Each octree contains seven channels, which encompass the aforementioned features. From this depth, the octree is processed through a series of 3D convolutions, ReLU (Rectified Linear Unit) activation functions and maximum pooling. Each pooling operation decrements the depth of the octree such that next convolutional layer computes features at a larger scale. This series of modules is applied twice, such that the depth of the octree is reduced to~\(d = 2\). While doing so, the dimensionality of channels is increased to provide a wider feature space. However, it is necessary to reduce the number of channels before the next step. For this, 1D convolution is applied in order to compress the feature space by combining together features from the different channels for each cell. Once the dimensionality is reduced, the octree is voxelised in order to acquire a structure that has a static size regardless on the input, which enables use of more traditional DL layers. It is achieved by padding the octree at~\(d = 2\) with~0s wherever a cell is not already occupied. Once voxelised, the feature space is flattened into a feature vector that is then processed by a single fully connected layer followed by ReLU activation in order to provide the final set of features from octree observations. -The proprioceptive observations are also processed by the same feature extractor. However, only a single linear layer with ReLU activation of the same dimensionality is used because these features are already at a higher level compared to the raw octrees. Hereafter, the features extracted from octree are combined with proprioceptive features into a single feature vector. The number of utilised channels and the dimensionality of feature vectors is presented in \autoref{fig:feature_extractor_architecture}, which results in total of~226,494 learnable parameters. +The proprioceptive observations are also processed by the same feature extractor. However, only a single linear layer with ReLU activation of the same dimensionality is used because these features are already at a higher level compared to the raw octrees. Hereafter, the features extracted from octree are combined with proprioceptive features into a single feature vector. The number of utilised channels and the dimensionality of feature vectors is presented in the aforementioned \autoref{fig:feature_extractor_architecture}, which results in total of~226,494 learnable parameters. In order to enable observation stacking described in \autoref{subsec:observation_stacking}, the feature extractor is duplicated for each of the three stacks. Once all observation stacks are processed individually, their output is concatenated into a single feature vector that can be used by actor and critic networks. A separate network for each stack is utilised instead of a common network because it allows agent to extract different set of features from historical and current observations. The disadvantage of this approach is increased number of parameters that must be learned, which could potentially slow down the training process. Such effect is therefore investigated during experimental evaluation. @@ -229,7 +237,7 @@ \subsection{Hyperparameter Optimisation} Selection of hyperparameters can significantly affect the learning curve as well as the final performance of a learned policy. This brittleness of DRL to hyperparameters therefore means that their optimisation is of great importance and needs to be performed for each environment. In this work, both automatic optimisation and manual fine-tuning is performed with aim to obtain a set of hyperparameters that would allow robust learning of policy for the created environment, observations and utilised RL algorithms. -First, an automatic hyperparameter optimisation is applied by the use of Optuna, which is a hyperparameter optimisation framework developed by \citet{akiba_optuna_2019}. Optuna and other similar frameworks address the problem of selecting a viable combination of hyperparameters for DL by performing a number of different trials that are used to iteratively search the hyperparameter space and find a combination that provides the best results according to some metric. In terms of RL, this metric is a reward that an agent is able to accumulate over the course of some evaluation period. Optuna generally consists of two parts, which are the sampler and the pruner. Sampler selects a set of hyperparameters from the hyperparameter search-space for the next trial. Such selection can either be completely random, e.g.~at the beginning of an experiment, or by applying algorithms that perform statistical analysis from all previous trials. Pruner in this context is a strategy that allows early stopping of non-promising trials with aim to limit the amount of wasted resources. Pruning requires that evaluation episodes of each trial are run at regular intervals, where each new trial is compared to the performance of all previous trials and pruned if the accumulated reward is comparably too low. +First, an automatic hyperparameter optimisation is applied by the use of Optuna, which is a hyperparameter optimisation framework developed by \citet{akiba_optuna_2019}. Optuna and other similar frameworks address the problem of selecting a viable combination of hyperparameters for DL by performing a number of different trials that are used to iteratively search the hyperparameter space and find a combination that provides the best results according to some metric. In terms of RL, this metric is a reward that an agent is able to accumulate over the course of some evaluation period. Optuna consists of two parts, which are the sampler and the pruner. Sampler selects a set of hyperparameters from the hyperparameter search space for the next trial. Such selection can either be completely random, e.g.~at the beginning of an experiment, or by applying algorithms that perform statistical analysis from all previous trials. Pruner in this context is a strategy that allows early stopping of non-promising trials with aim to limit the amount of wasted resources. Pruning requires that evaluation episodes of each trial are run at regular intervals, where each new trial is compared to the performance of all previous trials and pruned if the accumulated reward is comparably too low. For the grasping environment, Optuna is first applied to optimise hyperparameters in order to get a baseline that provides a reliable performance. This optimisation was performed using SAC, where the search space consisted of most hyperparameters including the size of the feature extractor and actor-critic networks. Size of the replay buffer, batch size and initial entropy were not optimised automatically. Replay buffer and batch size were selected to be adequately large for the utilised system in terms of maximum RAM and VRAM usage, respectively. Initial entropy is kept consistent because it directly influences the performance during the early stages of each trial, where large initial entropy could result in undesired pruning. Total of~70 trials with a maximum trial duration of~100,000 time steps were used. A set of~20 evaluation episodes was performed every~25,000 time steps, which could trigger pruning. At the end, the best performing set of hyperparameters was used for subsequent manual tuning. diff --git a/content/introduction.tex b/content/introduction.tex index ab82f70..8d77f7e 100644 --- a/content/introduction.tex +++ b/content/introduction.tex @@ -1,7 +1,7 @@ \chapter{Introduction} % Intro to grasping -Grasping is a fundamental manipulation skill that is essential for a variety of everyday tasks. Stacking, inserting, pouring, cutting and writing are all examples of such tasks that require an object or a tool to be firmly grasped prior to performing them. A hierarchy of subroutines can be assembled together in order to accomplish more complex goals, which in turn requires grasping of diverse objects that can differ in their appearance, geometry as well as inertial and mechanical properties. Despite the uniqueness this might bring to each individual grasp, a versatile robot should generalize over different objects and scenarios instead of treating them as distinct subtasks. +Grasping is a fundamental manipulation skill that is essential for a variety of everyday tasks. Stacking, inserting, pouring, cutting and writing are all examples of such tasks that require an object or a tool to be firmly grasped prior to performing them. A hierarchy of subroutines can be assembled together in order to accomplish more complex goals, which in turn requires grasping of diverse objects that can differ in their appearance, geometry as well as inertial and mechanical properties. Despite the uniqueness this might bring to each individual grasp, a versatile robot should generalise over different objects and scenarios instead of treating them as distinct subtasks. % Brief related works; analytical approaches, supervised learning, imitation learning, reinforcement learning (elaborated more with popular examples) Task-specific algorithms are often analytically developed for a specific gripper on a set of objects via time-consuming approach. Despite effectiveness of such methods, they usually lead to a solution that lacks the required generalization and even slight differences in the process or manipulated objects might require manual reprogramming \cite{sahbani_overview_2012}. Empirical approaches were introduced to overcome the difficulties with analytical grasping by progressively learning through sampling and training. In this way, supervised learning provides a way to learn grasp synthesis from a dataset that is labelled with analytical grasp metrics, however, this approach requires a large volume of data in order to achieve the desired generalization \cite{mahler_dex-net_2017}. Although imitation learning allows robots to quickly learn simple grasps \cite{zhang_deep_2018}, the amount of required human expert demonstrations can also become too costly and time-consuming before a general policy is learned. Reinforcement learning (RL) \cite{sutton_reinforcement_2018} could offer a solution to this problem, as self-supervision provides the means for a robot to progressively become better at grasping via repeated experience and minimal human involvement. The popularity of RL has significantly increased in recent years, especially due to the noteworthy results obtained by deep reinforcement learning (DRL). Several publications demonstrated that DRL can be used to achieve human level performance in tasks such as playing Atari games \cite{mnih_human-level_2015}, or even beating world champions in the boardgame Go \cite{silver_mastering_2017} and real-time strategy game StarCraft II \cite{vinyals_grandmaster_2019}. Moreover, \citet{schrittwieser_mastering_2020} established just how far DRL has come with a single algorithm that can achieve superhuman performance by learning a model without any prior knowledge of the game rules in multiple domains, i.e.~Go, Chess, Shogi and 57 Atari games. diff --git a/content/problem_formulation.tex b/content/problem_formulation.tex index 97161fa..9c43434 100644 --- a/content/problem_formulation.tex +++ b/content/problem_formulation.tex @@ -4,14 +4,14 @@ \chapter{Problem Formulation}\label{ch:problem_formulation} \section{Task Definition}\label{sec:problem_formulation_task_definition} -In this work, agent is assumed to be a high-level controller that provides sequential decision making in form of gripper poses and actions. Therefore, the environment is considered to include all objects and physical interactions in addition to the robot with its actuators and low-level controllers. Episodic formulation of the grasping task is studied, where a new set of objects is introduced into the scene at the beginning of each episode. During each episode, the aim of agent is to grasp and lift an object to a certain height above the ground plane, which also terminates the current episode. Furthermore, an episode is also terminated after~100 time steps and whenever the agent pushes all objects outside the union of the perceived and reachable workspace. Placing of objects after their picking is not investigated in this work. +In this work, agent is assumed to be a high-level controller that provides sequential decision making in form of gripper poses and actions. Therefore, the environment is considered to include the robot with its actuators and low-level controllers in addition to all objects and physical interactions. Episodic formulation of the grasping task is studied, where a new set of objects is introduced into the scene at the beginning of each episode. During each episode, the aim of agent is to grasp and lift an object to a certain height above the ground plane, which also terminates the current episode. Furthermore, an episode is also terminated after~100 time steps and whenever the agent pushes all objects outside the union of the perceived and reachable workspace. Placing of objects after their picking is not investigated in this work. Due to the benefits of employing robotics simulators to train RL agents, e.g.~safe and inexpensive data collection, robotics simulator will be used in this work. Once an agent is trained in a virtual environment, the learned policy will subsequently be evaluated in a real-world setup via sim-to-real transfer. The conceptual setup of this work that should be similar in both domains is illustrated in \autoref{fig:problem_formulation_setup_sketch}. \begin{figure}[ht] \centering \includegraphics[width=0.49\textwidth]{problem_formulation/setup_sketch.pdf} - \caption{Conceptual setup for the task of robotic grasping that needs to be constructed inside a robotics simulator for training and in real-world domain for subsequent evaluation.} + \caption{Conceptual setup for the task of robotic grasping that needs to be constructed inside a robotics simulator for training, and in real-world domain for subsequent evaluation.} \label{fig:problem_formulation_setup_sketch} \end{figure} @@ -27,16 +27,16 @@ \subsection{Octree}\label{subsec:problem_formulation_octree} Hereafter, three assumptions about the use of volumetric 3D data representation for end-to-end robotic manipulation are set forth. First, aspect ratio of~1:1:1 is considered to provide generalisation over all possible directions of movement, i.e.~traversing a fixed distance along any of the primary axes should result in a movement over the same number of cells. Second assumption considers the volume that each cell occupies, which shall remain consistent over the entire duration of training and evaluation. This is considered to be beneficial because a persistent scale of cells provides a consistency over distances between any two cells. Lastly, each cell should correspond to a specific position of space that remains fixed with respect to the robot pose, regardless of the camera pose. This assumption is considered to be necessary as it allows NNs to create relations among individual cells and their respective significance in space. -Due on these assumptions, the approach that is commonly used in classification and segmentation tasks, i.e.~rescale a point cloud to fit inside a fixed volume \cite{wang_o-cnn_2017}, cannot be applied in this work. However, it is assumed that the relative pose of camera with respect to robot is known, e.g.~through calibration process, therefore, the previously obtained point cloud is transformed into the robot coordinate frame in order to achieve invariance to camera pose. Furthermore, such point cloud is subsequently cropped in order to occupy a fixed volume in space with aspect ratio of~1:1:1. This volume is considered to be the observed workspace and it is subsequently used to construct the octree observations as illustrated in \autoref{fig:problem_formulation_octree_creation_sketch}. - -\begin{figure}[ht] +\begin{figure}[b] \centering - \includegraphics[width=\textwidth]{problem_formulation/octree_creation_sketch.pdf} + \includegraphics[width=0.73\textwidth]{problem_formulation/octree_creation_sketch.pdf} \caption{Process of constructing an octree from depth map and RGB image via an intermediate point cloud, which is transformed into the robot coordinate frame and cropped to a fixed volume.} \label{fig:problem_formulation_octree_creation_sketch} \end{figure} -The octree structure by \citet{wang_o-cnn_2017} allows arbitrary data to be stored at the finest leaf octants. Three distinct features are utilised in this work, namely the average unit normal vector~\(\overline{n}\), the average distance between the centre of a cell and all points that formed it~\(\overline{d}\), and the average colour~\(\overline{rgb}\). As illustrated in \autoref{fig:problem_formulation_octree_features}, all of these features are computed independently for each octant based on the points from the point cloud that produced it. Normals~\(n_{i}~{=}~(n_{x_{i}},n_{y_{i}},n_{z_{i}})\) are selected because they provide smoothness-preserving description of the object surfaces, as previously shown in \autoref{fig:rw_ocnn_occupancy_vs_normals}. Since point cloud acquired from RGB-D camera does not usually contain normals, they must be estimated from a local neighbourhood prior to constructing the corresponding octree. The average distance to the points~\(\overline{d}\) allows the perceived surface to be offset in the direction of normals, which allows octrees with lower resolution to be used while still preserving smooth transitions between the cells. Colour features~\(rgb_{i}~{=}~(r_{i},g_{i},b_{i})\) are expected to provide an agent with additional input that could allow semantic analysis in addition to shape analysis, which might be especially beneficial for distinguishing dissimilar objects that are in contact. Besides~\(\overline{n}\) being normalised as a unit vector,~\(\overline{d}\) and all channels of~\(\overline{rgb}\) are normalised to be in a range~\([0,1]\). +Due on these assumptions, the approach that is commonly used in classification and segmentation tasks, i.e.~rescale a point cloud to fit inside a fixed volume \cite{wang_o-cnn_2017}, cannot be applied in this work. However, it is assumed that the relative pose of camera with respect to robot is known, e.g.~through calibration process, therefore, the previously obtained point cloud is transformed into the robot coordinate frame in order to achieve invariance to camera pose as illustrated in \autoref{fig:problem_formulation_octree_creation_sketch}. Furthermore, such point cloud is subsequently cropped in order to occupy a fixed volume in space with an aspect ratio of~1:1:1. This volume is considered to be the observed workspace and it is subsequently used to construct the octree observations. + +The octree structure by \citet{wang_o-cnn_2017} allows arbitrary data to be stored at the finest leaf octants. Three distinct features are utilised in this work, namely the average unit normal vector~\(\overline{n}\), the average distance between the centre of a cell and all points that formed it~\(\overline{d}\), and the average colour~\(\overline{rgb}\). As illustrated in \autoref{fig:problem_formulation_octree_features}, all of these features are computed independently for each octant based on the points from the point cloud that produced it. Normals~\(n_{i}~{=}~(n_{x_{i}},n_{y_{i}},n_{z_{i}})\) are selected because they provide smoothness-preserving description of the object surfaces, as previously shown in \autoref{fig:rw_ocnn_occupancy_vs_normals}. Since point cloud acquired from RGB-D camera does not usually contain normals, they must be estimated from a local neighbourhood prior to constructing the corresponding octree. The average distance to the points~\(\overline{d}\) allows the perceived surface to be offset in the direction of normals, which allows octrees with lower resolution to be used while still preserving smooth transitions between the cells. Colour features~\(rgb_{i}~{=}~(r_{i},g_{i},b_{i})\) are expected to provide an agent with additional input that could allow semantic analysis in addition to shape analysis, which might be especially beneficial for distinguishing dissimilar objects that are in contact. Besides~\(\overline{n}\) being normalised as a unit vector,~\(\overline{d}\) and all channels of~\(\overline{rgb}\) are normalised to be in a range~\([0, 1]\). \begin{figure}[ht] \centering @@ -48,48 +48,49 @@ \subsection{Octree}\label{subsec:problem_formulation_octree} \subsection{Proprioceptive Observations}\label{subsec:problem_formulation_proprioceptive_observations} -In addition to the visual observations acquired by an RGB-D camera, it is considered to be beneficial to also include proprioceptive observations. Gripper pose and gripper state are used in this work because these observations are independent of the utilised robot. Although both of these could be determined solely from the visual observations, occlusion can introduce significant uncertainty. Furthermore, these readings are easily obtainable from any robot. The state of the gripper~\(g_{s}\) is represented as~\(\{closed: -1, opened: 1\}\). The position of the gripper is encoded as~\((x,y,z)\) vector represented with respect to robot's base frame. Gripper orientation is also with respect to robot's base frame, and represented as the first two columns of the rotation matrix~\([(R_{11},R_{21},R_{31}),(R_{12},R_{22},R_{32})]\) because they provide continuous description of 3D orientation without ambiguities, contrary to Euler angles or quaternions \cite{zhou_continuity_2020}. +In addition to the visual observations acquired by an RGB-D camera, it is considered to be beneficial to also include proprioceptive observations. Gripper pose and gripper state are used in this work because these observations are independent of the utilised robot. Although both of these could be determined solely from the visual observations, occlusion can introduce uncertainties. Furthermore, these readings are easily obtainable from any robot. The state of the gripper~\(g_{s}\) is represented as~\(\{closed: -1, opened: 1\}\). The position of the gripper is encoded as~\((x,y,z)\) vector represented with respect to robot's base frame. Gripper orientation is also with respect to robot's base frame, and represented as the first two columns of the rotation matrix~\([(R_{11},R_{21},R_{31}),(R_{12},R_{22},R_{32})]\) because they provide continuous description of 3D orientation without ambiguities, contrary to Euler angles or quaternions \cite{zhou_continuity_2020}. \subsection{Observation Stacking}\label{subsec:observation_stacking} A single set of visual and proprioceptive observations does not fully describe the state of the environment. In order to better satisfy Markov assumption, dynamics of the system must also be observed, including all data based on the temporal information. \citet{mnih_human-level_2015} addressed this in a simple way by stacking last~\(n\) historical observations together and combining them into a single observation that fully describes the state. -Despite the increase in the amount of similar data that needs to be processed, this work applies a similar observation stacking method due to the simplicity of such solution. More specifically, three sequential octrees and proprioceptive observations are stacked together, i.e.~\(n\)~=~\(3\). At the beginning of each episode when three observations are not available yet, the first observations is duplicated multiple times to form the stacked observation. +Despite the increase in the amount of similar data that needs to be processed, this work applies a similar observation stacking method due to the simplicity of such solution. More specifically, three sequential octrees and proprioceptive observations are stacked together, i.e.~\(n = 3\). At the beginning of each episode when three observations are not available yet, the first observations is duplicated multiple times to form the stacked observation. \section{Action Space} -In this work, the action space for end-to-end robotic grasping comprises of continuous actions in Cartesian space. By utilising actions in Cartesian space instead of joint space, the action space is invariant to the specific kinematic configuration of a robot. Furthermore, Cartesian actions provide better safety guarantees, where traditional IK and motion planning approaches can be employed to reliably provide commands for low-level joint controllers while avoiding self-collisions. +In this work, the action space for end-to-end robotic grasping comprises of continuous actions in Cartesian space. By utilising actions in Cartesian space instead of joint space, the action space is invariant to the specific kinematic configuration of a robot. Furthermore, Cartesian actions provide better safety guarantees, where traditional IK and motion planning approaches can be employed to reliably provide commands for low-level joint controllers while avoiding self-collisions. Actions available to agents are illustrated in \autoref{fig:problem_formulation_action_space}. -The utilised actions are illustrated in \autoref{fig:problem_formulation_action_space}. For gripper pose, the actions comprise of translational displacement~\((d_{x},d_{y},d_{z})\) and relative rotation around~\(z\)-axis~\(d_{\phi}\) that are both expressed with respect to robot base coordinate frame. These actions are normalised in the range~\([-1, 1]\) and subsequently rescaled to metric and angular units before applying them. The gripper action~\(g\) is also in a continuous range~\([-1, 1]\), where positive values open the gripper and negative values prompt closing of the gripper. Therefore, RL agent is allowed to take any combination of continuous actions by selecting the corresponding values for a tuple~\((d_{x},d_{y},d_{z},d_{\phi},g)\). - -\begin{figure}[ht] +\begin{figure}[b] \centering \includegraphics[width=0.333\textwidth]{problem_formulation/action_space.pdf} \caption{Action space of the grasping task, where~\((d_{x},d_{y},d_{z})\) indicates a translational displacement,~\(d_{\phi}\) is a relative yaw rotation, and the gripper closing and opening is denoted by~\(g\).} \label{fig:problem_formulation_action_space} \end{figure} +For gripper pose, the actions comprise of translational displacement~\((d_{x},d_{y},d_{z})\) and relative rotation around~\(z\)-axis~\(d_{\phi}\) that are both expressed with respect to robot base coordinate frame. These actions are normalised in the range~\([-1, 1]\) and subsequently rescaled to metric and angular units before applying them. The gripper action~\(g\) is also in a continuous range~\([-1, 1]\), where positive values open the gripper and negative values prompt closing of the gripper. Therefore, RL agent is allowed to take any combination of continuous actions by selecting the corresponding values for a tuple~\((d_{x},d_{y},d_{z},d_{\phi},g)\). + \section{Reward Function} Although it would be desirable to provide the agent only with a very sparse reward after successfully grasping and lifting an object, such approach would prolong the training due the sparsity of achieving a success through random exploration. Therefore, this work makes use of a composite reward function that combines together sparse rewards from four distinct stages, i.e.~reaching, touching, grasping and lifting. These stages follow a hierarchical flow, where the agent must first approach an object, then touch, grasp and finally lift it. During each episode, the agent is allowed to obtain a reward from each of these stages only once in order to discourage any rewarding behaviour that would not lead to a desired goal of the final stage, such as repeatedly pushing an object in order to continually accumulate reward for touching. -The proportion and scale of each component from the reward function can be treated as a tunable environment hyperparameter because it directly influences the policy that the agent aims to optimise. Generally, reward at the last stage should be much higher than the reward given at first stage, which is only meant to guide the training of the agent. Therefore, an exponential function~\(r_{exp}^{i-1}\) is used to determine the individual reward for each stage~\(i\). The base~\(r_{exp} \in [1,\infty)\) can be tuned, where~\(r_{exp}\)~=~\(7\) was empirically found to provide satisfactory results for the implemented grasping task, with theoretical maximum achievable reward of~\(r_{max}\)~=~\(400\). +The proportion and scale of each component from the reward function can be treated as a tunable environment hyperparameter because it directly influences the policy that the agent aims to optimise. Generally, reward at the last stage should be much higher than the reward given at first stage, which is only meant to guide the training of the agent. Therefore, an exponential function~\(r_{exp}^{i-1}\) is used to determine the individual reward for each stage~\(i\). The base~\(r_{exp} \in [1,\infty)\) can be tuned, where~\(r_{exp} = 7\) was empirically found to provide satisfactory results for the implemented grasping task, with theoretical maximum achievable reward of~\(r_{max} = 400\). In addition to positive reward for accomplishing the task, the agent is also given negative reward of~\(-1\) for each time step during which the robot is in collision with the ground plane in order to discourage the number of undesired collisions. Furthermore, a small reward of~\(-0.005\) is subtracted at each time step until termination in order to encourage the agent to accomplish the task as fast as possible. All rewards are summarised in \autoref{tab:reward_function}. \begin{table}[ht] \centering \begin{tabular}{cr|lc} - \multirow{4}{*}{\rotatebox[origin=c]{90}{\textbf{Composite}}} & \textbf{Reaching} & \(r_{exp}^{0}\)~=~\(1\) & \multirow{4}{*}{\textit{(once per episode)}} \\ - & \textbf{Touching} & \(r_{exp}^{1}\)~=~\(7\) & \\ - & \textbf{Grasping} & \(r_{exp}^{2}\)~=~\(49\) & \\ - & \textbf{Lifting} & \(r_{exp}^{3}\)~=~\(343\) & \\ \hline - \multicolumn{2}{r|}{\textbf{Collision}} & \(-1\) & \multirow{2}{*}{\textit{(each time step)}} \\ - \multicolumn{2}{r|}{\textbf{Act Quickly}} & \(-0.005\) & \\ + \multirow{4}{*}{\rotatebox[origin=c]{90}{Composite}} + & \textbf{Reaching} & \(r_{exp}^{0} = 1\) & \multirow{4}{*}{\textit{(once per episode)}} \\ + & \textbf{Touching} & \(r_{exp}^{1} = 7\) & \\ + & \textbf{Grasping} & \(r_{exp}^{2} = 49\) & \\ + & \textbf{Lifting} & \(r_{exp}^{3} = 343\) & \\ \hline + \multicolumn{2}{r|}{\textbf{Collision}} & \(-1\) & \multirow{2}{*}{\textit{(each time step)}} \\ + \multicolumn{2}{r|}{\textbf{Act Quickly}} & \(-0.005\) & \\ \end{tabular} - \caption{Overview of the reward function that is utilised in this work for the grasping task, where~\(r_{exp}\)~=~\(7\) was tuned and each episode has at most~100 time steps.} + \caption{Overview of the reward function that is utilised in this work for the grasping task, where~\(r_{exp} = 7\) was tuned and each episode has at most~100 time steps.} \label{tab:reward_function} \end{table} diff --git a/content/related_work.tex b/content/related_work.tex index 97eb7bd..8fa0e23 100644 --- a/content/related_work.tex +++ b/content/related_work.tex @@ -43,7 +43,7 @@ \section{Imitation Learning} Behavioural cloning is the simplest form of imitation learning, in which a policy that directly maps states to actions is learned through techniques such as non-linear regression or support vector machines \cite{osa_algorithmic_2018}. Recently, \citet{zhang_deep_2018} showed that DL allows behavioural cloning to be an effective way for robots to acquire complex skills. They used a virtual reality headset and hand-tracking controller to acquire teleoperated demonstrations in the form of RGB-D images, which were subsequently used to train a deep policy by the use of CNN. With this approach, \citeauthor{zhang_deep_2018} managed to train a simple grasping task with one object to~\(97\)\% success rate while using 180 distinct demonstrations. Learning from observation is an emerging category that similarly aims to learn policy from visual demonstrations but without any labels associated with them, where the state might not be fully known \cite{kroemer_review_2021}. % Difficulties -Even though imitation learning provides a quick way of acquiring new policies, demonstrations usually do not contain all possible states that the robot might experience because collecting expert demonstrations for all scenarios can become too expensive and time-consuming \cite{osa_algorithmic_2018}. For this reason, the learned policy might struggle to generalize to novel objects and situations. +Even though imitation learning provides a quick way of acquiring new policies, demonstrations usually do not contain all possible states that the robot might experience because collecting expert demonstrations for all scenarios can become too expensive and time-consuming \cite{osa_algorithmic_2018}. For this reason, the learned policy might struggle to generalise to novel objects and situations. \section{Reinforcement Learning}\label{sec:rw_reinforcement_learning} @@ -159,14 +159,14 @@ \subsection{Sim-to-Real}\label{subsec:sim2real} \end{figure} -\paragraph{Domain Randomization} Another way to easily expand the variety in data is by randomly changing the simulated environment. \citet{tobin_domain_2017} applied this method in order to randomize visual attributes shown in \autoref{fig:sim2real_domain_randomization}, such as object colours, table texture, camera pose and characteristics of the illumination. Furthermore, domain randomization can be extended also to other non-visual simulation attributes such as inertial properties of robot links and hyperparameters of the utilised physics solver. +\paragraph{Domain Randomization} Another way to easily expand the variety in data is by randomly changing the simulated environment. \citet{tobin_domain_2017} applied this method in order to randomise visual attributes shown in \autoref{fig:sim2real_domain_randomization}, such as object colours, table texture, camera pose and characteristics of the illumination. Furthermore, domain randomization can be extended also to other non-visual simulation attributes such as inertial properties of robot links and hyperparameters of the utilised physics solver. \begin{figure}[ht] \centering \begin{subfigure}[ht]{0.745\textwidth} \centering \includegraphics[height=3.75cm]{related_work/sim2real_randomization_synthetic.png} - \caption*{Randomized} + \caption*{Randomised} \end{subfigure}% ~ \begin{subfigure}[ht]{0.245\textwidth} diff --git a/master_thesis.pdf b/master_thesis.pdf index e1a14cb..a49fcc8 100644 Binary files a/master_thesis.pdf and b/master_thesis.pdf differ