introduction.tex

% -*- mode: latex; mode: visual-line; fill-column: 9999; coding: utf-8 -*-

\section{Introduction}
\label{sec:introduction}

Molecular dynamics (MD) simulations are a powerful method to generate new insights into the function of biomolecules \cite{Borhani:2012mi, Dror:2012cr, Orozco:2014dq, Perilla:2015kx, Bottaro:2018aa}.
These simulations produce trajectories---time series of atomic coordinates---that now routinely include millions of time steps and can measure Terabytes in size.
These trajectories need to be analyzed using statistical mechanics approaches \cite{Tuckerman:2010cr, Mura:2014kx} but because of the increasing size of the data, trajectory analysis is becoming a bottleneck in typical biomolecular simulation scientific workflows~\cite{Cheatham:2015}.
Many data analysis tools and libraries have been developed to extract the desired information from the output trajectories from MD simulations ~\cite{nmoldyn, nmoldyn-2012, Hum96, Hinsen:2000kx, Grant:2006ud, himach-2008, Romo:2009zr, Romo:2014bh, Michaud-Agrawal:2011fu, Gowers:2016aa, cpptraj-2013, McGibbon:2015aa, pteros2015, Doerr:2016aa} but few can efficiently use modern High Performance Computing (HPC) resources to accelerate the analysis stage.
MD trajectory analysis primarily requires \emph{reading} of data from the file system; the processed output data are typically negligible in size compared to the input data and therefore we exclusively investigate the reading aspects of trajectory I/O (i.e., the ``I'').
We focus on the \package{MDAnalysis} package \cite{Gowers:2016aa,Michaud-Agrawal:2011fu}, which is an open-source object-oriented Python library for structural and temporal analysis of MD simulation trajectories and individual protein structures.
Although \package{MDAnalysis} accelerates selected algorithms with OpenMP, it is not clear how to best use it for scaling up analysis on multi-node supercomputers.
Here we discuss the challenges and lessons-learned for making parallel analysis on HPC resources feasible with \package{MDAnalysis}, which should also be broadly applicable to other general purpose trajectory analysis libraries.

Previously, we had used a parallel split-apply-combine  approach \cite{Wickham:2011aa} to study the performance of the commonly performed ``RMSD fitting'' analysis problem~\cite{Khoshlessan:2017ab, ICCP-2018, Fan:2019aa}, which calculates the minimal root mean squared distance (RMSD) of the positions of a subset of atoms to a reference conformation under optimization of rigid body translations and rotations \cite{Liu:2010kx, Lea96, Mura:2014kx}.
We had investigated two parallel implementations, one using \package{Dask}~\cite{Rocklin:2015aa} and one using the message passing interface (MPI) with \package{mpi4py}~\cite{Dalcin:2011aa, Dalcin:2005aa}. 
For both \package{Dask} and MPI, we had previously only been able to obtain good strong scaling performance within a single node.
Beyond a single node performance had dropped due to \emph{straggler} tasks, a subset of tasks that had performed abnormally slower than the typical task execution times; the total execution time had become dominated by stragglers and overall performance had decreased.
Stragglers are a well-known challenge to improving performance on HPC resources \cite{Garraghan2016} but there has been little discussion of their impact in the biomolecular simulation community.

In the present study, we analyzed the MPI case in more detail to better understand the origin of stragglers with the goal to find  parallelization approaches to speed up parallel post-processing of MD trajectories in the \package{MDAnalysis} library.
We especially wanted to make efficient use of the resources provided by current supercomputers such as multiple nodes with hundreds of CPU cores and a Lustre parallel file system.

As in our previous study\cite{Khoshlessan:2017ab} we selected the commonly used RMSD algorithm implemented in \package{MDAnalysis} as a typical use case.
We employed the single program multiple data (SPMD) paradigm to parallelize this algorithm on three different HPC resources (XSEDE's \emph{SDSC Comet}, \emph{LSU SuperMic}, and \emph{PSC Bridges} \cite{xsede}).
With SPMD, each process executes essentially the same operations on different parts of the data.
The three clusters differed in their architecture but all used Lustre as their parallel file system.
We used Python (\url{https://www.python.org/}), a machine-independent, byte-code interpreted, object-oriented programming language, which is well-established in the biomolecular simulation community with good support for parallel programming for HPC~\cite{Dalcin:2011aa, GAiN}. 
We found that communication and reading I/O were the two main scalability bottlenecks, with some indication that read I/O might have been interfering with the communications.
Stragglers and therefore poor scaling were a consequence of inefficient use of the parallel Lustre file system.
We therefore focused on two different approaches to better leverage Lustre and mitigate I/O bottlenecks: MPI parallel I/O (MPI-IO) with the HDF5 file format and subfiling (trajectory file splitting).
MPI-IO eliminated stragglers and improved the performance with near ideal scaling, $S(N) = N$, i.e., the speed-up $S$ scaled linearly with the number $N$ of CPU cores while exhibiting a slope of one.

The paper is organized as follows:
We first review stragglers and existing approaches to parallelizing MD trajectory analysis in section \ref{sec:background}.
We describe the software packages and algorithms in section \ref{sec:packages} and the benchmarking environment in section \ref{sec:system}.
Section \ref{sec:methods} explains how we measured performance.
The main results are presented in section \ref{sec:experiments}, with section \ref{sec:clusters} demonstrating reproducibility on different supercomputers.
We provide general guidelines and lessons-learned in section \ref{sec:guidelines} and finish with conclusions in section \ref{sec:conclusions}.