Skip to content

Commit

Permalink
Add RNTuple update blog post
Browse files Browse the repository at this point in the history
  • Loading branch information
enirolf committed Jan 26, 2024
1 parent a4817ed commit 9a4102f
Show file tree
Hide file tree
Showing 3 changed files with 188 additions and 0 deletions.
Binary file added assets/images/rntuple_blog_adl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/rntuple_blog_size_mc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
188 changes: 188 additions & 0 deletions blog/_posts/2024-01-26-rntuple-update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
---
title: "RNTuple: Where are we now and what's next?"
layout: archive
author: Florine de Geus
---

Hello, this is Florine from the ROOT team! Over the past year, I’ve been working
as a technical student funded by ATLAS to evaluate and help further develop
RNTuple. As you may already be aware, RNTuple [[1]](#rntuple-evolution) is
currently being developed as the successor to TTree, and is projected to be
used in Run 4. I imagine you might be wondering why there is a need for a
completely new (TTree-incompatible) system, and what this looks like. That’s why
in this blog post, I will try to answer this question, as well as give you an
overview of the current status of RNTuple, what we’re still working on before
its first production release (and what we will work on beyond this), and
finally how you can already try it out!

## Why do we need RNTuple?

At this point, ROOT has been around for more than a quarter of a century – and
TTree for just as long. And as you might imagine, the computing landscape today
looks vastly different compared to 25 years ago. Just to set the scene: when
ROOT was first released, there was no C++ standard yet and parallel (let alone
distributed) computing really wasn’t a thing yet. On the hardware side, modern
storage technologies such as SSDs and object stores were still unheard of, and
let’s not forget to mention the evolution of networking technologies!
Naturally, TTree wasn’t designed and implemented with these things in mind. Now
of course, over the years a lot of effort has been put into improving the
performance and stability of TTree to make it compatible with modern computing
practices as much as possible. However, there are limits to what is possible in
this regard, especially given the fact that backwards- and
forwards-compatibility are two major requirements for ROOT’s I/O system. This
has led to the fact that with the High-Luminosity LHC on the horizon, where 90%
of the total amount of LHC data is expected to be produced [[2]](#atlas-roadmap),
we need to think about more optimized ways to store physics
data. The challenge here is that this data is unique in the sense that events
(or, in computer science terms, “entry” or “row”) are statistically independent
of each other. At the same time one event typically contains many (complex) data
structures, of which we often only need a small subset at a time, and we found
out that standard technologies are not well-tuned for this type of data
storage [[3]](#hep-data-formats). That is why we decided to combine the years
of experience with TTree and various industry best-practices and invest in the
next generation of high-energy physics data storage. Enter RNTuple!

## Where we are now?

For the past four years, a lot of effort has been put into making RNTuple the
best it can be. We are working closely with the experiments to make sure that
RNTuple can support their data models across all relevant stages in the
production pipeline. Simultaneously, we want to make sure that it is as
optimized as possible. This means making sure that the data stored in RNTuple
is as compact as possible, and at the same time coming up with ways in which we
can make reading and writing RNTuples to and from memory as fast as possible.
To give you an idea of where we’re currently at, the plot below shows the
average on-disk event size for ATLAS’s DAOD_PHYS data model [[4]](#atlas-edm),
comparing TTree and RNTuple. With RNTuple, we could potentially
save 20-35% of storage space, and in turn reduce the consumed network bandwidth
when reading the data from a remote location. When we’re talking about exabytes
of event data, this is quite significant!

<center>
<img
src="{{'/assets/images/rntuple_blog_size_mc.png' | relative_url}}"
alt="DAOD_PHYS storage efficiency"
style="width: 80%" width="1600" height="749" />
</center>

Besides storage efficiency, we’re also seeing very promising results when it
comes to read throughput. The two plots below show the number of events
processed per second for two different types of tasks, comparing ATLAS
DAOD_PHYSLITE data sets stored in TTree and RNTuple (stored on an SSD). As you
can see, RNTuple is remarkably faster than TTree, and similar observations are
made for other data sets [[1]](#rntuple-evolution), [[5]](#rntuple-status).

<center>
<img
src="{{'/assets/images/rntuple_blog_adl.png' | relative_url}}"
alt="PHYSLITE ADL benchmarks"
style="width: 80%" width="1600" height="749" />
</center>

Beyond performance, we have also been working hard on RNTuple’s interface and
supported features. This includes compatibility with RDataFrame, being able to
read and write C++ STL types as well as user-defined types and various other
features to support existing experiment frameworks.

## Can I try it out?

Yes! To be able to read and write RNTuples, the first thing you’ll need is a ROOT
installation that includes the
[ROOT 7 experimental features enabled](https://root.cern/install/build_from_source/#enabling-experimental-features-aka-root7).
This is the case for the default LXPLUS installation, which runs ROOT’s (at the
time of writing) latest release, [6.30.02](https://root.cern/releases/release-63002/)!
If you are running ROOT in a different way, you can easily check if ROOT 7 is
enabled for your installation by running `root-config --has-root7` in your terminal.
If this returns `yes`, you’re all set! If you get a `no`, you will need to use a different
installation of ROOT that does. Check out the [ROOT installation page](https://root.cern/install)
to get it. We strongly recommend using the most recent release in order to get
the latest and greatest from RNTuple.

Now, on to the fun part: using RNTuple! Of course, you could write a new RNTuple
completely from scratch, using fields and data that you come up with. This is
done using the [`RNTupleWriter`](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleWriter.html)
interface. Reading an RNTuple is then naturally done through the
[`RNTupleReader`](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleReader.html).
To get an idea of what this looks like in practice, check out for example
[this tutorial](https://root.cern/doc/master/ntpl001__staff_8C.html).

Of course, it would be more interesting to try out RNTuple with real data, for
example with data from an analysis ntuple that is currently stored as a TTree.
Well, good news! RNTuple also comes with an [`RNTupleImporter`](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleImporter.html)
class that allows you to automatically convert your TTrees to RNTuples. This
can be as simple as executing the following two lines in the ROOT prompt. The
input file containing the source TTree is read remotely, meaning you can
directly copy-paste these lines into your ROOT prompt. Of course, it’s entirely
possible to use your own existing TTrees.

```cpp
root [0] auto importer = ROOT::Experimental::RNTupleImporter::Create(
"http://root.cern/files/HiggsTauTauReduced/GluGluToHToTauTau.root",
"Events",
"my_rntuple.root")
root [1] importer->Import()
```

This will convert your TTree (called `Events` here) into an RNTuple also called
`Events` and write it to `my_rntuple.root`. Easy enough, but maybe you want more
control over this newly created RNTuple. For example, you might want to change
its name, or set the compression settings to something other than the default.
This (and more) can all be tweaked! Check out
[the reference](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleImporter.html)
or [this tutorial](https://root.cern/doc/master/ntpl008__import_8C.html) to see
what options are possible.

Now, I already mentioned that we have been working on RNTuple compatibility with
RDataFrame. Currently, with just one line change, you will be able to use
your existing analysis code with data stored in RNTuple:

```cpp
// Change this:
ROOT::RDataFrame df("Events", "http://root.cern/files/HiggsTauTauReduced/GluGluToHToTauTau.root");

// To this to use the RNTuple you just imported into "my_rntuple.root":
ROOT::RDataFrame df = ROOT::RDF::Experimental::FromRNTuple("Events", "my_rntuple.root");

// Use your existing analysis as-is!
```
> 💡 _The automatic detection of RNTuples in RDataFrame is currently available in
ROOT’s `master` branch and will be available in ROOT 6.32.00!_
## Next steps for RNTuple
So, what’s next? Performance is always one of our main concerns. We are
currently working on parallelizing the writing of RNTuples. In addition, we are
working on what we like to call “interface ergonomics”, i.e. the way developers
will interact with RNTuple. Be aware that this means that the RNTuple
interfaces might still change a little in the coming months! Next to all of
this, we are preparing for larger-scale performance testing to see in what
areas we could further improve. Another area of work for the near future will
be in the direction of _data set combinatorics_ – that is, finding smart(er)
ways of accessing and combining existing RNTuple data. And of course, we
will continue to work with the experiments to make sure the transition to
RNTuple will be as smooth as possible.
To wrap things up, things are looking good for RNTuple, and while there is still
enough work to be done, we’re excited and eager to make RNTuple as good as it
can be! If you want to know more about the evolution and performance of
RNTuple, be sure to check out the references below, as well as
[our other publications](https://root.cern/about/publications/#io).
If you are eager to dive deeper into the specifics of the RNTuple binary format, you can
[read the specification here](https://github.com/root-project/root/blob/368dd4999733efe3eb9eb171bfc1b57560259443/tree/ntuple/v7/doc/specifications.md#L1).
Finally, reach out to us on the [forum](https://root-forum.cern.ch/) if you have
any questions or if you would like to
[contribute to RNTuple or ROOT](https://root.cern/contribute/) in general!
## References
<a name="rntuple-evolution">[1]</a> J. Blomer, P. Canal, A. Naumann, and D. Piparo, “Evolution of the ROOT Tree I/O,” EPJ Web Conf., vol. 245, 2020, doi: [10.1051/epjconf/202024502030](https://doi.org/10.1051/epjconf/202024502030).
<a name="atlas-roadmap">[2]</a> ATLAS Collaboration, “ATLAS Software and Computing HL-LHC Roadmap,” CERN, Geneva, CERN-LHCC-2022-005, LHCC-G-182, 2022. Accessed: May 02, 2023. [Online]. Available: <http://cds.cern.ch/record/2802918>.
<a name="hep-data-formats">[3]</a> J. Blomer, “A quantitative review of data formats for HEP analyses,” J. Phys. Conf. Ser., vol. 1085, p. 032020, Sep. 2018, doi: [10.1088/1742-6596/1085/3/032020](https://doi.org/10.1088/1742-6596/1085/3/032020).
<a name="atlas-edm">[4]</a> J. Elmsheuser et al., “Evolution of the ATLAS analysis model for Run-3 and prospects for HL-LHC,” EPJ Web Conf., vol. 245, 2020, doi: [10.1051/epjconf/202024506014](https://doi.org/10.1051/epjconf/202024506014).
<a name="rntuple-status">[5]</a> J. Lopez-Gomez and J. Blomer, “RNTuple performance: Status and Outlook.” arXiv, Apr. 07, 2022. doi: [10.48550/arXiv.2204.09043](https://doi.org/10.48550/arXiv.2204.09043).

0 comments on commit 9a4102f

Please sign in to comment.