Skip to content

Commit

Permalink
chore: update docs (#83)
Browse files Browse the repository at this point in the history
* refigured prepare_docstrings, started adding pages for CLI documentation

* adding logo to readme

* formatting...

* Merge branch 'update-docs' of https://github.com/zbilodea/odapt into update-docs

* issues with merge...

* issues with merge

* chore: update README.md

* small changes to main

* merging

* cli changes should be complete

* formatting...

* still formatting...

* guide done for now

---------

Co-authored-by: Eduardo Rodrigues <[email protected]>
  • Loading branch information
zbilodea and eduardo-rodrigues authored Apr 7, 2024
1 parent ad7fa50 commit 517289b
Show file tree
Hide file tree
Showing 37 changed files with 868 additions and 167 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# hepconvert
<img src="https://github.com/scikit-hep/hepconvert/blob/f461872ec41473b14fdb7ebd76e68798ef8bb394/docs/docs-img/hepconvert_logo.svg" width="400px">

[![Actions Status][actions-badge]][actions-link]
[![Documentation Status][rtd-badge]][rtd-link]
Expand All @@ -24,7 +24,7 @@
[rtd-badge]: https://readthedocs.org/projects/hepconvert/badge/?version=latest
[rtd-link]: https://hepconvert.readthedocs.io/en/latest/

The hepconvert library is a bridge between columnar file formats, currently **ROOT, and Parquet** and soon eventually include **Feather, and HDF5.** It aims to simplify file conversions in Python, replacing what is usually a multi-step process with one line of code, with builtin features for managing large datasets and choosing compression levels.
The hepconvert library is a bridge between columnar file formats, currently **ROOT, and Parquet** and soon will include **Feather, and HDF5.** It aims to simplify file conversions in Python, replacing what is usually a multi-step process with one line of code, with builtin features for managing large datasets and choosing compression levels.

# Installation

Expand Down
43 changes: 43 additions & 0 deletions docs/source/add.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
CLI Guide for add_histograms (add)
==================================

Instructions for function `add_histograms <https://hepconvert.readthedocs.io/en/latest/hepconvert.histogram_adding.add_histograms.html>`__.

Command:
--------

.. code-block:: bash
hepconvert add [options] [OUT_FILE] [IN_FILES]
Examples:
---------

.. code-block:: bash
hepconvert add -f --progress-bar --union summed_hists.root hist1.root hist2.root hist3.root
Or, if files are in a directory:

.. code-block:: bash
hepconvert add -f --append --same_names summed_hists.root path/directory/
Options:
--------

``--force``, ``-f`` Use flag to overwrite a file if it already exists.

``--progress-bar`` Will show a basic progress bar to show how many histograms have summed, and how many files have been read.

``--append``, ``-a`` Will append histograms to an existing file.

``--compression``, ``-c`` Compression type. Options are "lzma", "zlib", "lz4", and "zstd". Default is "zlib".

``--compression-level`` Level of compression set by an integer. Default is 1.

``--union`` Use flag to add together histograms that have the same name and append all others to the new file.

``--same-names`` Use flag to only add histograms together if they have the same name.
9 changes: 9 additions & 0 deletions docs/source/cli.toctree
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.. toctree::
:caption: Command Line Interface Instructions
:hidden:

parquet-to-root <parquet_to_root>
root-to-parquet <root_to_parquet>
copy-root <copy_root>
merge-root <merge_root>
add (add_histograms) <add>
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@
# Additional stuff
master_doc = "index"

# exec(open("prepare_docstrings.py").read(), dict(globals()))
exec(open("prepare_docstrings.py").read(), dict(globals()))
57 changes: 57 additions & 0 deletions docs/source/copy_root.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Command Line Interface Guide: copy_root
=======================================

Instructions for function `hepconvert.copy_root <https://hepconvert.readthedocs.io/en/latest/hepconvert.copy_root.copy_root.html>`__

Command:
--------

.. code-block:: bash
hepconvert copy-root [options] [OUT_FILE] [IN_FILE]
Examples:
---------

.. code-block:: bash
hepconvert copy-root -f --progress-bar --keep-branches 'Jet_*' out_file.root in_file.root
Branch skimming using ``cut``:

.. code-block:: bash
hepconvert copy-root -f --keep-branches 'Jet_*' --cut 'Jet_Px > 5' out_file.root in_file.root
Options:
--------

``--drop-branches``, ``-db`` and ``--keep-branches``, ``-kb`` list, str or dict. Specify branch names to remove from the ROOT file. Either a str, list of str (for multiple branches), or a dict with form {'tree': 'branches'} to remove branches from certain ttrees. Wildcarding accepted.

``--drop-trees``, ``-dt`` and ``--keep-trees``, ``-kt`` list of str, or str. Specify tree names to remove/keep TTrees in the ROOT files. Wildcarding accepted.

``--cut`` For branch skimming, passed to `uproot.iterate <https://uproot.readthedocs.io/en/latest/uproot.behaviors.TBranch.iterate.html>`__. str, if not None, this expression filters all of the expressions.

``--expressions`` For branch skimming, passed to `uproot.iterate <https://uproot.readthedocs.io/en/latest/uproot.behaviors.TBranch.iterate.html>`__. Names of TBranches or aliases to convert to ararys or mathematical expressions of them. If None, all TBranches selected by the filters are included.

``--force``, ``-f`` Use flag to overwrite a file if it already exists.

``--progress-bar`` Will show a basic progress bar to show how many TTrees have merged and written.

``--append``, ``-a`` Will append new TTree to an existing file.

``--compression``, ``-c`` Compression type. Options are "lzma", "zlib", "lz4", and "zstd". Default is "zlib".

``--compression-level`` Level of compression set by an integer. Default is 1.

``--name`` Give a name to the new TTree. Default is "tree".

``--title`` Give a title to the new TTree.

``--initial-basket-capacity`` (int) Number of TBaskets that can be written to the TTree without rewriting the TTree metadata to make room. Default is 10.

``--resize-factor`` (float) When the TTree metadata needs to be rewritten, this specifies how many more TBasket slots to allocate as a multiplicative factor. Default is 10.0.

``--step-size`` Size of batches of data to read and write. If an integer, the maximum number of entries to include in each iteration step; if a string, the maximum memory size to include. The string must be a number followed by a memory unit, such as “100 MB”. Default is "100 MB"
229 changes: 229 additions & 0 deletions docs/source/general_guide.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
General Guide and Examples:
===========================
Is something missing from this guide? Please post your questions on the `discussions page <https://github.com/scikit-hep/hepconvert/discussions>`__!

Features of all (or most) functions:
----------------------------------------

**Automatic handling of Uproot duplicate counter issue:**
If you are using a hepconvert function that goes ROOT -> ROOT (both the input and output files are ROOT)
and working with data in jagged arrays, if branches have the same "fLeafCount", hepconvert
will group branches automatically so that Uproot will not create a `counter branch for each branch <https://github.com/scikit-hep/uproot5/discussions/903>`__.

**Quick Modifications of ROOT files and TTrees:**

Functions ``copy_root``, ``merge_root``, and ``root_to_parquet`` have a few options for applying quick
modifications to ROOT files and TTree data.

**Branch slimming:**
Parameters ``keep_branches`` or ``drop_branches`` (list or dict) control branch slimming.
Examples:

.. code:: python
>>> hepconvert.root_to_parquet("out_file.root", "in_file.root", keep_branches="x*", progress_bar=True, force=True)
# Before:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# x1 | int64_t | AsDtype('>i8')
# x2 | int64_t | AsDtype('>i8')
# y1 | int64_t | AsDtype('>i8')
# y2 | int64_t | AsDtype('>i8')
# After:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# x1 | int64_t | AsDtype('>i8')
# x2 | int64_t | AsDtype('>i8')
.. code:: python
>>> hepconvert.root_to_parquet("out_file.root", "in_file.root", keep_branches={"tree1": ["branch2", "branch3"], "tree2": ["branch2"]}, progress_bar=True, force=True)
# Before:
# Tree1:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch1 | int64_t | AsDtype('>i8')
# branch2 | int64_t | AsDtype('>i8')
# branch3 | int64_t | AsDtype('>i8')
# Tree2:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch1 | int64_t | AsDtype('>i8')
# branch2 | int64_t | AsDtype('>i8')
# branch3 | int64_t | AsDtype('>i8')
# After:
# Tree1:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch2 | int64_t | AsDtype('>i8')
# branch3 | int64_t | AsDtype('>i8')
# Tree2:
# name | typename | interpretation
# ---------------------+--------------------------+-------------------------------
# branch2 | int64_t | AsDtype('>i8')
**Branch skimming:**
Parameters ``cut`` and ``expressions`` control branch skimming. Both of these parameters go to Uproot's `iterate
<https://uproot.readthedocs.io/en/latest/uproot.behaviors.TBranch.iterate.html>`__
function. See Uproot's documentation for more details.

Basic example:

.. code:: python
hepconvert.copy_root("skimmed_HZZ.root", "HZZ.root", keep_branches="Jet_",
force=True, expressions="Jet_Px", cut="Jet_Px >= 10",)
**Remove TTrees:**
Use parameters ``keep_ttrees`` or ``drop_ttrees`` to remove TTrees.

.. code:: python
# Creating example data:
with uproot.recreate("two_trees.root") as file:
file["tree"] = {"x": np.array([1, 2, 3])}
file["tree1"] = {"x": np.array([1, 2, 3])}
hepconvert.copy_root("one_tree.root", "two_trees.root", keep_trees=tree,
force=True, expressions="Jet_Px", cut="Jet_Px >= 10",)
**How hepconvert works with ROOT**

hepconvert uses Uproot for reading and writing ROOT files; it also has the same limitations.
It currently only works with flat TTrees (nanoAOD-like data), and cannot yet read or write RNTuples.

As described in Uproot's documentation:

.. note::

A small but growing list of data types can be written to files:

* strings: TObjString
* histograms: TH1*, TH2*, TH3*
* profile plots: TProfile, TProfile2D, TProfile3D
* NumPy histograms created with `np.histogram <https://numpy.org/doc/stable/reference/generated/numpy.histogram.html>`__, `np.histogram2d <https://numpy.org/doc/stable/reference/generated/numpy.histogram2d.html>`__, and `np.histogramdd <https://numpy.org/doc/stable/reference/generated/numpy.histogramdd.html>`__ with 3 dimensions or fewer
* histograms that satisfy the `Universal Histogram Interface <https://uhi.readthedocs.io/>`__ (UHI) with 3 dimensions or fewer; this includes `boost-histogram <https://boost-histogram.readthedocs.io/>`__ and `hist <https://hist.readthedocs.io/>`__
* PyROOT objects

**Memory Management**

Each hepconvert function has automatic and customizable memory management for working with large files.

Functions reading **ROOT** files will read in batches controlled by the parameter ``step_size``.
Set ``step_size`` to either an `int` to set the batch size to a number of entries, or a `string` in
form of "100 MB".


**Progress Bars**
hepconvert uses the package tqdm for progress bars, if you do not have the package installed an error message will provide installation instructions.
They are controlled with the ``progress_bar`` argument.
For example, to use a default progress bar with copy_root, set progress_bar to True:

.. code:: python
hepconvert.copy_root("out_file.root", "in_file.root", progress_bar=True)
Some functions can handle a customized tqdm progress bar.
To use a customized tqdm progress bar, make a progress bar object and pass it to the hepconvert function like so,

.. code:: python
>>> import tqdm
>>> bar_obj = tqdm.tqdm(colour="GREEN", desc="Description")
>>> hepconvert.add_histograms("out_file.root", "path/in_files/", progress_bar=bar_obj)
.. image:: https://raw.githubusercontent.com/scikit-hep/hepconvert/main/docs/docs-img/progress_bar.png
:width: 450px
:alt: hepconvert
:target: https://github.com/scikit-hep/hepconvert


Some types of tqdm progress bar objects may not work in this way.


**Command Line Interface**

All functions are able to be run in the command line. See the "Command Line Interface Instructions" tab on the left to see CLI
instructions on individual functions.

Adding Histograms
-----------------
``hepconvert.add_histograms`` adds the values of many histograms
and writes the summed histograms to an output file (like ROOT's hadd, but limited
to histograms).


**Parameters of note:**

``union`` If True, adds the histograms that have the same name and appends all others
to the new file.

``append`` If True, appends histograms to an existing file. Force and append
cannot both be True.

``same_names`` If True, only adds together histograms which have the same name (key). If False,
histograms are added together based on TTree structure (bins must be equal).

Memory:
``add_histograms`` has no memory customization available currently. To maintain
performance it stores the summed histograms in memory until all files have
been read, then the summed histograms are written to the output file. Only
one input ROOT file is read and kept in memory at a time.


Merging TTrees
--------------
``hepconvert.merge_root`` merges TTrees in multiple ROOT files together. The end result is a single file containing data from all input files (again like ROOT's hadd, but can handle flat TTrees and histograms).

.. warning::
At the moment, hepconvert.merge can only merge TTrees that have the same
number of branches, with the same names and datatypes.
We are working on adding backfill capabilities for mismatched TTrees.

**Features:**
merge_root has parameters ``cut``, ``expressions``, ``drop_branches``, ``keep_branches``, ``drop_trees`` and ``keep_trees``.


Copying TTrees
--------------
``hepconvert.copy_root`` copies TTrees in multiple ROOT files together.

.. warning::
At the moment, hepconvert.merge can only merge TTrees that have the same
number of branches, with the same names and datatypes.
We are working on adding backfill capabilities for mismatched TTrees.

**Features:**
merge_root has parameters ``cut``, ``expressions``, ``drop_branches``, ``keep_branches``, ``drop_trees`` and ``keep_trees``.


Parquet to ROOT
---------------

Writes the data from a single Parquet file to one TTree in a ROOT file.
This function creates a new TTree (name the new tree with parameter ``tree``).


ROOT to Parquet
---------------

Writes the data from one TTree in a ROOT file to a single Parquet file.
If there are multiple TTrees in the file, specify one TTree to write to the Parquet file using the ``tree`` parameter.

**Features:**
root_to_parquet has parameters ``cut``, ``expressions``, ``drop_branches``, ``keep_branches``.
5 changes: 5 additions & 0 deletions docs/source/guide.toctree
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.. toctree::
:caption: Guide with Examples
:hidden:

general_guide
6 changes: 6 additions & 0 deletions docs/source/hepconvert.add_histograms.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
hepconvert.add_histograms
=========================

Defined in `hepconvert.histogram_adding <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/histogram_adding.py>`__ on `line 345 <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/histogram_adding.py#L345>`__.

.. autofunction:: hepconvert.add_histograms
6 changes: 0 additions & 6 deletions docs/source/hepconvert.copy_root.copy_root.rst

This file was deleted.

6 changes: 6 additions & 0 deletions docs/source/hepconvert.copy_root.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
hepconvert.copy_root
====================

Defined in `hepconvert.copy_root <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/copy_root.py>`__ on `line 15 <https://github.com/zbilodea/hepconvert/blob/52e6cbfbbf81c669ca31b8a538d8f3e8984b35a5/src/hepconvert/copy_root.py#L15>`__.

.. autofunction:: hepconvert.copy_root
3 changes: 1 addition & 2 deletions docs/source/hepconvert.copy_root.toctree
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,4 @@
:caption: copy_root
:hidden:

hepconvert.copy_root (module) <hepconvert.copy_root>
hepconvert.copy_root.copy_root <hepconvert.copy_root.copy_root>
hepconvert.copy_root <hepconvert.copy_root>
Loading

0 comments on commit 517289b

Please sign in to comment.