Skip to content

Commit

Permalink
Add blog post for improvements to data access (#72)
Browse files Browse the repository at this point in the history
  • Loading branch information
GeigerJ2 authored Oct 4, 2024
1 parent 1e5a1f2 commit f5b1daa
Show file tree
Hide file tree
Showing 6 changed files with 257 additions and 3 deletions.
1 change: 1 addition & 0 deletions create_post.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"event": "Events",
"report": "Reports",
"release": "Releases",
"blog": "Blog",
}


Expand Down
2 changes: 1 addition & 1 deletion docs/news/posts/2020-05-27-new-mc-archive.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ We would like to announce the launch of a newly engineered [Materials Cloud Arch

The [Materials Cloud Archive](https://archive.materialscloud.org/), active since March 2017, is a public, free, open-access repository for research data and tools in computational materials science and in related experimental efforts, inspired by the archive initiatives for preprints. It provides the capability to upload and persist arbitrary data records from anyone in the community with a minimum guaranteed 10-year retention time per record. Currently, 0.5 petabytes are already allocated; the  limits for standard submissions are of 5 GB for data sets in any format, and of 50 GB for AiiDA databases; moderators can approve larger data sets upon request. Each entry is assigned a globally unique and persistent digital object identifier (DOI) and harvestable metadata. The new Invenio platform makes it easier for authors to submit and later update data records, provides full-text searches, and powers streamlined workflows for content moderation.

The Archive is an integral part of the [Materials Cloud](https://www.materialscloud.org/) FAIR data infrastructure, in partnership with several European and national centres - these include the [MaX](http://www.max-centre.eu/) Centre of Excellence, the  [MARVEL](https://nccr-marvel.ch/) NCCR, the H2020 [MarketPlace](https://www.the-marketplace-project.eu/), [NFFA](http://www.nffa.eu/), and [Intersect](http://intersect-project.eu/) projects, [EMMC](https://emmc.eu/), [swissuniversities](https://www.materialscloud.org/swissuniversities), [PASC](https://www.pasc-ch.org/), and [OSSCAR](https://www.osscar.org/). It is a [recommended repository for Nature’s Scientific Data](https://www.nature.com/sdata/policies/repositories#materials), it is indexed by [FAIRsharing](https://fairsharing.org/biodbcore-001089/), [Google Dataset Search](https://datasetsearch.research.google.com/search?query=Materials%20Cloud) and [EOSC-hub](https://www.eosc-hub.eu/)/[EUDAT](https://www.eudat.eu/)’s service [B2FIND](http://b2find.eudat.eu/organization/materialscloud), and it is registered on [re3data](https://www.re3data.org/repository/r3d100012611). Finally, it is an official [implementation network](https://www.go-fair.org/implementation-networks/overview/materials-cloud/) of the [GO FAIR initiative](https://www.go-fair.org/).
The Archive is an integral part of the [Materials Cloud](https://www.materialscloud.org/) FAIR data infrastructure, in partnership with several European and national centres - these include the [MaX](http://www.max-centre.eu/) Centre of Excellence, the  [MARVEL](https://nccr-marvel.ch/) NCCR, the H2020 [MarketPlace](https://www.the-marketplace-project.eu/), [NFFA](http://www.nffa.eu/), and [Intersect](http://intersect-project.eu/) projects, [EMMC](https://emmc.eu/), [swissuniversities](https://www.materialscloud.org/swissuniversities), [PASC](https://www.pasc-ch.org/), and [OSSCAR](https://www.osscar.org/). It is a [recommended repository for Nature’s Scientific Data](https://www.nature.com/sdata/policies/repositories#materials), it is indexed by [FAIRsharing](https://fairsharing.org/biodbcore-001089/), [Google Dataset Search](https://datasetsearch.research.google.com/search?query=Materials%20Cloud) and [EOSC-hub](https://www.eosc-hub.eu/)/[EUDAT](https://www.eudat.eu/)’s service [B2FIND](http://b2find.eudat.eu/organization/materialscloud), and it is registered on [re3data](https://www.re3data.org/repository/r3d100012611).

More information on the Materials Cloud integration of data, workflows and codes can be found in [L. Talirz et al., Materials Cloud, a platform for open computational science, arXiv:2003.12510 (2020)](https://arxiv.org/abs/2003.12510) and in [S. Huber et al., AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance, arXiv:2003.12476 (2020)](https://arxiv.org/abs/2003.12476).

Expand Down
253 changes: 253 additions & 0 deletions docs/news/posts/2024-10-04-data-access.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
blogpost: true
category: Blog
tags: usability
date: 2024-10-04
---

# Improvements in the ways to get your data out of AiiDA

Dear users, as the saying goes "Data is the Gold of the 21st Century", in today's blog post, we would like to showcase
improvements in how you can get your data out of AiiDA's internal storage. We hope that these new features will enrich
not only you, but also the science you conduct.

## Dumping process data to disk

_From AiiDA's internal storage to classical directory trees_

As you might be aware, AiiDA uses an SQL
[database](https://aiida.readthedocs.io/projects/aiida-core/en/v2.6.2/topics/storage.html), as well as an internal [file
repository](https://aiida.readthedocs.io/projects/aiida-core/en/stable/topics/repository.html#repository) [^1] to store
your data locally [^2]. Both are optimized towards high performance and therefore constructed to be machine-readable
rather than human-readable. Hence, the difference between AiiDA's internal data storage and the typical file-system
approach (that most of us are familiar with) can make it cumbersome to get your data out of AiiDA onto your file system
in an easily understandable form.

Therefore, you, the user, are effectively forced to use the `verdi` CLI interface or AiiDA's Python API (e.g. the
`QueryBuilder` class) to access your data, making the transition towards AiiDA more challenging. To ease this
transition, we have [added
functionality](https://aiida.readthedocs.io/projects/aiida-core/en/v2.6.2/howto/data.html#dumping-data-to-disk) to dump
AiiDA `Process` data to disk in an intuitive directory structure via:

```shell
verdi process dump <pk>
```

The following video shows the result of running the command for a `PwCalculation` that was used to execute the `pw.x`
executable of Quantum ESPRESSO:

![PwCalculation dump](./_gifs/calculation-dump-white-10fps-2160p.gif)

And for a more complex `PwBandsWorkChain` (which actually contains the previously shown `PwCalculation` as one of its steps):

![PwBandsWorkChain dump](./_gifs/workflow-dump-white-10fps-2160p.gif)

As you can see, the command works both for individual calculations and for nested workflows, resulting in
the following output directories [^3].

**`tree` on a dumped example `CalcJob`:**

```shell
dump-PwCalculation-54
├── README.md
├── inputs
│ ├── _aiidasubmit.sh
│ └── aiida.in
├── outputs
│ ├── _scheduler-stderr.txt
│ ├── _scheduler-stdout.txt
│ ├── aiida.out
│ └── data-file-schema.xml
└── node_inputs
└── pseudos
└── Si
└── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
```

**`tree -d` on a dumped example `WorkChain`:**

```shell
dump-PwBandsWorkChain-70
├── 01-relax-PwRelaxWorkChain
│ ├── 01-iteration_01-PwBaseWorkChain
│ │ ├── 01-create_kpoints_from_distance
│ │ │ └── inputs
│ │ └── 02-iteration_01-PwCalculation
│ │ ├── inputs
│ │ ├── node_inputs
│ │ │ └── pseudos
│ │ │ └── Si
│ │ └── outputs
│ └── 02-iteration_02-PwBaseWorkChain
│ ├── 01-create_kpoints_from_distance
│ │ └── inputs
│ └── 02-iteration_01-PwCalculation
│ ├── inputs
│ ├── node_inputs
│ │ └── pseudos
│ │ └── Si
│ └── outputs
├── 02-seekpath-seekpath_structure_analysis
│ └── inputs
├── 03-scf-PwBaseWorkChain
│ ├── ...
...
```

Therefore, after running the command once, you'll have all data involved in the execution of your workflow directly
accessible as a standard folder [^4]. This allows you to explore it with your favorite file explorer or command-line
tool.

Happy grepping!

## New QueryBuilder Syntax

_SQL queries, but intuitive!_

In addition to accessing raw files as outlined above, AiiDA's powerful SQL database allows querying for stored nodes,
which can be achieved with the `QueryBuilder` class (as documented
[here](https://aiida.readthedocs.io/projects/aiida-core/en/v2.6.2/howto/query.html)). While using the `QueryBuilder` is
(at least for most of us) easier than writing raw SQL queries, its syntax typically requires some familiarization [^5].

Recent improvements have therefore enabled an alternative, more intuitive way to construct queries. Let us explain with
the following example: Assume you wanted to obtain all integers with values in a range between 1 and 10 (both excluded)
from a `Group` called "integers", and return their respective PKs and values. To achieve this, you'd have to construct
the following, rather convoluted query:

```python
from aiida import orm

qb = orm.QueryBuilder()
qb.append(
orm.Group,
filters={
"label": "integers",
},
project=["label"],
tag="group",
)
qb.append(
orm.Int,
with_group="group",
filters={
"and": [
{"attributes.value": {">": 1}},
{"attributes.value": {"<": 10}},
]
},
project=["pk", "attributes.value"],
)
```

In the code snippet above, we first import AiiDA's [object-relational
mapping](https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping) (`orm`) module, and then instantiate the
`QueryBuilder` class. The query is then gradually built up by adding the desired specifications using the `append`
method. Here, we first apply filtering for groups that are labelled "integers" and tag this filter as "group" so that we
can link it with the second `append`. In this second call of the method, we only filter for integers of AiiDA's integer
data type (`orm.Int`) that are part of our previously defined group via `with_group="group"`. We then apply the filter
that the values of the integers should be in our desired range between 1 and 10, and, lastly, using
`project=["pk", "attributes.value"]`, we only return the primary keys and actual values of the AiiDA `orm.Int` nodes we
obtain from our query (rather than, say, the entire AiiDA `Node` instance).

Instead, the new QueryBuilder syntax allows accessing attributes of AiiDA nodes via the [new `fields`
specifier](https://aiida.readthedocs.io/projects/aiida-core/en/v2.6.2/reference/_changelog.html#programmatic-syntax-for-query-builder-filters-and-projections),
with which the filtering logic can be applied to them directly:

```python
from aiida import orm

qb = orm.QueryBuilder()
qb.append(
orm.Group,
filters=orm.Group.fields.label == "integers",
project=[orm.Group.fields.label],
tag="group",
)
qb.append(
orm.Int,
with_group="group",
filters=(orm.Int.fields.value > 1) & (orm.Int.fields.value < 10),
project=[orm.Int.fields.pk, orm.Int.fields.value],
)
```

Hence, for example the filter on the values of the integer nodes reduces from:

```python
filters={
"and": [
{"attributes.value": {">": 1}},
{"attributes.value": {"<": 10}},
]
}
```

to the more concise:

```python
filters=(orm.Int.fields.value > 1) & (orm.Int.fields.value < 10),
```

in which the `"and"` condition can be expressed via the ampersand (`&`) and directly be applied on the relevant
entities. Furthermore, accessing through the `.fields` attribute, e.g. in the updated `project` specifier:

```python
project=[orm.Int.fields.pk, orm.Int.fields.value]
```

albeit being slightly more verbose, is less prone to errors than access via string identifiers in the previous version:

```python
project=["pk", "attributes.value"]
```

as it allows for autocompletion.

Any feedback on the new QueryBuilder syntax welcome!

***

## Relevant PRs

For the more tech-savvy among us, here are the relevant PRs of the changes outlined in this blog post:

- [[#6276]](https://github.com/aiidateam/aiida-core/pull/6276) Add CLI command to dump inputs/outputs of `CalcJob`/`WorkChain`
- [[#6245]](https://github.com/aiidateam/aiida-core/pull/6245) ✨ NEW: Add `orm.Entity.fields` interface for `QueryBuilder`
(cont.) [and linked PRs]

## Footnotes

[^1]:
The file repository is based on the [`disk-objectstore`](https://github.com/aiidateam/disk-objectstore)
implementation. If you ever wondered what the `_dos` appendix of the `core.psql_dos` and `core.sqlite_dos` storage
backends means, now you know! 😉

[^2]:
The discussion in the main text of this blog post refers to the files and data stored by AiiDA on the local computer
where AiiDA is installed, and which are preserved long-term in its internal file repository. These files are
obtained, e.g. by retrieval from the _remote_ computer once a calculation finishes, or could be parsed data or
inputs provided by the user. Instead, during the execution of your calculations _on the remote computer_, files are
located in a subfolder of the `work_directory` of the used `Computer` (typically in the `scratch`), where the
subfolder name is generated from the UUID of the AiiDA `CalculationNode`. This directory has a three-level depth,
obtained by "sharding" the UUID based on the first characters. For istance, if the UUID is
`6861d8fb-4694-46be-b0e6-7282989f069d`, the calculation will run in a subfolder named
`68/61/d8fb-4694-46be-b0e6-7282989f069d`.

[^3]:
The workflow is recursively traversed, and files are written to disk for each calculation (remember, it's the
`Calculation` that actually creates the data, while the `Workflow` can only return it, as [outlined
here](https://aiida.readthedocs.io/projects/aiida-core/en/v2.6.2/topics/processes/concepts.html#process-types)). In
the `verdi process dump` feature, the subdirectory naming is automatically determined based on the iteration
counter, the link label, and the class name, leading to a directory structure that mirrors the execution logic of
the workflow.

[^4]:
The `verdi process dump` feature is currently still actively developed to enable obtaining remote and stashed data
entities, e.g., intermediate files of the workflow that weren't originally retrieved from the remote
(high-performance) computer or data that was moved to tape. In addition, we are working on allowing to `dump` larger
collections of data, such as groups, or even all data contained in an AiiDA profile, again, in an easily
understandable folder structure. So stay tuned!

[^5]:
Modern LLMs like ChatGPT and Claude can actually generate (somewhat correct) AiiDA `QueryBuilder` queries (at least
with the syntax until their training data cutoff date), so they can provide a good starting point for your queries.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/sections/science.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ _<sup>*</sup> This is an incomplete list of research papers that utilise AiiDA.
- M. G. Mottet, [Accelerating materials discovery for solid state electrolytes](http://dx.doi.org/10.5075/epfl-thesis-7179), _EPFL THESIS_ (2020).
- E. Bosoni, [Material Selection for Spin-Transfer-Torque Magnetic Random Access Memories: a High-Throughput approach](http://hdl.handle.net/2262/91467), PhD diss., Trinity College Dublin (2020).
- J. S. Yu, J. H. Liao, Y. J. Zhao, Yin-Chang Zhao, and X. B. Yang, [Motif based high-throughput structure prediction of superconducting monolayer titanium boride](https://pubs.rsc.org/en/content/articlelanding/2020/CP/D0CP01540G#!divAbstract), _Physical Chemistry Chemical Physics_ **22,** no. 28, 16236-16243 (2020).
- A. Togo, Y. Inoue, and I. Tanaka, [Phonon structure of titanium under shear deformation along {10 1¯ 2} twinning mode](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.102.024106), _Physical Review B_ **102**, no. 2, 024106 (2020).
- A. Togo, Y. Inoue, and I. Tanaka, [Phonon structure of titanium under shear deformation along {10 1¯ 2} twinning mode](https://doi.org/10.1103/PhysRevB.102.024106), _Physical Review B_ **102**, no. 2, 024106 (2020).
- S. Mishra, X. Yao, Q. Chen, K. Eimre, O. Groening, R. Ortiz, M. Di Giovannantonio et al. [Giant magnetic exchange coupling in rhombus-shaped nanographenes with zigzag periphery](https://arxiv.org/abs/2003.03577). _arXiv preprint_ arXiv:2003.03577 (2020).
- Q. Sun, X. Yao, O. Gröning, K. Eimre, C. A. Pignedoli, K. Müllen, A. Narita, R. Fasel, and P. Ruffieux, [Coupled spin states in armchair graphene nanoribbons with asymmetric zigzag edge extensions](https://pubs.acs.org/doi/10.1021/acs.nanolett.0c02077), _Nano letters_ **20**, no. 9, 6429-6436 (2020).
- M. Kappeler, A. Marusczyk, and B. Ziebarth, [Simulation of nickel surfaces using ab-initio and empirical methods](https://www.sciencedirect.com/science/article/pii/S2589152920300922), _Materialia_ **12**, 100675 (2020).
Expand All @@ -111,7 +111,7 @@ _<sup>*</sup> This is an incomplete list of research papers that utilise AiiDA.
- T. Sohier, M. Gibertini, and N. Marzari. [Profiling novel high-conductivity 2D semiconductors](https://iopscience.iop.org/article/10.1088/2053-1583/abc5d0/meta). _2D Materials_ **8**, no. 1, 015025 (2020).
- A. García, N. Papior, A. Akhtar, E. Artacho, V. Blum, E. Bosoni, P. Brandimarte et al., [Siesta: Recent developments and applications](https://aip.scitation.org/doi/full/10.1063/5.0005077), _The Journal of chemical physics_ **152**, no. 20, 204108 (2020).
- D. Ongari, L. Talirz, and B. Smit. [Too Many Materials and Too Many Applications: An Experimental Problem Waiting for a Computational Solution](https://pubs.acs.org/doi/abs/10.1021/acscentsci.0c00988). _ACS central science_ **6**, no. 11, 1890-1900 (2020).
- D. Marchand, A. Jain, A. Glensk, and W. A. Curtin, [Machine learning for metallurgy I. A neural-network potential for Al-Cu](https://journals.aps.org/prmaterials/abstract/10.1103/PhysRevMaterials.4.103601), _Physical Review Materials_ **4**, no. 10, 103601 (2020).
- D. Marchand, A. Jain, A. Glensk, and W. A. Curtin, [Machine learning for metallurgy I. A neural-network potential for Al-Cu](https://doi.org/10.1103/PhysRevMaterials.4.103601), _Physical Review Materials_ **4**, no. 10, 103601 (2020).
- G. Pizzi, V. Vitale, R. Arita, S. Blügel, F. Freimuth, G. Géranton, M. Gibertini et al., [Wannier90 as a community code: new features and applications](https://iopscience.iop.org/article/10.1088/1361-648X/ab51ff/meta), *Journal of Physics: Condensed Matter* **32**, no. 16, 165902 (2020).
- Y. Qu, X. Meng, Z. Jia, X. Liu, D. Liu, S. Li, and F. Bian. [Accelerating Materials Discovery Based on Generalized Low-dimensional Conformation Performance Relationships](https://iopscience.iop.org/article/10.1088/1757-899X/746/1/012020/meta). _IOP Conference Series: Materials Science and Engineering_, Vol. **746**. No. 1. (2020).
- L. Kahle, A. Marcolongo, N. Marzari, [High-throughput computational screening for solid-state Li-ion conductors](http://doi.org/10.1039/c9ee02457c), _Energy & Environmental Science_ **13**, 928–948 (2020).
Expand Down

0 comments on commit f5b1daa

Please sign in to comment.