-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[RFC] Added numa_support rfc (#1535)
With sub-RFC about increased NUMA availability.
- Loading branch information
Showing
2 changed files
with
276 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,156 @@ | ||
# NUMA support | ||
|
||
## Introduction | ||
|
||
In Non-Uniform Memory Access (NUMA) systems, the cost of memory accesses depends on the | ||
*nearness* of the processor to the memory resource on which the accessed data resides. | ||
While oneTBB has core support that enables developers to tune for Non-Uniform Memory | ||
Access (NUMA) systems, we believe this support can be simplified and improved to provide | ||
an improved user experience. | ||
|
||
This RFC acts as an umbrella for sub-proposals that address four areas for improvement: | ||
|
||
1. improved reliability of HWLOC-dependent topology and pinning support in, | ||
2. addition of a NUMA-aware allocation, | ||
3. simplified approaches to associate task distribution with data placement and | ||
4. where possible, improved out-of-the-box performance for high-level oneTBB features. | ||
|
||
We expect that this draft proposal will spawn sub-proposals that will progress | ||
independently based on feedback and prioritization of the suggested features. | ||
|
||
The features for NUMA tuning already available in the oneTBB 1.3 specification include: | ||
|
||
- Functions in the `tbb::info` namespace **[info_namespace]** | ||
- `std::vector<numa_node_id> numa_nodes()` | ||
- `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)` | ||
- `tbb::task_arena::constraints` in **[scheduler.task_arena]** | ||
|
||
Below is the example based on existing oneTBB documentation that demonstrates the use of these APIs | ||
to pin threads to different arenas to each of the NUMA nodes available on a system, submit work | ||
across those `task_arena` objects and into associated `task_group` objects, and then wait for work | ||
again using both the `task_arena` and `task_group` objects. | ||
|
||
void constrain_for_numa_nodes() { | ||
std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes(); | ||
std::vector<tbb::task_arena> arenas(numa_nodes.size()); | ||
std::vector<tbb::task_group> task_groups(numa_nodes.size()); | ||
|
||
// initialize each arena, each constrained to a different NUMA node | ||
for (int i = 0; i < numa_nodes.size(); i++) | ||
arenas[i].initialize(tbb::task_arena::constraints(numa_nodes[i]), 0); | ||
|
||
// enqueue work to all but the first arena, using the task_group to track work | ||
// by using defer, the task_group reference count is incremented immediately | ||
for (int i = 1; i < numa_nodes.size(); i++) | ||
arenas[i].enqueue( | ||
task_groups[i].defer([] { | ||
tbb::parallel_for(0, N, [](int j) { f(w); }); | ||
}) | ||
); | ||
|
||
// directly execute the work to completion in the remaining arena | ||
arenas[0].execute([] { | ||
tbb::parallel_for(0, N, [](int j) { f(w); }); | ||
}); | ||
|
||
// join the other arenas to wait on their task_groups | ||
for (int i = 1; i < numa_nodes.size(); i++) | ||
arenas[i].execute([&task_groups, i] { task_groups[i].wait(); }); | ||
} | ||
|
||
### The need for application-specific knowledge | ||
|
||
In general when tuning a parallel application for NUMA systems, the goal is to expose sufficient | ||
parallelism while minimizing (or at least controlling) data access and communication costs. The | ||
tradeoffs involved in this tuning often rely on application-specific knowledge. | ||
|
||
In particular, NUMA tuning typically involves: | ||
|
||
1. Understanding the overall application problem and its use of algorithms and data containers | ||
2. Placement/allocation of data container objects onto memory resources | ||
3. Distribution of tasks to hardware resources that optimize for data placement | ||
|
||
As shown in the previous example, the oneTBB 1.3 specification only provides low-level | ||
support for NUMA optimization. The `tbb::info` namespace provides topology discovery. And the | ||
combination of `task_arena`, `task_arena::constraints` and `task_group` provide a mechanism for | ||
placing tasks onto specific processors. There is no high-level support for memory allocation | ||
or placement, or for guiding the task distribution of algorithms. | ||
|
||
### Issues that should be resolved in the oneTBB library | ||
|
||
**The behavior of existing features is not always predictable.** There is a note in | ||
section **[info_namespace]** of the oneTBB specification that describes | ||
the function `std::vector<numa_node_id> numa_nodes()`, "If error occurs during system topology | ||
parsing, returns vector containing single element that equals to `task_arena::automatic`." | ||
|
||
In practice, the error can occurs because HWLOC is not detected on the system. While the | ||
oneTBB documentation states in several places that HWLOC is required for NUMA support and | ||
even provides guidance on | ||
[how to check for HWLOC](https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/next-steps.html), | ||
the inability to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This | ||
default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding | ||
example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort. | ||
|
||
**Getting good performance using these tools requires notable manual coding effort by users.** As we | ||
can see in the preceding example, if we want to spread work across the NUMA nodes in | ||
a system we might need to query the topology using functions in the `tbb::info` namespace, create | ||
one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an | ||
extra loop that iterates over these `task_arena` and `task_group` objects to execute the | ||
work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific | ||
APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes. | ||
|
||
**The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.** | ||
Should the oneTBB library do anything special by default if the system is a NUMA system? Or should | ||
regular random stealing distribute the work across all of the cores, regardless of which NUMA first | ||
touched the data? | ||
|
||
Is it reasonable for a developer to expect that a series of loops, such as the ones that follow, will | ||
try to create a NUMA-friendly distribution of tasks so that accesses to the same elements of `b` and `c` | ||
in the two loops are from the same NUMA nodes? Or is this too much to expect without providing hints? | ||
|
||
tbb::parallel_for(0, N, | ||
[](int i) { | ||
b[i] = f(i); | ||
c[i] = g(i); | ||
}); | ||
|
||
tbb::parallel_for(0, N, | ||
[](int i) { | ||
a[i] = b[i] + c[i]; | ||
}); | ||
|
||
## Possible Sub-Proposals | ||
|
||
### Increased availability of NUMA support | ||
|
||
See [sub-RFC for increased availability of NUMA API](tbbbind-link-static-hwloc.org) | ||
|
||
|
||
### Add NUMA-constrained arenas | ||
|
||
See [sub-RFC for creation and use of NUMA-constrained arenas](numa-arenas-creation-and-use.org) | ||
|
||
### NUMA-aware allocation | ||
|
||
Define allocators or other features that simplify the process of allocating or placing data onto | ||
specific NUMA nodes. | ||
|
||
### Simplified approaches to associate task distribution with data placement | ||
|
||
As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures. | ||
We also need to deliver mechanisms to guide task distribution so that tasks are executed on execution | ||
resources that are near to the data they access. oneTBB already provides low-level support through | ||
`tbb::info` and `tbb::task_arena`, but we should up-level this support into the high-level algorithms, | ||
flow graph and containers where appropriate. | ||
|
||
### Improved out-of-the-box performance for high-level oneTBB features. | ||
|
||
For high-level oneTBB features that are modified to provide improved NUMA support, we can try to | ||
align default behaviors for those features with user-expectations when used on NUMA systems. | ||
|
||
## Open Questions | ||
|
||
1. Do we need simplified support, or are users that want NUMA support in oneTBB | ||
willing to, or perhaps even prefer, to manage the details manually? | ||
2. Is it reasonable to expect good out-of-the-box performance on NUMA systems | ||
without user hints or guidance. |
120 changes: 120 additions & 0 deletions
120
rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
# -*- fill-column: 80; -*- | ||
|
||
#+title: Link ~tbbbind~ with Static HWLOC for NUMA API predictability | ||
|
||
*Note:* This document is a sub-RFC of the [[file:README.md][umbrella RFC about improving NUMA | ||
support]]. Specifically, the "Increased availability of NUMA support" section. | ||
|
||
* Introduction | ||
oneTBB has a soft dependency on several variants of ~tbbbind~, which the library | ||
loads during the initialization stage. Each ~tbbbind~, in turn, has a hard | ||
dependency on a specific version of the HWLOC library [1, 2]. The soft | ||
dependency means that the library continues the execution even if the system | ||
loader fails to resolve the hard dependency on HWLOC for ~tbbbind~. In this | ||
case, oneTBB does not discover the hardware topology. Instead, it defaults to | ||
viewing all CPU cores as uniform, consistent with TBB behavior when NUMA | ||
constraints are not used. As a result, the following code returns the irrelevant | ||
values that do not reflect the actual topology: | ||
|
||
#+begin_src C++ | ||
std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes(); | ||
std::vector<oneapi::tbb::core_type_id> core_types = oneapi::tbb::info::core_types(); | ||
#+end_src | ||
|
||
This lack of valid HW topology, caused by the absence of a third-party library, | ||
is the major problem with the current oneTBB behavior. The problem lies in the | ||
lack of diagnostics making it difficult for developers to detect. As a result, | ||
the code continues to run but fails to use NUMA as intended. | ||
|
||
Dependency on a shared HWLOC library has the following benefits: | ||
1. Code reuse with all of the positive consequences out of this, including | ||
relying on the same code that has been tested and debugged, allowing the OS | ||
to share it among different processes, which consequently improves on cache | ||
locality and memory footprint. That's the primary purpose of shared | ||
libraries. | ||
2. A drop-in replacement. Users are able to use their own version of HWLOC | ||
without recompilation of oneTBB. This specific version of HWLOC could include | ||
a hotfix to support a particular and/or new hardware that a customer has, but | ||
whose support is not yet upstreamed to HWLOC project. It is also possible | ||
that such support won't be upstreamed at all if that hardware is not going to | ||
be available for massive users. It could also be a development version of | ||
HWLOC that someone wants to test on their systems first. Of course, they can | ||
do it with the static version as well, but that's more cumbersome as it | ||
requires recompilation of every dependent component. | ||
|
||
The only disadvantage from depending on HWLOC library dynamically is that the | ||
developers that use oneTBB's NUMA support API need to make sure the library is | ||
available and can be found by oneTBB. Depending on the distribution model of a | ||
developer's code, this is achieved either by: | ||
1. Asking the end user to have necessary version of a dependency pre-installed. | ||
2. Bundling necessary HWLOC version together with other pieces of a product | ||
release. | ||
|
||
However, the requirement to fulfill one of the above steps for the NUMA API to | ||
start paying off may be considered as an incovenience and, what is more | ||
important, it is not always obvious that one of these steps is needed. | ||
Especially, due to silent behavior in case HWLOC library cannot be found in the | ||
environment. | ||
|
||
The proposal is to reduce the effect of the disadvantage of relying on a dynamic | ||
HWLOC library. The improvements involve statically linking HWLOC with one of the | ||
~tbbbind~ libraries distributed together with oneTBB. At the same time, you | ||
retain the flexibility to specify different version of HWLOC library if needed. | ||
|
||
Since HWLOC 1.x is an older version and modern operating systems install HWLOC | ||
2.x by default, the probability of users being restricted to HWLOC 1.x is | ||
relatively small. Thus, we can reuse the filename of the ~tbbbind~ library | ||
linked to HWLOC 1.x for the library linked against a static HWLOC 2.x. | ||
|
||
* Proposal | ||
1. Replace the dynamic link of ~tbbbind~ library currently linked | ||
against HWLOC 1.x with a link to a static HWLOC library version 2.x. | ||
2. Add loading of that ~tbbbind~ variant as the last attempt to resolve the | ||
dependency on functionality provided by the ~tbbbind~ layer. | ||
3. Update the oneTBB documentation, including | ||
[[https://uxlfoundation.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these | ||
pages]], to detail the steps for identifying which ~tbbbind~ is being used. | ||
|
||
** Advantages | ||
1. The proposed behavior introduces a fallback mechanism for resolving the HWLOC | ||
library dependency when it is not in the environment, while still preferring | ||
user-provided versions. As a result, the problematic oneTBB API usage works | ||
as expected, returning an enumerated list of actual NUMA nodes and core types | ||
on the system the code is running on, provided that the loaded HWLOC library | ||
works on that system and that an application properly distributes all | ||
binaries of oneTBB, sets the environment so that the necessary variant of | ||
~tbbbind~ library can be found and loaded. | ||
2. Dropping support for HWLOC 1.x, does not introduce an additional ~tbbbind~ | ||
variant while maintaining support for widely used versions of HWLOC. | ||
|
||
** Disadvantages | ||
By default, there is still no diagnostics if you fail to correctly setup an | ||
environment with your version of HWLOC. Although, specifying the ~TBB_VERSION=1~ | ||
environment variable helps identify configuration issues quickly. | ||
|
||
* Alternative Handling for Missing System Topology | ||
The other behavior in case HWLOC library cannot be found is to be more explicit | ||
about the problem of a missing component and to either issue a warning or to | ||
refuse working requiring one of the ~tbbbind~ variant to be loaded (e.g., throw | ||
an exception). | ||
|
||
Comparing these alternative approaches to the one proposed. | ||
** Common Advantages | ||
- Explicitly indicates that the functionality being used does not work, instead | ||
of failing silently. | ||
- Avoids the need to distribute an additional variant of ~tbbbind~ library. | ||
|
||
** Common Disadvantages | ||
- Requires additional step from the user side to resolve the problem. In other | ||
words, it does not provide complete solution to the problem. | ||
|
||
*** Disadvantages of Issuing a Warning | ||
- The warning may be unnoticed, especially if standard streams are closed. | ||
|
||
*** Disadvantages of Throwing an Exception | ||
- May break existing code that does not expect an exception to be thrown. | ||
- Requires introduction of an additional exception hierarchy. | ||
|
||
* References | ||
1. [[https://www.open-mpi.org/projects/hwloc/][HWLOC project main page]] | ||
2. [[https://github.com/open-mpi/hwloc][HWLOC project repository on GitHub]] |