Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFC for creation and use of NUMA-constrained arenas #1559

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
180 changes: 180 additions & 0 deletions rfcs/proposed/numa_support/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# NUMA support

## Introduction

In Non-Uniform Memory Access (NUMA) systems, the cost of memory accesses depends on the
*nearness* of the processor to the memory resource on which the accessed data resides.
While oneTBB has core support that enables developers to tune for Non-Uniform Memory
Access (NUMA) systems, we believe this support can be simplified and improved to provide
an improved user experience.

This RFC acts as an umbrella for sub-proposals that address four areas for improvement:

1. improved reliability of HWLOC-dependent topology and pinning support in,
2. addition of a NUMA-aware allocation,
3. simplified approaches to associate task distribution with data placement and
4. where possible, improved out-of-the-box performance for high-level oneTBB features.

We expect that this draft proposal will spawn sub-proposals that will progress
independently based on feedback and prioritization of the suggested features.

The features for NUMA tuning already available in the oneTBB 1.3 specification include:

- Functions in the `tbb::info` namespace **[info_namespace]**
- `std::vector<numa_node_id> numa_nodes()`
- `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)`
- `tbb::task_arena::constraints` in **[scheduler.task_arena]**

Below is the example based on existing oneTBB documentation that demonstrates the use
of these APIs to pin threads to different arenas to each of the NUMA nodes available
on a system, submit work across those `task_arena` objects and into associated
`task_group`` objects, and then wait for work again using both the `task_arena`
and `task_group` objects.

#include "oneapi/tbb/task_group.h"
#include "oneapi/tbb/task_arena.h"

#include <vector>

int main() {
std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes();
std::vector<oneapi::tbb::task_arena> arenas(numa_nodes.size());
std::vector<oneapi::tbb::task_group> task_groups(numa_nodes.size());

// Initialize the arenas and place memory
for (int i = 0; i < numa_nodes.size(); i++) {
arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i]),0);
arenas[i].execute([i] {
// allocate/place memory on NUMA node i
});
}

for (int j 0; j < NUM_STEPS; ++i) {

// Distribute work across the arenas / NUMA nodes
for (int i = 0; i < numa_nodes.size(); i++) {
arenas[i].execute([&task_groups, i] {
task_groups[i].run([] {
/* executed by the thread pinned to specified NUMA node */
});
});
}

// Wait for the work in each arena / NUMA node to complete
for (int i = 0; i < numa_nodes.size(); i++) {
arenas[i].execute([&task_groups, i] {
task_groups[i].wait();
});
}
}

return 0;
}

### The need for application-specific knowledge

In general when tuning a parallel application for NUMA systems, the goal is to expose sufficient
parallelism while minimizing (or at least controlling) data access and communication costs. The
tradeoffs involved in this tuning often rely on application-specific knowledge.

In particular, NUMA tuning typically involves:

1. Understanding the overall application problem and its use of algorithms and data containers
2. Placement/allocation of data container objects onto memory resources
3. Distribution of tasks to hardware resources that optimize for data placement

As shown in the previous example, the oneTBB 1.3 specification only provides low-level
support for NUMA optimization. The `tbb::info` namespace provides topology discovery. And the
combination of `task_arena`, `task_arena::constraints` and `task_group` provide a mechanism for
placing tasks onto specific processors. There is no high-level support for memory allocation
or placement, or for guiding the task distribution of algorithms.

### Issues that should be resolved in the oneTBB library

**The behavior of existing features is not always predictable.** There is a note in
section **[info_namespace]** of the oneTBB specification that describes
the function `std::vector<numa_node_id> numa_nodes()`, "If error occurs during system topology
parsing, returns vector containing single element that equals to `task_arena::automatic`."

In practice, the error often occurs because HWLOC is not detected on the system. While the
oneTBB documentation states in several places that HWLOC is required for NUMA support and
even provides guidance on
[how to check for HWLOC](https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/next-steps.html),
the failure to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This
default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding
example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort.

**Getting good performance using these tools requires notable manual coding effort by users.** As we
can see in the preceding example, if we want to spread work across the NUMA nodes in
a system we need to query the topology using functions in the `tbb::info` namespace, create
one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an
extra loop that iterates over these `task_arena` and `task_group` objects to execute the
work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific
APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes.

**The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.**
Should the oneTBB library do anything special by default if the system is a NUMA system? Or should
regular random stealing distribute the work across all of the cores, regardless of which NUMA first
touched the data?

Is it reasonable for a developer to expect that a series of loops, such as the ones that follow, will
try to create a NUMA-friendly distribution of tasks so that accesses to the same elements of `b` and `c`
in the two loops are from the same NUMA nodes? Or is this too much to expect without providing hints?

tbb::parallel_for(0, N,
[](int i) {
b[i] = f(i);
c[i] = g(i);
});

tbb::parallel_for(0, N,
[](int i) {
a[i] = b[i] + c[i];
});

## Proposal

### Increased availability of NUMA support

The oneTBB 1.3 specification states for `tbb::info::numa_nodes`, "If error occurs during system
topology parsing, returns vector containing single element that equals to task_arena::automatic."

Since the oneTBB library dynamically loads the HWLOC library, a misconfiguration can cause the HWLOC
to fail to be found. In that case, a call like:

std::vector<oneapi::tbb::numa_node_id> numa_nodes = oneapi::tbb::info::numa_nodes();

will return a vector with a single element of `task_arena::automatic`. This behavior, as we have noticed
through user questions, can lead to unexpected performance from NUMA optimizations. When running
on a NUMA system, a developer that has not fully read the documentation may expect that `numa_nodes()`
will give a proper accounting of the NUMA nodes. When the code, without raising any alarm, returns only
a single, valid element due to the environmental configuation (such as lack of HWLOC), it is too easy
for developers to not notice that the code is acting in a valid, but unexpected way.

We propose that the oneTBB library implementation include, wherever possibly, a statically-linked fallback
to decrease that likelihood of such failures. The oneTBB specification will remain unchanged.

### NUMA-aware allocation

We will define allocators or other features that simplify the process of allocating or placing data onto
specific NUMA nodes.

### Simplified approaches to associate task distribution with data placement

As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures.
We also need to deliver mechanisms to guide task distribution so that tasks are executed on execution
resources that are near to the data they access. oneTBB already provides low-level support through
`tbb::info` and `tbb::task_arena`, but we should up-level this support into the high-level algorithms,
flow graph and containers where appropriate.

### Improved out-of-the-box performance for high-level oneTBB features.

For high-level oneTBB features that are modified to provide improved NUMA support, we should try to
align default behaviors for those features with user-expectations when used on NUMA systems.

## Open Questions

1. Do we need simplified support, or are users that want NUMA support in oneTBB
willing to, or perhaps even prefer, to manage the details manually?
2. Is it reasonable to expect good out-of-the-box performance on NUMA systems
without user hints or guidance.
147 changes: 147 additions & 0 deletions rfcs/proposed/numa_support/numa-arenas-creation-and-use.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
#+title: API to Facilitate Instantiation and Use of oneTBB's Task Arenas Constrained to NUMA Nodes

*Note:* This is a sub-RFC of the https://github.com/oneapi-src/oneTBB/pull/1535.

* Introduction
Let's consider the example from "Setting the preferred NUMA node" section of the
[[https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Guiding_Task_Scheduler_Execution.html][Guiding Task Scheduler Execution]] page of oneTBB Developer Guide.

** Motivating example
#+begin_src C++
std::vector<tbb::numa_node_id> numa_indexes = tbb::info::numa_nodes(); // [0]
std::vector<tbb::task_arena> arenas(numa_indexes.size()); // [1]
std::vector<tbb::task_group> task_groups(numa_indexes.size()); // [2]

for(unsigned j = 0; j < numa_indexes.size(); j++) {
arenas[j].initialize(tbb::task_arena::constraints(numa_indexes[j])); // [3]
arenas[j].execute([&task_groups, &j](){ // [4]
task_groups[j].run([](){/*some parallel stuff*/});
});
}

for(unsigned j = 0; j < numa_indexes.size(); j++) {
arenas[j].execute([&task_groups, &j](){ task_groups[j].wait(); }); // [5]
}
#+end_src

Usually the users of oneTBB employ this technique to tie oneTBB worker threads
up within NUMA nodes and yet have all the parallelism of a platform utilized.
The pattern starts by finding the number of NUMA nodes on the system. With that
number, user creates that many ~tbb::task_arena~ objects, constraining each to a
dedicated NUMA node. Along with ~tbb::task_arena~ objects user instantiates the
same number of ~tbb::task_group~ objects, with which the oneTBB tasks are going
to be associated. The ~tbb::task_group~ objects are needed because they allow
waiting for the work completion as the ~tbb::task_arena~ class does not provide
synchronization semantics on its own. Then the work gets submitted in each of
arena objects, and waited upon their finish at the end.

** Interface issues and inconveniences:
- [0] - Getting the number of NUMA nodes is not the task by itself, but rather a
necessity to know how many objects to initialize further.
- [1] - Explicit step for creating the number of ~tbb::task_arena~ objects per
each NUMA node. Note that by default the arena objects are constructed with a
slot reserved for master thread, which in this particular example usually
results in undersubscription issue as the master thread can join only one
arena at a time to help with work processing.
- [2] - Separate step for instantiation the same number of ~tbb::task_group~
objects, in which the actual work is going to be submitted. Note that user
also needs to make sure the size of ~arenas~ matches the size of
~task_groups~.
Comment on lines +46 to +49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the second sentence sounds like a rephrase of the first one, without new information or argumentation. I mean, I see no difference between "the same number of tbb::task_group objects" and "the size of arenas matches the size of task_groups"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't mind the repetition. Its easy to read over the "same number" without recognizing there's a potential error if the sizes of these two vectors don't match. So, in my opinion, the repetition highlights the potential danger here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, then I would rephrase as:

Suggested change
- [2] - Separate step for instantiation the same number of ~tbb::task_group~
objects, in which the actual work is going to be submitted. Note that user
also needs to make sure the size of ~arenas~ matches the size of
~task_groups~.
- [2] - The necessity to instantiate the same number of ~tbb::task_group~
objects for the actual work to be submitted; that is, the size of ~task_groups~
must match the size of ~arenas~.

- [3] - Actual tying of ~tbb::task_arena~ instances with corresponding NUMA
nodes. Note that user needs to make sure the indices of ~tbb::task_arena~
objects match corresponding indices of NUMA nodes.
Comment on lines +51 to +52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point about the need to match the indices is kind of strange. A single loop that works with several arrays/vectors is a typical pattern, you just use the loop index consistently. Moreover, with modern C++ you can rewrite the loop to not have any indices at all, e.g.

std::vector<tbb::numa_node_id> numa_indexes = tbb::info::numa_nodes();
std::vector<tbb::task_arena> arenas;      // note that the size is not set
std::vector<tbb::task_group> task_groups; // same for task groups

for (auto idx: numa_indexes) {
    arenas.emplace_back( tbb::task_arena::constraints(idx) );
    task_groups.emplace_back();
    arenas.back().execute([&tg = task_groups.back()]{
        tg.run([]{/*some parallel stuff*/});
    });
}

If you meant something else, perhaps try explaining it better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point here might be not that there's a safe way to do it and a well written piece of code will do it the safe way. I think the point is that a not-so-well-written might mess it up. But this specific example (with everything in a loop body) doesn't provide much room for mismatched indices, so it doesn't seem at all likely in this case. In the more general case of task_arenas with a matching number of task_groups, it could be possible to mismatch them.

Copy link
Contributor

@akukanov akukanov Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence says "the user needs to make sure ...", and my objection is - no, users do not necessarily need to. And we seem to agree that this specific pattern "doesn't provide much room for mismatched indices".

My point is that we should not paint the usage worse than it really is.

- [4] - Actual work submission point. It is relatively easy to make a mistake
here by using the ~tbb::task_arena::enqueue~ method instead. In this case not
only the work submission might be done after the synchronization point [5],
but also the loop counter ~j~ can be mistakenly captured by reference, which
at least results in submission of the work into incorrect ~tbb::task_group~,
and at most a segmentation fault, since the loop counter might not exist by
the time the functor starts its execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're highlighting possible points of failure, I suppose we don't want to bring up the safer enqueue deferred tasks patten, right?

Copy link
Contributor

@akukanov akukanov Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I would really want to bring it up as the existing way to mitigate the issue.

- [5] - Synchronization point, where user needs to again make sure corresponding
indices are used. Otherwise, the waiting might be done in unrelated
~tbb::task_arena~. It is also possible to mistakenly use
~tbb::task_arena::enqueue~ method with the same consequences as were outlined
in the previous bullet, but since it is a synchronization point, usually the
blocking call is used.

The proposal below addresses these issues.

* Proposal
Introduce simplified interface to:
- Contstrain a task arena to specific NUMA node,
- Submit work into constrained task arenas, and
- To wait for completion of the submitted work.

Since the new interface represents a constrained ~tbb::task_arena~ , the
proposed name is ~tbb::constrained_task_arena~. Not including the word "numa"
into the name would allow it for extension in the future for other types of
constraints.

** Usage Example
#+begin_src C++
std::vector<tbb::constrained_task_arena> numa_arenas =
tbb::initialize_numa_constrained_arenas();

for(unsigned j = 0; j < numa_arenas.size(); j++) {
numa_arenas[j].enqueue( (){/*some parallel stuff*/} );
}

for(unsigned j = 0; j < numa_arenas.size(); j++) {
numa_arenas[j].wait();
}
#+end_src

** New arena interface
The example above requires new class named ~tbb::constrained_task_arena~. On one
hand, it is a ~tbb::task_arena~ class that isolates the work execution from
other parallel stuff executed by oneTBB. On the other hand, it is a constrained
arena that represents an arena associated to a certain NUMA node and allows
efficient and error-prone work submission in this particular usage scenario.

#+begin_src C++
namespace tbb {

class constrained_task_arena : protected task_arena {
public:
using task_arena::is_active();
using task_arena::terminate();

using task_arena::max_concurrency();

using task_arena::enqueue;

void wait();
private:
constrained_task_arena(tbb::task_arena::constraints);
friend std::vector<constrained_task_arena> initialize_numa_constrained_arenas();
};

}
#+end_src

The interface exposes only necessary methods to allow submission and waiting of
a parallel work. Most of the exposed function members are taken from the base
~tbb::task_arena~ class. Implementation-wise, the new task arena would include
associated ~tbb::task_group~ instance, with which enqueued work will be
implicitly associated.

The ~tbb::constrained_task_arena::wait~ method waits for the work in associated
~tbb::task_group~ to finish, if any was submitted using the
~tbb::constrained_task_arena::enqueue~ method.

The instance of the ~tbb::constrained_task_arena~ class can be created only by
~tbb::initialize_numa_constrained_arenas~ function, whose sole purpose is to
instantiate a ~std::vector~ of initialized ~tbb::constrained_task_arena~
instances, each constrained to its own NUMA node of the platform and does not
include reserved slots, and return this vector back to caller.

* Open Questions
1. Should the interface for creation of constrained task arenas support other
construction parameters (e.g., max_concurrency, number of reserved slots,
priority, other constraints) from the very beginning or it is enough as the
first iteration and these parameters can be added in the future when the need
arise?
2. Should the new task arena allow initializing it with, probably, different
parameters after its creation?
3. Should the new task arena interface allow copying of its settings by exposing
its copy-constructor similarly to what ~tbb::task_arena~ does.
Loading