From 3c97ea9b28a0b93052e6fa79d6b5038be6950d4b Mon Sep 17 00:00:00 2001 From: Michael Voss Date: Wed, 23 Oct 2024 09:56:30 -0500 Subject: [PATCH 01/11] Added numa_support rfc --- .../simplified_numa_support/README.md | 179 ++++++++++++++++++ 1 file changed, 179 insertions(+) create mode 100755 rfcs/proposed/simplified_numa_support/README.md diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md new file mode 100755 index 0000000000..fbac6efd62 --- /dev/null +++ b/rfcs/proposed/simplified_numa_support/README.md @@ -0,0 +1,179 @@ +# Simplified NUMA support in oneTBB + +## Introduction + +In Non-Uniform Memory Access (NUMA) systems, the cost of memory accesses depends on the +*nearness* of the processor to the memory resource on which the accessed data resides. +While oneTBB has core support that enables developers to tune for Non-Uniform Memory +Access (NUMA) systems, we believe this support can be simplified and improved to provide +an improved user experience. + +This early proposal recommends addressing for areas for improvement: + +1. improved reliability of HWLOC-dependent topology and pinning support in, +2. addition of a NUMA-aware allocation, +3. simplified approaches to associate task distribution with data placement and +4. where possible, improved out-of-the-box performance for high-level oneTBB features. + +We expect that this draft proposal may be broken into smaller proposals based on feedback +and prioritization of the suggested features. + +The features for NUMA tuning already available in the oneTBB 1.3 specification include: + +- Functions in the `tbb::info` namespace **[info_namespace]** + - `std::vector numa_nodes()` + - `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)` +- `tbb::task_arena::constraints` in **[scheduler.task_arena]** + +Below is the example that demonstrates the use of these APIs to pin threads to different +arenas to each of the NUMA nodes available on a system, submit work across those `task_arena` +objects and into associated `task_group`` objects, and then wait for work again using both +the `task_arena` and `task_group` objects. + + #include "oneapi/tbb/task_group.h" + #include "oneapi/tbb/task_arena.h" + + #include + + int main() { + std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); + std::vector arenas(numa_nodes.size()); + std::vector task_groups(numa_nodes.size()); + + // Initialize the arenas and place memory + for (int i = 0; i < numa_nodes.size(); i++) { + arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i])); + arenas[i].execute([i] { + // allocate/place memory on NUMA node i + }); + } + + for (int j 0; j < NUM_STEPS; ++i) { + + // Distribute work across the arenas / NUMA nodes + for (int i = 0; i < numa_nodes.size(); i++) { + arenas[i].execute([&task_groups, i] { + task_groups[i].run([] { + /* executed by the thread pinned to specified NUMA node */ + }); + }); + } + + // Wait for the work in each arena / NUMA node to complete + for (int i = 0; i < numa_nodes.size(); i++) { + arenas[i].execute([&task_groups, i] { + task_groups[i].wait(); + }); + } + } + + return 0; + } + +### The need for application-specific knowledge + +In general when tuning a parallel application for NUMA systems, the goal is to expose sufficient +parallelism while minimizing (or at least controlling) data access and communication costs. The +tradeoffs involved in this tuning often rely on application-specific knowledge. + +In particular, NUMA tuning typically involves: + +1. Understanding the overall application problem and its use of algorithms and data containers +2. Placement of data container objects onto memory resources +3. Distribution of tasks to hardware resources that optimize for data placement + +As shown in the previous example, the oneTBB 1.3 specification only provides low-level +support for NUMA optimization. The `tbb::info` namespace provides topology discovery. And the +combination of `task_arena`, `task_arena::constraints` and `task_group` provide a mechanism for +placing tasks onto specific processors. There is no high-level support for memory allocation +or placement, or for guiding the task distribution of algorithms. + +### Issues that should be resolved in the oneTBB library + +**The behavior of existing features is not always predictable.** There is a note in +section **[info_namespace]** of the oneTBB specification that describes +the function `std::vector numa_nodes()`, "If error occurs during system topology +parsing, returns vector containing single element that equals to `task_arena::automatic`." + +In practice, the error often occurs because HWLOC is not detected on the system. While the +oneTBB documentation states in several places that HWLOC is required for NUMA support and +even provides guidance on +[how to check for HWLOC](https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/next-steps.html), +the failure to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This +default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding +example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort. + +**Getting good performance using these tools requres notable manual coding effort by users.** As we +can see in the preceding example, if we want to spread work across the NUMA nodes in +a system we need to query the topology using functions in the `tbb::info` namespace, create +one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an +extra loop that iterates overs these `task_arena` and `task_group` objects to execute the +work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific +APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes. + +**The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.** +Should the oneTBB library do anything special be default if the system is a NUMA system? Or should +regular random stealing distribute the work across all of the cores, regardless of which NUMA first +touched the data? + +Is it reasonable for a developer to expect that a series of loops, such as the ones that follow, will +try to create a NUMA-friendly distribution of tasks so that accesses to the same elements of `b` and `c` +in the two loops are from the same NUMA nodes? Or is this too much to expect without providing hints? + + tbb::parallel_for(0, N, + [](int i) { + b[i] = f(i); + c[i] = g(i); + }); + + tbb::parallel_for(0, N, + [](int i) { + a[i] = b[i] + c[i]; + }); + +## Proposal + +### Increased availability of NUMA support + +The oneTBB 1.3 specification states for `tbb::info::numa_nodes`, "If error occurs during system +topology parsing, returns vector containing single element that equals to task_arena::automatic." + +Since the oneTBB library dynamically loads the HWLOC library, a misconfiguration can cause the HWLOC +to fail to be found. In that case, a call like: + + std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); + +will return a vector with a single element of `task_arena::automatic`. This behavior, as we have noticed +through user questions, can lead to unexpected performance from NUMA optimizations. When running +on a NUMA system, a developer that has not fully read the documentation may expect that `numa_nodes()` +will give a proper accounting of the NUMA nodes. When the code, without raising any alarm, returns only +a single, valid element due to the environmental configuation (such as lack of HWLOCK), it is too easy +for developers to not notice that the code is acting in a valid, but unexpected way. + +We propose that the oneTBB library implementation include, wherever possibly, a statically-linked fallback +to decrease that likelihood of such failures. The oneTBB specification will remain unchanged. + +### NUMA-aware allocation + +We will define allocators of other features that simplify the process of allocating or places data onto +specific NUMA nodes. + +### Simplified approaches to associate task distribution with data placement + +As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures. +We also need to deliver mechanisms to guide task distribution so that tasks are executed on execution +resources that are near to the data they access. oneTBB already provides low-level support through +`tbb::info` and `tbb::task_arena`, but we should up-level this support into the high-level algorithms, +flow graph and containers where appropriate. + +### Improved out-of-the-box performance for high-level oneTBB features. + +For high-level oneTBB features that are modified to provide improved NUMA support, we should try to +align default behaviors for those features with user-expectations when used on NUMA systems. + +## Open Questions + +1. Do we need simplified support, or are users that want NUMA support in oneTBB +willing to, or perhaps even prefer, to manage the details manually? +2. Is it reasonable to expect good out-of-the-box performance on NUMA systems +without user hints or guidance. From de552dfc9da577f6f8efad205fa9e90ef5242289 Mon Sep 17 00:00:00 2001 From: Mike Voss Date: Wed, 13 Nov 2024 08:07:44 -0600 Subject: [PATCH 02/11] Update rfcs/proposed/simplified_numa_support/README.md Co-authored-by: Aleksei Fedotov --- rfcs/proposed/simplified_numa_support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md index fbac6efd62..a45d77c611 100755 --- a/rfcs/proposed/simplified_numa_support/README.md +++ b/rfcs/proposed/simplified_numa_support/README.md @@ -8,7 +8,7 @@ While oneTBB has core support that enables developers to tune for Non-Uniform Me Access (NUMA) systems, we believe this support can be simplified and improved to provide an improved user experience. -This early proposal recommends addressing for areas for improvement: +This early proposal recommends addressing four areas for improvement: 1. improved reliability of HWLOC-dependent topology and pinning support in, 2. addition of a NUMA-aware allocation, From 54ae854675ea9ac5da078520545c152624834c55 Mon Sep 17 00:00:00 2001 From: Mike Voss Date: Wed, 13 Nov 2024 08:08:28 -0600 Subject: [PATCH 03/11] Update rfcs/proposed/simplified_numa_support/README.md Co-authored-by: Aleksei Fedotov --- rfcs/proposed/simplified_numa_support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md index a45d77c611..b4bfc0742b 100755 --- a/rfcs/proposed/simplified_numa_support/README.md +++ b/rfcs/proposed/simplified_numa_support/README.md @@ -103,7 +103,7 @@ the failure to resolve HWLOC at runtime silently returns a default of `task_aren default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort. -**Getting good performance using these tools requres notable manual coding effort by users.** As we +**Getting good performance using these tools requires notable manual coding effort by users.** As we can see in the preceding example, if we want to spread work across the NUMA nodes in a system we need to query the topology using functions in the `tbb::info` namespace, create one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an From 87cf469767469d068a2fa9c848819e6ff0cc55b5 Mon Sep 17 00:00:00 2001 From: Mike Voss Date: Wed, 13 Nov 2024 08:08:43 -0600 Subject: [PATCH 04/11] Update rfcs/proposed/simplified_numa_support/README.md Co-authored-by: Aleksei Fedotov --- rfcs/proposed/simplified_numa_support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md index b4bfc0742b..806e53ccba 100755 --- a/rfcs/proposed/simplified_numa_support/README.md +++ b/rfcs/proposed/simplified_numa_support/README.md @@ -107,7 +107,7 @@ example and be unaware that a HWLOC installation error (or lack of HWLOC) has un can see in the preceding example, if we want to spread work across the NUMA nodes in a system we need to query the topology using functions in the `tbb::info` namespace, create one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an -extra loop that iterates overs these `task_arena` and `task_group` objects to execute the +extra loop that iterates over these `task_arena` and `task_group` objects to execute the work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes. From 94d0d357a8e89d045b3f3db3de92b1b0317fcb7f Mon Sep 17 00:00:00 2001 From: Mike Voss Date: Wed, 13 Nov 2024 08:08:54 -0600 Subject: [PATCH 05/11] Update rfcs/proposed/simplified_numa_support/README.md Co-authored-by: Aleksei Fedotov --- rfcs/proposed/simplified_numa_support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md index 806e53ccba..b297a68722 100755 --- a/rfcs/proposed/simplified_numa_support/README.md +++ b/rfcs/proposed/simplified_numa_support/README.md @@ -112,7 +112,7 @@ work on the desired NUMA nodes. We also need to handle all container allocations APIs (or behaviors, such as first-touch) to allocator or place them on the appropriate NUMA nodes. **The out-of-the-box performance of the generic TBB APIs on NUMA systems is not good enough.** -Should the oneTBB library do anything special be default if the system is a NUMA system? Or should +Should the oneTBB library do anything special by default if the system is a NUMA system? Or should regular random stealing distribute the work across all of the cores, regardless of which NUMA first touched the data? From aa14760efb40ef87cb089a79058f04dfa6164c08 Mon Sep 17 00:00:00 2001 From: Mike Voss Date: Wed, 13 Nov 2024 08:09:05 -0600 Subject: [PATCH 06/11] Update rfcs/proposed/simplified_numa_support/README.md Co-authored-by: Aleksei Fedotov --- rfcs/proposed/simplified_numa_support/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/simplified_numa_support/README.md index b297a68722..ca36b262db 100755 --- a/rfcs/proposed/simplified_numa_support/README.md +++ b/rfcs/proposed/simplified_numa_support/README.md @@ -147,7 +147,7 @@ will return a vector with a single element of `task_arena::automatic`. This beha through user questions, can lead to unexpected performance from NUMA optimizations. When running on a NUMA system, a developer that has not fully read the documentation may expect that `numa_nodes()` will give a proper accounting of the NUMA nodes. When the code, without raising any alarm, returns only -a single, valid element due to the environmental configuation (such as lack of HWLOCK), it is too easy +a single, valid element due to the environmental configuation (such as lack of HWLOC), it is too easy for developers to not notice that the code is acting in a valid, but unexpected way. We propose that the oneTBB library implementation include, wherever possibly, a statically-linked fallback From 6a57193b81eb372f89cd3027d5680c375e6285f7 Mon Sep 17 00:00:00 2001 From: Mike Voss Date: Wed, 13 Nov 2024 09:28:52 -0600 Subject: [PATCH 07/11] Renamed numa_support RFC --- .../README.md | 23 ++++++++++--------- 1 file changed, 12 insertions(+), 11 deletions(-) rename rfcs/proposed/{simplified_numa_support => numa_support}/README.md (90%) diff --git a/rfcs/proposed/simplified_numa_support/README.md b/rfcs/proposed/numa_support/README.md similarity index 90% rename from rfcs/proposed/simplified_numa_support/README.md rename to rfcs/proposed/numa_support/README.md index ca36b262db..0a0b822830 100755 --- a/rfcs/proposed/simplified_numa_support/README.md +++ b/rfcs/proposed/numa_support/README.md @@ -1,4 +1,4 @@ -# Simplified NUMA support in oneTBB +# NUMA support ## Introduction @@ -8,15 +8,15 @@ While oneTBB has core support that enables developers to tune for Non-Uniform Me Access (NUMA) systems, we believe this support can be simplified and improved to provide an improved user experience. -This early proposal recommends addressing four areas for improvement: +This RFC acts as an umbrella for sub-proposals that address four areas for improvement: 1. improved reliability of HWLOC-dependent topology and pinning support in, 2. addition of a NUMA-aware allocation, 3. simplified approaches to associate task distribution with data placement and 4. where possible, improved out-of-the-box performance for high-level oneTBB features. -We expect that this draft proposal may be broken into smaller proposals based on feedback -and prioritization of the suggested features. +We expect that this draft proposal will spawn sub-proposals that will progress +independently based on feedback and prioritization of the suggested features. The features for NUMA tuning already available in the oneTBB 1.3 specification include: @@ -25,10 +25,11 @@ The features for NUMA tuning already available in the oneTBB 1.3 specification i - `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)` - `tbb::task_arena::constraints` in **[scheduler.task_arena]** -Below is the example that demonstrates the use of these APIs to pin threads to different -arenas to each of the NUMA nodes available on a system, submit work across those `task_arena` -objects and into associated `task_group`` objects, and then wait for work again using both -the `task_arena` and `task_group` objects. +Below is the example based on existing oneTBB documentation that demonstrates the use +of these APIs to pin threads to different arenas to each of the NUMA nodes available +on a system, submit work across those `task_arena` objects and into associated +`task_group`` objects, and then wait for work again using both the `task_arena` +and `task_group` objects. #include "oneapi/tbb/task_group.h" #include "oneapi/tbb/task_arena.h" @@ -42,7 +43,7 @@ the `task_arena` and `task_group` objects. // Initialize the arenas and place memory for (int i = 0; i < numa_nodes.size(); i++) { - arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i])); + arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i]),0); arenas[i].execute([i] { // allocate/place memory on NUMA node i }); @@ -79,7 +80,7 @@ tradeoffs involved in this tuning often rely on application-specific knowledge. In particular, NUMA tuning typically involves: 1. Understanding the overall application problem and its use of algorithms and data containers -2. Placement of data container objects onto memory resources +2. Placement/allocation of data container objects onto memory resources 3. Distribution of tasks to hardware resources that optimize for data placement As shown in the previous example, the oneTBB 1.3 specification only provides low-level @@ -155,7 +156,7 @@ to decrease that likelihood of such failures. The oneTBB specification will rema ### NUMA-aware allocation -We will define allocators of other features that simplify the process of allocating or places data onto +We will define allocators or other features that simplify the process of allocating or placing data onto specific NUMA nodes. ### Simplified approaches to associate task distribution with data placement From bf798c813cf9a4cbf82093cf68423a8b0f848827 Mon Sep 17 00:00:00 2001 From: Mike Voss Date: Thu, 9 Jan 2025 13:24:52 -0600 Subject: [PATCH 08/11] Fixed example and added links to sub-RFCs --- rfcs/proposed/numa_support/README.md | 93 +++++++++++----------------- 1 file changed, 35 insertions(+), 58 deletions(-) diff --git a/rfcs/proposed/numa_support/README.md b/rfcs/proposed/numa_support/README.md index 0a0b822830..f17b18a4cb 100755 --- a/rfcs/proposed/numa_support/README.md +++ b/rfcs/proposed/numa_support/README.md @@ -31,44 +31,32 @@ on a system, submit work across those `task_arena` objects and into associated `task_group`` objects, and then wait for work again using both the `task_arena` and `task_group` objects. - #include "oneapi/tbb/task_group.h" - #include "oneapi/tbb/task_arena.h" - - #include - - int main() { - std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); - std::vector arenas(numa_nodes.size()); - std::vector task_groups(numa_nodes.size()); - - // Initialize the arenas and place memory - for (int i = 0; i < numa_nodes.size(); i++) { - arenas[i].initialize(oneapi::tbb::task_arena::constraints(numa_nodes[i]),0); - arenas[i].execute([i] { - // allocate/place memory on NUMA node i - }); - } - - for (int j 0; j < NUM_STEPS; ++i) { - - // Distribute work across the arenas / NUMA nodes - for (int i = 0; i < numa_nodes.size(); i++) { - arenas[i].execute([&task_groups, i] { - task_groups[i].run([] { - /* executed by the thread pinned to specified NUMA node */ - }); - }); - } - - // Wait for the work in each arena / NUMA node to complete - for (int i = 0; i < numa_nodes.size(); i++) { - arenas[i].execute([&task_groups, i] { - task_groups[i].wait(); - }); - } - } - - return 0; + void constrain_for_numa_nodes() { + std::vector numa_nodes = tbb::info::numa_nodes(); + std::vector arenas(numa_nodes.size()); + std::vector task_groups(numa_nodes.size()); + + // initialize each arena, each constrained to a different NUMA node + for (int i = 0; i < numa_nodes.size(); i++) + arenas[i].initialize(tbb::task_arena::constraints(numa_nodes[i]), 0); + + // enqueue work to all but the first arena, using the task_group to track work + // by using defer, the task_group reference count is incremented immediately + for (int i = 1; i < numa_nodes.size(); i++) + arenas[i].enqueue( + task_groups[i].defer([] { + tbb::parallel_for(0, N, [](int j) { f(w); }); + }) + ); + + // directly execute the work to completion in the remaining arena + arenas[0].execute([] { + tbb::parallel_for(0, N, [](int j) { f(w); }); + }); + + // join the other arenas to wait on their task_groups + for (int i = 1; i < numa_nodes.size(); i++) + arenas[i].execute([&task_groups, i] { task_groups[i].wait(); }); } ### The need for application-specific knowledge @@ -96,17 +84,17 @@ section **[info_namespace]** of the oneTBB specification that describes the function `std::vector numa_nodes()`, "If error occurs during system topology parsing, returns vector containing single element that equals to `task_arena::automatic`." -In practice, the error often occurs because HWLOC is not detected on the system. While the +In practice, the error can occurs because HWLOC is not detected on the system. While the oneTBB documentation states in several places that HWLOC is required for NUMA support and even provides guidance on [how to check for HWLOC](https://www.intel.com/content/www/us/en/docs/onetbb/get-started-guide/2021-12/next-steps.html), -the failure to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This +the inability to resolve HWLOC at runtime silently returns a default of `task_arena::automatic`. This default does not pin threads to NUMA nodes. It is too easy to write code similar to the preceding example and be unaware that a HWLOC installation error (or lack of HWLOC) has undone all your effort. **Getting good performance using these tools requires notable manual coding effort by users.** As we can see in the preceding example, if we want to spread work across the NUMA nodes in -a system we need to query the topology using functions in the `tbb::info` namespace, create +a system we might need to query the topology using functions in the `tbb::info` namespace, create one `task_arena` per NUMA node, along with one `task_group` per NUMA node, and then add an extra loop that iterates over these `task_arena` and `task_group` objects to execute the work on the desired NUMA nodes. We also need to handle all container allocations using OS-specific @@ -132,31 +120,20 @@ in the two loops are from the same NUMA nodes? Or is this too much to expect wit a[i] = b[i] + c[i]; }); -## Proposal +## Possible Sub-Proposals ### Increased availability of NUMA support -The oneTBB 1.3 specification states for `tbb::info::numa_nodes`, "If error occurs during system -topology parsing, returns vector containing single element that equals to task_arena::automatic." - -Since the oneTBB library dynamically loads the HWLOC library, a misconfiguration can cause the HWLOC -to fail to be found. In that case, a call like: +See [sub-RFC for increased availability of NUMA API](https://github.com/uxlfoundation/oneTBB/pull/1545) - std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); -will return a vector with a single element of `task_arena::automatic`. This behavior, as we have noticed -through user questions, can lead to unexpected performance from NUMA optimizations. When running -on a NUMA system, a developer that has not fully read the documentation may expect that `numa_nodes()` -will give a proper accounting of the NUMA nodes. When the code, without raising any alarm, returns only -a single, valid element due to the environmental configuation (such as lack of HWLOC), it is too easy -for developers to not notice that the code is acting in a valid, but unexpected way. +### Add NUMA-constrained arenas -We propose that the oneTBB library implementation include, wherever possibly, a statically-linked fallback -to decrease that likelihood of such failures. The oneTBB specification will remain unchanged. +See [sub-RFC for creation and use of NUMA-constrained arenas](https://github.com/uxlfoundation/oneTBB/pull/1559) ### NUMA-aware allocation -We will define allocators or other features that simplify the process of allocating or placing data onto +Define allocators or other features that simplify the process of allocating or placing data onto specific NUMA nodes. ### Simplified approaches to associate task distribution with data placement @@ -169,7 +146,7 @@ flow graph and containers where appropriate. ### Improved out-of-the-box performance for high-level oneTBB features. -For high-level oneTBB features that are modified to provide improved NUMA support, we should try to +For high-level oneTBB features that are modified to provide improved NUMA support, we can try to align default behaviors for those features with user-expectations when used on NUMA systems. ## Open Questions From 91ca50b55553d0907776d0ed260540f150da3957 Mon Sep 17 00:00:00 2001 From: Aleksei Fedotov Date: Thu, 23 Jan 2025 15:06:33 +0100 Subject: [PATCH 09/11] Add sub-RFC for increased availability of NUMA API (#1545) --- .../tbbbind-link-static-hwloc.org | 119 ++++++++++++++++++ 1 file changed, 119 insertions(+) create mode 100755 rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org diff --git a/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org new file mode 100755 index 0000000000..d108ac1283 --- /dev/null +++ b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org @@ -0,0 +1,119 @@ +# -*- fill-column: 80; -*- + +#+title: Link ~tbbbind~ with Static HWLOC for NUMA API predictability + +*Note:* This document is a sub-RFC of the [[file:README.md][umbrella RFC about improving NUMA +support]]. Specifically, the "Increased availability of NUMA support" section. + +* Introduction +oneTBB has a soft dependency on several variants of ~tbbbind~, which the library +loads during the initialization stage. Each ~tbbbind~, in turn, has a hard +dependency on a specific version of the HWLOC library [1, 2]. The soft +dependency means that the library continues the execution even if the system +loader fails to resolve the hard dependency on HWLOC for ~tbbbind~. In this +case, oneTBB does not discover the hardware topology. Instead, it defaults to +viewing all CPU cores as uniform, consistent with TBB behavior when NUMA +constraints are not used. As a result, the following code returns the irrelevant +values that do not reflect the actual topology: + +#+begin_src C++ +std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); +std::vector core_types = oneapi::tbb::info::core_types(); +#+end_src + +This lack of valid HW topology, caused by the absence of a third-party library, +is the major problem with the current oneTBB behavior. The problem lies in the +lack of diagnostics making it difficult for developers to detect. As a result, +the code continues to run but fails to use NUMA as intended. + +Dependency on a shared HWLOC library has the following benefits: +1. Code reuse with all of the positive consequences out of this, including + relying on the same code that has been tested and debugged, allowing the OS + to share it among different processes, which consequently improves on cache + locality and memory footprint. That's the primary purpose of shared + libraries. +2. A drop-in replacement. Users are able to use their own version of HWLOC + without recompilation of oneTBB. This specific version of HWLOC could include + a hotfix to support a particular and/or new hardware that a customer has, but + whose support is not yet upstreamed to HWLOC project. It is also possible + that such support won't be upstreamed at all if that hardware is not going to + be available for massive users. It could also be a development version of + HWLOC that someone wants to test on their systems first. Of course, they can + do it with the static version as well, but that's more cumbersome as it + requires recompilation of every dependent component. + +The only disadvantage from depending on HWLOC library dynamically is that the +developers that use oneTBB's NUMA support API need to make sure the library is +available and can be found by oneTBB. Depending on the distribution model of a +developer's code, this is achieved either by: +1. Asking the end user to have necessary version of a dependency pre-installed. +2. Bundling necessary HWLOC version together with other pieces of a product + release. + +However, the requirement to fulfill one of the above steps for the NUMA API to +start paying off may be considered as an incovenience and, what is more +important, it is not always obvious that one of these steps is needed. +Especially, due to silent behavior in case HWLOC library cannot be found in the +environment. + +The proposal is to reduce the effect of the disadvantage of relying on a dynamic +HWLOC library. The improvements involve statically linking HWLOC with one of the +~tbbbind~ libraries distributed together with oneTBB. At the same time, you +retain the flexibility to specify different version of HWLOC library if needed. + +Since HWLOC 1.x is an older version and modern operating systems install HWLOC +2.x by default, the probability of users being restricted to HWLOC 1.x is +relatively small. Thus, we can reuse the filename of the ~tbbbind~ library +linked to HWLOC 1.x for the library linked against a static HWLOC 2.x. + +* Proposal +1. Replace the dynamic link of ~tbbbind~ library currently linked + against HWLOC 1.x with a link to a static HWLOC library version 2.x. +2. Add loading of that ~tbbbind~ variant as the last attempt to resolve the + dependency on functionality provided by the ~tbbbind~ layer. +3. Update the oneTBB documentation, including [[https://oneapi-src.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these pages]], to + detail the steps for identifying which ~tbbbind~ is being used. + +** Advantages +1. The proposed behavior introduces a fallback mechanism for resolving the HWLOC + library dependency when it is not in the environment, while still preferring + user-provided versions. As a result, the problematic oneTBB API usage works + as expected, returning an enumerated list of actual NUMA nodes and core types + on the system the code is running on, provided that the loaded HWLOC library + works on that system and that an application properly distributes all + binaries of oneTBB, sets the environment so that the necessary variant of + ~tbbbind~ library can be found and loaded. +2. Dropping support for HWLOC 1.x, does not introduce an additional ~tbbbind~ + variant while maintaining support for widely used versions of HWLOC. + +** Disadvantages +By default, there is still no diagnostics if you fail to correctly setup an +environment with your version of HWLOC. Although, specifying the ~TBB_VERSION=1~ +environment variable helps identify configuration issues quickly. + +* Alternative Handling for Missing System Topology +The other behavior in case HWLOC library cannot be found is to be more explicit +about the problem of a missing component and to either issue a warning or to +refuse working requiring one of the ~tbbbind~ variant to be loaded (e.g., throw +an exception). + +Comparing these alternative approaches to the one proposed. +** Common Advantages +- Explicitly indicates that the functionality being used does not work, instead + of failing silently. +- Avoids the need to distribute an additional variant of ~tbbbind~ library. + +** Common Disadvantages +- Requires additional step from the user side to resolve the problem. In other + words, it does not provide complete solution to the problem. + +*** Disadvantages of Issuing a Warning +- The warning may be unnoticed, especially if standard streams are closed. + +*** Disadvantages of Throwing an Exception +- May break existing code that does not expect an exception to be thrown. +- Requires introduction of an additional exception hierarchy. + +* References +1. [[https://www.open-mpi.org/projects/hwloc/][HWLOC project main page]] +2. [[https://github.com/open-mpi/hwloc][HWLOC project repository on GitHub]] From f9a8ec0fbe508b2562762d1ca93cdf00f788c5c3 Mon Sep 17 00:00:00 2001 From: "Fedotov, Aleksei" Date: Thu, 23 Jan 2025 15:18:53 +0100 Subject: [PATCH 10/11] Address review remark and update the link --- rfcs/proposed/numa_support/README.md | 9 ++++----- rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org | 5 +++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/rfcs/proposed/numa_support/README.md b/rfcs/proposed/numa_support/README.md index f17b18a4cb..c07d05084d 100755 --- a/rfcs/proposed/numa_support/README.md +++ b/rfcs/proposed/numa_support/README.md @@ -25,11 +25,10 @@ The features for NUMA tuning already available in the oneTBB 1.3 specification i - `int default_concurrency(numa_node_id id = oneapi::tbb::task_arena::automatic)` - `tbb::task_arena::constraints` in **[scheduler.task_arena]** -Below is the example based on existing oneTBB documentation that demonstrates the use -of these APIs to pin threads to different arenas to each of the NUMA nodes available -on a system, submit work across those `task_arena` objects and into associated -`task_group`` objects, and then wait for work again using both the `task_arena` -and `task_group` objects. +Below is the example based on existing oneTBB documentation that demonstrates the use of these APIs +to pin threads to different arenas to each of the NUMA nodes available on a system, submit work +across those `task_arena` objects and into associated `task_group` objects, and then wait for work +again using both the `task_arena` and `task_group` objects. void constrain_for_numa_nodes() { std::vector numa_nodes = tbb::info::numa_nodes(); diff --git a/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org index d108ac1283..ebda06992e 100755 --- a/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org +++ b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org @@ -71,8 +71,9 @@ linked to HWLOC 1.x for the library linked against a static HWLOC 2.x. against HWLOC 1.x with a link to a static HWLOC library version 2.x. 2. Add loading of that ~tbbbind~ variant as the last attempt to resolve the dependency on functionality provided by the ~tbbbind~ layer. -3. Update the oneTBB documentation, including [[https://oneapi-src.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these pages]], to - detail the steps for identifying which ~tbbbind~ is being used. +3. Update the oneTBB documentation, including + [[https://uxlfoundation.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these + pages]], to detail the steps for identifying which ~tbbbind~ is being used. ** Advantages 1. The proposed behavior introduces a fallback mechanism for resolving the HWLOC From 85126b7411fa068e3c77357fe07e835ace927313 Mon Sep 17 00:00:00 2001 From: "Fedotov, Aleksei" Date: Thu, 23 Jan 2025 15:19:14 +0100 Subject: [PATCH 11/11] Update the links to point to the documents --- rfcs/proposed/numa_support/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfcs/proposed/numa_support/README.md b/rfcs/proposed/numa_support/README.md index c07d05084d..c19927f4a6 100755 --- a/rfcs/proposed/numa_support/README.md +++ b/rfcs/proposed/numa_support/README.md @@ -123,12 +123,12 @@ in the two loops are from the same NUMA nodes? Or is this too much to expect wit ### Increased availability of NUMA support -See [sub-RFC for increased availability of NUMA API](https://github.com/uxlfoundation/oneTBB/pull/1545) +See [sub-RFC for increased availability of NUMA API](tbbbind-link-static-hwloc.org) ### Add NUMA-constrained arenas -See [sub-RFC for creation and use of NUMA-constrained arenas](https://github.com/uxlfoundation/oneTBB/pull/1559) +See [sub-RFC for creation and use of NUMA-constrained arenas](numa-arenas-creation-and-use.org) ### NUMA-aware allocation