diff --git a/docs/abbreviations.md b/docs/abbreviations.md index 88761fd..92ec0ec 100644 --- a/docs/abbreviations.md +++ b/docs/abbreviations.md @@ -36,6 +36,7 @@ *[ISA]: Instruction Set Architecture *[ISAs]: Instruction Set Architectures *[JSON]: JavaScript Object Notation +*[LLC]: Last Level Cache *[LUT]: Look Up Table *[LUTs]: Look Up Tables *[MKL]: Intel Math Kernel Library @@ -45,6 +46,7 @@ *[MSVC]: Microsoft Visual C++ *[MT 19937]: Mersenne Twister 19937 *[NEON]: ARM SIMD instructions +*[NUMA]: Non Uniform Memory Access *[OS]: Operating System *[OSs]: Operating Systems *[PRNG]: Pseudo Random Number Generator @@ -68,3 +70,4 @@ *[SSE4.1]: Streaming SIMD Extensions 4.1 *[SSE4.2]: Streaming SIMD Extensions 4.2 *[STD]: Standard +*[UMA]: Uniform Memory Access diff --git a/docs/thread_pinning.md b/docs/thread_pinning.md index e0e4830..a0ae961 100644 --- a/docs/thread_pinning.md +++ b/docs/thread_pinning.md @@ -1,6 +1,6 @@ # Thread Pinning -`AFF3CT-core` enables to select on which CPU process units (PUs) the threads are +`AFF3CT-core` enables to select on which process units (PUs) the threads are effectively run. This is called *thread pinning* and it can significantly benefit to the performance, especially on modern heterogeneous architectures. To do so, the runtime relies on the @@ -25,7 +25,7 @@ To do so, the runtime relies on the *Portable Hardware Locality* (`hwloc` in short) is a library which provides a **portable abstraction** of the **hierarchical topology of modern -architectures** (see the illustration below). +architectures** (see the figure below).
![Orange Pi 5](./assets/hwloc_orangepi5.svg) @@ -36,52 +36,53 @@ architectures** (see the illustration below).
-`hwloc` gives the ability to pin threads over any level of hierarchy with a tree -view, where the process units are the leaves and there are intern nodes which -represent a set of PUs that are physically close (share the same LLC or are in -the same NUMA node). +`hwloc` gives the ability to pin threads over various level of hierarchy +represented by a tree structure. The deepest/lowest nodes (the leaves) are the +PUs while higher nodes represent sets of PUs that are physically close. For +instance, a PUs set can share the same UMA node (in the case of a NUMA +architecture), the same LLC or the same package. -For instance, we can choose to pin a thread over a *package* and it will be able -to execute on all the PUs that are in this level. In the Orange Pi 5 SBC, if we -choose `Package L#0` the thread will run over the following set of PUs: -`PU L#0`, `PU L#1`, `PU L#2` and `PU L#3`. Consequently, **the pinned thread can -move in the selected `hwloc` object during the execution** and it is up to the -OS to schedule the thread on the available set of PUs. +In the Orange Pi 5 SBC, if we pin a thread on the `Package L#0`, it will run +over the following set of PUs: `PU L#0`, `PU L#1`, `PU L#2` and `PU L#3`. +Thus, **the pinned thread can move in the selected `hwloc` node during the +execution** and it is up to the OS to schedule the thread on the selected PUs +set. !!! warning - The indexes given by `hwloc` are different from those given by the OS: they - are logical indexes that express the real locality. **Consequently, in + The indexes given by `hwloc` can be different from those given by the OS: + they are logical indexes that express the real locality. **Consequently, in `AFF3CT-core`, it is important to use `hwloc` logical indexes.** The `hwloc-ls` command gives an overview of the current topology with these logical indexes. ## Sequence & Pipeline -In `AFF3CT-core`, the thread pinning can be set in `runtime::Sequence` and -`runtime::Pipeline` classes constructor. In both cases, there is a dedicated -argument of `std::string` type: `sequence_pinning_policy` for -`runtime::Sequence` and `pipeline_pinning_policy` for `runtime::Pipeline`. +In `AFF3CT-core`, thread pinning can be set in `runtime::Sequence` and +`runtime::Pipeline` class constructors. In both cases, there is a dedicated +argument of `std::string` type named `sequence_pinning_policy` for +`runtime::Sequence` or `pipeline_pinning_policy` for `runtime::Pipeline`. !!! info - It is important to specify the thread pinning at the construction of the - `runtime::Sequence`/`runtime::Pipeline` object to guarantee that the data - will be allocated and initialized (first touch policy) on the right memory - banks during the replication process. + For NUMA architectures, it is important to specify thread pinning at the + construction of the `runtime::Sequence`/`runtime::Pipeline` object to + guarantee that the data will be allocated and initialized on the right + memory banks (according to the first touch policy) during the replication + process. -To specify the pinning policy, we defined a syntax to express `hwloc` with three -different separators: +To specify the pinning policy, we defined a syntax to express `hwloc` objects +with three different separators: - Pipeline stage (does not concern `runtime::Sequence`): `|` - Replicated stage (= replicated sequence = one thread): `;` - For one thread, the list of pinned `hwloc` objects (= logical or): `,` -Then, the pinning can contains all the available `hwloc` objects. Below is -the correspondence between the `std::string` and the `hwloc` objects type -enumerate: +Then, the pinning policy can contains all the available `hwloc` objects. Below +is the correspondence between the `std::string` and the `hwloc` object types: ```cpp -static std::map object_map = -{ /* global containers */ /* data caches */ /* instruction caches */ +std::map str_to_hwloc_obj = +{ + /* global containers */ /* data caches */ /* instruction caches */ { "GROUP", HWLOC_OBJ_GROUP }, { "L5D", HWLOC_OBJ_L5CACHE }, { "L3I", HWLOC_OBJ_L3ICACHE }, { "NUMA", HWLOC_OBJ_NUMANODE }, { "L4D", HWLOC_OBJ_L4CACHE }, { "L2I", HWLOC_OBJ_L2ICACHE }, { "PACKAGE", HWLOC_OBJ_PACKAGE }, { "L3D", HWLOC_OBJ_L3CACHE }, { "L1I", HWLOC_OBJ_L1ICACHE }, @@ -91,26 +92,24 @@ static std::map object_map = }; ``` -The following syntax is used to specify the object index `X`: `OBJECT_X`. - -`OBJECT` can be all the `std::string` defined in the previous listing -(ex: `PU_10` refers to the logical process unit n°10). +To specify the index `X` of an `hwloc` object, the following syntax is used: +`OBJECT_X` (ex: `PU_5` refers to the logical PU n°5). !!! info - `CORE` and `PU` objects can be confusing. If the CPU cores does not support + `CORE` and `PU` objects can be confusing. If the CPU cores do not support SMT, then `CORE` and `PU` are the same. However, if the CPU cores support SMT, then the `PU` is the hardware thread identifier inside a given `CORE`. ### Illustrative Examples -The section proposes some examples to understand how the syntax works. Only the -simplest `hwloc` object is used: the `PU`. Let's suppose that we have a -octo-core CPU with 8 process units (`PU_0, PU_1, PU_2, PU_3, PU_4, PU_5, PU_6, -PU_7`), see the topology of the Orange Pi 5 Plus above). +This section gives some examples to understand how the syntax works. We +suppose that we have a CPU with 8 PUs with the same topology as the the Orange +Pi 5 Plus SBC presented before. #### Example 1 -We want to describe a 3 stages pipeline with: +Let's suppose we want to setup a 3-stage pipeline with the following +characteristics: - **Stage 1** - No replication (= 1 thread): - Pinned to `PU_0` @@ -136,15 +135,18 @@ S2T4(Stage 2, thread 4 - pin: PU_6 or PU_7)-->SYNC2; SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3); ``` -The input parameters will be: +In the previous configuration, 6 threads will execute simultaneously (even if +the given architecture supports up to 8 executions in parallel). + +To instantiate this `runtime::Pipeline`, here are the corresponding constructor +parameters: - Number of replications (= threads) per stage: `{ 1, 4, 1 }` -- Enabling pinning: `{ true, true, true }` +- Enabling pinning per stage: `{ true, true, true }` - Pinning policy: `"PU_0 | PU_4, PU_5; PU_4, PU_5; PU_6, PU_7; PU_6, PU_7 | PU_0, PU_1, PU_2, PU_3"` -The previous pinning policy syntax can be compressed a little bit. It is -possible to use the following equivalent `std::string`: +The previous pinning policy syntax can be compressed a little bit as follow: - Pinning policy : `"PU_0 | PACKAGE_1; PACKAGE_1; PACKAGE_2; PACKAGE_2 | PACKAGE_0"` @@ -153,7 +155,7 @@ possible to use the following equivalent `std::string`: Let's now consider that we want to pin all the threads of the stage 2 on the `PU_4`, `PU_5`, `PU_6` or `PU_7` (this is less restrictive than the previous -example). The pinning strategy for stage 1 and 3 is the same as before. +example). The pinning strategy for stage 1 and 3 is unchanged. ```mermaid graph LR; @@ -169,6 +171,10 @@ S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2; SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3); ``` +Here are the corresponding parameters: + +- Number of replications (= threads) per stage: `{ 1, 4, 1 }` +- Enabling pinning per stage: `{ true, true, true }` - Pinning policy : `"PU_0 | PACKAGE_1, PACKAGE_2 | PACKAGE_0"` With the previous syntax, the 4 threads of the stage 2 will apply the @@ -176,12 +182,12 @@ With the previous syntax, the 4 threads of the stage 2 will apply the #### Example 3 -It is also possible to choose the stages we want to pin using a vector of -`boolean`. For instance, if we don't want to pin the first stage, we can do: +It is also possible to choose the stages we want to pin or not using a vector of +`boolean`. Let's suppose we do not want to specify any pinning for the stage 1. ```mermaid graph LR; -S1T1(Stage 1, thread 1 - no pin)-->SYNC1; +S1T1(Stage 1, thread 1 - no pinning)-->SYNC1; SYNC1(Sync)-->S2T1; SYNC1(Sync)-->S2T2; SYNC1(Sync)-->S2T3; @@ -193,11 +199,13 @@ S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2; SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3); ``` -- Enabling pinning: `{false, true, true}` +Here are the corresponding parameters: + +- Number of replications (= threads) per stage: `{ 1, 4, 1 }` +- Enabling pinning per stage: `{false, true, true}` - Pinning policy: `"| PACKAGE_1, PACKAGE_2 | PACKAGE_0"` -Thus, the operating system will be in charge of pinning the thread of the first -stage. +In this case, the OS will be in charge of pinning the thread of the first stage. ### Unpin