diff --git a/docs/abbreviations.md b/docs/abbreviations.md
index 88761fd..92ec0ec 100644
--- a/docs/abbreviations.md
+++ b/docs/abbreviations.md
@@ -36,6 +36,7 @@
*[ISA]: Instruction Set Architecture
*[ISAs]: Instruction Set Architectures
*[JSON]: JavaScript Object Notation
+*[LLC]: Last Level Cache
*[LUT]: Look Up Table
*[LUTs]: Look Up Tables
*[MKL]: Intel Math Kernel Library
@@ -45,6 +46,7 @@
*[MSVC]: Microsoft Visual C++
*[MT 19937]: Mersenne Twister 19937
*[NEON]: ARM SIMD instructions
+*[NUMA]: Non Uniform Memory Access
*[OS]: Operating System
*[OSs]: Operating Systems
*[PRNG]: Pseudo Random Number Generator
@@ -68,3 +70,4 @@
*[SSE4.1]: Streaming SIMD Extensions 4.1
*[SSE4.2]: Streaming SIMD Extensions 4.2
*[STD]: Standard
+*[UMA]: Uniform Memory Access
diff --git a/docs/thread_pinning.md b/docs/thread_pinning.md
index e0e4830..a0ae961 100644
--- a/docs/thread_pinning.md
+++ b/docs/thread_pinning.md
@@ -1,6 +1,6 @@
# Thread Pinning
-`AFF3CT-core` enables to select on which CPU process units (PUs) the threads are
+`AFF3CT-core` enables to select on which process units (PUs) the threads are
effectively run. This is called *thread pinning* and it can significantly
benefit to the performance, especially on modern heterogeneous architectures.
To do so, the runtime relies on the
@@ -25,7 +25,7 @@ To do so, the runtime relies on the
*Portable Hardware Locality* (`hwloc` in short) is a library which provides a
**portable abstraction** of the **hierarchical topology of modern
-architectures** (see the illustration below).
+architectures** (see the figure below).
-`hwloc` gives the ability to pin threads over any level of hierarchy with a tree
-view, where the process units are the leaves and there are intern nodes which
-represent a set of PUs that are physically close (share the same LLC or are in
-the same NUMA node).
+`hwloc` gives the ability to pin threads over various level of hierarchy
+represented by a tree structure. The deepest/lowest nodes (the leaves) are the
+PUs while higher nodes represent sets of PUs that are physically close. For
+instance, a PUs set can share the same UMA node (in the case of a NUMA
+architecture), the same LLC or the same package.
-For instance, we can choose to pin a thread over a *package* and it will be able
-to execute on all the PUs that are in this level. In the Orange Pi 5 SBC, if we
-choose `Package L#0` the thread will run over the following set of PUs:
-`PU L#0`, `PU L#1`, `PU L#2` and `PU L#3`. Consequently, **the pinned thread can
-move in the selected `hwloc` object during the execution** and it is up to the
-OS to schedule the thread on the available set of PUs.
+In the Orange Pi 5 SBC, if we pin a thread on the `Package L#0`, it will run
+over the following set of PUs: `PU L#0`, `PU L#1`, `PU L#2` and `PU L#3`.
+Thus, **the pinned thread can move in the selected `hwloc` node during the
+execution** and it is up to the OS to schedule the thread on the selected PUs
+set.
!!! warning
- The indexes given by `hwloc` are different from those given by the OS: they
- are logical indexes that express the real locality. **Consequently, in
+ The indexes given by `hwloc` can be different from those given by the OS:
+ they are logical indexes that express the real locality. **Consequently, in
`AFF3CT-core`, it is important to use `hwloc` logical indexes.** The
`hwloc-ls` command gives an overview of the current topology with these
logical indexes.
## Sequence & Pipeline
-In `AFF3CT-core`, the thread pinning can be set in `runtime::Sequence` and
-`runtime::Pipeline` classes constructor. In both cases, there is a dedicated
-argument of `std::string` type: `sequence_pinning_policy` for
-`runtime::Sequence` and `pipeline_pinning_policy` for `runtime::Pipeline`.
+In `AFF3CT-core`, thread pinning can be set in `runtime::Sequence` and
+`runtime::Pipeline` class constructors. In both cases, there is a dedicated
+argument of `std::string` type named `sequence_pinning_policy` for
+`runtime::Sequence` or `pipeline_pinning_policy` for `runtime::Pipeline`.
!!! info
- It is important to specify the thread pinning at the construction of the
- `runtime::Sequence`/`runtime::Pipeline` object to guarantee that the data
- will be allocated and initialized (first touch policy) on the right memory
- banks during the replication process.
+ For NUMA architectures, it is important to specify thread pinning at the
+ construction of the `runtime::Sequence`/`runtime::Pipeline` object to
+ guarantee that the data will be allocated and initialized on the right
+ memory banks (according to the first touch policy) during the replication
+ process.
-To specify the pinning policy, we defined a syntax to express `hwloc` with three
-different separators:
+To specify the pinning policy, we defined a syntax to express `hwloc` objects
+with three different separators:
- Pipeline stage (does not concern `runtime::Sequence`): `|`
- Replicated stage (= replicated sequence = one thread): `;`
- For one thread, the list of pinned `hwloc` objects (= logical or): `,`
-Then, the pinning can contains all the available `hwloc` objects. Below is
-the correspondence between the `std::string` and the `hwloc` objects type
-enumerate:
+Then, the pinning policy can contains all the available `hwloc` objects. Below
+is the correspondence between the `std::string` and the `hwloc` object types:
```cpp
-static std::map object_map =
-{ /* global containers */ /* data caches */ /* instruction caches */
+std::map str_to_hwloc_obj =
+{
+ /* global containers */ /* data caches */ /* instruction caches */
{ "GROUP", HWLOC_OBJ_GROUP }, { "L5D", HWLOC_OBJ_L5CACHE }, { "L3I", HWLOC_OBJ_L3ICACHE },
{ "NUMA", HWLOC_OBJ_NUMANODE }, { "L4D", HWLOC_OBJ_L4CACHE }, { "L2I", HWLOC_OBJ_L2ICACHE },
{ "PACKAGE", HWLOC_OBJ_PACKAGE }, { "L3D", HWLOC_OBJ_L3CACHE }, { "L1I", HWLOC_OBJ_L1ICACHE },
@@ -91,26 +92,24 @@ static std::map object_map =
};
```
-The following syntax is used to specify the object index `X`: `OBJECT_X`.
-
-`OBJECT` can be all the `std::string` defined in the previous listing
-(ex: `PU_10` refers to the logical process unit n°10).
+To specify the index `X` of an `hwloc` object, the following syntax is used:
+`OBJECT_X` (ex: `PU_5` refers to the logical PU n°5).
!!! info
- `CORE` and `PU` objects can be confusing. If the CPU cores does not support
+ `CORE` and `PU` objects can be confusing. If the CPU cores do not support
SMT, then `CORE` and `PU` are the same. However, if the CPU cores support
SMT, then the `PU` is the hardware thread identifier inside a given `CORE`.
### Illustrative Examples
-The section proposes some examples to understand how the syntax works. Only the
-simplest `hwloc` object is used: the `PU`. Let's suppose that we have a
-octo-core CPU with 8 process units (`PU_0, PU_1, PU_2, PU_3, PU_4, PU_5, PU_6,
-PU_7`), see the topology of the Orange Pi 5 Plus above).
+This section gives some examples to understand how the syntax works. We
+suppose that we have a CPU with 8 PUs with the same topology as the the Orange
+Pi 5 Plus SBC presented before.
#### Example 1
-We want to describe a 3 stages pipeline with:
+Let's suppose we want to setup a 3-stage pipeline with the following
+characteristics:
- **Stage 1** - No replication (= 1 thread):
- Pinned to `PU_0`
@@ -136,15 +135,18 @@ S2T4(Stage 2, thread 4 - pin: PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
```
-The input parameters will be:
+In the previous configuration, 6 threads will execute simultaneously (even if
+the given architecture supports up to 8 executions in parallel).
+
+To instantiate this `runtime::Pipeline`, here are the corresponding constructor
+parameters:
- Number of replications (= threads) per stage: `{ 1, 4, 1 }`
-- Enabling pinning: `{ true, true, true }`
+- Enabling pinning per stage: `{ true, true, true }`
- Pinning policy:
`"PU_0 | PU_4, PU_5; PU_4, PU_5; PU_6, PU_7; PU_6, PU_7 | PU_0, PU_1, PU_2, PU_3"`
-The previous pinning policy syntax can be compressed a little bit. It is
-possible to use the following equivalent `std::string`:
+The previous pinning policy syntax can be compressed a little bit as follow:
- Pinning policy :
`"PU_0 | PACKAGE_1; PACKAGE_1; PACKAGE_2; PACKAGE_2 | PACKAGE_0"`
@@ -153,7 +155,7 @@ possible to use the following equivalent `std::string`:
Let's now consider that we want to pin all the threads of the stage 2 on the
`PU_4`, `PU_5`, `PU_6` or `PU_7` (this is less restrictive than the previous
-example). The pinning strategy for stage 1 and 3 is the same as before.
+example). The pinning strategy for stage 1 and 3 is unchanged.
```mermaid
graph LR;
@@ -169,6 +171,10 @@ S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
```
+Here are the corresponding parameters:
+
+- Number of replications (= threads) per stage: `{ 1, 4, 1 }`
+- Enabling pinning per stage: `{ true, true, true }`
- Pinning policy : `"PU_0 | PACKAGE_1, PACKAGE_2 | PACKAGE_0"`
With the previous syntax, the 4 threads of the stage 2 will apply the
@@ -176,12 +182,12 @@ With the previous syntax, the 4 threads of the stage 2 will apply the
#### Example 3
-It is also possible to choose the stages we want to pin using a vector of
-`boolean`. For instance, if we don't want to pin the first stage, we can do:
+It is also possible to choose the stages we want to pin or not using a vector of
+`boolean`. Let's suppose we do not want to specify any pinning for the stage 1.
```mermaid
graph LR;
-S1T1(Stage 1, thread 1 - no pin)-->SYNC1;
+S1T1(Stage 1, thread 1 - no pinning)-->SYNC1;
SYNC1(Sync)-->S2T1;
SYNC1(Sync)-->S2T2;
SYNC1(Sync)-->S2T3;
@@ -193,11 +199,13 @@ S2T4(Stage 2, thread 4 - pin: PU_4, PU_5, PU_6 or PU_7)-->SYNC2;
SYNC2(Sync)-->S3T1(Stage 3, thread 1 - pin: PU_0, PU_1, PU_2 or PU_3);
```
-- Enabling pinning: `{false, true, true}`
+Here are the corresponding parameters:
+
+- Number of replications (= threads) per stage: `{ 1, 4, 1 }`
+- Enabling pinning per stage: `{false, true, true}`
- Pinning policy: `"| PACKAGE_1, PACKAGE_2 | PACKAGE_0"`
-Thus, the operating system will be in charge of pinning the thread of the first
-stage.
+In this case, the OS will be in charge of pinning the thread of the first stage.
### Unpin