-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threading Model #388
base: main
Are you sure you want to change the base?
Threading Model #388
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
Use from Multiple Threads | ||
######################### | ||
|
||
This document describes how various Kokkos functionality can safely be called when using more than one thread of host execution. | ||
It also describes intentions for what Kokkos aspires to implement in support of known or anticipated use cases, with clear indications that they are not currently supported. | ||
Some discussion of implementation characteristics and performance expectations is included as well. | ||
|
||
We will use terms as defined by the C++ standard memory model. In that model, executions that include a data race have undefined behavior. | ||
Thus, use of Kokkos in ways that don't conform to the safety guidance given here may be considered to result in undefined behavior. | ||
|
||
In formal terms, we address two distinct issues here. | ||
*Concurrency* is the matter of what operations may safely execute in different threads. | ||
*Linearizability* is the matter of how the results of concurrently-executed operations may map to potential observed orders in a hypothetical non-concurrent execution. | ||
|
||
Core | ||
==== | ||
|
||
Fundamental Operations | ||
---------------------- | ||
|
||
For purposes of multi-threaded execution, all free and member functions provided by Kokkos can be thought of as performing a sequence of one or more of the following *fundamental operations* | ||
|
||
|
||
* *Initialization* | ||
|
||
.. Not just Kokkos::init, but also whatever device-specific or thread-specific stuff we have Legion doing now | ||
|
||
* *Finalization* | ||
|
||
.. Ditto Initialization | ||
|
||
* *Memory Allocation* | ||
|
||
|
||
* *Memory Deallocation* | ||
|
||
|
||
* *Parallel Dispatch* | ||
``parallel_for``, ``parallel_reduce``, ``parallel_scan``, each of which occurs on a particular *Execution Space Instance*. If none is specified, Kokkos internally uses a default *Execution Space Instance*. | ||
|
||
* *Data Movement* | ||
``deep_copy`` which occurs on a particular *Execution Space Instance*. | ||
.. The 'two-argument' overload of the ``deep_copy()`` function performs a *Global Synchronization*, a ``deep_copy`` operation on an internal *Execution Space Instance*, and another *Global Synchronization* operation. | ||
|
||
* *Global Synchronization* | ||
|
||
.. Like what ``Kokkos::fence()`` does | ||
|
||
* *Local Synchronization* | ||
|
||
.. Like what we get from ``execution_space_instance.fence()``, hypothetical ``is_running`` that returns ``false`` | ||
|
||
* *Data Access* | ||
``View::operator()``, to memory that is accessible from the host. | ||
|
||
Comment on lines
+53
to
+55
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not quite sure if we want to promise anything about data access outside of kernels. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we have to, or else we can't suitably address either usage of unmanaged views, or UVM |
||
|
||
|
||
Fundamental Object Instances | ||
---------------------------- | ||
|
||
The fundamental objects of Kokkos programming, such as ``View`` and instances of execution space types, have general semantics similar to a reference-counted pointer. | ||
These objects provide no surface-level concurrency control of operations like assignment. Thus, client code is responsible for serializing any such operations between threads, such that they establish a *happens-before* relationship between writes and any other operations (e.g. by locking). | ||
The descriptions that follow refer to operations performed on the underlying object that is pointed-to by the C++ object that client code uses. | ||
|
||
Basic *Concurrency* | ||
------------------- | ||
|
||
A program that is single-threaded outside its use of any multi-threaded Kokkos backend can execute code that performs any otherwise-valid sequence of *Fundamental Operations*. (Note: This is analogous to ``MPI_THREAD_SINGLE``) | ||
|
||
A program that performs *Initialization* in a particular thread, and only executes code that performs any subsequent *Fundamental Operations* on that thread, is equivalent to a single-threaded program for these purposes. (Note: This is analogous to ``MPI_THREAD_FUNNELED``) | ||
|
||
A multi-threaded program structured such that there is a *happens-before* relationship between each call to perform a *Fundamental Operation* will behave equivalently to a single-threaded program that performs the same sequence of *Fundamental Operations*. (Note: This is analogous to ``MPI_THREAD_SERIALIZED``) | ||
|
||
.. Do we actually want to guarantee that every Fundamental Operation is serializing? Should that just mean that we don't require call sites to have *happens-before* relationships, or should they also internally create such *happens-before* relationships? I.e. that the calling threads *synchronize-with* each other at those points? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a key question. My understanding is that we want to serialize parallel dispatch to the same execution space instance but I don't think we want to promise anything with respect to data access outside of kernels. |
||
|
||
|
||
Execution Space Instance | ||
------------------------ | ||
|
||
An *Execution Space Instance* denotes a thread-like entity on which *Fundamental Operations* can be performed. Those *Fundamental Operations* have a chain of *sequenced-before* relationships corresponding to the order in which host code performs them. | ||
|
||
.. Assuming we guarantee internal serialization, the following would apply | ||
.. - If distinct host threads perform *Fundamental Operations* on a common *Execution Space Instance* without a *happens-before* relationship between the calls, then their sequence in such a chain is indeterminate. | ||
.. Otherwise, it would be undefined behavior | ||
|
||
*Local Synchronization* creates a *happens-before* relationship between the completion of every *Fundamental Operation* on the specified *Execution Space Instance* that *happens-before* the *Local Synchronization* and the thread that performs the *Local Synchronization*. | ||
|
||
*Global Synchronization* creates a *happens-before* relationship between the completion of every *Fundamental Operation* on any *Execution Space Instance* that *happens-before* the *Global Synchronization* and the thread that performs the *Global Synchronization*. | ||
|
||
.. Should the above actually be *synchronizes-with*? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there really much of a difference when we talk about fence? |
||
|
||
A multi-threaded program may concurrently perform *Fundamental Operations* on distinct *Execution Space Instances*. The order in which these operations execute is indeterminate. | ||
|
||
``View`` | ||
-------- | ||
|
||
* Managed Construction | ||
Managed construction of a Kokkos View performs a *Memory Allocation*, potentially followed by a *Parallel Dispatch* to initialize the memory (depending on whether ``WithoutInitializing`` was passed), potentially followed by a *Synchronization* (if no execution space instance was passed, so that allocation and initialization *happen-before* any subsequent operation that may reference the ``View``'s memory'). | ||
.. Do we want that to be *Global Synchronization* or *Local Synchronization*? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We effectively do a device-wide (or at least execution space-wide) synchronization at the moment, see https://github.com/kokkos/kokkos/blob/5d81422daea73f5a2a69771cc0dfafc19f785003/core/src/Cuda/Kokkos_CudaSpace.cpp#L160-L205. The intent is to make sure that memory can't be accessed before allocation is complete and thus it should be (IMHO) enough to fence the active execution space instance on the current thread. |
||
(Note that particular backends may internally perform additional synchronization operations) | ||
* Unmanaged Construction | ||
Unmanaged construction performs no fundamental operations. | ||
* Destruction | ||
Destruction occurs when the reference count of the underlying object to which a ``View`` refers falls to zero. Destruction performs a *Global Synchronization*, followed by a *Memory Deallocation*. | ||
The *Global Synchronization* is performed to ensure that any preceding operations that may reference the memory owned by the underlying object *happen-before* that memory is released. | ||
Any *Fundamental Operation* which accesses the ``View`` instance performed in a way that it does not have a *happens-before* relationship with Destruction results in undefined behavior. | ||
(Note this could occur via improper concurrent use of a ``View`` variable shared between threads, one of which passes it to an operation, and another which reassigns it to reference a different instance) | ||
* Metadata Query | ||
* Element Access | ||
Element Access performs a Data Access operation. | ||
Comment on lines
+108
to
+110
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not quite sure if we need these. |
||
|
||
|
||
Backend-Specific Details | ||
------------------------ | ||
|
||
.. Local or Global synchronizations below? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be enough to group backends into synchronous and asynchronous backends clarifying that kernels submitted by multiple kernels are serialized (if we decide to make that promise). |
||
|
||
* ``Serial`` | ||
This backend is *synchronous* - there is implicitly a *Synchronization* after any *Memory Allocation*, *Memory Deallocation*, *Parallel Dispatch*, or *Data Movement* operation. | ||
|
||
* ``OpenMP`` | ||
This backend is *synchronous* - there is implicitly a *Synchronization* after any *Memory Allocation*, *Memory Deallocation*, *Parallel Dispatch*, or *Data Movement* operation. | ||
|
||
* ``Threads`` | ||
This backend is *synchronous* - there is implicitly a *Synchronization* after any *Memory Allocation*, *Memory Deallocation*, *Parallel Dispatch*, or *Data Movement* operation. | ||
|
||
* ``CUDA`` and ``HIP`` | ||
|
||
* ``HPX`` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should talk more about parallel dispatch and the behavior of independent threads (without a happens-before relationship between them) accessing the same data. |
||
|
||
Performing Fundamental Operations on Device Execution Space Instances Within Operations on Host Execution Space Instances | ||
------------------------------------------------------------------------------------------------------------------------------------------ | ||
|
||
.. ??? | ||
|
||
|
||
Containers | ||
========== | ||
|
||
|
||
|
||
|
||
Algorithms | ||
========== | ||
|
||
|
||
|
||
Kokkos Kernels | ||
============== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backends can still only be initialized or finalized once. I'm not quite sure if it's worth mentioning initialization/finalization then. At the very least, we need to clarify what we mean here (execution space instance initialization/finalization maybe sensible).