From c1cb81668df00081824e732c438e520f55aabdc3 Mon Sep 17 00:00:00 2001 From: Adrian Sampson Date: Fri, 5 Apr 2024 15:25:58 -0400 Subject: [PATCH] TOC --- OpenTOC/exhet24.html | 97 ++++++++++++++++++++++++++++++++++++++++++++ _data/OpenTOC.yaml | 4 ++ 2 files changed, 101 insertions(+) create mode 100644 OpenTOC/exhet24.html diff --git a/OpenTOC/exhet24.html b/OpenTOC/exhet24.html new file mode 100644 index 0000000..4af84a2 --- /dev/null +++ b/OpenTOC/exhet24.html @@ -0,0 +1,97 @@ +ExHET '24: Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions

ExHET '24: Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions

+ Full Citation in the ACM Digital Library +

SESSION: Publications

+

GPU-Initiated Resource Allocation for Irregular Workloads

  • Ilyas Turimbetov
  • Muhammad Aditya Sasongko
  • Didem Unat
+

GPU kernels may suffer from resource underutilization in multi-GPU systems due to insufficient workload to saturate devices when incorporated within an irregular application. To better utilize the resources in multi-GPU systems, we propose a GPU-sided resource allocation method that can increase or decrease the number of GPUs in use as the workload changes over time. Our method employs GPU-to-CPU callbacks to allow GPU device(s) to request additional devices while the kernel execution is in flight. We implemented and tested multiple callback methods required for GPU-initiated workload offloading to other devices and measured their overheads on Nvidia and AMD platforms. To showcase the usage of callbacks in irregular applications, we implemented Breadth-First Search (BFS) that uses device-initiated workload offloading. Apart from allowing dynamic device allocation in persistently running kernels, it reduces time to solution on average by 15.7% at the cost of callback overheads with a minimum of 6.50 microseconds on AMD and 4.83 microseconds on Nvidia, depending on the chosen callback mechanism. Moreover, the proposed model can reduce the total device usage by up to 35%, which is associated with higher energy efficiency.

+
+ + +

Enhancing Intra-Node GPU-to-GPU Performance in MPI+UCX through Multi-Path Communication

  • Amirhossein Sojoodi
  • Yiltan H. Temucin
  • Ahmad Afsahi
+

Efficient communication among GPUs is crucial for achieving high performance in modern GPU-accelerated applications. This paper introduces a multi-path communication framework within the MPI+UCX library to enhance P2P communication performance between intra-node GPUs, by concurrently leveraging multiple paths, including available NVLinks and PCIe through the host. Through extensive experiments, we demonstrate significant performance gains achieved by our approach, surpassing baseline P2P communication methods. More specifically, in a 4-GPU node, multi-path P2P improves UCX Put bandwidth by up to 2.85x when utilizing the host path and 2 other GPU paths. Furthermore, we demonstrate the effectiveness of our approach in accelerating the Jacobi iterative solver, achieving up to 1.27x runtime speedup.

+
+ + +

Preparing for Future Heterogeneous Systems Using Migrating Threads

  • Peter Michael Kogge
  • Jayden Vap
  • Derek Pepple
+

Heterogeneity in computing systems is clearly increasing, especially as “accelerators” burrow deeper and deeper into different parts of an architecture. What is new, however, is a rapid change in not only the number of such heterogeneous processors, but in their connectivity to other structures, such as cores with different ISAs or smart memory interfaces. Technologies such as chiplets are accelerating this trend. This paper is focused on the problem of how to architect efficient systems that combine multiple heterogeneous concurrent threads, especially when the underlying heterogeneous cores are separated by networks or have no shared-memory access paths. The goal is to eliminate today’s need to invoke significant software stacks to cross any of these boundaries. A suggestion is made of using migrating threads as the glue. Two experiments are described: using a heterogeneous platform where all threads share the same memory to solve a rich ML problem, and a fast PageRank approximation that mirrors the kind of computation for which thread migration may be useful. Architectural “lessons learned” are developed that should help guide future development of such systems.

+
+ +
\ No newline at end of file diff --git a/_data/OpenTOC.yaml b/_data/OpenTOC.yaml index 571b2cc..6817ba5 100644 --- a/_data/OpenTOC.yaml +++ b/_data/OpenTOC.yaml @@ -1373,3 +1373,7 @@ event: PMAM year: 2024 title: "Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores" +- + event: ExHET + year: 2024 + title: "Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions"