From 031f149e98cc94aa3597eda6b3a313faaee65017 Mon Sep 17 00:00:00 2001 From: Adrian Sampson Date: Wed, 21 Feb 2024 20:03:33 -0500 Subject: [PATCH] TOC --- OpenTOC/cc24.html | 224 +++++++++++++++++++++++++++++ OpenTOC/ppopp24.html | 330 +++++++++++++++++++++++++++++++++++++++++++ _data/OpenTOC.yaml | 8 ++ 3 files changed, 562 insertions(+) create mode 100644 OpenTOC/cc24.html create mode 100644 OpenTOC/ppopp24.html diff --git a/OpenTOC/cc24.html b/OpenTOC/cc24.html new file mode 100644 index 0000000..1304739 --- /dev/null +++ b/OpenTOC/cc24.html @@ -0,0 +1,224 @@ +CC 2024: Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction

CC 2024: Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction

+ Full Citation in the ACM Digital Library +

SESSION: Code Generation and Synthesis

+

Fast Template-Based Code Generation for MLIR

  • Florian Drescher
  • Alexis Engelke
+

Fast compilation is essential for JIT-compilation use cases like dynamic languages or databases as well as development productivity when compiling static languages. Template-based compilation allows fast compilation times, but in existing approaches, templates are generally handwritten, limiting flexibility and causing substantial engineering effort.

In this paper, we introduce an approach based on MLIR that derives code templates for the instructions of any dialect automatically ahead-of-time. Template generation re-uses the existing compilation path present in the MLIR lowering of the instructions and thereby inherently supports code generation from different abstraction levels in a single step.

Our results on compiling database queries and standard C programs show a compile-time improvement of 10–30x compared to LLVM -O0 with only moderate run-time slowdowns of 1–3x, resulting in an overall improvement of 2x in a JIT-compilation-based database setting.

+
+ + +

A Unified Memory Dependency Framework for Speculative High-Level Synthesis

  • Jean-Michel Gorius
  • Simon Rokicki
  • Steven Derrien
+

Heterogeneous hardware platforms that leverage application-specific hardware accelerators are becoming increasingly popular as the demand for high-performance compute intensive applications rises. The design of such high-performance hardware accelerators is a complex task. High-Level Synthesis (HLS) promises to ease this process by synthesizing hardware from a high-level algorithmic description. Recent works have demonstrated that speculative execution can be inferred from the latter by leveraging compilation transformation and analysis techniques in HLS flows. However, existing work on speculative HLS lacks support for the intricate memory interactions in data-processing applications. In this paper, we introduce a unified memory speculation framework, which allows aggressive scheduling and high-throughput accelerator synthesis in the presence of complex memory dependencies. We show that our technique can generate high-throughput designs for various applications and describe a complete implementation inside an existing speculative HLS toolchain.

+
+ +

SESSION: Static and Dynamic Analysis

+

If-Convert as Early as You Must

  • Dorit Nuzman
  • Ayal Zaks
  • Ziv Ben-Zion
+

Optimizing compilers employ a rich set of transformations that generate highly efficient code for a variety of source languages and target architectures. These transformations typically operate on general control flow constructs which trigger a range of optimization opportunities, such as moving code to less frequently executed paths, and more. Regular loop nests are specifically relevant for accelerating certain domains, leveraging architectural features including vector instructions, hardware-controlled loops and data flows, provided their internal control-flow is eliminated. Compilers typically apply predicating if-conversion late, in their backend, to remove control-flow undesired by the target. Until then, transformations triggered by control-flow constructs that are destined to be removed may end up doing more harm than good. +We present an approach that leverages the existing powerful and general optimization flow of LLVM when compiling for targets without control-flow in loops. Rather than trying to teach various transformations how to avoid misoptimizing for such targets, we propose to introduce an aggressive if-conversion pass as early as possible, along with carefully addressing pass-ordering implications. This solution outperforms the traditional compilation flow with only a modest tuning effort, thereby offering a robust and promising compilation approach for branch-restricted targets.

+
+ + +

Paguroidea: Fused Parser Generator with Transparent Semantic Actions

  • Yifan Zhu
  • Quartic Cat
  • Boluo Ge
  • Shaotong Sun
+

Parser generators have long been a savior for programmers, liberating them from the daunting task of crafting correct and maintainable parsers. Yet, this much-needed simplicity often comes at the expense of efficiency. +

+

+We present, Paguroidea, a parser generator that harnesses the power of lexer-parser fusion techniques to create parsers that boast user-friendly grammar definitions while delivering performance that rivals specialized parsers. Building upon the foundations of the flap parser, our work introduces a series of extensions. +

+

+One of our key contributions is a novel approach to the normalization method. By encoding reduction actions directly into the Deterministic Greibach Normal Form (DGNF), we provide parser generators with flexibility in manipulating semantic actions. This unique approach empowers developers with the freedom to customize their parser generators to their specific needs while maintaining semantic correctness. +

+

+Furthermore, we formulate the execution of the parser in substructural logic, providing an elegant way to prove the correctness of the amended normalization procedure. In this exposition, we offer a glimpse into efficacious, user-friendly, and correctness-provable parser generation.

+
+ + +

Region-Based Data Layout via Data Reuse Analysis

  • Caio Salvador Rohwedder
  • João P. L. De Carvalho
  • José Nelson Amaral
+

Data-structure splicing techniques, such as structure splitting, field reordering, and pointer inlining reorganize data structures to improve cache and translation look-aside buffer (TLB) utilization. Structure types are typically transformed globally in the program, requiring updates to all references to elements of a transformed type. These techniques often rely on instrumentation, tracing, or sampling to create models that guide their transformations. Furthermore, compilers often cannot prove that their transformations are legal and must rely on manual inspection and manual transformation. Applying data-layout transformations locally -- as opposed to globally -- to regions of code removes the need for expensive profiling and simplifies legality verification. This work introduces RebaseDL, a static analysis that finds profitable and legal region-based data layout transformation opportunities that improve access locality. These opportunities are found within code regions that exhibit data reuse. Going beyond structure splicing, RebaseDL also identifies transformation opportunities that do not involve structure types, that is, it identifies data packing transformations. The analysis is implemented in LLVM and it detects multiple transformation opportunities within the SPEC CPU benchmark suite, where the transformation obtains speedups of up to 1.34x for transformed regions.

+
+ + +

A Context-Sensitive Pointer Analysis Framework for Rust and Its Application to Call Graph Construction

  • Wei Li
  • Dongjie He
  • Yujiang Gui
  • Wenguang Chen
  • Jingling Xue
+

Existing program analysis tools for Rust lack the ability to effectively detect security vulnerabilities due to the absence of an accurate call graph and precise points-to information. We present Rupta, the first context-sensitive pointer analysis framework designed for Rust, with a particular focus on its role in constructing call graphs. Operating on Rust MIR, Rupta employs callsite-based context-sensitivity and on-the-fly call graph construction to address a range of pointer analysis challenges, including method/function calls, pointer casts, and nested structs, while preserving type information. +

+

+Our assessment of Rupta against two state-of-the-art call graph construction techniques, Rurta (Rapid Type Analysis-based) and Ruscg (static dispatch-only), across 13 real-world Rust programs demonstrates its high efficiency and precision. In particular, our results reveal that Rupta surpasses Ruscg in soundness by discovering 29% more call graph edges and outperforms Rurta in precision by eliminating approximately 70% of spurious dynamic call edges. Consequently, Rupta has the potential to enhance existing security analysis tools, enabling them to identify a greater number of security vulnerabilities in Rust programs.

+
+ + +

CoSense: Compiler Optimizations using Sensor Technical Specifications

  • Pei Mu
  • Nikolaos Mavrogeorgis
  • Christos Vasiladiotis
  • Vasileios Tsoutsouras
  • Orestis Kaparounakis
  • Phillip Stanley-Marbell
  • Antonio Barbalace
+

Embedded systems are ubiquitous, but in order to maximize their lifetime on batteries there is a need for faster code execution – i.e., higher energy efficiency, and for reduced memory usage. The large number of sensors integrated into embedded systems gives us the opportunity to exploit sensors’ technical specifications, like a sensor’s value range, to guide compiler optimizations for faster code execution, small binaries, etc. We design and implement such an idea in COSENSE, a novel compiler (extension) based on the LLVM infrastructure, using an existing domain-specific language (DSL), NEWTON, to describe the bounds of and relations between physical quantities measured by sensors. COSENSE utilizes previously unexploited physical information correlated to program variables to drive code optimizations. COSENSE computes value ranges of variables and proceeds to overload functions, compress variable types, substitute code with constants and simplify the condition statements. We evaluated COSENSE using several microbenchmarks and two real-world applications on various platforms and CPUs. For microbenchmarks, COSENSE achieves 1.18× geomean speedup in execution time and 12.35% reduction on average in binary code size with 4.66% compilation time overhead on x86, and 1.23× geomean speedup in execution time and 10.95% reduction on average in binary code size with 5.67% compilation time overhead on ARM. For real-world applications, COSENSE achieves 1.70× and 1.50× speedup in execution time, 12.96% and 0.60% binary code reduction, 9.69% and 30.43% lower energy consumption, with a 26.58% and 24.01% compilation time overhead, respectively.

+
+ +

SESSION: Runtime Techniques

+

UNIFICO: Thread Migration in Heterogeneous-ISA CPUs without State Transformation

  • Nikolaos Mavrogeorgis
  • Christos Vasiladiotis
  • Pei Mu
  • Amir Khordadi
  • Björn Franke
  • Antonio Barbalace
+

Heterogeneous-ISA processor designs have attracted considerable research interest. However, unlike their homogeneous-ISA counterparts, explicit software support for bridging ISA heterogeneity is required. The lack of a compilation toolchain ready to support heterogeneous-ISA targets has been a major factor hindering research in this exciting emerging area. For any such compiler “getting right” the mechanics involved in state transformation upon migration and doing this efficiently is of critical importance. In particular, any runtime conversion of the current program stack from one architecture to another would be prohibitively expensive. In this paper, we design and develop Unifico, a new multi-ISA compiler that generates binaries that maintain the same stack layout during their execution on either architecture. Unifico avoids the need for runtime stack transformation, thus eliminating overheads associated with ISA migration. Additional responsibilities of the Unifico compiler backend include maintenance of a uniform ABI and virtual address space across ISAs. Unifico is implemented using the LLVM compiler infrastructure, and we are currently targeting the x86-64 and ARMv8 ISAs. We have evaluated Unifico across a range of compute-intensive NAS benchmarks and show its minimal impact on overall execution time, where less than 6% overhead is introduced on average. When compared against the state-of-the-art Popcorn compiler, Unifico reduces binary size overhead from ∼200% to ∼10%, whilst eliminating the stack transformation overhead during ISA migration.

+
+ + +

BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less Queuing

  • Qinzhe Wu
  • Ruihao Li
  • Jonathan Beard
  • Lizy John
+

Message queues are used widely in parallel processing systems for worker thread synchronization. When there is a throughput mismatch between the upstream and downstream tasks, the message queue buffer will often exist as either empty or full. Polling on an empty or full queue will affect the performance of upstream or downstream threads, since such polling cycles could have been spent on other computation. Non-blocking queue is an alternative that allow polling cycles to be spared for other tasks per applications’ choice. However, application programmers are not supposed to bear the burden, because a good decision of what to do upon blocking has to take many runtime environment information into consideration. +

+

+This paper proposes Blocking-Less Queuing Runtime (BLQ), a systematic solution capable of finding the proper strategies at (or before) blocking, as well as lightening the programmers’ burden. BLQ collects a set of solutions, including yielding, advanced dynamic queue buffer resizing, and resource-aware task scheduling. The evaluation on high-end servers shows that a set of diverse parallel queuing workloads could reduce blocking and lower cache misses with BLQ. BLQ outperforms the baseline runtime considerably (with up to 3.8× peak speedup).

+
+ +

SESSION: Debugging, Profiling, and Parallelism

+

APPy: Annotated Parallelism for Python on GPUs

  • Tong Zhou
  • Jun Shirako
  • Vivek Sarkar
+

GPUs are increasingly being used used to speed up Python applications in the scientific computing and machine learning domains. Currently, the two common approaches to leveraging GPU acceleration in Python are 1) create a custom native GPU kernel, and import it as a function that can be called from Python; 2) use libraries such as CuPy, which provides pre-defined GPU-implementation-backed tensor operators. The first approach is very flexible but requires tremendous manual effort to create a correct and high performance GPU kernel. While the second approach dramatically improves productivity, it is limited in its generality, as many applications cannot be expressed purely using CuPy’s pre-defined tensor operators. Additionally, redundant memory access can often occur between adjacent tensor operators due to the materialization of intermediate results. In this work, we present APPy (Annotated Parallelism for Python), which enables users to parallelize generic Python loops and tensor expressions for execution on GPUs by adding simple compiler directives (annotations) to Python code. Empirical evaluation on 20 scientific computing kernels from the literature on a server with an AMD Ryzen 7 5800X 8-Core CPU and an NVIDIA RTX 3090 GPU demonstrates that with simple pragmas APPy is able to generate more efficient GPU code and achieves significant geometric mean speedup relative to CuPy (30× on average), and to three state-of-the-art Python compilers, Numba (8.3× on average), DaCe-GPU (3.1× on average) and JAX-GPU (18.8× on average).

+
+ + +

Accurate Coverage Metrics for Compiler-Generated Debugging Information

  • J. Ryan Stinnett
  • Stephen Kell
+

Many debugging tools rely on compiler-produced metadata to present a source-language view of program states, such as variable values and source line numbers. While this tends to work for unoptimised programs, current compilers often generate only partial debugging information in optimised programs. +Current approaches for measuring the extent of coverage of local variables are based on crude assumptions (for example, assuming variables could cover their whole parent scope) and are not comparable from one compilation to another. In this work, we propose some new metrics, computable by our tools, which could serve as motivation for language implementations to improve debugging quality.

+
+ + +

FlowProf: Profiling Multi-threaded Programs using Information-Flow

  • Ahamed Al Nahian
  • Brian Demsky
+

Amdahl's law implies that even small sequential bottlenecks can seriously limit the scalability of multi-threaded programs. To achieve scalability, developers must painstakingly identify sequential bottlenecks in their program and eliminate these bottlenecks by either changing synchronization strategies or rearchitecting and rewriting any code with sequential bottlenecks. This can require significant effort by the developer to find and understand how to fix sequential bottlenecks. To address the issue, we bring a new tool, information flow, to the problem of understanding sequential bottlenecks. Information flow can help developers understand whether a bottleneck is fundamental to the computation, or merely an artifact of the implementation. +

+

+First, our strategy tracks memory access conflicts to find over-synchronized applications where redesigning the synchronization strategy on existing implementation can improve performance. Then, information flow analysis finds optimization opportunities where changing the existing implementation can improve performance of applications that have bottlenecks due to unnecessary memory access conflicts. We implemented this in FlowProf. We have evaluated FlowProf on a set of multi-threaded Java applications where the generated optimization insights achieve performance gains of up to 58%.

+
+ + +

Reducing the Overhead of Exact Profiling by Reusing Affine Variables

  • Leon Frenot
  • Fernando Magno Quintão Pereira
+

An exact profiler inserts counters in a program to record how many times each edge of that program's control-flow graph has been traversed during an execution of it. It is common practice to instrument only edges in the complement of a minimum spanning tree of the program's control-flow graph, following the algorithm proposed by Knuth and Stevenson in 1973. Yet, even with this optimization, the overhead of exact profiling is high. As a consequence, mainstream profile-guided code optimizers resort to sampling, i.e., approximate, profiling, instead of exact frequency counts. This paper introduces a technique to reduce the overhead of exact profiling. We show that it is possible to use the values of variables incremented by constant steps within loops---henceforth called SESE counters---as a replacement for some profiling counters. Such affine variables are common, for they include the induction variable of typical loops. This technique, although simple, is effective. We have implemented it in the LLVM compilation infrastructure. Standard Knuth-Stevenson instrumentation increases the running time of the 135 programs in the LLVM test suite from 648 seconds to 817. The optimization suggested in this paper brings this time down to 738 seconds. In the 949 Jotai programs, standard instrumentation increases the number of processed x86 instructions from 2.96 billion to 3.34 billion, whereas the proposed technique causes 3.07 billion instructions to be fetched.

+
+ + +

Stale Profile Matching

  • Amir Ayupov
  • Maksim Panchenko
  • Sergey Pupyrev
+

Profile-guided optimizations rely on profile data for directing compilers to generate optimized code. To achieve the maximum performance boost, profile data needs to be collected on the same version of the binary that is being optimized. In practice however, there is typically a gap between the profile collection and the release, which makes a portion of the profile invalid for optimizations. This phenomenon is known as profile staleness, and it is a serious practical problem for data-center workloads both for compilers and binary optimizers.

In this paper we thoroughly study the staleness problem and propose the first practical solution for utilizing profiles collected on binaries built from several revisions behind the release. Our algorithm is developed and implemented in a mainstream open-source post-link optimizer, BOLT. An extensive evaluation on a variety of standalone benchmarks and production services indicates that the new method recovers up to 0.8 of the maximum BOLT benefit, even when most of the input profile data is stale and would have been discarded by the optimizer otherwise.

+
+ +

SESSION: Safety and Correctness

+

From Low-Level Fault Modeling (of a Pipeline Attack) to a Proven Hardening Scheme

  • Sébastien Michelland
  • Christophe Deleuze
  • Laure Gonnord
+

Fault attacks present unique safety and security challenges that require dedicated countermeasures, even for bug-free programs. Models of these complex attacks are made workable by approximating their effects to a suitable level of abstraction. The common practice of targeting the Instruction Set Architecture (ISA) level isn't ideal because it discards important micro-architectural information, leading to weaker security guarantees. Conversely, including micro-architectural details makes countermeasures harder to model and reason about, creating a new challenge in validating and trusting protections. +

+

+We show that a semantic approach to modeling faults makes micro-architectural models workable, and enables precise cooperation between software and hardware in the design of countermeasures. We demonstrate the approach by designing and implementing a compiler/hardware countermeasure, which protects against a state-of-the-art pipeline fetch attack that generalizes multi-fault instruction skips. Crucially, we provide a formal security proof that guarantees faults are detected by the end of every basic block. This result shows that carefully embracing the complexity of low-level systems enables finer, more secure countermeasures.

+
+ + +

Clog: A Declarative Language for C Static Code Checkers

  • Alexandru Dura
  • Christoph Reichenbach
+

We present Clog, a declarative language for describing static code checkers for C. +Unlike other extensible state-of-the-art checker frameworks, Clog +enables powerful interprocedural checkers without exposing the +underlying program representation: +Clog checkers consist of Datalog-style recursive rules that access the program +under analysis via syntactic pattern matching and control flow edges only. +We have implemented Clog on top of Clang, using a custom Datalog evaluation strategy +that piggy-backs on Clang's AST matching facilities while working around Clang's limitations +to achieve our design goal of representation independence. +

+

+Our experiments demonstrate that Clog can concisely express a wide +variety of checkers for different security vulnerabilities, with +performance that is similar to Clang's own analyses and highly +competitive on real-world programs.

+
+ +

SESSION: Compilers and Machine Learning

+

Compiler-Based Memory Encryption for Machine Learning on Commodity Low-Power Devices

  • Kiwan Maeng
  • Brandon Lucia
+

Running machine learning (ML) on low-power IoT devices exposes unique security concerns. Attackers can easily steal or manipulate sensitive user data or proprietary ML models from the devices’ off-chip memory by leveraging their simple hardware structure and the lack of memory encryption hardware. To protect against these real-world threats, we propose a lightweight compiler-based memory encryption scheme, Spitz. Spitz achieves full off-chip memory encryption only with common architectural components on commodity devices, such as programmable on-chip SRAM, AES hardware, and Direct-Memory Access (DMA). Our evaluation on real hardware shows that Spitz maintains competitive performance while realizing full off-chip memory encryption. Spitz is only 1.16–1.73× slower than our best-effort non-secure baseline, and is even faster by 1.5–2.23× compared to a non-secure popular vendor library.

+
+ + +

YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs

  • Cyrus Zhou
  • Zack Hassman
  • Dhirpal Shah
  • Vaughn Richard
  • Yanjing Li
+

We address the challenges associated with deploying neural networks on CPUs, with a particular focus on minimizing inference time while maintaining accuracy. Our novel approach is to use the dataflow (i.e., computation order) of a neural network to explore data reuse opportunities using heuristic-guided analysis and a code generation framework, which enables exploration of various Single Instruction, Multiple Data (SIMD) implementations to achieve optimized neural network execution. Our results demonstrate that the dataflow that keeps outputs in SIMD registers while also maximizing both input and weight reuse consistently yields the best performance for a wide variety of inference workloads, achieving up to 3x speedup for 8-bit neural networks, and up to 4.8x speedup for binary neural networks, respectively, over the optimized implementations of neural networks today.

+
+ + +

Fast and Accurate Context-Aware Basic Block Timing Prediction using Transformers

  • Abderaouf Nassim Amalou
  • Elisa Fromont
  • Isabelle Puaut
+

This paper introduces ORXESTRA, a context-aware execution time prediction model based on Transformers XL, specifically designed to accurately estimate performance in embedded system applications. Unlike traditional machine learning models that often overlook contextual information, resulting in biased predictions for individual isolated basic blocks, ORXESTRA overcomes this limitation by incorporating execution context awareness. By doing so, ORXESTRA effectively accounts for the processor micro-architecture without explicitly modeling micro-architectural elements such as caches, pipelines, and branch predictors. Our evaluations demonstrate ORXESTRA's ability to provide precise timing estimations for different ARM targets (Cortex M4, M7, A53, and A72), surpassing existing machine learning-based approaches in both prediction accuracy and prediction speed.

+
+ + +

The Next 700 ML-Enabled Compiler Optimizations

  • S. VenkataKeerthy
  • Siddharth Jain
  • Umesh Kalvakuntla
  • Pranav Sai Gorantla
  • Rajiv Shailesh Chitale
  • Eugene Brevdo
  • Albert Cohen
  • Mircea Trofin
  • Ramakrishna Upadrasta
+

There is a growing interest in enhancing compiler optimizations with ML models, yet interactions between compilers and ML frameworks remain challenging. Some optimizations require tightly coupled models and compiler internals, raising issues with modularity, performance and framework independence. Practical deployment and transparency for the end-user are also important concerns. We propose ML-Compiler-Bridge to enable ML model development within a traditional Python framework while making end-to-end integration with an optimizing compiler possible and efficient. We evaluate it on both research and production use cases, for training and inference, over several optimization problems, multiple compilers and its versions, and gym infrastructures.

+
+ + +

Exponentially Expanding the Phase-Ordering Search Space via Dormant Information

  • Ruobing Han
  • Hyesoon Kim
+

Applying compilation transformations in optimal sequences can significantly improve program speed and reduce code size. However, finding these optimal sequences—a problem known as the phase-ordering problem—remains a long-standing challenge. Specifically, modern compilers offer hundreds of available transformations, making the search space too large to explore efficiently within a reasonable timeframe. Existing solutions address this problem by grouping transformations into short sequences based on prior knowledge from human experts, and then searching for optimal orders among these sequences. Such pruning methods are aggressive, potentially excluding optimal solutions from the search space. Additionally, they rely on prior knowledge and lack scalability when applied to new transformations.

In this paper, we propose a more conservative pruning approach. The insight of this new approach is to capture the dormant information and utilize it to guide the search process. By excluding dormant transformations, this approach significantly prunes the search space while retaining the optimal solutions. Moreover, it does not rely on any prior human knowledge, making it scalable to new transformations.

To demonstrate the efficacy of the conservative approach, we integrate it with a classical Reinforcement Learning model, which was previously used with aggressive pruning methods. Our solution, named FlexPO, is capable of exploring a search space exponentially larger than those considered in existing solutions. Experimental results show that FlexPO generates programs that are 12% faster or 17.6% smaller than the programs produced by modern compilers.

+
+ +
\ No newline at end of file diff --git a/OpenTOC/ppopp24.html b/OpenTOC/ppopp24.html new file mode 100644 index 0000000..251c242 --- /dev/null +++ b/OpenTOC/ppopp24.html @@ -0,0 +1,330 @@ +PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

+ Full Citation in the ACM Digital Library +

SESSION: Keynote

+

Sparsity in Deep Neural Nets (Keynote)

  • Nir N Shavit
+

Our brain executes very sparse computation, allowing for great speed and energy savings. Deep neural networks can also be made to exhibit high levels of sparsity without significant accuracy loss. As their size grows, it is becoming imperative that we use sparsity to improve their efficiency. This is a challenging task because the memory systems and SIMD operations that dominate todays CPUs and GPUs do not lend themselves easily to the irregular data patterns sparsity introduces. This talk will survey the role of sparsity in neural network computation, and the parallel algorithms and hardware features that nevertheless allow us to make effective use of it.

+
+ +

SESSION: Synchronization and Concurrency Control I

+

Scaling Up Transactions with Slower Clocks

  • Pedro Ramalhete
  • Andreia Correia
+

Concurrency controls with optimistic read accesses and pessimistic write accesses are among the fastest in the literature. However, during write transactions these algorithms need to increment an atomic variable, the central clock, limiting parallelism and preventing scalability at high core counts.

+

In this paper, we propose a new concurrency control, Deferred Clock Transactional Locking (DCTL), which significantly reduces the heartbeat of the central clock, thus increasing scalability. DCTL will not increment the clock for consecutive disjoint transactions. An optimized variant, named DCOTL, allows for consecutive transactions with nondisjoint write-accesses to commit without incrementing the clock. Moreover, we show variants of these two algorithms with starvation-free transactions.

+

Transactions in DCTL are opaque, which means it can be applied to concurrent data structures, Database Management Systems, Software Transactional Memory, and Persistent Transactional Memory. Our experiments show that these DCTL algorithms match or surpass the current state of the art for most workloads. We adapted both algorithms using an existing durability technique and implemented a fully transactional DBMS with disk persistence, whose scalability in write transactions exceeds the current state of the art.

+
+ + +

Locks as a Resource: Fairly Scheduling Lock Occupation with CFL

  • Jonggyu Park
  • Young Ik Eom
+

In multi-container environments, applications oftentimes experience unexpected performance fluctuations due to undesirable interference among applications. Synchronization such as locks has been targeted as one of the reasons but still remains an uncontrolled resource while a large set of locks are still shared across applications. In this paper, we demonstrate that this lack of lock scheduling incurs significant real-world problems including performance unfairness and interference among applications. To address this problem, we propose a new synchronization design with an embedded scheduling capability, called CFL (Completely Fair Locking). CFL fairly distributes a fair amount of lock occupation time to applications considering their priorities and cgroup information. For scalability, CFL also considers the NUMA topology in the case of NUMA machines. Experimental results demonstrate that CFL significantly improves performance fairness while achieving comparable or sometimes even superior performance to state-of-the-art locks.

+
+ + +

Are Your Epochs Too Epic? Batch Free Can Be Harmful

  • Daewoo Kim
  • Trevor Brown
  • Ajay Singh
+

Epoch based memory reclamation (EBR) is one of the most popular techniques for reclaiming memory in lock-free and optimistic locking data structures, due to its ease of use and good performance in practice. However, EBR is known to be sensitive to thread delays, which can result in performance degradation. Moreover, the exact mechanism for this performance degradation is not well understood.

+

This paper illustrates this performance degradation in a popular data structure benchmark, and does a deep dive to uncover its root cause---a subtle interaction between EBR and state of the art memory allocators. In essence, modern allocators attempt to reduce the overhead of freeing by maintaining bounded thread caches of objects for local reuse, actually freeing them (a very high latency operation) only when thread caches become too large. EBR immediately bypasses these mechanisms whenever a particularly large batch of objects is freed, substantially increasing overheads and latencies. Beyond EBR, many memory reclamation algorithms, and data structures, that reclaim objects in large batches suffer similar deleterious interactions with popular allocators.

+

We propose a simple algorithmic fix for such algorithms to amortize the freeing of large object batches over time, and apply this technique to ten existing memory reclamation algorithms, observing performance improvements for nine out of ten, and over 50% improvement for six out of ten in experiments on a high performance lock-free ABtree. We also present an extremely simple token passing variant of EBR and show that, with our fix, it performs 1.5--2.6× faster than the fastest known memory reclamation algorithm, and 1.2--1.5× faster than not reclaiming at all, on a 192 thread four socket Intel system.

+
+ +

SESSION: Compilers and Runtimes for Parallel Systems

+

Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference

  • Jiangsu Du
  • Jinhui Wei
  • Jiazhi Jiang
  • Shenggan Cheng
  • Dan Huang
  • Zhiguang Chen
  • Yutong Lu
+

Distributed large model inference is still in a dilemma where balancing cost and effect. The online scenarios demand intraoperator parallelism to achieve low latency and intensive communications makes it costly. Conversely, the inter-operator parallelism can achieve high throughput with much fewer communications, but it fails to enhance the effectiveness.

+

In this paper, we present Liger, a distributed large model inference runtime system that is of capability to achieve low latency at high throughput on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger enables this parallelism by carefully scheduling computation and communication kernels across requests onto multiple streams of multiple GPUs. It achieves precise control of kernel execution order efficiently by mixing use the CPU-GPU synchronization and the inter-stream synchronization. To prevent scheduling failures caused by resource contention, Liger introduces a contention factor strategy to anticipate the penalty of contention. It enables a higher degree of overlap by decomposing lengthy kernels into smaller, more manageable units at runtime.

+

Extensive evaluations show that Liger, in most cases, outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput results. In a 4-device case, Liger reduces the average latency by 36.0% while maintaining the same throughput compared to the inter-operator approach. Meanwhile, it improves the throughput by 1.34× with improved average latency compared to the intra-operator approach.

+
+ + +

A Holistic Approach to Automatic Mixed-Precision Code Generation and Tuning for Affine Programs

  • Jinchen Xu
  • Guanghui Song
  • Bei Zhou
  • Fei Li
  • Jiangwei Hao
  • Jie Zhao
+

Reducing floating-point (FP) precision is used to trade the quality degradation of a numerical program's output for performance, but this optimization coincides with type casting, whose overhead is undisclosed until a mixed-precision code version is generated. This uncertainty enforces the decoupled implementation of mixed-precision code generation and autotuning in prior work. In this paper, we present a holistic approach called PrecTuner that consolidates the mixed-precision code generator and the autotuner by defining one parameter. This parameter is first initialized by some automatically sampled values and used to generate several code variants, with various loop transformations also taken into account. The generated code variants are next profiled to solve a performance model formulated using the aforementioned parameter, possibly under a pre-defined quality degradation budget. The best-performing value of the defined parameter is finally predicted without evaluating all code variants. Experimental results of the PolyBench benchmarks on CPU demonstrate that PrecTuner outperforms LuIs by 3.28× while achieving smaller errors, and we also validate its effectiveness in optimizing a real-life large-scale application. In addition, PrecTuner also obtains a mean speedup of 1.81× and 1.52×-1.73× over Pluto on single- and multi-core CPU, respectively, and 1.71× over PPCG on GPU.

+
+ + +

Language-Agnostic Static Deadlock Detection for Futures

  • Stefan K Muller
+

Deadlocks, in which threads wait on each other in a cyclic fashion and can't make progress, have plagued parallel programs for decades. In recent years, as the parallel programming mechanism known as futures has gained popularity, interest in preventing deadlocks in programs with futures has increased as well. Various static and dynamic algorithms exist to detect and prevent deadlock in programs with futures, generally by constructing some approximation of the dependency graph of the program but, as far as we are aware, all are specialized to a particular programming language.

+

A recent paper introduced graph types, by which one can statically approximate the dependency graphs of a program in a language-independent fashion. By analyzing the graph type directly instead of the source code, a graph-based program analysis, such as one to detect deadlock, can be made language-independent. Indeed, the paper that proposed graph types also proposed a deadlock detection algorithm. Unfortunately, the algorithm was based on an unproven conjecture which we show to be false. In this paper, we present, and prove sound, a type system for finding possible deadlocks in programs that operates over graph types and can therefore be applied to many different languages. As a proof of concept, we have implemented the algorithm over a subset of the OCaml language extended with built-in futures.

+
+ + +

Recurrence Analysis for Automatic Parallelization of Subscripted Subscripts

  • Akshay Bhosale
  • Rudolf Eigenmann
+

Introducing correct and optimal OpenMP parallelization directives in applications is a challenge. To parallelize a loop in an input application code automatically, parallelizing compilers need to disprove dependences with respect to variables across iterations of the loop. Performing such dependence analysis in the presence of index arrays or subscripted subscripts - a[b[i]] - has long been a challenge for automatic parallelizers. Loops with subscripted subscripts can be parallelized if the subscript array is known to possess a property such as monotonicity. This paper presents a compiletime algorithm that can analyze complex recurrence relations and determine irregular or intermittent monotonicity of one-dimensional and monotonicity of multi-dimensional subscript arrays. The new algorithm builds on a prior approach that is capable of analyzing simple recurrence relations and determining monotonic one-dimensional subscript arrays. Experimental results show that automatic parallelizers equipped with our new analysis techniques can substantially improve the performance of ten out of twelve or 83.33% of the benchmarks evaluated, 25--33.33% more than possible with state-of-the-art compile-time automatic parallelization techniques.

+
+ +

SESSION: High Performance Computing

+

OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant Analytics

  • Kasra Jamshidi
  • Keval Vora
+

We present a verification-based Byzantine Fault Tolerant processing system, called OsirisBFT, for distributed task-parallel applications. OsirisBFT treats computation tasks differently from state update tasks, allowing the application to scale independently from number of expected failures. OsirisBFT captures application-specific verification semantics via generic verification operators and employs lightweight verification strategies with little coordination during graceful execution. Evaluation across multiple applications and workloads shows that OsirisBFT delivers high processing throughput and scalability compared to replicated processing. Importantly, the scalable nature of OsirisBFT enables it to reduce the performance gap compared to baseline with no fault tolerance by simply scaling out.

+
+ + +

Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores

  • Haozhong Qiu
  • Chuanfu Xu
  • Jianbin Fang
  • Liang Deng
  • Jian Zhang
  • Qingsong Wang
  • Yue Ding
  • Zhe Dai
  • Yonggang Che
  • Shizhao Chen
  • Jie Liu
+

Due to data conflicts or data dependences, exploiting shared memory parallelism on unstructured mesh applications is highly challenging. The prior approaches are neither general nor scalable on emerging many-core processors. This paper presents a general and scalable shared memory approach for unstructured mesh computations. We recursively divide and reorder an unstructured mesh to construct a task dependency tree (TDT), where massive parallelism is exposed and data conflicts as well as data dependences are respected. We propose two recursion strategies to support popular programming models on both CPUs and GPUs for TDT. We evaluate our approach by applying it to an industrial unstructured Computational Fluid Dynamics (CFD) software. Experimental results show that our approach significantly outperforms the prior shared memory approaches, delivering up to 8.1× performance improvement over the engineer-tuned implementations.

+
+ + +

Extreme-scale Direct Numerical Simulation of Incompressible Turbulence on the Heterogeneous Many-core System

  • Jiabin Xie
  • Guangnan Feng
  • Han Huang
  • Junxuan Feng
  • Zhiguang Chen
  • Yutong Lu
+

Direct numerical simulation (DNS) is a technique that directly solves the fluid Navier-Stokes equations with high spatial and temporal resolutions, which has driven much research regarding the nature of turbulence. For high-Reynolds number (Re) incompressible turbulence of particular interest, where the nondimensional Re characterizes the flow regime, the application of DNS is hindered by the fact that the numerical grid size (i.e., the memory requirement) scales with Re3, while the overall computational cost scales with Re4. Recent studies have shown that developing efficient parallel methods for heterogeneous many-core systems is promising to solve this computational challenge.

+

We develop PowerLLEL++, a high-performance and scalable implicit finite difference solver for heterogeneous many-core systems, to accelerate the extreme-scale DNS of incompressible turbulence. To achieve this goal, an adaptive multi-level parallelization strategy is first proposed to fully exploit the multi-level parallelism and computing power of heterogeneous many-core systems. Second, hierarchical-memory-adapted data reuse/tiling strategy and kernel fusion are adopted to improve the performance of memory-bound stencil-like operations. Third, a parallel tridiagonal solver based on the parallel diagonal dominant (PDD) algorithm is developed to minimize the number of global data transposes. Fourth, three effective communication optimizations are implemented by Remote Direct Memory Access (RDMA) to maximize the performance of the remaining global transposes and halo exchange.

+

Results show that the solver exploits the heterogeneous computing power of the new Tianhe supercomputer and achieves a speedup of up to 10.6× (against the CPU-only performance). Linear strong scaling is obtained with a grid size of up to 25.8 billion.

+
+ + +

Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes

  • James Psota
  • Armando Solar-Lezama
+

Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging efficient lock-free data structures in shared memory to achieve high-performance messaging and collective operations between the ranks within nodes. We use microbenchmarks to evaluate Pure's key messaging and collective features and also show application speedups up to 2.1× on the CoMD molecular dynamics and the miniAMR adaptive mesh refinement applications scaling up to 4,096 cores.

+
+ +

SESSION: Graph Processing

+

INFINEL: An efficient GPU-based processing method for unpredictable large output graph queries

  • Sungwoo Park
  • Seyeon Oh
  • Min-Soo Kim
+

With the introduction of GPUs, which are specialized for iterative parallel computations, the execution of computation-intensive graph queries using a GPU has seen significant performance improvements. However, due to the memory constraints of GPUs, there has been limited research on handling large-scale output graph queries with unpredictable output sizes on a GPU. Traditionally, two-phase methods have been used, where the query is re-executed after splitting it into sub-tasks while only considering the size of the output in a static manner. However, two-phase methods become highly inefficient when used with graph data with extreme skew, failing to maximize the GPU performance. This paper proposes INFINEL, which handles unpredictable large output graph queries in a one-phase method through chunk allocation per thread and kernel stop/restart methods. We also propose applicable optimization techniques due to the corresponding unique characteristics of operating with low time/space overhead and not heavily relying on the GPU output buffer size. Through extensive experiments, we demonstrate that our one-phase method of INFINEL improves the performance by up to 31.5 times over the conventional two-phase methods for triangle listing ULO query.

+
+ + +

GraphCube: Interconnection Hierarchy-aware Graph Processing

  • Xinbiao Gan
  • Guang Wu
  • Shenghao Qiu
  • Feng Xiong
  • Jiaqi Si
  • Jianbin Fang
  • Dezun Dong
  • Chunye Gong
  • Tiejun Li
  • Zheng Wang
+

Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing parallel systems. However, current graph processing engines do not scale well beyond a few tens of computing nodes because they are oblivious to the communication cost variations across the interconnection hierarchy. We introduce GraphCube, a better approach to optimizing graph processing on large-scale parallel systems with complex interconnections. GraphCube features a new graph partitioning approach to achieve better load balancing and minimize communication overhead across multiple levels of the interconnection hierarchy. We evaluate GraphCube by applying it to fundamental graph operations performed on synthetic and real-world graph datasets. Our evaluation used up to 79,024 computing nodes and 1.2+ million processor cores. Our large-scale experiments show that GraphCube outperforms state-of-the-art parallel graph processing methods in throughput and scalability. Furthermore, GraphCube outperformed the top-ranked systems on the Graph 500 list.

+
+ + +

Exploiting Fine-Grained Redundancy in Set-Centric Graph Pattern Mining

  • Zhiheng Lin
  • Ke Meng
  • Chaoyang Shui
  • Kewei Zhang
  • Junmin Xiao
  • Guangming Tan
+

Graph Pattern Mining (GPM) applications are memory intensive as they require a tremendous amount of edge checks. In recent years, the "set-centric" abstraction has gained attention for its powerful expressive abilities. By leveraging relational algebra, they optimized algorithms with methods like matching orders, early termination, automorphism-breaking, and result reuse to reduce redundancy. However, these approaches primarily address coarse-grained redundancy from exactly the same set formulas, neglecting that the data graph's inherent locality may lead to fine-grained duplicated edge checks. In fact, even unrelated set operations may check the same pair of vertices. This paper introduces the set union operation to the set-centric abstraction to fuse duplicated edge checks into one. It maintains the expressive power of relational algebra and previous optimizations while effectively avoids fine-grained redundancy in GPM tasks. Compared to state-of-the-art methods, our method achieves significant speedup on a V100 GPU cluster, demonstrating up to 305 × faster performance than the state-of-the-art GPM system G2Miner.

+
+ +

SESSION: Synchronization and Concurrency Control II

+

Memory Bounds for Concurrent Bounded Queues

  • Vitaly Aksenov
  • Nikita Koval
  • Petr Kuznetsov
  • Anton Paramonov
+

Concurrent data structures often require additional memory for handling synchronization issues in addition to memory for storing elements. Depending on the amount of this additional memory, implementations can be more or less memory-friendly. A memory-optimal implementation enjoys the minimal possible memory overhead, which, in practice, reduces cache misses and unnecessary memory reclamation.

+

In this paper, we discuss the memory-optimality of non-blocking bounded queues. Essentially, we investigate the possibility of constructing an implementation that utilizes a pre-allocated array to store elements and constant memory overhead, e.g., two positioning counters for enqueue(..) and dequeue() operations. Such an implementation can be readily constructed when the ABA problem is precluded, e.g., assuming that the hardware supports LL/SC instructions or all inserted elements are distinct. However, in the general case, we show that a memory-optimal non-blocking bounded queue incurs linear overhead in the number of concurrent processes. These results not only provide helpful intuition for concurrent algorithm developers but also open a new research avenue on the memory-optimality phenomenon in concurrent data structures.

+

The full version of this paper is available on arXiv [2].

+
+ + +

VERLIB: Concurrent Versioned Pointers

  • Guy E. Blelloch
  • Yuanhao Wei
+

Recent work has shown how to augment any CAS-based concurrent data structure to support taking a snapshot of the current memory state. Taking the snapshot, as well as loads and CAS (Compare and Swap) operations, take constant time. Importantly, such snapshotting can be used to easily implement linearizable queries, such as range queries, over any part of a data structure.

+

In this paper, we make two significant improvements over this approach. The first improvement removes a subtle and hard to reason about restriction that was needed to avoid a level of indirection on pointers. We introduce an approach, which we refer to as indirection-on-need, that removes the restriction, but yet almost always avoids indirection. The second improvement is to efficiently support snapshotting with lock-free locks. This requires supporting an idempotent CAS. We show a particularly simple solution to the problem that leverages the data structures used for snapshotting.

+

Based on these ideas we implemented an easy-to-use C++ library, verlib, centered around a versioned pointer type. The library works with lock (standard or lock-free) and CAS based algorithms, or any combination. Converting existing concurrent data-structures to use the library takes minimal effort. We present results for experiments that use verlib to convert state-of-the-art data structures for ordered maps (a B-tree), radix-ordered maps (an ART-tree), and unordered maps (an optimized hash table) to be snapshottable. The snapshottable versions perform almost as well as the original versions and far outperform any previous implementations that support atomic range queries.

+
+ + +

Practical Hardware Transactional vEB Trees

  • Mohammad Khalaji
  • Trevor Brown
  • Khuzaima Daudjee
  • Vitaly Aksenov
+

van Emde Boas (vEB) trees are sequential data structures optimized for extremely fast predecessor and successor queries. Such queries are an important incentive to use ordered sets or maps such as vEB trees. All operations in a vEB tree are doubly logarithmic in the universe size. Attempts to implement concurrent vEB trees have either simplified their structure in a way that eliminated their ability to perform fast predecessor and successor queries, or have otherwise compromised on doubly logarithmic complexity. In this work, we leverage Hardware Transactional Memory (HTM) to implement vEB tree-based sets and maps in which operations are doubly logarithmic in the absence of contention. Our proposed concurrent vEB tree is the first to implement recursive summaries, the key algorithmic component of fast predecessor and successor operations. Through extensive experiments, we demonstrate that our algorithm outperforms state-of-the-art concurrent maps by an average of 5× in a moderately skewed workload, and the single-threaded C++ standard ordered map and its unordered map by 70% and 14%, respectively. And, it does so while using two orders of magnitude less memory than traditional vEB trees.

+
+ +

SESSION: ML Workloads

+

Tetris: Accelerating Sparse Convolution by Exploiting Memory Reuse on GPU

  • Xiaoyan Liu
  • Xuegui Zheng
  • Hailong Yang
  • Zhongzhi Luan
  • Depei Qian
+

Convolutional neural networks (CNNs) have achieved remarkable success in various application fields. Although model compression techniques mitigate the ever-increasing resource demands of large CNN models, the compressed models usually exhibit irregular memory access and unstructured sparsity, which are difficult for dominant operators such as sparse convolution to achieve expected performance speedup on popular inference platforms such as GPU. In this paper, we propose Tetris, an efficient sparse convolution approach optimized for GPU. Tetris first fully exploits the input reuse opportunity of sparse convolution to reduce the memory accesses to global memory. It then adopts a stride packed filter (SPF) format and a bank-sensing reorganization scheme to eliminate the irregular memory accesses caused by unstructured sparsity. It also leverages a filter group reorder technique to address load imbalance among threads, and a parameter tuning method to determine the optimal parameters of the sparse convolution implementation. The experiment results show that Tetris outperforms dense/sparse convolution libraries and cutting-edge implementations with promising performance speedup.

+
+ + +

Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous System-on-Chips

  • Ismet Dagli
  • Mehmet E. Belviranli
+

Two distinguishing features of state-of-the-art mobile and autonomous systems are: 1) There are often multiple workloads, mainly deep neural network (DNN) inference, running concurrently and continuously. 2) They operate on shared memory System-on-Chips (SoC) that embed heterogeneous accelerators tailored for specific operations. State-of-the-art systems lack efficient performance and resource management techniques necessary to either maximize total system throughput or minimize end-to-end workload latency. In this work, we propose HaX-CoNN, a novel scheme that characterizes and maps layers in concurrently executing DNN inference workloads to a diverse set of accelerators within an SoC. Our scheme uniquely takes per-layer execution characteristics, shared memory (SM) contention, and inter-accelerator transitions into account to find optimal schedules. We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SoCs. Our experimental results indicate that HaX-CoNN can minimize memory contention by up to 45% and improve total latency and throughput by up to 32% and 29%, respectively, compared to the state-of-the-art.

+
+ + +

Training one DeePMD Model in Minutes: a Step towards Online Learning

  • Siyu Hu
  • Tong Zhao
  • Qiuchen Sha
  • Enji Li
  • Xiangyu Meng
  • Liping Liu
  • Lin-Wang Wang
  • Guangming Tan
  • Weile Jia
+

Neural Network Molecular Dynamics (NNMD) has become a major approach in material simulations, which can speedup the molecular dynamics (MD) simulation for thousands of times, while maintaining ab initio accuracy, thus has a potential to fundamentally change the paradigm of material simulations. However, there are two time-consuming bottlenecks of the NNMD developments. One is the data access of ab initio calculation results. The other, which is the focus of the current work, is reducing the training time of NNMD model. The training of NNMD model is different from most other neural network training because the atomic force (which is related to the gradient of the network) is an important physical property to be fit. Tests show the traditional stochastic gradient methods, like the Adam algorithms, cannot efficiently deploy the multisample minibatch algorithm. As a result, a typical training (taking the Deep Potential Molecular Dynamics (DeePMD) as an example) can take many hours. In this work, we designed a heuristic minibatch quasi-Newtonian optimizer based on Extended Kalman Filter method. An early reduction of gradient and error is adopted to reduce memory footprint and communication. The memory footprint, communication and settings of hyper-parameters of this new method are analyzed in detail. Computational innovations such as customized kernels of the symmetry-preserving descriptor are applied to exploit the computing power of the heterogeneous architecture. Experiments are performed on 8 different datasets representing different real case situations, and numerical results show that our new method has an average speedup of 32.2 compared to the Reorganized Layer-wised Extended Kalman Filter with 1 GPU, reducing the absolute training time of one DeePMD model from hours to several minutes, making it one step toward online training.

+
+ +

SESSION: Parallel Algorithms

+

ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms

  • Magdalen Dobson Manohar
  • Zheqi Shen
  • Guy Blelloch
  • Laxman Dhulipala
  • Yan Gu
  • Harsha Vardhan Simhadri
  • Yihan Sun
+

Approximate nearest-neighbor search (ANNS) algorithms are a key part of the modern deep learning stack due to enabling efficient similarity search over high-dimensional vector space representations (i.e., embeddings) of data. Among various ANNS algorithms, graph-based algorithms are known to achieve the best throughput-recall tradeoffs. Despite the large scale of modern ANNS datasets, existing parallel graph-based implementations suffer from significant challenges to scale to large datasets due to heavy use of locks and other sequential bottlenecks, which 1) prevents them from efficiently scaling to a large number of processors, and 2) results in non-determinism that is undesirable in certain applications.

+

In this paper, we introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms, along with a set of useful tools for developing such algorithms. In this library, we develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets. Our algorithms are deterministic and achieve high scalability across a diverse set of challenging datasets. In addition to the new algorithmic ideas, we also conduct a detailed experimental study of our new algorithms as well as two existing non-graph approaches. Our experimental results both validate the effectiveness of our new techniques, and lead to a comprehensive comparison among ANNS algorithms on large scale datasets with a list of interesting findings.

+
+ + +

Parallel k-Core Decomposition with Batched Updates and Asynchronous Reads

  • Quanquan C. Liu
  • Julian Shun
  • Igor Zablotchi
+

Maintaining a dynamic k-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for maintaining an approximate k-core decomposition. In their solution, both reads and updates need to be batched, and therefore each type of operation can incur high latency waiting for the other type to finish. To tackle most real-world workloads, which are dominated by reads, this paper presents a novel hybrid concurrent-parallel dynamic k-core data structure where asynchronous reads can proceed concurrently with batches of updates, leading to significantly lower read latencies. Our approach is based on tracking causal dependencies between updates, so that causally related groups of updates appear atomic to concurrent readers. Our data structure guarantees linearizability and liveness for both reads and updates, and maintains the same approximation guarantees as prior work. Our experimental evaluation on a 30-core machine shows that our approach reduces read latency by orders of magnitude compared to the batch-dynamic algorithm, up to a (4.05 · 105)-factor. Compared to an unsynchronized (non-linearizable) baseline, our read latency overhead is only up to a 3.21-factor greater, while improving accuracy of coreness estimates by up to a factor of 52.7.

+
+ + +

Parallel Integer Sort: Theory and Practice

  • Xiaojun Dong
  • Laxman Dhulipala
  • Yan Gu
  • Yihan Sun
+

Integer sorting is a fundamental problem in computer science. This paper studies parallel integer sort both in theory and in practice. In theory, we show tighter bounds for a class of existing practical integer sort algorithms, which provides a solid theoretical foundation for their widespread usage in practice and strong performance. In practice, we design a new integer sorting algorithm, DovetailSort, that is theoretically-efficient and has good practical performance.

+

In particular, DovetailSort overcomes a common challenge in existing parallel integer sorting algorithms, which is the difficulty of detecting and taking advantage of duplicate keys. The key insight in DovetailSort is to combine algorithmic ideas from both integer- and comparison-sorting algorithms. In our experiments, DovetailSort achieves competitive or better performance than existing state-of-the-art parallel integer and comparison sorting algorithms on various synthetic and real-world datasets.

+
+ + +

Fast American Option Pricing using Nonlinear Stencils

  • Zafar Ahmad
  • Reilly Browne
  • Rezaul Chowdhury
  • Rathish Das
  • Yushen Huang
  • Yimin Zhu
+

We study the binomial, trinomial, and Black-Scholes-Merton models of option pricing. We present fast parallel discrete-time finite-difference algorithms for American call option pricing under the binomial and trinomial models and American put option pricing under the Black-Scholes-Merton model. For T-step finite differences, each algorithm runs in O (T log2 T)/p + T) time under a greedy scheduler on p processing cores, which is a significant improvement over the Θ (T2/p) + Ω (T log T) time taken by the corresponding state-of-the-art parallel algorithm. Even when run on a single core, the O (T log2 T) time taken by our algorithms is asymptotically much smaller than the Θ (T2) running time of the fastest known serial algorithms. Implementations of our algorithms significantly outperform the fastest implementations of existing algorithms in practice, e.g., when run for T ≈ 1000 steps on a 48-core machine, our algorithm for the binomial model runs at least 15× faster than the fastest existing parallel program for the same model with the speedup factor gradually reaching beyond 500× for T ≈ 0.5 × 106. It saves more than 80% energy when T ≈ 4000, and more than 99% energy for T > 60,000.

+

Our algorithms can be viewed as solving a class of nonlinear 1D stencil (i.e., finite-difference) computation problems efficiently using the Fast Fourier Transform (FFT). To our knowledge, ours are the first algorithms to handle such stencils in o (T2) time. These contributions are of independent interest as stencil computations have a wide range of applications beyond quantitative finance.

+
+ +

SESSION: Optimizing for Memory

+

ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores

  • Yuetao Chen
  • Kun Li
  • Yuhao Wang
  • Donglin Bai
  • Lei Wang
  • Lingxiao Ma
  • Liang Yuan
  • Yunquan Zhang
  • Ting Cao
  • Mao Yang
+

Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors to enhance matrix multiplication performance. However, constrained to its over-specification, its potential for improving other critical scientific operations like stencil computations remains untapped.

+

This paper presents ConvStencil1, a novel stencil computing system designed to efficiently transform stencil computation to matrix multiplication on Tensor Cores. We first develop a performance model for ConvStencil to guide algorithm design and optimization on TCUs. Based on this model, we propose three techniques: (1) Memory-efficient Layout Transformation using the stencil2row method; (2) Computation-dense Compute Adaptation with Dual Tessellation and kernel fusion; and (3) Performance-boosting Conflict Removal using a Lookup Table and Dirty Bits Padding. ConvStencil outperforms other stencil optimization frameworks, achieving significant speedups compared to solutions like AMOS, cuDNN, Brick, DRStencil, and TCStencil. By transforming stencil computation on Tensor Cores, ConvStencil promises to improve the performance of various scientific and engineering applications.

+
+ + +

CPMA: An Efficient Batch-Parallel Compressed Set Without Pointers

  • Brian Wheatman
  • Randal Burns
  • Aydin Buluc
  • Helen Xu
+

This paper introduces the batch-parallel Compressed Packed Memory Array (CPMA), a compressed, dynamic, ordered set data structure based on the Packed Memory Array (PMA). Traditionally, batch-parallel sets are built on pointer-based data structures such as trees because pointer-based structures enable fast parallel unions via pointer manipulation. When compared with cache-optimized trees, PMAs were slower to update but faster to scan.

+

The batch-parallel CPMA overcomes this tradeoff between updates and scans by optimizing for cache-friendliness. On average, the CPMA achieves 3× faster batch-insert throughput and 4× faster range-query throughput compared with compressed PaC-trees, a state-of-the-art batch-parallel set library based on cache-optimized trees.

+

We further evaluate the CPMA compared with compressed PaC-trees and Aspen, a state-of-the-art system, on a real-world application of dynamic-graph processing. The CPMA is on average 1.2× faster on a suite of graph algorithms and 2× faster on batch inserts when compared with compressed PaC-trees. Furthermore, the CPMA is on average 1.3× faster on graph algorithms and 2× faster on batch inserts compared with Aspen.

+
+ + +

Gallatin: A General-Purpose GPU Memory Manager

  • Hunter Mccoy
  • Prashant Pandey
+

Dynamic memory management is critical for efficiently porting modern data processing pipelines to GPUs. However, building a general-purpose dynamic memory manager on GPUs is challenging due to the massive parallelism and weak memory coherence. Existing state-of-the-art GPU memory managers, Ouroboros and Reg-Eff, employ traditional data structures such as arrays and linked lists to manage memory objects. They build specialized pipelines to achieve performance for a fixed set of allocation sizes and fall back to the CUDA allocator for allocating large sizes. In the process, they lose general-purpose usability and fail to support critical applications such as streaming graph processing.

+

In this paper, we introduce Gallatin, a general-purpose and high-performance GPU memory manager. Gallatin uses the van Emde Boas (vEB) tree data structure to manage memory objects efficiently and supports allocations of any size. Furthermore, we develop a highly-concurrent GPU implementation of the vEB tree which can be broadly used in other GPU applications. It supports constant time insertions, deletions, and successor operations for a given memory size.

+

In our evaluation, we compare Gallatin with state-of-the-art specialized allocator variants. Gallatin is up to 374× faster on single-sized allocations and up to 264× faster on mixed-size allocations than the next-best allocator. In scalability benchmarks, Gallatin is up to 254× times faster than the next-best allocator as the number of threads increases. For the graph benchmarks, Gallatin is 1.5× faster than the state-of-the-art for bulk insertions, slightly faster for bulk deletions, and is 3× faster than the next-best allocator for all graph expansion tests.

+
+ +

SESSION: Linear Algebra

+

A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs

  • Meng Pang
  • Xiang Fei
  • Peng Qu
  • Youhui Zhang
  • Zhaolin Li
+

Sparse-Matrix Dense-Matrix Multiplication (SpMM) and Sampled Dense Dense Matrix Multiplication (SDDMM) are important sparse kernels in various computation domains. The uneven distribution of nonzeros in the sparse matrix and the tight data dependence between sparse and dense matrices make it a challenge to run sparse matrix multiplication efficiently on GPUs. By analyzing the aforementioned problems, we propose a row decomposition (RoDe)-based approach to optimize the two kernels on GPUs, using the standard Compressed Sparse Row (CSR) format. Specifically, RoDe divides the sparse matrix rows into regular parts and residual parts, to fully optimize their computations separately. We also devise the corresponding load balancing and finegrained pipelining technologies. Profiling results show that RoDe can achieve more efficient memory access and reduce warp stall cycles significantly. Compared to the state-of-the-art (SOTA) alternatives, RoDe achieves a speedup of up to 7.86× with a geometric mean of 1.45× for SpMM, and a speedup of up to 8.99× with a geometric mean of 1.49× for SDDMM; the dataset is SuiteSparse. RoDe also outperforms its counterpart in the deep learning dataset. Furthermore, its preprocessing overhead is significantly smaller, averaging only 16% of the SOTA.

+
+ + +

Fast Kronecker Matrix-Matrix Multiplication on GPUs

  • Abhinav Jangda
  • Mohit Yadav
+

Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table.

+

To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7× and 7.85× faster than existing implementations on 1 and 16 GPUs respectively.

+
+ + +

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

  • Lukas Gianinazzi
  • Alexandros Nikolaos Ziogas
  • Langwen Huang
  • Piotr Luczynski
  • Saleh Ashkboosh
  • Florian Scheidl
  • Armon Carigiet
  • Chio Ge
  • Nabil Abubaker
  • Maciej Besta
  • Tal Ben-Nun
  • Torsten Hoefler
+

We propose a novel approach to iterated sparse matrix dense matrix multiplication, a fundamental computational kernel in scientific computing and graph neural network training. In cases where matrix sizes exceed the memory of a single compute node, data transfer becomes a bottleneck. An approach based on dense matrix multiplication algorithms leads to sub-optimal scalability and fails to exploit the sparsity in the problem. To address these challenges, we propose decomposing the sparse matrix into a small number of highly structured matrices called arrow matrices, which are connected by permutations. Our approach enables communication-avoiding multiplications, achieving a polynomial reduction in communication volume per iteration for matrices corresponding to planar graphs and other minor-excluded families of graphs. Our evaluation demonstrates that our approach outperforms a state-of-the-art method for sparse matrix multiplication on matrices with hundreds of millions of rows, offering near-linear strong and weak scaling.

+
+ +

SESSION: Applications

+

FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters

  • Shenggan Cheng
  • Xuanlei Zhao
  • Guangyang Lu
  • Jiarui Fang
  • Tian Zheng
  • Ruidong Wu
  • Xiwen Zhang
  • Jian Peng
  • Yang You
+

Protein structure prediction helps to understand gene translation and protein function, which is of growing interest and importance in structural biology. The AlphaFold model, which used transformer architecture to achieve atomic-level accuracy in protein structure prediction, was a significant breakthrough. However, training and inference of AlphaFold model are challenging due to its high computation and memory cost. In this work, we present FastFold, an efficient implementation of AlphaFold for both training and inference. We propose Dynamic Axial Parallelism (DAP) as a novel model parallelism method. Additionally, we have implemented a series of low-level optimizations aimed at reducing communication, computation, and memory costs. These optimizations include Duality Async Operations, highly optimized kernels, and AutoChunk (an automated search algorithm finds the best chunk strategy to reduce memory peaks). Experimental results show that FastFold can efficiently scale to more GPUs using DAP and reduces overall training time from 11 days to 67 hours and achieves 7.5 ~ 9.5× speedup for long-sequence inference. Furthermore, AutoChunk can reduce memory cost by over 80% during inference by automatically partitioning the intermediate tensors during the computation.

+
+ + +

AGAThA: Fast and Efficient GPU Acceleration of Guided Sequence Alignment for Long Read Mapping

  • Seongyeon Park
  • Junguk Hong
  • Jaeyong Song
  • Hajin Kim
  • Youngsok Kim
  • Jinho Lee
+

With the advance in genome sequencing technology, the lengths of deoxyribonucleic acid (DNA) sequencing results are rapidly increasing at lower prices than ever. However, the longer lengths come at the cost of a heavy computational burden on aligning them. For example, aligning sequences to a human reference genome can take tens or even hundreds of hours. The current de facto standard approach for alignment is based on the guided dynamic programming method. Although this takes a long time and could potentially benefit from high-throughput graphic processing units (GPUs), the existing GPU-accelerated approaches often compromise the algorithm's structure, due to the GPU-unfriendly nature of the computational pattern. Unfortunately, such compromise in the algorithm is not tolerable in the field, because sequence alignment is a part of complicated bioinformatics analysis pipelines. In such circumstances, we propose AGAThA, an exact and efficient GPU-based acceleration of guided sequence alignment. We diagnose and address the problems of the algorithm being unfriendly to GPUs, which comprises strided/redundant memory accesses and workload imbalances that are difficult to predict. According to the experiments on modern GPUs, AGAThA achieves 18.8× speedup against the CPU-based baseline, 9.6× against the best GPU-based baseline, and 3.6× against GPU-based algorithms with different heuristics.

+
+ +

POSTER SESSION: Posters

+

POSTER: Accelerating High-Precision Integer Multiplication used in Cryptosystems with GPUs

  • Zhuoran Ji
  • Zhaorui Zhang
  • Jiming Xu
  • Lei Ju
+

High-precision integer multiplication is crucial in privacy-preserving computational techniques but poses acceleration challenges on GPUs due to its complexity and the diverse bit lengths in cryptosystems. This paper introduces GIM, an efficient high-precision integer multiplication algorithm accelerated with GPUs. It employs a novel segmented integer multiplication algorithm that separates implementation details from bit length, facilitating code optimizations. We also present a computation diagram to analyze parallelization strategies, leading to a series of enhancements. Experiments demonstrate that this approach achieves a 4.47× speedup over the commonly used baseline.

+
+ + +

POSTER: Enabling Extreme-Scale Phase Field Simulation with In-situ Feature Extraction

  • Zhichen Feng
  • Jialin Li
  • Yaqian Gao
  • Shaobo Tian
  • Huang Ye
  • Jian Zhang
+

In this paper, we present an integrated framework composed of a highly efficient phase field simulator and an in-situ feature extraction library. This novel framework enables us to conduct extreme-scale micro-structure evolution simulations while the characteristic features of each individual grain are extracted on the fly. After systematic design and optimization on the new generation Sunway supercomputer, the code scales up to 39 million cores and achieves 582 PFlops in double precision and 637 POps in mixed precision.

+
+ + +

POSTER: FineCo: Fine-grained Heterogeneous Resource Management for Concurrent DNN Inferences

  • Lixian Ma
  • Haoruo Chen
  • En Shao
  • Leping Wang
  • Quan Chen
  • Guangming Tan
+

Co-locating multiple DNN servings to share GPU resource is widely used to improve resource utilization while guaranteeing user QoS. Existing GPU sharing mechanism is restricted to model level, and fluctuations in kernel-level resource demands highlight a suboptimal utilization of the current sharing mechanism. We design a multi-DNN serving system, FineCo, that leverages a novel fine-grained resource sharing mechanism to optimize concurrent inference without modifications to the hardware or operating system. Our prototype implementation demonstrates that FineCo achieves up to 40% throughput improvement over the state-of-the-art work.

+
+ + +

POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters

  • Jiajun Huang
  • Sheng Di
  • Xiaodong Yu
  • Yujia Zhai
  • Jinyang Liu
  • Yafan Huang
  • Ken Raffenetti
  • Hui Zhou
  • Kai Zhao
  • Zizhong Chen
  • Franck Cappello
  • Yanfei Guo
  • Rajeev Thakur
+

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 64 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 3.4× and 18.7×, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

+
+ + +

POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table Design

  • Guofeng Feng
  • Weile Jia
  • Ninghui Sun
  • Guangming Tan
  • Jiajia Li
+

Sparse tensor contraction (SpTC) serves as an essential operation in high-performance applications. The high dimensionality of sparse tensors makes SpTC fundamentally challenging in aspects such as costly multidimensional index search, extensive intermediate output data, and indirect addressing. Previous state-of-the-art work addresses some of these challenges through hash-table implementation. In this paper, we propose a hash-table based and fully optimized SpTC by providing a more carefully designed customized hash table design, proposing an architecture-aware algorithm for hash table selection with size prediction, applying cross-stage optimizations to exploit shared information and avoid redundant operations. Evaluating on a set of tensors extracted from the real world, our method can achieve superior speedup and reduce the memory footprint substantially compared to the current state-of-the-art work.

+
+ + +

POSTER: LLM-PQ:Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

  • Juntao Zhao
  • Borui Wan
  • Chuan Wu
  • Yanghua Peng
  • Haibin Lin
+

The immense sizes of Large-scale language models (LLMs) have led to high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. Extensive experiments on production inference workloads demonstrate throughput improvement in inference, showing great advantages over state-of-the-art works.

+
+ + +

POSTER: OCToPus: Semantic-aware Concurrency Control for Blockchain Transactions

  • dePaul Miller
  • Henry F. Korth
  • Roberto Palmieri
+

Many blockchain implementations offer APIs to send and receive money between accounts exclusively. In this paper, we introduce OCToPus, a deterministic concurrency control scheme that uses a semantic-aware fast path and a GPU-accelerated directed acyclic graph-based fallback path to parallelize the execution of a block aggressively.

+
+ + +

POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model Training

  • Jiaao He
  • Shengqi Chen
  • Jidong Zhai
+

Recommendation models are an important category of deep learning models whose size is growing enormous. They consist of a sparse part with TBs of memory footprint and a dense part that demands PFLOPs of computing capability to train. Unfortunately, the high sparse communication cost to re-organize data for different parallel strategies of the two parts impedes the scalability in training.

+

Based on observations of sparse access patterns, we design a two-fold fine-grained parallel strategy to accelerate sparse communication. A performance model is built to select an optimal set of items that are replicated across all GPUs so that all-to-all communication volume is reduced, while keeping memory consumption acceptable. The all-to-all overhead is further reduced by parallel scheduling techniques. In our evaluation on 32 GPUs over real-world datasets, 2.16 -- 16.8× end-to-end speedup is achieved over the baselines.

+
+ + +

POSTER: ParGNN: Efficient Training for Large-Scale Graph Neural Network on GPU Clusters

  • Shunde Li
  • Junyu Gu
  • Jue Wang
  • Tiechui Yao
  • Zhiqiang Liang
  • Yumeng Shi
  • Shigang Li
  • Weiting Xi
  • Shushen Li
  • Chunbao Zhou
  • Yangang Wang
  • Xuebin Chi
+

Full-batch graph neural network (GNN) training is essential for interdisciplinary applications. Large-scale graph data is usually divided into subgraphs and distributed across multiple compute units to train GNN. The state-of-the-art load balancing method based on direct graph partition is too rough to effectively achieve true load balancing on GPU clusters. We propose ParGNN, which employs a profiler-guided load balance workflow in conjunction with graph repartition to alleviate load imbalance and minimize communication traffic. Experiments have verified that ParGNN has the capability to scale to larger clusters.

+
+ + +

POSTER: RadiK: Scalable Radix Top-K Selection on GPUs

  • Yifei Li
  • Bole Zhou
  • Jiejing Zhang
  • Xuechao Wei
  • Yinghan Li
  • Yingda Chen
+

By identifying the k largest or smallest elements in a set of data, top-k selection is critical for modern high-performance databases and machine learning systems, especially with large data volumes. However, previous studies on its GPU implementation are mostly merge-based and rely heavily on the high-speed but size-limited on-chip memory, thereby resulting in a restricted upper bound on k. This paper introduces RadiK, a highly optimized GPU-parallel radix top-k selection that is scalable with k, input length, and batch size. With a carefully designed optimization framework targeting high memory bandwidth and resource utilization, RadiK supports far larger k than the prior art, achieving up to 2.5× speedup for non-batch queries and up to 4.8× speedup for batch queries. We also propose a lightweight refinement that strengthens the robustness of RadiK against skewed distributions by adaptively scaling the input elements.

+
+ + +

POSTER: RELAX: Durable Data Structures with Swift Recovery

  • Almog Zur
  • Nachshon Cohen
  • Michal Friedman
  • Erez Petrank
+

Recent non-volatile main memory technology gave rise to an abundance of research on building persistent data structures, whose content can be recovered after a system crash. While there has been significant progress in making durable data structures efficient, shortening the length of the recovery phase after a crash has not received much attention. In this paper we present the RELAX general transformation. RELAX generates lock-free durable data structures that provide the best of both worlds: almost zero recovery time and high performance.

+
+ + +

POSTER: StructMG: A Fast and Scalable Structured Multigrid

  • Yi Zong
  • Xinliang Wang
  • Haopeng Huang
  • Chensong Zhang
  • Xiaowen Xu
  • Jian Sun
  • Bowen Yan
  • Qin Wang
  • Sicong Li
  • Zhaohui Ding
  • Wei Xue
+

Parallel multigrid is widely used as preconditioners in solving large-scale sparse linear systems. However, the current multigrid library still needs more satisfactory performance for structured grid problems regarding speed and scalability. To this end, we design and implement StructMG, a fast and scalable multigrid that constructs hierarchical grids automatically based on the original matrix. As a preconditioner, StructMG can achieve both low cost per iteration and good convergence. Two idealized and five real-world problems from four application fields, including radiation hydrodynamics, petroleum reservoir simulation, numerical weather prediction, and solid mechanics, are evaluated on ARM and X86 platforms. In comparison to hypre's multigrid preconditioners, StructMG achieves the fastest time-to-solutions in all cases with average speedups of 17.6x, 5.7x, 4.6x, 8.5x over SMG, PFMG, SysPFMG, and BoomerAMG, respectively. Additionally, StructMG significantly improves strong and weak scaling efficiencies in most tests.

+
+ +
\ No newline at end of file diff --git a/_data/OpenTOC.yaml b/_data/OpenTOC.yaml index 57bff15..75943cb 100644 --- a/_data/OpenTOC.yaml +++ b/_data/OpenTOC.yaml @@ -1361,3 +1361,11 @@ event: PEPM year: 2024 title: "Proceedings of the 2024 ACM SIGPLAN International Workshop on Partial Evaluation and Program Manipulation" +- + event: PPoPP + year: 2024 + title: "Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming" +- + event: CC + year: 2024 + title: "Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction"