Journal: IEEE Transactions on Parallel and Distributed Systems

Loading...

Abbreviation

IEEE Trans. Parallel Distrib. Syst.

Publisher

IEEE

Journal Volumes

ISSN

1045-9219
1558-2183
2161-9883

Description

Search Results

Publications1 - 10 of 44
  • Weakley, Le Mai; Robinson, Tim (2023)
    IEEE Transactions on Parallel and Distributed Systems
  • Glaser, Florian; Tagliavini, Giuseppe; Rossi, Davide; et al. (2021)
    IEEE Transactions on Parallel and Distributed Systems
    The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of real-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41× smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average. © 1990-2012 IEEE.
  • Mastoras, Aristeidis; Gross, Thomas (2018)
    IEEE Transactions on Parallel and Distributed Systems
  • Burger, Manuel; Kleine, Jan (2021)
    IEEE Transactions on Parallel and Distributed Systems
    This report analyzes the reproducibility of the article 'Computing Planetary Interior Normal Modes with A Highly Parallel Polynomial Filtering Eigensolver' by Jia Shi et al. (Shi, 2018). To reproduce the results we perform different weak and strong scaling studies using a series of Mars models. All experimental runs were performed during the SC19 Student Cluster Competition on a four node Intel Skylake cluster. We show that the findings of the original article can be reproduced in a different environment.
  • Besta, Maciej; Fischer, Marc; Kalavri, Vasiliki; et al. (2023)
    IEEE Transactions on Parallel and Distributed Systems
    Graph processing has become an important part of various areas of computing, including machine learning, medical applications, social network analysis, computational sciences, and others. A growing amount of the associated graph processing workloads are dynamic, with millions of edges added or removed per second. Graph streaming frameworks are specifically crafted to enable the processing of such highly dynamic workloads. Recent years have seen the development of many such frameworks. However, they differ in their general architectures (with key details such as the support for the concurrent execution of graph updates and queries, or the incorporated graph data organization), the types of updates and workloads allowed, and many others. To facilitate the understanding of this growing field, we provide the first analysis and taxonomy of dynamic and streaming graph processing. We focus on identifying the fundamental system designs and on understanding their support for concurrency, and for different graph updates as well as analytics workloads. We also crystallize the meaning of different concepts associated with streaming graph processing, such as dynamic, temporal, online, and time-evolving graphs, edge-centric processing, models for the maintenance of updates, and graph databases. Moreover, we provide a bridge with the very rich landscape of graph streaming theory by giving a broad overview of recent theoretical related advances, and by discussing which graph streaming models and settings could be helpful in developing more powerful streaming frameworks and designs. We also outline graph streaming workloads and research challenges.
  • Kishani, Mostafa; Ahmadi, Sina; Ahmadian, Saba; et al. (2025)
    IEEE Transactions on Parallel and Distributed Systems
    Hyperconverged Infrastructures (HCIs) combine processing and storage elements to meet the requirements of data-intensive applications in performance, scalability, and quality of service. As an emerging paradigm, HCI should couple with a variety of traditional performance improvement approaches such as I/O caching in virtualized platforms. Contemporary I/O caching schemes are optimized for traditional single-node storage architectures and suffer from two major shortcomings for multi-node architectures: a) imbalanced cache space requirement and b) imbalanced I/O traffic and load. This makes existing schemes inefficient in distributing cache resources over an array of separate physical nodes. In this paper, we propose an Efficient and Load Balanced I/O Cache Architecture (ELICA), managing the solid-state drive (SSD) cache resources across HCI nodes to enhance I/O performance. ELICA dynamically reconfigures and distributes the SSD cache resources throughout the array of HCI nodes and also balances the network traffic and I/O cache load by dynamic reallocation of cache resources. To maximize the performance, we further present an optimization problem defined by Integer Linear Programming to efficiently distribute cache resources and balance the network traffic and I/O cache relocations. Our experimental results on a real platform show that ELICA improves quality of service in terms of average and worst-case latency in HCIs by 3.1× and 23%, respectively, compared to the state-of-the-art.
  • Bridi, Thomas; Bartolini, Andrea; Lombardi, Michele; et al. (2016)
    IEEE Transactions on Parallel and Distributed Systems
  • Li, Shigang; Ben-Nun, Tal; Nadiradze, Giorgi; et al. (2021)
    IEEE Transactions on Parallel and Distributed Systems
    Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1× on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer). © 1990-2012 IEEE.
  • Scheffler, Paul; Zaruba, Florian; Schuiki, Fabian; et al. (2023)
    IEEE Transactions on Parallel and Distributed Systems
    Sparse linear algebra is crucial in many application domains, but challenging to handle efficiently in both software and hardware, with one- and two-sided operand sparsity handled with distinct approaches. In this work, we enhance an existing memorystreaming RISC-V ISA extension to accelerate both one- and two-sided operand sparsity on widespread sparse tensor formats like compressed sparse row (CSR) and compressed sparse fiber (CSF) by accelerating the underlying operations of streaming indirection, intersection, and union. Our extensions enable single-core speedups over an optimized RISC-V baseline of up to 7.0×, 7.7 × , and 9.8 × on sparse-dense multiply, sparse-sparse multiply, and sparse-sparse addition, respectively, and peak FPU utilizations of up to 80% on sparse-dense problems. On an eight-core cluster, sparse-dense and sparse-sparse matrix-vector multiply using real-world matrices are up to 4.9× and 5.9× faster and up to 2.9× and 3.0× more energy efficient. We explore further applications for our extensions, such as stencil codes and graph pattern matching. Compared to recent CPU, GPU, and accelerator approaches, our extensions enable higher flexibility on data representation, degree of sparsity, and dataflow at a minimal hardware footprint, adding only 1.8% in area to a compute cluster. A cluster with our extensions running CSR matrix-vector multiplication achieves 9.9× and 1.7× higher peak floating-point utilizations than recent highly optimized sparse data structures and libraries for CPU and GPU, respectively, even when accounting for off-chip main memory (HBM) and on-chip interconnect latency and bandwidth effects.
  • Chen, Jinfan; Li, Shigang; Guo, Ran; et al. (2024)
    IEEE Transactions on Parallel and Distributed Systems
    Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction. AutoDDL is equipped with an analytical performance model combined with a customized Coordinate Descent algorithm, which significantly reduces the scheme searching overhead. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to the expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer and up to 17.7% and 71.5% for VGG on the two parallel systems, respectively.
Publications1 - 10 of 44