Search
Results
-
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems
(2021)2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes ...Conference Paper -
FBLAS: Streaming Linear Algebra on FPGA
(2020)SC20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisSpatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity, and the lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures ...Conference Paper -
Chimera: Efficiently training large-scale neural networks with bidirectional pipelines
(2021)Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training largescale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles ...Conference Paper -
On the parallel I/O optimality of linear algebra kernels: Near-optimal matrix factorizations
(2021)Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)Matrix factorizations are among the most important building blocks of scientific computing. However, state-of-The-Art libraries are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra ...Conference Paper -
Distributed quantum computing with qmpi
(2021)Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)Practical applications of quantum computers require millions of physical qubits and it will be challenging for individual quantum processors to reach such qubit numbers. It is therefore timely to investigate the resource requirements of quantum algorithms in a distributed setting, where multiple quantum processors are interconnected by a coherent network. We introduce an extension of the Message Passing Interface (MPI) to enable ...Conference Paper -
Productivity, portability, performance: Data-centric python
(2021)Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Pythons high ...Conference Paper -
Clairvoyant prefetching for distributed machine learning i/o
(2021)Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottleneck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning ...Conference Paper -
An Efficient Algorithm for Sparse Quantum State Preparation
(2021)2021 58th ACM/IEEE Design Automation Conference (DAC)Generating quantum circuits that prepare specific states is an essential part of quantum compilation. Algorithms that solve this problem for general states generate circuits at grow exponentially in the number of qubits. However, in contrast to general states, many practically relevant states are sparse in the standard basis. In this paper we show how sparsity can be used for efficient state preparation. We present a polynomial-time ...Conference Paper -
Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching
(2022)2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. To cope with the computational and memory demands that stem from the size of today's graph datasets, distributed-memory algorithms have to be developed. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations ...Conference Paper -
Flare: Flexible in-network allreduce
(2021)Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance ...Conference Paper