Research Collection

Search

primary-search

Show Advanced FiltersHide Advanced Filters

Filters

Use the advanced filters to refine the search results.

Results

Now showing items 1-10 of 92

Gear dropdown

Sort Options:
Relevance
Title Asc
Title Desc
Issue Date Asc
Issue Date Desc
Last Modified Date Asc
Last Modified Date Desc
Divider
Results Per Page:
5
10
20
40
60
80
100

92CSV
92RIS
92BibTeX
Selective Export
Select All

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

de Fine Licht, Johannes; Kuster, Andreas; De Matteis, Tiziano; et al. (2021)

2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes ...

Conference Paper
FBLAS: Streaming Linear Algebra on FPGA

De Matteis, Tiziano; de Fine Licht, Johannes; Hoefler, Torsten (2020)

SC20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity, and the lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures ...

Conference Paper
Chimera: Efficiently training large-scale neural networks with bidirectional pipelines

Li, Shigang; Hoefler, Torsten (2021)

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)

Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training largescale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles ...

Conference Paper
On the parallel I/O optimality of linear algebra kernels: Near-optimal matrix factorizations

Kwasniewski, Grzegorz; Kabic, Marko; Ben-Nun, Tal; et al. (2021)

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)

Matrix factorizations are among the most important building blocks of scientific computing. However, state-of-The-Art libraries are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra ...

Conference Paper
Distributed quantum computing with qmpi

Häner, Thomas; Steiger, Damian S.; Hoefler, Torsten; et al. (2021)

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)

Practical applications of quantum computers require millions of physical qubits and it will be challenging for individual quantum processors to reach such qubit numbers. It is therefore timely to investigate the resource requirements of quantum algorithms in a distributed setting, where multiple quantum processors are interconnected by a coherent network. We introduce an extension of the Message Passing Interface (MPI) to enable ...

Conference Paper
Productivity, portability, performance: Data-centric python

Ziogas, Alexandros Nikolaos; Schneider, Timo; Ben-Nun, Tal; et al. (2021)

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Pythons high ...

Conference Paper
Clairvoyant prefetching for distributed machine learning i/o

Dryden, Nikoli; Böhringer, Roman; Ben-Nun, Tal; et al. (2021)

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottleneck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning ...

Conference Paper
An Efficient Algorithm for Sparse Quantum State Preparation

Gleinig, Niels; Hoefler, Torsten (2021)

2021 58th ACM/IEEE Design Automation Conference (DAC)

Generating quantum circuits that prepare specific states is an essential part of quantum compilation. Algorithms that solve this problem for general states generate circuits at grow exponentially in the number of qubits. However, in contrast to general states, many practically relevant states are sparse in the standard basis. In this paper we show how sparsity can be used for efficient state preparation. We present a polynomial-time ...

Conference Paper
Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching

Strausz, András; Vella, Flavio; Di Girolamo, Salvatore; et al. (2022)

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. To cope with the computational and memory demands that stem from the size of today's graph datasets, distributed-memory algorithms have to be developed. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations ...

Conference Paper
Flare: Flexible in-network allreduce

De Sensi, Daniele; Di Girolamo, Salvatore; Ashkboos, Saleh; et al. (2021)

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance ...

Conference Paper