Journal: ACM Transactions on Reconfigurable Technology and Systems

Loading...

Abbreviation

ACM Trans. Reconig. Technol. Syst.

Publisher

Association for Computing Machinery

Journal Volumes

ISSN

1936-7406
1936-7414

Description

Search Results

Publications 1 - 10 of 13
  • Serre, François; Püschel, Markus (2019)
    ACM Transactions on Reconfigurable Technology and Systems
  • Kara, Kaan; Alonso, Gustavo (2021)
    ACM Transactions on Reconfigurable Technology and Systems
    Data processing systems based on FPGAs offer high performance and energy efficiency for a variety of applications. However, these advantages are achieved through highly specialized designs. The high degree of specialization leads to accelerators with narrow functionality and designs adhering to a rigid execution flow. For multi-tenant systems this limits the scope of applicability of FPGA-based accelerators, because, first, supporting a single operation is unlikely to have any significant impact on the overall performance of the system, and, second, serving multiple users satisfactorily is difficult due to simplistic scheduling policies enforced when using the accelerator. Standard operating system and database management system features that would help address these limitations, such as context-switching, preemptive scheduling, and thread migration are practically non-existent in current FPGA accelerator efforts. In this work, we propose PipeArch, an open-source project1 for developing FPGA-based accelerators that combine the high efficiency of specialized hardware designs with the generality and functionality known from conventional CPU threads. PipeArch provides programmability and extensibility in the accelerator without losing the advantages of SIMD-parallelism and deep pipelining. PipeArch supports context-switching and thread migration, thereby enabling for the first time new capabilities such as preemptive scheduling in FPGA accelerators within a high-performance data processing setting. We have used PipeArch to implement a variety of machine learning methods for generalized linear model training and recommender systems showing empirically their advantages over a high-end CPU and even over fully specialized FPGA designs. © 2020 ACM
  • Besta, Maciej; Fischer, Marc; Ben-Nun, Tal; et al. (2020)
    ACM Transactions on Reconfigurable Technology and Systems
    Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.
  • Owaida, Muhsen; Kulkarni, Amit; Alonso, Gustavo (2019)
    ACM Transactions on Reconfigurable Technology and Systems
  • Singh, Gagandeep; Diamantopoulos, Dionysios; Gómez Luna, Juan; et al. (2022)
    ACM Transactions on Reconfigurable Technology and Systems
    Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an field-programmable gate array+HBM-based accelerator connected through Open Coherent Accelerator Processor Interface to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by and when running two different compound stencil kernels. NERO reduces the energy consumption by and for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/W and 21.01 GFLOPS/W. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
  • Woods, Louis; Alonso, Gustavo; Teubner, Jens (2015)
    ACM Transactions on Reconfigurable Technology and Systems
  • Strega: An HTTP Server for FPGAs
    Item type: Journal Article
    Maschi, Fabio; Alonso, Gustavo (2024)
    ACM Transactions on Reconfigurable Technology and Systems
    The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present Strega, an open-source light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single Strega node sustains a throughput of 1.7M HTTP requests per second with an end-to-end latency as low as 16 us, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.
  • Shi, Runbin; Kara, Kaan; Hagleitner, Christoph; et al. (2022)
    ACM Transactions on Reconfigurable Technology and Systems
    FPGAs are increasingly being used in data centers and the cloud due to their potential to accelerate certain workloads as well as for their architectural flexibility since they can be used as accelerators, as smart-NICs, or a stand-alone processors. To meet the challenges posed by these new use cases, FPGAs are quickly evolving in terms of their capabilities and organization. The utilization of High Bandwidth Memory (HBM) in FPGA devices is one recent example of such a trend. In this paper, we study the potential of FPGAs equipped with HBM from a data analytics perspective. We consider three workloads common in analytics oriented databases and implement them on an FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. We consider two possible configurations of the HBM, using a single and a dual clock version design. With the right design, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER91 system or a 14-core Xeon2 E5 by up to 5.9x (range selection), 18.3x (hash join), and 6.1x (SGD).
  • István, Zsolt; Alonso, Gustavo; Blott, Michaela; et al. (2015)
    ACM Transactions on Reconfigurable Technology and Systems
  • Resource Sharing in Dataflow Circuits
    Item type: Journal Article
    Josipović, Lana; Marmet, Axel; Guerrieri, Andrea; et al. (2023)
    ACM Transactions on Reconfigurable Technology and Systems
    To achieve resource-efficient hardware designs, high-level synthesis (HLS) tools share (i.e., time-multiplex) functional units among operations of the same type. This optimization is typically performed in conjunction with operation scheduling to ensure the best possible unit usage at each point in time. Dataflow circuits have emerged as an alternative HLS approach to efficiently handle irregular and control-dominated code. However, these circuits do not have a predetermined schedule—in its absence, it is challenging to determine which operations can share a functional unit without a performance penalty. More critically, although sharing seems to imply only some trivial circuitry, time-multiplexing units in dataflow circuits may cause deadlock by blocking certain data transfers and preventing operations from executing. In this paper, we present a technique to automatically identify performance-acceptable resource sharing opportunities in dataflow circuits. More importantly, we describe a sharing mechanism which achieves functionally correct and deadlock-free dataflow designs. On a set of benchmarks obtained from C code, we show that our approach effectively implements resource sharing. It results in significant area savings at a minor performance penalty compared to dataflow circuits which do not support this feature (i.e., it achieves a 64%, 2%, and 18% average reduction in DSPs, LUTs, and FFs, respectively, with an average increase in total execution time of only 2%) and matches the sharing capabilities of a state-of-the-art HLS tool.
Publications 1 - 10 of 13