Andreas Kurth
Loading...
19 results
Search Results
Publications1 - 10 of 19
- HERO: an Open-Source Research Platform for HW/SW Exploration of Heterogeneous Manycore SystemsItem type: Conference PaperKurth, Andreas; Capotondi, Alessandro; Vogel, Pirmin; et al. (2018)Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine “standard platform” software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities. In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way HERO’s host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA. We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions. Thus, HERO demonstrates that ARM and RISC-V can productively coexist in a dual-ISA HW-SW platform.
- PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the EdgeItem type: Conference Paper
2023 60th ACM/IEEE Design Automation Conference (DAC)Jain, Vikram; Cavalcante, Matheus; Bruschi, Nazareno; et al. (2023)Emerging deep neural network (DNN) applications require high-performance multi-core hardware acceleration with large data bursts. Classical network-on-chips (NoCs) use serial packet-based protocols suffering from significant protocol translation overheads towards the endpoints. This paper proposes PATRONoC, an open-source fully AXI-compliant NoC fabric to better address the specific needs of multi-core DNN computing platforms. Evaluation of PATRONoC in a 2D-mesh topology shows 34% higher area efficiency compared to a state-of-the-art classical NoC at 1 GHz. PATRONoC's throughput outperforms a baseline NoC by 2-8x on uniform random traffic and provides a high aggregated throughput of up to 350 GiB/s on synthetic and DNN workload traffic. - Analyzing Memory Interference of FPGA Accelerators on Multicore Hosts in Heterogeneous Reconfigurable SoCsItem type: Conference Paper
Proceedings of the 2021 Design, Automation & Test in Europe (DATE 2021)Mattheeuws, Maxim; Forsberg, Björn; Kurth, Andreas; et al. (2021)Reconfigurable heterogeneous systems-on-chips (SoCs) integrating multiple accelerators are cost-effective and feature the processing power required for complex embedded applications. However, to enable their usage in real-time settings, it is crucial to control interference on the shared main memory for reliable performance. Interference causes performance degradation due to simultaneous memory requests by components such as CPUs, caches, accelerators, and DMAs. We propose a methodology to characterize the interference to multicore host processors caused by accelerators implemented in the FPGA fabric of reconfigurable heterogeneous SoCs. Based on it, we extend the roofline model to account for performance degradation of the computing platform. The extended model allows to determine in an efficient way at which point memory interference becomes critical for a given platform and workload. We apply our methodology to a modern Xilinx UltraScale+ SoC integrating a multicore ARM Cortex-A CPU and a Kintex-grade FPGA. To the best of our knowledge, our results experimentally show for the first time that programs with intensities below 5 flop/byte - workloads with low cache locality - can suffer from slowdowns of up to an order of magnitude. - Mobile Ultrasound Imaging on Heterogeneous Multi-Core PlatformsItem type: Conference Paper
Proceedings of the 14th ACM/IEEE Symposium on Embedded Systems for Real-Time Multimedia (ESTIMedia'16)Kurth, Andreas; Tretter, Andreas; Hager, Pascal A.; et al. (2016) - An Open-Source Research Platform for Heterogeneous Systems on ChipItem type: Doctoral ThesisKurth, Andreas (2022)Heterogeneous systems on chip (HeSoCs) combine general-purpose, feature-rich multi-core host processors with domain-specific programmable many-core accelerators (PMCAs) to unite versatility with energy efficiency and peak performance. By virtue of their heterogeneity, HeSoCs hold the promise of increasing performance and energy efficiency compared to homogeneous multiprocessors, because applications can be executed on hardware that is designed for them. However, this heterogeneity also increases system complexity substantially. This thesis presents the first research platform for HeSoCs where all components, from accelerator cores to application programming interface, are available under permissive open-source licenses. We begin by identifying the hardware and software components that are required in HeSoCs and by designing a representative hardware and software architecture. We then design, implement, and evaluate four critical HeSoC components that have not been discussed in research at the level required for an open-source implementation: First, we present a modular, topology-agnostic, high-performance on-chip communication platform, which adheres to a state-of-the-art industry-standard protocol. We show that the platform can be used to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end communication fabrics with high degrees of concurrency (e.g., up to 256 independent concurrent transactions). Second, we present a modular and efficient solution for implementing atomic memory operations in highly-scalable many-core processors, which demonstrates near-optimal linear throughput scaling for various synthetic and real-world workloads and requires only 0.5 kGE per core. Third, we present a hardware-software solution for shared virtual memory that avoids the majority of translation lookaside buffer misses with prefetching, supports parallel burst transfers without additional buffers, and can be scaled with the workload and number of parallel processors. Our work improves accelerator performance for memory-intensive kernels by up to 4×. Fourth, we present a software toolchain for mixed-data-model heterogeneous compilation and OpenMP offloading. Our work enables transparent memory sharing between a 64-bit host processor and a 32-bit accelerator at overheads below 0.7 % compared to 32-bit-only execution. Finally, we combine our contributions to a research platform for state-of-the-art HeSoCs and demonstrate its performance and flexibility in multiple case studies.
- Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systemsItem type: Conference Paper
Proceedings of the 17th ACM International Conference on Computing Frontiers (CF 2020)Cavalcante, Matheus; Kurth, Andreas; Schuiki, Fabian; et al. (2020) - A Synergistic Approach to Predictable Compilation and Scheduling on Commodity Multi-CoresItem type: Conference Paper
LCTES '20: The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsForsberg, Björn; Mattheeuws, Maxim; Kurth, Andreas; et al. (2020) - General in-network processing – time is ripe!Item type: PresentationHoefler, Torsten; Di Girolamo, Salvatore; Taranov, Konstantin; et al. (2020)Remote memory access (RDMA) networks have been around for more than a decade. RDMA hardware enables basic put/get operations into userlant at very high speeds and reduces CPU overheads significantly. However, we observe that CPU requirements for processing data at modern speeds of 400 or 800 Gbit/s are still huge. Modern smart NICs add various processing capabilities ranging from fully-fledged ARM cores to FPGA-accelerated NICs. However, all current implementations are either relatively inefficient for line-rate packet processing or offer only limited functions such as header rewriting. We advocate for a fully flexible model that allows to execute arbitrary C code on each packet. We show that 'streaming Processing in the Network' (sPIN) enables such a model. Our implementation based on RISC-V demonstrates that generic network acceleration is feasible and delivers an efficiency improvement of up to 100x. We release our implementations as open source and expect that more vendors will adopt generic in-network computations in addition to RDMA.
- Mixed-data-model heterogeneous compilation and OpenMP offloadingItem type: Conference Paper
CC 2020: Proceedings of the 29th International Conference on Compiler ConstructionKurth, Andreas; Wolters, Koen; Forsberg, Björn; et al. (2020) - LLHD: A multi-level intermediate representation for hardware description languagesItem type: Conference Paper
Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and ImplementationSchuiki, Fabian; Kurth, Andreas; Grosser, Tobias; et al. (2020)Modern Hardware Description Languages (HDLs) such as SystemVerilog or VHDL are, due to their sheer complexity, insufficient to transport designs through modern circuit design flows. Instead, each design automation tool lowers HDLs to its own Intermediate Representation (IR). These tools are monolithic and mostly proprietary, disagree in their implementation of HDLs, and while many redundant IRs exists, no IR today can be used through the entire circuit design flow. To solve this problem, we propose the LLHD multi-level IR. LLHD is designed as simple, unambiguous reference description of a digital circuit, yet fully captures existing HDLs. We show this with our reference compiler on designs as complex as full CPU cores. LLHD comes with lowering passes to a hardware-near structural IR, which readily integrates with existing tools. LLHD establishes the basis for innovation in HDLs and tools without redundant compilers or disjoint IRs. For instance, we implement an LLHD simulator that runs up to 2.4× faster than commercial simulators but produces equivalent, cycle-accurate results. An initial vertically-integrated research prototype is capable of representing all levels of the IR, implements lowering from the behavioural to the structural IR, and covers a sufficient subset of SystemVerilog to support a full CPU design. © 2020 ACM.
Publications1 - 10 of 19