Thomas Benz
Loading...
Last Name
Benz
First Name
Thomas
ORCID
Organisational unit
03996 - Benini, Luca / Benini, Luca
33 results
Filters
Reset filtersSearch Results
Publications1 - 10 of 33
- PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the EdgeItem type: Conference Paper
2023 60th ACM/IEEE Design Automation Conference (DAC)Jain, Vikram; Cavalcante, Matheus; Bruschi, Nazareno; et al. (2023)Emerging deep neural network (DNN) applications require high-performance multi-core hardware acceleration with large data bursts. Classical network-on-chips (NoCs) use serial packet-based protocols suffering from significant protocol translation overheads towards the endpoints. This paper proposes PATRONoC, an open-source fully AXI-compliant NoC fabric to better address the specific needs of multi-core DNN computing platforms. Evaluation of PATRONoC in a 2D-mesh topology shows 34% higher area efficiency compared to a state-of-the-art classical NoC at 1 GHz. PATRONoC's throughput outperforms a baseline NoC by 2-8x on uniform random traffic and provides a high aggregated throughput of up to 350 GiB/s on synthetic and DNN workload traffic. - Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFETItem type: Other Conference Item
2024 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)Paulin, Gianna; Scheffler, Paul; Benz, Thomas; et al. (2024)We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply. - OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICsItem type: Conference Paper
ATC'24: Proceedings of the 2024 USENIX Annual Technical ConferenceKhalilov, Mikhail; Chrapek, Marcin; Shen, Siyuan; et al. (2024)Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and 10 resources. Compared to standard NIC data paths with a well-defined set of offloaded functions, unpredictable execution times of SmartNIC kernels make conventional approaches for multi-tenancy and QoS insufficient. We fill this gap with OSMOSIS, a SmartNICs resource. manager co-design, OSMOSIS extends existing OS mechanisms to enable dynamic hardware resource multiplexing of the on-path packet processing data plane. We integrate OSMOSIS within an open-source RISC-V-based 400Gbit/s SmartNIC. Our performance results demonstrate that OSMOSIS fully supports multi-tenancy and enables broader adoption of SmartNICs in datacenters with low overhead. - A Gigabit, DMA-enhanced Open-Source Ethernet Controller for Mixed-Criticality SystemsItem type: Conference Paper
CF '24 Companion: Proceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special SessionsLiang, Chaoqun; Ottaviano, Alessandro; Benz, Thomas; et al. (2024)The ongoing revolution in application domains targeting autonomous navigation, first and foremost automotive "zonalization", has increased the importance of certain off-chip communication interfaces, particularly Ethernet. The latter will play an essential role in next-generation vehicle architectures as the backbone connecting simultaneously and instantaneously the zonal/domain controllers. There is thereby an incumbent need to introduce a performant Ethernet controller in the open-source HW community, to be used as a proxy for architectural explorations and prototyping of mixed-criticality systems (MCSs). Driven by this trend, in this work, we propose a fully open-source, DMA-enhanced, technology-agnostic Gigabit Ethernet architecture that overcomes the limitations of existing open-source architectures, such as Lowrisc's Ethernet, often tied to FPGA implementation, performance-bound by sub-optimal design choices such as large memory buffers, and in general not mature enough to bridge the gap between academia and industry. Besides the area advantage, the proposed design increases packet transmission speed up to almost 3x compared to Lowrisc's and is validated through implementation and FPGA prototyping into two open-source, heterogeneous MCSs. - Basilisk: Achieving Competitive Performance with Open EDA Tools on an Open-Source Linux-Capable RISC-V SoCItem type: Other Conference ItemSauter, Phillippe; Benz, Thomas; Scheffler, Paul; et al. (2024)We introduce Basilisk, an optimized application-specific integrated circuit (ASIC) implementation and design flow building on the end-to-end open-source Iguana system-on-chip (SoC). We present enhancements to synthesis tools and logic optimization scripts improving quality of results (QoR), as well as an optimized physical design with an improved power grid and cell placement integration enabling a higher core utilization. The tapeout-ready version of Basilisk implemented in IHP's open 130 nm technology achieves an operation frequency of 77 MHz (51 logic levels) under typical conditions, a 2.3x improvement compared to the baseline open-source EDA design flow presented in Iguana, and a higher 55 % core utilization compared to 50 % in the baseline design. Through collaboration with EDA tool developers and domain experts, Basilisk exemplifies a synergistic effort towards competitive open-source electronic design automation (EDA) tools for research and industry applications.
- Snitch Scale-Out on Amazon F1 InstancesItem type: Master ThesisBenz, Thomas (2020)High performance computing systems provide the computational power required by many modern applications especially in the field of machine learning and big data. Simulating such large systems on the register-transfer level is essential to verify their correctness, but becomes increasingly more difficult with increasing size. We propose a FPGA-based alternative centered around an ecosystem of applications providing the necessary observability and controllability needed by such a simulation replacement. Compared to the traditional RTL simulation our approach easily achieves a speedup of 70000 x. With the data collected by this FPGA-based simulation alternative, we can bootstrap an online power estimation flow, whose prediction differs less than 20% from a state-of-the-art power estimation flow. In the process of developing the RTL simulation alternative, we optimized the memory system of the snitch cluster by introducing a 512 bit memory plug and the Frankensnitch, a programmable data movement unit featuring the new AXI DMA backend. These changes allowed us to increase the bandwidth into the cluster by a factor of 200 enabling efficient data movement.
- AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular WorkloadsItem type: Conference Paper
2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)Zhang, Chi; Scheffler, Paul; Benz, Thomas; et al. (2023)Data-intensive applications involving irregular memory streams are inefficiently handled by modern processors and memory systems highly optimized for regular, contiguous data. Recent work tackles these inefficiencies in hardware through core-side stream extensions or memory-side prefetchers and accelerators, but fails to provide end-to-end solutions which also achieve high efficiency in on-chip interconnects. We propose AXI-Pack, an extension to ARM's AXI4 protocol introducing bandwidth-efficient strided and indirect bursts to enable end-to-end irregular streams. AXI-Pack adds irregular stream semantics to memory requests and avoids inefficient narrow-bus transfers by packing multiple narrow data elements onto a wide bus. It retains full compatibility with AXI4 and does not require modifications to non-burst-reshaping interconnect IPs. To demonstrate our approach end-to-end, we extend an open-source RISC-V vector processor to leverage AXI-Pack at its memory interface for strided and indexed accesses. On the memory side, we design a banked memory controller efficiently handling AXI-Pack requests. On a system with a 256-bit-wide interconnect running FP32 workloads, AXI-Pack achieves near-ideal peak on-chip bus utilizations of 87% and 39%, speedups of 5.4x and 2.4x, and energy efficiency improvements of 5.3x and 2.1x over a baseline using an AXI4 bus on strided and indirect benchmarks, respectively. - ControlPULPlet: A Flexible Real-time Multicore RISC-V Controller for 2.5-D Systems-in-PackageItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsOttaviano, Alessandro; Balas, Robert; Fischer, Tim; et al. (2025)The growing complexity of real-time (RT) control algorithms with increasing performance demands along with the shift to 2.5-D technology drive the need for scalable controllers to manage chiplets’ coupled operation in 2.5-D systems-in-package (SiPs). These controllers must offer RT computing capabilities, as well as SiP-compatible IO interfaces for communicating with the controlled dies. Due to RT constraints, a key challenge is minimizing the performance penalty of die-to-die (D2D) communication with respect to native on-chip control interfaces. We address this challenge with ControlPULPlet, an open-source, RT multicore RISC-V controller designed specifically for SiP integration. ControlPULPlet features a 32-bit CV32RT core for fast interrupt handling and a specialized direct memory access engine to automate periodic sensor readout. A tightly coupled programmable multicore cluster for acceleration of advanced control algorithms is integrated through a dedicated AXI4 port. A flexible AXI4-compatible D2D link enables efficient communication in 2.5-D SiPs. We implemented and fabricated ControlPULPlet as a silicon demonstrator called Kairos in TSMC’s 65-nm CMOS. Kairos runs model predictive control algorithms at up to 290 MHz in a 30 mW power envelope. The D2D link attains a peak duplex transfer rate of 51 Gbit/s at 200 MHz, at the minimal costs of just 7.6 kGE in PHY area per channel, adding just 2.9% to the total system area. - A Reliable, Time-Predictable Heterogeneous SoC for AI-Enhanced Mixed-Criticality Edge ApplicationsItem type: Journal Article
IEEE Transactions on Circuits and Systems II. Express BriefsGarofalo, Angelo; Ottaviano, Alessandro; Perotti, Matteo; et al. (2025)Next-generation mixed-criticality Systems-on-chip (SoCs) must execute mixed-criticality AI-enhanced sensor processing and control workloads, ensuring reliable and time-predictable execution of critical tasks while fitting within a sub-2W power envelope. To tackle these challenges, we present a 16nm, reliable, time-predictable heterogeneous SoC with multiple programmable accelerators. Within a 1.2W power envelope, the SoC integrates software-configurable hardware IPs to ensure predictable access to shared resources, such as the on-chip interconnect and memory system, leading to tight upper bounds on execution times of critical applications. To accelerate mission-critical AI, the SoC integrates a reliable multi-core accelerator achieving 304.9 GOPS peak performance at 1.6 TOPS/W energy efficiency. Non-critical, compute-intensive, floating-point workloads are accelerated by a vector cluster, achieving 1.1 TFLOPS/W and 106.8 GFLOPS/mm2. - AXI-REALM: A Lightweight and Modular Interconnect Extension for Traffic Regulation and Monitoring of Heterogeneous Real-Time SoCsItem type: Conference Paper
2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)Benz, Thomas; Ottaviano, Alessandro; Balas, Robert; et al. (2024)The increasing demand for heterogeneous functionality in the automotive industry and the evolution of chip manufacturing processes have led to the transition from federated to integrated critical real-time embedded systems (CRTESs). This leads to higher integration challenges of conventional timing predictability techniques due to access contention on shared resources, which can be resolved by providing system-level observability and controllability in hardware. We focus on the interconnect as a shared resource and propose AXI-REALM, a lightweight, modular, and technology-independent real-time extension to industry-standard AXI4 interconnects, available open-source. AXI-REALM uses a credit-based mechanism to distribute and control the bandwidth in a multi-subordinate system on periodic time windows, proactively prevents denial of service from malicious actors in the system, and tracks each manager's access and interference statistics for optimal budget and period selection. We provide detailed performance and implementation cost assessment in a 12nm node and an end-to-end functional case study implementing AXI-REALM into an open-source Linux-capable RISC-V SoC. In a system with a general-purpose core and a hardware accelerator's DMA engine causing interference on the interconnect, AXI-REALM achieves fair bandwidth distribution among managers, allowing the core to recover 68.2 % of its performance compared to the case without contention. Moreover, near-ideal performance (above 95 %) can be achieved by distributing the available bandwidth in favor of the core, improving the worst-case memory access latency from 264 to below eight cycles. Our approach minimizes buffering compared to other solutions and introduces only 2.45 % area overhead compared to the original SoC.
Publications1 - 10 of 33