Thomas Benz
Loading...
Last Name
Benz
First Name
Thomas
ORCID
Organisational unit
03996 - Benini, Luca / Benini, Luca
29 results
Filters
Reset filtersSearch Results
Publications 1 - 10 of 29
- OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICsItem type: Conference Paper
ATC'24: Proceedings of the 2024 USENIX Annual Technical ConferenceKhalilov, Mikhail; Chrapek, Marcin; Shen, Siyuan; et al. (2024)Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and 10 resources. Compared to standard NIC data paths with a well-defined set of offloaded functions, unpredictable execution times of SmartNIC kernels make conventional approaches for multi-tenancy and QoS insufficient. We fill this gap with OSMOSIS, a SmartNICs resource. manager co-design, OSMOSIS extends existing OS mechanisms to enable dynamic hardware resource multiplexing of the on-path packet processing data plane. We integrate OSMOSIS within an open-source RISC-V-based 400Gbit/s SmartNIC. Our performance results demonstrate that OSMOSIS fully supports multi-tenancy and enables broader adoption of SmartNICs in datacenters with low overhead. - A Data-Driven Approach to Lightweight DVFS-Aware Counter-Based Power Modeling for Heterogeneous PlatformsItem type: Conference Paper
Lecture Notes in Computer Science ~ Embedded Computer Systems: Architectures, Modeling, and SimulationMazzola, Sergio; Benz, Thomas; Forsberg, Björn; et al. (2022)Computing systems have shifted towards highly parallel and heterogeneous architectures to tackle the challenges imposed by limited power budgets. These architectures must be supported by novel power management paradigms addressing the increasing design size, parallelism, and heterogeneity while ensuring high accuracy and low overhead. In this work, we propose a systematic, automated, and architecture-agnostic approach to accurate and lightweight DVFS-aware statistical power modeling of the CPU and GPU sub-systems of a heterogeneous platform, driven by the sub-systems' local performance monitoring counters (PMCs). Counter selection is guided by a generally applicable statistical method that identifies the minimal subsets of counters robustly correlating to power dissipation. Based on the selected counters, we train a set of lightweight, linear models characterizing each sub-system over a range of frequencies. Such models compose a lookup-table-based system-level model that efficiently captures the non-linearity of power consumption, showing desirable responsiveness and decomposability. We validate the system-level model on real hardware by measuring the total energy consumption of an NVIDIA Jetson AGX Xavier platform over a set of benchmarks. The resulting average estimation error is 1.3%, with a maximum of 3.1%. Furthermore, the model shows a maximum evaluation runtime of 500 ns, thus implying a negligible impact on system utilization and applicability to online dynamic power management (DPM). - Iguana: An End-to-End Open-Source Linux-capable RISC-V SoC in 130nm CMOSItem type: Other Conference ItemBenz, Thomas; Scheffler, Paul; Schönleber, Jannis; et al. (2023)Open-source architecture design and register-transfer level (RTL) descriptions, particularly around RISC-V, have made huge strides in the past decade. While open-source physical design and implementation are still lagging behind by comparison, they have recently been catching up: open EDA tools are nearing feature completeness, and some proprietary PDKs are being opened. We present Iguana, the first end-to-end open-source Linux-capable, 64-bit RV64GC, RISC-V System-on-Chip (SoC). Scheduled to tape out in IHP’s 130nm technology, which is currently being open-sourced, Iguana sets important milestones for both open-source silicon and European silicon sovereignty. It implements our Cheshire architecture, which uses IP-based high-level synthesis (iHLS) to generate SoCs from carefully designed, silicon-proven open-source IPs. It includes a variety of standard peripherals, a DRAM controller, VGA display output, and a parallel die-to-die link. Cheshire, all its IPs and physical layers (PHYs) are released under a liberal Apache-based license. We implement Iguana with a fully open RTL-to-GDS toolchain, using established tools where possible and filling gaps with our open tools. Iguana is not a one-shot, but the first in a series of SoCs we will progressively extend and improve on. Furthermore, we have pre-validated our architecture through an FPGA mapping and a tested silicon prototype in TSMC’s proprietary 65nm node.
- A RISC-V in-network accelerator for flexible high-performance low-power packet processingItem type: Conference Paper
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)Di Girolamo, Salvatore; Kurth, Andreas; Calotoiu, Alexandru; et al. (2021)The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. However, sustaining bandwidths provided by next-generation networks, e.g., 400 Gbit/s, can become a challenge. sPIN is a programming model for in-NIC compute, where users specify handler functions that are executed on the NIC, for each incoming packet belonging to a given message or flow. It enables a CUDA-like acceleration, where the NIC is equipped with lightweight processing elements that process network packets in parallel. We investigate the architectural specialties that a sPIN NIC should provide to enable high-performance, low-power, and flexible packet processing. We introduce PsPIN, a first open-source sPIN implementation, based on a multi-cluster RISC-V architecture and designed according to the identified architectural specialties. We investigate the performance of PsPIN with cycle-accurate simulations, showing that it can process packets at 400 Gbit/s for several use cases, introducing minimal latencies (26 ns for 64 B packets) and occupying a total area of 18.5 mm2 (22 nm FDSOI). - DUTCTL: A Flexible Open-Source Framework for Rapid Bring-Up, Characterization, and Remote Operation of Custom-Silicon RISC-V SoCsItem type: Other Conference ItemBenz, Thomas; Scheffler, Paul; Holborn, Jennifer; et al. (2024)Automatic test equipment (ATE) systems are industry-standard tools to test and characterize silicon at high resolution and volume. With open-source hardware initiatives on the rise, small organizations and research groups increasingly design their own prototype systems-on-chip (SoCs), but often struggle to acquire and maintain ATE. Thanks to RISC-V's Debug Module (DM), SoCs can control and observe their own state using standardized interfaces which can be driven through commodity adapters. However, this on-chip controllability and observability by itself is insufficient to replace ATE for functional tests and silicon characterization; it lacks coordination and automation of clocks, resets, power supplies, and debugging. We present DUTCTL, an open-source framework automating the bring-up and characterization of RISC-V-based SoCs by coordinating internal and external resources without an ATE. We evaluate our approach by bringing up an open-source 64-bit Linux-capable RISC-V SoC and characterizing its performance, speed, and power consumption.
- Cheshire: A Lightweight, Linux-Capable RISC-V Host Platform for Domain-Specific Accelerator Plug-InItem type: Journal Article
IEEE Transactions on Circuits and Systems II. Express BriefsOttaviano, Alessandro; Benz, Thomas; Scheffler, Paul; et al. (2023)Power and cost constraints in the Internet-of-Things (IoT) extreme-edge and TinyML domains, coupled with increasing performance requirements, motivate a trend toward heterogeneous architectures. These designs use energy-efficient application-class host processors to coordinate compute-specialized multicore accelerators, amortizing the architectural costs of operating system support and external communication. This brief presents Cheshire, a lightweight and modular 64-bit Linux-capable host platform designed for the seamless plug-in of domain-specific accelerators. It features a unique low-pin-count DRAM interface, a last-level cache configurable as scratchpad memory, and a DMA engine enabling efficient data movement to or from accelerators or DRAM. It also provides numerous optional IO peripherals including UART, SPI, I2C, VGA, and GPIOs. Cheshire’s synthesizable RTL description, comprising all of its peripherals and its fully digital DRAM interface, is available free and open-source. We implemented and fabricated Cheshire as a silicon demonstrator called Neo in TSMC’s 65nm CMOS technology. At 1.2V, Neo achieves clock frequencies of up to 325 MHz while not exceeding 300 mW in total power on data-intensive computational workloads. Its RPC DRAM interface consumes only 250 pJ/B and incurs only 3.5 kGE in area for its PHY while attaining a peak transfer rate of 750 MB/s at 200 MHz. - Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMVItem type: Conference Paper
2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)Zhang, Chi; Scheffler, Paul; Benz, Thomas; et al. (2024)Sparse matrix-vector multiplication (SpMV) is central to numerous data-intensive applications, but requires streaming indirect memory accesses that severely degrade both processing and memory throughput in state-of-the-art architectures. Near-memory hardware units, decoupling indirect streams from processing elements, partially alleviate the bottleneck, but rely on low DRAM access granularity, which is highly inefficient for modern DRAM standards like HBM and LPDDR. To fully address the end-to-end challenge, we propose a low-overhead data coa- lescer combined with a near-memory indirect streaming unit for AXI-Pack, an extension to the widespread AXI4 protocol packing narrow irregular stream elements onto wide memory buses. Our combined solution leverages the memory-level parallelism and coalescence of streaming indirect accesses in irregular applications like SpMV to maximize the performance and bandwidth efficiency attained on wide memory interfaces. Our solution delivers an average speedup of 8x in effective indirect access, often reaching the full memory bandwidth. As a result, we achieve an average end-to-end speedup on SpMV of 3x. Moreover, our approach demonstrates remarkable on-chip efficiency, requiring merely 27kB of on-chip storage and a very compact implementation area of O.2-0.3mm² in a 12nm node. - Basilisk: An End-to-End Open-Source Linux-Capable RISC-V SoC in 130nm CMOSItem type: Conference Paper
Proceedings of the 2nd Safety and Security in Heterogeneous Open System-on-Chip Platforms Workshop (SSH-SoC 2024)Scheffler, Paul; Sauter, Philippe; Benz, Thomas; et al. (2024)Open-source hardware (OSHW) is rapidly gaining traction in academia and industry. The availability of open RTL descriptions, EDA tools, and even PDKs enables a fully auditable supply chain for end-to-end (RTL to layout) open-source silicon, significantly strengthening security and transparency. Despite promising developments, existing OSHW efforts have so far fallen short of producing end-to-end open-source SoCs at the complexity and performance level needed to run a general-purpose OS. We present Basilisk, the first end-to-end open-source, Linux-capable RISC-V SoC taped out in IHP's open 130 nm technology. Basilisk features a 64-bit RISC-V core, a fully digital HyperRAM DRAM controller, and a rich set of IO peripherals including USB 1.1 and VGA. To tape out Basilisk, we create a reusable tool pipeline to convert its industry-grade SystemVerilog description to Verilog. We optimized logic synthesis in the open source Yosys synthesis tool, obtaining an increase in Basilisk's peak clock speed by 2.3x to 77 MHz and reducing its cell area by 1.6x to 1.1 MGE while also reducing synthesis runtime and RAM usage. We further optimized place and route in OpenROAD, enabling convergence to zero DRC violations while increasing core area utilization by 10% and reducing die area by 12%. - Towards Reliable Systems: A Scalable Approach to AXI4 Transaction MonitoringItem type: Conference PaperLiang, Chaoqun; Benz, Thomas; Ottaviano, Alessandro; et al. (2025)In safety-critical SoC applications such as automotive and aerospace, reliable transaction monitoring is crucial for maintaining system integrity. This paper introduces a drop-in Transaction Monitoring Unit (TMU) for AXI4 subordinate endpoints that detects transaction failures including protocol violations or timeouts and triggers recovery by resetting the affected subordinates. Two TMU variants address different constraints: a Tiny-Counter solution for tightly area-constrained systems and a Full-Counter solution for critical subordinates in mixed-criticality SoCs. The Tiny-Counter employs a single counter per outstanding transaction, while the Full-Counter uses multiple counters to track distinct transaction stages, offering finer-grained monitoring and reducing detection latencies by up to hundreds of cycles at roughly 2.5x the area cost. The Full-Counter also provides detailed error logs for performance and bottleneck analysis. Evaluations at both IP and system levels confirm the TMU's effectiveness and low overhead. In GF12 technology, monitoring 16-32 outstanding transactions occupies 1330-2616 um2 for the Tiny-Counter and 3452-6787 um2 for the Full-Counter; moderate prescaler steps reduce these figures by 18-39% and 19-32%, respectively, with no loss of functionality. Results from a full-system integration demonstrate the TMU's robust and precise monitoring capabilities in safety-critical SoC environments.
- AXI-Pack: Near-Memory Bus Packing for Bandwidth-Efficient Irregular WorkloadsItem type: Conference Paper
2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)Zhang, Chi; Scheffler, Paul; Benz, Thomas; et al. (2023)Data-intensive applications involving irregular memory streams are inefficiently handled by modern processors and memory systems highly optimized for regular, contiguous data. Recent work tackles these inefficiencies in hardware through core-side stream extensions or memory-side prefetchers and accelerators, but fails to provide end-to-end solutions which also achieve high efficiency in on-chip interconnects. We propose AXI-Pack, an extension to ARM's AXI4 protocol introducing bandwidth-efficient strided and indirect bursts to enable end-to-end irregular streams. AXI-Pack adds irregular stream semantics to memory requests and avoids inefficient narrow-bus transfers by packing multiple narrow data elements onto a wide bus. It retains full compatibility with AXI4 and does not require modifications to non-burst-reshaping interconnect IPs. To demonstrate our approach end-to-end, we extend an open-source RISC-V vector processor to leverage AXI-Pack at its memory interface for strided and indexed accesses. On the memory side, we design a banked memory controller efficiently handling AXI-Pack requests. On a system with a 256-bit-wide interconnect running FP32 workloads, AXI-Pack achieves near-ideal peak on-chip bus utilizations of 87% and 39%, speedups of 5.4x and 2.4x, and energy efficiency improvements of 5.3x and 2.1x over a baseline using an AXI4 bus on strided and indirect benchmarks, respectively.
Publications 1 - 10 of 29