Samuel Riedel


Loading...

Last Name

Riedel

First Name

Samuel

Organisational unit

Search Results

Publications 1 - 10 of 23
  • Mazzola, Sergio; Riedel, Samuel; Benini, Luca (2024)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double 's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80/25), in a 22-nm FDX technology, our hybrid architecture runs at 600with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.
  • Zhang, Yichao; Bertuletti, Marco; Mazzola, Sergio; et al. (2025)
    2025 IEEE European Solid-State Electronics Research Conference (ESSERC)
    We present HeartStream, a 64-RV-core shared-L1-memory cluster (410 GFLOP/s peak performance and 204.8 GBps L1 bandwidth) for energy-efficient AI-enhanced O-RAN. The cores and cluster architecture are customized for baseband processing, supporting complex (16-bit real&imaginary) instructions: multiply&accumulate, division&square-root, SIMD instructions, and hardware-managed systolic queues, improving up to 1.89× the energy efficiency of key baseband kernels. At 800 MHz@0.8 V, HeartStream delivers up to 243 GFLOP/s on complex-valued wireless workloads. Furthermore, the cores also support efficient AI processing on received data at up to 72 GOP/s. HeartStream is fully compatible with base station power and processing latency limits: it achieves leading-edge software-defined PUSCH efficiency (49.6 GFLOP/s/W) and consumes just 0.68 W(645 MHz@0.65 V, within the 4 ms end-to-end constraint for B5G/6G uplink.
  • Riedel, Samuel; Zhang, Yichao; Bertuletti, Marco; et al. (2025)
    2025 10th International Workshop on Advances in Sensors and Interfaces (IWASI)
    Next-generation wireless technologies (for immersive-massive communication, joint communication and sensing) demand highly parallel architectures for massive data processing. A common architectural template scales up by grouping tens to hundreds of cores into shared-memory clusters, which are then scaled out as multi-cluster manycore systems. This hierarchical design, used in GPUs and accelerators, requires a balancing act between fewer large clusters and more smaller clusters, affecting design complexity, synchronization, communication efficiency, and programmability. While all multi-cluster architectures must balance these trade-offs, there is limited insight into optimal cluster sizes. This paper analyzes various cluster configurations, focusing on synchronization, data movement overhead, and programmability for typical wireless sensing and communication workloads. We extend the open-source shared-memory cluster MemPool into a multi-cluster architecture and propose a novel double-buffering barrier that decouples processor and DMA. Our results show a single 256-core cluster can be twice as fast as 16 16-core clusters for memory-bound kernels and up to 24% faster for compute-bound kernels due to reduced synchronization and communication overheads.
  • Kurth, Andreas; Riedel, Samuel; Zaruba, Florian; et al. (2020)
    2020 57th ACM/IEEE Design Automation Conference (DAC)
    Atomic operations are crucial for most modern parallel and concurrent algorithms, which necessitates their optimized implementation in highly-scalable manycore processors. We pro-pose a modular and efficient, open-source ATomic UNit (ATUN) architecture that can be placed flexibly at different levels of the memory hierarchy. ATUN demonstrates near-optimal linear scaling for various synthetic and real-world workloads on an FPGA prototype with 32 RISC-V cores. We characterize the hardware complexity of our ATUN design in 22 nm FDSOI and find that it scales linearly in area (only 0.5 kGE per core) and logarithmically in the critical path. © 2020 IEEE.
  • Riedel, Samuel; Schuiki, Fabian; Scheffler, Paul; et al. (2021)
    2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
    System simulators are essential for the exploration, evaluation, and verification of manycore processors and are vital for writing software and developing programming models in conjunction with architecture design. A promising approach to fast, scalable, and instruction-accurate simulation is binary translation. In this paper, we present Banshee, an instruction-accurate full-system RISC-V multi-core simulator based on LLVM-powered ahead-of-time binary translation that can simulate systems with thousands of cores. Banshee supports the RV32IMAFD instruction set. It also models peripherals, custom ISA extensions, and a multi-level, actively-managed memory hierarchy used in existing multi-cluster systems. Banshee is agnostic to the host architecture, fully open-source, and easily extensible to facilitate the exploration and evaluation of new ISA extensions. As a key novelty with respect to existing binary translation approaches, Banshee supports performance estimation through a lightweight extension, modeling the effect of architectural latencies with an average deviation of only 2% from their actual impact. We evaluate Banshee by simulating various compute-intensive workloads on two large-scale open-source RISC-V manycore systems, Manticore and MemPool (with 4096 and 256 cores, respectively). We achieve simulation speeds of up to 618 MIPS per core or 72 GIPS for complete systems, exhibiting almost perfect scaling, competitive single-core performance, and leading multi-core performance. We demonstrate Banshee’s extensibility by implementing multiple custom RISC-V ISA extensions.
  • Bethur, Nesara Eranna; Agnesina, Anthony; Brunion, Moritz; et al. (2024)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    Hierarchical very-large-scale integration (VLSI) flows are an understudied yet critical approach to achieving design closure at giga-scale complexity and gigahertz frequency targets. This article proposes a novel hierarchical physical design flow enabling the building of high-density and commercial-quality two-tier face-to-face-bonded hierarchical 3-D ICs. Complemented with an automated floor planning solution, the flow allows for system-level physical and architectural exploration of 3-D designs. As a result, we significantly reduce the associated manufacturing cost compared to existing 3-D implementation flows and, for the first time, achieve cost competitiveness against the 2-D reference in large modern designs. Experimental results on complex industrial and open manycore processors demonstrate in two advanced nodes that the proposed flow provides major power, performance, and area/cost (PPAC) improvements of 1.2 - 2.2 $\times $ compared with 2-D, where all metrics are improved simultaneously, including up to 20 % power savings.
  • Riedel, Samuel; Khov, Gua Hao; Mazzola, Sergio; et al. (2023)
    2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)
    Systolic arrays and shared-memory manycore clusters are two widely used architectural templates that offer vastly different trade-offs. Systolic arrays achieve exceptional performance for workloads with regular dataflow at the cost of a rigid architecture and programming model. Shared-memory manycore systems are more flexible and easy to program, but data must be moved explicitly to/from cores. This work combines the best of both worlds by adding a systolic overlay to a general-purpose shared-memory manycore cluster allowing for efficient systolic execution while maintaining flexibility. We propose and implement two instruction set architecture extensions enabling native and automatic communication between cores through shared memory. Our hybrid approach allows configuring different systolic topologies at execution time and running hybrid systolic-shared-memory computations. The hybrid architecture's convolution kernel outperforms the optimized shared-memory one by 18%.
  • Cavalcante, Matheus; Agnesina, Anthony; Riedel, Samuel; et al. (2022)
    2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)
    Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latency interconnect. MemPool's baseline 2D design is severely limited by routing congestion and wire propagation delay, making the design ideal for 3D integration. In architectural terms, we increase MemPool's scratchpad memory capacity beyond the sweet spot for 2D designs, improving performance in a common digital signal processing kernel. We propose a 3D MemPool design that leverages a smart partitioning of the memory resources across two layers to balance the size and utilization of the stacked dies. In this paper, we explore the architectural and the technology parameter spaces by analyzing the power, performance, area, and energy efficiency of MemPool instances in 2D and 3D with 1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28nm technology node. We observe a performance gain of 9.1% when running a matrix multiplication on MemPool-3D with 4 MiB of scratchpad memory compared to the MemPool 2D counterpart. In terms of energy efficiency, we can implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget 15% smaller than its 2D counterpart, and 3.7% smaller than the MemPool-2D instance with a fourth of the L1 scratchpad memory capacity.
  • Riedel, Samuel (2024)
  • Das, Sudipta; Riedel, Samuel; Naeim, Mohamed; et al. (2025)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12-and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases T-max only by 2 degree celsius -3degree celsius compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.
Publications 1 - 10 of 23