Marco Bertuletti
Loading...
Last Name
Bertuletti
First Name
Marco
ORCID
Organisational unit
03996 - Benini, Luca / Benini, Luca
13 results
Filters
Reset filtersSearch Results
Publications 1 - 10 of 13
- A 410 GFLOP/s, 64 RISC-V Cores, 204.8 GBps Shared-Memory Cluster in 12 nm FinFET with Systolic Execution Support for Efficient B5G/6G AI-Enhanced O-RANItem type: Conference Paper
2025 IEEE European Solid-State Electronics Research Conference (ESSERC)Zhang, Yichao; Bertuletti, Marco; Mazzola, Sergio; et al. (2025)We present HeartStream, a 64-RV-core shared-L1-memory cluster (410 GFLOP/s peak performance and 204.8 GBps L1 bandwidth) for energy-efficient AI-enhanced O-RAN. The cores and cluster architecture are customized for baseband processing, supporting complex (16-bit real&imaginary) instructions: multiply&accumulate, division&square-root, SIMD instructions, and hardware-managed systolic queues, improving up to 1.89× the energy efficiency of key baseband kernels. At 800 MHz@0.8 V, HeartStream delivers up to 243 GFLOP/s on complex-valued wireless workloads. Furthermore, the cores also support efficient AI processing on received data at up to 72 GOP/s. HeartStream is fully compatible with base station power and processing latency limits: it achieves leading-edge software-defined PUSCH efficiency (49.6 GFLOP/s/W) and consumes just 0.68 W(645 MHz@0.65 V, within the 4 ms end-to-end constraint for B5G/6G uplink. - Optimizing Scalable Multi-Cluster Architectures for Next-Generation Wireless Sensing and CommunicationItem type: Conference Paper
2025 10th International Workshop on Advances in Sensors and Interfaces (IWASI)Riedel, Samuel; Zhang, Yichao; Bertuletti, Marco; et al. (2025)Next-generation wireless technologies (for immersive-massive communication, joint communication and sensing) demand highly parallel architectures for massive data processing. A common architectural template scales up by grouping tens to hundreds of cores into shared-memory clusters, which are then scaled out as multi-cluster manycore systems. This hierarchical design, used in GPUs and accelerators, requires a balancing act between fewer large clusters and more smaller clusters, affecting design complexity, synchronization, communication efficiency, and programmability. While all multi-cluster architectures must balance these trade-offs, there is limited insight into optimal cluster sizes. This paper analyzes various cluster configurations, focusing on synchronization, data movement overhead, and programmability for typical wireless sensing and communication workloads. We extend the open-source shared-memory cluster MemPool into a multi-cluster architecture and propose a novel double-buffering barrier that decouples processor and DMA. Our results show a single 256-core cluster can be twice as fast as 16 16-core clusters for memory-bound kernels and up to 24% faster for compute-bound kernels due to reduced synchronization and communication overheads. - Quadrilatero: A RISC-V programmable matrix coprocessor for low-power edge applicationsItem type: Other Conference Item
CF '25 Companion: Proceedings of the 22nd ACM International Conference on Computing Frontiers: Workshops and Special SessionsCammarata, Danilo; Perotti, Matteo; Bertuletti, Marco; et al. (2025)The rapid growth of AI-based Internet-of-Things applications in creased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a conse quence, vector processor architectures, traditionally designed for high-performance computing (HPC), made their way into edge devices, promising high utilization of floating-point units (FPUs) and low power consumption. However, vector processors can only exploit a single dimension of parallelism, leading to expensive ac cesses to the vector register file (VRF) when performing matrix computations, which are pervasive in AI workloads. To overcome these limitations while guaranteeing programmability, many re searchers and companies are developing dedicated instructions for a more efficient matrix multiplication (MatMul) execution. In this context, we propose Quadrilatero, an open-source RISC-V pro grammable systolic array coprocessor for low-power edge appli cations that implements a streamlined matrix ISA extension. We evaluate the post-synthesis power, performance, and area (PPA) metrics of Quadrilatero in a mature 65-nm technology node, show ing that it requires only 0.65 𝑚𝑚2 and that it can reach up to 99.4% of FPU utilization. Compared to a state-of-the-art open-source RISC V vector processor and a hybrid vector-matrix processor optimized for embedded applications, Quadrilatero improves area efficiency and energy efficiency by up to 77% and 15%, respectively. - A Dynamic Allocation Scheme for Adaptive Shared-Memory Mapping on Kilo-Core RV Clusters for Attention-Based Model DeploymentItem type: Conference Paper
2025 IEEE 36th International Conference on Application-specific Systems, Architectures and Processors (ASAP)Wang, Bowen; Bertuletti, Marco; Zhang, Yichao; et al. (2025)Attention-based models demand flexible hardware to manage diverse kernels with varying arithmetic intensities and memory access patterns. Large clusters with shared L1 memory, a common architectural pattern, struggle to fully utilize their processing elements (PEs) when scaled up due to reduced throughput in the hierarchical PE-to-L1 intra-cluster interconnect. This paper presents Dynamic Allocation Scheme (DAS), a runtime programmable address remapping hardware unit coupled with a unified memory allocator, designed to minimize data access contention of PEs onto the multi-banked L1. We evaluated DAS on an aggressively scaled-up 1024-PE RISC-V cluster with Non-Uniform Memory Access (NUMA) PE-to-L1 interconnect to demonstrate its potential for improving data locality in large parallel machine learning workloads. For a Vision Transformer (ViT)-L/16 model, each encoder layer executes in 5.67 ms, achieving a 1.94× speedup over the fixed word-level interleaved baseline with 0.81 PE utilization. Implemented in 12nm FinFET technology, DAS incurs <0.1% area overhead. - Fast Shared-Memory Barrier Synchronization for a 1024-Cores RISC-V Many-Core ClusterItem type: Conference Paper
Lecture Notes in Computer Science ~ Embedded Computer Systems: Architectures, Modeling, and SimulationBertuletti, Marco; Riedel, Samuel; Zhang, Yichao; et al. (2023)Synchronization is likely the most critical performance killer in shared-memory parallel programs. With the rise of multi-core and many-core processors, the relative impact on performance and energy overhead of synchronization is bound to grow. This paper focuses on barrier synchronization for TeraPool, a cluster of 1024 RISC-V processors with non-uniform memory access to a tightly coupled 4 MB shared L1 data memory. We compare the synchronization strategies available in other multi-core and many-core clusters to identify the optimal native barrier kernel for TeraPool. We benchmark a set of optimized barrier implementations and evaluate their performance in the framework of the widespread fork-join Open-MP style programming model. We test parallel kernels from the signal-processing and telecommunications domain, achieving less than 10% synchronization overhead over the total runtime for problems that fit TeraPool’s L1 memory. By fine-tuning our tree barriers, we achieve 1.6x speed-up with respect to a naive central counter barrier and just 6.2% overhead on a typical 5G application, including a challenging multistage synchronization kernel. To our knowledge, this is the first work where shared-memory barriers are used for the synchronization of a thousand processing elements tightly coupled to shared data memory. - Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-Core ProcessorItem type: Conference Paper
2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)Bertuletti, Marco; Zhang, Yichao; Vanelli Coralli, Alessandro; et al. (2023)5G Radio access network disaggregation and soft-warization pose challenges in terms of computational performance to the processing units. At the physical layer level, the baseband processing computational effort is typically offloaded to specialized hardware accelerators. However, the trend toward software-defined radio-access networks demands flexible, programmable architectures. In this paper, we explore the software design, parallelization and optimization of the key kernels of the lower physical layer (PHY) for physical uplink shared channel (PUSCH) reception on MemPool and TeraPool, two manycore systems having respectively 256 and 1024 small and efficient RISC-V cores with a large shared L1 data memory. PUSCH processing is demanding and strictly time-constrained, it represents a challenge for the baseband processors, and it is also common to most of the uplink channels. Our analysis thus generalizes to the entire lower PHY of the uplink receiver at gNodeB (gNB). Based on the evaluation of the computational effort (in multiply-accumulate operations) required by the PUSCH algorithmic stages, we focus on the parallel implementation of the dominant kernels, namely fast Fourier transform, matrix-matrix multiplication, and matrix decomposition kernels for the solution of linear systems. Our optimized parallel kernels achieve respectively on MemPool and TeraPool speedups of 211, 225, 158, and 762, 880, 722, at high utilization (0.81, 0.89, 0.71, and 0.74, 0.88, 0.71), comparable a single-core serial execution, moving a step closer toward a full-software PUSCH implementation. - Bandwidth-Latency-Thermal Co-Optimization of Interconnect-Dominated Many-Core 3D-ICItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsDas, Sudipta; Riedel, Samuel; Naeim, Mohamed; et al. (2025)The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12-and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases T-max only by 2 degree celsius -3degree celsius compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations. - 3D Partitioning with Pipeline Optimization for Low-Latency Memory Access in Many-Core SoCsItem type: Conference Paper
2024 IEEE International Symposium on Circuits and Systems (ISCAS)Das, Sudipta; Riedel, Samuel; Bertuletti, Marco; et al. (2024)This paper presents an investigation of System-onChip (SoC) communication latency optimization for 3D system integration and highlights the role of architectural modifications to maximize the Power, Performance, & Area (PPA) benefits. An instance of a highly configurable RISC-V SoC is implemented using similar to 2nm nanosheet technology and different 3D stacking options using design flow from sign-off tools. The proposed implementation targets performance optimization for different 3D partitioning scenarios: Memory-on-Logic (MoL) & Logicon-Logic (LoL). We target 2-die 3D Integrated Circuits (3DIC) with high density 3D interconnect using Face-to-Face (F2F) hybrid bonding (similar to 1 mu m), and 3-die stack, as Face-to-Back (F2B) on top of F2F. Our analysis of the 16-core SoC instance shows that the proposed architectural optimizations bring a significant reduction of 4 pipeline stages in the design hierarchy at a marginal cost of 9% effective frequency loss when implemented in 3D in comparison to the baseline 2D architecture. Further, going from 2D to 3D allows more than 40% total system wire-length reduction & 10% less cell area, resulting in 20% power savings. These findings hold promise for further explorations on manycore SoC instances (256 & more) facing system interconnect challenges. - TeraPool-SDR: An 1.89TOPS 1024 RV-Cores 4MiB Shared-L1 Cluster for Next-Generation Open-Source Software-Defined RadiosItem type: Conference Paper
GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024Zhang, Yichao; Bertuletti, Marco; Riedel, Samuel; et al. (2024)Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but staggering performance requirements demand a high number of PEs coupled with extreme Power, Performance and Area (PPA) efficiency. We present the architecture, design, and full physical implementation of Terapool-SDR, a cluster for Software Defined Radio (SDR) with 1024 latency-tolerant, compact RV32 PEs, sharing a global view of a 4 MiB, 4096-banked, L1 memory. We report various feasible configurations of TeraPool-SDR featuring an ultra-high bandwidth PE-to-L1-memory interconnect, clocked at 730 MHz, 880 MHz, and 924 MHz (TT/0.80 V/25 °C) in 12 nm FinFET technology. The TeraPool-SDR cluster achieves high energy efficiency on all SDR key kernels for 5G RANs: Fast Fourier Transform (93 GOPS/W), Matrix-Multiplication (125 GOPS/W), Channel Estimation (96 GOPS/W), and Linear System Inversion (61 GOPS/W). For all the kernels, it consumes less than 10 W, in compliance with industry standards. - Maestro: A 302 GFLOPS/W and 19.8GFLOPS RISC-V Vector-Tensor Architecture for Wearable Ultrasound Edge ComputingItem type: Journal Article
IEEE Transactions on Circuits and Systems I: Regular PapersSinigaglia, Mattia; Kiamarzi, Amirhossein; Bertuletti, Marco; et al. (2025)Most Wearable Ultrasound (WUS) devices lack the computational power to process signals at the edge, instead relying on remote offload, which introduces latency, high power consumption, and privacy concerns. We present Maestro, a RISC-V SoC with unified Vector-Tensor Unit (VTU) and memory-coupled Fast Fourier Transform (FFT) accelerators targeting edge processing for wearable ultrasound devices, fabricated using low-cost TSMC 65nm CMOS technology. The VTU achieves peak 302GFLOPS/W and 19.8GFLOPS at FP16, while the multi-precision 16/32-bit floating-point FFT accelerator delivers peak 60.6GFLOPS/W and 3.6GFLOPS at FP16. We evaluate Maestro on a US-based gesture recognition task, achieving 1.62GFLOPS in signal processing at 26.68GFLOPS/W, and 19.52GFLOPS in Convolutional Neural Network (CNN) workloads at 298.03GFLOPS/W. Compared to a state-of-the-art SoC with a similar mission profile, Maestro achieves a 5x speedup while consuming only 12mW, with an energy consumption of 2.5mJ in a wearable US channel preprocessing and ML-based postprocessing pipeline.
Publications 1 - 10 of 13