Matteo Perotti


Loading...

Last Name

Perotti

First Name

Matteo

Organisational unit

Search Results

Publications 1 - 9 of 9
  • Cammarata, Danilo; Perotti, Matteo; Bertuletti, Marco; et al. (2025)
    CF '25 Companion: Proceedings of the 22nd ACM International Conference on Computing Frontiers: Workshops and Special Sessions
    The rapid growth of AI-based Internet-of-Things applications in creased the demand for high-performance edge processing engines on a low-power budget and tight area constraints. As a conse quence, vector processor architectures, traditionally designed for high-performance computing (HPC), made their way into edge devices, promising high utilization of floating-point units (FPUs) and low power consumption. However, vector processors can only exploit a single dimension of parallelism, leading to expensive ac cesses to the vector register file (VRF) when performing matrix computations, which are pervasive in AI workloads. To overcome these limitations while guaranteeing programmability, many re searchers and companies are developing dedicated instructions for a more efficient matrix multiplication (MatMul) execution. In this context, we propose Quadrilatero, an open-source RISC-V pro grammable systolic array coprocessor for low-power edge appli cations that implements a streamlined matrix ISA extension. We evaluate the post-synthesis power, performance, and area (PPA) metrics of Quadrilatero in a mature 65-nm technology node, show ing that it requires only 0.65 𝑚𝑚2 and that it can reach up to 99.4% of FPU utilization. Compared to a state-of-the-art open-source RISC V vector processor and a hybrid vector-matrix processor optimized for embedded applications, Quadrilatero improves area efficiency and energy efficiency by up to 77% and 15%, respectively.
  • Perotti, Matteo; Maisto, Vincenzo; Imfeld, Moritz; et al. (2025)
    Vector processor architectures offer an efficient solution for accelerating data-parallel workloads (e.g., ML, AI), reducing instruction count, and enhancing processing efficiency. This is evidenced by the increasing adoption of vector ISAs, such as Arm’s SVE/SVE2 and RISC-V’s RVV, not only in high-performance computers but also in embedded systems. The open-source nature of RVV has particularly encouraged the development of numerous vector processor designs across industry and academia. However, despite the growing number of open-source RVV processors, there is a lack of published data on their performance in a complex application environment hosted by a full-fledged operating system (Linux). In this work, we add OS support to the open-source bare-metal Ara2 vector processor (AraOS) by sharing the MMU of CVA6, the scalar core used for instruction dispatch to Ara2, and integrate AraOS into the open-source Cheshire SoC platform. We evaluate the performance overhead of virtual-to-physical address translation by benchmarking matrix multiplication kernels across several problem sizes and translation lookaside buffer (TLB) configurations in CVA6’s shared MMU, providing insights into vector performance in a full-system environment with virtual memory. With at least 16 TLB entries, the virtual memory overhead remains below 3.5%. Finally, we benchmark a 2-lane AraOS instance with the open-source RiVEC benchmark suite for RVV architectures, with peak average speedups of 3.2 × against scalar-only execution.
  • Kunhi Purayil, Navaneeth; Shen, Diyou; Perotti, Matteo; et al. (2025)
    2025 IEEE 43rd International Conference on Computer Design (ICCD)
    The fast evolution of Machine Learning (ML) models requires flexible and efficient hardware solutions as hardwired accelerators face rapid obsolescence. Vector processors are fully programmable and achieve high energy efficiencies by exploiting data parallelism, amortizing instruction fetch and decoding costs. Hence, a promising design choice is to build accelerators based on shared L1-memory clusters of streamlined Vector Processing Elements (VPEs). However, current state-of-the-art VPEs are limited in L1 memory bandwidth and achieve high efficiency only for computational kernels with high data reuse in the Vector Register File (VRF), such as General Matrix Multiplication (GEMM). Performance is sub-optimal for workloads with lower data reuse like General Matrix-Vector Multiplication (GEMV). To fully exploit available bandwidth at the L1 memory interface, the VPE micro-architecture must be optimized to achieve nearideal utilization, i.e., to be as close as possible to the L1 memory roofline (at-the-roofline). In this work, we propose TROOP, a set of hardware optimizations that include decoupled load-store interfaces, improved vector chaining, shadow buffers to hide VRF conflicts, and address scrambling techniques to achieve at-the-roofline performance for VPEs without compromising their area and energy efficiency. We implement TROOP on an open-source streamlined vector processor in a 12 nm FinFET technology. TROOP achieves significant speedups of 1.5×,2.2×, and 2.6×, respectively, for key memory-intensive kernels such as GEMV, DOTP and AXPY, achieving at-the-roofline performance. Additionally, TROOP enhances the energy efficiency by up to 45%, reaching 38 DP-GFLOPs/W (1 GHz, TT, 0.8 V) for DOTP while maintaining a high energy efficiency of 61 DP-GFLOPs/W for GEMMs, incurring only a minor area overhead of less than 7%.
  • Garofalo, Angelo; Ottaviano, Alessandro; Perotti, Matteo; et al. (2025)
    IEEE Transactions on Circuits and Systems II. Express Briefs
    Next-generation mixed-criticality Systems-on-chip (SoCs) must execute mixed-criticality AI-enhanced sensor processing and control workloads, ensuring reliable and time-predictable execution of critical tasks while fitting within a sub-2W power envelope. To tackle these challenges, we present a 16nm, reliable, time-predictable heterogeneous SoC with multiple programmable accelerators. Within a 1.2W power envelope, the SoC integrates software-configurable hardware IPs to ensure predictable access to shared resources, such as the on-chip interconnect and memory system, leading to tight upper bounds on execution times of critical applications. To accelerate mission-critical AI, the SoC integrates a reliable multi-core accelerator achieving 304.9 GOPS peak performance at 1.6 TOPS/W energy efficiency. Non-critical, compute-intensive, floating-point workloads are accelerated by a vector cluster, achieving 1.1 TFLOPS/W and 106.8 GFLOPS/mm2.
  • Schönleber, Jannis; Cavigelli, Lukas Arno Jakob; Perotti, Matteo; et al. (2025)
    2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
    Artificial intelligence has surged in recent years, with advancements in machine learning rapidly impacting nearly every area of life. However, the growing complexity of these models has far outpaced advancements in available hardware accelerators, leading to significant computational and energy demands, primarily due to matrix multiplications, which dominate the compute workload. MADDNESS (i.e., Multiply-ADDitioN-lESS) presents a hash-based version of product quantization, which renders matrix multiplications into lookups and additions, eliminating the need for multipliers entirely. We present STELLA NERA1, the first MADDNESS-based accelerator achieving an energy efficiency of 161 TOp/s/W@0.55V, 25x better than conventional MatMul accelerators due to its small components and reduced computational complexity. We further enhance MADDNESS with a differentiable approximation, allowing for gradient-based fine-tuning and achieving an end-to-end performance of 92.5% Top-1 accuracy on CIFAR-10
  • Kunhi Purayil, Navaneeth; Perotti, Matteo; Benini, Luca; et al. (2025)
    2025 Design, Automation & Test in Europe Conference (DATE)
    The ever-growing scale of data parallelism in today's HPC and ML applications presents a big challenge for computing architectures' energy efficiency and performance. Vector processors address the scale-up challenge by decoupling Vector Register File (VRF) and datapath widths, allowing the VRF to host long vectors and increase register-stored data reuse while reducing the relative cost of instruction fetch and decode. However, even the largest vector processor designs today struggle to scale to more than 8 vector lanes with double-precision Floating Point Units (FPUs) and 256 64-bit elements per vector register. This limitation is induced by difficulties in the physical implementation, which becomes wire-dominated and inefficient. In this work, we present AraXL, a modular and scalable 64-bit RISC-V V vector architecture targeting long-vector applications for HPC and ML. AraXL addresses the physical scalability challenges of state-of-the-art vector processors with a distributed and hierarchical interconnect, supporting up to 64 parallel vector lanes and reaching the maximum Vector Register File size of 64 Kibit/vreg permitted by the RISC-V V 1.0 ISA specification. Implemented in a 22-nm technology node, our 64-lane AraXL achieves a performance peak of 146 GFLOPs on computation-intensive HPC/ML kernels (>99% FPU utilization) and energy efficiency of 40.1 GFLOPs/W (1.15 GHz, TT, 0.8V), with only 3.8× the area of a 16-lane instance.
  • Perotti, Matteo; Cavalcante, Matheus; Wistoff, Nils; et al. (2022)
    2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP)
    Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.
  • Perotti, Matteo; Riedel, Samuel; Cavalcante, Matheus; et al. (2025)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64 bit floating-point-capable vector processor based on RISC-V's vector extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dualcore vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries' 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 GFLOPSDP/W at 1 GHz and nominal operating conditions (TT, 0.80 V, and 25 degrees C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 GFLOPSDP, and 99.3 GFLOPSDP/W (61% of the power spent in the FPU) on a 2D workload with 7x7 kernel, resulting in an outstanding area/energy efficiency of 171 GFLOPSDP/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.
Publications 1 - 9 of 9