Sergio Mazzola
Loading...
Last Name
Mazzola
First Name
Sergio
ORCID
Organisational unit
03996 - Benini, Luca / Benini, Luca
5 results
Filters
Reset filtersSearch Results
Publications 1 - 5 of 5
- A Data-Driven Approach to Lightweight DVFS-Aware Counter-Based Power Modeling for Heterogeneous PlatformsItem type: Conference Paper
Lecture Notes in Computer Science ~ Embedded Computer Systems: Architectures, Modeling, and SimulationMazzola, Sergio; Benz, Thomas; Forsberg, Björn; et al. (2022)Computing systems have shifted towards highly parallel and heterogeneous architectures to tackle the challenges imposed by limited power budgets. These architectures must be supported by novel power management paradigms addressing the increasing design size, parallelism, and heterogeneity while ensuring high accuracy and low overhead. In this work, we propose a systematic, automated, and architecture-agnostic approach to accurate and lightweight DVFS-aware statistical power modeling of the CPU and GPU sub-systems of a heterogeneous platform, driven by the sub-systems' local performance monitoring counters (PMCs). Counter selection is guided by a generally applicable statistical method that identifies the minimal subsets of counters robustly correlating to power dissipation. Based on the selected counters, we train a set of lightweight, linear models characterizing each sub-system over a range of frequencies. Such models compose a lookup-table-based system-level model that efficiently captures the non-linearity of power consumption, showing desirable responsiveness and decomposability. We validate the system-level model on real hardware by measuring the total energy consumption of an NVIDIA Jetson AGX Xavier platform over a set of benchmarks. The resulting average estimation error is 1.3%, with a maximum of 3.1%. Furthermore, the model shows a maximum evaluation runtime of 500 ns, thus implying a negligible impact on system utilization and applicability to online dynamic power management (DPM). - Enabling Efficient Hybrid Systolic Computation in Shared-L1-Memory Manycore ClustersItem type: Journal Article
IEEE Transactions on Very Large Scale Integration (VLSI) SystemsMazzola, Sergio; Riedel, Samuel; Benini, Luca (2024)Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead instruction set architecture (ISA) extensions for efficient systolic execution, namely Xqueue and queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several digital signal processing (DSP) kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double 's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80/25), in a 22-nm FDX technology, our hybrid architecture runs at 600with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs. - A 410 GFLOP/s, 64 RISC-V Cores, 204.8 GBps Shared-Memory Cluster in 12 nm FinFET with Systolic Execution Support for Efficient B5G/6G AI-Enhanced O-RANItem type: Conference Paper
2025 IEEE European Solid-State Electronics Research Conference (ESSERC)Zhang, Yichao; Bertuletti, Marco; Mazzola, Sergio; et al. (2025)We present HeartStream, a 64-RV-core shared-L1-memory cluster (410 GFLOP/s peak performance and 204.8 GBps L1 bandwidth) for energy-efficient AI-enhanced O-RAN. The cores and cluster architecture are customized for baseband processing, supporting complex (16-bit real&imaginary) instructions: multiply&accumulate, division&square-root, SIMD instructions, and hardware-managed systolic queues, improving up to 1.89× the energy efficiency of key baseband kernels. At 800 MHz@0.8 V, HeartStream delivers up to 243 GFLOP/s on complex-valued wireless workloads. Furthermore, the cores also support efficient AI processing on received data at up to 72 GOP/s. HeartStream is fully compatible with base station power and processing latency limits: it achieves leading-edge software-defined PUSCH efficiency (49.6 GFLOP/s/W) and consumes just 0.68 W(645 MHz@0.65 V, within the 4 ms end-to-end constraint for B5G/6G uplink. - MemPool Meets Systolic: Flexible Systolic Computation in a Large Shared-Memory Processor ClusterItem type: Conference Paper
2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)Riedel, Samuel; Khov, Gua Hao; Mazzola, Sergio; et al. (2023)Systolic arrays and shared-memory manycore clusters are two widely used architectural templates that offer vastly different trade-offs. Systolic arrays achieve exceptional performance for workloads with regular dataflow at the cost of a rigid architecture and programming model. Shared-memory manycore systems are more flexible and easy to program, but data must be moved explicitly to/from cores. This work combines the best of both worlds by adding a systolic overlay to a general-purpose shared-memory manycore cluster allowing for efficient systolic execution while maintaining flexibility. We propose and implement two instruction set architecture extensions enabling native and automatic communication between cores through shared memory. Our hybrid approach allows configuring different systolic topologies at execution time and running hybrid systolic-shared-memory computations. The hybrid architecture's convolution kernel outperforms the optimized shared-memory one by 18%. - Data-driven power modeling and monitoring via hardware performance counter trackingItem type: Journal Article
Journal of systems architectureMazzola, Sergio; Ara, Gabriele; Benz, Thomas; et al. (2025)Energy-centric design is paramount in the current embedded computing era: use cases require increasingly high performance at an affordable power budget, often under real-time constraints. Hardware heterogeneity and parallelism help address the efficiency challenge, but greatly complicate online power consumption assessments, which are essential for dynamic hardware and software stack adaptations. We introduce a novel power modeling methodology with state-of-the-art accuracy, low overhead, and high responsiveness, whose implementation does not rely on microarchitectural details. Our methodology identifies the Performance Monitoring Counters (PMCs) with the highest linear correlation to the power consumption of each hardware sub-system, for each Dynamic Voltage and Frequency Scaling (DVFS) state. The individual, simple models are composed into a complete model that effectively describes the power consumption of the whole system, achieving high accuracy and low overhead. Our evaluation reports an average estimation error of 7.5% for power consumption and 1.3% for energy. We integrate these models in the Linux kernel with Runmeter, an open-source, PMC-based monitoring framework. Runmeter manages PMC sampling and processing, enabling the execution of our power models at runtime. With a worst-case time overhead of only 0.7%, Runmeter provides responsive and accurate power measurements directly in the kernel. This information can be employed for actuation policies in workload-aware DVFS and power-aware, closed-loop task scheduling.
Publications 1 - 5 of 5