Luca Benini


Loading...

Last Name

Benini

First Name

Luca

Organisational unit

03996 - Benini, Luca / Benini, Luca

Search Results

Publications 1 - 10 of 1003
  • Borghesi, Andrea; Di Santi, Carmine; Molan, Martin; et al. (2023)
    Scientific Data
    Supercomputers are the most powerful computing machines available to society. They play a central role in economic, industrial, and societal development. While they are used by scientists, engineers, decision-makers, and data-analyst to computationally solve complex problems, supercomputers and their hosting datacenters are themselves complex power-hungry systems. Improving their efficiency, availability, and resiliency is vital and the subject of many research and engineering efforts. Still, a major roadblock hinders researchers: dearth of reliable data describing the behavior of production supercomputers. In this paper, we present the result of a ten-year-long project to design a monitoring framework (EXAMON) deployed at the Italian supercomputers at CINECA datacenter. We disclose the first holistic dataset of a tier-0 Top10 supercomputer. It includes the management, workload, facility, and infrastructure data of the Marconi100 supercomputer for two and half years of operation. The dataset (published via Zenodo) is the largest ever made public, with a size of 49.9TB before compression. We also provide open-source software modules to simplify access to the data and provide direct usage examples.
  • Burgio, Paolo; Tagliavini, Giuseppe; Conti, Francesco; et al. (2014)
    2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)
    Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.
  • Meloni, Paolo; Deriu, Gianfranco; Conti, Francesco; et al. (2016)
    Proceedings of the ACM International Conference on Computing Frontiers (CF '16)
  • Verdecchia, Andrea; Brunelli, Davide; Tinti, Francesco; et al. (2016)
    EESMS 2016, 2016 IEEE Workshop on Environmental, Energy, and Structural Monitoring Systems, Proceedings
  • Spenza, Dora; Magno, Michele; Basagni, Stefano; et al. (2015)
    2015 IEEE Conference on Computer Communications (INFOCOM)
  • Rossi, Davide; Pullini, Antonio; Loi, Igor; et al. (2016)
    Proceedings of the IEEE Symposium in Low-Power and High-Speed Chips, 2016 (IEEE COOL CHIPS XIX)
  • Kurth, Andreas; Capotondi, Alessandro; Vogel, Pirmin; et al. (2018)
    Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine “standard platform” software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities. In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way HERO’s host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA. We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions. Thus, HERO demonstrates that ARM and RISC-V can productively coexist in a dual-ISA HW-SW platform.
  • Pittino, Federico; Diversi, Roberto; Benini, Luca; et al. (2020)
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
    Power and thermal management are critical components of high-performance-computing (HPC) systems, due to their high-power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large-scale systems “in production,” the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities such as quantization. In this article we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1 °C) for a large-scale HPC system on real workloads for very long time periods. However, we also show that: 1) not all real workloads allow for the identification of a good model and 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that deep learning techniques are absolutely necessary to correctly choose such traces up to 96% of the times.
  • Cioflan, Cristian; Cavigelli, Lukas Arno Jakob; Benini, Luca (2024)
    arXiv
    Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.
  • Wang, Xiaying; Magno, Michele; Cavigelli, Lukas; et al. (2018)
    2018 IEEE Biomedical Circuits and Systems Conference (BioCAS)
    This paper focuses on ultra-low power embedded classification of neural activities. The machine learning (ML) algorithm has been trained using evoked local field potentials (LFPs) recorded with an implanted 16x16 multi-electrode array (MEA) from the rat barrel cortex while stimulating the whisker. Experimental results demonstrate that ML can be successfully applied to noisy single-trial LFPs. We achieved up to 95.8% test accuracy in predicting the whisker deflection. The trained ML model is successfully implemented on a low-power embedded system with an average consumption of 2.6mW.
Publications 1 - 10 of 1003