Luca Benini
Loading...
Last Name
Benini
First Name
Luca
ORCID
Organisational unit
03996 - Benini, Luca / Benini, Luca
1018 results
Search Results
Publications 1 - 10 of 1018
- Dual-Issue Execution of Mixed Integer and Floating-Point Workloads on Energy-Efficient In-Order RISC-V CoresItem type: Conference Paper
2025 62nd ACM/IEEE Design Automation Conference (DAC)Colagrande, Luca; Benini, Luca (2025)To meet the computational requirements of modern workloads under tight energy constraints, general-purpose accelerator architectures have to integrate an ever-increasing number of extremely area- and energy-efficient processing elements (PEs). In this context, single-issue in-order cores are commonplace, but lean dual-issue cores could boost PE IPC, especially for the common case of mixed integer and floating-point workloads. We develop the COPIFT methodology and RISC-V ISA extensions to enable low-cost and flexible dual-issue execution of mixed integer and floating-point instruction sequences. On such kernels, our methodology achieves speedups of 1.47x, reaching a peak 1.75 instructions per cycle, and 1.37 x energy improvements on average, over optimized RV32G baselines. - Paper, pen and ink: An innovative system and software framework to assist writing rehabilitationItem type: Conference Paper
2015 Design, Automation & Test in Europe Conference & Exhibition (DATE 2015): Grenoble, France, 9 - 13 March 2015Guardati, Leonardo; Casamassima, Filippo; Farella, Elisabetta; et al. (2015) - Optimizing BFloat16 Deployment of Tiny Transformers on Ultra-Low Power Extreme Edge SoCsItem type: Journal Article
Journal of Low Power Electronics and ApplicationsDequino, Alberto; Bompani, Luca; Benini, Luca; et al. (2025)Transformers have emerged as the central backbone architecture for modern generative AI. However, most ML applications targeting low-power, low-cost SoCs (TinyML apps) do not employ Transformers as these models are thought to be challenging to quantize and deploy on small devices. This work proposes a methodology to reduce Transformer dimensions with an extensive pruning search. We exploit the intrinsic redundancy of these models to fit them on resource-constrained devices with a well-controlled accuracy tradeoff. We then propose an optimized library to deploy the reduced models using BFLoat16 with no accuracy loss on Commercial Off-The-Shelf (COTS) RISC-V multi-core micro-controllers, enabling the execution of these models at the extreme edge, without the need for complex and accuracy-critical quantization schemes. Our solution achieves up to 220x speedup with respect to a na & iuml;ve C port of the Multi-Head Self Attention PyTorch kernel: we reduced MobileBert and TinyViT memory footprint up to similar to 94% and similar to 57%, respectively, and we deployed a tinyLLAMA SLM on microcontroller, achieving a throughput of 1219 tokens/s with an average power of just 57 mW. - Paving the Way Toward Energy-Aware and Automated DatacentreItem type: Conference Paper
ACM International Conference Proceeding Series ~ ICPP 2019: Proceedings of the 48th International Conference on Parallel Processing: Workshops, August 2019Bartolini, Andrea; Beneventi, Francesco; Borghesi, Andrea; et al. (2019) - A neuro-vector-symbolic architecture for solving Raven's progressive matricesItem type: Journal Article
Nature Machine IntelligenceHersche, Michael; Zeqiri, Mustafa; Benini, Luca; et al. (2023)Neither deep neural networks nor symbolic artificial intelligence (AI) alone has approached the kind of intelligence expressed in humans. This is mainly because neural networks are not able to decompose joint representations to obtain distinct objects (the so-called binding problem), while symbolic AI suffers from exhaustive rule searches, among other problems. These two problems are still pronounced in neuro-symbolic AI, which aims to combine the best of the two paradigms. Here we show that the two problems can be addressed with our proposed neuro-vector-symbolic architecture (NVSA) by exploiting its powerful operators on high-dimensional distributed representations that serve as a common language between neural networks and symbolic AI. The efficacy of NVSA is demonstrated by solving Raven’s progressive matrices datasets. Compared with state-of-the-art deep neural network and neuro-symbolic approaches, end-to-end training of NVSA achieves a new record of 87.7% average accuracy in RAVEN, and 88.1% in I-RAVEN datasets. Moreover, compared with the symbolic reasoning within the neuro-symbolic approaches, the probabilistic reasoning of NVSA with less expensive operations on the distributed representations is two orders of magnitude faster. - Low-cost micro-thermal response test system for characterizing very shallow geothermal energyItem type: Conference Paper
EESMS 2016, 2016 IEEE Workshop on Environmental, Energy, and Structural Monitoring Systems, ProceedingsVerdecchia, Andrea; Brunelli, Davide; Tinti, Francesco; et al. (2016) - Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clustersItem type: Conference Paper
2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)Burgio, Paolo; Tagliavini, Giuseppe; Conti, Francesco; et al. (2014)Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform. - Beyond duty cycling: Wake-up radio with selective awakenings for long-lived wireless sensing systemsItem type: Conference Paper
2015 IEEE Conference on Computer Communications (INFOCOM)Spenza, Dora; Magno, Michele; Basagni, Stefano; et al. (2015) - Curbing the roofline: A scalable and flexible architecture for CNNs on FPGAItem type: Conference Paper
Proceedings of the ACM International Conference on Computing Frontiers (CF '16)Meloni, Paolo; Deriu, Gianfranco; Conti, Francesco; et al. (2016) - Scalable EEG seizure detection on an ultra low power multi-core architectureItem type: Conference Paper
2016 IEEE Biomedical Circuits and Systems Conference (BioCAS 2016)Benatti, Simone; Montagna, Fabio; Rossi, Davide; et al. (2016)
Publications 1 - 10 of 1018