Luca Benini


Loading...

Last Name

Benini

First Name

Luca

Organisational unit

03996 - Benini, Luca / Benini, Luca

Search Results

Publications 1 - 10 of 1018
  • Colagrande, Luca; Benini, Luca (2025)
    2025 62nd ACM/IEEE Design Automation Conference (DAC)
    To meet the computational requirements of modern workloads under tight energy constraints, general-purpose accelerator architectures have to integrate an ever-increasing number of extremely area- and energy-efficient processing elements (PEs). In this context, single-issue in-order cores are commonplace, but lean dual-issue cores could boost PE IPC, especially for the common case of mixed integer and floating-point workloads. We develop the COPIFT methodology and RISC-V ISA extensions to enable low-cost and flexible dual-issue execution of mixed integer and floating-point instruction sequences. On such kernels, our methodology achieves speedups of 1.47x, reaching a peak 1.75 instructions per cycle, and 1.37 x energy improvements on average, over optimized RV32G baselines.
  • Guardati, Leonardo; Casamassima, Filippo; Farella, Elisabetta; et al. (2015)
    2015 Design, Automation & Test in Europe Conference & Exhibition (DATE 2015): Grenoble, France, 9 - 13 March 2015
  • Dequino, Alberto; Bompani, Luca; Benini, Luca; et al. (2025)
    Journal of Low Power Electronics and Applications
    Transformers have emerged as the central backbone architecture for modern generative AI. However, most ML applications targeting low-power, low-cost SoCs (TinyML apps) do not employ Transformers as these models are thought to be challenging to quantize and deploy on small devices. This work proposes a methodology to reduce Transformer dimensions with an extensive pruning search. We exploit the intrinsic redundancy of these models to fit them on resource-constrained devices with a well-controlled accuracy tradeoff. We then propose an optimized library to deploy the reduced models using BFLoat16 with no accuracy loss on Commercial Off-The-Shelf (COTS) RISC-V multi-core micro-controllers, enabling the execution of these models at the extreme edge, without the need for complex and accuracy-critical quantization schemes. Our solution achieves up to 220x speedup with respect to a na & iuml;ve C port of the Multi-Head Self Attention PyTorch kernel: we reduced MobileBert and TinyViT memory footprint up to similar to 94% and similar to 57%, respectively, and we deployed a tinyLLAMA SLM on microcontroller, achieving a throughput of 1219 tokens/s with an average power of just 57 mW.
  • Bartolini, Andrea; Beneventi, Francesco; Borghesi, Andrea; et al. (2019)
    ACM International Conference Proceeding Series ~ ICPP 2019: Proceedings of the 48th International Conference on Parallel Processing: Workshops, August 2019
  • Hersche, Michael; Zeqiri, Mustafa; Benini, Luca; et al. (2023)
    Nature Machine Intelligence
    Neither deep neural networks nor symbolic artificial intelligence (AI) alone has approached the kind of intelligence expressed in humans. This is mainly because neural networks are not able to decompose joint representations to obtain distinct objects (the so-called binding problem), while symbolic AI suffers from exhaustive rule searches, among other problems. These two problems are still pronounced in neuro-symbolic AI, which aims to combine the best of the two paradigms. Here we show that the two problems can be addressed with our proposed neuro-vector-symbolic architecture (NVSA) by exploiting its powerful operators on high-dimensional distributed representations that serve as a common language between neural networks and symbolic AI. The efficacy of NVSA is demonstrated by solving Raven’s progressive matrices datasets. Compared with state-of-the-art deep neural network and neuro-symbolic approaches, end-to-end training of NVSA achieves a new record of 87.7% average accuracy in RAVEN, and 88.1% in I-RAVEN datasets. Moreover, compared with the symbolic reasoning within the neuro-symbolic approaches, the probabilistic reasoning of NVSA with less expensive operations on the distributed representations is two orders of magnitude faster.
  • Verdecchia, Andrea; Brunelli, Davide; Tinti, Francesco; et al. (2016)
    EESMS 2016, 2016 IEEE Workshop on Environmental, Energy, and Structural Monitoring Systems, Proceedings
  • Burgio, Paolo; Tagliavini, Giuseppe; Conti, Francesco; et al. (2014)
    2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)
    Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.
  • Spenza, Dora; Magno, Michele; Basagni, Stefano; et al. (2015)
    2015 IEEE Conference on Computer Communications (INFOCOM)
  • Meloni, Paolo; Deriu, Gianfranco; Conti, Francesco; et al. (2016)
    Proceedings of the ACM International Conference on Computing Frontiers (CF '16)
  • Benatti, Simone; Montagna, Fabio; Rossi, Davide; et al. (2016)
    2016 IEEE Biomedical Circuits and Systems Conference (BioCAS 2016)
Publications 1 - 10 of 1018