Francesco Conti


Loading...

Last Name

Conti

First Name

Francesco

Organisational unit

Search Results

Publications 1 - 10 of 77
  • Burgio, Paolo; Tagliavini, Giuseppe; Conti, Francesco; et al. (2014)
    2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)
    Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.
  • Rossi, Davide; Conti, Francesco; Eggimann, Manuel; et al. (2022)
    IEEE Journal of Solid-State Circuits
    The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7-mu W fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states.
  • Conti, Francesco; Palossi, Daniele; Andri, Renzo; et al. (2017)
    IEEE Transactions on Human-Machine Systems
  • Paulin, Gianna; Andri, Renzo; Conti, Francesco; et al. (2021)
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    Radio resource management (RRM) is critical in 5G mobile communications due to its ubiquity on every radio device and its low latency constraints. The rapidly evolving RRM algorithms with low latency requirements combined with the dense and massive 5G base station deployment ask for an on-the-edge RRM acceleration system with a tradeoff between flexibility, efficiency, and cost-making application-specific instruction-set processors (ASIPs) an optimal choice. In this work, we start from a baseline, simple RISC-V core and introduce instruction extensions coupled with software optimizations for maximizing the throughput of a selected set of recently proposed RRM algorithms based on models using multilayer perceptrons (MLPs) and recurrent neural networks (RNNs). Furthermore, we scale from a single-ASIP to a multi-ASIP acceleration system to further improve RRM throughput. For the single-ASIP system, we demonstrate an energy efficiency of 218 GMAC/s/W and a throughput of 566 MMAC/s corresponding to an improvement of 10× and 10.6× , respectively, over the single-core system with a baseline RV32IMC core. For the multi-ASIP system, we analyze the parallel speedup dependency on the input and output feature map (FM) size for fully connected and LSTM layers, achieving up to 10.2× speedup with 16 cores over a single extended RI5CY core for single LSTM layers and a speedup of 13.8× for a single fully connected layer. On the full RRM benchmark suite, we achieve an average overall speedup of 16.4× , 25.2× , 31.9× , and 38.8× on two, four, eight, and 16 cores, respectively, compared to our single-core RV32IMC baseline implementation. © 2021 IEEE
  • Palossi, Daniele; Loquercio, Antonio; Conti, Francesco; et al. (2019)
    IEEE Internet of Things Journal
  • Bertaccini, Luca; Benini, Luca; Conti, Francesco (2021)
    2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP)
    Hardware-accelerated multicore clusters have recently emerged as a viable approach to deploy advanced digital signal processing (DSP) capabilities in ultra-low-power extreme edge nodes. As a critical basic block for DSP, Fast Fourier Transforms (FFTs) are one of the best candidates for implementation on a dedicated accelerator core; however, their peculiar memory access patterns make direct integration of an FFT accelerator with a core cluster challenging. In this paper, we compare two different approaches for cluster-coupled FFT accelerators: one with a large internal buffer to store and shuffle partial results; and a buffer-less accelerator sharing all memory with the cluster cores. Both versions can work on complex data with 8/16/32-bit real and imaginary parts. We show that, thanks to a newly proposed scheme to reorder data access and exploit full bandwidth also for sub-word FFTs, the buffer-less accelerator can be made as fast as the buffered one at only 0.26× the area cost. We report post-layout performance and power results showing that the buffer-less accelerator can provide up to 4/2/1 butterfly/cycle performance, with an average power consumption of 4.1/5.5/6.8 mW @ 350 MHz, 0.65 V operating point in 22 nm CMOS technology, respectively for complex data with 8/16/32-bit real and imaginary part. The buffer-less accelerator is 8 × faster than an optimized multicore software implementation working on 16-bit data and compares favorably with FFT accelerators presented in the recent literature.
  • Meloni, Paolo; Loi, Daniela; Deriu, Gianfranco; et al. (2019)
    2018 30th International Conference on Microelectronics (ICM)
  • Zanghieri, Marcello; Benatti, Simone; Burrello, Alessio; et al. (2020)
    IEEE Transactions on Biomedical Circuits and Systems
  • Lamberti, Lorenzo; Cereda, Elia; Abbate, Gabriele; et al. (2024)
    IEEE Robotics and Automation Letters
  • Zanghieri, Marcello; Indirli, Fabrizio; Latella, Antonio; et al. (2024)
    IEEE Access
    Modern manufacturing industry relies on complex machinery that requires skills, attention, and precise safety certifications. Protecting operators in the machine's surroundings while at the same time reducing the impact on the normal workflow is a major challenge. In particular, safety systems based on proximity sensing of humans or obstacles require that the detection is accurate, low-latency, and robust against variations in environmental conditions. This work proposes a functional safety solution for collision avoidance relying on Ultrasounds (US) and a Temporal Convolutional Network (TCN) suitable for deployment directly at the edge on a low-power Microcontroller Unit (MCU). The setup allowed to acquire a sensor-fusion dataset with 9 US sensors mounted on a real industrial woodworking machine. Applying incremental training, the proposed TCN achieved sensitivity 90.5%, specificity 95.2%, and AUROC 0.972 on data affected by the typical acoustic noise of an industrial facility, an accuracy comparable with the State-of-the-Art (SoA). Deployment on an STM32H7 MCU yielded a memory footprint of 560 B (3x less than SoA), with an extremely low latency of 5.0 ms and an energy consumption of 8.2 mJ per inference (both >2.3x less than SoA). The proposed solution increases its robustness against acoustic noise by leveraging new data, and it fits the resource budget of real-time operation execution on resource-constrained embedded devices. It is thus promising for generalization to different industrial settings and for scale-up to wider monitored spaces.
Publications 1 - 10 of 77