Daniele Cesarini


Loading...

Last Name

Cesarini

First Name

Daniele

Organisational unit

Search Results

Publications1 - 10 of 11
  • Silvano, Cristina; Agosta, Giovanni; Bartolini, Andrea; et al. (2019)
    Microprocessors and Microsystems
  • Libri, Antonio; Bartolini, Andrea; Cesarini, Daniele; et al. (2018)
    ACM International Conference Proceeding Series ~ ANDARE '18 Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems
    Fine-grain time synchronization is important to address several challenges in today and future High Performance Computing (HPC) centers. Among the many, (i) co-scheduling techniques in parallel applications with sensitive bulk synchronous workloads, (ii) performance analysis tools and (iii) autotuning strategies that want to exploit State-of-the-Art (SoA) high resolution monitoring systems, are three examples where synchronization of few microseconds is required. Previous works report custom solutions to reach this performance without incurring in extra cost of dedicated hardware. On the other hand, the benefits to use robust standards which are widely supported by the community, such as Network Time Protocol (NTP) and Precision Time Protocol (PTP), are evident. With today's software and hardware improvements of these two protocols and off-the-shelf integration in SoA HPC servers no expensive extra hardware is required anymore, but an evaluation of their performance in supercomputing clusters is needed. Our results show NTP can reach on computing nodes an accuracy of 2.6us and a precision below 2.7us, with negligible overhead. These values can be bounded below microseconds, with PTP and low-cost switches (no needs of GPS antenna). Both protocols are also suitable for data time-stamping in SoA HPC monitoring infrastructures. We validate their performance with two real use-cases, and quantify scalability and CPU overhead. Finally, we report software settings and low-cost network configuration to reach these high precision synchronization results.
  • Cesarini, Daniele; Bartolini, Andrea; Bonfà, Pietro; et al. (2021)
    IEEE Transactions on Computers
    Power and energy consumption are becoming key challenges for the supercomputers’ exascale race. HPC systems’ processors waist active power during communication and synchronization among the MPI processes in large-scale HPC applications. However, due to the time scale at which communication happens, transitioning into low-power states while waiting for the completion of each communication may introduce unacceptable overhead. In this article, we present COUNTDOWN, a run-time library for identifying and automatically reducing the power consumption of the CPUs during communication and synchronization. COUNTDOWN saves energy without penalizing the time-to-completion by lowering CPUs power consumption only during idle times for which power state transition overhead is negligible. This is done transparently to the user, without requiring labor-intensive and error-prone application code modifications, nor requiring recompilation of the application. We test our methodology on a production Tier-1 system. For the NAS benchmarks, COUNTDOWN saves between 6 and 50 percent energy, with a time-to-solution penalty lower than 5 percent. In a complete production—Quantum ESPRESSO—for a 3.5K cores run, COUNTDOWN saves 22.36 percent energy, with a performance penalty below 3 percent. Energy saving increases to 37 percent with a performance penalty of 6.38 percent, if the application is executed without communication tuning.
  • Cesarini, Daniele; Bartolini, Andrea; Bonfà, Piero; et al. (2018)
    ANDARE '18 Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems
    Energy and power consumption are prominent issues in today’s supercomputers and are foreseen as a limiting factor of future installations. In scientific computing, a significant amount of power is spent in the communication and synchronization-related idle times among distributed processes participating to the same application. However, due to the time scale at which communication happens, taking advantage of low-power states to reduce power in idle times in the computing resources, may introduce significant overheads. In this paper we present COUNTDOWN, a methodology and a tool for identifying and automatically reducing the frequency of the computing elements in order to save energy during communication and synchronization primitives. COUNTDOWN is able to filter out phases which would detriment the time to solution of the application transparently to the user, without touching the application code nor requiring recompilation of the application. We tested our methodology in a production Tier-0 system, a production application - Quantum ESPRESSO (QE) - with production datasets which can scale up to 3.5K cores. Experimental results show that our methodology saves 22.36% of energy consumption with a performance penalty of 2.88% in real production MPI-based application.
  • Bartolini, Andrea; Beneventi, Francesco; Borghesi, Andrea; et al. (2019)
    ACM International Conference Proceeding Series ~ ICPP 2019: Proceedings of the 48th International Conference on Parallel Processing: Workshops, August 2019
  • Cesarini, Daniele; Bartolini, Andrea; Benini, Luca (2019)
    IFIP Advances in Information and Communication Technology ~ VLSI-SoC: Opportunities and Challenges Beyond the Internet of Things
    As side effects of the end of Dennard’s scaling, power and thermal technological walls stand in front of the evolution of supercomputers towards the exaflops era. Energy and temperature walls are big challenges to face for assuring a constant grow of performance in future. New generation architectures for HPC systems implement HW and SW components to address energy and thermal issues for increasing power and efficient computing in scientific workload. In thermal-bound HPC machines, workload-aware runtimes can leverage hardware knobs to guarantee the best operating point in term of performance and power saving without violating thermal constraints. In this paper, we present an integer-linear programming formulation for job mapping and frequency selection for thermal-bound HPC nodes. We use a fast solver and workload traces extracted from a real supercomputer to test our methodology. Our runtime is integrated into the MPI library, and it is capable of assigning high-performance cores to performance-critical processes. Critical processes are identified at execution time through a mathematical formulation, which relies on the characterization of the application workload and on the global synchronization barriers. We demonstrate that by combining long and short horizon predictions with information on the critical processes retrieved from the programming model, we can drastically improve the performance of the target application w.r.t. state-of-the-art DTM solutions.
  • De Sensi, Daniele; Pichetti, Lorenzo; Vella, Flavio; et al. (2024)
    SC24: International Conference for High Performance Computing, Networking, Storage and Analysis
    Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers — Alps, Leonardo, and LUMI — each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4,096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
  • Molan, Martin; Borghesi, Andrea; Cesarini, Daniele; et al. (2023)
    Future Generation Computer Systems
    The increasing complexity of modern high-performance computing (HPC) systems necessitates the introduction of automated and data-driven methodologies to support system administrators’ effort towards increasing the system's availability. Anomaly detection is an integral part of improving the availability as it eases the system administrator's burden and reduces the time between an anomaly and its resolution. However, current state-of-the-art (SOTA) approaches to anomaly detection are supervised and semi-supervised, so they require a human-labelled dataset with anomalies — this is often impractical to collect in production HPC systems. Unsupervised anomaly detection approaches based on clustering, aimed at alleviating the need for accurate anomaly data, have so far shown poor performance. In this work, we overcome these limitations by proposing RUAD, a novel Recurrent Unsupervised Anomaly Detection model. RUAD achieves better results than the current semi-supervised and unsupervised SOTA approaches. This is achieved by considering temporal dependencies in the data and including long-short term memory cells in the model architecture. The proposed approach is assessed on a complete ten-month history of a Tier-0 system (Marconi100 from CINECA with 980 nodes). RUAD achieves an area under the curve (AUC) of 0.763 in semi-supervised training and an AUC of 0.767 in unsupervised training, which improves upon the SOTA approach that achieves an AUC of 0.747 in semi-supervised training and an AUC of 0.734 in unsupervised training. It also vastly outperforms the current SOTA unsupervised anomaly detection approach based on clustering, achieving the AUC of 0.548.
  • Silvano, Cristina; Agosta, Giovanni; Bartolini, Andrea; et al. (2019)
    2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
  • Cesarini, Daniele; Bartolini, Andrea; Benini, Luca (2017)
    2017 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC)
    We are entering the era of thermally-bound computing: Advanced and costly cooling solutions are needed to sustain the high computing densities of high-performance computing equipment. To reduce cooling costs and cooling overprovisioning, dynamic thermal management (DTM) strategies aim at controlling the device temperature by modulating online the performance of processing elements. While operating systems allow the migration of threads between cores, in HPC systems the threads of parallel applications are pinned to the allocated cores at start-time to avoid job-migration overheads. In this scenario state-of-the-art DTM solutions, which use thermal models to map jobs to cores, are based on long-term predictions to map the most critical job to the coldest core. Instead, turbo-mode and DVFS controllers are based on short-term predictions to squeeze the thermal capacitance allowing for short period performance boosts which are thermally unsustainable. In this work we propose an integer-linear programming formulation and a fast solver for controlling, at the same time, the job mapping and cores frequency selections in HPC nodes, tested with real supercomputer workload. Our approach can be integrated with the MPI runtimes and OpenMP libraries and is capable of assigning high-performance cores to performance-critical threads. We show that by combining long and short term predictions with information of the programming model we can significantly improve the performance of final application w.r.t. state-of-the-art DTM solutions.
Publications1 - 10 of 11