EMOI: CSCS Extensible Monitoring and Observability Infrastructure
METADATA ONLY
Loading...
Author / Producer
Date
2024
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
METADATA ONLY
Data
Rights / License
Abstract
The Swiss National Supercomputing Centre (CSCS) is enhancing its computational capabilities through the expansion of the Alps architecture, a Cray HPE EX system equipped with approximately 5000 GH200 modules, in addition to the pre-existing 1000 nodes of a diverse combination of CPUs and GPUs. CSCS has developed an Extensible Monitoring and Observability Infrastructure (EMOI), designed to manage the substantial data influx and provide insightful analysis of the infrastructure`s behavior. This paper presents the architecture and capabilities of EMOI at CSCS, emphasizing its scalability and adaptability to handle the increasing volume of monitoring data generated by the Alps infrastructure. We detail the integration of the Cray System Management (CSM) and Cray System Monitoring Application (SMA) within EMOI. The paper describes our hardware infrastructure, leveraging Kubernetes for dynamic data collection and analysis tools deployment, and outlines our GitOps strategy for efficient service management. We also explore the distinctions in data models across various node architectures within the Alps system, focusing on power consumption data and its relevance concerning global supercomputing challenges. The insights and methodologies presented in this paper are anticipated to be beneficial not only to CSCS, but also to other HPE/Cray sites facing similar challenges in supercomputing infrastructure management.
Permanent link
Publication status
published
External links
Editor
Book title
CUG2024 Proceedings
Journal / series
Volume
Pages / Article No.
Publisher
CUG
Event
CUG 2024 "Diverse Universe"
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Monitoring and Observability Infrastructure
Organisational unit
00080 - CSCS / CSCS
Notes
Conference lecture held on May 7, 2024