EMOI: CSCS Extensible Monitoring and Observability Infrastructure


METADATA ONLY
Loading...

Date

2024

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

The Swiss National Supercomputing Centre (CSCS) is enhancing its computational capabilities through the expansion of the Alps architecture, a Cray HPE EX system equipped with approximately 5000 GH200 modules, in addition to the pre-existing 1000 nodes of a diverse combination of CPUs and GPUs. CSCS has developed an Extensible Monitoring and Observability Infrastructure (EMOI), designed to manage the substantial data influx and provide insightful analysis of the infrastructure`s behavior. This paper presents the architecture and capabilities of EMOI at CSCS, emphasizing its scalability and adaptability to handle the increasing volume of monitoring data generated by the Alps infrastructure. We detail the integration of the Cray System Management (CSM) and Cray System Monitoring Application (SMA) within EMOI. The paper describes our hardware infrastructure, leveraging Kubernetes for dynamic data collection and analysis tools deployment, and outlines our GitOps strategy for efficient service management. We also explore the distinctions in data models across various node architectures within the Alps system, focusing on power consumption data and its relevance concerning global supercomputing challenges. The insights and methodologies presented in this paper are anticipated to be beneficial not only to CSCS, but also to other HPE/Cray sites facing similar challenges in supercomputing infrastructure management.

Publication status

published

External links

Editor

Book title

CUG2024 Proceedings

Journal / series

Volume

Pages / Article No.

Publisher

CUG

Event

CUG 2024 "Diverse Universe"

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Monitoring and Observability Infrastructure

Organisational unit

00080 - CSCS / CSCS

Notes

Conference lecture held on May 7, 2024

Funding

Related publications and datasets