Continual Benchmarking of LLM-Based Systems on Networking Operations
OPEN ACCESS
Loading...
Author / Producer
Date
2025
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Scopus:
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
The inherent complexity of operating modern network infrastructures has led to growing interest in using Large Language Models (LLMs) to support network operators, particularly in the area of Incident Management (IM). Yet, the absence of standardized benchmarks for evaluating such systems poses challenges in tracking progress, comparing approaches, and uncovering their limitations. As LLM-based tools become widespread, there is a clear need for a comprehensive benchmarking suite that reflects the diversity and complexity of operational tasks encountered in real-world networks. This poster outlines our vision for designing such a modular benchmarking suite. We describe an approach for generating operational tasks of varying complexity and discuss how to evaluate LLMs on these tasks and assess system-level performance. As a preliminary evaluation, we benchmark three LLMs — GPT-4.1, Gemini 2.5-Pro, and Claude 3.7 Sonnet — across over 100 test cases and two pipeline variants.
Permanent link
Publication status
published
External links
Editor
Book title
ACM SIGCOMM Posters and Demos '25: Proceedings of the ACM SIGCOMM 2025 Posters and Demos
Journal / series
Volume
Pages / Article No.
70 - 72
Publisher
Association for Computing Machinery
Event
39th ACM SIGCOMM Conference (SIGCOMM 2025)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Large language models; Network management; Benchmarking; Incident management
Organisational unit
09477 - Vanbever, Laurent / Vanbever, Laurent
Notes
Short paper