This record is in review state, the data has not yet been validated.
This is not the latest version of this item. The latest version can be found here.
Continual Benchmarking of LLM-Based Systems on Networking Operations
Loading...
Author / Producer
Date
2025-09-10
Publication Type
Conference Poster
ETH Bibliography
yes
Citations
Altmetric
Data
Rights / License
Abstract
The inherent complexity of operating modern network infrastructures has led to growing interest in using Large Language Models (LLMs) to support network operators, particularly in the area of Incident Management (IM). Yet, the absence of standardized benchmarks for evaluating such systems poses challenges in tracking progress, comparing approaches, and uncovering their limitations. As LLM-based tools become widespread, there is a clear need for a comprehensive benchmarking suite that reflects the diversity and complexity of operational tasks encountered in real-world networks.
This poster outlines our vision for designing such a modular benchmarking suite. We describe an approach for generating operational tasks of varying complexity and discuss how to evaluate LLMs on these tasks and assess system-level performance. As a preliminary evaluation, we benchmark three LLMs --- GPT-4.1, Gemini 2.5-Pro, and Claude 3.7 Sonnet --- across over 100 test cases and two pipeline variants.
Permanent link
Publication status
accepted
External links
Editor
Book title
Journal / series
Volume
Pages / Article No.
Publisher
Event
SIGCOMM '25
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
09477 - Vanbever, Laurent / Vanbever, Laurent