Continual Benchmarking of LLM-Based Systems on Networking Operations


Loading...

Date

2025

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Scopus:
Altmetric

Data

Abstract

The inherent complexity of operating modern network infrastructures has led to growing interest in using Large Language Models (LLMs) to support network operators, particularly in the area of Incident Management (IM). Yet, the absence of standardized benchmarks for evaluating such systems poses challenges in tracking progress, comparing approaches, and uncovering their limitations. As LLM-based tools become widespread, there is a clear need for a comprehensive benchmarking suite that reflects the diversity and complexity of operational tasks encountered in real-world networks. This poster outlines our vision for designing such a modular benchmarking suite. We describe an approach for generating operational tasks of varying complexity and discuss how to evaluate LLMs on these tasks and assess system-level performance. As a preliminary evaluation, we benchmark three LLMs — GPT-4.1, Gemini 2.5-Pro, and Claude 3.7 Sonnet — across over 100 test cases and two pipeline variants.

Publication status

published

Editor

Book title

ACM SIGCOMM Posters and Demos '25: Proceedings of the ACM SIGCOMM 2025 Posters and Demos

Journal / series

Volume

Pages / Article No.

70 - 72

Publisher

Association for Computing Machinery

Event

39th ACM SIGCOMM Conference (SIGCOMM 2025)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Large language models; Network management; Benchmarking; Incident management

Organisational unit

09477 - Vanbever, Laurent / Vanbever, Laurent check_circle

Notes

Short paper

Funding

Related publications and datasets