Autoscaling is a core capability in cloud computing with significant impact on service quality and cost. Modern applications, like microservices and serverless functions, consist of many containers that enable fine-grained, component-wise scaling. Effective autoscaling across large, heterogeneous service landscapes remains challenging. As cloud adoption increases, workloads have become more diverse, exhibiting highly variable request patterns, payload characteristics, and response time requirements. This limits the effectiveness of conventional autoscalers, whose fixed intervals and cooldown periods restrict responsiveness. At the same time, the growing number of services and frequent updates strain approaches based on predefined models, motivating more adaptive solutions.
To overcome these limitations, we introduce a decentralized, continuous autoscaling approach for modern cloud applications. Individual instances make scaling decisions solely from locally observed metrics. Distributing decisions across instances and time enables much higher decision frequencies than interval-based methods and approximates continuous behavior at scale. The approach is evaluated at three abstraction levels: (i) a discrete-time queueing model analyzing decision speed and convergence, (ii) discrete-event simulations assessing scaling dynamics across configurations and system sizes, and (iii) real-world cloud experiments comparing against widely used autoscalers.
A central result is that, despite the absence of explicit coordination among instances, the proposed method achieves scaling performance comparable to that of established autoscalers. This behavior is consistently observed across different system scales, applications, and workload patterns, and is supported by results from the analytical model, simulation, and prototype implementation. A comprehensive parameter study demonstrates high configurability: instance-level parameters yield predictable global effects, and diverse behaviors can be realized by tuning only three parameters. Under highly dynamic workloads, the approach outperforms baselines due to rapid, frequent decisions within a limited action space, reducing delays in upscaling even when metrics such as CPU utilization provide limited expressiveness.
Beyond performance considerations, the approach differs fundamentally from traditional autoscalers in its conceptual design. In particular, it eliminates global monitoring and enables arbitrary instance-local scaling policies. A key advantage is the flexibility in specifying scaling functions without being restricted to a predefined configuration space. At the same time, we demonstrate that certain canonical configurations, such as exponential scaling strategies for both up- and downscaling, perform well across a range of applications. This supports a balance between application-specific tuning and general-purpose applicability. Finally, our results indicate that the proposed analytical model and simulation framework can serve as effective tools for estimating scaling behavior under different scenarios and configurations. The prototype implementation is compatible with and deployable in modern container orchestration environments. Overall, the evaluation confirms the approach's convergence, robustness, configurability, and performance across diverse workloads, environments, and application scenarios.
Performance alert triage in large-scale software systems is constrained by limited engineering attention and long time-to-diagnosis. This paper addresses the task of predicting whether an aggregated Mozilla performance alert summary is associated with a reported bug. We propose a leakage-free, time-aware approach based primarily on structured alert metadata and multi-scale time-series features extracted from historical performance measurements. Evaluation follows a strict chronological protocol in which the last 20% of alert summaries constitute a held-out future test set. The best configuration, combining tabular metadata with time-series enrichment, achieves a test AUPRC of 0.851 and strong ranking performance (P@50 0.920, P@100 0.810, P@200 0.475, R@50 0.455, R@100 0.802, R@200 0.941), supporting both high-precision prioritization at the top of the ranking and broader coverage when reviewing larger alert sets. The resulting pipeline is computationally efficient and suitable for near real-time deployment. Feature importance analysis indicates that alert aggregation statistics, temporal context, and historical performance dynamics provide complementary signals for effective performance regression detection.
The automatic detection of performance regressions is a key requirement of modern performance testing infrastructures. Recently, this problem has been addressed using change-point detection algorithms, which aim to identify abrupt shifts in time series, thereby enabling the detection of regressions introduced during software evolution. Despite these advancements, the application of time series analysis to performance regression detection still presents many underexplored opportunities. For instance, the field of time series analysis has recently seen a growing adoption of machine learning techniques, which have been successfully applied to a wide range of time series tasks. Building on this trend, in this paper we investigate a novel methodology for performance regression detection in software systems based on learning-driven Time Series Classification (TSC). Specifically, we train three TSC models to predict performance regressions using time series collected from Mozilla Firefox's performance testing infrastructure, and we evaluate their ability to detect previously unseen software performance regressions. The results show that TSC models can effectively detect performance regressions, achieving reasonably high balanced accuracies across all evaluated models (up to 0.738). However, they also reveal notable limitations in terms of false positives that may hinder their practical adoption.
Cloud capacity data from tech's leading businesses is rarely made available for public research. Cloud providers tend to treat resource consumption patterns and infrastructure spending as proprietary information, and this limits the ability of researchers to study real-world demand behavior. To our knowledge, few studies exist that evaluate real capacity time series in detail, and existing literature relies on simulations or synthetic datasets. The release of Snowflake's Shaved Ice dataset presents an opportunity to analyze multi-year VM demand, and offers insights that would otherwise remain inaccessible to the broader research community.
Some capacity planning workflows rely on simple, backward-looking heuristics such as rolling averages to estimate future demand. While these methods are easy to implement, they fail to account for systematic temporal patterns, particularly weekly seasonality, and can lead to persistent forecast error even when summing demand across all resources. Other network monitoring research has shown that such approaches ''apply a static criterion'' rather than dynamic forecasting that accounts for recurring patterns.
In this undergraduate study, we analyzed Snowflake's publicly available Shaved Ice dataset to explore demand patterns in global virtual machine usage and evaluate the performance of common rolling-average baselines against simple seasonal time-series models. After aggregating demand across instance types and regions into a single daily global time series, we compare rolling averages computed over multiple window sizes against a seasonal naive (sNaive) baseline, a Seasonal ARIMA (SARIMA) model, and Exponential Smoothing (ETS). Forecast accuracy is evaluated using mean absolute percentage error (MAPE) on a held-out test period.
Our results show that rolling averages are highly sensitive to window selection and consistently underperform methods that encode weekly structure, which achieve MAPE near 3% versus 13-16% for rolling averages. Notably, a parameter-free seasonal naive baseline performs comparably to SARIMA and ETS, indicating that encoding weekly periodicity, not model sophistication, drives the accuracy gain.
Mixed-precision arithmetic can reduce time-to-solution and energy-to-solution on modern heterogeneous HPC systems. Yet tool-based mixed-precision autotuning succeeds unevenly across real applications. A key missing piece is application-based guidance: which characteristics make a code a good candidate for mixed-precision autotuners, and which characteristics make the process costly, fragile, or inconclusive.
This paper presents a forward-looking vision for application-centric mixed-precision tuning by proposing a taxonomy of properties that shape feasibility and payoff. We relate these property categories to a generic mixed-precision autotuner workflow, producing an impact matrix that clarifies why the same tuning stage can constitute fundamentally different problems across applications, and why tool comparisons without an explicit application-property frame can be misleading. We conclude by outlining how these properties can be operationalized as checklist for assessing application tuning readiness.
Incident management is essential to maintain the reliability and availability of cloud computing services. Cloud vendors typically disclose incident reports to the public, summarizing the failures and recovery process to help minimize their impact. However, such reports are often lengthy and unstructured, making them difficult to understand, analyze, and use for long-term dependability improvements. The emergence of LLMs offers new opportunities to address this challenge, but how to achieve this is currently understudied. In this paper, we explore the use of cutting-edge LLMs to extract key information from unstructured cloud incident reports. First, we collect more than 3,000 incident reports from 3 leading cloud service providers (AWS, AZURE, and GCP), and manually annotate these collected samples. Then, we design and compare 6 prompt strategies to extract and classify different types of information. We consider 6 LLM models, including 3 lightweight and 3 state-of-the-art (SotA), and evaluate model accuracy, latency, and token cost across datasets, models, prompts, and extracted fields. Finally, we identify the best-performing prompt strategy and model to extract information from all reports and perform a statistical dependability analysis. Our study has uncovered 12 key findings, among which: (1) LLMs achieve high metadata extraction accuracy, 75%-95% depending on the dataset. (2) Few-shot prompting generally improves accuracy for meta-data fields except for classification, and has better (lower) latency due to shorter output-tokens but requires 1.5-2× more input-tokens. (3) Lightweight models (e.g., Gemini 2.0, GPT 3.5) offer favorable trade-offs in accuracy, cost, and latency; SotA models yield higher accuracy at significantly greater cost and latency. (4) Incident durations vary by cloud vendor; Azure incidents last 2.6× longer than AWS. Our study provides tools, methodologies, and insights for leveraging LLMs to accurately and efficiently extract incident-report information. The FAIR data and code are publicly available at https://github.com/atlarge-research/llm-cloud-incident-extraction.
Multi-Agentic AI systems, powered by large language models (LLMs), are inherently non-deterministic due to stochastic generation, evolving internal states, and complex inter-agent interactions. This non-determinism can lead to silent failures such as behavioral drift over time, cyclic reasoning loops and so on, none of which necessarily trigger explicit error signals. Such silent failures rapidly increase operational costs, including computational resources, token usage, and degrading overall system performance by blocking resources that could be crucial to other tasks being executed with a faster turnaround time. A trajectory captures the interactions among agents as well as their interactions with external tools. Since agents influence each other's behaviour, the evolving trajectory ultimately determines whether a particular set of operations produced the expected correct output or deviated with intertwined silent failure manifestations. In this work, we introduce the task of detecting silent failures in agentic applications, where failures occur without explicit error signals. To enable this, we present a dataset curation pipeline that captures user behavior, agent non-determinism, and LLM variations to capture a large agentic ecosystem state-space. Using this pipeline, we curate and label two benchmark datasets comprising 4,275 and 894 trajectories from Multi-Agentic AI systems. We show that supervised (XGBoost) and semi-supervised (SVDD) approaches perform comparably, achieving accuracies up to 98% and 96%, respectively, serving as potential indicators of these methods towards robust silent failure detection. This work provides the first systematic study of silent-failure detection in Multi-Agentic AI systems, offering datasets, benchmarks, and insights to guide future research.
Agentic applications powered by Large Language Models exhibit non-deterministic behaviors that can form hidden execution cycles, silently consuming resources without triggering explicit errors. Traditional observability platforms fail to detect these costly inefficiencies. We present an unsupervised cycle detection framework that combines structural and semantic analysis. Our approach first applies computationally efficient temporal call stack analysis to identify explicit loops and then leverages semantic similarity analysis to uncover subtle cycles characterized by redundant content generation. Evaluated on 1575 trajectories from a LangGraph-based stock market application, our hybrid approach achieves an F1 score of 0.72 (precision: 0.62, recall: 0.86), significantly outperforming individual structural (F1: 0.08) and semantic methods (F1: 0.28). Although the initial results are promising, additional enhancements are needed. Ongoing efforts focus on refining the approach, expanding experiments, mitigating its limitations, and broadening the scope of this work in the emerging agent-ops space.
Deep neural networks (DNNs) are pervasive across various domains, with inference requests often generated at the network edge, where resources are limited and energy efficiency is critical. Techniques like Post-Training Quantization (PTQ) also emerged to facilitate inference at the edge, trading off resource demand with accuracy. However, running inference entirely on devices can lead to high latency and excessive battery drain, while executing it exclusively in the cloud introduces communication delays and may result in a significant environmental impact. As such, inference tasks must carefully exploit both edge and cloud computing resources, leveraging DNN model splitting (or partitioning).
In this work, we present a multi-objective optimization problem to distribute DNN model inference across the edge–cloud continuum while integrating PTQ. We develop a prototype architecture to profile DNN models and the underlying computing infrastructure, and we address the issue of estimating quantization noise. Evaluated on YOLO11 vision models, our approach achieves significant reductions in both inference times and energy consumption (up to 30% for both metrics) compared to device-only inference execution.
Performance testing is essential for contemporary software systems because it ensures reliability and user satisfaction (while keeping operational costs under control). Conventional performance testing activities still require expert knowledge, laborious scripting, and substantial time investment within modern DevOps workflows. Recent advances in generative and agentic AI create the opportunity to automate these processes. This work-in-progress paper explores the use of agentic AI to support end-to-end automation of software performance testing. We present an initial prototype that coordinates multiple specialized agents to extract contextual information from existing project artifacts, generate realistic performance test scenarios, execute tests using an established performance testing framework, and provide structured interpretations of the resulting metrics. We report early results from exploratory case studies of microservice-based systems from the DeathStarBench benchmark suite, which suggest that agentic workflows can autonomously produce executable performance tests and meaningful performance reports with limited human intervention, even when available context is incomplete or partially misleading. Such results illustrate the feasibility of agent-based performance testing and motivate a broader research agenda on AI-augmented performance engineering.
Identifying root causes of performance failures in microservice systems remains challenging, despite rich observability data such as distributed traces and profiling metrics. Existing root cause analysis techniques largely rely on predefined statistical models or anomaly propagation heuristics, which often struggle to scale and adapt to the complexity, heterogeneity, and evolution of real-world microservice workloads. We argue that effective root cause analysis must move beyond raw observability signals toward pattern-centric representations that better align with how engineers reason about performance failures. In this position paper, we examine Large Language Models as reasoning engines for root cause analysis in distributed systems and investigate how observability data representation shapes their effectiveness. Through exploratory experiments on microservice traces and profiling metrics, we show that root cause analysis using Large Language Models benefits substantially more from structured abstractions—such as invocation-level summaries and implicit anomaly patterns—than from raw, high-volume observability data. These abstractions not only improve diagnostic accuracy but also significantly reduce token usage and inference cost. Our findings suggest that a dominant practical bottleneck in applying Large Language Models to performance engineering is often how observability data is distilled into hypothesis-friendly representations, rather than model capability alone. Based on these insights, we outline a research agenda for Large Language Models-assisted performance diagnosis, highlighting open challenges in representation design, cost-aware reasoning, and evaluation beyond accuracy. This work motivates a shift toward representation-aware root cause analysis pipelines and opens new directions for performance engineering research.
Vehicle platooning has traditionally relied on a leader–follower architecture in which a specific vehicle persistently manages coordination and generates motion references for others. Although this structure simplifies control design, it inherently creates performance imbalance, limits scalability, and becomes fragile when operating conditions vary dynamically. Most prior studies implicitly treat leadership as inevitable and therefore concentrate on leader failure handling, redundancy, or replacement strategies rather than questioning the architecture itself. This paper revisits that assumption and advances a distinct research direction: leaderless platooning enabled by distributed performance stress balancing. Instead of viewing a platoon as a hierarchy, we conceptualize it as a collective performance system in which authority is not fixed. Each vehicle continuously assesses its own performance stress based on factors such as communication delay, control jitter, and computational workload, and dynamically adjusts its coordination roleo in response. Platoon-level coordination then arises through local negotiation and adaptation rather than centralized decision making. The primary contribution is not a novel controller or protocol, but an architectural reframing of platooning that defines a new research direction for performance engineering in cyber-physical systems.
Profilers are widely used tools in performance engineering, often employed to estimate method cost (i.e. the percentage of CPU time spent in a method) and to identify the costliest method in a program. However, recent work has shown that profiler outputs can be inconsistent, especially for languages running on virtual machines. For the same program, different profilers may report different costliest methods and different cost estimates for the same method. To mitigate this, one might aggregate results from multiple profilers (commonly by majority voting or averaging percentages). While simple and widely used, these methods assume all profilers are equally reliable, and fail to account for systematic bias in profiler measurements. We present a Bayesian statistical model that explicitly estimates profiler accuracy by looking at cost estimates for various programs, and using this information to correct cost estimates. In this paper, we describe the model and present an evaluation, which includes a simulation study and a case study using profiler data from prior research. We find our model is more robust to outliers than averaging percentages and performs comparably to majority voting at identifying the hottest method, while additionally providing estimates of method cost and profiler reliability.
Performance regressions are notoriously difficult to detect due to their silent nature and dependency on complex, structure-specific input conditions. While state-of-the-art performance fuzzers effectively maximize execution path lengths, they often lack semantic direction, wasting significant resources probing stable regions. In this paper, we propose Issue-Driven Performance Fuzzing, which posits that a reported regression is rarely an isolated incident but a visible symptom of a broader cluster of latent vulnerabilities. To operationalize this, we introduce an automated framework utilizing LLMs to bridge the semantic gap between unstructured issue reports and executable fuzzing seeds. By extracting the essence of a historical bug to synthesize targeted mutation strategies, our approach explores the bug's structural neighborhood for hidden defects. We evaluated this framework on Apache PDFBox, where it successfully isolated latent degradation factors and revealed resilience regressions, i.e., the instances where optimized versions exhibit higher algorithmic fragility. In total, our framework generated 22 valid mutation strategies and identified 2 previously unknown performance weaknesses.
This demo showcases an open-source simulator for evaluating the performance of cloud-native applications in Cyber-Physical clouds. Applications are modelled as DAGs of possibly stateful microservices, each implemented as a pool of load-balanced, distributed containers which can be subjected to different kinds of random failures. Such applications are simulated as DAGs of special FIFO queues to capture statistics such as nodes utilizations, end-to-end latency distributions, and other relevant metrics via discrete event simulation. This allows users to evaluate the performance impact of e.g. different load-balancing strategies, fail-over techniques or state management policies. In this demo, we will introduce the audience to the simulator through some simple but significant examples, showing how to use it to identify performance bottlenecks in a cloud-native application and how to correctly configure a Cyber-Physical cloud to serve it.
This paper introduces EVA-rt-Engine, a new schedulability analysis and design library released under the GPLv3 open-source license. The tool features the implementation of standard state-of-the-art analysis techniques for uniprocessor and global multiprocessor scheduling platforms, as well as more advanced hierarchical scheduling models, analyses, and design tools to aid in the development of component-based real-time software systems. The tool, which is compared with the state of the art in this paper, is currently used to aid the development of a Linux kernel patch that introduces hierarchical scheduling policies.
The planning of IoT applications requires consideration of the performance, energy consumption, and security of resource-constrained devices. In particular, n-to-n communication patterns and security mechanisms significantly affect the runtime, energy consumption, and reliability of IoT systems. In this article, we summarize our previous work on the performance and security analysis of secure group communication and attribute-based encryption methods, as well as on the evaluation of publish/subscribe brokers. Based on measurements on real-world IoT hardware, we outline an approach for modeling and combining individual IoT operations to estimate time and energy consumption in advance. The goal is to support developers in performance-oriented planning and development of energy-efficient, secure IoT applications.
Adaptive computing systems must react to workload and configuration shifts under hard constraints such as capacity limits, stability, and reconfiguration costs. Model Predictive Control (MPC) provides a principled framework by repeatedly solving a finite-horizon optimization problem online, but MPC's runtime optimization can become the bottleneck as horizon length and discrete decision spaces grow. In recent works we expressed discrete-time MPC optimization as a QUBO problem which is solved via classical and quantum methods. This poster paper synthesizes two complementary case studies, a simulated tandem queueing model controlled via buffer tuning and a running Apache HTTP server controlled by tuning the number of concurrent users in the system, showing tradeoffs that determine feasibility in real-time control loops.
For almost two decades, the Standard Performance Evaluation Corporation (SPEC) has developed widely used tools to reliably evaluate system efficiency: the SPEC PTDaemonR interface, the SPECpower_ssjR 2008 benchmark, and the SPEC SERTR> suite. To ensure relevant, high?quality, and reproducible measurements, and to adapt continuously to evolving industry and regulatory requirements, these tools are under continuous development. We present an overview of their evolution and describe current progress towards the next major versions of the benchmarks.
Over the past few decades, workflow management systems have established themselves as a major tool in scientific computing, enabling the composition and execution of complex analysis tasks on distributed computing resources. This paper presents a proof-of-concept workflow orchestrator based on the blackboard paradigm, designed to decouple workflow logic from resource management while enabling bidirectional synchronization between them. The proposed blackboard architecture provides a shared state and communication layer for workflow models, executions, and platform metadata, supporting coordinated scheduling, placement, and monitoring. Its modular design integrates heterogeneous resource managers and provides a foundation for self-aware learning, workflow-based performance engineering, and resource demand estimation of scientific workflows, as pursued in the DFG research unit SOS.
This poster paper links performance engineering and cybersecurity by presenting an architecture for performance-based attack detection in microservice systems. We define a workflow based on service-level runtime behavior rather than network signatures or logs. The architecture integrates workload generation, attack injection, and continuous collection of CPU, memory, filesystem activity, and response-time metrics. In future work, we will evaluate this architecture on benchmarking applications that reflect realistic architectures reported by practitioners. The goal is to assess whether runtime performance behavior can support service-level attack detection.
Modern software systems operate within increasingly complex execution environments, making performance diagnosis a critical and challenging task for developers, engineers, and researchers. Effectively identifying performance bottlenecks requires powerful tools and a clear methodology for selecting, configuring, and combining complementary analysis techniques. This tutorial presents a comprehensive, hands-on introduction to three widely used and production-ready Linux performance analysis tools: perf, LTTng, and Trace Compass. These tools complement each other, and when used together, they can provide a comprehensive view of the performance of the target system.
The tutorial guides participants through a complete performance analysis workflow, combining statistical profiling, low-overhead tracing, and advanced trace visualization to diagnose performance issues in Linux systems. Attendees will learn how to install and configure each tool, understand their respective strengths and trade-offs, and apply them effectively to analyze application- and system-level behavior. Through guided hands-on exercises, participants will gain practical experience in profiling execution hotspots, tracing system and application events, correlating performance data across layers, and identifying the root causes of performance bottlenecks.
By the end of the session, participants will be equipped with the practical skills and methodological knowledge needed to perform reproducible and efficient performance analyses using state-of-the-art Linux tools, enabling them to diagnose and resolve performance issues in both research and production environments.
Identifying performance bottlenecks in complex AI/ML work-loads, which span intricate application software and diverse hardware architectures, presents a significant challenge. Pin-pointing performance inhibitors across this vast stack is difficult, making an understanding of a workload's unique execution signature crucial for developing targeted and effective performance optimizations. The Ampere Performance Toolkit ad-dresses this by providing a hierarchical, top-down methodology for workload observability and profiling. This paper demonstrates how to effectively pinpoint performance inhibitors across combined application software and hardware architecture.
The approach begins with a multi-stage process. First, the Am-pere-System-Profiler offers a holistic platform view by monitoring CPU, network, disk, and kernel activity. This identifies and rules out high-level system bottlenecks, which is vital be-fore advancing to more targeted, lower-level profiling efforts. If no apparent system-level issues are found, the analysis proceeds. The Ampere-PMU-Profiler then takes a deeper dive, collecting Performance Monitoring Unit (PMU) event data visualized through sunburst plots, illustrating how instructions are distributed across the CPU. Benchmarking methodically helps isolate performance issues, enabling efficient root-cause analysis, Ampere PerfKit Benchmarker (APB) facilitates systematic benchmarking. This methodical application of the Ampere Performance Toolkit facilitates precise bottleneck identification, driving significant performance optimizations for demanding workloads.
The rapid adoption of GPU-accelerated workloads for AI and high-performance computing has driven a sharp rise in data center energy consumption, making reliable energy efficiency measurement critical for research, system design, and operational decisions. Unlike conventional performance benchmarking, energy efficiency benchmarking introduces new sources of variability (e.g., power-domain selection, sampling strategy, thermal and power-management policies, and workload non-determinism) that must be explicitly controlled and documented. Small changes in drivers, firmware, or tuning can produce large, misleading differences in reported efficiency, particularly on modern GPU platforms where hardware/software co-optimization and vendor-specific features are pervasive. In this paper, we summarize best practices and lessons learned from implementing GPU workloads for the latest SPEC efficiency benchmarks, providing guidance for reliable, reproducible energy efficiency experiments on GPU servers. Our contributions aim to standardize methodology, reduce ambiguity in comparisons, and enable trusted energy efficiency benchmarking of GPU-accelerated systems.
The Workshop on Education and Practice of Performance Engineering, in its 6th edition, brings together University and Industry Performance Engineers to share education and practice experiences. Specifically, the goal of the workshop is to discuss the gap between university performance engineering programs and the skills required to implement performance engineering in industrial settings. In this edition, we have built a workshop program consisting of a keynote talk, seven invited talks, and two peer reviewed papers. The workshop presentations cover both industry experiences, pedagogical practice, and technical transfer between the two environments.
Performance engineering education traditionally relies on analytical models supported by specialized queueing solvers. While effective, this approach creates strong tool dependencies and limits opportunities for qualitative reasoning when solvers are unavailable, difficult to maintain, or inappropriate for exploratory learning. This position paper argues that Model Interchange Formats (MIFs), combined with recent advances in Artificial Intelligence (AI), enable a solver-independent approach to teaching performance engineering. Our experience has been specifically grounded in the use of PMIF+ (Performance Model Interchange Format), one instance of a Model Interchange Format, which provides a structured yet solver-agnostic representation of Queuing Networks based models. Rather than replacing analytical methods, AI can act as a pedagogical assistant that supports reasoning, interpretation, and exploratory analysis of MIF-based performance models. We discuss what this approach enables, where it fails, and its implications for performance engineering education.
Performance analysis is essential for understanding the behavior of parallel applications on High-Performance Computing (HPC) systems, identifying bottlenecks and load imbalances. Traditional analysis workflows are post-mortem: applications are instrumented, performance data is collected at runtime, and insights become available only after execution. This workflow is time-consuming - especially for large-scale Message Passing Interface (MPI) applications (i.e., running on many processes/nodes) requiring repeated analysis after code changes - and demands substantial expertise, posing a steep learning barrier for beginners.
To address these challenges, in previous work, we proposed EduMPI Suite, which provides near-real-time visualization of MPI communication in an interactive GUI, making performance analysis immediately accessible in educational settings. While our previous work focused on usability and classroom integration, this paper presents the technical architecture and performance evaluation of EduMPI Suite's measurement and data-management system.
We couple EduMPI, an extended Open MPI fork with integrated performance measurement, with EduStore, a time-series database that processes performance-relevant events in near-real-time for visualization in EduMPI GUI. EduMPI leverages a binary-based ingestion layer for high-throughput data insertion, avoiding costly data conversions. Combined with TimescaleDB's optimized time-series processing, this design ensures continuous data availability with negligible delay. On-the-fly aggregation and indexing enable responsive queries and smooth, interactive visualization of communication traces. Our evaluation demonstrates that EduMPI Suite introduces less than 3.8% runtime overhead while maintaining query latencies below 0.09 s, enabling interactive performance analysis of MPI applications. Usability studies confirm that EduMPI Suite significantly reduces entry barriers for students and improves their ability to identify performance issues over conventional tools.
Typical tasks in performance engineering education and practice include anomaly detection, outlier detection, change point detection, and trend analysis, with the goal of anticipating future threshold breaches and capacity risks.
This paper presents Statistical Exception and Trend Detection (SETDS), a method for detecting both past and future change points in performance time series and discusses its use as both an analytical technique and a teaching tool. The paper provides an intuitive yet rigorous description of the SETDS method and demonstrates its practical application through Perfomalist, a free web-based tool that implements the approach. Perfomalist enables students and practitioners to visualize anomalies, trends, and change points using IT Control Charts and Exception Values, making advanced concepts in performance analysis more accessible. Perfomalist was also used as a baseline tool in a CMG.org Hackathon, where participants were tasked with identifying change points and anomalies in time-stamped performance data to reveal distinct phases and patterns. Finally, the method and tool have been successfully used in an online course on Performance Anomaly Detection offered through CMG.org, illustrating how real-world performance data and tooling can be effectively integrated into performance engineering education.
The HotCloudPerf workshop proposes a meeting venue for academics and practitioners---from experts to trainees---in the field of Cloud computing performance. The new understanding of Cloud computing covers the full computing continuum from data centers to edge to IoT sensors and devices. The workshop aims to engage this community and lead to the development of new methodological aspects for gaining a deeper understanding not only of cloud performance, but also of cloud operation and behavior, through diverse quantitative evaluation tools, including benchmarks, metrics, and workload generators. The workshop focuses on novel cloud properties such as elasticity, performance isolation, dependability, and other non-functional system properties, in addition to classical performance-related metrics such as response time, throughput, scalability, and efficiency. HotCloudPerf 2026, co-located with the 17th ACM/SPEC International Conference on Performance Engineering (ICPE 2026), is held on May 5th, 2026.
Information and Communications Technologies have been undergoing a relentless evolution, with the widespread availability of affordable high-speed Internet connections, that caused a major paradigm shift in software development and engineering towards distributed computing. This led to the raise of Cloud Computing as the de-facto standard for managing distributed infrastructures on top of which an ever-increasing number of on-line services are made available, impacting our everyday lives.
Cloud infrastructures have become increasingly attractive for a wide range of modern and future distributed computing scenarios, including those involving the deployment of time-sensitive applications, where the ability to provide end-to-end performance and availability guarantees for the hosted applications, in conjunction with resource and energy efficiency, becomes paramount. This is relevant for several application domains: interactive multimedia, virtual and augmented reality and on-line gaming; offloading of advanced functionality in robotics, factory automation, Industry 4.0 and automotive; real-time packet processing and software-defined telecommunications; and others. In these scenarios, it is important to provide performance guarantees to end-users, despite temporal interference and noisy neighbor effects due to resource sharing and multi-tenancy, while tolerating failures at various levels. Among the key challenges a provider has to tackle in this context, there are key design and architectural decisions related to the necessary trade-offs between performance, costs, and energy efficiency.
In this keynote talk, I provide an overview of the research projects I've been overseeing on these topics, spanning nearly 2 decades of research carried out jointly with international academic and industrial partners, highlighting what challenges have been tackled and what mechanisms have been realized. These include our relentless efforts to develop and contribute a real-time CPU scheduler for the Linux kernel to support deadline-driven scheduling of real-time embedded and virtualized workloads, since the early-stage developments within the IRMOS EU project, to the integration of SCHED_DEADLINE within the mainline Linux kernel, to its hierarchical extension and its proposed integration within the OpenStack and Kubernetes Cloud orchestration platforms. The talk will conclude with an overview of the challenges that still lie ahead in this area for future research.
Datacenters are the backbone of our digital society, but raise numerous operational challenges. We envision digital twins becoming primary instruments in datacenter operations, continuously and autonomously helping with major operational decisions and with adapting ICT infrastructure, live, with a human-in-the-loop. Although fields such as aviation and autonomous driving successfully employ digital twins, an open-source digital twin for datacenters has not been demonstrated to the community. Addressing this challenge, we design, implement, and experiment using OpenDT, an Open-source, Digital Twin for monitoring and operating datacenters through a continuous integration cycle that includes: (1) live and continuous telemetry data; (2) discrete-event simulation using live telemetry from the physical ICT, with self-calibration; and (3) SLO-aware and human-approved feedback to physical ICT. Through trace-driven experiments with a prototype mainly covering stages 1 and 2 of the cycle, we show that (i) OpenDT can be used to reproduce peer-reviewed experiments and extend the analysis with performance and energy-efficiency results; (ii) OpenDT's online re-calibration can increase digital-twinning accuracy, quantified to a MAPE 4.39% vs. 7.86% in peer-reviewed work. OpenDT adheres to FAIR/FOSS principles and is available at: https://github.com/atlarge-research/opendt/tree/hcp
Today, researchers and domain experts in remote sensing typically design their workflows from scratch with limited sharing and reusability across teams and organizations. A component-based approach to workflow engineering addresses these issues by modeling workflows as a DAG of self-contained, reusable tasks, each characterized by its input and output schema, semantics, and metadata. This abstraction enables interchangeable implementations and reuse in future workflows. Furthermore, it would enable fine-grained resource allocation required for a serverless execution of scientific workflows. We demonstrate the flexibility gained in new execution paradigms by implementing a component-based workflow in the Earth Observation (EO) domain. Specifically, we deploy the workflow on a monolithic system and a serverless Function-as-a-Service (FaaS) platform. We show that the monolithic deployment outperforms serverless execution due to greater availability of computing resources and reduced data transfer requirements. Furthermore, our custom scheduler in the local deployment can leverage the DAG structure of inter-task dependencies, in contrast to the limited parallelization options available in application-agnostic schedulers in FaaS systems. In contrast, the serverless deployment could significantly reduce the estimated cost of a workflow execution. Our component-based approach to workflow engineering enabled the decomposition of the computation into a parallel chain of self-contained serverless functions, thereby improving resource utilization, component reuse, and cost efficiency.
Configuring stream processing systems for efficient performance, especially in cloud-native deployments, is a challenging and largely manual task. We present an experiment-driven approach for automated configuration optimization that combines three phases: Latin Hypercube Sampling for initial exploration, Simulated Annealing for guided stochastic search, and Hill Climbing for local refinement. The workflow is integrated with the cloud-native Theodolite benchmarking framework, enabling automated experiment orchestration on Kubernetes and early termination of underperforming configurations. In an experimental evaluation with Kafka Streams and a Kubernetes-based cloud testbed, our approach identifies configurations that improve throughput by up to 23% over the default. The results indicate that Latin Hypercube Sampling with early termination and Simulated Annealing are particularly effective in navigating the configuration space, whereas additional fine-tuning via Hill Climbing yields limited benefits.
The Italian Conference on System and Service Quality (QualITA) is the annual event of the CINI. Consorzio Interuniversitario Nazionale per l'Informatica System and Service Quality Working Group. https://qualitawg.github.io/. Its main goal is to bring together Italian researchers, practitioners, and professionals from academia, industry, and public administration interested in the qualities of computing systems and services, such as performance, dependability, trustworthiness, efficiency, resilience, and sustainability.
QualITA aims to highlight quality aspects, considering the complex, heterogeneous, and multidisciplinary environment in which modern computing systems operate, as well as their application domains, characterized by an increasing demand for resilient and sustainable solutions that meet efficiency and time-constrained requirements.
QualITA provides a venue for showcasing consolidated knowledge, enabling the community to learn from robust findings, appreciate the evolution of ongoing research lines, and foster discussions on the impact, lessons learned, and future directions emerging from mature contributions.
After a thorough review process, QualITA26 features 5 short papers and 4 full papers, covering methodological and practical aspects of quality science and engineering, ranging from modeling and design techniques to evaluation tools and case studies on domains such as cloud, edge, IoT, cyber-physical systems, high-performance computing, artificial intelligence, software and data quality. The workshop also includes the keynote talk titled ''We Optimized the Models and Broke the Community: Social Awareness as a Missing Requirement in ML-Enabled Software Engineering'', by Gemma Catolino from the University of Salerno.
In addition to these contributions, the Workshop organizes an 'Emerging Ideas, Early Results, Feedback & Collaboration' track that provides researchers and practitioners with the opportunity to share novel ideas, preliminary findings, or ongoing work that could benefit from early visibility within the community. These contributions, although not included in the proceedings, present innovative directions, exploratory research, or open problems that can benefit from peer feedback to inform subsequent stages of development.
This track also serves as a space to propose collaborative initiatives, seek partners for future studies, or initiate discussions on emerging challenges and opportunities. The goal is to foster an open, constructive, and forward-looking environment in which early-stage research can grow, mature, and connect with potential collaborators across academia and industry. We would like to thank Professor Salvatore Distefano, chair of the CINI System and Service Quality Working Group, for his active leadership of the community. We are grateful to the authors for their contributions and to the program committee, who worked very hard in reviewing papers and providing feedback to authors. Finally, we thank ICPE for hosting our workshop.
Machine learning (ML) and large language models are increasingly integrated into software engineering (SE) workflows, supporting tasks such as code generation, documentation, testing, and maintenance. While these techniques provide measurable productivity benefits, their impact on the social structure of development teams remains largely underexplored. Existing research primarily focuses on model performance, leaving socio-technical consequences such as fairness, collaboration, and organizational sustainability insufficiently addressed. This keynote positions ML-enabled software engineering as a socio-technical system in which intelligent tools influence communication, coordination, and decision-making. Drawing on empirical studies on fairness-aware practices, community smells, and social debt, we identify two key challenges. First, fairness concerns propagate across the entire ML lifecycle, requiring lifecycle-wide mitigation strategies rather than post-hoc corrections. Second, ML-based assistants reshape collaboration structures, potentially introducing new coordination issues and amplifying existing socio-technical anti-patterns. The goal of the talk is to establish social awareness as a first-class requirement for ML-enabled SE. We outline a research agenda centered on socio-technical metrics, human-centered AI tools, and governance mechanisms to support sustainable and collaborative AI-enabled development ecosystems.
Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is of higher complexity and far less field-tested, making it more susceptible to failures that are difficult to resolve. Keeping outage costs and quality of service degradations in check depends on shortening mean time to repair, which in practice is gated by how quickly the fault is identified, located, and diagnosed. Automated root cause analysis (RCA) accelerates failure localization by identifying the system component that failed and tracing how the failure propagated. Numerous RCA methods have been developed for traditional services, using request path tracing, resource metric and log data analysis. Yet, existing RCA methods have not been designed for LLM deployments that present distinct runtime characteristics. In this study, we evaluate the effectiveness of RCA methods on a best-practice LLM inference deployment under controlled failure injections. Across 24 methods—20 metric-based, two trace-based, and two multi-source—we find that multi-source approaches achieve the highest accuracy, metric-based methods show fault-type-dependent performance, and trace-based methods largely fail. These results reveal that existing RCA tools do not generalize to LLM systems, motivating tailored analysis techniques and enhanced observability, for which we formulate guidelines.
When dealing with privacy risk assessment of ICT systems, the results of processes are represented by reports that are difficult to reproduce, compare, and audit, affecting the efficiency of decision processes. In this regard, we present an integrated decision-based workflow which combines (i) Analytic Hierarchy Process (AHP) to derive explicit weights for privacy/cost/performance criteria and select the preferred ICT configuration, and (ii) ISO/IEC 27005 to validate the selected configuration from a risk-management perspective (risk register, processing options, residual risk). By means of a case study, we show that alternatives, correctly weighted, that are privacy-oriented can result in being optimal even if their cost results in being higher. The ISO/IEC 27005 layer enforces the quality of the final decision for governance and compliance purposes, guaranteeing the compliance with privacy-by-design/privacy-by-default principles, in a well-spread and accepted form. For the sake of completeness, we report the results of a TOPSIS analysis as a benchmark against an existing baseline study, rather than as a step of the proposed workflow.
As AI-powered vision components are increasingly embedded in production software systems and services: therefore system-level quality, reliability, diagnosability, and accountable decision-making requires explanations that engineers can trust, debug, and operate at scale. Recent research on explainable computer vision has highlighted pixel-level feature attributions as a practical way to expose how visual evidence drives black-box predictions. Hierarchical Shapley explanations built on the Owen coalition formulation offer a principled, model-agnostic foundation, yet existing approaches typically rely on rigid, data-agnostic partitions that overlook the multiscale structure of images, leading to slow convergence and saliency maps that poorly follow true object morphology.
We review ShapBPT, a recently published data-aware attribution method that computes hierarchical Shapley coefficients over a Binary Partition Tree (BPT) tailored to the image being explained. By aligning the coalition hierarchy with intrinsic morphological cues and combining it with an adaptive Owen-style recursion, ShapBPT focuses the evaluation budget on semantically coherent regions, yielding crisper explanations while reducing computational overhead.
Experiments across multiple computer-vision tasks, datasets, and model families have shown that ShapBPT improves structural alignment and efficiency compared to existing explainers for understanding model decisions.
In workflows with stochastic durations, the end-to-end (E2E) response time distribution is jointly determined by several factors, including the workflow topology, the response time distribution of elementary services, and their sensitivity to resource scaling. This complexity is further exacerbated when workflows are deployed on microservices architectures, where additional factors related to the orchestration infrastructure may affect the E2E response time.
In this paper, we apply a state-of-the-art resource provisioning method to a real container orchestration system. Specifically, the method coordinates resource allocation for elementary services by jointly considering the factors mentioned above. %workflow topology, service duration distributions, and resource scaling sensitivity. We consider workflows with randomly generated topology and with elementary service durations drawn from a data set used in the literature. We implement these workflows as microservices and we deploy them on Kubernetes using the proposed provisioning strategy. Experimental results demonstrate the effectiveness of the approach compared to alternative baselines, under both low and high workload conditions.
As autonomous UAV face increasing risks from cyber-attacks and software aging, countermeasures are mandatory. This paper proposes software rejuvenation as a mitigation strategy to UAV attack, providing a Generalized Stochastic Petri Net model and a tool to assess its effectiveness. By formalizing the stochastic competition between adversarial progression, represented by a Cyber Kill Chain attack process, and rejuvenation-based defense, we quantitatively evaluate the trade-offs between safety, availability, and security. Our analysis identifies optimal rejuvenation rates and policies that maximize system resilience while satisfying mission-critical constraints, providing a rigorous decision-making tool for securing resource-constrained edge nodes.
Machine learning (ML) models are increasingly deployed as components of production-grade systems, making the systematic assessment of their quality a central concern. While ML metrics have long played an important role in benchmarking ML models, they are also becoming increasingly important for productive operation. The selected metric not only influences the predictive performance during development, it also has vital impact on the capabilities and accuracy of fit of the model for the productive use case. Although there is certain awareness of the significance of metric selection, the choice is often made on an ad hoc basis and a systematic selection guidance is missing. To address this research gap, we provide a structured analysis of ML metrics and their properties for classification and regression tasks. With this knowledge base, we propose a decision tree that aims to help both researchers and practitioners working with ML to identify the best fitting metrics. Using this decision tree, we highlight practical implications on two case studies: spam filtering and machine failure prediction. Our decision tree is completely use case agnostic and can be used by ML practitioners in a multitude of application areas.
Agent-Based Models (ABMs) are invaluable for capturing heterogeneity in complex systems but typically suffer from severe scalability issues, where computational cost grows linearly or super-linearly with population size. To overcome this bottleneck without sacrificing stochastic exactness, we propose a novel Symbolic Macro-Agent paradigm. Drawing theoretical parallels to the folding of Colored Petri Nets, this approach aggregates indistinguishable entities into dynamic clusters, shifting the simulation logic from iterating over discrete individuals to operating on state-based quantities. By applying an adapted Gillespie Algorithm (SSA) directly to these symbolic aggregates, we effectively move the computational complexity from the total number of agents, to the number of active states. To validate this theoretical contribution, we developed a prototype simulation engine in C++ designed to test the Macro-Agent abstraction. Experimental results confirm that this architecture has the potential to effectively reduce the computational effort typical of ABM models characterized by semi-homogeneous agents, offering a scalable alternative to traditional individual-based approaches.
Auto-scaling mechanisms aim at fully exploiting the benefits of the cloud computing paradigm by adapting the allocated resources to the requirements of the workload being processed. The auto-scaler proposed in this paper focuses on scenarios where heterogeneous cloud resources can be provisioned. Notably, by leveraging the wide range of different Virtual Machine types offered by cloud providers, our auto-scaler selects the appropriate mix of resources able to provide the estimated processing capacity that copes with the predicted incoming workload and ensures at the same time the desired utilization level of the resources. The simulation experiments have clearly demonstrated the effectiveness of our approach.
In Industry 4.0, Digital Twins enable real-time monitoring and decision-making through a digital replica of physical systems. However, their effectiveness depends on the ability to timely mirror the corresponding system's behavior. While systems are subject to unbounded delays, decisions must often be taken within a strict time window. In critical contexts, waiting for accurate data is crucial, as decisions based on stale data could be dangerous. Conversely, in safer contexts, such as Industry 4.0, where machines are equipped with autonomous safety mechanisms, decisions could be made in a timely manner with a proper mechanism to adjust the Digital Twin state and enhance production line resilience. In this work, we propose a data-driven method to enhance Digital Twin synchronization, in presence of delay bound, by ensuring the timely availability of data when required to support more informed decisions. We validate the approach through an industrial use case featuring a packaging line and a KUKA LBR iiwa collaborative robot. This approach moves towards an ''Always In-Sync'' Digital Twin capable of proactive, trustworthy decision-making in dynamic industrial environments.
We are delighted to welcome participants to the 14th International Workshop on Load Testing and Benchmarking of Software Systems (LTB 2026). Held in conjunction with ICPE 2026, LTB continues its tradition of fostering discussion among researchers and practitioners interested in performance engineering, load testing methodologies, benchmarking practices, and system resilience under realistic workloads. The workshop provides a forum for exchanging ideas on emerging challenges in performance validation, benchmarking infrastructures, workload modeling, and performance evaluation of modern software systems, including cloud-native applications, distributed services, and AI-driven systems. LTB 2026 features a keynote address and a selected set of research and presentation-track contributions covering diverse aspects of performance engineering and benchmarking.
Jakarta EE application servers are critical middleware for enterprise systems, yet their performance can vary across implementations and versions. This paper presents a controlled performance-regression study comparing IBM OpenLiberty and Red Hat WildFly using the SPECjEnterprise 2018 Web Profile benchmark.
We evaluate four research questions: whether the servers differ in response time and memory utilization, how performance evolves across versions, whether relative performance depends on Jakarta EE technology choices (REST, JSF, or WebSocket), and whether sustained execution exposes memory leaks. We perform profiling runs using JDK Flight Recorder to characterize runtime behavior.
Results show that OpenLiberty achieves lower latency for JSF workloads (typically 2-3 ms faster), while REST operations exhibit comparable performance, except for OpenLiberty degradation on invalid input validation (approximately 8x slower). Memory analysis reveals stable heap usage with no evidence of leaks, though OpenLiberty sustains higher steady-state footprint.
Observability of software systems requires the collection of runtime data. The collection can be performed using either sampling or instrumentation. These two techniques have different properties with respect to collectible data, the exactness of change detection, and the overhead in terms of resource consumption. During software development, these techniques are used to detect performance changes between different commits or variants of the developed software. While prior comparisons of these techniques focused on the accuracy of the measurements, we focus on the exactness of change detection. This paper presents: (1) A benchmark for comparing the change detection exactness and overhead of sampling and instrumentation, and (2) an experimental evaluation of this benchmark on current hardware. By our evaluation, we find the tracing overhead caused by instrumentation is 0.5 μ s per method call in our setting. At the same time, sampling does not cause overhead that can be identified with statistical significance. Also for change detection, sampling shows to be much more effective. Due to these measurements, we follow that developers should use instrumentation only if it is necessary to trace the full behavior of a system, e.g., if single REST requests need to be identified, or if it is possible to preselect a small amount of methods that should be observed. If no preselection of methods is possible and performance changes of complex programs should be obtained, sampling is the only feasible technique due to the high overhead of instrumentation.
Benchmarking of stream processing systems is challenging due to heterogeneous deployment settings and complex interaction effects among chained operators and operator buffers within streaming pipelines. Backpressure, or reactive rate throttling, occurs when an operator receives data faster than it can be processed, causing earlier operators to slow or stop as processing delays propagate upstream. In closed-loop stream benchmarks, backpressure can propagate to the benchmark generator, producing unreliable results.
This paper makes four contributions. First, we introduce a conceptual model of rate-throttling mechanisms in stream processing systems that clarifies the interaction between backpressure and benchmarking workloads. Second, we analyse the stream benchmarking literature to assess how explicitly backpressure effects are considered. Third, we empirically demonstrate how closed-loop benchmarks can propagate backpressure from the system under test (SUT) to the workload generator, distorting benchmark behaviour. Finally, we propose practical mechanisms for monitoring, mitigating, and moderating generator backpressure using system metrics and modifications to benchmark workload generation algorithms.
Artificial Intelligence (AI) has been widely adopted in mainstream domains, yet its role in performance evaluation and modeling remains under-explored. Traditional AI tools are often used as black-box solutions not tailored to performance engineering, leading to models that demand extensive time, data, and expert interpretation. Simultaneously, the rise of Large Language Models (LLMs) has brought new challenges in terms of infrastructure cost, energy usage, and specialized skills required. For instance, pre-training GPT-3 (behind ChatGPT) reportedly cost around 1,287,000 kWh in dynamic computing, generating a notable carbon footprint and high hardware expenses. These considerations underscore the urgent need to develop systematic performance engineering approaches that balance efficiency, scalability, and sustainability. This workshop aims to bridge the gap by convening researchers and industry practitioners to share techniques and insights on applying AI methods (including specialized or explainable AI approaches) for performance engineering of LLMs and similar large-scale systems. The objective is to identify best practices, new tools, and open research directions that facilitate optimized performance while reducing resource consumption.
AI systems increasingly rely on multi-step agentic workflows that orchestrate heterogeneous computational agents, including Large and Small Language Models (LLMs and SLMs), each configurable at runtime with different inference methods and parameters. While this flexibility enables specialization and resource efficiency, it introduces complex trade-offs between accuracy, latency, and energy consumption, particularly under strict operational budgets. The sequential nature of these workflows further amplifies such challenges: errors propagate across steps, and resource decisions made early in the pipeline constrain downstream options. To address these challenges, this paper presents a formal optimization framework for configuring multi-step agentic pipelines under global time and energy constraints. Given a task decomposition and a heterogeneous pool of agent configurations, the framework selects exactly one agent–inference method–parameter triple per task to maximize end-to-end workflow quality. We formalize this as a Mixed-Integer Linear Programming problem and introduce two complementary objective formulations: a max-min accuracy objective that prioritizes robustness by improving the least accurate step, and a multiplicative accuracy objective that captures cumulative performance across the workflow. We further propose an iterative solution strategy that re-optimizes at each step, adapting to deviations between predicted and realized resource consumption. Experimental results demonstrate distinct accuracy-resource trade-offs between the two formulations and show that the approach scales effectively to workflows with up to a thousand steps.
Performance profiling is essential for software optimization, yet integrating profiling data with large language models (LLMs) presents significant challenges due to context limits and representation choices. We present a systematic comparison of five profiling data representations: raw text, summarized text, text-as-image, flamegraph, and DOT graph, across six real-world workloads using two multimodal LLMs (Qwen3-VL and GPT-4o). Experiments reveal that raw profiles frequently exceed context limits (67% failure rate), making compression essential. Among viable representations, visual formats achieve 60-200× compression with constant token cost regardless of profile complexity. Crucially, our accuracy analysis shows that DOT graphs achieve the highest and most consistent accuracy (67% on both models), while flamegraphs are model-dependent (67% on Qwen3-VL but only 33% on GPT-4o). Text-based formats show moderate to poor accuracy (33-50%). These findings demonstrate that effective LLM-based performance analysis requires careful consideration of both representation format and model characteristics. Additionally, we release torch2pprof, an open-source tool for converting PyTorch Profiler traces to pprof format.
Anomaly detection is widely used in software performance engineering to identify performance degradations in production systems and to automatically trigger alerts. Over the years, several algorithms have been employed for this task, ranging from recurrent neural networks to traditional statistical models. Recently, a new class of algorithms, known as time series foundation models, has demonstrated remarkable effectiveness across a variety of time series analysis tasks. Despite these promising results, there is still limited understanding of how such models behave in the context of software anomaly detection. In this paper, we provide preliminary empirical insights into the effectiveness of two time series foundation models, Chronos and TSPulse, evaluated on two publicly available software anomaly detection datasets, AIOps and MSCloud. Our results show that foundation models can achieve comparable performance with respect to four representative TSAD baseline models, with the advantage of zero-shot prompting setup and the elimination of training stages and training data.
Provisioning language models as inference services can be significantly challenging to cloud providers because of the need to balance service-level objectives (SLOs) with capital and operational costs. The heterogeneous nature of user queries, combined with temporal variations in workload patterns and the high energy consumption of GPUs warrants an intelligent resource-allocation strategy. We present OASIS (Optimal Allocation Strategy for Inference Services), a comprehensive methodology that combines static and dynamic optimizations to provision inference services efficiently. OASIS employs a two-phase approach: (1) pre-deployment profiling to determine optimal parameter configurations for different query types across various hardware options, and (2) run-time query routing and dynamic provisioning based on observed workload patterns. We evaluate OASIS using real-world BurstGPT traces on NVIDIA A100 and H100 GPUs. Our results show that: (1) GPU frequency scaling from 1320 MHz to 855 MHz reduces power by 39% with only 7% throughput loss; (2) query type classification reveals long-input-short-output workloads achieve 2.1× higher throughput than short-input-long-output; (3) MIG-based multi-tenancy achieves 42-43% power reduction when running two models simultaneously; and (4) runtime adaptation on real campus traces achieves statistically significant 12.4% power reduction while maintaining SLO compliance and identical throughput.