ICPE '26: Proceedings of the 17th ACM/SPEC International Conference on Performance Engineering

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Systems for AI: Predicting Performance of Machine Learning Workloads

Zhuojin Li
Marco Paolieri
Leana Golubchik

Deep learning is enabled by new training techniques, large datasets, enormous compute power, and easy-to-use ML frameworks. We focus on approaches to predicting performance of ML workloads, to facilitate resource management, neural architecture search, efficient training and inference, aiming to support efficient development of deep learning models and their novel applications.

Illuminating Multi-GPU Data Movement

Didem Unat

GPUs have become the accelerators of choice for HPC and machine learning applications, thanks to their massive parallelism and high memory bandwidth. However, as GPU counts per node and across clusters continue to grow, inter-GPU communication has emerged as a major scalability bottleneck, requiring debugging and profiling tool support. In this talk, I will provide an overview of GPU-centric communication within and across compute nodes, highlighting vendor mechanisms and existing tool support for profiling multi- GPU communication. I will then discuss research challenges and opportunities in developing modern profiling tools. Finally, I will emphasize the need for more user-friendly profiling and analysis frameworks that illuminate communication data paths and help developers better understand and optimize data movement across networks. I will conclude by outlining ongoing efforts in our re- search group to address these challenges and advance the state of the art.

SESSION: Session 1: Heterogeneous Architecture Performance

Improving Energy Efficiency and Performance of Weather and Climate Simulations by Leveraging the Heterogeneity of Modern Systems

Julius Plehn
Christian von Elm
Pay Giesselmann
Carsten Clauss
Hendryk Bockelmann
Robert Schöne
Jan Frederik Engels

The increasing need for higher resolution and greater accuracy in weather forecasts and climate simulations continues to drive software and hardware developments in high-performance computing (HPC) systems. To achieve increasingly faster simulations, the adoption of hardware accelerators has proven highly effective in recent years. However, these HPC codes exhibit, in parts, divergent memory access and computation patterns. Hence, their performance benefits strongly depend on how well the software's computational characteristics align with the underlying hardware architecture. While an accelerator may be well-suited for some code regions, other architectures may achieve better performance and energy efficiency elsewhere. We therefore propose a highly heterogeneous setup for a performant and energy-efficient execution of complex HPC codes such as those used in the weather and climate domain, incorporating a variety of processor and accelerator architectures.

Using the climate and weather model ICON, we explore the potential of mapping of components onto compute architectures. We describe the design of a highly heterogeneous test cluster and discuss its implications for middleware and software management. Furthermore, we present our comprehensive energy measurement infrastructure which allows for comparisons with water-cooled production systems. Leveraging this, we demonstrate that a heterogeneous cluster can achieve up to 40% higher energy efficiency compared to a traditional homogeneous configuration.

Cross-Platform, Cross-Framework Development of Hybrid-Parallel Matrix-Multiplication codes

Vyuhita Bonthu
Nikhil Hegde

Matrix-multiplication is an important kernel in domains ranging from machine learning to high-performance computing. Developers devote significant time and effort to optimizing the matrix-multiplication kernel. In this paper, we simplify the development of optimized matrix-multiplication codes for various platforms and heterogeneous systems. We target codes for CPUs on x86 (AVX, AVX2) and ARM (Neon) platforms, as well as Nvidia GPUs and Jetson Nano. To achieve this, we employ a tool to generate a novel, hybrid-parallel, implementation of matrix-multiplication that exploits parallelism within a core, across cores, across nodes, and across GPU devices.

We create a specification of the recursive matrix-multiplication algorithm and feed it to the tool, which emits the implementation with an empty recursion base case. The base case is then filled in (manually) with pre-existing hand-optimized codes using AVX, AVX2, ARM Neon, OpenCilk, OMP, BLAS APIs for CPUs and GPUs. Experiments with different programming models (OpenCilk, OpenMP, library-based), and multiple floating-point precision inputs on CPU-based, GPU-based, and Jetson Nano show that: (i) The hybrid-parallel multi-GPU codes show an energy efficiency of up to 5.5× and also execute at 1.5× lower peak temperatures compared to the cuBLAS library-based GPU implementations. Furthermore, for double-precision data, the hybrid multi-GPU code on RTX5000 achieves better performance compared to the cuBLAS implementation. (ii) The recursive CPU implementations are faster, more energy-efficient, and also execute at lower peak temperatures compared to iterative CPU codes. We introduce a cost-of-ownership metric to analyse hybrid-parallel code in a multi-platform, multi-framework setting and show that executions on Jetson Nano are the most cost-effective.

SESSION: Session 2: Cloud Systems and Resource Efficiency

CarbonShare: Carbon-Fair Allocation for Shared Clusters

John Thiede
David Irwin
Prashant Shenoy

Computing's energy demand, and thus its carbon emissions, are rapidly accelerating with the emergence of a wide range of useful, but computationally-intensive, AI-driven applications. At the same time, computing, along with the rest of society, must rapidly reduce its emissions to avoid the worst consequences of climate change, e.g., population displacement, agricultural collapse, mass extinction, extreme weather, etc. To do so, computing and other industries will eventually need to limit their carbon emissions. Such a limit imposes a new allocation problem: how should datacenters allocate their limited carbon emissions across multiple applications?

To address the problem, we introduce the notion of carbon fairness, which enables clusters to enforce a limit on their aggregate carbon emissions while fairly allocating them to applications. As we show, enforcing carbon fairness has multiple desirable properties: it fairly distributes any performance penalty from a cluster-wide carbon cap across all applications; it incentivizes applications to optimize their energy- and carbon-efficiency; and it is scheduler agnostic by simply limiting the usage of allocated resources. We implement and evaluate a range of carbon fairness policies in LXD, and show i) how these policies differ from enforcing resource and energy fairness and ii) the importance of enabling flexible policies, which permit periodic bursts above carbon limits, to maintain performance despite limits on carbon. For example, we show that our most flexible footprint-fair policy outperforms a stricter rate-fair policy on a large-scale industry workload by up to 30%.

Kill Smart, Run Fast: Using Job Termination for Resource Efficiency in Data Centers

Adityo Anggraito
Rostislav Razumchik
Andrea Marin

For the last few years, job scheduling in large data centers has been attracting many research interests, especially because of the larger scales of these infrastructures and the drastic change in the workload characteristics introduced by AI jobs. The high variability in the resource demands (in terms of memory, number of cores, GPUs, etc.) makes the scheduling design extremely complicated, and the maximum achievable throughput is limited with respect to the amount of available resources. This is due to the head-of-line blocking (HOL), i.e., the scheduler has to wait for the termination of some jobs to make space for a job demanding many resources. In this paper, we study a killing policy that serves the jobs in a nearly First-Come First-Served (FCFS) order, i.e., the scheduler preempts the jobs in service to make space for a large job. In order to make the analysis realistic, we assume that the preempted jobs must be restarted once they enter again in service. Our results are summarized as follows: (i) we show that we can achieve higher throughput than standard FCFS beside the waste of computational resources due to the restart policy, (ii) we study a queueing model that allows us to understand the trade-off among throughput, fairness and energy waste in a simplified scenario, (iii) we use discrete event simulation to show that the killing policy allows for higher throughput and lower expected response times of jobs with workload characteristics derived from the traces, released by Google, of the Borg scheduler with respect to FCFS.

Understanding Foundational Library Energy Consumption

Jacob D. Hauenstein
Timothy S. Newman

The energy consumption of a foundational library providing core functionality for programs--the C standard library--is considered here. The most common realization of this library in the Linux world, glibc, is studied against two popular alternatives, musl and uClibc (via its currently active fork, uClibc-ng). The study utilizes the popular CoreMark benchmark, which tests core CPU functionality in code exercising common, basic mechanisms. By considering metrics for power usage, time, and computational productivity in conjunction with computational strategies and runtime conditions exhibited while computing CoreMark, this study delivers guidance for achievement of computation goals related to computation time and energy consumption. Tradeoffs between these two, including marginal energy costs and time benefits or computational performance improvement, are also explored. As musl and uClibc are also known or being parsimonious in memory usage, memory-related considerations are also examined. Results for a wide array of experiments are reported.

The Impact of Memory Configuration on Server Efficiency

Maximilian Meissner
Khang Pham
Aaron Cragin
Klaus-Dieter Lange
Samuel Kounev

The SPEC SERT suite is the industry-standard benchmark for evaluating server energy efficiency and is widely adopted in government regulations and certification programs. Current certification rules require every CPU memory channel to be populated with at least one Dual In-line Memory Module (DIMM). In real-world deployments, however, servers are sometimes configured with fewer DIMMs, leaving some channels unpopulated. This discrepancy introduces a significant gap between certified efficiency scores and the actual efficiency of deployed systems. To address this issue, there are ongoing discussions about certifying servers with partially populated memory channels. However, the impact of memory configuration on SPEC SERT results has not been systematically studied. In this paper, we present a comprehensive analysis of how different memory configurations affect performance, power consumption, and energy efficiency on two state-of-the-art server systems using the SPEC SERT 2 suite. We vary both the number of populated channels and the type of DIMMs. Our experimental results show that server efficiency scores can be up to 3.4 times lower with only one channel populated and still up to 1.3 times lower with half the channels, compared to fully populated configurations. Detailed analysis of individual SPEC SERT 2 worklets reveals that some CPU worklets are highly sensitive to memory bandwidth, and that the impact of memory configuration is dependent on both server architecture and workload intensity. These findings underscore the need to reconsider certification criteria and highlight the importance of memory configuration for accurate energy efficiency assessment.

On the Efficiency and Disruption Trade-Offs of Kubernetes Packing Heuristics

Mariane Santos Zeitouni
Oscar Brito
Matheus Rocha
Thiago Emmanuel Pereira
Gabriel Gomes

Kubernetes addresses load changes by provisioning and disrupting nodes. Tools such as Cluster Autoscaler and Karpenter implement heuristics that decide when to add or remove nodes and how to pack and repack pods to reduce the size of the infrastructure. Despite its popularity, users report that consolidation is often too conservative, leading to wasted resources. This paper quantifies the trade-off between consolidation efficiency and disruption in production Kubernetes clusters. Using one week of traces from VTEX, a large e-commerce platform, we replay pod lifecycles in a trace-driven simulator that models pod placement as a two-dimensional bin packing problem over CPU and memory. We evaluate First Fit Decreasing and Best Fit Decreasing with a threshold-based repacking policy that migrates pods from underutilized nodes. Results show that average CPU utilization improves from 77% to 88--90% as node occupancy thresholds increase, but disruption---measured as the ratio of pod migrations to pod insertions---grows rapidly beyond moderate thresholds. Thresholds around 60--70% achieve near-peak utilization with disruption ratios close to one migration per insertion, while higher thresholds yield negligible additional utilization and sharply higher disruption. We show that a node occupancy threshold acts as a control knob for practitioners to tune consolidation aggressiveness and bound disruption in production clusters.

SESSION: Session 3: AI and LLM Performance

SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

Hiari Pizzini Cavagna
Andrea Proia
Giacomo Madella
Giovanni Battista Esposito
Francesco Antici
Daniele Cesarini
Zeynep Kiziltan
Andrea Bartolini

Large Language Models (LLMs) inference is central to modern AI applications, dominating worldwide datacenter workloads, making it critical to predict its energy footprint. Existing approaches estimate energy consumption as a simple linear function of input and output sequence. However, by analyzing the autoregressive structure of Transformers, which implies a fundamentally non-linear relationship between input and output sequence lengths and energy consumption, we demonstrate the existence of a generation energy minima. Peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs. Consequently, we propose SweetSpot, an analytical model derived from the computational and memory-access complexity of the Transformer architecture, which accurately characterizes the efficiency curve as a function of input and output lengths. To assess accuracy, we measure energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite. We test input and output lengths from 64 to 4096 tokens and achieve a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency ''sweet spots'' reduce energy usage, up to 33.41x, enabling informed truncation, summarization, and adaptive generation strategies in production systems.

B-Perf: Black-box Performance Antipattern Detection Using System-level Execution Tracing

Morteza Noferesti
Mahsa Panahanddeh
Naser Ezzati-Jivan

Performance antipatterns capture recurring behaviours that degrade software efficiency. Black-box approaches aim to detect such issues without modifying the application. This paper presents B-Perf, a system-level black-box method that reconstructs execution, memory, and messaging behaviour from kernel-level traces. By analysing scheduling, allocation, and communication events, B-Perf derives workload-dependent behavioural trends and reports antipattern indicators grounded in resource usage and contention. To handle large trace volumes, the approach follows a pipeline of workload generation, event gathering, trace handling, and antipattern inference.

We evaluate B-Perf on three representative antipatterns: One Lane Bridge, Empty Semi Trucks, and Excessive Dynamic Allocation, and apply it to traces from real multi-threaded applications. The results show that system-level events are often sufficient to expose bottlenecks linked to resource contention and system-level interactions. A key limitation is that kernel traces provide limited visibility into fine-grained in-process behaviour. When performance issues are driven by internal logic or function-level interactions, B-Perf may capture only indirect symptoms and may not reveal the full root cause. Within this scope, B-Perf provides practical and efficient black-box detection for antipatterns driven by resource interaction and competition.

ORION: Integrated Runtime Modelling for Predicting Deep Learning Training Time

Alireza Pourali
Hamzeh Khazaei

Training deep learning models, especially Transformer-based and large-scale architectures, entails significant computational requirements and relies on precise coordination between accelerator execution and host-side data delivery. While prior performance prediction efforts have primarily focused on GPU compute modelling, the influence of the input pipeline, including CPU parallelism, data preprocessing, and storage I/O throughput, remains far less understood. This paper introduces ORION, an integrated runtime modelling framework that predicts iteration-level neural network training time by jointly characterizing GPU computation and host-side resource behaviour. ORION incorporates empirical measurements of data loading latency, CPU preprocessing scalability, and disk throughput into a unified analytical formulation that distinguishes compute-bound and I/O-bound execution regimes. By modelling the interaction between data ingestion and accelerator execution, ORION reveals critical host-induced bottlenecks and quantifies their contribution to end-to-end training latency. Across diverse vCPU counts, storage tiers, and neural network architectures, ORION achieves an average reduction of 44.36% in prediction error compared to the state-of-the-art GPU-centric baseline. Overall, ORION enables accurate, hardware-aware training time prediction and provides practical guidance for selecting balanced system configurations in modern deep learning environments.

SwiftSNNI: Optimized Scheduling for Secure Neural Network Inference (SNNI) on Multi-Core Systems

Kanwal Batool
Saleem Anwar
Francesco Regazzoni
Andy Pimentel
Zoltán Ádám Mann

Secure Neural Network Inference (SNNI) enables privacy-preserving inference on encrypted data with strong cryptographic guarantees. However, practical deployments suffer from high preprocessing overhead, significant communication costs, and sequential execution. These limitations lead to low throughput, underutilized system resources, long queueing delays, and poor scalability. This work introduces SwiftSNNI, a unified, resource-aware scheduling framework for SNNI. It implements a hybrid offline–online strategy that orchestrates offline preprocessing (T_pre,i) and online inference (T_on,i) jobs to maximize parallelism. By formulating SNNI scheduling as a constrained optimization problem, SwiftSNNI overlaps T_pre,i phase execution of future requests with active T_on,j, jobs. SwiftSNNI also incorporates optional advance notices to enable proactive T_pre,i, which further reduces average input delay (D). Evaluations using five benchmark neural networks (M1, M2, HiNet, AlexNet, VGG-16) under diverse workloads and stochastic arrival rates confirm substantial performance gains. Compared to a parallelized sequential baseline (MS-SHARK), SwiftSNNI achieves up to 97% lower average input delay (D), a 81% reduction in makespan (≈ 5.4 × speedup), and delivers 5.6 × increase in throughput. Furthermore, SwiftSNNI reduces average waiting time (W) by over 99%, demonstrating robust starvation prevention for high-concurrency workloads. SwiftSNNI supports concurrent execution, scales to larger neural networks, and provides an efficient runtime for SNNI deployments. The SwiftSNNI implementation is available online.

Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization

Sai Sindhur Malleni
Raúl Sevilla
Aleksei Vasilevskii
José Castillo Lema
André Bauer

As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes' capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36%; and GAIE working in conjunction with llm-d improved tail Time to First Token latency by up to 90% even under high loads.

SESSION: Session 4: System Observability and Latency

Benchmarking the Overhead of Distributed Tracing Agents

David Georg Reichelt
Shinhyung Yang
Marcel Hanson
Wilhelm Hasselbring

Tracing is a fundamental technique for analyzing the runtime behavior of software systems. By recording the start and end times of method executions together with contextual metadata, tracing enables detailed performance analysis, architecture reconstruction, and program comprehension. However, such instrumentation inevitably introduces runtime overhead that can distort performance measurements and increase variability. Quantifying and comparing this overhead across tracing frameworks and configurations is therefore essential for selecting suitable tools and ensuring reliable performance evaluations.

The overhead of different tracing frameworks and their configuration can be measured by the MooBench microbenchmark. In this work, we extend the MooBench microbenchmark to support the established Java tracing frameworks Elastic APM Agent, inspectIT, Kieker, OpenTelemetry, Pinpoint, Scouter, and SkyWalking. By executing MooBench with these agents, we find (1) significant differences in performance overhead, whereby the industry standard implementation of OpenTelemetry is comparably slow, while the Kieker agent has the lowest overhead among the functionally correct frameworks, (2) the agents of Pinpoint and Scouter do not store all records, making their behavior not fulfill the functional requirements, and (3) that avoidable overhead of some of the frameworks is created by extensive metadata gathering and needless copying of data.

Modeling Extreme End-to-End Delays for Availability Assessment on Latency Datasets

Orangel Azuaje
Ana Aguiar

Mission-critical applications depend on networks that consistently meet strict end-to-end (e2e) latency bounds. In these systems, a network becomes effectively unavailable whenever delay exceeds the required deadline, making availability a question of delay compliance rather than simple uptime. This paper proposes a methodology for assessing delay-based availability using Extreme Value Theory (EVT). The contribution lies in the systematic integration and automation of established EVT techniques to enable reproducible and diagnostically validated tail analysis of latency data. The approach includes algorithmic selection of thresholds and block sizes, formal validation of approximate independence through declustering, and stability diagnostics for reliable tail modeling. We demonstrate the methodology on a publicly available latency dataset from a commercial 5G non-standalone (NSA) network, used strictly as a case-study example. Results show consistent tail-index estimates across both EVT methods, enabling extrapolation of delay-violation probabilities under conditions where EVT assumptions are verified. The proposed framework provides a principled foundation for evaluating delay-based availability in communication systems where rare extreme delays dominate reliability and sufficient measurement data are available.

SESSION: Session 5: Benchmarking and Profiling Infrastructure

benchkit: A Declarative Framework for Composable Performance Evaluation of System Software

Antonio Paolillo
Mats Van Molle
Ken Hasselmann

Performance-evaluation pipelines in systems research often combine benchmarks, system configuration steps, profiling tools, and analysis scripts. In practice, these components are glued together with ad-hoc shell scripts, notebooks, and bespoke tooling, making experiment dimensions difficult to explore systematically and results hard to reproduce or extend. We present benchkit, a lightweight Python library that provides a structured way to express performance experiments declaratively and to automate their full lifecycle—from build and execution to system configuration, profiling, and result collection. Instead of relying on monolithic scripts, benchkit provides a structured way to compose existing system tools (e.g., CPU-placement utilities, frequency controllers, and performance profilers) while keeping benchmark code untouched. We illustrate benchkit through two representative studies: (1) a drilldown of performance anomalies in SPEC CPU workloads on hybrid-core x86 processors, enabled by systematic exploration of CPU placement policies; and (2) an analysis of lock implementations and scheduling strategies on a many-core ARM server, where benchkit coordinates system tools and visualizations to interpret performance differences. We evaluate the overhead of benchkit and show that it introduces no measurable cost compared to hand-written shell workflows, both on the host and inside containers. These results show that benchkit provides a reproducible, extensible, and principled foundation for system-level performance experimentation.

Performance and Cost Implications of Migrating Serverless Functions from x86 to ARM based Servers

Yassine Lazreg
Saman Akbari
Manfred Hauswirth

Serverless computing allows users to run code without managing servers, offering automatic scaling and pay-as-you-go pricing. AWS Lambda, the first major commercial serverless platform to offer ARM support alongside traditional x86, enables a direct comparison of the two architectures. In this study, we first analyze the architectural differences between ARM and x86 to understand potential performance and cost implications. We then empirically evaluate both architectures using a set of real-world serverless benchmarks, measuring execution time, response time, peak memory usage, cold start times, total cost, and performance-to-cost ratio. Our results show that ARM consistently delivers lower cost across all workloads and memory configurations, with savings of up to 30%, while performance differences vary by workload and memory allocation. Based on our findings, we provide recommendations for users who consider migrating their AWS Lambda functions from x86 to the ARM architecture.

SESSION: Session 6: Performance Modeling and Optimization of Complex Systems

Variability-Guided Performance Optimization

Eitan Frachtenberg
Viyom Mittal
Mohammed Baydoun
Aditya Dhakal
Izzat El Hajj
Dejan Milojicic

The past few decades have seen software and hardware growing more heterogeneous and layered in abstractions. This trend produced many benefits for hiding complexity and increasing efficiency and modularity. But it also makes reasoning about performance and identifying its underlying factors more challenging because of the presence of performance variability. Moreover, performance variability can prevent synchronous applications from scaling and server applications from meeting service-level agreements.

In this paper, we present a variability-guided optimization (VGO) workflow that leverages information in performance distributions to optimize performance, and more importantly, to reduce performance variability. It works by uncovering software, hardware, and compiler factors associated with specific aspects of variability and then suggesting measures for reducing it. Our experimental evaluation on a set of CPU and GPU benchmarks and applications shows that our tool successfully reduces the standard deviation and coefficient of variation of application run times by 0.374× and 0.444×, while also reducing mean run time by 0.843×. This technique enables tuning applications and their environments, improving their performance and predictability.

Energy-Efficient Right-Sizing of Kafka-like Message Brokers for IoT Workloads

Govind Kp
Guillaume Pierre
Romain Rouvoy

IoT data pipelines rely on message brokers, such as Apache Kafka and Redpanda, for continuous telemetry ingestion. When it comes to capacity planning of these systems, the absence of clear sizing guidance often leads to conservative over-provisioning and unnecessary energy use. We present a calibration-based methodology for energy-efficient right-sizing of Kafka-compatible clusters for IoT ingest. Using a small set of initial experiments on 3--4 nodes, we fit a performance model that predicts maximum sustainable throughput and per-node power, enabling operators to choose the smallest cluster that satisfies a target ingest rate with headroom while minimizing energy consumption. We substantiate the approach with an experimental study of Kafka and Redpanda across three hardware generations (HDD, SATA SSD, NVMe), varying partition counts, node counts, and resource limits. We find that storage technology is the primary determinant of throughput, horizontal scaling is near-linear, and vertical CPU scaling yields diminishing returns; the two brokers exhibit distinct energy proportionality properties. On previously unseen hardware, the model predicts throughput and power with median errors below 10% and 7%, respectively. Our results provide a practical, reproducible capacity-planning workflow that maps IoT workload requirements (message size and rate) to concrete, energy-aware deployment decisions.

Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

Jeeho Ryoo
Yongchan Jung
Muhammad Ali Khaliq
Weidong Zhang
Jiatong Han
Byeong Kil Lee

Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous kernel behavior. This paper performs a comprehensive performance analysis of the state-of-the-art medical diffusion model, Med-DDPM, across three generations of NVIDIA architectures to study kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, warp-level activities, and profiler priority-score estimates. We show that training is overwhelmingly dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies arising from memory-access patterns, tensor-layout conversions, and limited Tensor Core utilization. Guided by these insights, we evaluate two architecture-aware optimizations TF32 Tensor Core activation and a 3D channels-last layout and demonstrate that they reduce SM cycles by up to 100x, cut dynamic instructions by 100x, raise Tensor Core utilization from 1.45 to 9.98x, and increase IPC by 7% on A100, all without degrading synthesis quality.

A Comparative Evaluation of Imputation Models for Agricultural Weather Networks

Awanish Khanal
Monowar Hasan

High-resolution weather data are essential for irrigation scheduling, frost protection, and pest and disease risk modeling. However, weather stations frequently experience multi-hour to multi-day outages, leading to substantial downtime for weather-driven decision-making. To mitigate this, stakeholders often rely on ''imputation'' models to reconstruct missing data. Despite the existence of many statistical and machine-learning models, it remains unclear which imputation methods are most suitable for operational agricultural settings. This paper evaluates twelve imputation methods---statistical models, classical machine learning algorithms, and deep neural networks---to identify the most suitable model for agricultural applications. We tested the models using data from five meteorological towers (three in Jena, Germany, and two in Sunnyside, Washington, USA). We perform a thorough performance-engineering study: in addition to accuracy, we evaluate runtime, inference throughput, peak memory usage, GPU usage, energy consumption, and monetary cost. Our findings are surprising: among all complex and advanced (neural network) models, a properly tuned Random Forest (RF) model consistently outperforms them across multiple evaluation categories (e.g., accuracy, latency, throughput, and cost). Specifically, an RF achieves competitive error with no GPU dependencies, modest memory usage, and substantially lower energy expenditure than deep learning baselines. Our research shows that classical machine learning models remain a compelling choice for scalable, cost-aware weather data imputation in agricultural decision-support systems.

SESSION: Session 7: GPU & Heterogeneous Computing

MQGPU: A Multi-Queue Scheduling Framework For GPU Accelerated Serverless Functions

Alexander Fuerst
Siddharth Anil
Prateek Sharma

Hardware accelerators like GPUs are now ubiquitous in data centers, but are not fully supported by common cloud abstractions such as Functions as a Service (FaaS). Many popular and emerging FaaS applications such as machine learning and scientific computing can benefit from GPU acceleration. However, FaaS frameworks (such as OpenWhisk) are not capable of providing this acceleration because of the impedance mismatch between GPUs and the FaaS programming model, which requires virtualization and sandboxing of each function. The challenges are amplified due to the highly dynamic and heterogeneous FaaS workloads.

This paper presents the design and implementation of a FaaS system for providing GPU acceleration in a black-box manner (without modifying function code). Running small functions in containerized sandboxes is challenging due to limited GPU concurrency and high cold-start overheads, resulting in heavy queueing of function invocations. We develop MQGPU, an integrated fair queueing and GPU memory management approach, which balances the tradeoffs between locality, fairness, and latency. Using principles from I/O scheduling, we develop a new fair-queueing policy for GPUs, which reduces function latency by 1.2\times-20× compared to FCFS, continuous batching, and Paella. Each of our solution components reduces function latency several-fold, and by more than 300× when combined, compared to standard NVIDIA-docker GPU containers.

LSTC: Large-Scale Triangle Counting on Single GPU

Kishan Tamboli
Vishwesh Jatala

Triangle counting in graphs has applications in a wide range of domains. For many years, researchers have been improving triangle counting performance by exploiting recent architectures, such as multicores and accelerators like GPUs. Improving the performance of triangle counting in GPU has several challenges: (1) GPUs are equipped very low memory, hence real-word large graphs can not be processed using the capacity of single GPU memory, (2) GPUs follow SIMD execution, whereas graphs exhibit irregular data parallelism, and (3) triangle counting incurs huge memory accesses, minimizing the number of global memory accesses is crucial for good performance.

To address these challenges, we propose Large-Scale Triangle Counting (LSTC), which can perform triangle counting on large graphs on a single GPU, even when they exceed the GPU's memory capacity. To achieve this, we first propose a novel workload partitioning scheme that partitions a large graph so that triangles can be counted using a single GPU without communication overhead. The proposed partitioning scheme reduces the number of duplicated edges and vertices across the partitions, which results in minimizing the memory footprint. Further, we propose a triangle-counting algorithm for each partition to improve performance by efficiently exploiting GPU architectural resources.

We evaluated LSTC on a wide range of graph datasets. We achieve not only an average speed up of 2.8× on large datasets when compared it with the state-of-the-art multi-GPU implementations but also require fewer resources for computation.

Pulse: A Profiling and Visualization Infrastructure for Heterogeneous Managed Systems

Michail Papadimitriou
Maria Xekalaki
Orion Papadakis
Ruiqi Ye
Athanasios Stratikopoulos
Christos Kotselidis

TornadoVM is an open-source framework that enables Java applications to execute on heterogeneous hardware accelerators, such as GPUs and FPGAs. While TornadoVM simplifies programmability, understanding and optimizing the performance of applications on heterogeneous systems remains a significant challenge.

This paper introduces Pulse, a profiling infrastructure designed to collect, correlate, and visualize detailed performance metrics for application components offloaded to accelerators. At its core, Pulse includes a profiler that integrates information from both the managed runtime and the underlying hardware. The collected metrics include fine-grained execution timing, data-transfer costs, compilation overheads, and power consumption during accelerator execution. In addition, Pulse provides an interactive visualization layer that presents these metrics in an accessible and actionable manner, helping developers analyze performance bottlenecks and improve the efficiency of Java applications running on heterogeneous hardware through TornadoVM.

Using a Java-native implementation of the Llama3 inference pipeline (GPULlama3.java) as a case study, this paper demonstrates how Pulse facilitates profiling-guided optimization. By following profiling information generated by Pulse, developers were able to refactor and fuse fine-grained GPU kernels, resulting in reduced kernel launches per layer from 19 to 13, a 38% decrease in data-transfers cost, and an overall 18.4% improvement in the execution time per-layer. Finally, the instrumentation overhead of Pulse is evaluated to be 1.5% for coarse-grained microbenchmarks, and up to 30% for workloads dominated by short-lived kernels such as GPULlama3.java - representing a worst-case scenario for fine-grained profiling.

Low-Latency ML Offloading Across Edge and IoT Devices

Konstantinos Papazafeiropoulos
Anastasia Mallikopoulou
Anastassios Nanos
Georgios Goumas
Nectarios Koziris

Deploying machine learning (ML) inference across the Cloud\allowbreak–\allowbreak Edge\allowbreak–\allowbreak IoT continuum faces challenges from device heterogeneity, limited resources, and dynamic networks. This work presents MLIoT, a cloud-native framework enabling seamless ML workload offloading with secure, dynamic device discovery. Building on Akri and vAccel, MLIoT balances local inference on constrained microcontrollers with offloading to edge accelerators, optimizing latency, resource use, and accuracy. In this work, we highlight the trade-offs between performing inference locally and transparently offloading compute-intensive components to edge accelerators. Large models often cannot fit in constrained memory, and quantized local models can suffer accuracy loss. Similarly, compute-limited devices may experience high latency or energy consumption when running inference locally. We describe performance engineering techniques including custom discovery, transport optimizations, and scheduling. Experiments on ESP32, Raspberry Pi, NVIDIA Jetson, and x86 cloud nodes demonstrate reduced latency and improved inference efficiency, highlighting practical trade-offs and challenges in managing dynamic device availability. Our findings showcase the potential of cloud-native performance engineering for accurate, low-latency ML inference at the edge.

A Transparent and Efficient Performance Analysis Approach to Enhance DPDK Observability

Adel Belkhiri
Arnaud Fiorini
Matthew Khouzam
Heng Li

In recent years, the rapid growth of network traffic and the performance bottlenecks inherent in kernel networking stacks have driven the widespread adoption of userspace networking frameworks. While kernel-bypass solutions such as the Data Plane Development Kit (DPDK) effectively eliminate kernel overhead, they also limit observability for traditional monitoring tools, complicating fault diagnosis and performance tuning. This observability gap, coupled with the complexity of modern packet-processing software, makes diagnosing performance issues increasingly difficult. This paper presents a performance analysis framework tailored for DPDK-based applications. The framework leverages trace data collected through DPDK's native tracer to derive targeted performance metrics, which are visualized through interactive, domain-specific analyses in Trace Compass. By enabling fine-grained observability with minimal runtime overhead, the approach bridges the gap between low-level tracing and actionable performance insights. To ground our design in real-world needs, we surveyed 19 industry practitioners to validate our design choices and capture empirical evidence of the debugging challenges encountered when diagnosing DPDK-based applications. We further demonstrate how the proposed analyses can reveal and explain performance bottlenecks in a widely used software router.

SESSION: Session 8: Adaptive Cloud & Edge

WASL: Harmonizing Uncoordinated Adaptive Modules in Multi-Tenant Cloud Systems

Ahsan Pervaiz
Anwesha Das
Vedant Kodagi
Muhammad Husni Santriaji
Henry Hoffmann

Modern cloud applications increasingly rely on adaptive control modules, such as dynamic resource tuning or system reconfiguration, to meet strict quality-of-service (QoS) objectives. However, when multiple independently developed adaptation modules are colocated on a shared infrastructure, their uncoordinated behavior causes interference leading to QoS violations. Existing approaches require centralized control or inter-module communication, violating modularity and limiting adoption in multi-tenant environments.

We present WASL, a modular runtime coordination technique, that enables colocated adaptive workloads to operate harmoniously without information sharing or control coupling. WASL estimates the deviation between expected and observed behavior using only local feedback and dynamically adjusts each module's adaptation rate to reduce interference. It acts as a lightweight plug-in with constant-time overhead and can be integrated into diverse adaptation strategies without requiring any changes to control logic.

We implement WASL across five latency-sensitive applications from the TailBench suite, incorporating three adaptation paradigms. Across single- and multi-application scenarios, WASL reduces tail latency by up to 84% compared to uncoordinated adaptation and achieves performance comparable to centralized coordination approaches, while preserving modularity, avoiding information leakage, and aligning with resource isolation policies. WASL provides a general and practical solution for runtime coordination in adaptive cloud systems, enabling scalable deployment of independently managed services without compromising QoS.

KLUE: A Framework for Cost-Effective Experimentation in Emulated Kubernetes Clusters

Kayky Fidelis
Geraldo Junior
Caetano Albuquerque
Giovanni Farias
Thiago Emmanuel Pereira
Fabio Morais
Kilian Melcher

Executing performance experiments in Kubernetes clusters is a resource- and time-intensive task, especially in large-scale environments typical of production systems. However, such experiments are essential for understanding system behavior and supporting operational decisions that improve efficiency and reliability. This paper introduces KLUE (Kubernetes Lite execUtion Environment), a lightweight framework that enables performance experimentation in emulated Kubernetes clusters. KLUE provides a practical, cost-effective approach for testing and validating configurations, policies, and workloads without the need for extensive physical infrastructure. Using KLUE, we successfully reproduced an experiment originally conducted in a real Kubernetes cluster, achieving a 93.5% reduction in execution cost. We also leveraged the framework to study the impact of multiple application spreading strategies using a 24-hour production-scale trace from a large technology company by spending only 0.14% of the estimated cost required to run the same study on real infrastructure—an analysis that would be economically infeasible in real environments. These results highlight KLUE's potential to accelerate experimentation, reduce costs, and improve decision-making in Kubernetes-based environments, offering a valuable tool for both research and industry settings.

FLYT: Transparent and Elastic GPU Provisioning for Multi-Tenant Cloud Services

Santhosh M. Kumar
Sameer Ahmad
Armaan Chowfin
Purushottam Kulkarni
Anand Eswaran
Praveen Jayachandran

Modern cloud services such as AI inference, video analytics, and scientific computing exhibit highly variable and bursty GPU demand patterns that static provisioning and coarse-grained sharing mechanism struggle to accommodate efficiently. Existing GPU multiplexing approaches, including NVIDIA MPS and MIG, provide limited flexibility in multi-tenant environments, often leading to resource fragmentation, under-utilization, or unpredictable latency. We present Flyt, a transparent, latency-0aware GPU orchestration framework for virtualized cloud services. Flyt enables fine-grain runtime scaling of Streaming Multiprocessors (SMs) and breaks the traditional VM–GPUs binding by allowing applications inside a VM to execute on different GPUs over time. This design supports elastic scaling and live inter–node GPU migration without application or guest OS modifications, by virtualizing GPU memory through address translation and enforcing elastic SM execution caps.

An evaluation on heterogeneous GPUs using TorchServe and Rodinia benchmark applications demonstrates that Flyt maintains predictable bounded latency under dynamic workloads while significantly improving GPU utilization compared to static provisioning. In co-located VM–GPU deployments using shared-memory transport, Flyt achieves performance within 12--15% of native execution for most workloads while providing latency isolation and elasticity under contention, demonstrating that elastic SM allocation can maintain latency targets under bursty load without hardware partitioning.

To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing In the Era of Accelerators

Nathan Ng
David Irwin
Ananthram Swami
Don Towsley
Prashant Shenoy

Computational offloading is a promising approach to overcome client device resource constraints by moving application computations to remote servers. With the advent of specialized hardware accelerators, client devices can now perform fast local processing of tasks such as machine learning inference, reducing the need for offloading. However, edge servers with accelerators also offer faster offloading performance than was previously possible. In this paper, we present an analytic and experimental comparison of on-device processing and edge offloading across accelerator, multi-tenant, and workload scenarios to understand when to use local processing versus offloading. We present models that leverage queuing theory to derive explainable closed-form equations for end-to-end latencies, yielding quantitative performance crossover predictions to guide adaptive offloading. We validate our models across various settings and show that they achieve a mean absolute percentage error of 2.2% compared to observed latencies. We further use these models to develop a resource manager for adaptive offloading and demonstrate its effectiveness in dynamic multi-tenant edge environments.

SESSION: Session 9: Java & Heap Performance

MapReplay: Trace-Driven Benchmark Generation for Java HashMap

Filippo Schiavio
Andrea Rosà
Júnior Löff
Lubomír Bulej
Petr Tůma
Walter Binder

Hash-based maps, particularly java.util.HashMap, are pervasive in Java applications and the JVM, making their performance critical. Evaluating optimizations is challenging because performance depends on factors such as operation patterns, key distributions, and resizing behavior. Microbenchmarks are fast and repeatable but often oversimplify workloads, failing to capture the realistic usage patterns. Application benchmarks (e.g., DaCapo, Renaissance) provide realistic usages but are more expensive to run, prone to variability, and dominated by non-HashMap computations, making map-related performance changes difficult to observe. To address this challenge, we propose MapReplay, a benchmarking methodology that combines the realism of application benchmarks with the efficiency of microbenchmarks. MapReplay traces HashMap API usages generating a replay workload that reproduces the same operation sequence while faithfully reconstructing internal map states. This enables realistic and efficient evaluation of alternative implementations under realistic usage patterns. Applying MapReplay to DaCapo-Chopin and Renaissance, the resulting suite, MapReplayBench, reproduces application-level performance trends while reducing experimentation time and revealing insights difficult to obtain from full benchmarks.

G1HeapVis: Visualizing and Measuring Heap Fragmentation

Oleksandr Kachur

Performance predictability and memory efficiency are key concerns in managed runtime environments such as the Java Virtual Machine (JVM). These factors depend not only on garbage collection (GC) efficiency but also on the organization of memory within the heap defining allocation rates, time, and frequency of GC cycles. Although the G1 Garbage Collector (G1GC) in the JVM is designed to balance throughput and latency, the effects of heap fragmentation on object allocation efficiency, time of GC pauses and overall application performance remain insufficiently characterized and largely unmeasured in empirical studies.

This paper presents G1HeapVis, a tool that parses G1GC logs and analyzes region-level occupancy data, deriving metrics for analyzing and visualizing heap fragmentation. These data and metrics can be visualized to understand fragmentation patterns and their dynamics during program execution.

We evaluate our tool on DaCapo benchmark suite, correlating heap memory fragmentation with different performance metrics such as execution speed, allocation throughput and time of GC pauses. The results indicate that heap fragmentation can introduce measurable performance overhead, highlighting its role as a significant contributing factor to the inefficiency of Java applications. By providing a quantitative fragmentation metric G1HeapVis contributes to a deeper understanding of G1GC behaviour on various workloads and enables comparative studies of heap fragmentation.

SESSION: Session 10: Adaptive Systems & Predictive Management

Are We There Yet? Predicting if Executing Applications are Near Completion

Mohammad Sonji
Mohammed Baydoun
Safaa Diab
Amir Nassereldine
Pedro Bruel
Aditya Dhakal
Rolando Pablo Hong Enriquez
Gourav Rattihalli
Diman Zad Tootaghaj
Gallig Renaud
Barbara Chapman
Fatima K. Abu Salem
Eitan Frachtenberg
Dejan Milojicic
Izzat El Hajj

Predicting the running time or remaining time of batch-style applications is useful to schedulers and resource managers. However, it is fundamentally challenging to make such predictions accurately for applications that have not been seen before or that run on datasets with varying sizes. For this reason, we aim to answer a simpler, but nevertheless instrumental, question: is an executing application about to finish executing? To this end, we present AWTY, a workflow for predicting whether or not a running batch-style application is near completion. AWTY analyzes application profiles to identify what applications' last phases look like while treating applications as black boxes. It then uses this data to train classifiers that can identify whether or not an executing application is in its last phase. AWTY employs both single-application classifiers that work on applications that have been seen before and general classifiers that work on applications that have not been seen before. Our evaluation shows that AWTY can predict if an application is near completion reasonably well. AWTY can inform schedulers and resource managers in making decisions about whether to kill applications that have overstayed their time or to let them finish.

Holpaca: Holistic and Adaptable Cache Management for Shared Environments

José Pedro Peixoto
Alexis Gonzalez
Janki Bhimani
Raju Rangaswami
Cláudia Brito
João Paulo
Ricardo Macedo

Modern data-intensive systems rely on in-memory caching to achieve high throughput and low latency. CacheLib, Meta's general-purpose caching engine, provides high performance and flexibility for building specialized caches for a variety of applications. However, despite its wide adoption in large-scale infrastructures, CacheLib's data management mechanisms exhibit inefficiencies in shared environments. Particularly, its static and uncoordinated memory allocation leads to fragmented resource usage, unfair memory distribution, and degraded performance across tenants and instances.

We present Holpaca, a general-purpose caching middleware that enables holistic and adaptable orchestration of shared caching environments. Holpaca introduces a shim data layer co-located with each cache instance and a centralized orchestrator with system-wide visibility, enabling global memory management and per-tenant QoS policies. Using production traces from Twitter, results show that, by continuously readjusting memory allocations based on workload dynamics, Holpaca achieves up to 3× higher throughput in multi-tenant and 2.2× improvement in multi-instance settings over CacheLib's rigid built-in mechanisms.

Energy-efficient Dynamic Partitioning and Tensors Compression of AI Applications in Smart Eyewears

Abednego Wamuhindo Kambale
Samin Shokrivahed
Giacomo Verticale
Francesca Palermo
Diana Trojaniello
Danilo Ardagna

Resource-constrained smart eyewear (SEW) devices face significant challenges when deploying deep neural networks due to limited computational capacity and battery life. Computational offloading to companion devices like smartphones and cloud servers addresses processing limitations, but data transmission becomes a critical bottleneck, consuming over 50% of total energy in some scenarios. Although lossless compression methods provide limited data reduction for intermediate tensors, lossy techniques such as Vector Quantization (VQ) offer higher compression ratios (requiring only 3.3 bits per float) at the expense of inference accuracy degradation. This paper presents an adaptive multi-stage compression framework that dynamically balances these trade-offs across the SEW-phone-cloud continuum. We employ VQ at the SEW-phone interface where aggressive compression is essential (achieving 89.6% tensor size reduction with 90% retained accuracy), followed by adaptive selection between quantization and run-length encoding for phone-to-cloud transmission based on network conditions. A Deep Q-Network (DQN) agent jointly optimizes network partitioning points and compression strategies to minimize energy consumption while preserving accuracy and meeting latency constraints. A large simulation campaign considering object detection and human pose estimation tasks demonstrate that our method achieves 55--70% energy savings and 86--91% violation reduction compared to Neurosurgeon (a dynamic partitioning baseline without compression), 45.8% energy savings versus local execution, and 61.1% savings over uncompressed offloading, with latency violation rates below 9% and acceptable accuracy loss (8.0--8.1%). These results enable practical deployment of AI applications on battery-limited SEW devices.

An Evaluation Study of Generative AI Systems: Framework-Aware Performance Under Real-World Constraints

Abed Matinpour
Farhoud Jafari Kaleibar
Sara Fehresti
Shaylin Ziaei
Marin Litoiu

The widespread adoption of Large Language Models (LLMs) in enterprise applications has created a critical need for systematic evaluation of Generative AI Systems (GenAIS) that integrate orchestration frameworks, foundation models, and deployment optimizations. This paper presents the first comprehensive evaluation study specifically designed to evaluate the performance trade-offs between orchestration frameworks (LangChain, LlamaIndex), foundation models, and deployment constraints across multiple application domains. Our study systematically evaluates eight foundation models across question-answering with Retrieval-Augmented Generation (RAG) and mathematical reasoning tasks, measuring latency, accuracy, resource utilization, and power consumption under various optimization strategies including adaptive context windowing and concurrent request processing. Through extensive empirical analysis, we demonstrate that framework selection significantly impacts system performance independent of model choice, with task-specific trade-offs emerging across workload types. For retrieval-heavy RAG workloads, LlamaIndex delivers 3-15% lower latency, around 60% lower peak memory usage, and 2-4% more energy consumption. For reasoning-intensive mathematical tasks, Langchain achieves higher accuracy across all tested models and provides more predictable latency profiles and reasonable resource consumption, making it preferable for strict SLA environments. Adaptive context windowing does not yield uniform gains and offers only modest, model-dependent improvements over fixed policies, whereas increasing concurrency improves aggregate throughput but, beyond moderate loads, consistently inflates tail latency and degrades SLA predictability. These findings underscore the decisive role of orchestration design in GenAIS performance and provide empirical guidance for balancing efficiency, accuracy, and scalability in real-world deployments.