Proceedings of the ACM on Measurement and Analysis of Computing Systems: Vol. 10, No. 1. 2026

Full Citation in the ACM Digital Library

POMACS V10, N1, March 2026 Editorial

The Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the ACM Special Interest Group SIGMETRICS. All papers in this issue of POMACS will be presented at the ACM SIGMETRICS 2026 conference on June 8-12, 2026 in Ann Arbor, Michigan, USA.

The papers in this issue were selected during the Fall submission round by the 113 members of the ACM SIGMETRICS 2026 program committee via a rigorous review process. Each paper was conditionally accepted (and shepherded), or allowed to be resubmitted to one of the subsequent three SIGMETRICS deadlines, or rejected (with resubmission allowed only after a year).

Of the 149 papers submitted, 25 papers were accepted by the program committee members (including two revised papers from the previous rounds out of four resubmissions). Each submission received between three and five reviews. Although most of the papers were decided in the online discussion phase, borderline cases were discussed during the online program committee meeting. There were two types of submission this year: regular papers and operational systems track -- the latter being intended to report measurement studies on currently deployed systems -- and the vast majority of the papers were regular papers (144 out of 149). Among these regular papers, each submission could indicate one or two possible tracks and roughly 36% of the submissions indicated Measurement & Applied Modeling, 33% Theory, 46% Systems, and 20% Learning. All papers went through the same review process.

Many individuals contributed to the success of this POMACS issue. We thank the authors, who submitted their best work to SIGMETRICS/POMACS. We also thank the program committee members who provided constructive feedback in their reviews to authors and participated in the online discussions and program committee meeting. In addition, we thank several external reviewers who provided their expert opinions on specific submissions that required additional input. Finally, we are also grateful to the ACM SIGMETRICS Chair, Mor Harchol-Balter, to the SIGMETRICS Organization Committee, and to the SIGMETRICS Executive Committee for their ongoing efforts and initiatives for creating an exciting program for ACM SIGMETRICS 2026.

A Comprehensive Study of Satellite Network Performance During Severe or Extreme Geomagnetic Storms over 1.5 Years (May 2024 – Oct 2025)

Geomagnetic or Solar storms are often associated with disruptions in satellite communications, yet their impact on real-world performance remains under-explored even as satellite broadband usage grows exponentially due to Starlink and others. This is particularly critical now, as we are currently near the peak of a so-called solar cycle, which is associated with increased solar storm activity. Using data from a combination of active and passive measurements, we conduct a comprehensive study of the network effects all the 15 strong, severe or extreme solar storms that occurred in the past 1.5 years (May 2024 – Oct 2025). We bring together three different datasets: First, we use LEOScope, a public global LEO testbed to schedule controlled and fine-grained measurements at seven locations globally during solar storms. The results reveal a severe degradation in performance during such events, manifesting as a 20% decrease in throughput, a 10% increase in latency, and a doubling of the packet loss rate. Second, we use data from M-Lab (Google) speedtests conducted by satellite network users, to enhance global coverage. We obtain over 4 million records, which allows us to do a comparative analysis by studying variation in performance degradation across different latitudes in the same longitude range (over the North and South American continents) and effects across similar (40°–55°) latitudes during periods of high geomagnetic disturbance. Finally, we supplement this with data from Cloudflare AIM, which, in addition to common network speed metrics, also provide an estimation of how this affects user experience for common applications such as gaming, streaming and VoIP. These findings provide new insights into how space weather affects LEO, MEO, and GEO satellite Networks.

A Fine-Grained and Efficient Reliability Analysis Framework for Noisy Quantum Circuits

Evaluating the reliability of noisy quantum circuits is essential for implementing quantum algorithms on noisy quantum devices. However, current quantum hardware exhibits diverse noise mechanisms whose compounded effects make accurate and efficient reliability evaluation challenging. While state fidelity is the most faithful indicator of circuit reliability, it is experimentally and computationally prohibitive to obtain. Alternative metrics, although easier to compute, often fail to accurately reflect circuit reliability, lack universality across circuit types, or offer limited interpretability. To address these challenges, we propose a fine-grained, scalable, and interpretable framework for efficient and accurate reliability evaluation of noisy quantum circuits. Our approach performs a state-independent analysis to model how circuit reliability progressively degrades during execution. We introduce the Noise Proxy Circuit (NPC), which removes all logical operations while preserving the complete sequence of noise channels, thereby providing an abstraction of cumulative noise effects. Based on the NPC, we define Proxy Fidelity, a reliability metric that quantifies both qubit-level and circuit-level reliability. We further develop an analytical algorithm to estimate Proxy Fidelity under depolarizing, thermal relaxation, and readout error channels. The proposed framework achieves fidelity-level reliability estimation while remaining execution-free, scalable, and interpretable. Experimental results show that our method accurately estimates circuit fidelity, with an average absolute difference (AAD) ranging from 0.031 to 0.069 across diverse circuits and devices.

A Variegated Look at Direct-to-Cell Satellites in the Wild

Direct-to-cell satellites aim to deliver ubiquitous cellular network connectivity for our regular phones from space. Starlink, as the most successful direct-to-cell satellite operator, has already provided messaging and data services to unmodified smartphones across multiple countries, proving valuable in emergency communications, disaster response, and outdoor activities. However, how do Starlink direct-to-cell satellites behave in the wild? This work presents the first in-depth cellular measurement study of Starlink's direct-to-cell satellites by collecting a two-month full-stack dataset. Our analysis reveals that Starlink strives to mitigate the satellite's high latency and extreme mobility through infrastructure-side mechanisms, including reasonable function split, Doppler and delay offset compensation, and rapid cell switching strategies. However, its backward-compatibility requirements prevent cellular protocol and signaling modifications, yielding unresolved issues including frequent access failures, ping-pong handovers, and extensive retransmissions. These findings expose fundamental limitations in adapting legacy terrestrial 5G/4G to space. In the long term, these issues cannot be resolved simply by allocating more spectrum resources or deploying more satellites. We further conduct data-driven emulation using 3GPP non-terrestrial network (NTN) protocol stacks and satellite channel simulators to characterize how these issues could be resolved through protocol modifications implemented in 5G NTN and beyond.

Cache Your Prompt When It's Green — Carbon-Aware Caching for Large Language Model Serving

As large language models (LLMs) become widely used, their environmental impact — especially carbon emission — has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present GreenCache, a carbon-aware cache management framework that dynamically derives resource allocation plans for LLM serving. GreenCache analyzes the correlation between carbon emission and SLO satisfaction, reconfiguring the resource over time to keep the balance between SLO and carbon emission under dynamic workloads. Evaluations from real traces demonstrate that GreenCache achieves an average carbon reduction of 15.1% when serving Llama-3 70B in the FR grid, with reductions reaching up to 25.3%, while staying within latency constraints for > 90% of requests.

CAPSULE: A Storage Prefetcher Harnessing Spatio-Temporal Locality for Cloud-Scale Workloads

Effective prefetching is essential in modern storage systems to reduce I/O latency and improve cache hit rates, especially under challenging access patterns. While temporal locality has been the dominant foundation for most prefetching algorithms, real-world storage workloads often exhibit long stack distances (LSD), where blocks are reused only after millions of other accesses. These patterns weaken the effectiveness of conventional prefetchers, even when strong spatial locality is present.

In this paper, we present CAPSULE (Clustering-Assisted Prefetching Scheme Utilizing Locality Exploration), a novel prefetching framework that exploits spatial locality through adaptive clustering. By dynamically grouping logical block addresses (LBAs) and prefetching across neighboring clusters, CAPSULE effectively bridges the gap left by temporal-only approaches.

We evaluate CAPSULE across 729 real-world workloads drawn from five major benchmark suites (MSR, CloudPhysics, Tencent CBS, Alibaba Block and Meta Tectonic), reflecting diverse cloud-scale environments. CAPSULE improves cache hit rates by up to 6.1× and achieves up to 1.89× speedup in task completion time, outperforming traditional and learned prefetchers alike. Our results demonstrate that CAPSULE is particularly well-suited for modern cloud storage systems, where massive working sets and temporal locality erosion are increasingly the norm.

Characterizing Performance–Energy Trade-offs of Large Language Models in Multi-Request Workflows

Large language models (LLMs) are increasingly deployed in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inferences. Existing measurement and benchmarking efforts either focus on assessing performance of LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls is not explored in-depth.

To address these gaps, this paper presents the first systematic characterization of performance--energy trade-offs in multi-request LLM inference. We develop and evaluate four representative workloads that capture sequential, interactive, agentic, and composite patterns common in modern deployments. Using an empirical NVIDIA A100 testbed with state-of-the-art serving systems (vLLM and Parrot), we systematically analyze how key energy knobs (e.g., input-output length, batch size, and GPU power cap) reshape latency, throughput, and component-level (e.g., CPU, GPU, and DRAM) energy use. Our findings reveal that batch size is the most impactful lever, though its benefits are highly workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for mutli-agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further demonstrate that engine-level optimizations in vLLM (e.g., continuous batching, PagedAttention) maintain higher GPU utilization and efficiency, especially for decode-heavy workloads, while Parrot's workflow-aware scheduling achieves lower energy consumption under stringent power constraints. These findings offer actionable guidelines for developers and system operators in designing performance- and energy-aware LLM serving systems in emerging multi-request workflows.

Dynamic SLA-aware Network Slice Monitoring

Next-generation networks increasingly rely on network slices - logical networks tailored to specific application requirements, each with distinct Service-Level Agreements (SLAs). Ensuring compliance with these SLAs requires continuous, real-time monitoring of end-to-end performance metrics for each slice, within a limited telemetry budget. However, we find that existing solutions face two fundamental limitations: they either lack end-to-end visibility (e.g., sketches, probabilistic sampling) or provide visibility but lack the control mechanisms to dynamically allocate monitoring resources according to slice SLAs.

We address this through a formal framework that reframes slice monitoring as a closed-loop control problem, and defines the minimal data plane requirements for SLA-aware slice monitoring via a telemetry primitive contract. We then present SliceScope, a realization of this framework that combines: (1) a control strategy that dynamically allocates the monitoring resources across diverse slices according to their SLA criticality, and (2) a data-plane based on change-triggered INT that provides per-packet end-to-end visibility with tunable accuracy-overhead trade-offs, satisfying the telemetry contract. Our evaluation results on programmable switches and in large-scale simulations with a mixture of different slice types, demonstrate that SliceScope tracks critical slices up to 4× more accurately compared to static baselines, while showing that change-triggered INT outperforms alternative primitives for realizing the telemetry primitive contract.

Empirical Gittins for data-driven M/G/1 scheduling with arbitrary job size distributions

When scheduling in the M/G/1 queue to minimize mean response time, a classic result is that if job sizes are unknown, then the Gittins policy is optimal. Gittins is best described as a policy construction: it takes as input the queue's job size distribution, and it outputs a job prioritization rule that optimizes mean response time for that particular distribution. But in practice, instead of knowing the exact job size distribution, one usually only has samples from it. We therefore ask: given finitely many samples from the job size distribution, how can one construct a scheduling policy with near-optimal mean response time?

Our main result is that to achieve near-optimal mean response time, it suffices to simply apply the Gittins construction to the empirical distribution of the job size samples. We call this policy empirical Gittins, and we prove an explicit high-probability bound on its mean response time. Our bound implies convergence to the optimal mean response time as one increases the number of samples. We also show that if one has even vague knowledge of the true distribution's tail asymptotics, one can make empirical Gittins more robust using truncation, resulting in better convergence rates.

It is surprising that empirical Gittins works well even for continuous job size distributions. This is because the Gittins construction is sensitive to the distribution's density, yet the empirical distribution, being discrete, cannot possibly approximate a continuous density. Our main technical contribution is to show that despite its sensitivity to density, the Gittins construction yields a good policy as long as one gives it a distribution with an approximately correct tail, even if the density is completely wrong. Underlying this finding are two new extensions of the WINE queueing identity.

Experimental Study on System-Level Performance Impact of Read Disturbance in Modern SSDs

This work investigates the system-level performance impact of read disturbance in modern NAND flash-based SSDs, aiming to provide new insights that can help develop better storage architectures and optimize system software. Continuous improvement in storage density over decades has led NAND flash memory to play a vital role in modern computing systems, but it also comes at a cost of significant reliability degradation. Among various error sources, read disturbance has gained growing attention as a major reliability concern due to its rapidly increasing impact, which can significantly affect system I/O performance by exacerbating SSD-internal reliability-management overheads. Although a large body of prior work has focused on device-level characterizations and optimizations, the system-level performance impact of read disturbance still remains largely uninvestigated. To address this gap, this work conducts a rigorous experimental study using 15 modern NVMe SSDs from 10 major vendors in two ways. First, we comprehensively analyze the system-level performance impact of read disturbance under diverse workloads and operating conditions. Second, to highlight the importance of efficient read-disturbance management, we showcase a new possible SSD-performance attack as a case study, demonstrating that an adversary can significantly degrade the I/O performance of other concurrently running processes by exploiting read disturbance alone in commodity SSDs. Based on our experimental study, we make 16 new observations and 7 takeaway lessons, which lead to 6 key directions for future improvements at the host-system and SSD-architecture levels to better cope with read disturbance.

Green Bin Packing

The online bin packing problem and its variants are regularly used to model server allocation problems. Modern concerns surrounding sustainability and overcommitment in cloud computing motivate bin packing models that capture costs associated with highly utilized servers. In this work, we introduce the green bin packing problem, an online variant with a linear cost β for filling above a fixed level G. For a given instance, the goal is to minimize the sum of the number of opened bins and the linear cost. We show that when β ≤ 1/G, classical online bin packing algorithms such as FirstFit or Harmonic perform well, and can achieve competitive ratios lower than in the classic setting. However, when β > 1/G, new algorithmic solutions can improve both worst-case and typical performance. We introduce variants of classic online bin packing algorithms and establish theoretical bounds, as well as test their empirical performance.

Guiding the Recommender: Information-Aware Auto-Bidding for Content Promotion

Modern content platforms offer paid promotion to mitigate cold start by allocating exposure via auctions. Our empirical analysis reveals a counterintuitive flaw in this paradigm: while promotion rescues low-to-medium quality content, it can harm high-quality content by forcing exposure to suboptimal audiences, polluting engagement signals and downgrading future recommendation. We recast content promotion as a dual-objective optimization that balances short-term value acquisition with long-term model improvement. To make this tractable at bid time in content promotion, we introduce a decomposable surrogate objective, gradient coverage, and establish its formal connection to Fisher Information and optimal experimental design. We design a two-stage auto-bidding algorithm based on Lagrange duality that dynamically paces budget through a shadow price and optimizes impression-level bids using per-impression marginal utilities. To address missing labels at bid time, we propose a confidence-gated gradient heuristic, paired with a zeroth-order variant for black-box models that reliably estimates learning signals in real time. We provide theoretical guarantees, proving monotone submodularity of the composite objective, sublinear regret in online auction, and budget feasibility. Extensive offline experiments on synthetic and real-world datasets validate the framework: it outperforms baselines, achieves superior final AUC/LogLoss, adheres closely to budget targets, and remains effective when gradients are approximated zeroth-order. These results show that strategic, information-aware promotion can improve long-term model performance and organic outcomes beyond naive impression-maximization strategies.

Higher-Order Approximations of Sojourn Times in M/G/1 Queues via Stein's Method

We study the stationary sojourn time distribution in an M/G/1 queue operating under heavy traffic. It is known that the sojourn time converges to an exponential distribution in the limit. Our focus is on obtaining pre-asymptotic, higher-order approximations that go beyond the classical exponential limit. Using Stein's method, we develop an approach based on higher-order expansions of the generator of the underlying Markov process. The key technical step is to represent higher-order derivatives in terms of lower-order ones and control the resulting error via derivative bounds of the Stein equation. Under suitable moment-matching conditions on the service distribution, we show that the approximation error decays as a high-order power of the slack parameter ? = 1 ? ?. Error bounds are established in the Zolotarev metric, which further imply bounds on the Wasserstein distance as well as the moments. Our results demonstrate that the accuracy of the exponential approximation can be systematically improved by matching progressively more moments of the service distribution.

Janus: A Dual-Mask Attention Transformer for Log-based Anomaly Detection in Cellular Networks

Modern cellular networks, such as 5G, generate complex operational logs that challenge traditional anomaly detection techniques. Existing deep learning approaches, including standard transformer models, treat logs as monolithic text streams and lack the specialization to reason about procedural correctness and semantic integrity, a key requirement for telecommunications software. We tackle this problem in our system Janus, a log-based anomaly detection system featuring a novel Single-Pass Dual-Mask (SPDM) attention mechanism. Janus introduces a domain-specific inductive bias by partitioning attention heads into two groups. Global heads learn the valid temporal grammar of 5G procedures using a causal mask, and local heads perform fine-grained audits on the consistency of critical data fields using a tag-based semantic mask. A multi-stage curriculum learning framework progressively adapts Janus from domain pre-training to discriminative fine-tuning and learns to detect complex, real-world software failures. Experimental evaluation with several 5G log datasets demonstrates that Janus consistently outperforms prior systems, achieving on average a 3× performance improvement over a DNN-based baseline and an 80% gain over a transformer-based system.

Mapping the Landscape of LLM Deployment in the Wild: Prevalence, Patterns, and Perils

Large language models (LLMs) are increasingly deployed through open-source and commercial frameworks, enabling individuals and organizations to self-host advanced LLM capabilities. As LLM deployments become prevalent, ensuring their secure and reliable operation has become a critical issue. However, insecure defaults and misconfigurations often expose LLM services to the public internet, posing serious security risks. This study conducted a large-scale empirical investigation of public-facing LLM deployments, focusing on the prevalence of services, exposure characteristics, systemic vulnerabilities, and associated risks. Through internet-wide measurements, we identified 320,102 public-facing LLM services across 15 frameworks and extracted 158 unique API endpoints, categorized into 12 groups based on functionality and security risk. Our analysis found that over 40% of endpoints used plain HTTP, and over 210,000 endpoints lacked valid TLS metadata. API exposure was highly inconsistent: some frameworks, such as Ollama, responded to over 35% of unauthenticated API requests, with about 15% leaking model or system information, while other frameworks implemented stricter controls. We observed widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. These security risks, such as model leakage, system compromise, and unauthorized access, are pervasive and highlight the need for a secure-by-default framework and stronger deployment practices.

On Abnormal Execution Timing of Conditional Jump Instructions

An extensive line of work on modern computing architectures has shown that the execution time of instructions can (i) depend on the operand of the instruction or (ii) be influenced by system optimizations, e.g., branch prediction and speculative execution paradigms.

In this paper, we systematically measure and analyze timing variabilities in conditional jump instructions that can be macro-fused with a preceding instruction, depending on their placement within the binary. Our measurements indicate that these timing variations stem from the µop cache placement and the jump's offset in the L1 instruction cache of modern processors. We demonstrate that this behavior is consistent across multiple microarchitectures, including Skylake, Coffee Lake, and Kaby Lake, as well as various real-world implementations. We confirm the prevalence of this variability through extensive experiments on a large-scale set of popular binaries, including libraries from Ubuntu 24.04, Windows 10 Pro, and several open-source cryptographic libraries. We also show that one can easily avoid this timing variability by ensuring that macro-fusible instructions are 32-byte aligned - an approach initially suggested in 2019 by Intel in an overlooked short report. We quantify the performance impact of this approach across the cryptographic libraries, showing a speedup of 2.15% on average (and up to 10.54%) when avoiding the timing variability. As a by-product, we show that this variability can be exploited as a covert channel, achieving a maximum throughput of 16.14 Mbps.

Prediction-Specific Design of Learning-Augmented Algorithms

Algorithms with predictions have emerged as a powerful framework to combine the robustness of traditional online algorithms with the data-driven performance benefits of machine-learned (ML) predictions. However, most existing approaches in this paradigm are overly conservative, as they do not leverage problem structure to optimize performance in a prediction-specific manner. In this paper, we show that such prediction-specific performance criteria can enable significant performance improvements over the coarser notions of consistency and robustness considered in prior work. Specifically, we propose a notion of strongly-optimal algorithms with predictions, which obtain Pareto optimality not just in the worst-case tradeoff between robustness and consistency, but also in the prediction-specific tradeoff between these metrics. We develop a general bi-level optimization framework that enables systematically designing strongly-optimal algorithms in a wide variety of problem settings, and we propose explicit strongly-optimal algorithms for several classic online problems: deterministic and randomized ski rental, and one-max search. Our analysis reveals new structural insights into how predictions can be optimally integrated into online algorithms by leveraging a prediction-specific design. To validate the benefits of our proposed framework, we empirically evaluate our algorithms in case studies on problems including dynamic power management and volatility-based index trading. Our results demonstrate that prediction-specific, strongly-optimal algorithms can significantly improve performance across a variety of online decision-making settings.

Robustness of the 2-Choices Dynamics to Node Failures

In many applications, it becomes necessary for a set of distributed network nodes to agree on a common value or opinion as quickly as possible and with minimal communication overhead. The classical 2-choices rule is a well-known distributed algorithm designed to achieve this goal. Under this rule, each node in a network updates its opinion at random instants by sampling two neighbours uniformly at random and then adopting the common opinion held by these neighbours if they agree. For a sufficiently well-connected network of n nodes and two initial opinions, this simple rule results in the network being absorbed in a consensus state in O(łog n) time (with high probability) and the consensus is obtained on the opinion held by the majority of nodes initially.

In this paper, we study the robustness of this algorithm to node failures. In particular, we assume that with a constant probability α, a node may fail to update according to the 2-choices rule and erroneously adopt any one of the two opinions uniformly at random. This is a strong form of failure under which the network can no longer be absorbed in a consensus state. However, we show that as long as the error probability α is less than a threshold value, the network is able to retain the majority support of the initially prevailing opinion for an exponentially long time (Ømega((\exp(Θ(n))))). In contrast, when the error probability is above a threshold value, we show that any opinion quickly (O(łog n) time) loses its majority support and the network reaches a state where (nearly) an equal proportion of nodes support each opinion. We establish the above phase transition in the dynamics for both complete graphs and expander graphs with sufficiently large spectral gaps and sufficiently homogeneous degrees. Our analysis combines spectral graph theory with Markov chain mixing and hitting time analyses.

Shiny Objects: Object-Centric Characterization of Chromium

Modern web browsers manage millions of dynamic objects across tabs, frames, DOM elements, and JavaScript contexts. However, fine-grained behaviors related to object allocation, lifetime, and memory usage in production browsers remain elusive. Chromium's modular and extensible design, use of specialized memory allocators, and sensitivity to instrumentation overhead further complicate precise object tracking. To this end, we develop a lightweight, thread-safe, and non-intrusive profiling framework. Using this infrastructure, we present an empirical characterization of Chromium's memory object behavior across twelve diverse, user-centric workloads. We examine object lifetime events, size diversity, spatial locality, type diversity, and memory activity, and reflect on their broader software and architectural implications. Our study offers a systems-oriented view into Chromium's architecture and memory behavior, and highlights structural challenges in efficient memory management in large-scale and diverse systems.

Stochastic Network Utility Maximization in Strategic Queueing Systems: A Game-Theoretic Approach

Stochastic Network Utility Maximization (NUM) has been a dominant framework for many queueing network resource allocation and control problems. Its original model seeks to optimize social welfare, which usually takes the form of the sum of local utilities of participating entities. However, such a centralized utility maximization approach is unsuitable for many modern multi-agent systems, in which each agent may selfishly optimize its local utility without regard to the overall utility. In this paper, we formulate the stochastic NUM problem in strategic queueing systems as a repeated game with queue stability constraints. In particular, the agents repeatedly make decisions to satisfy both their local constraints and global constraints, shared among them, while maintaining queue stability. The goal is to design a policy that constitutes a generalized Nash equilibrium (GNE) for the game.

We first derive the fluid model characterization of the strategic queueing NUM problem via a static one-shot game formulation. This characterization motivates a primal-dual algorithm that constitutes an approximate GNE by ensuring last-iterate convergence to a solution of the regularized static one-shot game. However, similar to primal-dual methods developed for the classical NUM problem, this approach does not leverage real-time queue lengths in decision making, leading to suboptimal queueing delay in practice, and has no explicit performance guarantees. To this end, we propose the Strategic Drift-plus-Penalty (SDP) algorithm and show that it constitutes an ε-GNE and has a uniformly bounded expected queue length of order O(1/ε3) for any ε > 0. Under an additional mild assumption that holds for a wide class of problems, we show that our algorithms achieve long-term average social welfare arbitrarily close to that of a welfare-maximizing GNE policy. Simulations validate our theory and demonstrate the favorable performance of our algorithms.

Sumeru: Towards Understanding and Achieving Cache-Optimal Inbound Network I/O

The slow growth of DRAM performance and ever-increasing memory bandwidth demands have made receiver-side memory a critical bottleneck for end-to-end data movement in cutting-edge data centers. Although Direct Cache Access (DCA) allows for memory-bypass I/O, existing implementations like Intel's Data Direct I/O (DDIO) have proven ineffective on 100 Gbps links, leading to a widespread belief that current processor caches are simply too small to serve modern high-speed links. This paper challenges this conclusion, arguing that the fundamental problem is not insufficient cache capacity, but inefficient cache usage. Our novel cache model reveals that software queue dynamics determine a receive buffer's path through the non-inclusive cache hierarchy (i.e., its ''cache trajectory''), opening the path toward cache-optimal DRAM-bypass inbound I/O on commodity hardware with pure software modifications. Guided by the model, we design and implement Sumeru, which approaches cache-optimal I/O through four synergistic innovations: (1) a dual-path stack architecture with a shallow fast path for large flows, (2) cache-aware buffer pools enforcing optimal trajectories, (3) host-based active queue management preventing bufferbloat, and (4) trajectory-aware dynamic cache partitioning. These designs work together to consistently keep network buffers on their optimal trajectory. The result is near-100% cache hit rates on a wide range of workloads and network settings. This eliminates memory-induced intra-host congestion, improving performance for both the target throughput-bound application and co-located latency-sensitive or memory-intensive neighbors. On real-world resource-contending deployments, Sumeru achieves a Pareto improvement: It boosts SPDK NVMe/TCP goodput by up to 51.2% while simultaneously boosting co-located SPEC CPU 2017 suite scores by up to 30.1%.

The Multiserver-Job Stochastic Recurrence Equation for Cloud Computing Performance Evaluation

Cloud computing data centers handle highly variable workloads: job resource requirements can range from just one or few cores to thousands, and job service times can range from milliseconds to hours or days. This variability significantly limits the maximum achievable utilization of infrastructures. Queuing theory has addressed the study of these systems with the definition of the Multiserver-Job Queuing Model (MJQM), where s identical servers are present and job n requires αn of the s servers for a random amount of time σn. The αn servers are occupied and released simultaneously. Unfortunately, despite its simple formulation, the MJQM remains elusive. For example, the MJQM stability condition has been derived only in particular cases. As a consequence, even applying Discrete-Event Simulation (DES) under high load becomes challenging, because stability cannot be determined a priori. In this paper, we analyze the MJQM with general independent arrival processes and service times under FCFS scheduling, using stochastic recurrence equations (SREs) and ergodic theory. Starting from the definition of the MJQM SRE, we prove the monotonicity and separability properties that allow us to apply an extension of Loynes' theorem, known as the monotone-separable framework, and formally define the MJQM stability condition. From these results, we introduce and implement two algorithms: the first one is used to draw sub-perfect samples (SPS) of the system's workload and the second one estimates the system's stability condition given the statistics of the jobs' input stream. The nature of the SPS algorithm allows for a massive GPU parallelization, thus greatly improving the efficiency in the estimation of performance metrics. The algorithm for the estimation of the stability condition solves an important problem for the analysis of MJQMs. We also define new metrics that capture the synchronization loss in MJQM systems and we show how these metrics can be efficiently evaluated using the SRE approach. Finally, we show that the approach proposed in this paper can be extended to more complicated systems, including MJQMs where resources have types.

Towards Scalable Storage Architectures for GPU Clusters Running Large Language Models

In modern multi-accelerator nodes, GPU throughput is increasingly constrained by storage and I/O bottlenecks, leaving accelerators idle as data transfer is restricted by software. In this study, we focus on single-node, multi-GPU systems with small-to-medium-scale models and present a systematic, phase-aware evaluation of datapath performance for Large Language Models (LLMs), covering technologies such as in-kernel libaio, hybrid user-kernel io_uring, user-space NVMe via the Storage Performance Development Kit (SPDK), and GPUDirect Storage (GDS). These approaches are evaluated across various storage media including SATA Solid State Drives (SSDs), NVMe SSDs, Optane NVMe, and Optane Persistent Memory (PMem). Leveraging an automated evaluation framework, we explore over 25,000 configurations, measuring throughput, latency, I/O per second (IOPS), and CPU cost.

Our study offers LLM storage scenarios in both standardized benchmarks and real-world production traces, ensuring that our workload models accurately reflect the I/O demands across pre-training, fine-tuning, and inference. We find that for inference, io_uring achieves the lowest latency and competitive IOPS for small random I/O on NVMe. In contrast, SPDK is limited to raw block-device evaluation due to its lack of POSIX file-system support. For pre-training and fine-tuning, workloads are dominated by coarse-grained sequential reads and writes, where GDS excels in reducing load times and host CPU usage. Among CPU-mediated datapaths, CPU efficiency—measured as GB/s per core—emerges as the key differentiator.

Taken together, these results yield actionable design guidelines: align the choice of datapath with the LLM pipeline phase. Use io_uring for inference to optimize data transfer efficiency and minimize latency, and leverage GDS for pre-training and fine-tuning to improve throughput per core, thereby narrowing the storage-to-compute gap in GPU LLM clusters.