SIGMETRICS Abstracts '26: Abstracts of the 2026 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Full Citation in the ACM Digital Library

SESSION: Session 1A: Online Learning

Learning to Allocate Reusable Resources under Stochastic Rewards and Durations

Ativ Joshi
Qingsong Liu
Mohammad Hajiesmaili

We study sequential allocation of a single reusable resource to stochastically arriving tasks in continuous time, where rewards and service durations are random with unknown statistics. When the resource is idle, the controller observes the arriving task's type and duration and must immediately accept or reject it; while the resource is busy, new arrivals are unobserved, which tightly couples learning and control. We first show that no non-anticipatory online policy can guarantee more than a 1/2 fraction of the full-information offline prophet benchmark, motivating the use of 1/2-approximate regret. To approach this limit, we derive an infinite-dimensional steady-state linear program that upper bounds the offline optimum and reveals a threshold structure: a task should be accepted when its profitability, defined as reward divided by duration plus expected waiting time, exceeds a critical threshold. Guided by this characterization, we develop online threshold-learning algorithms that achieve O(√T) 1/2-approximate regret when rewards are known and O(T^2/3) when reward functions are unknown. We further show that if durations are unobserved upon arrival, meaningful guarantees become impossible without additional flexibility, but can be recovered via task termination.

Offline Local Search for Online Stochastic Bandits

Gerdus Benade
Rathish Das
Thomas Lavastida

Combinatorial multi-armed bandits provide a fundamental online decision-making environment where a decision-maker interacts with an environment across T time steps, each time selecting an action and learning the cost of that action. The goal is to minimize regret, defined as the loss compared to the optimal fixed action in hindsight under full-information. There has been substantial interest in leveraging what is known about offline algorithm design in this online setting. Offline greedy and linear optimization algorithms (both exact and approximate) have been shown to provide useful guarantees when deployed online. We investigate local search methods, a broad class of algorithms used widely in both theory and practice, which have thus far been under-explored in this context. We focus on problems where offline local search terminates in an approximately optimal solution and give a generic method for converting such an offline algorithm into an online stochastic combinatorial bandit algorithm with O(log³ T) (approximate) regret. In contrast, existing offline-to-online frameworks yield regret (and approximate regret) which depend sub-linearly, but polynomially on T. We demonstrate the flexibility of our framework by applying it to three online stochastic combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid and uncertain clustering.

Heterogeneous Multi-agent Multi-armed Bandit on Stochastic Block Models

Mengfan Xu
Liren Shan
Fatemeh Ghaffari
Xuchuang Wang
Xutong Liu
Mohammad Hajiesmaili

We study a novel heterogeneous multi-agent multi-armed bandit problem with a cluster structure induced by stochastic block models (SBMs), influencing not only graph topology, but also reward heterogeneity. Specifically, agents are distributed on random graphs based on SBMs, a generalized Erdos-Renyi model with heterogeneous edge probabilities: agents are grouped into clusters (known or unknown); edge probabilities for agents within the same cluster differ from those across clusters. The same cluster structure in SBMs also determines our heterogeneous rewards. Rewards distributions of the same arm vary across agents in different clusters but remain consistent within a cluster, unifying homogeneous and heterogeneous settings and varying degree of heterogeneity. Rewards are independent samples from these distributions. The objective is to minimize system-wide regret across all agents. To address this, we propose a novel algorithm applicable to both known and unknown cluster settings. The algorithm combines an averaging-based consensus approach with a newly introduced information aggregation and weighting technique, resulting in a UCB-type strategy. It accounts for graph randomness, leverages both intra-cluster (homogeneous) and inter-cluster (heterogeneous) information from rewards and graphs, and incorporates cluster detection for unknown cluster settings. We derive optimal instance-dependent regret upper bounds of order log T under sub-Gaussian rewards. Importantly, our regret bounds capture the degree of heterogeneity in the system (an additional layer of complexity), exhibit smaller constants, scale better for large systems, and impose significantly relaxed assumptions on edge probabilities.

A Unified Framework for Online Combinatorial Allocation of Reusable Resources under Endogenous Deterioration

Qingsong Liu
Mohammad Hajiesmaili

This paper studies an online resource allocation problem, where resources are reusable, allocated in combinatorial bundles, and subject to endogenous performance deterioration. Unlike existing models that assume singleton decisions or stationary or exogenous dynamics, we consider a combinatorial decision structure where decisions involve selecting among resource-bundled options, while capturing a self-induced feedback loop in which system performance, like rewards, activation probabilities, and service times, degrade as a function of cumulative resource usage. We propose a unified framework that subsumes a broad class of existing models as special cases, and develop a ranking-like online policy, termed IPRANK, which employs an indicator-perturbation mechanism to induce a randomized ranking over configurations based on their size-penalized rewards. Our analysis is based on a queue-coupling argument and shows that under adversarial arrivals, the greedy policy inevitably suffers a competitive ratio with linear dependence on reward heterogeneity, whereas IPRANK achieves a competitive ratio with only logarithmic dependence, making it far more robust in heterogeneous environments.

SESSION: Session 1B: Blockchain 1

Geographical Centralization Resilience in Ethereum's Block-Building Paradigms

Sen Yang
Burak Öz
Fei Wu
Fan Zhang

Decentralization has an important geographic dimension that conventional metrics, such as stake distribution, often overlook. Validator location affects resilience to regional shocks and fairness in reward access, yet in major blockchains, geographical locations are not encoded in the rules but arise from economic incentives. When some regions offer systematic advantages, validators may strategically co-locate, as seen in Ethereum. In this paper, we study validators' geographic positioning incentives under Ethereum's protocol design by modeling how its two block-building paradigms, local and external block building, interact with the geographic distribution of validators and information sources. We analyze the model under a mean-field approximation and complement it with agent-based simulations calibrated using real-world latency data.

Our results show that Ethereum's block-building architecture is not geographically neutral: both paradigms generate location-dependent payoffs and incentives to relocate closer to payoff-relevant parties in order to reduce propagation delays, although through different underlying mechanisms. Asymmetric access to information sources further amplifies geographical centralization. We also demonstrate that consensus parameters, such as attestation thresholds and slot times, modulate latency sensitivity and can amplify these effects, acting as protocol-level levers. Finally, we discuss the implications of our findings for protocol design and outline potential mitigation directions informed by our analysis.

Beyond Single-Tokenomics: How Farcaster's Pluralistic Incentives Reshape Social Networking

Wen Yang
Qiming Ye
Onur Ascigil
Saidu Sokoto
Leonhard Balduf
Michał Król
Gareth Tyson

This paper presents the first empirical analysis of how diverse token-based reward mechanisms impact platform dynamics and user behaviors. For this, we gather a unique, large-scale dataset from Farcaster. This blockchain-based, decentralized social network incorporates multiple incentive mechanisms spanning platformnative rewards, third-party token programs, and peer-to-peer tipping. Our dataset captures token transactions and social interactions from 574,829 wallet-linked users, representing 64.25% of the platform's user base. Our socioeconomic analyses reveal how different tokenomics design shape varying participation rates (7.6%–70%) and wealth concentration patterns (Gini 0.72–0.94), whereas intercommunity tipping is 1.3–2x more frequent among non-following pairs, thereby mitigating echo chambers. Our causal analyses further uncover several critical trade-offs: (1)while most token rewards boost content creation, they often fail to enhance—sometimes undermining—content quality; (2) token rewards increase follower acquisition but show neutral or negative effects on outbound following, suggesting potential asymmetric network growth; (3) repeated algorithmic rewards demonstrate strong cumulative effects that may encourage strategic optimization. Our findings advance understanding of cryptocurrency integration in social platforms and highlight challenges in aligning economic incentives with authentic social value.

Cross-Chain Arbitrage: The Next Frontier of MEV in Decentralized Finance

Burak Öz
Christof Ferreira Torres
Christoph Schlegel
Bruno Mazorra
Jonas Gebele
Filip Rezabek
Florian Matthes

Decentralized finance (DeFi) markets are fragmented across block-chains, creating price differences that arbitrage helps eliminate. Today, much of this price alignment occurs against centralized exchanges, whose deep liquidity and fast execution make them dominant venues for price discovery. As trading activity moves on-chain, cross-chain arbitrage between decentralized exchanges is becoming increasingly important. Yet despite its relevance to DeFi, prior work has largely been limited to hypothetical opportunity analyses and conceptual overviews.

We conduct a year-long empirical study of cross-chain arbitrage from September 2023 to August 2024 across nine blockchains and identify 242,535 executed arbitrages totaling USD 868.64 million in volume. Activity grows by 5.5x over the study period and surges after the Dencun upgrade on March 13, 2024. Most trades rely on pre-positioned inventory (66.96%) and settle in a median of 9s, whereas bridge-based arbitrages take 242s, highlighting the latency cost of today's bridges. Market concentration is substantial: the five largest addresses execute more than half of all trades, and one address alone captures nearly 40% of daily volume post-Dencun. Taken together, these patterns suggest that cross-chain arbitrage can encourage vertical integration, concentrating sequencing infrastructure and economic power, thereby increasing censorship and consensus security risks. Decentralizing block building and lowering entry barriers are therefore important mitigation directions.

Multiple Sides of 36 Coins: Measuring Peer-to-Peer Infrastructure Across Cryptocurrencies

Lucianna Kiffer
Lioba Heimbach
Dennis Trautwein
Yann Vonlanthen
Oliver Gasser

We present the first longitudinal, cross-network measurement study of 36 public blockchain peer-to-peer (P2P) networks, spanning nine months from late 2024. We deployed 15 active crawlers with hourly connectivity probes, performed Internet-wide scans, and decomposed shared discovery substrates to identify 19 auxiliary networks. We find that active node counts span three orders of magnitude (<10 to >10,000) and vary substantially in reachability, discovery coverage, and deployment structure. To characterize these differences, we quantify discovery effectiveness, churn dynamics, peer-table coverage, and AS- and geographic concentration (via the Herfindahl--Hirschman Index). Our results reveal pronounced heterogeneity in network structure and highlight fundamental observability limitations in widely deployed peer-discovery mechanisms. This extended abstract accompanies the full paper [10].

SESSION: Session 1C: LLM Benchmarking and Caching

Benchmarking and Characterization of Large Language Model Inference on Apple Silicon

Afsara Benazir
Felix Xiaozhu Lin

A systematic understanding of Apple Silicon is lacking in the current landscape of hardware efficiency; research focus is largely centered on accelerating GPUs for large-scale training or inference on CUDA devices. This paper investigates Apple Silicon's unique memory architecture that offers a unified memory integrating CPU and GPU memory and its implications for on-device LLM inference.

We decipher myths about whether Apple Silicon is efficient for on-device inference compared to competitors such as NVIDIA GPUs by directly conducting latency and throughput comparison benchmarks. We explain the performance gap between them through profiling low level hardware metrics - ALU utilization, memory bandwidth, buffer usage, cache residency etc. at runtime. We draw several insights regarding performance bottlenecks such as dequantization overhead, compute throughput and memory bandwidth. We debunk existing false claims regarding large language model inference such as compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms. We find that the large unified memory enables Apple Silicon to be both cost effective and efficient against NVIDIA GPUs for ultra large language models.

Our work provides a comprehensive perspective of on-device inference on commodity GPUs; our large scale evaluation on 5 hardware testbeds incorporating three Apple M-series devices: M2 Ultra, M2 Max and M4 Pro and two NVIDIA GPUs: NVIDIA RTX A6000, a multi GPU setup with 2xNVIDIA RTX A6000, 5 model scales ranging from 8B to 405B parameters and 14 quantization schemes gives an understanding of how Apple Silicon fits within the paradigm of on-device LLM inference. Our analysis reveals multiple resource interdependencies and unexpected findings, while also quantifying established insights. To the best of our knowledge, this study makes the first attempt to present a thorough characterization and analysis of Apple Silicon for on-device inference.

WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

Xiangchen Li
Jiakun Fan
Qingyuan Wang
Dimitrios Spatharakis
Saeid Ghafouri
Hans Vandierendonck
Deepu John
Bo Ji
Ali R. Butt
Dimitrios Nikolopoulos

LLM inference is increasingly initiated at edge devices but still executed in centralized GPU clusters. Distributed speculative edge serving offloads drafting to devices, yet existing systems scale poorly under heterogeneous, latency-sensitive workloads. We identify two bottlenecks: Wasted Drafting Time (WDT), where devices over-draft tokens later rejected by the verifier, and Verification Interference, where long verifications delay short requests within a micro-batch. We present WISP, a lossless distributed speculative serving system that combines dynamic drafting with SLO-aware verification batching. WISP uses a lightweight rejection predictor, a batch-latency estimator, and an interference-aware scheduler. Across simulation and prototype experiments, WISP improves system capacity by up to 2.1× over centralized serving and 4.1× over SLED, and improves system goodput by up to 1.94× and 3.7×, while reducing token-speed SLO violations under bursty multi-tenant workloads.

UniCache: Unifying Prefix Cache Eviction for Heterogeneous LLM Serving Workloads

Bei Ouyang
Yifan Qiao
Jiarong Xing

Prefix caching is a key performance optimization in Large Language Model (LLM) serving systems, enabling reuse of attention Key-Value (KV) states across requests with shared prompt prefixes. However, the size of GPU memory limits cache capacity, making the eviction policy a critical factor in overall system performance. Existing systems primarily rely on simple heuristics, such as LRU, and apply the same policy across task categories, implicitly assuming homogeneous workloads. In practice, however, modern LLMs serve heterogeneous workloads that mix multi-turn conversational traffic with diverse single-turn API requests, leading to fundamentally different prefix reuse patterns. In this work, we first design a trace-driven prefix cache simulator built on vLLM to systematically characterize prefix reuse across representative workloads. Our analysis reveals two dominant reuse patterns---session reuse and structural reuse---that vary significantly across task types. Motivated by these observations, we propose \sys, a unified eviction policy that jointly captures both reuse patterns and dynamically balances cache allocation across tasks. When implemented in vLLM, UniCache achieves substantial improvements under heterogeneous workloads, improving prefix cache hit rates by up to 17.32% and reducing inference latency by up to 3.63× compared to existing policies.

Mapping the Landscape of LLM Deployment in the Wild: Prevalence, Patterns, and Perils

Xinyi Hou
Jiahao Han
Yanjie Zhao
Shenao Wang
Haoyu Wang

Large language models (LLMs) are increasingly deployed through open-source and commercial frameworks, enabling individuals and organizations to self-host advanced LLM capabilities. As LLM deployments become prevalent, ensuring their secure and reliable operation has become a critical issue. However, insecure defaults and misconfigurations often expose LLM services to the public internet, posing serious security risks. This study conducted a large-scale empirical investigation of public-facing LLM deployments, focusing on the prevalence of services, exposure characteristics, systemic vulnerabilities, and associated risks. Through internet-wide measurements, we identified 320,102 public-facing LLM services across 15 frameworks and extracted 158 unique API endpoints, categorized into 12 groups based on functionality and security risk. Our analysis found that over 40% of endpoints used plain HTTP, and over 210,000 endpoints lacked valid TLS metadata. API exposure was highly inconsistent: some frameworks, such as Ollama, responded to over 35% of unauthenticated API requests, with about 15% leaking model or system information, while other frameworks implemented stricter controls. We observed widespread use of insecure protocols, poor TLS configurations, and unauthenticated access to critical operations. These security risks, such as model leakage, system compromise, and unauthorized access, are pervasive and highlight the need for a secure-by-default framework and stronger deployment practices.

SESSION: Session 2A: Distributed Learning

SemanticDFL: Similarity-Aware Pull-based Personalized Decentralized Federated Learning

Javad Dogani
Mostafa Khastkhodaei
Farshad Khunjush
Nikolaos Laoutaris

Personalized decentralized federated learning (PDFL) aims to learn client-specific models from heterogeneous data without a central coordinator. However, push-based gossip over large peer graphs is graph-oblivious: it mixes semantically mismatched updates, attenuates minority signals, and assumes broad peer reachability. We present SemanticDFL, a fully decentralized, pull-based PDFL framework built on a hierarchical semantic overlay network (SON). Each client exports a compact top-P model signature; bounded peer discovery forms zones; affinity propagation clusters similar peers; and replica-backed super-peers route lightweight bounded-fanout similarity queries. Each client then pulls only its top-K most similar peers for personalized aggregation. Experiments on multiple benchmarks with dozens to hundreds of peers on the EU SLICES testbed show that SemanticDFL improves final accuracy, reaches target accuracy in substantially fewer rounds than FedAvg and the strongest PDFL baseline, while incurring only modest per-round overhead.

Analyzing Symbolic Properties for DRL Agents in Systems and Networking

Mohammad Zangooei
Jannis Weil
Amr Rizk
Mina Tahmasbi Arashloo
Raouf Boutaba

Deep reinforcement learning (DRL) has shown remarkable performance on complex control problems in systems and networking, including adaptive video streaming, wireless resource management, and congestion control. For safe deployment, however, it is critical to reason about how agents behave across the range of system states they may encounter in practice. Existing verification-based approaches in this domain primarily focus on point properties -- properties defined around fixed input states -- which offer limited coverage and require substantial manual effort to identify relevant input-output pairs for analysis.

In this paper, we study symbolic properties -- properties that specify expected behaviors over ranges of input states -- for DRL agents in systems and networking. We present a generic formulation for symbolic properties, with monotonicity and robustness as concrete examples, and show how they can be analyzed using existing Deep Neural Network (DNN) verification engines. Our approach encodes symbolic properties as comparisons between related executions of the same agent and decomposes them into practically tractable sub-properties. These techniques serve as practical enablers for applying existing verification tools to symbolic properties.

Using our framework, diffRL, we conduct an extensive empirical study across three representative DRL-based control systems -- adaptive video streaming, wireless resource allocation, and congestion control -- covering both discrete and continuous action spaces. Through these case studies, we analyze symbolic properties over broad input ranges, examine how property satisfaction evolves during training, study the impact of model size on verifiability, and compare multiple verification backends. Our results show that symbolic properties provide substantially broader coverage than point properties and can uncover non-obvious, operationally meaningful counterexamples, while also revealing practical solver trade-offs and limitations.

Further details and comprehensive experimental results are provided in the full paper [10].

SESSION: Session 2B: Caching

Performance Analysis of the Randomized SIEVE/CLOCK Cache Replacement Algorithm

Yirong Wang
Peter Desnoyers
Benny Van Houdt

SIEVE and CLOCK have attracted significant interest as cache replacement algorithms due to their simplicity in implementation and, in the case of CLOCK, frugality in memory usage. Unfortunately, they are susceptible to pathologies such as performance cliffs, wherein increasing cache capacity has little effect over a range of sizes until a workload-dependent threshold is crossed, after which hit rate improves abruptly and disproportionately.

Performance cliffs arise precisely because scan-based eviction paths create a synchronized, deterministic aging frontier in which many objects are admitted and removed in lockstep, causing catastrophic performance degradation under scan-like workloads. This observation motivates a design principle for cache replacement: randomize the eviction path to mitigate worst-case scans while keeping the per-object state minimal. Intuitively, randomization mitigates performance cliffs by dispersing eviction pressure across objects; from a queueing-theoretic perspective, this dispersion is akin to a per-object birth-death process.

We thus introduce a randomized SIEVE/CLOCK variant that uses ⌈ log₂ (K+1) ⌉ ≥ 1 access bits per cached item; it is as cheap to implement as CLOCK using a single circular array, yet remains robust to scans and amenable to a natural CTMC description, allowing accurate performance approximation via a heterogeneous mean-field model.

Compared with popular baselines and their generalizations, the randomized SIEVE/CLOCK achieves hit rates of 1.5× or more on production workloads with long scan sequences. Furthermore, though the cache hit rate improves with the number of access bits, we show that the majority of the gain is already achieved with as few as 4 access bits per cached item.

EviDex: Provenance-Weighted Evidence-Path Indexing for Fresh and Auditable Retrieval under Continuous Updates

Rui Li
Shuang Cao
Ruihua Liu
Alexandre Duprey

Continuously updated corpora make freshness a first-class systems objective, yet periodic TTL refresh introduces unavoidable stale windows. We formalize a staleness-latency-cost tradeoff and show that any periodic refresh policy has violation rate at least p_dep(τ)/2, where p_dep(τ) is the probability that a query depends on an update committed within the TTL window τ. We present EviDex, a provenance-weighted evidence-path index that continuously compacts a commit stream into intent-partitioned path buckets and supports auditable multi-hop retrieval with bounded visibility delay. On clinical and Wikipedia update streams, EviDex reduces evidence-set violation to 1.3% and 1.1% at T=15 minutes, versus 2.4% and 2.1% for the strongest streaming-aware baseline, while reducing clinical cost by 42%. In a physician-rated clinical QA study, EviDex reaches 0.884 correctness with 0.84% macro-SCER.

SESSION: Session 2C: Networking 1: Congestion Control

Can Multipath TCP Congestion Control do better without adhering to its First Design Principle?

Xuan Wang
Hao Che
Zhijun Wang
Hong Jiang

Multipath TCP (MPTCP) enables a transport connection to utilize multiple network paths simultaneously and is expected to achieve higher throughput than single-path TCP. To ensure compatibility with the existing Internet, MPTCP congestion control follows three classical design principles: (P1) an MPTCP flow should perform at least as well as TCP on its best path, (P2) it should remain fair to TCP flows, and (P3) it should pool resources across paths. However, extensive experiments consistently show that MPTCP often delivers only modest throughput improvement over best-path TCP. In this paper, we show that this phenomenon reflects a structural limitation imposed by the joint enforcement of P1–P3 under end-to-end transport. Our analysis characterizes the achievable throughput envelope of MPTCP and proves that, even under favorable conditions, the gain over best-path TCP is fundamentally bounded. Using a network utility maximization (NUM) framework, we further show that relaxing P1 enlarges the feasible allocation region and admits significantly higher-throughput operating points. Under a restricted symmetric setting, the achievable gain approaches 4-2√2 ≈ 17.2% with logarithmic utilities. Guided by this insight, we derive a NUM-optimal congestion control family, N-MPTCP(ω^→), and design a practical instance with adaptive weights that realizes these gains in packet-level simulations.

PEMI: Transparent Performance Enhancements for QUIC

Jie Zhang
Lei Zhang
Ziyi Wang
Chenxiang Sun
Yuming Hu
Xiaohui Xie
Zeqi Lai
Yong Cui

QUIC, the transport layer of the next-generation Web stack (HTTP/3), provides native security and performance improvements over TCP. However, since QUIC encrypts both payload and headers, in-network assistance like Performance-Enhancing Proxy (PEP) is unavailable for QUIC. Existing efforts seek to collaborate endpoints and middleboxes to provide in-network assistance for QUIC, which increases the deployment difficulty in the Internet.

By analyzing the QUIC standard, implementations, and the locality of application traffic, we identify opportunities for transparent middleboxes to measure RTT and infer packet loss for QUIC connections, despite the absence of plaintext ACK information. We then propose PEMI that continuously measures RTT and infers lost packets, enabling fast retransmissions for QUIC in a completely transparent manner. To maintain fairness and reduce harmful fast retransmissions during congestion, PEMI utilizes a delay-based congestion control and feedback-based CWND enforcement. Extensive evaluation results, including Mininet and trace-driven dynamic experiments, show that PEMI can significantly improve the performance of QUIC. For example, in the Mininet experiments, PEMI increases the goodput of file transfers by up to 2.5×, and reduces the 90th percentile jitter of RTC frames by 20-75%.

SESSION: Session 3A: Queueing and Scheduling

Higher-Order Approximations of Sojourn Times in M/G/1 Queues via Stein's Method

Bihan Chatterjee
Siva Theja Maguluri
Debankur Mukherjee

We study the stationary sojourn time distribution in an M/G/1 queue operating under heavy traffic. It is known that the sojourn time converges to an exponential distribution in the limit. Our focus is on obtaining pre-asymptotic, higher-order approximations that go beyond the classical exponential limit. Using Stein's method, we develop an approach based on higher-order expansions of the generator of the underlying Markov process. The key technical step is to represent higher-order derivatives in terms of lower-order ones and control the resulting error via derivative bounds of the Stein equation. Under suitable moment-matching conditions on the service distribution, we show that the approximation error decays as a high-order power of the slack parameter ε=1-ρ. Error bounds are established in the Zolotarev metric, which further imply bounds on the Wasserstein distance as well as the moments. Our results demonstrate that the accuracy of the exponential approximation can be systematically improved by matching progressively more moments of the service distribution.

Empirical Gittins for Data-driven M/G/1 Scheduling with Arbitrary Job Size Distributions

Shefali Ramakrishna
Amit Harlev
Ziv Scully

When scheduling in the M/G/1 queue to minimize mean response time, a classic result is that if job sizes are unknown, then the Gittins policy is optimal. Gittins is best described as a policy construction: it takes as input the queue's job size distribution, and it outputs a job prioritization rule that optimizes mean response time for that particular distribution. But in practice, instead of knowing the exact job size distribution, one usually only has samples from it. We therefore ask: given finitely many samples from the job size distribution, how can one construct a scheduling policy with near-optimal mean response time?

Our main result is that to achieve near-optimal mean response time, it suffices to simply apply the Gittins construction to the empirical distribution of the job size samples. We call this policy empirical Gittins, and we prove an explicit high-probability bound on its mean response time. Our bound implies convergence to the optimal mean response time as one increases the number of samples. We also show that if one has even vague knowledge of the true distribution's tail asymptotics, one can make empirical Gittins more robust using truncation, resulting in better convergence rates.

It is surprising that empirical Gittins works well even for continuous job size distributions. This is because the Gittins construction is sensitive to the distribution's density, yet the empirical distribution, being discrete, cannot possibly approximate a continuous density. Our main technical contribution is to show that despite its sensitivity to density, the Gittins construction yields a good policy as long as one gives it a distribution with an approximately correct tail, even if the density is completely wrong. Underlying this finding are two new extensions of the WINE queueing identity.

A Tale of Two Traffics: Optimizing Tail Latency in the Light-Tailed M/G/k

George Yu
Amit Harlev
Reevu Adakroy
Ziv Scully

We consider the problem of scheduling to minimize asymptotic tail latency in the M/G/k queue with light-tailed job size distribution. This problem combines the challenges of scheduling for tail latency and scheduling in multiserver queues, but there is hope. In the simpler setting of the single-server M/G/1, the recently proposed γ-Boost policy is tail constant optimal, and it has excellent empirical tail latency. And for the simpler objective of mean latency, it is known that the optimal policy in the M/G/1, namely SRPT (shortest remaining processing time), is also excellent in the M/G/k: it is provably optimal in heavy traffic and has state-of-the-art empirical performance in lighter traffic.

One might therefore hope that γ-Boost is similarly effective in the M/G/k, but our results paint a more complicated picture. In heavy traffic, we prove that γ-Boost is indeed tail constant optimal. We also prove an analogous result for scheduling with unknown sizes, where γ-Boost is replaced by its unknown-size counterpart. But in lighter traffic, we find empirically that γ-Boost can be even worse than FCFS (first-come, first-served). This is a significant shortcoming, as the boundary between ''lighter'' and ''heavy'' traffic occurs at higher load when the number of servers k is larger. To overcome this, we design a new variant of γ-Boost that outperforms the original by, counterintuitively, giving more priority to larger jobs. The new variant, which we prove is also heavy-traffic optimal, has state-of-the-art empirical tail latency at lighter loads, outperforming even a much more computationally intensive mixed-integer-programming heuristic.

Outperforming Multiserver SRPT at All Loads

Izzy Grosof
Daniela Hurtado-Lange

A well-designed scheduling policy can unlock significant performance improvements with no additional resources. Multiserver SRPT (SRPT-k) is known to achieve asymptotically optimal mean response time in the heavy traffic limit, as load approaches capacity. No better policy is known for the M/G/k queue in any regime. We introduce a new policy, SRPT-Except-k+1 & Modified SRPT (SEK-SMOD), which is the first policy to provably achieve lower mean response time than SRPT-k. SEK-SMOD outperforms SRPT-k across all loads and all job size distributions. The key idea behind SEK-SMOD is to prioritize large jobs over small jobs in specific scenarios to improve server utilization, and thereby improve the response time of subsequent jobs in expectation. Our proof is a novel application of hybrid worst-case and stochastic techniques to relative analysis, where we analyze the deviations of our proposed SEK-SMOD policy away from the SRPT-k baseline policy. Furthermore, we design Practical-SEK (a simplified variant of SEK-SMOD) and empirically verify the improvement over SRPT-k via simulation.

SESSION: Session 3B: Storage & Memory

Optimizing Storage Overhead of User Behavior Log for ML-embedded Mobile Apps

Chen Gong
Yan Zhuang
Zhenzhe Zheng
Yiliu Chen
Sheng Wang
Fan Wu
Guihai Chen

Machine learning (ML) models are increasingly integrated into modern mobile apps to enable personalized and intelligent services. These models typically rely on massive input features derived from historical user behaviors to capture user intents. However, as ML-driven services become more prevalent, recording necessary user behavior data imposes high storage cost on mobile apps, leading to lower system responsiveness and more app uninstalls. In this work, we present AdaLog system to improve the storage efficiency of user behavior log in ML-embedded mobile apps, without compromising model inference accuracy or latency. We identify two key inefficiencies in current industrial practices of user behavior log: (1) redundant logging of overlapping behavior data across different features and models, and (2) sparse storage caused by storing behaviors with heterogeneous attribute descriptions in a single log file. To solve these issues, AdaLog first formulates the elimination of feature-level redundant data as a maximum weighted matching problem in hypergraphs, and proposes an efficient designs an incremental update mechanism to minimize the I/O operations needed for adapting outdated behavior log. We implement a prototype of AdaLog and evaluate it in popular mobile apps in collaboration with our industry partner. Evaluations on real-world user data show that AdaLog reduces behavior log size by 19% to 44% with minimal system overhead (only 2 s latency and 15 MB memory usage), providing a more efficient data foundation for broader adoption of on-device ML.

Experimental Study on System-Level Performance Impact of Read Disturbance in Modern SSDs

Yonggon Park
Hyunuk Cho
Onur Mutlu
Sungjin Lee
Jisung Park

This work investigates the system-level performance impact of read disturbance in modern NAND flash-based SSDs, aiming to provide new insights that can help develop better storage architectures and optimize system software. Read disturbance in NAND flash memory has gained growing attention as a major reliability concern due to its rapidly increasing impact, which can significantly affect system I/O performance by exacerbating reliability-management overheads. Although a large body of prior work has focused on device-level characterizations and optimizations, the system-level performance impact of read disturbance still remains largely uninvestigated. To address this gap, this work conducts a rigorous experimental study using 15 modern SSDs in two ways. First, we comprehensively analyze the system-level performance impact of read disturbance under diverse workloads and operating conditions. Second, to highlight the importance of efficient read-disturbance management, we showcase a new possible SSD-performance attack as a case study, demonstrating that an adversary can significantly degrade the I/O performance of other concurrently running processes by exploiting read disturbance alone. Based on our experimental study, we make 16 new observations and 7 takeaway lessons, which we hope to help and inspire future work to develop NAND flash-based storage systems more robust to read disturbance.

cPMEM: A Design of Byte-Addressable Persistent Memory with Compute Express Link for Data Center Applications

Gaocong Liu
Mengting Lu
Kun Wang
Feng Zhu
Shu Li

Persistent memory has emerged as a groundbreaking solution for byte-addressable storage-class memory (SCM). While persistent memory is highly desirable due to its advantages in high capacity, performance, and cost-effectiveness, novel system architecture solutions are required for its adoption in data center applications that demand high performance and reliability. The major challenges are issues related to persistent memory media, such as endurance, retention, and reliability, along with hardware and software system integration within the data center platform. To reduce the total service cost for end users by utilizing persistent memory media as the main memory, we have developed a comprehensive system comprising persistent memory chips, a persistent memory controller, firmware, and system software layers that effectively mitigate persistent memory issues and are compatible with data center applications.

In this paper, we discuss the system design of a high-performance byte-addressable persistent memory with a Compute Express Link (CXL) interface. This design choice was based on real-world persistent memory limitations. It was implemented on Application-Specific Integrated Circuit (ASIC) platform, and tested on in-memory database (IMDB) applications. Bandwidth and latency results show that our design approach for CXL persistent memory offers comparable performance to CXL dynamic random access memory (DRAM) under certain test cases while providing increased memory capacity and reduced cost.

Our main contributions are summarized as follows:

(1) We have conducted a systematic study on persistent memory media issues and have proposed a hardware and software architecture that ensures reliable and efficient utilization of persistent memory media. Media management algorithms for endurance, disturbance, retention, and drifting have been developed and validated.

(2) We have pioneered the first SCM-based data-center CXL memory. By utilizing the CXL interface, the barrier to adapting persistent memory to server platforms in data centers has been largely reduced, as it uses Peripheral Component Interconnect Express (PCIe) interconnect slots instead of a limited number of DIMM slots. Our persistent memory solution with the CXL interface offers enhanced compatibility across various platforms and unlocks the potential for memory pooling.

(3) We have developed a software path that integrates persistent memory into our server systems, spanning from the application layer to the driver and firmware layers. To evaluate its performance, we have conducted benchmarking tests using in-memory databases as the application and compared the results with those of CXL DRAM devices and Intel's Optane product.

CAPSULE: A Storage Prefetcher Harnessing Spatio-Temporal Locality for Cloud-Scale Workloads

Agung Rahmat Ramadhan
Seehwan Yoo
Jongmoo Choi

Effective prefetching is essential in modern storage systems to reduce I/O latency and improve cache hit rates, especially under challenging access patterns. While temporal locality has been the dominant foundation for most prefetching algorithms, real-world storage workloads often exhibit long stack distances (LSD), where blocks are reused only after millions of other accesses. These patterns weaken the effectiveness of conventional prefetchers, even when strong spatial locality is present.

In this paper, we present CAPSULE (Clustering-Assisted Prefetching Scheme Utilizing Locality Exploration), a novel prefetching framework that exploits spatial locality through adaptive clustering. By dynamically grouping logical block addresses (LBAs) and prefetching across neighboring clusters, CAPSULE effectively bridges the gap left by temporal-only approaches.

We evaluate CAPSULE across 729 real-world workloads drawn from five major benchmark suites (MSR, CloudPhysics, Tencent CBS, Alibaba Block and Meta Tectonic), reflecting diverse cloud-scale environments. CAPSULE improves cache hit rates by up to 6.1× and achieves up to 1.89× speedup in task completion time, outperforming traditional and learned prefetchers alike. Our results demonstrate that CAPSULE is particularly well-suited for modern cloud storage systems, where massive working sets and temporal locality erosion are increasingly the norm.

SESSION: Session 3C: Learning-augmented and Sequential learning

Sequential Fair Allocation With Replenishments: A Little Envy Goes An Exponentially Long Way

Chido Onyeze
Sean R. Sinclair
Chamsi Hssaine
Siddhartha Banerjee

We study the trade-off between envy and inefficiency in repeated resource allocation settings with stochastic replenishments, motivated by real-world systems such as food banks and medical supply chains. Specifically, we consider a model in which a decision-maker faced with stochastic demand and resource donations must trade off between an equitable and efficient allocation of resources over an infinite horizon. The decision-maker has access to storage with fixed capacity M, and incurs efficiency losses when storage is empty (stockouts) or full (overflows). We provide a nearly tight (up to constant factors) characterization of achievable envy-inefficiency pairs. Namely, we introduce a class of Bang-Bang control policies whose inefficiency exhibits a sharp phase transition, dropping from Θ(1/M) when Δ = 0 to e^{-Ω(Δ M)} when Δ > 0, where Δ is used to denote the target envy of the policy. We complement this with matching lower bounds. Our results demonstrate that envy-inefficiency trade-offs not only persist in settings with dynamic replenishment, but are shaped by the decision-maker's available capacity, and are therefore qualitatively different compared to previously studied settings with fixed supply.

Prediction-Specific Design of Learning-Augmented Algorithms

Sizhe Li
Nicolas Christianson
Tongxin Li

Algorithms with predictions have emerged as a powerful framework to combine the robustness of traditional online algorithms with the data-driven performance benefits of machine-learned (ML) predictions. However, most existing approaches in this paradigm are overly conservative, as they do not leverage problem structure to optimize performance in a prediction-specific manner. In this paper, we show that prediction-specific performance criteria can enable significant improvements over the coarser notions of consistency and robustness considered in prior work. We propose a notion of strongly-optimal algorithms with predictions, which obtain Pareto optimality not just in the worst-case tradeoff between robustness and consistency, but also in the prediction-specific tradeoff between these metrics. We develop a general bi-level optimization framework that enables systematically designing strongly-optimal algorithms in a wide variety of problem settings, and we propose explicit strongly-optimal algorithms for several classic online problems: deterministic and randomized ski rental, and one-max search. Our analysis reveals new structural insights into how predictions can be optimally integrated into online algorithms by leveraging a prediction-specific design. Empirical evaluations on dynamic power management and VIX trading validate the practical benefits of our framework.

Online Smoothed Demand Management

Adam Lechowicz
Nicolas Christianson
Mohammad Hajiesmaili
Adam Wierman
Prashant Shenoy

We introduce and study a class of online problems called online smoothed demand management ( OSDM ), motivated by paradigm shifts in grid integration and energy storage for large energy consumers such as data centers. In OSDM , an operator makes two decisions at each time step: an amount of energy to be purchased, and an amount of energy to be delivered (i.e., used for computation). The difference between these decisions charges (or discharges) the operator's energy storage (e.g., a battery). Two types of demand arrive online: base demand, which must be covered at the current time, and flexible demand, which can be satisfied at any time steps before a demand-specific deadline Δ_t. The operator's goal is to minimize a cost (subject to the constraints above) that combines a cost of purchasing energy, a cost for delivering energy (if applicable), and smoothness penalties on the purchasing and delivery rates to discourage fluctuations and encourage ''grid healthy'' decisions. OSDM generalizes several problems in the online algorithms literature while being the first to fully model applications of interest. We propose a competitive algorithm for OSDM called PAAD (partitioned accounting & aggregated decisions) and show it achieves the optimal competitive ratio. To overcome the pessimism typical of worst-case analysis, we also propose a novel learning framework for OSDM that provides guarantees on the worst-case competitive ratio (i.e., to provide robustness against nonstationarity) while allowing end-to-end differentiable learning of the best algorithm on historical instances of the problem. We evaluate our algorithms in a case study of a grid-integrated data center with battery storage, showing that PAAD effectively solves the problem and end-to-end learning achieves substantial performance improvements compared to PAAD .

Green Bin Packing

Jackson Bibbens
Cooper Sigrist
Bo Sun
Shahin Kamali
Mohammad Hajiesmaili

The online bin packing problem and its variants are regularly used to model server allocation problems. Modern concerns surrounding sustainability and overcommitment in cloud computing motivate bin packing models that capture costs associated with highly utilized servers. In this work, we introduce the green bin packing problem, an online variant with a linear cost β for filling above a fixed level G. For a given instance, the goal is to minimize the sum of the number of opened bins and the linear cost. We show that when β łe 1/G, classical online bin packing algorithms such as FirstFit or Harmonic perform well, and can achieve competitive ratios lower than in the classic setting. However, when β > 1/G, new algorithmic solutions can improve both worst-case and typical performance. We introduce variants of classic online bin packing algorithms and establish theoretical bounds, as well as test their empirical performance.

SESSION: Session 3D: Substainability

Characterizing Performance–Energy Trade-offs of Large Language Models in Multi-Request Workflows

Md. Monzurul Amin Ifath
Israat Haque

Large language models (LLMs) are increasingly deployed in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inferences. Existing measurement and benchmarking efforts either focus on assessing performance of LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls is not explored in-depth.

To address these gaps, this paper presents the first systematic characterization of performance--energy trade-offs in multi-request LLM inference. We develop and evaluate four representative workloads that capture sequential, interactive, agentic, and composite patterns common in modern deployments. Using an empirical NVIDIA A100 testbed with state-of-the-art serving systems (vLLM and Parrot), we systematically analyze how key energy knobs (e.g., input-output length, batch size, and GPU power cap) reshape latency, throughput, and component-level (e.g., CPU, GPU, and DRAM) energy use. Our findings reveal that batch size is the most impactful lever, though its benefits are highly workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for mutli-agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further demonstrate that engine-level optimizations in vLLM (e.g., continuous batching, PagedAttention) maintain higher GPU utilization and efficiency, especially for decode-heavy workloads, while Parrot's workflow-aware scheduling achieves lower energy consumption under stringent power constraints. These findings offer actionable guidelines for developers and system operators in designing performance- and energy-aware LLM serving systems in emerging multi-request workflows.

Cache Your Prompt When It's Green - Carbon-aware Caching for Large Language Model Serving

Yuyang Tian
Desen Sun
Yi Ding
Sihang Liu

As large language models (LLMs) become widely used, their environmental impact — especially carbon emission — has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present GreenCache, a carbon-aware cache management framework that dynamically derives resource allocation plans for LLM serving. GreenCache analyzes the correlation between carbon emission and SLO satisfaction, reconfiguring the resource over time to keep the balance between SLO and carbon emission under dynamic workloads. Evaluations from real traces demonstrate that GreenCache achieves an average carbon reduction of 15.1 % when serving Llama-3 70B in the FR grid, with reductions reaching up to 25.3 %, while staying within latency constraints for > 90 % of requests.

From Measurement to Emissions: Assessing the Carbon Footprint of Traffic Flows

Sawsan El-Zahr
Noa Zilberman

As sustainability becomes a requirement in network operations, accurately quantifying the carbon footprint of Internet traffic is essential. While energy-aware networking has seen significant attention, the ability to trace carbon emissions at the flow-level remains an open challenge due to the complexity of shared infrastructure and lack of appropriate telemetry. In this paper, we present a methodology to obtain fine-grained per-flow carbon emissions from traffic statistics. To this end, we collect power measurements from three switches under varying traffic conditions, including synthetic and real-world traces. From these measurements, we derive a regression model that accurately estimates instantaneous router power consumption using only throughput and packet rate counters--achieving >96% accuracy across all switch types and traces. We then extend this model to compute per-flow carbon emissions, distinguishing between consequential and attributional perspectives, and validate the results using traces from CAIDA and Google services. Our findings uncover actionable insights into how flow and network characteristics such as packet size, packet rate, and network utilization influence carbon cost. Finally, we propose feasible deployment strategies for flow-level carbon estimation frameworks. This work provides a foundational step towards enabling carbon-aware flow-level decision-making for users, applications, and network operators.

HealthServe: A Public Health-Aware Inference Framework

Sicheng Zhou
Rohan Basu Roy

Large language model (LLM) inference consumes substantial datacenter energy and produces both carbon emissions and criteria pollutants (PM2.5, SO2, NOx). These criteria pollutants impact public health depending on the meteorological conditions of emission regions and population exposure. We apply health impact assessment to LLM inference, a use case of growing concern due to its rapidly rising energy consumption, and characterize health impacts from datacenter energy consumption as quantifiable health cost metrics. We then present HealthServe, a systems-level framework for health-aware LLM inference serving, which jointly minimizes carbon emissions and health costs across geo-distributed GPU clusters while meeting latency SLOs. HealthServe employs a hierarchical scheduling architecture with Social Cognitive Optimization (SCO) that maintains condition-indexed decision libraries for configuration reuse under recurring operating conditions. Evaluation on a heterogeneous GPU cluster across different U.S. datacenter locations demonstrates over 50% carbon footprint and over 25% health cost reductions over the state-of-the-art energy-aware LLM inference serving solutions. Additionally, we show that HealthServe can be augmented with existing carbon-aware serving approaches to provide additive sustainability benefits.

SESSION: Session 4A: Game Theory

Learning with Episodic Hypothesis Testing in General Games: A Framework for Equilibrium Selection

Ruifan Yang
Manxi Wu

We introduce a new hypothesis testing-based learning dynamics in which players update their strategies by combining hypothesis testing with utility-driven exploration. In this dynamics, each player forms beliefs about opponents' strategies and episodically tests these beliefs using empirical observations. Beliefs are resampled either when the hypothesis test is rejected or through exploration, where the probability of exploration decreases with the player's (transformed) utility. In general finite normal-form games, we show that the learning process converges to a set of approximate Nash equilibria and, more importantly, to a refinement that selects equilibria maximizing the minimum (transformed) utility across all players. Our result establishes convergence to equilibrium in general finite games and reveals a novel mechanism for equilibrium selection induced by the structure of the learning dynamics.

Information Design for Prosocial Behavior

Alexandre Reiffers-Masson
Rajesh Sundaresan

We study the idea of information design for inducing prosocial behavior. As an example, we ground our study in the context of electricity consumption. Electricity utility providers would like to reduce the total power consumed by their residential consumer segment. Supply to this segment is often subsidized, and the saved power can be diverted to more profitable segments. Alternatively, the provider may be keen on earning carbon credits by inducing reduced consumption in this segment. The societal network in which the consumers reside may have a prevailing norm, for example, saving power is environment friendly and is considered good. Those that consume less are considered more prosocial and may derive a larger reputational benefit. How can the service provider, who is familiar with the prevailing norm and the consumption of all users, design suitable feedback signals that exploit reputation benefits to reduce net consumption? We call this a problem of information design and address this question in this paper. We consider a continuum of agents. Each agent has a different intrinsic motivation to reduce her power consumption. Each agent models the power consumption of the others via a distribution. Using this distribution, agents will anticipate their reputational benefit and choose a power consumption by trading off their own intrinsic motivation to do a prosocial action, the cost of this prosocial action and their reputation. We assume that the service provider can provide two types of quantized feedbacks of the power consumption. We study their advantages and disadvantages. For each feedback, we characterize the corresponding mean field equilibrium, using a fixed point equation. Besides computing the mean field equilibrium, we highlight that, in some situations, revealing less information can lead to more prosociality. We illustrate the result on London smart-meter data. We also introduce the notion of privacy and provide a new quantized feedback respecting agents' privacy concerns yet improving prosociality. The results of this study are not restricted to the framework of energy efficiency but are also applicable to congestion problems in road traffic and other resource sharing problems.

The Price of Strategic Information: Degree-Based Bounds on Inefficiency in Distributed Learning Games

Tuan Ngoc Do
Hai Duc Nguyen
Khanh Quoc Nguyen
Hiep Quang Dao
Anh Duc-Duy Nguyen
Luong Cong Nguyen

We study the Strategic Information Game (SIG), a distributed learning model over a fixed graph in which useful transfer on an edge requires bilateral commitment. Our central result is a sharp degree-based law: for every connected graph, the Price of Strategic Information (PoSI) satisfies PoSI(G) ≤ 1 + d_max, where d_max = max_i∈V d_i is the maximum degree, and in the negligible-cost regime this becomes exact for regular graphs. We also prove equilibrium existence, an exact harmonic-mean formula for arbitrary graphs when communication and competition costs vanish, polynomial-time topology design under edge budgets, no-regret convergence to coarse correlated equilibria, and an impossibility theorem showing that transfer-free reciprocal protocols cannot eliminate the all-zero equilibrium.

Nash Equilibria in Uniform Price Auctions: Theory, Computation, and Market Insights

Yifan Feng
Negin Golrezaei
Qinzhen Li

We study pure-strategy Nash equilibria in multi-unit auctions under the K-th price rule. Despite their widespread use in resource allocation markets such as treasury bond sales, electricity markets, and carbon credit trading, these auctions remain poorly understood due to their combinatorial complexity and bidders' strategic incentives for bid shading. In this paper, we provide a complete characterization of the set of allocations and clearing prices that can arise in equilibrium. This enables us to develop a mixed-integer programming formulation to compute Nash equilibrium outcomes and select those with desirable properties, such as welfare-maximizing or revenue-maximizing outcomes. We further extend our analysis to incorporate reserve-price optimization. Finally, we use our characterizations to study market stability. Numerical experiments show that markets with fewer bidders and coarser bid grids are more prone to tacit collusion. These inefficiencies can be mitigated by increasing the number of bidders to enhance competition or by using reserve prices to curb revenue loss. We also examine market dynamics in repeated multi-unit auctions. Simulations reveal that despite bidders having private valuations, adaptive bidding algorithms often converge to pure-strategy Nash equilibrium, which underscores the relevance of pure-strategy Nash equilibrium analysis in understanding the dynamics and efficiency of multi-unit auctions.

Guiding the Recommender: Information-Aware Auto-Bidding for Content Promotion

Yumou Liu
Zhenzhe Zheng
Jiang Rong
Yao Hu
Fan Wu
Guihai Chen

Modern content platforms increasingly rely on paid promotion services to mitigate the cold-start problem for new creators and posts. However, our empirical analysis reveals a critical paradox: while promotional campaigns successfully rescue low-to-medium quality content, they frequently degrade the long-term organic performance of high-quality content by forcing distribution to suboptimal audiences, thereby polluting engagement signals and downgrading future recommendation rankings. We reconceptualize content promotion as a dual-objective optimization problem that balances short-term value acquisition with long-term platform model improvement. To make this formulation tractable in millisecond-scale auctions, we introduce a decomposable surrogate objective called gradient coverage, which maximizes the similarity between acquired impression gradients and a representative validation set, and we formally link it to Fisher Information and optimal experimental design. We design a two-stage auto-bidding algorithm grounded in Lagrange duality that dynamically paces budgets through a shadow price while optimizing per-impression bids using marginal utilities. To resolve the missing-label challenge at bid time, we propose a confidence-gated gradient heuristic and a zeroth-order variant for black-box models. We provide rigorous theoretical guarantees, proving monotone submodularity of the composite objective, sublinear regret in online first-price auctions, and budget feasibility. Extensive offline evaluations on synthetic and real-world datasets demonstrate that our framework consistently outperforms baselines, achieves superior final AUC and LogLoss, strictly adheres to budget targets, and remains robust under gradient approximation. This work establishes a principled paradigm for strategic, information-aware content promotion that enhances long-term model performance and creator outcomes beyond naive impression maximization.

SESSION: Session 4B: HPC & GPU Computing

Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters

Rutwik Jain
Yiwei Jiang
Matthew D. Sinclair
Shivaram Venkataraman

As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Given that modern applications can often temporarily exceed the power ratings of the accelerators (''power spikes''), current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.

LPGSim: A Lightweight Parallel GPU Simulator Maximizing Speed with Trustworthy Simulation

Hyunwoo Nam
Jay Hwan Lee
Yeonsoo Kim
Mengzhao Zhang
Jeonggeun Kim
Bernd Burgstaller

In designing an effective GPU for large-scale workloads, a trustworthy and fast simulator is required to evaluate performance and explore the design space. However, existing GPU simulators suffer from long execution times due to detailed component simulation, limiting their utility for evaluating the effects of architectural modifications. It is necessary to improve the performance of a GPU simulator such that quick architecture exploration and evaluation for large-scale workloads are available, at the expense of accuracy. This paper presents LPGSim, a trace-driven and cycle-level GPU simulator. LPGSim aims to provide fast and accurate GPU simulation. To this end, LPGSim first eliminates instruction metadata that has minimal impact on simulation accuracy. Next, LPGSim parallelizes GPU simulation. LPGSim partitions a GPU architecture into three parallelizable subsystems and introduces local-clock-based parallelization. LPGSim further employs parallelization methods to achieve scalability on NUMA systems with an acceptable trade-off in accuracy. LPGSim shows 21.4%, 23.3%, and 22.7% errors across three different GPU architectures, while achieving a total 197.4x speedup over the state-of-the-art simulator.

Towards Scalable Storage Architectures for GPU Clusters Running Large Language Models

Ali Sedaghatgoo
Reza Salkhordeh
André Brinkmann
Hossein Asadi

Accelerators in multi-GPU nodes increasingly sit idle due to storage and I/O inefficiencies. We present a phase-aware evaluation of storage datapaths for Large Language Models (LLMs) on single-node, multi-GPU systems, comparing libaio, io_uring, Storage Performance Development Kit (SPDK), and GPUDirect Storage (GDS) across SATA SSDs, NVMe SSDs, Optane NVMe, and Optane Persistent Memory. Using an automated framework, we test over 25,000 configurations and measure throughput, latency, Input Output Per Second (IOPS), and CPU overhead using workloads derived from real LLM traces spanning pre-training, fine-tuning, and inference.

For inference workloads dominated by small random I/O, io_uring provides the best latency and competitive IOPS on NVMe devices. For pre-training and fine-tuning, which rely on large sequential transfers, GDS significantly reduces model load times and CPU usage. These findings show that optimal storage datapaths are phase-dependent and offer practical guidance for narrowing the storage--compute gap in GPU-based LLM systems.

cuPTW: Leveraging Idle Compute Units for Massively Parallel GPU Page Table Walks

Zihang Chen
Tianao Ge
Lieven Eeckhout
Hongyuan Liu
Jiayi Huang

Virtual memory is essential for modern GPUs, yet address translation remains a major bottleneck. In irregular workloads, frequent TLB misses and costly page table walks (PTWs) dominate memory latency. We propose Compute Unit Page Table Walk (cuPTW), which repurposes idle GPU compute resources to accelerate PTWs. By offloading requests to idle functional units, cuPTW transforms address translation into a massively parallel task. We further optimize this with cuPTW-SW, which caches walks in local data store (LDS) memory, and cuPTW-MT, which parallelizes walks across SIMD lanes. Evaluation shows that cuPTW-Full achieves a 4.43× average speedup (up to 76.09×) and improves PTW throughput by 9.92×. Compared to state-of-the-art designs, Marching Page Walks and SnakeByte, cuPTW-Full delivers a 2.08× and 1.97× performance gain, respectively.

RELL-STC: An Efficient SpMM Extension of Sparse Tensor Core with Re-structured ELLPACK and Gustavson Dataflow

Jianfeng Cui
Bo Yuan
Zekun Jiang
Kai Lu
Sheng Liu

Sparsity is crucial for deep neural networks, reducing storage and bandwidth while improving computational efficiency. With the rapid development of deep neural networks, sparse computation accelerators have become a key research focus. NVIDIA introduced 2:4 fine-grained structured sparsity in the Ampere architecture, doubling Tensor Core peak throughput by retaining two out of four elements, but this may compromise model accuracy. In contrast, unstructured sparsity achieves maximum sparsity without accuracy loss. However, the irregular distribution of non-zero elements (NNZs) in unstructured sparse matrices causes complex memory access and load imbalance, preventing direct use of GPU Tensor Cores. Thus, computation relies on CUDA cores via sparse libraries like cuSPARSE, underutilizing Tensor Cores' capabilities.

To address this, we extend the Tensor Core architecture based on Gustavson's dataflow and propose a re-structured row offline compression format for inference. This format integrates with the extended architecture to enable efficient unstructured sparse-dense matrix multiplication. Experiments show negligible area and power overheads, with an average speedup of 3.54x over cuSPARSE and 1.2x over state-of-the-art Tensor Core extensions.

SESSION: Session 4C: Networking 2

MATER: Mutually Aware Framework for Teleoperated-robot with Extended Reality

Ziliang Zhang
Cong Liu
Hyoseung Kim

Robot teleoperation with extended reality (XR teleoperation) enables intuitive interaction by mapping user motions to remote robots with real-time 3D feedback. However, existing systems suffer from large completion delays and trajectory deviations under prolonged network latency, rooted in their exclusive reliance on network communication and strict synchronous execution architecture. Moreover, network fluctuations destabilize teleoperation accuracy, while dynamic user motions amplify teleoperation errors.

We present MATER, an end-to-end XR teleoperation framework that introduces a mutually-aware architecture in which each side reconstructs its counterpart's delayed or missing state to decouple the execution from network dependency. MATER includes latency-adaptive input window and user motion gap interpolation techniques to handle unstable network communication. It also proposes motion-driven robot state rollback and robot trajectory coordination to handle complex motions. The key idea behind these techniques is to adapt on the fly by reshaping reconstruction and filling gaps as network conditions fluctuate, and by realigning states when motions become fast or complex. Together with lightweight local synchronization and bandwidth optimizations, these system-level advances make MATER resilient to both network and motion dynamics.

We implement MATER across three hardware settings, including simulated and physical robots, and evaluate it on 9,500 real-world teleoperation trials from the RoboSet dataset, covering single- and multi-step missions. Compared to state-of-the-art XR teleoperation frameworks, MATER reduces teleoperation error by up to 69.8% on WLAN and 73.1% on cellular networks with only 6.7% maximum runtime overhead. It also shortens mission completion time by up to 47.7%, enabling smoother teleoperation. A real-world case study on ten stationary and mobile missions further shows MATER achieves up to 37.7% faster completion while lowering average teleoperation error by up to 57.2%. MATER code is available at: https://github.com/rtenlab/mater

Assessing Resilience in Authoritative DNS Infrastructure Supporting Government Services

Agung Septiadi
Minzhao Lyu
Hassan Habibi Gharakheili
Vijay Sivaraman

Online government services are increasingly regarded as critical national functions, yet their supporting infrastructures remain vulnerable to outages caused by natural disasters, geopolitical tensions, and targeted attacks. Central to their operation is the authoritative Domain Name System (DNS) infrastructure, the single source of truth that maps government domain names to service endpoints. In this paper, we introduce a comprehensive assessment framework to systematically evaluate the operational resilience of authoritative DNS infrastructure per government service. Our first contribution develops a data schema that characterizes a domain's authoritative DNS infrastructure across four hierarchical levels: entire hosting infrastructure, server functionality, name servers, and hosting instances. Our second contribution identifies resilience attributes at their finest hierarchy across three operational phases, namely infrastructure placement, service configuration, and DNS record dispatch. Numerical scores are assigned and aggregated algorithmically to enable consistent and cross-domain comparisons. Lastly, we apply our method to government domains in six representative countries.

CAP: Detecting Network Device Misconfigurations with Context-Aware Prompting of LLMs

Xi Jiang
Aaron Gember-Jacobson
Nick Feamster

Traditional tools for verifying network configurations: (1) require high manual effort or lack semantic depth, and (2) lack interpretable, human-readable explanations to support diagnosis. Large Language Models (LLMs) present a promising alternative, but simple approaches like full-file, partition-based, or chain-of-thought prompting fail to handle the large, interconnected nature of configuration files. We introduce the Context-Aware Prompting (CAP) framework, which enables an LLM to reason about network configurations like a human expert. CAP first identifies all potentially relevant neighboring, similar, and referenced configuration segments. CAP then engages the LLM in a structured dialogue, where the model requests the specific context it needs before performing a focused analysis. This on-demand approach provides the necessary context for deep reasoning while avoiding overload. Our evaluation on production network configurations demonstrates that CAP reliably detects a wide range of known and previously unknown configuration errors. CAP outperforms existing LLM-based baselines, and achieves performance comparable to state-of-the-art non-LLM tools.

Janus: A Dual-Mask Attention Transformer for Log-based Anomaly Detection in Cellular Networks

Umakant Kulkarni
Sonia Fahmy

Modern cellular networks, such as 5G networks, generate complex operational logs that challenge traditional anomaly detection techniques. Existing deep learning approaches, including standard transformer models, treat logs as monolithic text streams and lack the specialization to reason about procedural correctness and semantic integrity, a key requirement for telecommunications software. We tackle this problem in our system Janus, a log-based anomaly detection system featuring a novel Single-Pass Dual-Mask (SPDM) attention mechanism. Janus introduces a domain-specific inductive bias by partitioning attention heads into two groups. Global heads learn the valid temporal grammar of 5G procedures using a causal mask, and local heads perform fine-grained audits on the consistency of critical data fields using a tag-based semantic mask. A multi-stage curriculum learning framework progressively adapts Janus from domain pre-training to discriminative fine-tuning and learns to detect complex, real-world software failures. Experimental evaluation with several 5G log datasets demonstrates that Janus consistently outperforms prior systems, achieving on average a 3× performance improvement over a DNN-based baseline and an 80% gain over a transformer-based system.

Dynamic SLA-aware Network Slice Monitoring

Niloy Saha
Mina Tahmasbi Arashloo
Nashid Shahriar
Raouf Boutaba

Next-generation networks increasingly rely on network slices — logical networks tailored to specific application requirements, each with distinct Service-Level Agreements (SLAs). Ensuring compliance with these SLAs requires continuous, real-time monitoring of end-to-end performance metrics for each slice, within a limited telemetry budget. However, we find that existing solutions face two fundamental limitations: they either lack end-to-end visibility (e.g., sketches, probabilistic sampling) or provide visibility but lack the control mechanisms to dynamically allocate monitoring resources according to slice SLAs.

We address this through a formal framework that reframes slice monitoring as a closed-loop control problem, and defines the minimal data plane requirements for SLA-aware slice monitoring via a telemetry primitive contract. We then present SliceScope, a realization of this framework that combines: (1) a control strategy that dynamically allocates the monitoring resources across diverse slices according to their SLA criticality, and (2) a data-plane based on change-triggered INT that provides per-packet end-to-end visibility with tunable accuracy-overhead trade-offs, satisfying the telemetry contract. Our evaluation results on programmable switches and in large-scale simulations with a mixture of different slice types, demonstrate that SliceScope tracks critical slices up to 4× more accurately compared to static baselines, while showing that change-triggered INT outperforms alternative primitives for realizing the telemetry primitive contract.

SESSION: Session 5A: Queueing

Stochastic Network Utility Maximization in Strategic Queueing Systems: A Game-Theoretic Approach

Quang Minh Nguyen
Randall Berry
Eytan Modiano

Stochastic Network Utility Maximization (NUM) has been a dominant framework for many queueing network resource allocation and control problems. Its original model seeks to optimize social welfare, which usually takes the form of the sum of local utilities of participating entities. However, such a centralized utility maximization approach is unsuitable for many modern multi-agent systems, in which each agent may selfishly optimize its local utility without regard to the overall utility. In this paper, we formulate the stochastic NUM problem in strategic queueing systems as a repeated game with queue stability constraints. Based on the fluid model characterization of the strategic queueing NUM problem, we develop a a primal-dual algorithm that constitutes an approximate generalized Nash equilibrium (GNE) for the game. However, similar to primal-dual methods developed for the classical NUM problem, this approach does not leverage real-time queue lengths in decision making, leading to suboptimal queueing delay in practice, and has no explicit performance guarantees. To this end, we propose the Strategic Drift-plus-Penalty (SDP) algorithm and show that it constitutes an ε-GNE and has a uniformly bounded expected queue length of order O(1/ε³) for any ε > 0. Under an additional mild assumption, we show that our algorithms achieve long-term average social welfare arbitrarily close to that of a welfare-maximizing GNE policy.

The Multiserver-Job Stochastic Recurrence Equation for Cloud Computing Performance Evaluation

François Baccelli
Diletta Olliaro
Marco Ajmone Marsan
Andrea Marin

We study the Multiserver-Job Queuing Model (MJQM) with general independent arrivals and service times under FCFS scheduling, using stochastic recurrence equations (SREs) and ergodic theory. We prove the monotonicity and separability properties of the MJQM SRE, enabling the application of the monotone-separable extension of Loynes' theorem and the formal definition of the MJQM stability condition. Based on these results, we introduce and implement two algorithms: one for drawing sub-perfect samples (SPS) of the system's workload and the second one to estimate the system's stability condition given the statistics of the jobs' input stream. The SPS algorithm allows for a massive GPU parallelization, greatly improving the efficiency of performance metrics evaluation. We also show that this approach extends to more complex systems, including MJQMs with typed resources.

Efficiency-Reward Trade-Off in Queues with Dynamic Arrivals

Tianze Qu
Sushil Mahavir Varma

Motivated by emerging applications in various online marketplaces, such as ride-hailing platforms and payment channel networks, we study a single-server queue with service rate μ and state (queue length) dependent arrival rate λ(q) ∈ [0, λ_max]. The service operator receives a reward of F(λ) by setting an arrival rate of λ, where F can be used to model objectives like throughput, revenue, and social welfare. The goal is to design a control policy {λ(q)}_{q ∈ Z₊} that is both reward-optimal and efficient. Specifically, we consider the long-run average reward E[F(λ(q))] and the steady-state expected queue length E[q], where q denotes the steady-state queue length, and characterize the Pareto frontier between these two objectives.

We first establish a benchmark upper bound on the operating reward and define regret with respect to this benchmark. We then constrain the regret to be at most ε and focus on minimizing E[q]. Under mild smoothness assumptions on F, we show that in the small-market regime (λ_max ≤ μ), any control policy incurs E[q] ≥ Ω(1/ε). In contrast, in the large-market regime (λ_max >> μ), the system is significantly more efficient: E[q] = Θ(1/√ε) when F is concave-like near capacity, and E[q] = Θ(log (1/ε)) otherwise. In both cases, we establish matching lower bounds and construct state-dependent control policies that achieve them. These results provide a unified view of efficiency–reward trade-offs and can be interpreted as a non-asymptotic heavy-traffic theory for queues with dynamic arrivals.

Robustness of the 2-Choices Dynamics to Node Failures

Luke Meredith
Arpan Mukhopadhyay

In many applications, it becomes necessary for a set of distributed network nodes to agree on a common value or opinion as quickly as possible and with minimal communication overhead. The classical 2-choices rule is a well-known distributed algorithm designed to achieve this goal. Under this rule, each node in a network updates its opinion at random instants by sampling two neighbours uniformly at random and then adopting the common opinion held by these neighbours if they agree.

In this paper, we study the robustness of this algorithm to node failures. In particular, we assume that with a constant probability α, a node may fail to update according to the 2-choices rule and erroneously adopt any one of the two opinions uniformly at random. This is a strong form of failure under which the network can no longer be absorbed in a consensus state. However, we show that as long as the error probability α is less than a threshold value, the network is able to retain the majority support of the initially prevailing opinion for an exponentially long time (Ømega(poly(\exp(n)))). In contrast, when the error probability is above a threshold value, we show that any opinion quickly (O(log n) time) loses its majority support and the network reaches a state where (nearly) an equal proportion of nodes support each opinion. We establish the above phase transition in the dynamics for both complete graphs and expander graphs with sufficiently large spectral gaps and sufficiently homogeneous degrees. Our analysis combines spectral graph theory with Markov chain mixing and hitting time analyses.

SESSION: Session 5B: Quantum Systems 1

ScaleQsim: Highly Scalable Quantum Circuit Simulation Framework for Exascale HPC Systems

Changjong Kim
Ehan Sohn
Seunghwan Kim
Alex Sim
Kesheng Wu
Houjun Tang
Yongseok Son
Sunggon Kim

Large-scale quantum circuit simulation on high-performance computing (HPC) systems is crucial for developing and verifying quantum algorithms to overcome the limitations of current noisy quantum computers. However, existing simulators face scalability bottlenecks due to memory limits and communication overhead. Many state-of-the-art approaches rely on static circuit partitioning, which makes it difficult to address the exponential growth in problem size as the number of qubits increases. To overcome these challenges, we present ScaleQsim, a highly scalable quantum circuit simulation framework for large-scale HPC systems. ScaleQsim focuses on providing a unified representation of the full qubit state, yet provides scalability through efficient synchronization and communication. ScaleQsim adopts a novel full state vector partitioning strategy that evenly distributes the full state vector across multiple nodes and GPUs. This distributed structure enables efficient parallel gate execution without costly synchronization, and ScaleQsim applies adaptive kernel configuration, which adjusts execution parameters based on GPU resources and task granularity to enhance simulation efficiency. Our evaluation on the Perlmutter supercomputer with up to 512 GPUs demonstrates that ScaleQsim simulates quantum circuits with up to 42 qubits and outperforms state-of-the-art simulators by up to 77.40× across various quantum circuits.

Anchor: Reducing Temporal and Spatial Output Performance Variability on Quantum Computers

Yuqian Huo
Daniel Leeds
Jason Z. Ludmir
Nicholas S. DiBrita
Tirthak Patel

Quantum computing, which has the power to accelerate many computing applications, is currently a technology under development. As a result, the existing noisy intermediate-scale quantum (NISQ) computers suffer from different hardware noise effects, which cause errors in the output of quantum programs. These errors cause a high degree of variability in the performance (i.e., output fidelity) of quantum programs, which varies from one computer to another and from one day to another. Consequently, users are unable to get consistent results even when running the same program multiple times. Current solutions, while focusing on reducing the errors faced by quantum programs, do not address the variability challenge. To address this challenge, we propose Anchor, a first-of-its-kind technique that leverages linear programming to reduce the performance variability by 73% on average over the state-of-the-art implementation focused on error reduction.

ObliQ: Solving Quadratic Unconstrained Binary Optimization Problems on Real Photonic Quantum Machines

Aditya Ranjan
Daniel Silver
Tirthak Patel
Bhargav Chickmagalurnanjun
Devesh Tiwari

Quadratic Unconstrained Binary Optimization (QUBO) problems are central to combinatorial optimization but challenging due to their exponential solution space. We aim to solve QUBO problems on photonic quantum computers via novel problem-customized optimizations and exploiting unique aspects of photonic quantum machine architecture. We present a novel solution, ObliQ, that creates a hybrid quantum circuit composed of two parts: a theoretically grounded, non-tunable, problem-customized quantum circuit and a parameterized variational quantum circuit -- underlined by the co-design of photonic quantum hardware and software.

A Fine-Grained and Efficient Reliability Analysis Framework for Noisy Quantum Circuits

Jindi Wu
Tianjie Hu
Qun Li

Evaluating the reliability of noisy quantum circuits is essential for implementing quantum algorithms on noisy quantum devices. While state fidelity is the most faithful indicator of circuit reliability, it is experimentally and computationally prohibitive to obtain. Alternative metrics, although easier to compute, often fail to accurately reflect circuit reliability. We propose a fine-grained, scalable, and interpretable framework for efficient reliability evaluation of noisy quantum circuits. We introduce the Noise Proxy Circuit (NPC), which removes all logical operations while preserving the complete sequence of noise channels, thereby providing an abstraction of cumulative noise effects. Based on the NPC, we define Proxy Fidelity, a reliability metric that quantifies both qubit-level and circuit-level reliability. We further develop an analytical algorithm to estimate Proxy Fidelity under depolarizing, thermal relaxation, and SPAM error channels. The proposed framework achieves fidelity-level reliability estimation while remaining execution-free, scalable, and interpretable. Experimental results on real and simulated IBM quantum devices show that our method accurately estimates circuit fidelity, with an average absolute difference (AAD) ranging from 0.031 to 0.069 across diverse circuits and devices.

SESSION: Session 5C: Networking 3: Satellite

Horizon: Understanding and Predicting Global Starlink Performance

Cristian Benghe
Vlad Graure
Tanya Shreedhar
Nitinder Mohan

Starlink has deployed over 7,800 satellites serving millions of subscribers, yet predicting its performance remains an open challenge. Rapid orbital dynamics, frequent handovers, and weather-induced signal attenuation create variability that existing models, built on a handful of instrumented terminals in limited regions, cannot capture at global scale. We present Horizon, the first global-scale machine learning system for predicting LEO satellite Internet performance. Our key insight is that crowdsourced measurement platforms, while noisier than controlled experiments, provide the geographic diversity necessary to build globally generalizable models. Horizon integrates 11 months of measurements from M-Lab and Cloudflare spanning 90+ countries with meteorological data and satellite orbital propagation features. On a fully held-out one-week temporal window, Horizon achieves mean absolute errors of 17.76 ms for latency and 25.63 Mbps for throughput; on a standard 80/20 split it outperforms all baselines, including adaptations of state-of-the-art architectures. Feature importance analysis reveals that geographic position dominates prediction, with latitude alone contributing 42-46%, while weather features account for 14-15%, quantifying the impact of atmospheric conditions on Ku/Ka-band links. Leave-one-location-out experiments confirm that Horizon generalizes to regions absent from training, enabling performance estimation where measurement infrastructure does not yet exist. Our dataset and pipeline are publicly available, providing a foundation for global LEO network performance visibility.

Exploring the 5G Digital Divide in the Non-Contiguous US: LEO Satellites to the Rescue?

Sizhe Wang
Moinak Ghoshal
Yufei Feng
Imran Khan
Phuc Dinh
Omar Basit
Zhekun Yu
Y. Charlie Hu
Dimitrios Koutsonikolas

5G cellular networks and Low-Earth-Orbit (LEO) satellite networks, such as Starlink, promise enhanced performance and coverage capabilities. While a large number of research works have evaluated these technologies in the mainland US, their performance in non-contiguous US regions remains under-explored, despite their unique challenges and significant visitor demands. Through extensive drive tests covering over 4,200 km across Alaska, Maui (Hawaii), and the mainland US (as a baseline), while simultaneously measuring the performance of the three major US mobile operators and Starlink, this work presents the first detailed evaluation of cellular and Starlink network coverage and performance in the non-contiguous US regions. Our study shows a persistent digital divide between mainland and non-contiguous US for cellular networks in terms of coverage and performance, highlighting the challenges of cellular deployments in non-contiguous US regions. Starlink provides substantially higher performance than cellular networks most of the time, but area-specific challenges, including unique terrains in Hawaii and sparse satellite deployment in Alaska, significantly degrade performance compared to the mainland US. Additionally, we explore the spatiotemporal diversity between cellular and Starlink performance and study the potential of multipath transport to bridge the connectivity gap in non-contiguous US regions.

A Comprehensive Study of Satellite Network Performance During Severe or Extreme Geomagnetic Storms over 1.5 Years (May 2024 – Oct 2025)

Saeed Fadaei
Aravindh Raman
Prince Bhardwaj Pawankumar Sharma
Nishanth Sastry

As LEO megaconstellations such as Starlink become critical infrastructure for millions of users, it becomes important to examine the effects of space weather phenomena such as geomagnetic or solar storms. The network-level impact of these events remains poorly characterised, especially across orbit types and at application scale. Using a three-pronged measurement strategy that combines real-time controlled experiments via LEOScope, post-hoc analysis of over 4 million M-Lab NDT7 speedtests spanning 240 countries, and Cloudflare AIM application quality scores, we study the effects on Starlink of all the fifteen strong, severe, or extreme geomagnetic storms that occurred from May 2024 to October 2025. We find that LEO networks suffer severe, latitude-bounded degradation: packet loss doubles, throughput drops by 20–40%, and connection drops nearly quadruple. LEO is disproportionately affected relative to MEO and GEO. Application scores for gaming, streaming, and VoIP degrade by one quality tier. BBR consistently outperforms CUBIC under storm conditions. We also find that satellite operators manoeuvre nearly half their LEO fleets during storms, a necessary but performance-costly response. These findings call for space-weather-aware satellite network operations and end-user transparency which may allow for user-side adaptation during geomagnetic storms.

A Variegated Look at Direct-to-Cell Satellites in the Wild

Wei Liu
Jingyi Lan
Yuanjie Li
Bowen Sun
Yekai Dong
Yufeng Wang
Hewu Li
Qian Wu
Zeqi Lai
Jun Liu

Direct-to-cell (DtC) satellites aim to deliver ubiquitous cellular connectivity for regular phones from space. Starlink, as the most successful DtC satellite operator, has already provided messaging and data services to unmodified smartphones across multiple countries. This work presents the first in-depth cellular measurement study of Starlink's DtC satellites by collecting a two-month full-stack dataset. Our analysis reveals that Starlink strives to mitigate the satellite's long distance and extreme mobility through infrastructure-side mechanisms, including physical-layer pre-compensation, reasonable function split, and rapid cell switching strategies. However, its backward-compatibility requirements prevent cellular protocol modifications, yielding unresolved issues including frequent access failures, ping-pong handovers, and extensive retransmissions. These findings expose fundamental limitations in adapting legacy terrestrial 5G/4G to space. In the long term, these issues cannot be resolved simply by allocating more spectrum or deploying more satellites. We further conduct data-driven emulation using 3GPP NTN protocol stacks and satellite channel simulators to characterize how next-generation NTN designs may address these issues.

SESSION: Session 5D: Measurement

Are We Ready to Enable Satellite Autonomy Through on Board Image Registration?

Claudio Di Salvo
Emanuele Del Sozzo
Giuseppe Sorrentino
Eleonora D'Arnese
Paolo Panicucci
Davide Conficconi

Enabling autonomous CubeSats requires shifting data processing from ground stations directly to on-board hardware to bypass severe downlink bottlenecks, using a paradigm known as Orbital Edge Computing (OEC). A critical, yet overlooked, OEC prerequisite is Image Registration (IR), whose pipelines are usually developed for unconstrained platforms, failing to meet strict on-board latency, accuracy, and energy trade-offs. For this reason, we introduce STARBench, a modular framework to standardize the evaluation of IR methods. Leveraging this framework, we proposed the Quality-Adjusted Cost (QAC) as a new system-level metric that quantifies the latency-accuracy-energy trade-off and reveals a fundamental ''specialization trade-off.'' Based on these insights we propose ETNA, a mission-aware solution that dynamically orchestrates heterogeneous IR pipelines. Through a hardware-software co-designed pyramidal Field Programmable Gate Array (FPGA) accelerator, ETNA achieves real-time multimodal registration, improving the QAC by up to 12.03× over Pareto-optimal baselines.

μ-VF: Enabling Virtualization of Embedded FPGAs

Vincenzo Alessio Bucaria
Francesco Longo
Giovanni Merlino
Francesco Restuccia

Despite growing interest in virtualization of Field-Programmable Gate Arrays (FPGAs), existing approaches predominantly target datacenter-class FPGAs, which heavily rely on external servers for hypervisor execution and resource management. This significantly limits their suitability for edge environments where autonomy, energy efficiency, and direct low-latency access to physical Input/Output (I/O) are critical. To address this goal, this work presents µ-VF, a lightweight virtualization framework specifically designed to enable robust multi-tenancy on embedded FPGAs operating autonomously at the network edge. µ-VF embeds all virtualization logic entirely onboard the FPGA unit, eliminating the need for any off-chip infrastructure and thus significantly reducing overall system power consumption. Each tenant operates within a secure and isolated container on the on-chip Processing System (PS), coupled with exclusive access to a dedicated Programmable Logic (PL) region. Additionally, µ-VF fully virtualizes external General-Purpose Input/Output (GPIO) directly within the PL fabric, enabling independent, concurrent, and latency-sensitive access to shared peripherals. We have implemented a prototype of µ-VF on a Zynq UltraScale+ ZCU102 board with PL operating at 100 MHz. Experimental results demonstrate that the hardware virtualization layer utilizes less than 10% of the FPGA's logic resources, with 85% available for tenant applications compared to 50% in prior work. Moreover, µ-VF adds only 2.93% to Memory-Mapped I/O (MMIO) access latency compared to native execution for single-tenant operation, increasing to 6.5% with four concurrent tenants. Memory throughput overhead remains below 1.8% for write operations and negligible for reads, with aggregate throughput 20.58% higher than previous frameworks. Hardware-based GPIO remapping completes in 20 ns. By operating entirely on-device, µ-VF transforms embedded FPGAs into autonomous, multi-tenant computing nodes suitable for edge Infrastructure-as-a-Service (IaaS) deployments.

On Abnormal Execution Timing of Conditional Jump Instructions

Annika Wilde
Samira Briongos
Claudio Soriente
Ghassan Karame

An extensive line of work on modern computing architectures has shown that the execution time of instructions can (i) depend on the operand of the instruction or (ii) be influenced by system optimizations, e.g., branch prediction and speculative execution paradigms. In this paper, we systematically measure and analyze timing variabilities in conditional jump instructions that can be macro-fused with a preceding instruction, depending on their placement within the binary. Our measurements indicate that these timing variations stem from the µop cache placement and the jump's offset in the L1 instruction cache of modern processors. We demonstrate that this behavior is consistent across multiple microarchitectures, including Skylake, Coffee Lake, and Kaby Lake, as well as various real-world implementations. We confirm the prevalence of this variability through extensive experiments on a large-scale set of popular binaries, including libraries from Ubuntu 24.04, Windows 10 Pro, and several open-source cryptographic libraries. We also show that one can easily avoid this timing variability by ensuring that macro-fusible instructions are 32-byte aligned - an approach initially suggested in 2019 by Intel in an overlooked short report. We quantify the performance impact of this approach across the cryptographic libraries, showing a speedup of 2.15% on average (and up to 10.54%) when avoiding the timing variability. As a by-product, we show that this variability can be exploited as a covert channel, achieving a maximum throughput of 16.14 Mbps.

Shiny Objects: Object-Centric Characterization of Chromium

Saket Upadhyay
Ashish Venkat

Modern web browsers manage millions of dynamic objects across tabs, frames, DOM elements, and JavaScript contexts. However, fine-grained behaviors related to object allocation, lifetime, and memory usage in production browsers remain elusive. Chromium's modular and extensible design, use of specialized memory allocators, and sensitivity to instrumentation overhead further complicate precise object tracking. To this end, we develop a lightweight, thread-safe, and non-intrusive profiling framework. Using this infrastructure, we present an empirical characterization of Chromium's memory object behavior across twelve diverse, user-centric workloads. We examine object lifetime events, size diversity, spatial locality, type diversity, and memory activity, and reflect on their broader software and architectural implications. Our study offers a systems-oriented view into Chromium's architecture and memory behavior, and highlights structural challenges in efficient memory management in large-scale and diverse systems.

SESSION: Session 6A: Blockchain 2

Shedding Light on Shadows: Automatically Tracing Illicit Money Flows on EVM-Compatible Blockchains

Yicheng Huo
Yufeng Hu
Yajin Zhou
Ting Yu
Lei Wu
Cong Wang

The pseudo-anonymity and rapidly expanding ecosystem of Decentralized Finance (DeFi) have brought about significant liquidity on EVM-compatible blockchains, making them lucrative targets for cybercriminals. In the modern financial landscape, the need for an automated, high-speed, and effective illicit money tracing system is more urgent than ever to support regulators, on-chain service providers and security practitioners in their efforts to combat the frequent and large-scale occurrences of cyber financial crimes.

In this paper, we propose MFTracer, an automated system for tracing illicit money flows on EVM-compatible blockchains. Against the backdrop of a domain where tracing remains labor-intensive and expert-driven, MFTracer is developed in response to two pressing real-world demands: operational efficiency and forensic effectiveness. In response to the sophisticated fund transfer mechanisms enabled by the EVM environment, we introduce a novel fine-grained technique that enables protocol-agnostic transaction-level fund flow analysis. We further propose MFA, a lightweight and purpose-built graph abstraction with a tailored storage backend, to support efficient data retrieval. We also present a simulation algorithm for downstream illicit flow discovery. We implemented MFTracer. Its infrastructure for data retrieval achieves 3.7× to 9.4× higher storage efficiency while being 14.1× to 300× faster than the leading graph database systems. Furthermore, applied to real-world cybercrime incidents, MFTracer achieved 94.09% coverage of illicit money flows. It also newly reported 686 blockchain addresses and 4183 related transactions involved in money laundering that were previously undiscovered. MFTracer was able to reconstruct complete fund flow trajectories and provide strong evidence to investigators for 120.9 million in stolen assets.

The Price of Interoperability: Exploring Cross-Chain Bridges and Their Economic Consequences

Yiyue Cao
Mingzhe Zheng
Lin William Cong
Siguang Li
Xuechao Wang

Cross-chain bridges have become the primary infrastructure mediating asset and liquidity mobility across heterogeneous blockchain networks, yet their system-level and economic consequences remain poorly understood. This paper presents a large-scale empirical measurement study covering 20 blockchains and 16 major bridge protocols from 2022 to 2025. We model the multi-chain ecosystem as a time-varying weighted hypergraph and decompose interoperability into two dimensions: structural interoperability, the connectivity implied by deployed bridge infrastructure independent of user behavior, and active interoperability, the realized intensity of cross-chain transfers normalized by chain size. Our analysis yields five findings. First, the cross-chain network evolves from a sparse hub-and-spoke topology into a dense, multi-hub core dominated by EVM-compatible chains. Second, cross-chain capital flows are return-chasing and cost-sensitive, whereas structural interoperability serves as an enabling factor rather than a directional driver of flows. Third, we identify a growth-return paradox: higher interoperability expands ecosystem scale but dilutes native asset value. Fourth, we uncover an efficiency-fragility trade-off: structural interoperability acts as a load balancer that reduces average gas fees, yet synchronizes economic cycles, transforming independent networks into a tightly coupled system. Finally, architectural heterogeneity across bridge designs leads to distinct economic and system-level outcomes, yielding direct implications for the design of scalable and resilient cross-chain infrastructure.

SESSION: Session 6B: Quantum Systems 2

ECCentric: An Empirical Analysis of Quantum Error Correction Codes

Aleksandra Świerkowska
Jannik Pflieger
Emmanouil Giortamis
Pramod Bhatotia

Quantum Error Correction (QEC) is essential for building scalable quantum computers, but a lack of systematic, end-to-end evaluation methods makes it difficult to assess how different QEC codes perform under realistic conditions. The vast diversity of codes, an expansive experimental search space, and the absence of a standardized framework prevent a thorough, holistic analysis. To address this, we introduce ECCentric, an end-to-end benchmarking framework designed to systematically evaluate QEC codes across the full quantum computing stack. ECCentric is designed to be modular, extensible, and general, allowing for a comprehensive analysis of QEC code families under varying hardware topologies, noise models, and compilation strategies.

Using ECCentric, we conduct the first systematic benchmarking of major QEC code families against realistic, mid-term quantum device parameters. Our empirical analysis reveals that intra-QPU execution significantly outperforms distributed methods, that qubit connectivity is a far more critical factor for reducing logical errors than increasing code distance, and that compiler overhead remains a major source of error. Furthermore, our findings suggest that trapped-ion architectures with qubit shuttling are the most promising near-term platforms and that on noisy devices, a strategic and selective application of QEC is necessary to avoid introducing more errors than are corrected. This study provides crucial, actionable insights for both hardware designers and practitioners, guiding the development of fault-tolerant quantum systems.

Three Birds with One Stone: Improving Performance, Convergence, and System Throughput with NEST

Yuqian Huo
David Quiroga
Anastasios Kyrillidis
Tirthak Patel

Variational quantum algorithms (VQAs) have the potential to demonstrate quantum utility on near-term quantum computers. However, these algorithms often get executed on the highest-fidelity qubits and computers to achieve the best performance, causing low system throughput. Recent efforts have shown that VQAs can be run on low-fidelity qubits initially and high-fidelity qubits later on to still achieve good performance. We take this effort forward and show that carefully varying the qubit fidelity map of the VQA over its execution using our technique, NEST, does not just (1) improve performance (i.e., help achieve close to optimal results), but also (2) lead to faster convergence. We also use NEST to co-locate multiple VQAs concurrently on the same computer, thus (3) increasing the system throughput, and therefore, balancing and optimizing three conflicting metrics simultaneously.

SESSION: Session 6C: Networking 4

MeshAgent: Enabling Reliable Network Management with Large Language Models

Yajie Zhou
Kevin Hsieh
Sathiya Kumaran Mani
Srikanth Kandula
Zaoxing Liu

Large language models (LLMs) are promising for domain-specific agents, but adapting them to network management remains difficult because existing specialization methods depend on large amounts of high-quality task data, while real network queries are diverse and unpredictable. To address this, we propose MeshAgent, a workflow that extracts reusable domain invariants from sample queries and encodes them as constraints to guide LLM generation and validation. Across three network management applications and a user study with industry professionals, MeshAgent consistently improves accuracy and reliability, including the ability to abstain when confidence is low. MeshAgent achieves over 95% accuracy, reaches 100% when combined with fine-tuned agents, and improves accuracy by up to 26% over baselines. These results show that constraint-based adaptation is a practical and scalable alternative to data-heavy LLM specialization for network management.

The Impact of IP Version on Household Internet Speed: A Comparative Study

Amanda Hsu
Nick Feamster
Paul Pearce
Frank Li

Despite extensive research measuring broadband quality, limited work has considered the interaction between residential Internet throughput and IP protocol. Public speed tests used to determine Internet quality do not, by and large, control for IP version; results, analysis, and recommendations are made using a hybrid of IPv4 and IPv6 data. Given the range of protocol designs, software stacks, and network infrastructure differences between the protocols, combined with the recent significant increase in IPv6 adoption, it is critical to understand what role the IP protocol plays in measuring Internet speeds.

In this work, we systematically compare IPv4 and IPv6 speeds in residential access networks, examining differences in throughput experienced by households depending on IP version. Our findings demonstrate that IPv4 and IPv6 throughput differ in many instances and motivate a large-scale re-evaluation of our assumptions on IP version in future speed test analysis. Specifically, we find that IPv4 and IPv6 speeds differ in a significant number of cases, with up to 18.3% of our measurements differing by over 5%. Our findings indicate that substantial speed differences between IP versions can be driven by provider-specific factors such as speed tiers. Furthermore, we observe differences in IPv4 and IPv6 data depending on the speed-test software and testing infrastructure. Thus, this work guides future research about Internet speeds on how to consider and control for IP version.

SESSION: Session 7A: Theoretical Machine Learning

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Phalguni Nanda
Zaiwei Chen

In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for E[||Q_k - Q^*||_∞²], which implies a sample complexity of order O(1/ξ²) for achieving E[||Q_k - Q^*||_∞]≤ ξ. This rate matches that of off-policy Q-learning, but with a worse dependence on exploration-related parameters. We also derive an explicit finite-time rate for E[||Q^π_k - Q^*||_∞²], where π_k denotes the learning policy at iteration k. Together, these results highlight the exploration--exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage since the learning policy itself converges to an optimal one. From a technical perspective, the combination of rapidly time-varying learning policies, which induce time-inhomogeneous Markovian noise, and minimal exploration assumptions presents significant analytical challenges. To address these challenges, we develop a Poisson-equation-based decomposition of the Markovian noise associated with a lazy transition matrix, separating it into a martingale-difference term and residual terms. We then control the residual terms through a sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may facilitate the analysis of other reinforcement learning algorithms with rapidly time-varying learning policies, such as single-timescale actor--critic methods and learning-in-games algorithms.

Wasserstein-p Central Limit Theorem Rates: From Local Dependence to Markov Chains

Yixuan Zhang
Qiaomin Xie

Non-asymptotic central limit theorem (CLT) rates play a central role in modern machine learning. In this paper, we study CLT rates for multivariate dependent data in Wasserstein-p (W_p) distance, for general p ≥ 1. We focus on two fundamental dependence structures that commonly arise in machine learning: locally dependent sequences and geometrically ergodic Markov chains. In both settings, we establish the first optimal O(n^-1/2 ) rate in W₁, as well as the first W_p (p ≥ 2) CLT rates under mild moment assumptions, substantially improving the best previously known bounds in these dependent-data regimes. As an application of our optimal W₁ rate for locally dependent sequences, we further obtain the first optimal W₁-CLT rate for multivariate U-statistics. On the technical side, we derive a tractable auxiliary bound for W₁ Gaussian approximation errors that is well suited for studying dependent data. For Markov chains, we further prove that the regeneration time of the split chain associated with a geometrically ergodic chain has a geometric tail without assuming strong aperiodicity or other restrictive conditions. These tools may be of independent interests and enable our optimal W₁ rates and underpin our W_p (p ≥ 2) results.

Better than Optimal: Improving Adaptive Stochastic Quantization Using Shared Randomness

Ran Ben Basat
Yaniv Ben-Itzhak
Michael Mitzenmacher
Shay Vargaftik

Quantization reduces memory, computation, and communication costs by converting high-precision values into compact representations. Recent adaptive stochastic quantization (ASQ) methods optimize quantization values for a specific input vector, but they do not exploit shared randomness available to both the quantizer and the dequantizer. We introduce adaptive unbiased quantization (AUQ), which extends ASQ with shared randomness and can asymptotically improve achievable accuracy without increasing the encoded size. We then present Simba, a practical AUQ method that uses multiple sets of quantization values and selects among them using shared randomness while preserving unbiasedness. Simba starts by computing a fast approximate ASQ solution and refines it through a structured shared-randomness search. Across common synthetic distributions, Simba is up to about 24× more accurate than the state-of-the-art optimal ASQ method and can be up to two orders of magnitude faster. We also show that Simba can improve QUIC-FL by replacing its solver-generated lookup tables with tables computed by Simba.

SESSION: Session 7B: HPC

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

Wei-Fen Lin
Jen-Chien Chang
Yen-Po Chen
Zi-Yi Tai
Yu-Cheng Chang
Chia-Pao Chiang
Yu-Yang Lee
Yu-Jie Wang

Architectural simulation has become the critical bottleneck limiting design space exploration for high-performance computing systems. The growing complexity of modern GPUs and AI accelerators, with hundreds to thousands of tightly-coupled components, demands simulation frameworks that deliver efficient parallelism and scalable single-node execution. Existing frameworks such as SST, gem5, and GPGPU-Sim fail to meet these requirements: SST focuses on multi-node MPI scalability but struggles with intra-node scaling; GPGPU-Sim remains largely single-threaded. Critically, these frameworks provide fixed threading models with no mechanism for users to optimize simulation performance for their specific workloads.

We introduce ACALSim, a scalable parallel simulation framework designed to accelerate design space exploration for high-performance systems. ACALSim's key innovation is a pluggable thread management architecture that enables developers to implement custom scheduling strategies optimized for specific simulation patterns---a capability absent in existing frameworks. This is complemented by (1) event-driven execution with fast-forward to eliminate idle cycle overhead, (2) a shared-memory data model enabling zero-copy communication, and (3) a two-phase parallel execution model for deterministic thread scaling. We demonstrate ACALSim's effectiveness through HPCSim, a GPU simulator targeting A100-class architectures. Direct comparison with an SST implementation---using identical shared timing cores to isolate framework overhead---shows ACALSim achieves over 14× speedup with 41% lower memory footprint, while hardware validation confirms 0.72-1.22× cycle count correlation with A100 measurements.

Sumeru: Towards Understanding and Achieving Cache-Optimal Inbound Network I/O

Minhu Wang
Mingwei Xu
Jianping Wu

The slow growth of DRAM performance and ever-increasing memory bandwidth demands have made receiver-side memory a critical bottleneck for end-to-end data movement in cutting-edge data centers. Although Direct Cache Access (DCA) allows for memory-bypass I/O, existing implementations like Intel's Data Direct I/O (DDIO) have proven ineffective on 100 Gbps links, leading to a widespread belief that current processor caches are simply too small to serve modern high-speed links. This paper challenges this conclusion, arguing that the fundamental problem is not insufficient cache capacity, but inefficient cache usage. Our novel cache model reveals that software queue dynamics determine a receive buffer's path through the non-inclusive cache hierarchy (i.e., its ''cache trajectory''), opening the path toward cache-optimal DRAM-bypass inbound I/O on commodity hardware with pure software modifications. Guided by the model, we design and implement Sumeru, which approaches cache-optimal I/O through four synergistic innovations: (1) a dual-path stack architecture with a shallow fast path for large flows, (2) cache-aware buffer pools enforcing optimal trajectories, (3) host-based active queue management preventing bufferbloat, and (4) trajectory-aware dynamic cache partitioning. These designs work together to consistently keep network buffers on their optimal trajectory. The result is near-100% cache hit rates on a wide range of workloads and network settings. This eliminates memory-induced intra-host congestion, improving performance for both the target throughput-bound application and co-located latency-sensitive or memory-intensive neighbors. On real-world resource-contending deployments, Sumeru achieves a Pareto improvement: It boosts SPDK NVMe/TCP goodput by up to 51.2% while simultaneously boosting co-located SPEC CPU 2017 suite scores by up to 30.1%.

d-HNSW: A High-performance Vector Search Engine on Disaggregated Memory

Fei Fang
Yi Liu
Chen Qian

We present d-HNSW, an RDMA-based vector search engine for disaggregated memory that overcomes network bottlenecks and layout fragmentation through hardware-algorithm co-designed representative index caching, RDMA-friendly remote storage layout, query-aware loading, and pipelined execution.

SESSION: Session 7C: LLM Scheduling

SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling

Shashwat Jaiswal
Kunal Jain
Yogesh Simmhan
Anjaly Parayil
Ankur Mallick
Rujia Wang
Renee St. Amant
Chetan Bansal
Victor Rühle
Anoop Kulkarni
Steve Kofsky
Saravan Rajmohan

Global cloud service providers handle inference workloads for LLMs that span latency-sensitive (e.g., chatbots) and insensitive (e.g., report writing) tasks, with diverse and often conflicting Service Level Agreement (SLA) needs. Serving such mixed workloads is challenging due to the multiple models, diverse GPU hardware, and globally distributed data centers. Solutions that silo fast and slow tasks on separate GPU pools cause under-utilization of expensive accelerators. Here, we characterize the large-scale LLM inference workloads (> 10M requests per day) at Microsoft Office 365, and highlight key observations in one of the first Internet-scale public studies. These insights drive SageServe, an LLM serving framework that dynamically adapts to workload demands. It combines short-term request routing to data centers with long-term scaling of GPU VMs and model placement with higher lead times, and co-optimizes the routing and resource allocation problem using a traffic forecast model and an ILP solution. We evaluate SageServe on traces comprising 10M production requests across three regions and four open-source models. SageServe achieves up to 25% savings in GPU-hours compared to the current baseline deployment and reduces GPU-hour wastage due to inefficient auto-scaling by 80%, potentially resulting in monthly cost savings of up to $2.5 million, while maintaining tail latency and meeting SLAs.

Optimizing Dropout in LLM Training: Performance Comparison of Fusion and Overlap

Haiyue Ma
Jian Liu
Ronny Krashinsky

This work proposes overlapping Random Number Generation (RNG), the main runtime contributor of Dropout, with preceding GEneral Matrix Multiplication (GEMM) layers to hide RNG latency during LLM training. The state-of-the-art optimization is to fuse Dropout into the Flash-Attention kernel; however, evaluating fine-grained architecture resource constraints beyond traditional compute or memory metrics reveals that fusion fails to fully hide RNG latency due to shared lower-level architecture bottlenecks. RNG and GEMM have distinct hardware bottlenecks, so they can run together without compromising each other's performance. Our analytical model, validated on GH100 GPUs, shows 1.26× speedup over sequential execution and 1.22× over state-of-the-art fusion on Llama3 for a single Transformer block. Our methodology generalizes to fusion-versus-overlap decisions for new LLM operators across various architectures, models, and configurations.

Optimal Scheduling Algorithms for LLM Inference: Theory and Practice

Agrim Bari
Parikshit Hegde
Gustavo de Veciana

With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests.

In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles—optimal tiling and dynamic resource allocation—that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays.

We evaluate SLAI on the openchat_shareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints. The complete source code is available at: https://github.com/agrimUT/SLAI.