SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Full Citation in the ACM Digital Library

SESSION: Cloud & Datacenter

Memtrade: Marketplace for Disaggregated Memory Clouds

We present Memtrade, the first practical marketplace for disaggregated memory clouds. Clouds introduce a set of unique challenges for resource disaggregation across different tenants, including resource harvesting, isolation, and matching. Memtrade allows producer virtual machines (VMs) to lease both their unallocated memory and allocated-but-idle application memory to remote consumer VMs for a limited period of time. Memtrade does not require any modifications to host-level system software or support from the cloud provider. It harvests producer memory using an application-aware control loop to form a distributed transient remote memory pool with minimal performance impact; it employs a broker to match producers with consumers while satisfying performance constraints; and it exposes the matched memory to consumers through different abstractions. As a proof of concept, we propose two such memory access interfaces for Memtrade consumers -- a transient KV cache for specified applications and a swap interface that is application-transparent. Our evaluation shows that Memtrade provides significant performance benefits for consumers (improving average read latency up to 2.8X) while preserving confidentiality and integrity, with little impact on producer applications (degrading performance by less than 2.1%).

Mars: Near-Optimal Throughput with Shallow Buffers in Reconfigurable Datacenter Networks

The performance of large-scale computing systems often critically depends on high-performance communication networks. Dynamically reconfigurable topologies, e.g., based on optical circuit switches, are emerging as an innovative new technology to deal with the explosive growth of datacenter traffic. Specifically, periodic reconfigurable datacenter networks (RDCNs) such as RotorNet (SIGCOMM 2017), Opera (NSDI 2020) and Sirius (SIGCOMM 2020) have been shown to provide high throughput, by emulating a complete graph through fast periodic circuit switch scheduling.

However, to achieve such a high throughput, existing reconfigurable network designs pay a high price: in terms of potentially high delays, but also, as we show as a first contribution in this paper, in terms of the high buffer requirements. In particular, we show that under buffer constraints, emulating the high-throughput complete graph is infeasible at scale, and we uncover a spectrum of unvisited and attractive alternative RDCNs, which emulate regular graphs, but with lower node degree than the complete graph.

We present Mars, a periodic reconfigurable topology which emulates a d-regular graph with near-optimal throughput. In particular, we systematically analyze how the degree~d can be optimized for throughput given the available buffer and delay tolerance of the datacenter. We further show empirically that Mars achieves higher throughput compared to existing systems when buffer sizes are bounded.

Architectural Support for Efficient Data Movement in Fully Disaggregated Systems

Traditional data centers include monolithic servers that tightly integrate CPU, memory and disk (Figure 1a). Instead, Disaggregated Systems (DSs) [8, 13, 18, 27] organize multiple compute (CC), memory (MC) and storage devices as independent, failure-isolated components interconnected over a high-bandwidth network (Figure 1b). DSs can greatly reduce data center costs by providing improved resource utilization, resource scaling, failure-handling and elasticity in modern data centers [5, 8-10, 10, 11, 13, 18, 27]

The MCs provide large pools of main memory (remote memory), while the CCs include the on-chip caches and a few GBs of DRAM (local memory) that acts as a cache of remote memory. In this context, a large fraction of the application's data (~ 80%) [8, 18, 27] is located in remote memory, and can cause large performance penalties from remotely accessing data over the network.

Duo: A High-Throughput Reconfigurable Datacenter Network Using Local Routing and Control

The performance of many cloud-based applications critically depends on the capacity of the underlying datacenter network. A particularly innovative approach to improve the throughput in datacenters is enabled by emerging optical technologies, which allow to dynamically adjust the physical network topology, both in an oblivious or demand-aware manner. However, such topology engineering, i.e., the operation and control of dynamic datacenter networks, is considered complex and currently comes with restrictions and overheads.

We present Duo, a novel demand-aware reconfigurable rack-to-rack datacenter network design realized with a simple and efficient control plane. Duo is based on the well-known de Bruijn topology (implemented using a small number of optical circuit switches) and the key observation that this topology can be enhanced using dynamic ("opportunistic") links between its nodes.

In contrast to previous systems, Duo has several desired features: i) It makes effective use of the network capacity by supporting integrated and multi-hop routing (paths that combine both static and dynamic links). ii) It uses a work-conserving queue scheduling which enables out-of-the-box TCP support. iii) Duo employs greedy routing that is implemented using standard IP longest prefix match with small forwarding tables. And iv) during topological reconfigurations, routing tables require only local updates, making this approach ideal for dynamic networks.

We evaluate Duo in end-to-end packet-level simulations, comparing it to the state-of-the-art static and dynamic networks designs. We show that Duo provides higher throughput, shorter paths, lower flow completion times for high priority flows, and minimal packet reordering, all using existing network and transport layer protocols. We also report on a proof-of-concept implementation of \system's control and data plane.

Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICs

Emerging SmartNIC creates new opportunities to offload application-level computation into the networking layer. Shuffle, the all-to-all data exchange process, is a critical building block for network communication in distributed data-intensive applications and can potentially benefit from SmartNICs.

In this paper, we develop SmartShuffle, which accelerates the data-intensive application's shuffle process by offloading computation tasks into the SmartNIC devices. SmartShuffle supports offloading both low-level network functions, including data partitioning and network transport, and high-level computation tasks, including filtering, aggregation, and sorting. SmartShuffle adopts a coordinated offload architecture to make sender-side and receiver-side SmartNICs jointly contribute to the computation offload. SmartShuffle manages the computation and memory constraints on the device using liquid offloading, which dynamically migrates computations between the host CPU and the SmartNIC at runtime.

We prototype SmartShuffle on the Stingray SoC SmartNICs and plug it into Spark. Our evaluation shows that SmartShuffle outperforms Spark, and Spark RDMA by up to 40% on TPC-H.

DiffForward: On Balancing Forwarding Traffic for Modern Cloud Block Services via Differentiated Forwarding

Modern cloud block service provides cloud users with virtual block disks (VDisks), and it usually relies on a forwarding layer consisting of multiple proxy servers to forward the block-level writes from applications to the underlying distributed storage. However, we discover that severe traffic imbalance exists among the proxy servers at the forwarding layer, thus creating a performance bottleneck which severely prolongs the latency of accessing VDisks. Worse yet, due to the diverse access patterns of VDisks, stable traffic and burst traffic coexist at the forwarding layer, and thus making existing load balancing designs inefficient for balancing the traffic at the forwarding layer of VDisks, as they are unaware of and also lack the ability to differentiate the decomposable burst and stable traffic. To this end, we propose a novel traffic forwarding scheme DiffForward for cloud block services. DiffForward differentiates the burst traffic from stable traffic in an accurate and efficient way at the client side, then it forwards the burst traffic to a decentralized distributed log store to realize real-time load balance by writing the data in a round-robin manner and balances the stable traffic by segmentation. DiffForward also judiciously coordinates the stable and burst traffic and preserves strong consistency under differentiated forwarding. Extensive experiments with reallife workloads on our prototype show that DiffForward effectively balances the traffic at the forwarding layer at a fine-grained subsecond level, thus significantly reducing the write latency of VDisks.

SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference Serving

The growing adoption of hardware accelerators driven by their intelligent compiler and runtime system counterparts has democratized ML services and precipitously reduced their execution times. This motivates us to shift our attention to characterize the overheads imposed by the RPC mechanism (`RPC tax') when serving them on accelerators. Conventional RPC implementations implicitly assume the host CPU services the requests, and we focus on expanding such works towards accelerator-based services. While SmartNIC based solutions work well for simple applications, serving complex ML models requires a more nuanced view to optimize both the data-path and the control/orchestration of these accelerators. We program commodity network interface cards (NICs) to split the control and data paths for effective transfer of control while efficiently transferring the payload to the accelerator. As opposed to unified approaches that bundle these paths together, limiting the flexibility in each of these paths, we design and implement SplitRPC - a {control + data} path optimizing RPC mechanism for ML inference serving. SplitRPC allows us to optimize the datapath to the accelerator while simultaneously allowing the CPU to maintain full orchestration capabilities. We implement SplitRPC on both commodity NICs and SmartNICs and demonstrate that SplitRPC is effective in minimizing the RPC tax while providing significant gains in throughput and latency.

Go-to-Controller is Better: Efficient and Optimal LPM Caching with Splicing

Data center networks must support huge forwarding policies as they handle the traffic of the various tenants. Since such policies cannot be stored within the limited memory available at commodity switches, SDN controllers can manage the memory available at the switch as a cache, updating and changing the forwarding rules in the cache according to the policy and workloads dynamics. Most policies, such as Longest-prefix-match (LPM) policies, include dependencies between the forwarding rules, which introduce consistency constraints on the structure of the cached content, affecting the performance in terms of throughput and delay. Previous work suggested the concept of splicing to address such deficiencies, where modified Go-to-Controller rules can be inserted into the cache to improve performance while maintaining consistency.

We present the first optimal algorithm for determining the cache content with splicing, as well as several efficient heuristics with some performance guarantees. We evaluate our solutions using traces derived from real systems and traffic, and show that splicing can reduce the cache miss ratio by as much as 30%, without increasing the cache size. We further propose a new metric which can provide a quick estimate as to the potential benefits of splicing compared to classical LPM-caching. The full version of our work appeared in [2].

Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale. The full paper of this abstract can be found at

Smash: Flexible, Fast, and Resource-efficient Placement and Lookup of Distributed Storage

Smash is a new placement and lookup method for distributed storage systems. It achieves full placement flexibility and low DRAM cost to store ID-to-location mappings, two desired features that could not be achieved simultaneously by any prior method.

SESSION: Emerging Architectures

SLITS: Sparsity-Lightened Intelligent Thread Scheduling

To make the most of hardware resources in multi-core architectures, effective thread scheduling is crucial. To achieve this, various scheduling objectives have been developed, such as reducing hardware resource contention [1, 11], allocating resources evenly for co-running threads [ 5, 6], and following priority-based policies [7]. Current thread scheduling designs can be categorized into two types. The first type involves fixed-rule scheduling1, which does not depend on workload characteristics and cannot meet the needs of different scheduling objectives. The second type takes the scheduling objectives into account by collecting run-time information on threads together with their correlations (e.g., Cache Miss Count [10, 13], Thread IPC [20], dynamic priority requirements like Earliest Deadline First [2 , 8, 17]), and making scheduling decisions based on thread-to-thread interactions.

A unified approach to these two types of scheduler designs can be achieved by focusing on Thread-Interaction statistics. To this end, we introduce the Thread-Interaction Matrix (TIM), which stores statistics on thread-to-thread interaction. These statistics can be any type of run-time statistics concerning the thread-to-thread pairs (e.g., Cache Miss Count [10 , 13], Thread IPC [20], dynamic priority requirements like Earliest Deadline First [2 , 8, 17])2. For fixed-rule scheduling, the TIM contains static values as the statistics do not affect the scheduling decisions. Based on the TIM, scheduling policies can be customized by specifying the rules of thread rescheduling, such as reschedule conditions and strategy. Combining the Thread-Interaction Matrix and scheduling policy provides a formalization of existing thread scheduler designs. Therefore, it is essential to design a general scheduler that can be tailored to different scheduling objectives by co-designing the Thread-Interaction Matrix and the scheduling policy in a synergistic manner.

Asynchronous Automata Processing on GPUs

Finite-state automata serve as compute kernels for application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1) input stream level, 2) automaton-level, and 3) state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To overcome this, we propose AsyncAP, a low-overhead approach that optimizes scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. By making the matching process asynchronous, which involves having parallel GPU threads process an input stream from different input locations instead of processing it serially, AsyncAP is able to significantly improve throughput and scale with input length. Detailed evaluation across 12 applications shows that AsyncAP achieves an average speedup of 58x speedup over the state-of-the-art GPU automata processing engine when the task does not have enough parallelism to utilize all GPU cores. When tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4x speedup.

SESSION: Game Theory, Pricing&Mean-Field Analysis

Gacha Game Analysis and Design

Gacha game is a special opaque selling approach, where the seller is selling gacha pulls to the buyer. Each gacha pull provides a certain probability for the buyer to win the gacha game reward. The gacha game has been enthusiastically embraced in numerous online video games and has a wide range of potential applications. In this work, we model the complex interaction between the seller and the buyer as a Stackelberg game, where the sequential decision of the buyer is modeled as a Markov Decision Process (MDP). We define the whale property in the context of gacha games. Then, we show that this is the necessary condition to achieve optimal revenue. Moreover, we provide the revenue-optimal gacha game design and show that it is equivalent to the single-item single-bidder Myerson auction. We further explore two popular multi-item gacha games, namely, the sequential multi-item gacha game and the banner-based multi-item gacha game. We also discuss the subsidies in the gacha game and demonstrate how subsidies may encourage the buyer to engage in grinding behavior. Finally, we provide a case study on blockchain systems as gacha games.

Mean-field Analysis for Load Balancing on Spatial Graphs

A pivotal methodological tool behind the analysis of large-scale load balancing systems is mean-field analysis. The high-level idea is to represent the system state by aggregate quantities and characterize their rate of change as the system size grows large. An assumption for the above scheme to work is that the aggregate quantity is Markovian such that its rate of change can be expressed as a function of its current state. If the aggregate quantity is not Markovian, not only does this technique break down, the mean-field approximation may even turn out to be highly inaccurate.

In load balancing systems, if servers are exchangeable, then the aggregate quantity is indeed Markovian. However, the growing heterogeneity in the types of tasks processed by modern data centers has recently motivated the research community to consider systems beyond the exchangeability assumption. The main reason stems from data locality, i.e., the fact that servers need to store resources to process tasks of a particular type locally and have only limited storage space. An emerging line of work thus considers a bipartite graph between task types and servers [2, 3, 5 -7]. In this compatibility graph, an edge between a server and a task type represents the server's ability to process these tasks. In practice, storage capacity or geographical constraints force a server to process only a small subset of all task types, leading to sparse network topologies. This motivates the study of load balancing in systems with suitably sparse bipartite compatibility graphs.

Bias and Refinement of Multiscale Mean Field Models

We analyze the error of an ODE approximation of a generic two-timescale model (X, Y), where the slow component X describes a population of interacting particles which is fully coupled with a rapidly changing environment Y. The model is parametrized by a scaling factor N, which can be the number of particles. As N grows, the jump sizes of the slow component decrease in contrast to the unchanged dynamics of the fast component. A typical example is the random access CSMA model that we study.

By using an averaging principle, one can construct an ODE approximation of X, that we call the 'average' mean field approximation. We show that under relatively mild conditions, this approximation has a bias of order O(1/N) compared to \mathbbE [X]. This holds true under any continuous performance metric in the transient regime, as well as for the steady-state if the model is exponentially stable. To go one step further, we derive a bias correction term for the steady-state, from which we define a new approximation called the refined 'average' mean field approximation whose bias is of order O(1/N2). This refined 'average' mean field approximation allows computing an accurate approximation even for small scaling factors, i.e., N ≈ 10 - 50.

Enabling Long-term Fairness in Dynamic Resource Allocation

We study the fairness of dynamic resource allocation problem under the α-fairness criterion. We recognize two different fairness objectives that naturally arise in this problem: the well-understood slot-fairness objective that aims to ensure fairness at every timeslot, and the less explored horizon-fairness objective that aims to ensure fairness across utilities accumulated over a time horizon. We argue that horizon-fairness comes at a lower price in terms of social welfare. We study horizon-fairness with the regret as a performance metric and show that vanishing regret cannot be achieved in presence of an unrestricted adversary. We propose restrictions on the adversary's capabilities corresponding to realistic scenarios and an online policy that indeed guarantees vanishing regret under these restrictions.

SESSION: ML for Systems

PEACH: Proactive and Environment Aware Channel State Information Prediction with Depth Images

Up-to-date and accurate prediction of Channel State Information (CSI) is of paramount importance in Ultra-Reliable Low-Latency Communications (URLLC), specifically in dynamic environments where unpredictable mobility is inherent. CSI can be meticulously tracked by means of frequent pilot transmissions, which on the downside lead to an increase in metadata (overhead signaling) and latency, which are both detrimental for URLLC. To overcome these issues, in this paper, we take a fundamentally different approach and propose PEACH, a machine learning system which utilizes environmental information with depth images to predict CSI amplitude in beyond 5G systems, without requiring metadata radio resources, such as pilot overheads or any feedback mechanism. PEACH exploits depth images by employing a convolutional neural network to predict the current and the next 100 ms CSI amplitudes. The proposed system is experimentally validated with extensive measurements conducted in an indoor environment. We prove that environmental information can be instrumental towards proactive CSI amplitude acquisition of both static and mobile users on base stations, while completely avoiding the dependency on feedback and pilot transmission for both downlink and uplink CSI information. Furthermore, compared to demodulation reference signal based traditional pilot estimation, in ideal conditions without interference, our experimental results show that PEACH yields the similar performance in terms of average bit error rate. More importantly, in the realistic cases with interference taken into account, our experiments demonstrate considerable improvements introduced by PEACH in terms of normalized mean square error of CSI amplitude estimation when compared to traditional approaches.

FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models

Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FUNCPIPE, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FUNCPIPE is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. We implement FUNCPIPE on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Deep Neural Network (DNN) models are becoming ubiquitous in a variety of contemporary domains such as Autonomous Vehicles, Smart cities and Healthcare. They help drones to navigate, identify suspicious activities from safety cameras, and perform diagnostics over medical imaging. Fast DNN inferencing close to the data source is enabled by a growing class of accelerated edge devices such as NVIDIA Jetson and Google Coral which host low-power Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) along with ARM CPUs in a compact form-factor to offer a superior performance-to-energy ratio. E.g., the NVIDIA Jetson AGX Xavier kit has a 512-core Volta GPU, an 8-core ARM CPU and 32GB LPDDR4x memory, that operates within 65W of power, costs US999 and is smaller than a paperback novel.

Recently, there has been a push towards training DNN models on the edge [2]. This is driven by the massive growth in data collected from edge devices in Cyber-Physical Systems (CPS) and Internet of Things (IoT), the need to refresh the models periodically, the bandwidth constraints in moving all this data to Cloud data centers for training, and a heightened emphasis on privacy by retaining data on the edge. This has led to techniques like federated and geo-distributed learning that train DNN models locally on data on an edge device and aggregate them centrally. In this abstract, we summarise and highlight key results from our full paper [5].

Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings, tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework, which allows servers to collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. The full paper of this abstract can be found at

SESSION: Network Measurement

A Comparative Analysis of Ookla Speedtest and Measurement Labs Network Diagnostic Test (NDT7)

Consumers, regulators, and ISPs all use client-based "speed tests" to measure network performance, both in single-user settings and in aggregate. Two prevalent speed tests, Ookla's Speedtest and Measurement Lab's Network Diagnostic Test (NDT), are often used for similar purposes, despite having significant differences in both the test design and implementation, and in the infrastructure used to perform measurements. In this paper, we present the first-ever comparative evaluation of Ookla and NDT7 (the latest version of NDT), both in controlled and wide-area settings. Our goal is to characterize when and to what extent these two speed tests yield different results, as well as the factors that contribute to the differences. To study the effects of the test design, we conduct a series of controlled, in-lab experiments under a comprehensive set of network conditions and usage modes (e.g., TCP congestion control, native vs. browser client). Our results show that Ookla and NDT7 report similar speeds under most in-lab conditions, with the exception of networks that experience high latency, where Ookla consistently reports higher throughput. To characterize the behavior of these tools in wide-area deployment, we collect more than 80,000 pairs of Ookla and NDT7 measurements across nine months and 126 households, with a range of ISPs and speed tiers. This first-of-its-kind paired-test analysis reveals many previously unknown systemic issues, including high variability in NDT7 test results and systematically under-performing servers in the Ookla network.

Each at its Own Pace: Third-Party Dependency and Centralization Around the World

We describe the results of a large-scale study of third-party dependencies around the world based on regional top-500 popular websites accessed from vantage points in 50 countries, together covering all inhabited continents. This broad perspective shows that dependencies on a third-party DNS, CDN or CA provider vary widely around the world. The critical dependencies of websites -- where the site depends on a single third-party provider -- are equally spread. Even more concerning, these differences persist a year later with increasing dependencies, particularly for DNS and CDNs. We briefly explore various factors that may help explain the differences and similarities in degrees of third-party dependency across countries, including economic conditions, Internet development, economic trading partners, categories, home countries, and traffic skewness of the country's top-500 sites.

Detecting and Measuring Aggressive Location Harvesting in Mobile Apps via Data-flow Path Embedding

Today, location-based services have become prevalent in the mobile platform, where mobile apps provide specific services to a user based on his or her location. Unfortunately, mobile apps can aggressively harvest location data with much higher accuracy and frequency than they need because the coarse-grained access control mechanism currently implemented in mobile operating systems (e.g., Android) cannot regulate such behavior. This unnecessary data collection violates the data minimization policy, yet no previous studies have investigated privacy violations from this perspective, and existing techniques are insufficient to address this violation. To fill this knowledge gap, we take the first step toward detecting and measuring this privacy risk in mobile apps at scale. Particularly, we annotate and release the first dataset to characterize those aggressive location harvesting apps and understand the challenges of automatic detection and classification. Next, we present a novel system, LocationScope, to address these challenges by (i) uncovering how an app collects locations and how to use such data through a fine-tuned value set analysis technique, (ii) recognizing the fine-grained location-based services an app provides via embedding data-flow paths, which is a combination of program analysis and machine learning techniques, extracted from its location data usages, and (iii) identifying aggressive apps with an outlier detection technique achieving a precision of 97% in aggressive app detection. Our technique has further been applied to millions of free Android apps from Google Play as of 2019 and 2021. Highlights of our measurements on detected aggressive apps include their growing trend from 2019 to 2021 and the app generators' significant contribution of aggressive location harvesting apps.

Fiat Lux: Illuminating IPv6 Apportionment with Different Datasets

IPv6 adoption continues to grow, making up more than 40% of client traffic to Google globally. While the ubiquity of the IPv4 address space makes it comparably easier to understand, the vast and less studied IPv6 address space motivates a variety of works detailing methodology to collect and analyze IPv6 properties, many of which use knowledge from specific data sources as a lens for answering research questions. Despite such work, questions remain on basic properties such as the appropriate prefix size for research tasks.

Our work fills this knowledge gap by presenting an analysis of the apportionment of the IPv6 address space from the ground-up, using data and knowledge from numerous data sources simultaneously, aimed at identifying how to leverage IPv6 address information for a variety of research tasks. Utilizing WHOIS data from RIRs, routing data, and hitlists, we highlight fundamental differences in apportionment sizes and structural properties depending on data source and examination method. We focus on the different perspectives each dataset offers and the disjoint, heterogeneous nature of these datasets when taken together. We additionally leverage a graph-based analysis method for these datasets that allows us to draw conclusions regarding when and how to intersect the datasets and their utility. The differences in each dataset's perspective is not due to dataset problems but rather stems from a variety of differing structural and deployment behaviors across RIRs and IPv6 providers alike. In light of these inconsistencies, we discuss network address partitioning, best practices, and considerations for future IPv6 measurement and analysis projects.

Network Monitoring on Multi-Pipe Switches

Programmable switches have been widely used to design network monitoring solutions that operate in the fast data-plane level, e.g., detecting heavy hitters, super-spreaders, computing flow size distributions and their entropy. Existing works assume packets access the same memory region in a switch. However, high-speed ASIC switches deploy multiple packet processing pipes, each equipped with its own independent memory.

In this work, we first quantify the accuracy degradation due to splitting a monitoring data structure across multiple pipes (e.g., up to 3000x worse flow-size estimation average error). We then present PipeCache, a system that adapts existing data-plane mechanisms to multi-pipe switches by storing monitoring information for a traffic class into a single pipe. PipeCache stores monitoring information into a cache and piggybacks this information onto existing data packets to the correct pipe. Our implementation shows a 2-20x memory reduction to achieve an accuracy similar to single-pipe deployments.

Real-time Spread Burst Detection in Data Streaming

Data streaming has many applications in network monitoring, web services, e-commerce, stock trading, social networks, and distributed sensing. This paper introduces a new problem of real-time burst detection in flow spread. It is a challenging problem because estimating flow spread requires us to remember all past data items and detecting bursts in real time requires us to minimize spread estimation overhead, which was not the priority in most prior work. This paper provides the first efficient, real-time solution for spread burst detection. It is designed based on a new real-time super spreader identifier, which outperforms the state of the art in terms of both accuracy and processing overhead. The super spreader identifier is in turn based on a new sketch design for real-time spread estimation, which outperforms the best existing sketches.

JS Capsules: A Framework for Capturing Fine-grained JavaScript Memory Measurements for the Mobile Web.

Understanding the resource consumption of the mobile web is an important topic that has garnered much attention in recent years. However, existing works mostly focus on the networking or computational aspects of the mobile web and largely ignore memory, which is an important aspect given the mobile web's reliance on resource-heavy JavaScript.

In this paper, we propose a framework, called JS Capsule, for characterizing the memory of JavaScript functions and, using this framework, we investigate the key browser mechanics that contribute to the memory overhead. Leveraging our framework on a testbed of Android mobile phones, we conduct measurements of the Alexa top 1K websites. While most existing frameworks focus on V8 --- the JavaScript engine used in most popular browsers --- in the context of memory, our measurements show that the memory implications of JavaScript extends far beyond V8 due to the cascading effects that certain JavaScript calls have on the browser's rendering mechanics. We quantify and highlight the direct impact that website DOM have on JavaScript memory overhead and present, to our knowledge, the first root-cause analysis to dissect and characterize their impact on JavaScript memory overheads.

SESSION: Online Optimization

Online Fair Allocation of Perishable Resources

We consider a practically motivated variant of the canonical online fair allocation problem: a decision-maker has a budget of resources to allocate over a fixed number of rounds. Each round sees a random number of arrivals, and the decision-maker must commit to an allocation for these individuals before moving on to the next round. In contrast to prior work, we consider a setting in which resources are perishable and individuals' utilities are potentially non-linear (e.g., goods exhibit complementarities). The goal is to construct a sequence of allocations that is envy-free and efficient. We design an algorithm that takes as input (i) a prediction of the perishing order, and (ii) a desired bound on envy. Given the remaining budget in each period, the algorithm uses forecasts of future demand and perishing to adaptively choose one of two carefully constructed guardrail quantities. We characterize conditions under which our algorithm achieves the optimal envy-efficiency Pareto frontier. We moreover demonstrate its strong numerical performance using data from a partnering food bank.

Dynamic Bin Packing with Predictions

The MinUsageTime Dynamic Bin Packing (DBP) problem aims to minimize the accumulated bin usage time for packing a sequence of items into bins. It is often used to model job dispatching for optimizing the busy time of servers, where the items and bins match the jobs and servers respectively. It is known that the competitiveness of MinUsageTime DBP has tight bounds of Θ(√, log μ) and Θ(μ) in the clairvoyant and non-clairvoyant settings respectively, where μ is the max/min duration ratio of all items. In practice, the information about items' durations (i.e., job lengths) obtained via predictions is usually prone to errors. In this paper, we study the MinUsageTime DBP problem with predictions of items' durations. We find that an existing O(√ log μ)-competitive clairvoyant algorithm, if using predicted durations rather than real durations for packing, does not provide any bounded performance guarantee when the predictions are adversarially bad. We develop a new online algorithm with a competitive ratio of {O(∈2 √ log(∈2 μ)}, O(μ) (where ε is the maximum multiplicative error of prediction among all items), achieving O(√ log μ) consistency (competitiveness under perfect predictions where ∈ = 1) and O(μ) robustness (competitiveness under terrible predictions), both of which are asymptotically optimal.

The Online Knapsack Problem with Departures

The online knapsack problem is a classic online resource allocation problem in networking and operations research. Its basic version studies how to pack online arriving items of different sizes and values into a capacity-limited knapsack. In this paper, we study a general version that includes item departures, while also considering multiple knapsacks and multi-dimensional item sizes. We design a threshold-based online algorithm and prove that the algorithm can achieve order-optimal competitive ratios. Beyond worst-case optimized algorithms, we also propose a data-driven online algorithm that can achieve near-optimal average performance under typical instances while guaranteeing the worst-case performance.

(Private) Kernelized Bandits with Distributed Biased Feedback

We study kernelized bandits with distributed biased feedback. This problem is motivated by several real-world applications (such as dynamic pricing, cellular network configuration, and policy making), where users from a large population contribute to the reward of the action chosen by a central entity, but it is difficult to collect feedback from all users. Instead, only biased feedback (due to user heterogeneity) from a subset of users may be available. In addition to such biased feedback, we are also faced with two practical challenges due to communication cost and computation complexity. To tackle these challenges, we carefully design a new distributed phase-then-batch-based elimination (DPBE) algorithm, which samples users in phases for collecting feedback to reduce the bias and employs maximum variance reduction to select actions in batches within each phase. By properly choosing the phase length, the batch size, and the confidence width used for eliminating suboptimal actions, we show that DPBE achieves a sublinear regret of Õ(T1-α/2 +γTT), where α∈ (0,1) is the user-sampling parameter one can tune. Moreover, DPBE can significantly reduce both communication cost and computation complexity in distributed kernelized bandits, compared to some variants of the state-of-the-art algorithms (originally developed for standard kernelized bandits). Furthermore, by incorporating various differential privacy models, we generalize DPBE to provide privacy guarantees for users participating in the distributed learning process. The algorithm design, analyses, and numerical experiments are provided in the full version of this paper [4].

Online Resource Allocation under Horizon Uncertainty

We study stochastic online resource allocation: a decision maker needs to allocate limited resources to stochastically-generated sequentially-arriving requests in order to maximize reward. At each time step, requests are drawn independently from a distribution that is unknown to the decision maker. Online resource allocation and its special cases have been studied extensively in the past, but prior results crucially and universally rely on the strong assumption that the total number of requests (the horizon) is known to the decision maker in advance. In many applications, such as revenue management and online advertising, the number of requests can vary widely because of fluctuations in demand or user traffic intensity. In this work, we develop online algorithms that are robust to horizon uncertainty. In sharp contrast to the known-horizon setting, no algorithm can achieve even a constant asymptotic competitive ratio that is independent of the horizon uncertainty. We introduce a novel generalization of dual mirror descent which allows the decision maker to specify a schedule of time-varying target consumption rates, and prove corresponding performance guarantees. We go on to give a fast algorithm for computing a schedule of target consumption rates that leads to near-optimal performance in the unknown-horizon setting. In particular, our competitive ratio attains the optimal rate of growth (up to logarithmic factors) as the horizon uncertainty grows large. Finally, we also provide a way to incorporate machine-learned predictions about the horizon which interpolates between the known and unknown horizon settings.

Streaming Algorithms for Constrained Submodular Maximization

Due to the pervasive "diminishing returns" property appeared in data-intensive applications, submodular maximization problems have aroused great attention from both the machine learning community and the computation theory community. During the last decades, a lot of algorithms have been proposed for submodular maximization subject to various constraints [4, 6, 8], and these algorithms can be used in numerous applications including sensor placement [9], clustering [5], network design [13], and so on.

The existing algorithms for submodular maximization can be roughly classified into offline algorithms and streaming algorithms; the former assume full access to the whole dataset at any time (e.g.,[4, 10]), while the latter only require an amount of space which is nearly linear in the maximum size of a feasible solution (e.g., [1, 7]). Apparently, streaming algorithms are more useful in big data applications, as the whole data set is usually too large to be fit into memory in practice. However, compared to the offline algorithms, the existing streaming algorithms for submodular maximization generally have weaker capabilities in that they handle more limited problem constraints or achieve weaker performance bounds, due to the more stringent requirements under the streaming setting. Another classification of the existing algorithms is that they concentrate on either monotone or non-monotone submodular functions. As monotone submodular function is a special case of non-monotone submodular function, we will concentrate on non-monotone submodular maximization in this paper.

Robust Multi-Agent Bandits Over Undirected Graphs

We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) log (T) / Δ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m << K, this improves over the single-agent baseline regret of O(K log(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we provide an instance for which honest agents using the state-of-the-art algorithm suffer (nearly) linear regret until time is doubly exponential in n. In light of this negative result, we propose a new algorithm for which the i-th agent has regret O((dmal (i) + K/n) log(T)/Δ) on any connected and undirected graph, where dmal (i) is the number of i's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph and show the effect of malicious agents is entirely local.

Optimistic No-regret Algorithms for Discrete Caching

We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning, where the caching policy has access to a prediction oracle. The successive file requests are assumed to be generated by an adversary, and no assumption is made on the accuracy of the oracle. We provide a universal lower bound for prediction-assisted online caching and proceed to design a suite of policies with a range of performance-complexity trade-offs. All proposed policies offer sublinear regret bounds commensurate with the accuracy of the oracle. In this pursuit, we design, to the best of our knowledge, the first optimistic Follow-the-Perturbed leader policy, which generalizes beyond the caching problem. We also study the problem of caching files with different sizes and the bipartite network caching problem.

Smoothed Online Optimization with Unreliable Predictions

We consider online optimization with switching costs in a normed vector space (X, ||·||) wherein, at each time t, a decision maker observes a non-convex hitting cost function ƒ : t X →[0, ∞] and must decide upon some xt∈X→, paying ƒt (xt) + || xt-xt-1||, where ||·|| characterizes the switching cost. Throughout, we assume that ƒt is globally α-polyhedral, i.e., ƒt has a unique minimizer υt ∈X, and, for all x ∈ X, ƒ t) (x) ≥ ƒt + α · ||x - υ t. Moreover, we assume that the decision maker has access to an untrusted prediction xt of the optimal decision during each round, such as the decision suggested by a black-box AI tool.

Online Adversarial Stabilization of Unknown Networked Systems

We investigate the problem of stabilizing an unknown networked linear system under communication constraints and adversarial disturbances. We propose the first provably stabilizing algorithm for the problem. The algorithm uses a distributed version of nested convex body chasing to maintain a consistent estimate of the network dynamics and applies system level synthesis to determine a distributed controller based on this estimated model. Our approach avoids the need for system identification and accommodates a broad class of communication delay while being fully distributed and scaling favorably with the number of subsystems.

SESSION: Queueing Theory

The M/M/k with Deterministic Setup Times

Capacity management, whether it involves servers in a data center, or human staff in a call center, or doctors in a hospital, is largely about balancing a resource-delay tradeoff. On the one hand, one would like to turn off servers when not in use (or send home staff that are idle) to save on resources. On the other hand, one wants to avoid the considerable setup time required to turn an off server back on. This paper aims to understand the delay component of this tradeoff, namely, what is the effect of setup time on average delay in a multi-server system?

Surprisingly little is known about the effect of setup times on delay. While there has been some work on studying the M/M/k with Exponentially-distributed setup times, these works provide only iterative methods for computing mean delay, giving little insight as to how delay is affected by k, by load, and by the setup time. Furthermore, setup time in practice is much better modeled by a Deterministic random variable, and, as this paper shows, the effect of a Deterministic setup time is nothing like that of an Exponentially-distributed setup time.

This paper provides the first analysis of the M/M/k with Deterministic setup times. We prove a lower bound on the effect of setup on delay. Our result is a simple algebraic formula which provides insight into how delay scales with the input parameters. For more details, see the full paper.

Joint Learning and Control in Stochastic Queueing Networks with Unknown Utilities

We study the optimal control problem in stochastic queueing networks with a set of job dispatchers connected to a set of parallel servers with queues. Jobs arrive at the dispatchers and get routed to the servers following some routing policy. The arrival processes of jobs and the service processes of servers are stochastic with unknown arrival rates and service rates. Upon the completion of each job from dispatcher un at server sm, a random utility whose mean is unknown is obtained. We seek to design a control policy that makes routing decisions at the dispatchers and scheduling decisions at the servers to maximize the total utility obtained by the end of a finite time horizon T. The performance of policies is measured by regret, which is defined as the difference in total expected utility with respect to the optimal dynamic policy that has access to arrival rates, service rates and underlying utilities.

We first show that the expected utility of the optimal dynamic policy is upper bounded by T times the solution to a static linear program, where the optimization variables correspond to rates of jobs from dispatchers to servers and the feasibility region is parameterized by arrival rates and service rates. We next propose a policy for the optimal control problem that is an integration of a learning algorithm and a control policy. The learning algorithm seeks to learn the optimal extreme point solution to the static linear program based on the information available in the optimal control problem. The control policy, a mixture of priority-based and Joint-the-Shortest-Queue routing at the dispatchers and priority-based scheduling at the servers, makes decisions based on the graphical structure induced by the extreme point solutions provided by the learning algorithm. We prove that our policy achieves logarithmic regret whereas application of existing techniques to the optimal control problem would lead to Ω(√T)-regret. The theoretical analysis is further complemented with simulations to evaluate the empirical performance of our policy.

Constant Regret Primal-Dual Policy for Multi-way Dynamic Matching

We study a discrete-time dynamic multi-way matching model. There are finitely many agent types that arrive stochastically and wait to be matched. State-of-the-art dynamic matching policies in the literature require the knowledge of all system parameters to determine an optimal basis of the fluid relaxation, and focus on controlling the number of waiting agents using only matches in the optimal basis [4,6,7]. In this paper, we propose a primal-dual policy that schedule matches for future arrivals based on an estimator for the dual solution. Our policy does not require the knowledge of optimal bases, and is the first to achieve constant regret at all times under unknown arrival rates. In addition, we show that when the arrival rates are known, the primal-dual policy achieves the optimal scaling as the lower-bound described in [6,7]. Furthermore, we find that when the arrival rates are known, the primal-dual policy can significantly outperform alternative dynamic matching policies in numerical simulations.

SESSION: Reinforcement Learning

Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes

We consider Linear Stochastic Approximation (LSA) with constant stepsize and Markovian data. Viewing the joint process of the data and LSA iterate as a time-homogeneous Markov chain, we prove its convergence to a unique limiting and stationary distribution in Wasserstein distance and establish non-asymptotic, geometric convergence rates. Furthermore, we show that the bias vector of this limit admits an infinite series expansion with respect to the stepsize. Consequently, the bias is proportional to the stepsize up to higher order terms. This result stands in contrast with LSA under i.i.d. data, for which the bias vanishes. In the reversible chain setting, we provide a general characterization of the relationship between the bias and the mixing time of the Markovian data, establishing that they are roughly proportional to each other.

Polyak-Ruppert averaging reduces the variance of the LSA iterates but does not affect the bias. The above characterization allows us to show that the bias can be reduced using Richardson-Romberg extrapolation with m≥ 2 stepsizes, which eliminates the m-1 leading terms in the bias expansion. This extrapolation scheme leads to an exponentially smaller bias and an improved mean squared error, both in theory and empirically. Our results immediately apply to the Temporal Difference learning algorithm with linear function approximation, Markovian data, and constant stepsizes.

Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its κ-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in κ. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing κ. Numerical simulations demonstrate the effectiveness of LPI. This extended abstract is an abridged version of [12].

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

Reinforcement learning (RL) methods have been increasingly popular in sequential decision making tasks due to its empirical success. However, large state and action spaces in real-world problems modeled as a Markov decision processes (MDPs) limit the use of RL algorithms. Given a standard finite-horizon MDP (S, A, P, R, H) with state space S, action space A, transition kernel P = {Ph} ∈ []H, reward function R = {R h} ∈ [H] bounded between zero and one, and time horizon H, one needs Ω (|S||A|H3/∈2 samples given a generative model to learn an optimal policy [3], which can be impractical when S and A are large. The above tabular RL framework does not capture the fact that many real-world systems in fact have additional structure that if exploited should improve computational and statistical efficiency. Moreover, [1] empirically verifies that optimal and near-optimal action-value functions (both viewed as |S|-by-|A| matrices) of classical stochastic control tasks have low rank. Thus, the critical question is what are the minimal low rank structural assumptions that allow for computationally and statistically efficient learning?

SESSION: Security & Privacy

Detecting and Measuring Security Risks of Hosting-Based Dangling Domains

Public hosting services offer a convenient and secure option for creating web applications. However, adversaries can take over a domain by exploiting released service endpoints, leading to hosting-based domain takeover. This threat has affected numerous popular websites, including the subdomains of However, no effective detection system for identifying vulnerable domains at scale exists to date. This paper fills the research gap by presenting a novel framework, HostingChecker, for detecting domain takeovers. HostingChecker expands detection scope and improves efficiency compared to previous work by: (i) identifying vulnerable hosting services using a semi-automated method; and (ii) detecting vulnerable domains through passive reconstruction of domain dependency chains. The framework enables us to detect the subdomains of Tranco sites on a daily basis. It discovers 10,351 vulnerable subdomains under Tranco Top-1M apex domains, which is over 8× more than previous findings, demonstrating its effectiveness. Furthermore, we conduct an in-depth security analysis on the affected vendors (e.g., Amazon, Alibaba) and gain a suite of new insights, including flawed domain ownership validation implementation. In the end, we have reported the issues to the security response centers of affected vendors, and some (e.g., Baidu and Tencent) have adopted our mitigation. The full paper is provided in [2].

Batching of Tasks by Users of Pseudonymous Forums: Anonymity Compromise and Protection

In a number of applications where anonymity is critical, users act under pseudonyms to preserve their privacy. For instance, in scientific peer review using forums like, reviewers make comments on papers that are publicly viewable. Reviewers who have been assigned multiple papers operate under different pseudonyms across their papers to remain anonymous. Other examples of publicly visible tasks where users operate under pseudonyms include Wikipedia editing and cryptocurrency transactions.

In these settings, it is common for users to engage in batching - the completion of several similar tasks at the same time. Batching occurs both due to natural bursts in activity (e.g., a person visits a website and makes many comments at once) or as a productivity strategy used to streamline work.

In peer-review forums such as computer science conferences, reviewers and meta-reviewers are often assigned multiple papers. We find empirically that reviewers are highly likely to batch their comments and/or reviews across papers. In analysis of data from a top Computer Science conference with thousands of papers, reviewers, and discussion comments we find that when reviewers and meta-reviewers comment on multiple papers, they have a 30.10% chance of batching their comments within 5 minutes of one other. In comparison, any randomly chosen pair of reviewers and meta- reviewers had only a 0.66% chance of making comments on different papers within 5 minutes of each other.

Characterizing Cryptocurrency-themed Malicious Browser Extensions

Due to the surging popularity of various cryptocurrencies in recent years, a large number of browser extensions have been developed as portals to access relevant services, such as cryptocurrency exchanges and wallets. This has stimulated a wild growth of cryptocurrency-themed malicious extensions that cause heavy financial losses to the users and legitimate service providers. They have shown their capability of evading the stringent vetting processes of the extension stores, highlighting a lack of understanding of this emerging type of malware in our community. In this work, we conduct the first systematic study to identify and characterize cryptocurrency-themed malicious extensions. We monitor seven official and third-party extension distribution venues for 18 months (December 2020 to June 2022) and have collected around 3600 unique cryptocurrency-themed extensions. Leveraging a hybrid analysis, we have identified 186 malicious extensions that belong to five categories. We then characterize those extensions from various perspectives including their distribution channels, life cycles, developers, illicit behaviors, and illegal gains. Our work unveils the status quo of the cryptocurrency-themed malicious extensions and reveals their disguises and programmatic features on which detection techniques can be based. Our work serves as a warning to extension users, and an appeal to extension store operators to enact dedicated countermeasures. To facilitate future research in this area, we release our dataset of the identified malicious extensions and open-source our analyzer.

Strategic Latency Reduction in Blockchain Peer-to-Peer Networks

Most permissionless blockchain networks run on peer-to-peer (P2P) networks, which offer flexibility and decentralization at the expense of performance (e.g., network latency). Historically, this tradeoff has not been a bottleneck for most blockchains. However, an emerging host of blockchain-based applications (e.g., decentralized finance) are increasingly sensitive to latency; users who can reduce their network latency relative to other users can accrue (sometimes significant) financial gains. In this work, we initiate the study of strategic latency reduction in blockchain P2P networks. We first define two classes of latency that are of interest in blockchain applications. We then show empirically that a strategic agent who controls only their local peering decisions can manipulate both types of latency, achieving 60% of the global latency gains provided by the centralized, paid service bloXroute, or, in targeted scenarios, comparable gains. Finally, we show that our results are not due to the poor design of existing P2P networks. Under a simple network model, we theoretically prove that an adversary can always manipulate the P2P network's latency to their advantage, provided the network experiences sufficient peer churn and transaction activity.

SESSION: Stochastic Scheduling

Power-of-d Choices Load Balancing in the Sub-Halfin Whitt Regime

We characterize the steady-state queue length distribution for the Power-of-d choices routing algorithm for almost all values of d in the sub-Halfin Whitt asymptotic regime.

On the Stochastic and Asymptotic Improvement of First-Come First-Served and Nudge Scheduling

Recently it was shown that, contrary to expectations, the First-Come-First-Served (FCFS) scheduling algorithm can be stochastically improved upon by a scheduling algorithm called Nudge for light-tailed job size distributions. Nudge partitions jobs into 4 types based on their size, say small, medium, large and huge jobs. Nudge operates identical to FCFS, except that whenever a small job arrives that finds a large job waiting at the back of the queue, Nudge swaps the small job with the large one unless the large job was already involved in an earlier swap.

In this paper, we show that FCFS can be stochastically improved upon under far weaker conditions. We consider a system with 2 job types and limited swapping between type-1 and type-2 jobs, but where a type-1 job is not necessarily smaller than a type-2 job. More specifically, we introduce and study the Nudge-K scheduling algorithm which allows type-1 jobs to be swapped with up to K type-2 jobs waiting at the back of the queue, while type-2 jobs can be involved in at most one swap. We present an explicit expression for the response time distribution under Nudge-K when both job types follow a phase-type distribution. Regarding the asymptotic tail improvement ratio (ATIR), we derive a simple expression for the ATIR, as well as for the K that maximizes the ATIR. We show that the ATIR is positive and the optimal K tends to infinity in heavy traffic as long as the type-2 jobs are on average longer than the type-1 jobs.

Optimal Scheduling in the Multiserver-job Model under Heavy Traffic

Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response time. Our goal in this paper is to minimize mean response time in a multiserver-job setting. Minimizing mean response time requires prioritizing small jobs while simultaneously maximizing utilization. Our question is how to achieve these joint objectives.

We devise the ServerFilling-SRPT scheduling policy, which is the first policy to minimize mean response time in the multiserver-job model in the heavy traffic limit. In addition to proving this heavy-traffic result, we present empirical evidence that ServerFilling-SRPT outperforms all existing scheduling policies for all loads, with orders of magnitude improvements at high load.

Because ServerFilling-SRPT requires knowing job sizes, we also define the ServerFilling-Gittins policy, which is optimal when sizes are unknown or partially known.

For more detail, see the full paper

SESSION: Wireless Networks

Switching in the Rain: Predictive Wireless x-haul Network Reconfiguration

4G, 5G, and smart city networks often rely on microwave and millimeter-wave x-haul links. A major challenge associated with these high frequency links is their susceptibility to weather conditions. In particular, precipitation may cause severe signal attenuation, which significantly degrades the network performance. In this paper, we develop a Predictive Network Reconfiguration (PNR) framework that uses historical data to predict the future condition of each link and then prepares the network ahead of time for imminent disturbances. The PNR framework has two components: (i) an Attenuation Prediction (AP) mechanism; and (ii) a Multi-Step Network Reconfiguration (MSNR) algorithm. The AP mechanism employs an encoder-decoder Long Short-Term Memory (LSTM) model to predict the sequence of future attenuation levels of each link. The MSNR algorithm leverages these predictions to dynamically optimize routing and admission control decisions aiming to maximize network utilization, while preserving max-min fairness among the nodes using the network (e.g., base-stations) and preventing transient congestion that may be caused by switching routes. We train, validate, and evaluate the PNR framework using a dataset containing over 2 million measurements collected from a real-world city-scale backhaul network. The results show that the framework: (i) predicts attenuation with high accuracy, with an RMSE of less than 0.4 dB for a prediction horizon of 50 seconds; and (ii) can improve the instantaneous network utilization by more than 200% when compared to reactive network reconfiguration algorithms that cannot leverage information about future disturbances. The full paper associated with this abstract can be found at

A First Look at Wi-Fi 6 in Action: Throughput, Latency, Energy Efficiency, and Security

We present the performance measurement of Wi-Fi 6 (IEEE 802.11ax). Our experiments focus on multi-client scenarios. The results reveal the impact of the new channel access mechanisms (i.e., OFDMA and TWT) on spectrum efficiency, latency, energy consumption, and security. (i) A comparison with the legacy CSMA/CA scheme shows that the commodity Wi-Fi 6 achieves 3× overall throughput and dramatically reduces the latency (5×) when coexisting with the legacy Wi-Fi network. (ii) However, the current OFDMA implementation significantly increases the power consumption (6×), implying a design tradeoff between the gain of throughput/latency and the cost of energy consumption. We believe that our findings provide critical insights for the scheduling algorithm design, power optimization, and security protection of the next-generation WLANs. The full paper is provided in \cite10.1145/3579451.

CoBF: Coordinated Beamforming in Dense mmWave Networks

With MIMO and enhanced beamforming features, IEEE 802.11ay is poised to create the next generation of mmWave WLANs that can provide over 100 Gbps data rate. However, beamforming between densely deployed APs and clients incurs unacceptable overhead. On the other hand, the absence of up-to-date beamforming information restricts the diversity gains available through MIMO and multi-users, reducing the overall network capacity. This paper presents a novel approach of "coordinated beamforming" (called CoBF) where only a small subset of APs are selected for beamforming in the 802.11ay mmWave WLANs. Based on the concept of uncertainty, CoBF predicts the APs whose beamforming information is likely outdated and needs updating. The proposed approach complements existing per-link beamforming solutions and extends their effectiveness from link-level to network-level. Furthermore, CoBF leverages the AP uncertainty to create MU-MIMO groups through interference-aware scheduling in 802.11ay WLANs. With extensive experimentation and simulations, we show that CoBF can significantly reduce beamforming overhead and improve network capacity for 802.11ay WLANs.

Leveraging the Properties of mmWave Signals for 3D Finger Motion Tracking for Interactive IoT Applications

Wireless signals, which are mainly used for communication networks, also have the potential to extend our senses, enabling us to see behind closed doors and track moving objects through walls [3, 10]. Accordingly, there is a growing interest in the community recently to develop novel IoT applications for sensing by exploiting radio frequency signals [7]. Given the compact size of modern wireless devices, this enables ubiquitous applications in the areas of smart healthcare, sports analytics, AR/VR etc. Specifically, as these signals travel in the medium, they traverse occlusions and bounce off different objects before arriving at a receiver; hence, the reflected signals carry information about the environment. By exploiting this property, this paper shows the feasibility of tracking precise 3D finger motion using mmWave signals that are popularly used in 5G networks.

Motivation and Application: This paper presents mm4Arm, a system that quantifies the performance of finger motion tracking for interactive applications using mmWave signals through a carefully designed simulation and measurement study. We considered using mmWave signal because FMCW-based radars are being used for ubiquitous applications in the areas of smart healthcare [4], sports analytics, AR/VR [17], autonomous driving [5], etc. Similar to the popular Google Soli platform [15], our main motivation is to enable wearable, mobile computing, and AR/VR applications where conventional touch interaction may be hard. Finger motion-based interfaces over the air are known to be a popular form of human-computer interaction [9, 14]. In contrast to Soli, which can only detect 11 predefined gestures, mm4Arm can perform arbitrary 3D motion tracking, thus allowing highly precise control. Decades of prior research have shown that such a finer control can enable rapid and fluid manipulation for highly intuitive interaction [16]. Therefore, regardless of the application, we focus on enabling the core motion tracking framework by solving the underlying challenges.

Tracking Fingers by Observing the ForeArm: In this paper, we not only focus on tracking the 3D finger motion using mmWave reflections, but based on observations via simulations and measurements, we also identify the underlying conditions that enable precise tracking. A critical observation is that the small size of fingers does not provide stable reflections to the level required for tracking. However, the data-driven analysis reveals that it is possible to indirectly track fingers by measuring reflections from the forearm. Finger motion activation involves neuro-muscular interactions, which induce minute muscular motions in the forearm. Such muscular motion produces vibrations in the forearm. Thanks to the short wavelength of mmWave signals, the phase measurements are extremely sensitive to small vibrations (up to 0.63 μm), thus opening up opportunities for precise motion tracking. Moreover, the forearm offers a rich texture and curvature and a much bigger surface for reflections, in contrast to the small size of fingers, which facilitates robust tracking. mm4Arm analyzes such forearm vibrations for 3D finger motion tracking.

We reiterate two critical observations made in this paper: (i) When 3D finger motion tracking is of interest in contrast to predefined gesture classification, the reflections obtained directly from fingers do not provide sufficient information. Very few reflections come back to the radar due to the small size of fingers and dominant specular reflections. A similar observation on specularity has been made earlier in the context of autonomous cars [6, 13]. (ii) Vibrations in the forearm during finger motion can capture rich information. Because of the large surface of the forearm and its curvature, the reflections are more stable and robust to natural variation in arm position, height, and orientation. This can be leveraged for 3D finger motion tracking.