# SIGMETRICS '19- Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems

Full Citation in the ACM Digital Library

## SESSION: Session 1: Graph Learning

### Collaboratively Learning the Best Option on Graphs, Using Bounded Local Memory

• Lili Su
• Martin Zubeldia
• Nancy Lynch

We consider multi-armed bandit problems in social groups wherein each individual has bounded memory and shares the common goal of learning the best arm/option. We say an individual learns the best option if eventually (as $t\diverge$) it pulls only the arm with the highest expected reward. While this goal is provably impossible for an isolated individual due to bounded memory, we show that, in social groups, this goal can be achieved easily with the aid of social persuasion (i.e., communication) as long as the communication networks/graphs satisfy some mild conditions. In this work, we model and analyze a type of learning dynamics which are well-observed in social groups. Specifically, under the learning dynamics of interest, an individual sequentially decides on which arm to pull next based on not only its private reward feedback but also the suggestion provided by a randomly chosen neighbor. To deal with the interplay between the randomness in the rewards and in the social interaction, we employ the \em mean-field approximation method. Considering the possibility that the individuals in the networks may not be exchangeable when the communication networks are not cliques, we go beyond the classic mean-field techniques and apply a refined version of mean-field approximation: \beginitemize \item Using coupling we show that, if the communication graph is connected and is either regular or has doubly-stochastic degree-weighted adjacency matrix, with probability $\to 1$ as the social group size $N \to \infty$, every individual in the social group learns the best option. \item If the minimum degree of the graph diverges as $N \diverge$, over an arbitrary but given finite time horizon, the sample paths describing the opinion evolutions of the individuals are asymptotically independent. In addition, the proportions of the population with different opinions converge to the unique solution of a system of ODEs. Interestingly, the obtained system of ODEs are invariant to the structures of the communication graphs. In the solution of the obtained ODEs, the proportion of the population holding the correct opinion converges to $1$ exponentially fast in time. \enditemize Notably, our results hold even if the communication graphs are highly sparse.

### Learning Graphs from Noisy Epidemic Cascades

• Jessica H. Hoffmann
• Constantine Caramanis

Epidemic models accurately represent (among other processes) the spread of diseases, information (rumors, viral videos, news stories, etc.), the spread of malevolent agents in a network (computer viruses, malicious apps, etc.), or even biological processes (pathways in cell signaling networks, chains of activation in the gene regulatory network, etc.). We focus on epidemics that spread on an underlying graph [5].

## SESSION: Session 2A: Load Balancing and Multiserver Systems

### Heavy-traffic Delay Optimality in Pull-based Load Balancing Systems: Necessary and Sufficient Conditions

• Xingyu Zhou
• Jian Tan
• Ness Shroff

In this paper, we consider a load balancing system under a general pull-based policy. In particular, each arrival is randomly dispatched to any server whose queue length is below a threshold; if no such server exists, then the arrival is randomly assigned to any server. We are interested in the fundamental relationship between the threshold and the delay performance of the system in heavy traffic. To this end, we first establish the following necessary condition to guarantee heavy-traffic delay optimality: the threshold needs to grow to infinity as the exogenous arrival rate approaches the boundary of the capacity region (i.e., the load intensity approaches one) but the growth rate should be slower than a polynomial function of the mean number of tasks in the system. As a special case of this result, we directly show that the delay performance of the popular pull-based policy Join-Idle-Queue (JIQ) is not heavy traffic optimal, but performs strictly better than random routing. We further show that a sufficient condition for heavy-traffic delay optimality is that the threshold grows logarithmically with the mean number of tasks in the system. This result directly resolves a generalized version of the conjecture by Kelly and Laws.

• Tim Hellemans
• Tejas Bodas
• Benny Van Houdt

Load balancing plays a crucial role in achieving low latency in large distributed systems. Recent load balancing strategies often rely on replication or use placeholders to further improve latency. However assessing the performance and stability of these strategies is challenging and is therefore often simulation based. In this paper we introduce a unified approach to analyze the performance and stability of a broad class of workload dependent load balancing strategies. This class includes many replication policies, such as replicate below threshold, delayed replication and replicate only small jobs, as well as strategies for fork-join systems.

We consider systems with general job size distributions where jobs may experience server slowdown. We show that the equilibrium workload distribution of the cavity process satisfies a functional differential equation and conjecture that the cavity process captures the limiting behavior of the system as its size tends to infinity.

We study this functional differential equation in more detail for a variety of load balancing policies and propose a numerical method to solve it. The numerical method relies on a fixed point iteration or a simple Euler iteration depending on the type of functional differential equation involved. We further show that additional simplifications can be made if certain distributions are assumed to be phase-type.

Various numerical examples are included that validate the numerical method and illustrate its strength and flexibility.

• Isaac Grosof
• Ziv Scully
• Mor Harchol-Balter

Load balancing systems, comprising a central dispatcher and a scheduling policy at each server, are widely used in practice, and their response time has been extensively studied in the theoretical literature. While much is known about the scenario where the scheduling at the servers is First-Come-First-Served (FCFS), to minimize mean response time we must use Shortest-Remaining-Processing-Time (SRPT) scheduling at the servers. Much less is known about dispatching polices when SRPT scheduling is used. Unfortunately, traditional dispatching policies that are used in practice in systems with FCFS servers often have poor performance in systems with SRPT servers. In this paper, we devise a simple fix that can be applied to any dispatching policy. This fix, called guardrails, ensures that the dispatching policy yields optimal mean response time under heavy traffic when used in a system with SRPT servers. Any dispatching policy, when augmented with guardrails, becomes heavy-traffic optimal. Our results yield the first analytical bounds on mean response time for load balancing systems with SRPT scheduling at the servers.

### Random Walk Based Sampling for Load Balancing in Multi-Server Systems

• Dengwang Tang
• Vijay G. Subramanian

In multi-server systems, a classical job assignment algorithm works as follows: at the arrival of each job, pick d servers independently and uniformly at random and send the job to the least loaded server among the d servers. This model is known as the power-of-d choices algorithm. In this paper, we analyze a variant of this algorithm, where d servers are sampled through d independent non-backtracking random walks on a k-regular graph. The random walkers are periodically reset to independent uniform random positions. Under some assumptions on the underlying graph, we show that the system dynamics under this new algorithm converges to the solution of a deterministic ordinary differential equation (ODE), which is the same ODE as the classical power-of-d choices. We also show that the new algorithm stablizes the system, and the stationary distribution of the system converges to the stationary solution of the ODE. The new scheme can be considered as a derandomized version of power-of-d choices as it reduces the use of randomness while maintaining the performance of power-of-d choices.

## SESSION: Session 2B: Network Measurement and Performance

### Network Resilience Assessment via QoS Degradation Metrics: An Algorithmic Approach

• Lan N. Nguyen
• My T. Thai

This paper focuses on network resilience to perturbation of edge weight. Other than connectivity, many network applications nowadays rely upon some measure of network distance between a pair of connected nodes. In these systems, a metric related to network functionality is associated to each edge. A pair of nodes only being functional if the weighted, shortest-path distance between the pair is below a given threshold T. Consequently, a natural question is on which degree the change of edge weights can damage the network functionality? With this motivation, we study a new problem, Quality of Service Degradation : given a set of pairs, find a minimum budget to increase the edge weights which ensures the distance between each pair exceeds T. We introduce four algorithms with theoretical performance guarantees for this problem. Each of them has its own strength in trade-off between effectiveness and running time, which are illustrated both in theory and comprehensive experimental evaluation.

### A TTL-based Approach for Data Aggregation in Geo-distributed Streaming Analytics

• Dhruv Kumar
• Jian Li
• Abhishek Chandra
• Ramesh K. Sitaraman

Streaming data analytics has been an important topic of research in recent years. Large quantities of data are generated continuously over time across a variety of application domains such as web and social analytics, scientific computing and energy analytics. One of the key requirements in modern data analytics services is the real-time analysis of these data streams to extract useful and timely information for the analyst. Several distributed data analytics platforms have been developed in recent times to meet this growing requirement of real-time streaming analytics. Nowadays, a large amount of data is generated continuously by geographically distributed sources (e.g., agents, sensors, mobile devices, edge nodes, etc.) in many streaming applications. For instance, services like Facebook, Twitter and Netflix continuously gather data from the end users for a variety of analytical purposes such as finding the popular web content amongst their users or monitoring the QoS metrics. Large content delivery networks (CDNs) like Akamai that serve a significant fraction of content on the Internet continuously collect data from their edge servers and clients from around the globe to understand what, where and how content is accessed for the purpose of providing content analytics insights to businesses.

### Retroactive Packet Sampling for Traffic Receipts

• Pavlos Nikolopoulos
• Christos Pappas
• Katerina Argyraki

Is it possible to design a packet-sampling algorithm that prevents the network node that performs the sampling from treating the sampled packets preferentially? We study this problem in the context of designing a "network-transparency'' system. In this system, networks emit receipts for a small sample of the packets they observe, and a monitor collects these receipts to estimate each network's loss and delay performance. Sampling is a good building block for this system, because it enables a solution that is flexible and combines low resource cost with quantifiable accuracy. The challenge is cheating resistance: when a network's performance is assessed based on the conditions experienced by a small traffic sample, the network has a strong incentive to treat the sampled packets better than the rest. We contribute a sampling algorithm that is provably robust to such prioritization attacks, enables network performance estimation with quantifiable accuracy, and requires minimal resources. We confirm our analysis using real traffic traces.

### Inferring Catchment in Internet Routing

• Pavlos Sermpezis
• Vasileios Kotronis

BGP is the de-facto Internet routing protocol for interconnecting Autonomous Systems (AS). Each AS selects its preferred routes based on its routing policies, which are typically not disclosed. Due to the distributed route selection and information hiding, answering questions such as "what is the expected catchment of the anycast sites of a content provider at the AS-level, if new sites are deployed?", or "how will load-balancing behave if an ISP changes its routing policy for a prefix?", is a hard challenge. In this work, we propose a framework and methodology to infer the routing behavior in existing or hypothetical routing configurations, and provide new capabilities and insights for informative route inference (e.g., isolating the effect of randomness that is present in prior simulation-based approaches). The proposed framework can be useful in a number of applications: measurements/monitoring, traffic engineering, network planning, Internet routing models, etc.

## SESSION: Session 3A: Computation and Memory Management

### Crystal Gazer: Profile-Driven Write-Rationing Garbage Collection for Hybrid Memories

• Shoaib Akram
• Jennifer B. Sartor
• Kathryn S. McKinley
• Lieven Eeckhout

Emerging non-volatile memory (NVM) technologies offer greater capacity than DRAM. Unfortunately, production NVM exhibits high latency and low write endurance. Hybrid memory combines DRAM and NVM to deliver greater capacity, low latency, high en- durance, and low energy consumption. Write-rationing garbage col- lection mitigates NVM wear-out by placing highly-written objects in DRAM and the rest in NVM. Existing write-rationing garbage collectors dynamically monitor object writes to place highly writ- ten objects in DRAM. Unfortunately, monitoring writes incurs a non-negligible performance overhead. This work proposes Crystal Gazer, profile-driven write-rationing garbage collection for hybrid memories. Allocation sites are statically profiled, and highly writ- ten objects are predicted based on previous program executions. Unlike prior work, this paper exposes a Pareto trade-off between DRAM usage and NVM lifetime. Experimental results on an emula- tion platform show that Crystal Gazer eliminates the performance overhead of dynamic monitoring, while reducing more NVM writes than state-of-the-art write-rationing garbage collectors.

### Architecture-Aware Approximate Computing

• Mustafa Karakoy
• Orhan Kislal
• Xulong Tang
• Mahmut Taylan Kandemir
• Meenakshi Arunachalam

Observing that many application programs from different domains can live with less-than-perfect accuracy, existing techniques try to trade off program output accuracy with performance-energy savings. While these works provide point solutions, they leave three critical questions regarding approximate computing unanswered: (i) what is the maximum potential of skipping (i.e., not performing) data accesses under a given inaccuracy bound?; (ii) can we identify the data accesses to drop randomly, or is being architecture aware critical?; and (iii) do two executions that skip the same number of data accesses always result in the same output quality (error)? This paper first provides answers to these questions using ten multithreaded workloads, and then presents a program slicing-based approach that identifies the set of data accesses to drop. Results indicate 8.8% performance improvement and 13.7% energy saving are possible when we set the error bound to 2%, and the corresponding improvements jump to 15% and 25%, respectively, when the error bound is raised to 4%.

### Quantifying Data Locality in Dynamic Parallelism in GPUs

• Xulong Tang
• Ashutosh Pattnaik
• Onur Kayiran
• Mahmut Taylan Kandemir
• Chita Das

Dynamic parallelism (DP) is a new feature of emerging GPUs that allows new kernels to be generated and scheduled from the device-side (GPU) without the host-side (CPU) intervention. To efficiently support DP, one of the major challenges is to saturate the GPU processing elements and provide them with the required data in a timely fashion. In this paper, we first conduct a limit study on the performance improvements that can be achieved by hardware schedulers that are provided with accurate data reuse information. We next propose LASER, a Locality-Aware SchedulER, where the hardware schedulers employ data reuse monitors to help make scheduling decisions to improve data locality at runtime. Experimental results on 16 benchmarks show that LASER, on an average, can improve performance by 11.3%.

### Computing with Near Data

• Xulong Tang
• Mahmut Taylan Kandemir
• Hui Zhao
• Myoungsoo Jung
• Mustafa Karakoy

The cost of moving data between compute elements and storage elements plays a significant role in shaping the overall performance of applications. We present a compiler-driven approach to reducing data movement costs. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called "recomputation", in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. Experimental result reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric significantly.

## SESSION: Session 3B: Online Optimization and Pricing

### Dynamic Pricing of Relocating Resources in Large Networks: Extended Abstract

• Santiago R. Balseiro
• David B. Brown
• Chen Chen

We study dynamic pricing of resources that are distributed over a network of locations (e.g., shared vehicle systems and logistics networks). Customers with private willingness-to-pay sequentially request to relocate a resource from one location to another. We focus on networks with a hub-and-spoke structure. We develop a dynamic pricing policy and a performance bound based on a Lagrangian relaxation. This relaxation decomposes the problem over spokes and is thus far easier to solve than the original problem. We analyze the performance of the Lagrangian-based policy and focus on a large network regime in which the number of spokes (n) and number of resources grow at the same rate. We show that our policy loses no more than $O\Big(\sqrtłn n/n \Big)$ in performance compared to an optimal policy, thus implying asymptotic optimality as n grows large. We provide examples that show that upper bounds and static policies based on fluid relaxations fail to work well in this asymptotic regime. Finally, we discuss how our approach extends to more general networks involving multiple, interconnected hubs.

### The Segmentation-Thickness Tradeoff in Online Marketplaces

• Reza Alijan
• Siddhartha Banerjee
• Sreenivas Gollapud
• Kostas Kollias
• Kamesh Munagala

A core tension in the operations of online marketplaces is between segmentation (wherein platforms can increase revenue by segmenting the market into ever smaller sub-markets) and thickness (wherein the size of the sub-market affects the utility experienced by an agent). An important example of this is in dynamic online marketplaces, where buyers and sellers, in addition to preferences for different matches, also have finite patience (or deadlines) for being matched. We formalize this trade-off via a novel optimization problem that we term as 'Two-sided Facility Location': we consider a market wherein agents arrive at nodes embedded in an underlying metric space, where the distance between a buyer and seller captures the quality of the corresponding match. The platform posts prices and wages at the nodes, and opens a set of virtual clearinghouses where agents are routed for matching. To ensure high match-quality, the platform imposes a distance constraint between an agent and its clearinghouse; to ensure thickness, the platform requires the flow to any clearinghouse be at least a pre-specified lower bound. Subject to these constraints, the goal of the platform is to maximize the social surplus subject to weak budget balance, i.e., profit being non-negative. Our work characterizes the complexity of this problem by providing both hardness results as well as algorithms for this setting; in particular, we present an algorithm that for any constant ε > 0 yields a (1 + ε) approximation for the gains from trade, while relaxing the match quality (i.e., maximum distance of any match) by a constant factor.

### On the Value of Look-Ahead in Competitive Online Convex Optimization

• Ming Shi
• Xiaojun Lin
• Lei Jiao

Although using look-ahead information is known to improve the competitive ratios of online convex optimization (OCO) problems with switching costs, the competitive ratios obtained from existing results often depend on the cost coefficients of the problem, and can potentially be large. In this paper, we propose new online algorithms that can utilize look-ahead to achieve much lower competitive ratios for OCO problems with switching costs and hard constraints. For the perfect look-ahead case where the algorithm is provided with the exact inputs in a future look-ahead window of size K, we design an Averaging Regularized Moving Horizon Control (ARMHC) algorithm that can achieve a competitive ratio of K+1/K. To the best of our knowledge, ARMHC is the first to attain a low competitive ratio that is independent of either the coefficients of the switching costs and service costs, or the upper and lower bounds of the inputs. Then, for the case when the future look-ahead has errors, we develop a Weighting Regularized Moving Horizon Control (WRMHC) algorithm that carefully weights the decisions inside the look-ahead window based on the accuracy of the look-ahead information. As a result, WRMHC also achieves a low competitive ratio that is independent of the cost coefficients, even with uncertain hard constraints. Finally, our analysis extends online primal-dual analysis to the case with look-ahead by introducing a novel "re-stitching" idea, which is of independent interest.

### Competitive Online Optimization under Inventory Constraints

• Qiulin Lin
• Hanling Yi
• John Pang
• Minghua Chen
• Michael Honig
• Yuanzhang Xiao

This paper studies online optimization under inventory (budget) constraints. While online optimization is a well-studied topic, versions with inventory constraints have proven difficult. We consider a formulation of inventory-constrained optimization that is a generalization of the classic one-way trading problem and has a wide range of applications. We present a new algorithmic framework, CR-Pursuit, and prove that it achieves the optimal competitive ratio among all deterministic algorithms (up to a problem-dependent constant factor) for inventory-constrained online optimization. Our algorithm and its analysis not only simplify and unify the state-of-the-art results for the standard one-way trading problem, but they also establish novel bounds for generalizations including concave revenue functions. For example, for one-way trading with price elasticity, CR-Pursuit achieves a competitive ratio within a small additive constant (i.e., 1/3) to the lower bound of lnӨ+1, where Ө is the ratio between the maximum and minimum base prices.

## SESSION: Session 4: Optimal Control and Estimation

### Analyzing Location-Based Advertising for Vehicle Service Providers Using Effective Resistances

• Haoran Yu
• Ermin Wei
• Randall A. Berry

Vehicle service providers can display commercial ads in their vehicles based on passengers' origins and destinations to create a new revenue stream. We study a vehicle service provider who can generate different ad revenues when displaying ads on different arcs (i.e., origin-destination pairs). The provider needs to ensure the vehicle flow balance at each location, which makes it challenging to analyze the provider's vehicle assignment and pricing decisions for different arcs. To tackle the problem, we show that certain properties of the traffic network can be captured by a corresponding electrical network. When the effective resistance between two locations is small, there are many paths between the two locations and the provider can easily route vehicles between them. We derive the provider's optimal vehicle assignment and pricing decisions based on effective resistances.

### A Structural Result for Personalized PageRank and its Algorithmic Consequences

• Daniel Vial
• Vijay Subramanian

Many systems, such as the Internet, social networks, and the power grid, can be represented as graphs. When analyzing graphs, it is often useful to compute scores describing the relative importance or distance between nodes. One example is Personalized PageRank (PPR), which assigns to each node v a vector whose i-th entry describes the importance of the i-th node from the perspective of v. PPR has proven useful in many applications, such as recommending who users should follow on social networks (if this i-th entry is large, v may be interested in following the i-th user). Unfortunately, computing n PPR vectors exactly for a graph of n nodes has complexity O(n3), which is infeasible for many graphs of interest.

In this work, we devise a scheme to estimate all n PPR vectors with bounded l_1 error and complexity O(nc), where c<2 depends on the degrees of the graph at hand, the desired error tolerance, and a parameter that defines PPR. This improves upon existing methods, the best of which have complexity O(n2 log n) in our setting. Our complexity guarantee holds with high probability, for certain choices of the PPR parameter, and for a certain class of random graphs (roughly speaking, the sparse directed configuration model with heavy-tailed in-degrees); our accuracy guarantee holds with probability 1 and for arbitrary graphs and PPR parameters. The complexity result arises as a consequence of our main (structural) result, which shows that the dimensionality of the set of PPR vectors scales sublinearly in n with high probability, for the same class of random graphs and for a notion of dimensionality similar to matrix rank. It is this coupling of the PPR vectors for the nodes on a common underlying graph that allows for estimating them faster. Hence, at a high level, our scheme is analogous to (but distinct from) low-rank matrix approximation. We also note that our scheme is similar to one that was proposed in [Jeh and Widom 2003] but lacked accuracy and complexity guarantees, so another contribution of our paper is to address this gap in the literature.

### Learning to Control Renewal Processes with Bandit Feedback

• Semih Cayci
• Atilla Eryilmaz
• R. Srikant

## SESSION: Session 6A: Workload Optimization and Cache Management

• Monika Henzinger
• Stefan Neumann
• Stefan Schmid

Modern networked systems are increasingly reconfigurable, enabling demand-aware infrastructures whose resources can be adjusted according to the workload they currently serve. Such dynamic adjustments can be exploited to improve network utilization and hence performance, by moving frequently interacting communication partners closer, e.g., collocating them in the same server or datacenter. However, dynamically changing the embedding of workloads is algorithmically challenging: communication patterns are often not known ahead of time, but must be learned. During the learning process, overheads related to unnecessary moves (i.e., re-embeddings) should be minimized. This paper studies a fundamental model which captures the tradeoff between the benefits and costs of dynamically collocating communication partners on l servers, in an online manner. Our main contribution is a distributed online algorithm which is asymptotically almost optimal, i.e., almost matches the lower bound (also derived in this paper) on the competitive ratio of any (distributed or centralized) online algorithm.

### Optimizing the Cost of Executing Mixed Interactive and Batch Workloads on Transient VMs

• David Irwin

Container Orchestration Platforms (COPs), such as Kubernetes, are increasingly used to manage large-scale clusters by automating resource allocation between applications encapsulated in containers. Increasingly, the resources underlying COPs are virtual machines (VMs) dynamically acquired from cloud platforms. COPs may choose from many different types of VMs offered by cloud platforms, which differ in their cost, performance, and availability. While transient VMs cost significantly less than on-demand VMs, platforms may revoke them at any time, causing them to become unavailable. While transient VMs' price is attractive, their unreliability is a problem for COPs designed to support mixed workloads composed of, not only delay-tolerant batch jobs, but also long-lived interactive services with high availability requirements.

To address the problem, we design TR-Kubernetes, a COP that optimizes the cost of executing mixed interactive and batch workloads on cloud platforms using transient VMs. To do so, TR-Kubernetes enforces arbitrary availability requirements specified by interactive services despite transient VM unavailability by acquiring many more transient VMs than necessary most of the time, which it then leverages to opportunistically execute batch jobs when excess resources are available. When cloud platforms revoke transient VMs, TR-Kubernetes relies on existing Kubernetes functions to internally revoke resources from batch jobs to maintain interactive services' availability requirements. We show that TR-Kubernetes requires minimal extensions to Kubernetes, and is capable of lowering the cost (by 53%) and improving the availability (99.999%) of a representative interactive/batch workload on Amazon EC2 when using transient compared to on-demand VMs.

### Online Optimization in Cloud Resource Provisioning: Predictions, Regrets, and Algorithms

• Joshua Comden
• Sijie Yao
• Niangjun Chen
• Haipeng Xing
• Zhenhua Liu

Several different control methods are used in practice or have been proposed to cost-effectively provision IT resources. Due to the dependency of many control methods on having accurate predictions of the future to make good provisioning decisions, there has been a great deal of literature on prediction workload demand. However, even with all of this literature on workload predictions and their utilization in control algorithms, the understanding of prediction error and how to handle it remains an important open issue and research challenge.

In this paper we aim to mend this gap by making the following contributions: (i) Prediction error is modeled to aid in proving worst-case dynamic regret bounds for control algorithms. (ii) Upper bounds on dynamic regret are proven for a variety of algorithms in terms of the prediction error model. In order to choose which algorithm to run without prediction error knowledge, a simple online meta-algorithm is designed. (iii) A detailed analysis of prediction accuracy is done for cloud computing by fitting real-world CPU utilization traces of Azure virtual machines to popular prediction models. (iv) Using real-world trace based simulations of CPU allocation for virtual machines, the proposed meta-algorithm is shown to outperform a popular algorithm selection policy and perform very closely to that of the best algorithm chosen in hindsight.

### A New Flexible Multi-flow LRU Cache Management Paradigm for Minimizing Misses

• Guocong Quan
• Jian Tan
• Atilla Eryilmaz
• Ness Shroff

The Least Recently Used (LRU) caching and its variants are used in large-scale data systems in order to provide high-speed data access for a wide class of applications. Nonetheless, a fundamental question still remains: in order to minimize the miss probabilities, how should the cache space be organized to serve multiple data flows? Commonly used strategies can be categorized into two designs: pooled LRU (PLRU) caching and separated LRU (SLRU) caching. However, neither of these designs can satisfactorily solve this problem. PLRU caching is easy to implement and self-adaptive, but does not often achieve optimal or even efficient performance because its set of feasible solutions are limited. SLRU caching can be statically configured to achieve optimal performance for stationary workload, which nevertheless could suffer in a dynamically changing environment and from a cold-start problem. To this end, we propose a new insertion based pooled LRU paradigm, named I-PLRU, where the data flows can be inserted at different positions of a pooled cache. This new design can achieve the optimal performance of the static SLRU, but retains the adaptability of PLRU for resource sharing. Theoretically, we characterize the asymptotic miss probabilities of I-PLRU, and prove that, for any given SLRU design, there always exists an I-PLRU configuration that achieves the same asymptotic miss probability, and vice versa. We next design a policy to minimize the miss probabilities. However, the miss probability minimization problem turns out to be non-convex under the I-PLRU paradigm. Notably, we utilize an equivalence mapping between I-PLRU and SLRU to efficiently find the optimal I-PLRU configuration. We prove that I-PLRU outperforms PLRU and achieves the same miss probability as the optimal SLRU under a stationary request arrival process. From an engineering perspective, the flexibility of I-PLRU avoids separating the memory space, supports dynamic and refined configurations, and alleviates the cold-start problem, potentially yielding better performance than both SLRU and PLRU.

## SESSION: Session 6B: Control and Resource Allocation

### Axiomatizing Congestion Control

• Doron Zarchy
• Michael Schapira
• Scott Shenker

Recent years have witnessed a revival of both industrial and academic interest in improving congestion control designs. The quest for better congestion control is complicated by the extreme diversity and range of (i) the design space (as exemplified by the stark conceptual and operational differences between recent proposals~\citebbr,vivace, COPA ), (ii) the desired properties (ranging from high performance to fairness to TCP-friendliness), (iii) the envisioned operational setting (inter- and intra-datacenter, wireless, the commercial Internet, satellite), and (iv) the application loads and requirements (small vs. large traffic demands, latency- vs. bandwidth-sensitive).

Most congestion control research uses simulation and experiments under a limited range of network conditions. This is extremely important for understanding the detailed performance of particular schemes in specific settings, but provides limited insight into the more general properties of these schemes and no information about the inherent limits %of congestion control designs (such as, which properties are simultaneously achievable and which are mutually exclusive). In contrast, traditional theoretical approaches are typically focused on the design of protocols that achieve specific, predetermined objectives (e.g., network utility maximization~\citekelly2009 ), or the analysis of specific protocols (e.g., from control-theoretic perspectives~\citepaganini2003control ), as opposed to exploring the inherent tensions/derivations between desired properties.

We advocate an axiomatic approach to congestion control, which is complementary to the experimental and theoretical work currently being pursued. Our approach, modeled on similar efforts in social choice theory and game theory~\citearrow1950difficulty, identifies a set of requirements ("axioms'') and then identifies (i) which of its subsets of requirements can coexist (i.e., there are designs that achieve all of them) and which subsets cannot be met simultaneously (i.e., no design can simultaneously achieve all of them), and (ii) whether some requirements immediately follow from satisfying other requirements. Thus, the axiomatic approach can shed light on the inherent tradeoffs involved in congestion control protocol design, and can be leveraged to classify existing and proposed solutions according to the properties they satisfy.

The axiomatic approach has been applied to many computer science environments, e.g. reputation systems, recommendation systems, link prediction, and more. To the best of our knowledge, ours is the first application of this approach to congestion control protocols (though \citeshenker1990theoretical touches on the subject briefly).

We introduce a simple network model where we can evaluate congestion control designs and formulate several natural axioms (or requirements) for congestion control protocols, including efficient link-utilization, loss-avoidance, fairness, stability, and TCP-friendliness. Congestion control protocols can be regarded as points in a multidimensional space reflecting the extent to which they satisfy these requirements, and we show how classical families of congestion control protocols (e.g., additive-increase-multiplicative-decrease, multiplicative-increase-multiplicative-decrease, and more) can be mapped to points in this space.

We leverage our axiomatic framework to derive basic results on the feasibility of simultaneously achieving different requirements within a single design. Our results formalize and shed light on various empirical/experimental observations about tensions between different desiderata, including (1) the tension between attaining high performance and being friendly to legacy TCP connections~\citelakshman1997performance,padhye1998modeling, (2) the tension between achieving high bandwidth and maintaining low latency under dynamic environments~\citebrakmo1995tcp,mo1999analysis,sprout, and (3) the tension between being robust to non-congestion loss and not incurring high loss upon convergence~\citevivace. From a protocol design perspective, desirable congestion control protocols are those that reside on the Pareto frontier in the multidimensional space induced by our requirements, and this Pareto frontier is characterized by our theoretical results.

To be sure, our axiomatic approach has its limitations, as it revolves around investigating these properties in a simplified model. However, we feel that the results, in terms of which axioms can coexist and which cannot, provide insights that apply far beyond the simple model (even if the detailed theoretical results do not). Thus, we contend that the axiomatic approach is a useful addition to the evaluatory arsenal that researchers should apply to congestion control.

We view our results as a first step and leave the reader with many research directions relating to the extension of our model and the re-examination of our axioms/metrics. Thus far, axiomatic approaches have been applied to a few, very specific, networking environments~\citeshenker1990theoretical,lev2015axiomatic,cohen2015axiomatic. We believe that applying the axiomatic approach to other networking contexts (e.g., intradomain~\citelev2015axiomatic and interdomain routing, traffic engineering, in-network queueing~\citesivaraman2013no, network security) could contribute to more principled discussions about these contexts.

### Information, Memory and Capacity in Dynamic Resource Allocation

• Kuang Xu
• Yuan Zhong

We propose a general framework, dubbed Stochastic Processing under Imperfect Information (SPII), to study the impact of information constraints and memories on dynamic re- source allocation. The framework involves a Stochastic Processing Network (SPN) scheduling problem in which the decision maker may access the system state only through a noisy channel, and resource allocation decisions must be carried out through the interaction between an encoding policy (who observes the state) and allocation policy (who chooses the allocation). Applications in the management of large-scale data centers and human-in-the-loop service systems are among our chief motivations.

We quantify the degree to which information constraints reduce the size of the capacity region in general SPNs, and how such reduction depends on the amount of memories available to the encoding and allocation policies. Using a novel metric, capacity factor, our main theorem characterizes the reduction in capacity region (under "optimal" policies) for all non-degenerate channels, and across almost all combinations of memory sizes. Notably, the theorem demonstrates, in substantial generality, that (1) the presence of a noisy channel always reduces capacity, (2) more memories for the allocation policy always improve capacity, and (3) more memories for the encoding policy have little to no effect on capacity. Finally, all of our positive (achievability) results are established through constructive, implementable policies.

Our proof program involves the development of a host of new techniques by combining ideas from information theory, learning and queueing theory. We create a simple yet powerful generalization of the Max-Weight policy, in which individual Markov chains are selected dynamically, in a manner analogous to how schedules are used in a conventional Max-Weight policy.

### mRSC: Multidimensional Robust Synthetic Control

• Vishal Misra
• Devavrat Shah
• Dennis Shen

When evaluating the impact of a policy (e.g., gun control) on a metric of interest (e.g., crime-rate), it may not be possible or feasible to conduct a randomized control trial. In such settings where only observational data is available, synthetic control (SC) methods \citeabadie1, abadie2, abadie3 provide a popular data-driven approach to estimate a "synthetic'' or "virtual'' control by combining measurements of "similar'' alternatives or units (called "donors'').

Recently, robust synthetic control (RSC) \citersc1 was proposed as a generalization of SC to overcome the challenges of missing data and high levels of noise, while removing the reliance on expert domain knowledge for selecting donors. However, both SC and RSC (and its variants) suffer from poor estimation when the pre-intervention period is too short.

As the main contribution of this work, we propose a generalization of unidimensional RSC to multi-dimensional Robust Synthetic Control, mRSC. Our proposed mechanism, mRSC, incorporates multiple types of measurements (or metrics) in addition to the measurement of interest for estimating a synthetic control, thus overcoming the challenge of poor inference due to limited amounts of pre-intervention data. We show that the mRSC algorithm, when using K relevant metrics, leads to a consistent estimator of the synthetic control for the target unit of interest under any metric. Our finite-sample analysis suggests that the mean-squared error (MSE) of our predictions decays to zero at a rate faster than the RSC algorithm by a factor of K and $\sqrtK$ for the training (pre-intervention) and testing (post-intervention) periods, respectively. Additionally, we propose a principled scheme to combine multiple metrics of interest via a diagnostic test that evaluates if adding a metric can be expected to result in improved inference.

Our mechanism for validating mRSC performance is also an important and related contribution of this work: time series prediction. We propose a method to predict the future evolution of a time series based on limited data when the notion of time is relative and not absolute, i.e., where we have access to a donor pool that has already undergone the desired future evolution.

We conduct extensive experimentation to establish the efficacy of mRSC in three different scenarios: predicting the evolution of a metric of interest using synthetically generated data from a known factor model, and forecasting weekly sales and score trajectories of a Walmart store and Cricket game, respectively.

### A Distributed Algorithm to Calculate Max-Min Fair Rates Without Per-Flow State

• Lavanya Jose
• Stephen Ibanez
• Nick McKeown

Most congestion control algorithms, like TCP, rely on a reactive control system that detects congestion, then marches carefully towards a desired operating point (e.g. by modifying the window size or adjusting a rate). In an effort to balance stability and convergence speed, they often take hundreds of RTTs to converge; an increasing problem as networks get faster, with less time to react.

This paper is about an alternative class of congestion control algorithms based on proactive-scheduling: switches and NICs "pro-actively" exchange control messages to run a distributed algorithm to pick "explicit rates" for each flow. We call these Proactive Explicit Rate Control (PERC) algorithms. They take as input the routing matrix and link speeds, but not a congestion signal. By exploiting information such as the number of flows at a link, they can converge an order of magnitude faster than reactive algorithms.

Our main contributions are (1) s-PERC ("stateless" PERC), a new practical distributed PERC algorithm without per-flow state at the switches, and (2) a proof that s-PERC computes exact max-min fair rates in a known bounded time, the first such algorithm to do so without per-flow state. To analyze s-PERC, we introduce a parallel variant of standard waterfilling, 2-Waterfilling. We prove that s-PERC converges to max-min fair in 6N rounds, where N is the number of iterations 2-Waterfilling takes for the same routing matrix.

We describe how to make s-PERC practical and robust to deploy in real networks. We confirm using realistic simulations and an FPGA hardware testbed that s-PERC converges 10-100x faster than reactive algorithms like TCP, DCTCP and RCP in data-center networks and 1.3--6x faster in wide-area networks (WANs). Long flows complete in close to the ideal time, while short-lived flows are prioritized, making it appropriate for data-centers and WANs.

## SESSION: Session 7A: Queueing and Scheduling

### Global Attraction of ODE-based Mean Field Models with Hyperexponential Job Sizes

• Benny Van Houdt

Mean field modeling is a popular approach to assess the performance of large scale computer systems. The evolution of many mean field models is characterized by a set of ordinary differential equations that have a unique fixed point. In order to prove that this unique fixed point corresponds to the limit of the stationary measures of the finite systems, the unique fixed point must be a global attractor. While global attraction was established for various systems in case of exponential job sizes, it is often unclear whether these proof techniques can be generalized to non-exponential job sizes.

In this paper we show how simple monotonicity arguments can be used to prove global attraction for a broad class of ordinary differential equations that capture the evolution of mean field models with hyperexponential job sizes. This class includes both existing as well as previously unstudied load balancing schemes and can be used for systems with either finite or infinite buffers.

The main novelty of the approach exists in using a Coxian representation for the hyperexponential job sizes and a partial order that is stronger than the componentwise partial order used in the exponential case.

### Hyper-Scalable JSQ with Sparse Feedback

• Mark van der Boor
• Sem Borst
• Johan van Leeuwaarden

Load balancing algorithms play a vital role in enhancing performance in data centers and cloud networks. Due to the massive size of these systems, scalability challenges, and especially the communication overhead associated with load balancing mechanisms, have emerged as major concerns. Motivated by these issues, we introduce and analyze a novel class of load balancing schemes where the various servers provide occasional queue updates to guide the load assignment.

We show that the proposed schemes strongly outperform JSQ(d) strategies with comparable communication overhead per job, and can achieve a vanishing waiting time in the many-server limit with just one message per job, just like the popular JIQ scheme. The proposed schemes are particularly geared however towards the sparse feedback regime with less than one message per job, where they outperform corresponding sparsified JIQ versions.

We investigate fluid limits for synchronous updates as well as asynchronous exponential update intervals. The fixed point of the fluid limit is identified in the latter case, and used to derive the queue length distribution. We also demonstrate that in the ultra-low feedback regime the mean stationary waiting time tends to a constant in the synchronous case, but grows without bound in the asynchronous case.

### Two Extensions of Kingman's GI/G/1 Bound

• Florin Ciucu
• Felix Poloczek

A simple bound in GI/G/1 queues was obtained by Kingman using a discrete martingale transform. We extend this technique to 1) multiclass ΣGI/G/1 queues and 2) Markov Additive Processes (MAPs) whose background processes can be time-inhomogeneous or have an uncountable state-space. Both extensions are facilitated by a necessary and sufficient ordinary differential equation (ODE) condition for MAPs to admit continuous martingale transforms. Simulations show that the bounds on waiting time distributions are almost exact in heavy-traffic, including the cases of 1) heterogeneous input, e.g., mixing Weibull and Erlang-k classes and 2) Generalized Markovian Arrival Processes, a new class extending the Batch Markovian Arrival Processes to continuous batch sizes.

### Queue and Loss Distributions in Finite-Buffer Queues

• Florin Ciucu
• Felix Poloczek
• Amr Rizk

We derive simple bounds on the queue distribution in finite-buffer queues with Markovian arrivals. The bounds capture a truncated exponential behavior, involving joint horizontal and vertical shifts of an exponential function; this is fundamentally different than existing results capturing horizontal shifts only. We also obtain similar bounds on the loss distribution, which is a key metric to understand the impact of finite-buffer queues on real-time applications. Simulations show that the bounds are accurate in heavy-traffic regimes, and improve existing ones by orders of magnitude.

• Jiaming Xu
• Yuan Zhong

### Computationally Efficient Estimation of the Spectral Gap of a Markov Chain

• Richard Combes
• Mikael Touati

We consider the problem of estimating from sample paths the absolute spectral gap 1-λ⋆ of a reversible, irreducible and aperiodic Markov chain (Xt)t∈N over a finite state space Ω. We propose the UCPI (Upper Confidence Power Iteration) algorithm for this problem, a low-complexity algorithm which estimates the spectral gap in time O(n) and memory space O((ln n)2 given n samples. This is in stark contrast with most known methods which require at least memory space O(|Ω|), so that they cannot be applied to large state spaces. Furthermore, UCPI is amenable to parallel implementation.