SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems

Full Citation in the ACM Digital Library

SESSION: Online Optimization

Uniform Loss Algorithms for Online Stochastic Decision-Making With Applications to Bin Packing

We consider a general class of finite-horizon online decision-making problems, where in each period a controller is presented a stochastic arrival and must choose an action from a set of permissible actions, and the final objective depends only on the aggregate type-action counts. Such a framework encapsulates many online stochastic variants of common optimization problems including bin packing, generalized assignment, and network revenue management. In such settings, we study a natural model-predictive control algorithm that in each period, acts greedily based on an updated certainty-equivalent optimization problem. We introduce a simple, yet general, condition under which this algorithm obtains uniform additive loss (independent of the horizon) compared to an optimal solution with full knowledge of arrivals. Our condition is fulfilled by the above-mentioned problems, as well as more general settings involving piece-wise linear objectives and offline index policies, including an airline overbooking problem.

Online Primal-Dual Mirror Descent under Stochastic Constraints

We consider online convex optimization with stochastic constraints where the objective functions are arbitrarily time-varying and the constraint functions are independent and identically distributed (i.i.d.) over time. Both the objective and constraint functions are revealed after the decision is made at each time slot. The best known expected regret for solving such a problem is O(√T), with a coefficient that is polynomial in the dimension of the decision variable and relies on theSlater condition (i.e. the existence of interior point assumption), which is restrictive and in particular precludes treating equality constraints. In this paper, we show that such Slater condition is in fact not needed. We propose a newprimal-dual mirror descent algorithm and show that one can attain O(√T) regret and constraint violation under a much weaker Lagrange multiplier assumption, allowing general equality constraints and significantly relaxing the previous Slater conditions. Along the way, for the case where decisions are contained in a probability simplex, we reduce the coefficient to have only a logarithmic dependence on the decision variable dimension. Such a dependence has long been known in the literature on mirror descent but seems new in this new constrained online learning scenario. Simulation experiments on a data center server provision problem with real electricity price traces further demonstrate the performance of our proposed algorithm.

Dynamic Weighted Fairness with Minimal Disruptions

In this paper, we consider the following dynamic fair allocation problem: Given a sequence of job arrivals and departures, the goal is to maintain an approximately fair allocation of the resource against a target fair allocation policy, while minimizing the total number of disruptions, which is the number of times the allocation of any job is changed. We consider a rich class of fair allocation policies that significantly generalize those considered in previous work.

We first consider the models where jobs only arrive, or jobs only depart. We present tight upper and lower bounds for the number of disruptions required to maintain a constant approximate fair allocation every time step. In particular, for the canonical case where jobs have weights and the resource allocation is proportional to the job's weight, we show that maintaining a constant approximate fair allocation requires Θ(log* n) disruptions per job, almost matching the bounds in prior work for the unit weight case. For the more general setting where the allocation policy only decreases the allocation to a job when new jobs arrive, we show that maintaining a constant approximate fair allocation requires Θ(log n) disruptions per job. We then consider the model where jobs can both arrive and depart. We first show strong lower bounds on the number of disruptions required to maintain constant approximate fairness for arbitrary instances. In contrast we then show that there there is an algorithm that can maintain constant approximate fairness with O(1) expected disruptions per job if the weights of the jobs are independent of the jobs arrival and departure order. We finally show how our results can be extended to the setting with multiple resources.

Online Linear Optimization with Inventory Management Constraints

This paper considers the problem of online linear optimization with inventory management constraints. Specifically, we consider an online scenario where a decision maker needs to satisfy her timevarying demand for some units of an asset, either from a market with a time-varying price or from her own inventory. In each time slot, the decision maker is presented a (linear) price and must immediately decide the amount to purchase for covering the demand and/or for storing in the inventory for future use. The inventory has a limited capacity and can be used to buy and store assets at low price and cover the demand when the price is high. The ultimate goal of the decision maker is to cover the demand at each time slot while minimizing the cost of buying assets from the market. We propose ARP, an online algorithm for linear programming with inventory constraints, and ARPRate, an extended version that handles rate constraints to/from the inventory. Both ARP and ARPRate achieve optimal competitive ratios, meaning that no other online algorithm can achieve a better theoretical guarantee. To illustrate the results, we use the proposed algorithms in a case study focused on energy procurement and storage management strategies for data centers.

Online Optimization with Predictions and Non-convex Losses

We study online optimization in a setting where an online learner seeks to optimize a per-round hitting cost, which may be non-convex, while incurring a movement cost when changing actions between rounds. We ask: under what general conditions is it possible for an online learner to leverage predictions of future cost functions in order to achieve near-optimal costs? Prior work has provided near-optimal online algorithms for specific combinations of assumptions about hitting and switching costs, but no general results are known. In this work, we give two general sufficient conditions that specify a relationship between the hitting and movement costs which guarantees that a new algorithm, Synchronized Fixed Horizon Control (SFHC), achieves a 1+O(1/w) competitive ratio, where w is the number of predictions available to the learner. Our conditions do not require the cost functions to be convex, and we also derive competitive ratio results for non-convex hitting and movement costs. Our results provide the first constant, dimension-free competitive ratio for online non-convex optimization with movement costs. We also give an example of a natural problem, Convex Body Chasing (CBC), where the sufficient conditions are not satisfied and prove that no online algorithm can have a competitive ratio that converges to 1.

Mechanism Design for Online Resource Allocation: A Unified Approach

This paper concerns the mechanism design for online resource allocation in a strategic setting. In this setting, a single supplier allocates capacity-limited resources to requests that arrive in a sequential and arbitrary manner. Each request is associated with an agent who may act selfishly to misreport the requirement and valuation of her request. The supplier charges payment from agents whose requests are satisfied, but incurs a load-dependent supply cost. The goal is to design an incentive compatible online mechanism, which determines not only the resource allocation of each request, but also the payment of each agent, so as to (approximately) maximize the social welfare (i.e., aggregate valuations minus supply cost). We study this problem under the framework of competitive analysis. The major contribution of this paper is the development of a unified approach that achieves the best-possible competitive ratios for setups with different supply costs. Specifically, we show that when there is no supply cost or the supply cost function is linear, our model is essentially a standard 0-1 knapsack problem, for which our approach achieves logarithmic competitive ratios that match the state-of-the-art (which is optimal). For the more challenging setup when the supply cost is strictly-convex, we provide online mechanisms, for the first time, that lead to the optimal competitive ratios as well. To the best of our knowledge, this is the first approach that unifies the characterization of optimal competitive ratios in online resource allocation for different setups including zero, linear and strictly-convex supply costs.

Predict and Match: Prophet Inequalities with Uncertain Supply

We consider the problem of selling perishable items to a stream of buyers in order to maximize social welfare. A seller starts with a set of identical items, and each arriving buyer wants any one item, and has a valuation drawn i.i.d. from a known distribution.Each item, however, disappears after an a priori unknown amount of time that we term the horizon for that item. The seller knows the (possibly different) distribution of the horizon for each item, but not its realization till the item actually disappears. As with the classic prophet inequalities, the goal is to design an online pricing scheme that competes with the prophet that knows the horizon and extracts full social surplus (or welfare).

Our main results are for the setting where items have independent horizon distributions satisfying the monotone-hazard-rate (MHR) condition. Here, for any number of items, we achieve a constant-competitive bound via a conceptually simple policy that balances the rate at which buyers are accepted with the rate at which items are removed from the system. We implement this policy via a novel technique of matching via probabilistically simulating departures of the items at future times. Moreover, for a single item and MHR horizon distribution with mean μ, we show a tight result: There is a fixed pricing scheme that has competitive ratio at most 2 - 1/μ, and this is the best achievable in this class.

We further show that our results are best possible. First, we show that the competitive ratio is unbounded without the MHR assumption even for one item. Further, even when the horizon distributions are i.i.d. MHR and the number of items becomes large, the competitive ratio of any policy is lower bounded by a constant greater than 1, which is in sharp contrast to the setting with identical deterministic horizons.

Fundamental Limits on the Regret of Online Network-Caching

Optimal caching of files in a content distribution network (CDN) is a problem of fundamental and growing commercial interest. Although many different caching algorithms are in use today, the fundamental performance limits of the network caching algorithms from an online learning point-of-view remain poorly understood to date. In this paper, we resolve this question in the following two settings: (1) a single user connected to a single cache, and (2) a set of users and a set of caches interconnected through a bipartite network. Recently, an online gradient-based coded caching policy was shown to enjoy sub-linear regret. However, due to the lack of known regret lower bounds, the question of the optimality of the proposed policy was left open. In this paper, we settle this question by deriving tight non-asymptotic regret lower bounds in the above settings. In addition to that, we propose a new Follow-the-Perturbed-Leader-based uncoded caching policy with near-optimal regret. Technically, the lower-bounds are obtained by relating the online caching problem to the classic probabilistic paradigm of balls-into-bins. Our proofs make extensive use of a new result on the expected load in the most populated half of the bins, which might also be of independent interest. We evaluate the performance of the caching policies by experimenting with the popular MovieLens dataset and conclude the paper with design recommendations and a list of open problems.

SESSION: Learning

Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces

We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel Q-learning policy with adaptive data-driven discretization. The central idea is to maintain a finer partition of the state-action space in regions which are frequently visited in historical trajectories, and have higher payoff estimates. We demonstrate how our adaptive partitions take advantage of the shape of the optimal Q-function and the joint space, without sacrificing the worst-case performance. In particular, we recover the regret guarantees of prior algorithms for continuous state-action spaces, which additionally require either an optimal discretization as input, and/or access to a simulation oracle. Moreover, experiments demonstrate how our algorithm automatically adapts to the underlying structure of the problem, resulting in much better performance compared both to heuristics and Q-learning with uniform discretization.

Staleness Control for Edge Data Analytics

A new generation of cyber-physical systems has emerged with a large number of devices that continuously generate and consume massive amounts of data in a distributed and mobile manner. Accurate and near real-time decisions based on such streaming data are in high demand in many areas of optimization for such systems. Edge data analytics bring processing power in the proximity of data sources, reduce the network delay for data transmission, allow large-scale distributed training, and consequently help meeting real-time requirements. Nevertheless, the multiplicity of data sources leads to multiple distributed machine learning models that may suffer from sub-optimal performance due to the inconsistency in their states. In this work, we tackle the insularity, concept drift, and connectivity issues in edge data analytics to minimize its accuracy handicap without losing its timeliness benefits. Thus, we propose an efficient model synchronization mechanism for distributed and stateful data analytics. Staleness Control for Edge Data Analytics (SCEDA) ensures the high adaptability of synchronization frequency in the face of an unpredictable environment by addressing the trade-off between the generality and timeliness of the model.

Fundamental Limits of Approximate Gradient Coding

In the distributed graident coding problem, it has been established that, to exactly recover the gradient under s slow machines, the mmimum computation load (number of stored data partitions) of each worker is at least linear ($s+1$), which incurs a large overhead when s is large[13]. In this paper, we focus on approximate gradient coding that aims to recover the gradient with bounded error ε. Theoretically, our main contributions are three-fold: (i) we analyze the structure of optimal gradient codes, and derive the information-theoretical lower bound of minimum computation load: O(log(n)/log(n/s)) for ε = 0 and d≥ O(log(1/ε)/log(n/s)) for ε>0, where d is the computation load, and ε is the error in the gradient computation; (ii) we design two approximate gradient coding schemes that exactly match such lower bounds based on random edge removal process; (iii) we implement our schemes and demonstrate the advantage of the approaches over the current fastest gradient coding strategies. The proposed schemes provide order-wise improvement over the state of the art in terms of computation load, and are also optimal in terms of both computation load and latency.

Forecasting with Alternative Data

We consider the problem of forecasting fine-grained company financials, such as daily revenue, from two input types: noisy proxy signals a la alternative data (e.g. credit card transactions) and sparse ground-truth observations (e.g. quarterly earnings reports). We utilize a classical linear systems model to capture both the evolution of the hidden or latent state (e.g. daily revenue), as well as the proxy signal (e.g. credit cards transactions). The linear system model is particularly well suited here as data is extremely sparse (4 quarterly reports per year). In classical system identification, where the central theme is to learn parameters for such linear systems, unbiased and consistent estimation of parameters is not feasible: the likelihood is non-convex; and worse, the global optimum for maximum likelihood estimation is often non-unique.

As the main contribution of this work, we provide a simple, consistent estimator of all parameters for the linear system model of interest; in addition the estimation is unbiased for some of the parameters. In effect, the additional sparse observations of aggregate hidden state (e.g. quarterly reports) enable system identification in our setup that is not feasible in general. For estimating and forecasting hidden state (actual earnings) using the noisy observations (daily credit card transactions), we utilize the learned linear model along with a natural adaptation of classical Kalman filtering (or Belief Propagation). This leads to optimal inference with respect to mean-squared error. Analytically, we argue that even though the underlying linear system may be "unstable,'' "uncontrollable,'' or "undetectable'' in the classical setting, our setup and inference algorithm allow for estimation of hidden state with bounded error. Further, the estimation error of the algorithm monotonically decreases as the frequency of the sparse observations increases. This, seemingly intuitive insight contradicts the word on the Street. Finally, we utilize our framework to estimate quarterly earnings of 34 public companies using credit card transaction data. Our data-driven method convincingly outperforms the Wall Street consensus (analyst) estimates even though our method uses only credit card data as input, while the Wall Street consensus is based on various data sources including experts' input.

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challenging due to the complexity and scale of the services running across global data centers. It is often difficult to review the logs jointly for understanding production issues given the distributed nature of the system. Additionally, there could easily be millions of entities, each described by hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability.

We first explore item-sets, i.e. combinations of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. We propose pre-processing steps with the use of a large-scale real-time database and post-processing techniques and parallelism to further speed up the analysis and improve interpretability, and demonstrate that such optimization is necessary for handling large- scale production datasets. We have successfully rolled out this approach for root cause investigation purposes within Facebook's infrastructure. We also present the setup and results from multiple production use cases in this paper.

Inferring Streaming Video Quality from Encrypted Traffic: Practical Models and Deployment Experience

Inferring the quality of streaming video applications is important for Internet service providers, but the fact that most video streams are encrypted makes it difficult to do so. We develop models that infer quality metrics (i.e., startup delay and resolution) for encrypted streaming video services. Our paper builds on previous work, but extends it in several ways. First, the models work in deployment settings where the video sessions and segments must be identified from a mix of traffic and the time precision of the collected traffic statistics is more coarse (e.g., due to aggregation). Second, we develop a single composite model that works for a range of different services (i.e., Netflix, YouTube, Amazon, and Twitch), as opposed to just a single service. Third, unlike many previous models, our models perform predictions at finer granularity (e.g., the precise startup delay instead of just detecting short versus long delays) allowing to draw better conclusions on the ongoing streaming quality. Fourth, we demonstrate the models are practical through a 16-month deployment in 66 homes and provide new insights about the relationships between Internet "speed'' and the quality of the corresponding video streams, for a variety of services; we find that higher speeds provide only minimal improvements to startup delay and resolution.

Social Learning in Multi Agent Multi Armed Bandits

We introduce a novel decentralized, multi agent version of the classical Multi-Arm Bandit (MAB) problem, consisting of n agents, that collaboratively and simultaneously solve the same instance of K armed MAB to minimize individual regret. The agents can communicate and collaborate among each other only through a pairwise asynchronous gossip based protocol that exchange a limited number of bits. In our model, agents at each point decide on (i) which arm to play, (ii) whether to, and if so (iii) what and whom to communicate with. We develop a novel algorithm in which agents, whenever they choose, communicate only arm-ids and not samples, with another agent chosen uniformly and independently at random. The peragent regret achieved by our algorithm is O(⌈K/n⌉ + log(n)/Δ log(T)), where Δ is the difference between the mean of the best and second best arm. Furthermore, any agent in our algorithm communicates (arm-ids to an uniformly and independently chosen agent) only a total of Θ(log(T)) times over a time interval of T. We compare our results to two benchmarks - one where there is no communication among agents and one corresponding to complete interaction, where an agent has access to the entire system history of arms played and rewards obtained of all agents. We show both theoretically and empirically, that our algorithm experiences a significant reduction both in per-agent regret when compared to the case when agents do not collaborate and each agent is playing the standard MAB problem (where regret would scale linearly in K), and in communication complexity when compared to the full interaction setting which requires T communication attempts by an agent over T arm pulls. Our result thus demonstrates that even a minimal level of collaboration among the different agents enables a significant reduction in per-agent regret.

Non-Asymptotic Analysis of Monte Carlo Tree Search

In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP) with deterministic transitions. While MCTS is believed to provide an approximate value function for a given state with enough simulations, cf. [Kocsis and Szepesvari 2006; Kocsis et al. 2006], the claimed proof of this property is incomplete. This is due to the fact that the variant of MCTS, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes "logarithmic" bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature, cf. [Agrawal 1995; Auer et al. 2002]. In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in [Audibert et al. 2009], even for stationary MABs.

As the key contribution of this work, we establish polynomial concentration property of regret for a class of non-stationary multi-arm bandits. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has the claimed property of [Kocsis and Szepesvari 2006; Kocsis et al. 2006]. Interestingly enough, empirically successful approaches (cf. [Silver et al. 2017]) utilize a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a "policy improvement" operator, i.e., it iteratively improves value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε-approximation of the value function for deterministic MDPs with respect to ℓ∞ norm, MCTS combined with nearest neighbor requires a sample size scaling as Õ (ε-(d+4), where d is the dimension of the state space. This is nearly optimal due to a minimax lower bound of ∼Ω (ε-(d+2) [Shah and Xie 2018], suggesting the strength of the variant of MCTS we propose here and our resulting analysis.

SESSION: Scheduling +

Heavy-traffic Analysis of the Generalized Switch under Multidimensional State Space Collapse

Stochastic Processing Networks that model wired and wireless networks, and other queueing systems have been studied in heavy-traffic limit in the literature under the so-called Complete Resource Pooling (CRP) condition. Under the CRP condition, these systems behave like a single server queue. When the CRP condition is not satisfied, heavy-traffic results are known only in the special case of an input-queued switch and bandwidth-sharing network.

In this paper, we consider a very general queueing system called the 'generalized switch' that includes wireless networks under fading, data center networks, input-queued switch, etc. The primary contribution of this paper is to present the exact value of the steady-state mean of certain linear combinations of queue lengths in the heavy-traffic limit under the MaxWeight scheduling algorithm. We do this using the Drift method, and we also present a negative result that it is not possible to obtain the remaining linear combinations (and consequently all the individual mean queue lengths) using this method. We do this by presenting an alternate view of the Drift method in terms of an (under-determined) system of linear equations. Finally, we use this system of equations to obtain upper and lower bounds on all linear combinations of queue lengths.

Characterizing Policies with Optimal Response Time Tails under Heavy-Tailed Job Sizes

We consider the tail behavior of the response time distribution in an M/G/1 queue with heavy-tailed job sizes, specifically those with intermediately regularly varying tails. In this setting, the response time tail of many individual policies has been characterized, and it is known that policies such as Shortest Remaining Processing Time (SRPT) and Foreground-Background (FB) have response time tails of the same order as the job size tail, and thus such policies are tail-optimal. Our goal in this work is to move beyond individual policies and characterize the set of policies that are tail-optimal. Toward that end, we use the recently introduced SOAP framework to derive sufficient conditions on the form of prioritization used by a scheduling policy that ensure the policy is tail-optimal. These conditions are general and lead to new results for important policies that have previously resisted analysis, including the Gittins policy, which minimizes mean response time among policies that do not have access to job size information. As a by-product of our analysis, we derive a general upper bound for fractional moments of M/G/1 busy periods, which is of independent interest.

Simple Near-Optimal Scheduling for the M/G/1

We consider the problem of preemptively scheduling jobs to minimize mean response time of an M/G/1 queue. When we know each job's size, the shortest remaining processing time (SRPT) policy is optimal. Unfortunately, in many settings we do not have access to each job's size. Instead, we know only the job size distribution. In this setting the Gittins policy is known to minimize mean response time, but its complex priority structure can be computationally intractable. A much simpler alternative to Gittins is the shortest expected remaining processing time (SERPT) policy. While SERPT is a natural extension of SRPT to unknown job sizes, it is unknown whether or not SERPT is close to optimal for mean response time.

We present a new variant of SERPT called monotonic SERPT (M-SERPT) which is as simple as SERPT but has provably near-optimal mean response time at all loads for any job size distribution. Specifically, we prove the mean response time ratio between M-SERPT and Gittins is at most 3 for load ρ ≤ 8/9 and at most 5 for any load. This makes M-SERPT the only non-Gittins scheduling policy known to have a constant-factor approximation ratio for mean response time.

Delay-Optimal Policies in Partial Fork-Join Systems with Redundancy and Random Slowdowns

We consider a large distributed service system consisting of n homogeneous servers with infinite capacity FIFO queues. Jobs arrive as a Poisson process of rate λ n/kn (for some positive constant λ and integer kn). Each incoming job consists of kn identical tasks that can be executed in parallel, and that can be encoded into at least kn "replicas" of the same size (by introducing redundancy) so that the job is considered to be completed when any kn replicas associated with it finish their service. Moreover, we assume that servers can experience random slowdowns in their processing rate so that the service time of a replica is the product of its size and a random slowdown.

First, we assume that the server slowdowns are shifted exponential and independent of the replica sizes. In this setting we show that the delay of a typical job is asymptotically minimized (as n\→\∞) when the number of replicas per task is a constant that only depends on the arrival rate λ, and on the expected slowdown of servers.

Second, we introduce a new model for the server slowdowns in which larger tasks experience less variable slowdowns than smaller tasks. In this setting we show that, under the class of policies where all replicas start their service at the same time, the delay of a typical job is asymptotically minimized (as n\→\∞) when the number of replicas per task is made to depend on the actual size of the tasks being replicated, with smaller tasks being replicated more than larger tasks.

Mean Field Analysis of Join-Below-Threshold Load Balancing for Resource Sharing Servers

Load balancing plays a crucial role in many large scale computer systems. Much prior work has focused on systems with First-Come-First-Served (FCFS) servers. However, servers in practical systems are more complicated. They serve multiple jobs at once, and their service rate can depend on the number of jobs in service. Motivated by this, we study load balancing for systems using Limited-Processor-Sharing (LPS). Our model has heterogeneous servers, meaning the service rate curve and multiprogramming level (limit on the number of jobs sharing the processor) differs between servers. We focus on a specific load balancing policy: Join-Below-Threshold (JBT), which associates a threshold with each server and, whenever possible, dispatches to a server which has fewer jobs than its threshold. Given this setup, we ask: how should we configure the system to optimize objectives such as mean response time? Configuring the system means choosing both a load balancing threshold and a multiprogramming level for each server. To make this question tractable, we study the many-server mean field regime.

In this paper we provide a comprehensive study of JBT in the mean field regime. We begin by developing a mean field model for the case of exponentially distributed job sizes. The evolution of our model is described by a differential inclusion, which complicates its analysis. We prove that the sequence of stationary measures of the finite systems converges to the fixed point of the differential inclusion, provided a unique fixed point exists. We derive simple conditions on the service rate curves to guarantee the existence of a unique fixed point. We demonstrate that when these conditions are not satisfied, there may be multiple fixed points, meaning metastability may occur. Finally, we give a simple method for determining the optimal system configuration to minimize the mean response time and related metrics.

While our theoretical results are proven for the special case of exponentially distributed job sizes, we provide evidence from simulation that the system becomes insensitive to the job size distribution in the mean field regime, suggesting our results are more generally applicable.

Achieving Efficient Routing in Reconfigurable DCNs

With the fast growth of cloud services and network scales, the heavy and highly dynamic traffic demands pose great challenges to the efficient traffic engineering in today's data center networks (DCNs) [21]. The DCN flows can be broadly classified into two main categories: delay-sensitive small flows (e.g., queries or realtime small messages) and throughput-sensitive large flows (e.g., the backup traffic). In general, more than 80% flows in data centers are small flows, while the majority of the traffic volume is contributed by the top 10% large flows [3, 7]. To handle the mixed traffic, today's data centers [1, 14] generally follow the tree-based topologies (e.g., fat-tree) and take the load-agnostic routing strategies based on random path selection (e.g., ECMP1) [14, 19]. Although it is applicable for routing small flows which are highly random, these strategies are likely to route several large flows through the same output link and lead to long-lived congestions [2, 8]. With the limited switch buffer occupied by large flows for a long time, small flows are reported to experience one order of magnitude larger delay, which compromises the performance of DCNs and makes the users suffer [3].

SESSION: Networking

Fundamental Limits of Volume-based Network DoS Attacks

Volume-based network denial-of-service (DoS) attacks refer to a class of cyber attacks where an adversary seeks to block user traffic from service by sending adversarial traffic that reduces the available user capacity. In this paper, we explore the fundamental limits of volume-based network DoS attacks by studying the minimum required rate of adversarial traffic and investigating optimal attack strategies. We start our analysis with single-hop networks where user traffic is routed to servers following the Join-the-Shortest-Queue (JSQ) rule. Given the service rates of servers and arrival rates of user traffic, we first characterize the feasibility region of the attack and show that the attack is feasible if and only if the rate of the adversarial traffic lies in the region. We then design an attack strategy that is (i).optimal: it guarantees the success of the attack whenever the adversarial traffic rate lies in the feasibility region and (ii).oblivious: it does not rely on knowledge of service rates or user traffic rates. Finally, we extend our results on the feasibility region of the attack and the optimal attack strategy to multi-hop networks that employ Back-pressure (Max-Weight) routing. At a higher level, this paper addresses a class of dual problems of stochastic network stability, i.e., how to optimally de-stabilize a network.

On the Complexity of Traffic Traces and Implications

This paper presents a systematic approach to identify and quantify the types of structures featured by packet traces in communication networks. Our approach leverages an information-theoretic methodology, based on iterative randomization and compression of the packet trace, which allows us to systematically remove and measure dimensions of structure in the trace. In particular, we introduce the notion of trace complexity which approximates the entropy rate of a packet trace. Considering several real-world traces, we show that trace complexity can provide unique insights into the characteristics of various applications. Based on our approach, we also propose a traffic generator model able to produce a synthetic trace that matches the complexity levels of its corresponding real-world trace. Using a case study in the context of datacenters, we show that insights into the structure of packet traces can lead to improved demand-aware network designs: datacenter topologies that are optimized for specific traffic patterns.

On the Analysis of a Multipartite Entanglement Distribution Switch

We study a quantum switch that distributes maximally entangled multipartite states to sets of users. The entanglement switching process requires two steps: first, each user attempts to generate bipartite entanglement between itself and the switch; and second, the switch performs local operations and a measurement to create multipartite entanglement for a set of users. In this work, we study a simple variant of this system, wherein the switch has infinite memory and the links that connect the users to the switch are identical. Further, we assume that all quantum states, if generated successfully, have perfect fidelity and that decoherence is negligible. This problem formulation is of interest to several distributed quantum applications, while the technical aspects of this work result in new contributions within queueing theory. Via extensive use of Lyapunov functions, we derive necessary and sufficient conditions for the stability of the system and closed-form expressions for the switch capacity and the expected number of qubits in memory.

On Time Synchronization Issues in Time-Sensitive Networks with Regulators and Nonideal Clocks

Flow reshaping is used in time-sensitive networks (as in the context of IEEE TSN and IETF Detnet) in order to reduce burstiness inside the network and to support the computation of guaranteed latency bounds. This is performed using per-flow regulators (such as the Token Bucket Filter) or interleaved regulators (as with IEEE TSN Asynchronous Traffic Shaping, ATS). The former use one FIFO queue per flow, whereas the latter use one FIFO queue per input port. Both types of regulators are beneficial as they cancel the increase of burstiness due to multiplexing inside the network. It was demonstrated, by using network calculus, that they do not increase the worst-case latency. However, the properties of regulators were established assuming that time is perfect in all network nodes. In reality, nodes use local, imperfect clocks. Time-sensitive networks exist in two flavours: (1) in non-synchronized networks, local clocks run independently at every node and their deviations are not controlled and (2) in synchronized networks, the deviations of local clocks are kept within very small bounds using for example a synchronization protocol (such as PTP) or a satellite based geo-positioning system (such as GPS). We revisit the properties of regulators in both cases. In non-synchronized networks, we show that ignoring the timing inaccuracies can lead to network instability due to unbounded delay in per-flow or interleaved regulators. We propose and analyze two methods (rate and burst cascade, and asynchronous dual arrival-curve method) for avoiding this problem. In synchronized networks, we show that there is no instability with per-flow regulators but, surprisingly, interleaved regulators can lead to instability. To establish these results, we develop a new framework that captures industrial requirements on clocks in both non-synchronized and synchronized networks, and we develop a toolbox that extends network calculus to account for clock imperfections.

Lancet: Better network resilience by designing for pruned failure sets

Recently, researchers have started exploring the design of route protection schemes that ensure networks can sustain traffic demand without congestion under failures. Existing approaches focus on ensuring worst-case performance over simultaneous f-failure scenarios is acceptable. Unfortunately, even a single bad scenario may render the schemes unable to protect against any f-failure scenario. In this paper, we present Lancet, a system designed to handle most failures when not all can be tackled. Lancet comprises three components: (i) an algorithm to analyze which failure scenarios the network can intrinsically handle which provides a benchmark for any protection routing scheme, and guides the design of new schemes; (ii) an approach to efficiently design a protection schemes for more general failure sets than all f-failure scenarios; and (iii) techniques to determine which of combinatorially many scenarios to design for. Our evaluations with real topologies and validations on an emulation testbed show that Lancet outperforms a worst-case approach by protecting against many more scenarios, and can even match the scenarios handled by optimal network response.

vrfinder: Finding Outbound Addresses in Traceroute

Current methods to analyze the Internet's router-level topology with paths collected using traceroute assume that the source address for each router in the path is either an inbound or off-path address on each router. In this work, we show that outbound addresses are common in our Internet-wide traceroute dataset collected by CAIDA's Ark vantage points in January 2020, accounting for 1.7% - 5.8% of the addresses seen at some point before the end of a traceroute. This phenomenon can lead to mistakes in Internet topology analysis, such as inferring router ownership and identifying interdomain links. We hypothesize that the primary contributor to outbound addresses is Layer 3 Virtual Private Networks (L3VPNs), and propose vrfinder, a technique for identifying L3VPN outbound addresses in traceroute collections. We validate vrfinder against ground truth from two large research and education networks, demonstrating high precision (100.0%) and recall (82.1% - 95.3%). We also show the benefit of accounting for L3VPNs in traceroute analysis through extensions to bdrmapIT, increasing the accuracy of its router ownership inferences for L3VPN outbound addresses from 61.5% - 79.4% to 88.9% - 95.5%.

Ludo Hashing: Compact, Fast, and Dynamic Key-value Lookups for Practical Network Systems

Key-value lookup engines running in fast memory are crucial components of many networked and distributed systems such as packet forwarding, virtual network functions, content distribution networks, distributed storage, and cloud/edge computing. These lookup engines must be memory-efficient because fast memory is small and expensive. This work presents a new key-value lookup design, called Ludo Hashing, which costs the least space (3.76 + 1.05 ι bits per key-value item for ι-bit values) among known compact lookup solutions including the recently proposed partial-key Cuckoo and Bloomier perfect hashing. In addition to its space efficiency, Ludo Hashing works well with most practical systems by supporting fast lookups, fast updates, and concurrent writing/reading. We implement Ludo Hashing and evaluate it with both micro-benchmark and two network systems deployed in CloudLab. The results show that in practice Ludo Hashing saves 40% to 80%+ memory cost compared to existing dynamic solutions. It costs only a few GB memory for 1 billion key-value items and achieves high lookup throughput: over 65 million queries per second on a single node with multiple threads.

SESSION: Network Measurement

The Great Internet TCP Congestion Control Census

In 2016, Google proposed and deployed a new TCP variant called BBR. BBR represents a major departure from traditional congestion control as it uses estimates of bandwidth and round-trip delays to regulate its sending rate. BBR has since been introduced in the upstream Linux kernel and deployed by Google across its data centers. Since the last major study to identify TCP congestion control variants on the Internet was done before BBR, it is timely to conduct a new census to give us a sense of the current distribution of congestion control variants on the Internet. To this end, we designed and implemented Gordon, a tool that allows us to measure the congestion window (cwnd) corresponding to each successive RTT in the TCP connection response of a congestion control algorithm. To compare a measured flow to the known variants, we created a localized bottleneck and introduced a variety of network changes like loss events, changes in bandwidth and delay, while normalizing all measurements by RTT. We built an offline classifier to identify the TCP variant based on the cwnd trace over time.

Our results suggest that CUBIC is currently the dominant TCP variant on the Internet, and is deployed on about 36% of the websites in the Alexa Top 20,000 list. While BBR and its variant BBR G1.1 are currently in second place with a 22% share by website count, their present share of total Internet traffic volume is estimated to be larger than 40%. We also found that Akamai has deployed a unique loss-agnostic rate-based TCP variant on some 6% of the Alexa Top 20,000 websites and there are likely other undocumented variants. Therefore, the traditional assumption that TCP variants ''in the wild'' will come from a small known set is not likely to be true anymore. Our results suggest that some variant of BBR seems poised to replace CUBIC as the next dominant TCP variant on the Internet.

I Know What You Did Last Summer: Network Monitoring using Interval Queries

Modern telemetry systems require advanced analytic capabilities such as drill down queries. These queries can be used to detect the beginning and end of a network anomaly by efficiently refining the search space. We present the first integral solution that (i) enables multiple measurement tasks inside the same data structure, (ii) supports specifying the time frame of interest as part of its queries, and (iii) is sketch-based and thus space efficient. Namely, our approach allows the user to define both the measurement task (e.g., heavy hitters, entropy estimation, cardinality estimation) and the time frame of relevance (e.g., 5PM-6PM) at query time. Our approach provides accuracy guarantees and is the only space-efficient solution that offers such capabilities. Finally, we demonstrate how the algorithm can be used to accurately pinpoint the beginning of a realistic DDoS attack.

Generalized Sketch Families for Network Traffic Measurement

Traffic measurement provides critical information for network management, resource allocation, traffic engineering, and attack detection. Most prior art has been geared towards specific application needs with specific performance objectives. To support diverse requirements with efficient and future-proof implementation, this paper takes a new approach to establish common frameworks, each for a family of traffic measurement solutions that share the same implementation structure, providing a high level of generality, for both size and spread measurements and for all flows. The designs support many options of performance-overhead tradeoff with as few as one memory update per packet and as little space as several bits per flow on average. Such a family-based approach will unify implementation by removing redundancy from different measurement tasks and support reconfigurability in a plug-n-play manner. We demonstrate the connection and difference in the design of these traffic measurement families and perform experimental comparisons on hardware/software platforms to find their tradeoff, which provide practical guidance for which solutions to use under given performance goals.

Latency Imbalance Among Internet Load-Balanced Paths: A Cloud-Centric View

Load balancers choose among load-balanced paths to distribute traffic as if it makes no difference using one path or another. This work shows that the latency difference between load-balanced paths (called latency imbalance), previously deemed insignificant, is now prevalent from the perspective of the cloud and affects various latency-sensitive applications. In this work, we present the first large-scale measurement study of latency imbalance from a cloud-centric view. Using public cloud around the globe, we measure latency imbalance both between data centers (DCs) in the cloud and from the cloud to the public Internet. Our key findings include that 1) Amazon's and Alibaba's clouds together have latency difference between load-balanced paths larger than 20ms to 21% of public IPv4 addresses; 2) Google's secret in having lower latency imbalance than other clouds is to use its own well-balanced private WANs to transit traffic close to the destinations and that 3) latency imbalance is also prevalent between DCs in the cloud, where 8 pairs of DCs are found to have load-balanced paths with latency difference larger than 40ms. We further evaluate the impact of latency imbalance on three applications (i.e., NTP, delay-based geolocation and VoIP) and propose potential solutions to improve application performance. Our experiments show that all three applications can benefit from considering latency imbalance, where the accuracy of delay-based geolocation can be greatly improved by simply changing how ping measures the minimum path latency.

On the Bottleneck Structure of Congestion-Controlled Networks

In this paper, we introduce the Theory of Bottleneck Ordering, a mathematical framework that reveals the bottleneck structure of data networks. This theoretical framework provides insights into the inherent topological properties of a network in at least three areas: (1) It identifies the regions of influence of each bottleneck; (2) it reveals the order in which bottlenecks (and flows traversing them) converge to their steady state transmission rates in distributed congestion control algorithms; and (3) it provides key insights into the design of optimized traffic engineering policies. We demonstrate the efficacy of the proposed theory in TCP congestion-controlled networks for two broad classes of algorithms: Congestion-based algorithms (TCP BBR) and loss-based additive-increase/multiplicative-decrease algorithms (TCP Cubic and Reno). Among other results, our network experiments show that: (1) Qualitatively, both classes of congestion control algorithms behave as predicted by the bottleneck structure of the network; (2) flows compete for bandwidth only with other flows operating at the same bottleneck level; (3) BBR flows achieve higher performance and fairness than Cubic and Reno flows due to their ability to operate at the right bottleneck level; (4) the bottleneck structure of a network is continuously changing and its levels can be folded due to variations in the flows' round trip times; and (5) against conventional wisdom, low-hitter flows can have a large impact to the overall performance of a network.

Characterizing Transnational Internet Performance and the Great Bottleneck of China

Transnational Internet performance is an important indication of a country's level of infrastructure investment, globalization, and openness. We conduct a large-scale measurement study of transnational Internet performance in and out of 29 countries and regions,and find six countries that have surprisingly low performance. Five of them are African countries and the last is mainland China, a significant outlier with major discrepancies between down stream and upstream performance. We then conduct a comprehensive investigation of the unusual transnational Internet performance of mainland China, which we refer to as the "Great Bottleneck of China". Our results show that this bottleneck is widespread, affecting 79% of the receiver-sender pairs we measured. More than 70%of the pairs suffer from extremely slow speed (less than 1 Mbps)for more than 5 hours every day. In most tests the bottleneck appeared to be located deep inside China, suggesting poor network infrastructure to handle transnational traffic. The phenomenon has far-reaching implications for Chinese users' browsing habits as well as for the ability of foreign Internet services to reach Chinese customers.

SESSION: Privacy&Blockchain

Stability and Scalability of Blockchain Systems

The blockchain paradigm, introduced in the Bitcoin whitepaper [10], enables distributed consensus over a peer-to-peer network. Each peer constantly mines new information called blocks. Thus, blocks in the network are created over time. Each peer that creates (mines) a block also creates references to one or more previously created blocks. Peers also communicate blocks in order to synchronize their information sets; i.e., the sets of blocks and references the peers are aware of.

Measuring Membership Privacy on Aggregate Location Time-Series

While location data is extremely valuable for various applications, disclosing it prompts serious threats to individuals' privacy. To limit such concerns, organizations often provide analysts with aggregate time-series that indicate, e.g., how many people are in a location at a time interval, rather than raw individual traces. In this paper, we perform a measurement study to understand Membership Inference Attacks (MIAs) on aggregate location time-series, where an adversary tries to infer whether a specific user contributed to the aggregates. We find that the volume of contributed data, as well as the regularity and particularity of users' mobility patterns, play a crucial role in the attack's success. We experiment with a wide range of defenses based on generalization, hiding, and perturbation, and evaluate their ability to thwart the attack vis-à-vis the utility loss they introduce for various mobility analytics tasks. Our results show that some defenses fail across the board, while others work for specific tasks on aggregate location time-series. For instance, suppressing small counts can be used for ranking hotspots, data generalization for forecasting traffic, hotspot discovery, and map inference, while sampling is effective for location labeling and anomaly detection when the dataset is sparse. Differentially private techniques provide reasonable accuracy only in very specific settings, e.g., discovering hotspots and forecasting their traffic, and more so when using weaker privacy notions like crowd-blending privacy. Overall, our measurements show that there does not exist a unique generic defense that can preserve the utility of the analytics for arbitrary applications, and provide useful insights regarding the disclosure of sanitized aggregate location time-series.

Who Filters the Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking

Ad and tracking blocking extensions are popular tools for improving web performance, privacy and aesthetics. Content blocking extensions generally rely on filter lists to decide whether a web request is associated with tracking or advertising, and so should be blocked. Millions of web users rely on filter lists to protect their privacy and improve their browsing experience.

Despite their importance, the growth and health of filter lists are poorly understood. Filter lists are maintained by a small number of contributors who use undocumented heuristics and intuitions to determine what rules should be included. Lists quickly accumulate rules, and rules are rarely removed. As a result, users' browsing experiences are degraded as the number of stale, dead or otherwise not useful rules increasingly dwarf the number of useful rules, with no attenuating benefit. An accumulation of "dead weight" rules also makes it difficult to apply filter lists on resource-limited mobile devices.

This paper improves the understanding of crowdsourced filter lists by studying EasyList, the most popular filter list. We measure how EasyList affects web browsing by applying EasyList to a sample of 10,000 websites. We find that 90.16% of the resource blocking rules in EasyList provide no benefit to users in common browsing scenarios. We use our measurements of rule application rates to taxonomies ways advertisers evade EasyList rules. Finally, we propose optimizations for popular ad-blocking tools that (i) allow EasyList to be applied on performance constrained mobile devices and (ii) improve desktop performance by 62.5%, while preserving over 99% of blocking coverage. We expect these optimizations to be most useful for users in non-English locals, who rely on supplemental filter lists for effective blocking and protections.

Under the Concealing Surface: Detecting and Understanding Live Webcams in the Wild

Given the central role of webcams in monitoring physical surroundings, it behooves the research community to understand the characteristics of webcams' distribution and their privacy/security implications. In this paper, we conduct the first systematic study on live webcams from both aggregation sites and individual webcams (webpages/IP hosts). We propose a series of efficient, automated techniques for detecting and fingerprinting live webcams. In particular, we leverage distributed algorithms to detect aggregation sites and generate webcam fingerprints by utilizing the Graphical User Interface (GUI) of the built-in web server of a device. Overall, we observe 0.85 million webpages from aggregation sites hosting live webcams and 2.2 million live webcams in the public IPv4 space. Our study reveals that aggregation sites have a typical long-tail distribution in hosting live streams (5.8% of sites contain 90.44% of live streaming contents), and 85.4% of aggregation websites scrape webcams from others. Further, we observe that (1) 277,239 webcams from aggregation sites and IP hosts (11.7%) directly expose live streams to the public, (2) aggregation sites expose 187,897 geolocation names and more detailed 23,083 longitude/latitude pairs of webcams, (3) the default usernames and passwords of 38,942 webcams are visible on aggregation sites in plaintext, and (4) 1,237 webcams are detected as having been compromised to conduct malicious behaviors.

Your Noise, My Signal: Exploiting Switching Noise for Stealthy Data Exfiltration from Desktop Computers

Attacks based on power analysis have been long existing and studied, with some recent works focused on data exfiltration from victim systems without using conventional communications (e.g., WiFi). Nonetheless, prior works typically rely on intrusive direct power measurement, either by implanting meters in the power outlet or tapping into the power cable, thus jeopardizing the stealthiness of attacks. In this paper, we propose NoDE (Noise for Data Exfiltration), a new system for stealthy data exfiltration from enterprise desktop computers. Specifically, NoDE achieves data exfiltration over a building's power network by exploiting high-frequency voltage ripples (i.e., switching noises) generated by power factor correction circuits built into today's computers. Located at a distance and even from a different room, the receiver can non-intrusively measure the voltage of a power outlet to capture the high-frequency switching noises for online information decoding without supervised training/learning. To evaluate NoDE, we run experiments on seven different computers from top vendors and using top-brand power supply units. Our results show that for a single transmitter, NoDE achieves a rate of up to 28.48 bits/second with a distance of 90 feet (27.4 meters) without the line of sight, demonstrating a practically stealthy threat. Based on the orthogonality of switching noise frequencies of different computers, we also demonstrate simultaneous data exfiltration from four computers using only one receiver. Finally, we present a few possible defenses, such as installing noise filters, and discuss their limitations.

Privacy-Utility Tradeoffs in Routing Cryptocurrency over Payment Channel Networks

Understanding (Mis)Behavior on the EOSIO Blockchain

EOSIO has become one of the most popular blockchain platforms since its mainnet launch in June 2018. In contrast to the traditional PoW-based systems (e.g., Bitcoin and Ethereum), which are limited by low throughput, EOSIO is the first high throughput Delegated Proof of Stake system that has been widely adopted by many decentralized applications. Although EOSIO has millions of accounts and billions of transactions, little is known about its ecosystem, especially related to security and fraud. In this paper, we perform a large-scale measurement study of the EOSIO blockchain and its associated DApps. We gather a large-scale dataset of EOSIO and characterize activities including money transfers, account creation and contract invocation. Using our insights, we then develop techniques to automatically detect bots and fraudulent activity. We discover thousands of bot accounts (over 30% of the accounts in the platform) and a number of real-world attacks (301 attack accounts). By the time of our study, 80 attack accounts we identified have been confirmed by DApp teams, causing 828,824 EOS tokens losses (roughly \$2.6 million) in total.

SESSION: Systems - Various Topics

Optimal Data Placement for Heterogeneous Cache, Memory, and Storage Systems

New memory technologies are blurring the previously distinctive performance characteristics of adjacent layers in the memory hierarchy. No longer are such layers orders of magnitude different in request latency or capacity. Beyond the traditional single-layer view of caching, we now must re-cast the problem as a data placement challenge: which data should be cached in faster memory if it could instead be served directly from slower memory?

We present CHOPT, an offline algorithm for data placement across multiple tiers of memory with asymmetric read and write costs. We show that CHOPT is optimal and can therefore serve as the upper bound of performance gain for any data placement algorithm. We also demonstrate an approximation of CHOPT which makes its execution time for long traces practical using spatial sampling of requests incurring a small 0.2% average error on representative workloads at a sampling ratio of 1%. Our evaluation of CHOPT on more than 30 production traces and benchmarks shows that optimal data placement decisions could improve average request latency by 8.2%-44.8% when compared with the long-established gold standard: Belady and Mattson's offline, evict-farthest-in-the-future optimal algorithms. Our results identify substantial improvement opportunities for future online memory management research.

Set the Configuration for the Heart of the OS: On the Practicality of Operating System Kernel Debloating

This paper presents a study on the practicality of operating system (OS) kernel debloating---reducing kernel code that is not needed by the target applications---in real-world systems. Despite their significant benefits regarding security (attack surface reduction) and performance (fast boot times and reduced memory footprints), the state-of-the-art OS kernel debloating techniques are seldom adopted in practice, especially in production systems. We identify the limitations of existing kernel debloating techniques that hinder their practical adoption, including both accidental and essential limitations. To understand these limitations, we build an advanced debloating framework named \tool which enables us to conduct a number of experiments on different types of OS kernels (including Linux and the L4 microkernel) with a wide variety of applications (including HTTPD, Memcached, MySQL, NGINX, PHP and Redis). Our experimental results reveal the challenges and opportunities towards making kernel debloating techniques practical for real-world systems. The main goal of this paper is to share these insights and our experiences to shed light on addressing the limitations of kernel debloating in future research and development efforts.

User-level Threading: Have Your Cake and Eat It Too

An important class of computer software, such as network servers, exhibits concurrency through many loosely coupled and potentially long-running communication sessions. For these applications, a long-standing open question is whether thread-per-session programming can deliver comparable performance to event-driven programming. This paper clearly demonstrates, for the first time, that it is possible to employ user-level threading for building thread-per-session applications without compromising functionality, efficiency, performance, or scalability. We present the design and implementation of a general-purpose, yet nimble, user-level M:N threading runtime that is built from scratch to accomplish these objectives. Its key components are efficient and effective load balancing and user-level I/O blocking. While no other runtime exists with comparable characteristics, an important fundamental finding of this work is that building this runtime does not require particularly intricate data structures or algorithms. The runtime is thus a straightforward existence proof for user-level threading without performance compromises and can serve as a reference platform for future research. It is evaluated in comparison to event-driven software, system-level threading, and several other user-level threading runtimes. An experimental evaluation is conducted using benchmark programs, as well as the popular Memcached application. We demonstrate that our user-level runtime outperforms other threading runtimes and enables thread-per-session programming at high levels of concurrency and hardware parallelism without sacrificing performance.

DSM: A Case for Hardware-Assisted Merging of DRAM Rows with Same Content

The number of cores and the capacities of main memory in modern systems have been growing significantly. Specifically, memory scaling, although at a slower pace than computation scaling, provided opportunities for very large DRAMs with Terabytes (TBs) capacity. Consequently, addressing the performance and energy consumption bottlenecks of DRAMs is more important than ever.

DRAM memory refresh operation is one of the main contributing factors to the memory overheads, especially for large capacity DRAMs used in modern servers and emerging large-scale data centers. This paper addresses the memory refresh problem by leveraging the fact that most cloud servers host virtualized systems that use similar kernels, libraries, etc. We propose and experimentally evaluate a novel approach that exploits this observation to address the DRAM refresh overhead in such systems.

More specifically, in this work, we present DSM, a light-weight hardware extension in memory controller to detect the pages with same content in memory and refresh only one of them and redirect the requests to the others to this page. Our detailed experimental analysis shows that the proposed DSM design can reduce 99th percentile memory access latency by up to 2.01x, and it also reduces the overall memory energy consumption by up to 8.5%.

Centaur: A Novel Architecture for Reliable, Low-Wear, High-Density 3D NAND Storage

Due to the high density storage demand coming from applications from different domains, 3D NAND flash is becoming a promising candidate to replace 2D NAND flash as the dominant non-volatile memory. However, denser 3D NAND presents various performance and reliability issues, which can be addressed by the 3D NAND specific full-sequence program (FSP) operation. The FSP programs multiple pages simultaneously to mitigate the performance degradation caused by the long latency 3D NAND baseline program operations. However, the FSP-enabled 3D NAND-based SSDs introduce lifetime degradation due to the larger write granularities accessed by the FSP. To address the lifetime issue, in this paper, we propose and experimentally evaluate Centaur, a heterogeneous 2D/3D NAND heterogeneous SSD, as a solution. Centaur has three main components: a lifetime-aware inter-NAND request dispatcher, a lifetime-aware inter-NAND work stealer, and a data migration strategy from 2D NAND to 3D NAND. We used twelve SSD workloads to compare Centaur against a state-of-the-art 3D NAND-based SSD with the same capacity. Our experimental results indicate that the SSD lifetime and performance are improved by 3.7x and 1.11x, respectively, when using our 2D/3D heterogeneous SSD.

SESSION: Theory - Various Topics

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

Large-scale machine learning and data mining applications require computer systems to perform massive matrix-vector and matrix-matrix multiplication operations that need to be parallelized across multiple nodes. The presence of stragglers -- nodes that unpredictably slowdown or fail -- is a major bottleneck in such distributed computations. We propose a rateless fountain coding strategy to address this issue. Our idea is to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes. The original matrix-vector product can be decoded as soon as slightly more than m row-vector products are collectively finished by the nodes. We show that our approach achieves optimal latency and performs zero redundant computations asymptotically. Experiments on Amazon EC2 show that rateless coding gives as much as 3x speed-up over uncoded schemes.

Logarithmic Communication for Distributed Optimization in Multi-Agent Systems

Classically, the design of multi-agent systems is approached using techniques from distributed optimization such as dual descent and consensus algorithms. Such algorithms depend on convergence to global consensus before any individual agent can determine its local action. This leads to challenges with respect to communication overhead and robustness, and improving algorithms with respect to these measures has been a focus of the community for decades.

This paper presents a new approach for multi-agent system design based on ideas from the emerging field of local computation algorithms. The framework we develop, LOcal Convex Optimization (LOCO), is the first local computation algorithm for convex optimization problems and can be applied in a wide-variety of settings. We demonstrate the generality of the framework via applications to Network Utility Maximization (NUM) and the distributed training of Support Vector Machines (SVMs), providing numerical results illustrating the improvement compared to classical distributed optimization approaches in each case.

Partial Recovery of Erdős-Rényi Graph Alignment via k-Core Alignment

We determine information theoretic conditions under which it is possible to partially recover the alignment used to generate a pair of sparse, correlated Erdos-Renyi graphs. To prove our achievability result, we introduce the k-core alignment estimator. This estimator searches for an alignment in which the intersection of the correlated graphs using this alignment has a minimum degree of k. We prove a matching converse bound. As the number of vertices grows, recovery of the alignment for a fraction of the vertices tending to one is possible when the average degree of the intersection of the graph pair tends to infinity. It was previously known that exact alignment is possible when this average degree grows faster than the logarithm of the number of vertices.

Fiedler Vector Approximation via Interacting Random Walks

The Fiedler vector of a graph, namely the eigenvector corresponding to the second smallest eigenvalue of a graph Laplacian matrix, plays an important role in spectral graph theory with applications in problems such as graph bi-partitioning and envelope reduction. Algorithms designed to estimate this quantity usually rely on a priori knowledge of the entire graph, and employ techniques such as graph sparsification and power iterations, which have obvious shortcomings in cases where the graph is unknown, or changing dynamically. In this paper, we develop a framework in which we construct a stochastic process based on a set of interacting random walks on a graph and show that a suitably scaled version of our stochastic process converges to the Fiedler vector for a sufficiently large number of walks.

We also provide numerical results to confirm our theoretical findings on different graphs, and show that our algorithm performs well over a wide range of parameters and the number of random walks. Simulations results over time varying dynamic graphs are also provided to show the efficacy of our random walk based technique in such settings. As an important contribution, we extend our results and show that our framework is applicable for approximating not just the Fiedler vector of graph Laplacians, but also the second eigenvector of any time reversible Markov Chain kernel via interacting random walks. To the best of our knowledge, our attempt to approximate the second eigenvector of any time reversible Markov Chain using random walks is the first of its kind, opening up possibilities to achieving approximations of higher level eigenvectors using random walks on graphs.

Third-Party Data Providers Ruin Simple Mechanisms

Motivated by the growing prominence of third-party data providers in online marketplaces, this paper studies the impact of the presence of third-party data providers on mechanism design. When no data provider is present, it has been shown that simple mechanisms are "good enough'' --- they can achieve a constant fraction of the revenue of optimal mechanisms. The results in this paper demonstrate that this is no longer true in the presence of a third-party data provider who can provide the bidder with a signal that is correlated with the item type. Specifically, even with a single seller, a single bidder, and a single item of uncertain type for sale, the strategies of pricing each item-type separately (the analog of item pricing for multi-item auctions) and bundling all item-types under a single price (the analog of grand bundling) can both simultaneously be a logarithmic factor worse than the optimal revenue. Further, in the presence of a data provider, item-type partitioning mechanisms---a more general class of mechanisms which divide item-types into disjoint groups and offer prices for each group---still cannot achieve within a $łog łog$ factor of the optimal revenue. Thus, our results highlight that the presence of a data-provider forces the use of more complicated mechanisms in order to achieve a constant fraction of the optimal revenue.

Dynamic Pricing and Matching for Two-Sided Queues

Motivated by diverse applications in sharing economy and online marketplaces, we consider optimal pricing and matching control in a two-sided queueing system. We assume that heterogeneous customers and servers arrive to the system with price-dependent arrival rates. The compatibility between servers and customers is specified by a bipartite graph. Once a pair of customer and server are matched, they depart from the system instantaneously. The objective is to maximize the long-run average profits of the system while minimizing average waiting time. We first propose a static pricing and max-weight matching policy, which achieves O(√η) optimality rate when all of the arrival rates are scaled by η. We further show that a dynamic pricing and modified max-weight matching policy achieves an improved O(η1/3) optimality rate. In addition, we propose a constraint generation algorithm that solves value function approximation of the MDP and demonstrate strong numerical performance of this algorithm.

Unimodal Bandits with Continuous Arms: Order-optimal Regret without Smoothness

We consider stochastic bandit problems with a continuous set of arms and where the expected reward is a continuous and unimodal function of the arm. For these problems, we propose the Stochastic Polychotomy (SP) algorithms, and derive finite-time upper bounds on their regret and optimization error. We show that, for a class of reward functions, the SP algorithm achieves a regret and an optimization error with optimal scalings, i.e., O(√T) and O(1/√T) (up to a logarithmic factor), respectively.

Optimal Bidding Strategies for Online Ad Auctions with Overlapping Targeting Criteria

We analyze the problem of how to optimally bid for ad spaces in online ad auctions. For this we consider the general case of multiple ad campaigns with overlapping targeting criteria. In our analysis we first characterize the structure of an optimal bidding strategy. In particular, we show that an optimal bidding strategies decomposes the problem into disjoint sets of campaigns and targeting groups. In addition, we show that pure bidding strategies that use only a single bid value for each campaign are not optimal when the supply curves are not continuous. For this case, we derive a lower-bound on the optimal cost of any bidding strategy, as well as mixed bidding strategies that either achieve the lower-bound, or can get arbitrarily close to it.