SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems

Full Citation in the ACM Digital Library

SESSION: Session: Networking

Curvature-based Analysis of Network Connectivity in Private Backbone Infrastructures

The main premise of this work is that since large cloud providers can and do manipulate probe packets that traverse their privately owned and operated backbones, standard traceroute-based measurement techniques are no longer a reliable means for assessing network connectivity in large cloud provider infrastructures. In response to these developments, we present a new empirical approach for elucidating private connectivity in today's Internet. Our approach relies on using only "light-weight" (i.e., simple, easily-interpretable, and readily available) measurements, but requires applying a "heavy-weight" or advanced mathematical analysis. In particular, we describe a new method for assessing the characteristics of network path connectivity that is based on concepts from Riemannian geometry (i.e., Ricci curvature) and also relies on an array of carefully crafted visualizations (e.g., a novel manifold view of a network's delay space). We demonstrate our method by utilizing latency measurements from RIPE Atlas anchors and virtual machines running in data centers of three large cloud providers to (i) study different aspects of connectivity in their private backbones and (ii) show how our manifold-based view enables us to expose and visualize critical aspects of this connectivity over different geographic scales.

Automatic Inference of BGP Location Communities

We present a set of techniques to infer the semantics of BGP communities from public BGP data. Our techniques infer communities related to the entities or locations traversed by a route by correlating communities with AS paths. We also propose a set of heuristics to filter incorrect inferences introduced by misbehaving networks, sharing of BGP communities among sibling autonomous systems, and inconsistent BGP dumps. We apply our techniques to billions of routing records from public BGP collectors and make available a public database with more than 15 thousand location communities. Our comparison with manually-built databases shows our techniques provide high precision (up to 93%), better coverage (up to 81% recall), and dynamic updates, complementing operators' and researchers' abilities to reason about BGP community semantics.

Understanding I/O Direct Cache Access Performance for End Host Networking

Direct Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit Ethernet. As numerous I/O devices and cores compete for scarce cache resources, making the most of DCA for networking applications with varied objectives and constraints is a challenge, especially given the increasing complexity of modern cache hardware and I/O stacks. In this paper, we reverse engineer details of one commercial implementation of DCA, Intel's Data Direct I/O (DDIO), to explicate the importance of hardware-level investigation into DCA. Based on the learned knowledge of DCA and network I/O stacks, we (1) develop an analytical framework to predict the effectiveness of DCA (i.e., its hit rate) under certain hardware specifications, system configurations, and application properties; (2) measure penalties of the ineffective use of DCA (i.e., its miss penalty) to characterize its benefits; and (3) show that our reverse engineering, measurement, and model contribute to a deeper understanding of DCA, which in turn helps diagnose, optimize, and design end-host networking.

Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic

Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. Towards this goal, this paper develops a new framework and system that enables a joint evaluation of both the conventional notions of machine learning performance (model accuracy) and the systems-level costs of different representations of network traffic. We highlight these two dimensions for two practical network management tasks, video streaming quality inference and malware detection, to demonstrate the importance of exploring different representations to find the appropriate operating point. We demonstrate the benefit of exploring a range of representations of network traffic and present Traffic Refinery, a proof-of-concept implementation that both monitors network traffic at 10~Gbps and transforms traffic in real time to produce a variety of feature representations for machine learning. Traffic Refinery both highlights this design space and makes it possible to explore different representations for learning, balancing systems costs related to feature extraction and model training against model accuracy.

SESSION: Session: Streaming, Gaming, and the Decentralized Web

Xatu: Richer Neural Network Based Prediction for Video Streaming

The performance of Adaptive Bitrate (ABR) algorithms for video streaming depends on accurately predicting the download time of video chunks. Existing prediction approaches (i) assume chunk download times are dominated by network throughput; and (ii) apriori cluster sessions (e.g., based on ISP and CDN) and only learn from sessions in the same cluster. We make three contributions. First, through analysis of data from real-world video streaming sessions, we show (i) apriori clustering prevents learning from related clusters; and (ii) factors such as the Time to First Byte (TTFB) are key components of chunk download times but not easily incorporated into existing prediction approaches. Second, we propose Xatu, a new prediction approach that jointly learns a neural network sequence model with an interpretable automatic session clustering method. Xatu learns clustering rules across all sessions it deems relevant, and models sequences with multiple chunk-dependent features (e.g., TTFB) rather than just throughput. Third, evaluations using the above datasets and emulation experiments show that Xatu significantly improves prediction accuracies by 23.8% relative to CS2P (a state-of-the-art predictor). We show Xatu provides substantial performance benefits when integrated with multiple ABR algorithms including MPC (a well studied ABR algorithm), and FuguABR (a recent algorithm using stochastic control) relative to their default predictors (CS2P and a fully connected neural network respectively). Further, Xatu combined with MPC outperforms Pensieve, an ABR based on deep reinforcement learning.

End-to-end Characterization of Game Streaming Applications on Mobile Platforms

With the advent of 5G, hosting high-quality game streaming applications on mobile devices has become a reality. To our knowledge, no prior study systematically investigates the < QoS, Energy > tuple on the end-to-end game streaming pipeline across the cloud, network, and edge devices to understand the individual contributions of the different pipeline stages. In this paper, we present a comprehensive performance and power analysis of the entire game streaming pipeline through extensive measurements with a high-end workstation mimicking the cloud end, an open-source platform (Moonlight-GameStreaming) emulating the edge device/mobile platform, and two network settings (WiFi and 5G). Our study shows that the rendering stage and the encoding stage at the cloud end are the bottlenecks for 4K streaming. While 5G is certainly more suitable for supporting enhanced video quality with 4K streaming, it is more expensive in terms of power consumption compared to WiFi. Further, the network interface and the decoder units in mobile devices need more energy-efficient design to support high quality games at a lower cost. These observations should help in designing more cost-effective future cloud gaming platforms.

Dissecting Cloud Gaming Performance with DECAF

Cloud gaming platforms have witnessed tremendous growth over the past two years, with a number of large Internet companies including Amazon, Facebook, Google, Microsoft, and Nvidia publicly launching their own platforms. However, there is an absence of systematic performance measurement methodologies which can generally be applied. In this paper, we implement DECAF, a methodology to systematically analyze and dissect the performance of cloud gaming platforms across different game genres and game platforms. By applying DECAF, we measure the performance of Google Stadia, Amazon Luna, and Nvidia GeForceNow, and uncover a number of important findings such as processing delays in the cloud comprise majority of the total round trip delay (≈73.54%), the video streams delivered by these platforms are characterized by high variability of bitrate, frame rate, and resolution. Our work has important implications for cloud gaming platforms and opens the door for further research on measurement methodologies for cloud gaming.

Toxicity in the Decentralized Web and the Potential for Model Sharing

The "Decentralised Web" (DW) is an evolving concept, which encompasses technologies aimed at providing greater transparency and openness on the web. The DW relies on independent servers (aka instances) that mesh together in a peer-to-peer fashion to deliver a range of services (e.g. micro-blogs, image sharing, video streaming). However, toxic content moderation in this decentralised context is challenging. This is because there is no central entity that can define toxicity, nor a large central pool of data that can be used to build universal classifiers. It is therefore unsurprising that there have been several high-profile cases of the DW being misused to coordinate and disseminate harmful material. Using a dataset of 9.9M posts from 117K users on Pleroma (a popular DW microblogging service), we quantify the presence of toxic content. We find that toxic content is prevalent and spreads rapidly between instances. We show that automating per-instance content moderation is challenging due to the lack of sufficient training data available and the effort required in labelling. We therefore propose and evaluate ModPair, a model sharing system that effectively detects toxic content, gaining an average per-instance macro-F1 score 0.89.

SESSION: Session: Measurements and Security

Understanding the Practices of Global Censorship through Accurate, End-to-End Measurements

It is challenging to conduct a large scale Internet censorship measurement, as it involves triggering censors through artificial requests and identifying abnormalities from corresponding responses. Due to the lack of ground truth on the expected responses from legitimate services, previous studies typically require a heavy, unscalable manual inspection to identify false positives while still leaving false negatives undetected. In this paper, we propose Disguiser, a novel framework that enables end-to-end measurement to accurately detect the censorship activities and reveal the censor deployment without manual efforts. The core of Disguiser is a control server that replies with a static payload to provide the ground truth of server responses. As such, we send requests from various types of vantage points across the world to our control server, and the censorship activities can be recognized if a vantage point receives a different response. In particular, we design and conduct a cache test to pre-exclude the vantage points that could be interfered by cache proxies along the network path. Then we perform application traceroute towards our control server to explore censors' behaviors and their deployment. With Disguiser, we conduct 58 million measurements from vantage points in 177 countries. We observe 292 thousand censorship activities that block DNS, HTTP, or HTTPS requests inside 122 countries, achieving a 106 false positive rate and zero false negative rate. Furthermore, Disguiser reveals the censor deployment in 13 countries.

Monetizing Spare Bandwidth: The Case of Distributed VPNs

MalRadar: Demystifying Android Malware in the New Era

A reliable and up-to-date malware dataset is critical to evaluate the effectiveness of malware detection approaches. Although there are several widely-used malware benchmarks in our community (e.g., MalGenome, Drebin, Piggybacking and AMD, etc.), these benchmarks face several limitations including out-of-date, size, coverage, and reliability issues, etc. In this paper, we first make effort to create MalRadar, a growing and up-to-date Android malware dataset using the most reliable way, i.e., by collecting malware based on the analysis reports of security experts. We have crawled all the mobile security related reports released by ten leading security companies, and used an automated approach to extract and label the useful ones describing new Android malware and containing Indicators of Compromise (IoC) information. We have successfully compiled MalRadar, a dataset that contains 4,534 unique Android malware samples (including both apks and metadata) released from 2014 to April 2021 by the time of this paper, all of which were manually verified by security experts with detailed behavior analysis. Then we characterize the MalRadar dataset from malware distribution channels, app installation methods, malware activation, malicious behaviors and anti-analysis techniques. We further investigate the malware evolution over the last decade. At last, we measure the effectiveness of commercial anti-virus engines and malware detection techniques on detecting malware in MalRadar. Our dataset can be served as the representative Android malware benchmark in the new era, and our observations can positively contribute to the community and boost a series of studies on mobile security.

Trade or Trick?: Detecting and Characterizing Scam Tokens on Uniswap Decentralized Exchange

The prosperity of the cryptocurrency ecosystem drives the need for digital asset trading platforms. Uniswap, as the most prominent cryptocurrency decentralized exchange (DEX), is continuing to attract scammers, with fraudulent cryptocurrencies flooding in the ecosystem. In this paper, we take the first step to detect and characterize scam tokens on Uniswap. We first investigate the landscape of cryptocurrency trading on Uniswap from different perspectives based on its transactions. Then, we propose an accurate approach for flagging scam tokens on Uniswap. We have identified over 10K scam tokens listed on Uniswap, which suggests that roughly 50% of the tokens listed on Uniswap are scam tokens. All the scam tokens are created specialized for the "rug pull" scams, and some scam tokens have embedded tricks and backdoors in the smart contracts. We further observe that thousands of collusion addresses help carry out the scams. The scammers have gained a profit of at least $16 million from 39,762 potential victims. Our observations in this paper suggest the urgency to identify and stop scams in the decentralized finance ecosystem.

SESSION: Session: Wireless and Mobile Networks

A Detailed Look at MIMO Performance in 60 GHz WLANs

One of the key enhancements in the upcoming 802.11ay standard for 60 GHz WLANs is the support for simultaneous transmissions of up to 8 data streams via SU- and MU-MIMO, which has the potential to enable data rates up to 100 Gbps. However, in spite of the key role MIMO is expected to play in 802.11ay, experimental evaluation of MIMO performance in 60 GHz WLANs has been limited to date, primarily due to lack of hardware supporting MIMO transmissions at millimeter wave frequencies. In this work, we fill this gap by conducting the first large-scale experimental evaluation of SU- and MU-MIMO performance in 60 GHz WLANs. Unlike previous studies, our study involves multiple environments with very different multipath characteristics. We analyze the performance in each environment, identify the factors that affect it, and compare it against the performance of SISO. Further, we seek to identify factors that can guide beam and user selection to limit the (often prohibitive in practice) overhead of exhaustive search. Finally, we propose two heuristics that perform both user and beam selection with low overhead, and show that they perform close to an Oracle solution and outperform previously proposed approaches in both static and mobile scenarios, regardless of the environment and number of users.

NG-Scope: Fine-Grained Telemetry for NextG Cellular Networks

Accurate and highly-granular channel capacity telemetry of the cellular last hop is crucial for the effective operation of transport layer protocols and cutting edge applications, such as video on demand and videotelephony. This paper presents the design, implementation, and experimental performance evaluation of NG-Scope, the first such telemetry tool able to fuse physical-layer channel occupancy readings from the cellular control channel with higher-layer packet arrival statistics and make accurate capacity estimates. NG-Scope handles the latest cellular innovations, such as when multiple base stations aggregate their signals together to serve mobile users. End-to-end experiments in a commercial cellular network demonstrate that wireless capacity varies significantly with channel quality, mobility, competing traffic within each cell, and the number of aggregated cells. Our experiments demonstrate significantly improved cell load estimation accuracy, missing the detection of less than 1% of data capacity overall, a reduction of 82% compared to OWL [2], the state-of-the-art in cellular monitoring. Further experiments show that MobileInsight-based [4] CLAW [6] has a root-mean-squared capacity error of 30.5 Mbit/s, which is 3.3× larger than NG-Scope (9.2 Mbit/s).

Argus: Predictable Millimeter-Wave Picocells with Vision and Learning Augmentation

We propose Argus, a system to enable millimeter-wave (mmWave) deployers to quickly complete site-surveys without sacrificing the accuracy and effectiveness of thorough network deployment surveys. Argus first models the mmWave reflection profile of an environment, considering dominant reflectors, and then uses this model to find locations that maximize the usability of the reflectors. The key component in Argus is an effective deep learning model that can map the visual data to the mmWave signal reflections of an environment and can accurately predict mmWave signal profile at any unobserved locations. It allows Argus to find the best picocell locations to provide maximum coverage and also lets users self-localize accurately anywhere in the environment. Furthermore, Argus allows mmWave picocells to predict device's orientation accurately and enables object tagging and retrieval for VR/AR applications.

We implement and validate Argus on two different buildings consisting of multiple different indoor environments. However, the generalization capability of Argus can easily update the model for unseen environments; so, Argus can be deployed to any indoor environment with little or no model fine-tuning.

The First 5G-LTE Comparative Study in Extreme Mobility

5G claims to support mobility up to 500 km/h according to the 3GPP standard. However, its field performance under high-speed scenes remains in mystery. In this paper, we conduct the first large-scale measurement campaign on a high-speed railway route operating at the maximum speed of 350 km/h, with full coverage of LTE and 5G (NSA and SA) along the track. Our study consumed 1788.8 GiB of cellular data in six months, covering the three major carriers in China and the recent standardized QUIC protocol. Based on our dataset, we reveal the key characteristics of 5G and LTE in extreme mobility in terms of throughput, RTT, loss rate, signal quality, and physical resource utilization. We further develop a taxonomy of handovers in both LTE and 5G and carry out the link-layer latency breakdown analysis. Our study pinpoints the deficiencies in the user equipment, radio access network, and core network which hinder seamless connectivity and better utilization of 5G's high bandwidth. Our findings highlight the directions of the next step in the 5G evolution.

SESSION: Session: Memory and GPUs

Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make two key contributions. First, we design efficient SpMV algorithms to accelerate the SpMV kernel in current and future PIM systems, while covering a wide variety of sparse matrices with diverse sparsity patterns. Second, we provide the first comprehensive analysis of SpMV on a real PIM architecture. Specifically, we conduct our rigorous experimental analysis of SpMV kernels in the UPMEM PIM system, the first publicly-available real-world PIM architecture. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems. For more information about our thorough characterization on the SpMV PIM execution, results, insights and the open-source SparseP software package [21], we refer the reader to the full version of the paper [3, 4]. The SparseP software package is publicly and freely available at

Memory Space Recycling

Many program codes from different application domains process very large amounts of data, making their data locality/cache memory behavior critical for high performance. Prior work has addressed the data locality optimization problem in the context of both single-core and multi-core systems. Another dimension of optimization, which can be as equally important/beneficial as improving data access pattern is to reduce the data volume (total number of addresses) accessed by the program code. In this work, we explore the idea of rewriting an application program code to reduce its memory space footprint. The main idea behind this approach is to reuse/recycle, for a given data element, a memory location that has originally been assigned to another data element, provided that the lifetimes of these two data elements do not overlap with each other. We present a detailed experimental evaluation of our proposed memory space recycling strategy. The experimental results show that our proposed approach brings, respectively, 33.2%, 48.6%, 46.5%, 31.8%, and 27.9% average improvements in these metrics, in the case of single-threaded applications. With the multi-threaded versions of the same applications, the achieved improvements are 39.5%, 55.5%, 53.4%, 26.2%, and 22.2%, in the same order.

Data Convection: A GPU-Driven Case Study for Thermal-Aware Data Placement in 3D DRAMs

Stacked DRAMs have been studied and productized in the last decade. The large available bandwidth they offer makes them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM boosts the capacity and the bandwidth without increasing the package size. It also helps meet the capacity needs of emergent workloads like deep learning. However, the bandwidth given by these 3D stacked DRAMs is significantly constrained by the GPU's heat production. Our investigations on a cycle-level simulator show that the 3D stacked DRAM portions closest to the GPU have shorter retention times than the layers further away. Depending on the retention period, certain regions of 3D stacked DRAM are refreshed more frequently than others, leading to thermally-induced NUMA paradigms. Our proposed approach attempts to place the most frequently requested data in a thermally conscious manner, taking into consideration both bank-level parallelism and channel-level parallelism. The results collected with a cycle-level GPU simulator indicate that the three implementations of our proposed approach lead to 1.8%, 11.7%, and 14.4% performance improvements, over a baseline that already includes 3D+2.5D stacked DRAMs.

NURA: A Framework for Supporting Non-Uniform Resource Accesses in GPUs

Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g. spatial multitasking) have limited opportunity to improve resource utilization, while others, e.g. simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensure fairness and Quality-of-Service(QoS). The key idea is that each streaming multiprocessor (SM) executes the Cooperative Thread Arrays (CTAs) that belong to only one application (similar to spatial multi-tasking) and shares its unused resources with the SMs running other applications demanding more resources. NURA handles resource sharing process mainly using a software approach to provide simplicity, low hardware overhead, and flexibility.We also perform some hardware modifications as an architectural support for our software-based proposal. Our conservative analysis reveals that the hardware area overhead of our proposal is less than 1.07% with respect to the whole GPU die. Our experimental results over various mixes of GPU workloads show that NURA improves throughput by 26% compared to the state-of-the-art spatial multi-tasking, on average, while meeting QoS targets. In terms of fairness, NURA has almost similar results to spatial multitasking, while it outperforms simultaneous multi-kernel by 76%, on average.

SESSION: Session: Pricing and Speed

YourAdvalue: Measuring Advertising Price Dynamics without Bankrupting User Privacy

The Real Time Bidding (RTB) protocol is by now more than a decade old. During this time, a handful of measurement papers have looked at bidding strategies, personal information flow, and cost of display advertising through RTB. In this paper, we present YourAdvalue, a privacy-preserving tool for displaying to end-users in a simple and intuitive manner their advertising value as seen through RTB. Using YourAdvalue, we measure desktopRTB prices in the wild, and compare them with desktop and mobileRTB prices reported by past work. We present how it estimates ad prices that are encrypted, and how it preserves user privacy while reporting results back to a data-server for analysis. We deployed our system, disseminated its browser extension, and collected data from 200 users, including 12000 ad impressions over 11 months.

By analyzing this dataset, we show that desktop RTB prices have grown 4.6x over desktop RTB prices measured in 2013, and 3.8x over mobile RTB prices measured in 2015. We also study how user demographics associate with the intensity of RTB ecosystem tracking, leading to higher ad prices. We find that exchanging data between advertisers and/or data brokers through cookie- syncronization increases the median value of displayed ads by 19%. We also find that female and younger users are more targeted, suffering more tracking (via cookie synchronization) than male or elder users. As a result of this targeting in our dataset, the advertising value (i) of women is 2.4x higher than that of men, (ii) of 25-34 year-olds is 2.5x higher than that of 35-44 year-olds, (iii) is most expensive on weekends and early mornings.

Power of Bonus in Pricing for Crowdsourcing

We consider a simple form of pricing for a crowdsourcing system, where pricing policy is published a priori, and workers then decide their task acceptance. Such a pricing form is widely adopted in practice for its simplicity, e.g., Amazon Mechanical Turk, although additional sophistication to pricing rule can enhance budget efficiency. With the goal of designing efficient and simple pricing rules, we study the impact of the following two design features in pricing policies: (i) personalization tailoring policy worker-by-worker and (ii) bonus payment to qualified task completion. In the Bayesian setting, where the only prior distribution of workers' profiles is available, we first study the Price of Agnosticism (PoA) that quantifies the utility gap between personalized and common pricing policies. We show that PoA is bounded within a constant factor under some mild conditions, and the impact of bonus is essential in common pricing. These analytic results imply that complex personalized pricing can be replaced by simple common pricing once it is equipped with a proper bonus payment. To provide insights on efficient common pricing, we then study the efficient mechanisms of bonus payment for several profile distribution regimes which may exist in practice. We provide primitive experiments on Amazon Mechanical Turk, which support our analytical findings[5].

Fusing Speed Index during Web Page Loading

With conventional web page load metrics (e.g., Page Load Time) being blamed for deviating from actual user experiences, in recent years a more sensible and complex metric called Speed Index (SI) has been widely adopted to measure the user's quality of experience (QoE). In brief, SI indicates how quickly a page is filled up with above-the-fold visible elements (or crucial elements for short). To date, however, SI has been used as a metric for performance evaluation, rather than as an explicit heuristic to improve page loading. To demystify this, we examine the entire loading process of various pages and ascribe such incapability to three-fold fundamental uncertainties in terms of network, browser execution, and viewport size. In this paper, we design SipLoader, an SI-oriented page load scheduler through a novel cumulative reactive scheduling framework. It does not attempt to deal with uncertainties in advance or in one shot, but schedules page loading by "repairing" the anticipated (nearly) SI-optimal scheduling when uncertainties actually occur. This is achieved with a suite of efficient designs that fully exploit the cumulative nature of SI calculation. Evaluations show that SipLoader improves the median SI by 41%, and provides 1.43 times to 1.99 times more benefits than state-of-the-art solutions.

Data-Driven Network Path Simulation with iBox

While network simulation is widely used for evaluating network protocols and applications, ensuring realism remains a key challenge. There has been much work on simulating network mechanisms faithfully (e.g., links, buffers, etc.), but less attention on the critical task of configuring the simulator to reflect reality.

We present iBox ("Internet in a Box"), which enables data-driven network path simulation, using input/output packet traces gathered at the sender/receiver in the target network to create a model of the end-to-end behaviour of a network path. Our work builds on recent work in this direction [2, 6] and makes three contributions: (1) estimation of a lightweight non-reactive cross-traffic model, (2) estimation of a more powerful reactive cross-traffic model based on Bayesian optimization, and (3) evaluation of iBox in the context of congestion control variants in an Internet research testbed and also controlled experiments with known ground truth.

This paper represents an abridged version of [3].

SESSION: Session: System Performance

CachePerf: A Unified Cache Miss Classifier via Hybrid Hardware Sampling

The cache plays a key role in determining the performance of applications, no matter for sequential or concurrent programs on homogeneous and heterogeneous architecture. Fixing cache misses requires to understand the origin and the type of cache misses. However, this remains to be an unresolved issue even after decades of research. This paper proposes a unified profiling tool--CachePerf--that could correctly identify different types of cache misses, differentiate allocator-induced issues from those of applications, and exclude minor issues without much performance impact. The core idea behind CachePerf is a hybrid sampling scheme: it employs the PMU-based coarse-grained sampling to select very few susceptible instructions (with frequent cache misses) and then employs the breakpoint-based fine-grained sampling to collect the memory access pattern of these instructions. Based on our evaluation, CachePerf only imposes 14% performance overhead and 19% memory overhead (for applications with large footprints), while identifying the types of cache misses correctly. CachePerf detected 9 previous-unknown bugs. Fixing the reported bugs achieves from 3% to 3788% performance speedup. CachePerf will be an indispensable complementary to existing profilers due to its effectiveness and low overhead.

A Comprehensive Empirical Study of Query Performance Across GPU DBMSes

In recent years, GPU database management systems (DBMSes) have rapidly become popular largely due to their remarkable acceleration capability obtained through extreme parallelism in query evaluations. However, there has been relatively little study on the characteristics of these GPU DBMSes for a better understanding of their query performance in various contexts. To fill this gap, we have conducted a rigorous empirical study to identify such factors and to propose a structural causal model, including key factors and their relationships, to explicate the variances of the query execution times on the GPU DBMSes. To test the model, we have designed and run comprehensive experiments and conducted in-depth statistical analyses on the obtained data. As a result, our model achieves about 77% amount of variance explained on the query time and indicates that reducing kernel time and data transfer time are the key factors to improve the query time. Also, our results show that the studied systems still need to resolve several concerns such as bounded processing within GPU memory, lack of rich query evaluation operators, limited scalability, and GPU under-utilization.

A Formalism of DNN Accelerator Flexibility

One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search

Convolutional neural networks (CNNs) are used in numerous realworld applications such as vision-based autonomous driving and video content analysis.

SESSION: Session: Systems

WISEFUSE: Workload Characterization and DAG Transformation for Serverless Workflows

We characterize production workloads of serverless DAGs at a major cloud provider. Our analysis highlights two major factors that limit performance: (a) lack of efficient communication methods between the serverless functions in the DAG, and (b) stragglers when a DAG stage invokes a set of parallel functions that must complete before starting the next DAG stage. To address these limitations, we propose WISEFUSE, an automated approach to generate an optimized execution plan for serverless DAGs for a user-specified latency objective or $ budget. We introduce three optimizations: (1) Fusion combines in-series functions together in a single VM to reduce the communication overhead between cascaded functions. (2) Bundling executes a group of parallel invocations of a function in one VM to improve resource sharing among the parallel workers to reduce skew. (3) Resource Allocation assigns the right VM size to each function or function bundle in the DAG to reduce the E2E latency and cost. We implement WISEFUSE to evaluate it experimentally using three popular serverless applications with different DAG structures, memory footprints, and intermediate data sizes. Compared to competing approaches and other alternatives, WISEFUSE shows significant improvements in E2E latency and cost. Specifically, for a machine learning pipeline, WISEFUSE achieves P95 latency that is 67% lower than Photons, 39% lower than Faastlane, and 90% lower than SONIC without increasing the $ cost.

An Enterprise-Grade Open-Source Data Reduction Architecture for All-Flash Storage Systems

Data reduction technologies have proven their effectiveness to decrease the ever-growing demands on storage system capacities, but also introduce new complexity in the system I/O stack that can easily invalidate well-known best practices. In this paper, we conduct an extensive set of experiments on an enterprise all-flash storage (AFS) system equipped with an open-source data reduction module, i.e., RedHat VDO, and reveal novel observations on the performance gap between the state-of-the-art and the optimal AFS stack with integrated data reduction. We then offer cross-layer optimizations to enhance the performance of AFS, which range from deriving new optimal hardware RAID configurations up to modifications of the enterprise storage stack tailored to the major bottlenecks observed. By implementing all proposed optimizations in an enterprise AFS, we show up to 12.5x speedup over the baseline AFS with integrated data reduction, and up to 57x performance/cost improvement over an optimized AFS (with no data reduction) for 100% write, high-reduction workload scenarios.

Dremel: Adaptive Configuration Tuning of RocksDB KV-Store

LSM-tree-based key-value stores like RocksDB are widely used to support many applications. However, configuring a RocksDB instance is challenging for the following reasons: 1) RocksDB has a massive parameter space to configure; 2) there are inherent trade-offs and dependencies between parameters; 3) optimal configurations are dependent on workload and hardware; and 4) evaluating configurations is time-consuming. Prior works struggle with handling the curse of dimensionality, capturing relationships between parameters, adapting configurations to workload and hardware, and evaluating quickly.

We present a system, Dremel, to adaptively and quickly configure RocksDB with strategies based on the Multi-Armed Bandit model. To handle the large parameter space, we propose using fused features, which encode domain-specific knowledge, to work as a compact and powerful representation for configurations. To adapt to the workload and hardware, we build an online bandit model to identify the best configuration. To evaluate quickly, we enable multi-fidelity evaluation and upper-confidence-bound sampling to speed up configuration search. Dremel not only achieves up to ×2.61 higher IOPS and 57% less latency than default configurations but also achieves up to 63% improvement over prior works on 18 different settings with the same or smaller time budget.

This paper is an abridged version.

Tuxedo: Maximizing Smart Contract Computation in PoW Blockchains

Proof-of-Work~(PoW) based blockchains typically allocate only a tiny fraction (e.g., less than 1% for Ethereum) of the average interarrival time~$\mathbbI $ between blocks for validating smart contracts present in transactions. In such systems, block validation and PoW mining are typically performed sequentially, the former by CPUs and the latter by ASICs. A trivial increase in validation time~$(τ)$ introduces the popularly known Verifier's Dilemma, and as we demonstrate, causes more forking and hurts fairness. Large τ also reduces the tolerance for safety against a Byzantine adversary. Solutions that offload validation to a set of non-chain nodes (a.k.a. off-chain approaches) suffer from trust and performance issues that are non-trivial to resolve.

In this paper, we present Tuxedo, the first on-chain protocol to theoretically scale τ/\mathbbI \approx 1$ in PoW blockchains. The key innovation in Tuxedo is to perform CPU-based block processing in \em parallel to ASIC mining. We achieve this by allowing miners to delay validation of transactions in a block by up to ζ blocks, where ζ is a system parameter. We perform security analysis of Tuxedo considering all possible adversarial strategies in a synchronous network with maximum end-to-end delay Δ and demonstrate that Tuxedo achieves security equivalent to known results for longest chain PoW Nakamoto consensus. Our prototype implementation of Tuxedo atop Ethereum demonstrates that it can scale τ without suffering the harmful effects of naïve scaling up of τ/\mathbbI $ in existing blockchains.

SESSION: Session: Learning meets Systems

Metamorphic Testing of Deep Learning Compilers

The prosperous trend of deploying deep neural network (DNN) models to diverse hardware platforms has boosted the development of deep learning (DL) compilers. DL compilers take high-level DNN model specifications as input and generate optimized DNN executables for diverse hardware architectures like CPUs, GPUs, and hardware accelerators. We introduce MT-DLComp, a metamorphic testing framework specifically designed for DL compilers to uncover erroneous compilations. Our approach leverages deliberately-designed metamorphic relations (MRs) to launch semantics-preserving mutations toward DNN models to generate their variants. This way, DL compilers can be automatically tested for compilation correctness by comparing the execution outputs of the compiled DNN models and their variants without manual intervention. We detected over 435 inputs that can result in erroneous compilations in four popular DL compilers, all of which are industry-strength products maintained by Amazon, Facebook, Microsoft, and Google. We uncovered four bugs in these compilers by debugging them using the error-triggering inputs.

SimNet: Accurate and High-Performance Computer Architecture Simulation using Deep Learning

While cycle-accurate simulators are essential tools for architecture research, design, and development, their practicality is limited by an extremely long time-to-solution for realistic applications under investigation. This work describes a concerted effort, where machine learning (ML) is used to accelerate microarchitecture simulation. First, an ML-based instruction latency prediction framework that accounts for both static instruction properties and dynamic processor states is constructed. Then, a GPU-accelerated parallel simulator is implemented based on the proposed instruction latency predictor, and its simulation accuracy and throughput are validated and evaluated against a state-of-the-art simulator. Leveraging modern GPUs, the ML-based simulator outperforms traditional CPU-based simulators significantly.

Prediction of the Resource Consumption of Distributed Deep Learning Systems

Predicting resource consumption for the distributed training of deep learning models is of paramount importance, as it can inform a priori users of how long their training would take and enable users to manage the cost of training. Yet, no such prediction is available for users because the resource consumption itself varies significantly according to "settings" such as GPU types and also by "workloads" like deep learning models. Previous studies have attempted to derive or model such a prediction, but they fall short of accommodating the various combinations of settings and workloads together. This study presents Driple, which designs graph neural networks to predict the resource consumption of diverse workloads. Driple also designs transfer learning to extend the graph neural networks to adapt to differences in settings. The evaluation results show that Driple effectively predicts a wide range of workloads and settings. In addition, Driple can efficiently reduce the time required to tailor the prediction for different settings by up to 7.3×.

Hierarchical Learning Algorithms for Multi-scale Expert Problems

In this work,1 we study the multi-scale expert problem, where the rewards of different experts vary in different reward ranges. The performance of existing algorithms for the multi-scale expert problem degrades linearly proportional to the maximum reward range of any expert or the best expert and does not capture the non-uniform heterogeneity in the reward ranges among experts. In this work, we propose learning algorithms that construct a hierarchical tree structure based on the heterogeneity of the reward range of experts and then determine differentiated learning rates based on the reward upper bounds and cumulative empirical feedback over time. We then characterize the regret of the proposed algorithms as a function of non-uniform reward ranges and show that their regrets outperform prior algorithms when the rewards of experts exhibit non-uniform heterogeneity in different ranges. Last, our numerical experiments verify our algorithms' efficiency compared to previous algorithms.

SESSION: Session: Optimization I

Tensor Completion with Nearly Linear Samples Given Weak Side Information

Tensor completion exhibits an interesting computational-statistical gap in terms of the number of samples needed to perform tensor estimation. While there are only Θ(tn) degrees of freedom in a t-order tensor with nt entries, the best known polynomial time algorithm requires O(nt/2) samples in order to guarantee consistent estimation. In this paper, we show that weak side information is sufficient to reduce the sample complexity to O(n). The side information consists of a weight vector for each of the modes which is not orthogonal to any of the latent factors along that mode; this is significantly weaker than assuming noisy knowledge of the subspaces. We provide an algorithm that utilizes this side information to produce a consistent estimator with O(n1+κ) samples for any small constant κ > 0. We also provide experiments on both synthetic and real-world datasets that validate our theoretical insights.

Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics At, Bt. The sequence of dynamics matrices can be arbitrary, but with a total variation, VT, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of Õ(VT2/5 T3/5). With piecewise constant dynamics, our algorithm achieves the optimal regret of Õ(√ST) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of VT. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.

Differentially Private Reinforcement Learning with Linear Function Approximation

Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large-scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.

On Multivariate Singular Spectrum Analysis and Its Variants

We introduce and analyze a simpler, practically useful variant of multivariate singular spectrum analysis (mSSA), a known time series method to impute (or de-noise) and forecast a multivariate time series. Towards this, we introduce a spatio-temporal factor model to analyze mSSA. This model includes the usual components used to model dynamics in time series analysis, such as trends (low order polynomials), seasonality (finite sum of harmonics), and linear time-invariant systems. We establish that given N time series and T observations per time series, the in-sample prediction error for both imputation and forecasting under mSSA scales as 1/√ min(N, T)T. This is an improvement over: (i) the 1/√T error scaling of SSA, which is the restriction of mSSA to univariate time series; (ii) the 1/min(N, T) error scaling for Temporal Regularized Matrix Factorized (TRMF), a matrix factorization based method for time series prediction. That is, mSSA exploits both the 'temporal' and 'spatial' structure in a multivariate time series. Our experimental results using various benchmark datasets confirm the characteristics of the spatio-temporal factor model and our theoretical findings---our variant of mSSA empirically performs as well or better compared to neural network based time series methods, LSTM and DeepAR.

SESSION: Session: Online Optimization

Online Optimization with Feedback Delay and Nonlinear Switching Cost

We study a variant of online optimization in which the learner receives k-round delayed feedback about hitting cost and there is a multi-step nonlinear switching cost, i.e., costs depend on multiple previous actions in a nonlinear manner. Our main result shows that a novel Iterative Regularized Online Balanced Descent (iROBD) algorithm has a constant, dimension-free competitive ratio that is O(L2k), where L is the Lipschitz constant of the nonlinear switching cost. Additionally, we provide lower bounds that illustrate the Lipschitz condition is required and the dependencies on k and L are tight. Finally, via reductions, we show that this setting is closely related to online control problems with delay, nonlinear dynamics, and adversarial disturbances, where iROBD directly offers constant-competitive online policies. This extended abstract is an abridged version of [2].

Competitive Online Optimization with Multiple Inventories: A Divide-and-Conquer Approach

We study an online inventory trading problem where a user seeks to maximize the aggregate revenue of trading multiple inventories over a time horizon. The trading constraints and concave revenue functions are revealed sequentially in time, and the user needs to make irrevocable decisions. The problem has wide applications in various engineering domains. Existing works employ the primal-dual framework to design online algorithms with sub-optimal, albeit near-optimal, competitive ratios (CR). We exploit the problem structure to develop a new divide-and-conquer approach to solve the online multi-inventory problem by solving multiple calibrated single-inventory ones separately and combining their solutions. The approach achieves the optimal CR of $łn θ + 1$ if $Nłeq łn θ + 1$, where N is the number of inventories and θ represents the revenue function uncertainty; it attains a CR of $1/[1-e^-1/(łnθ+1) ] \in [łn θ +1, łn θ +2)$ otherwise. The divide-and-conquer approach reveals novel structural insights for the problem, (partially) closes a gap in existing studies, and generalizes to broader settings. For example, it gives an algorithm with a CR within a constant factor to the lower bound for a generalized one-way trading problem with price elasticity with no previous results. When developing the above results, we also extend a recent CR-Pursuit algorithmic framework and introduce an online allocation problem with allowance augmentation, both of which can be of independent interest.

Expert-Calibrated Learning for Online Optimization with Switching Costs

Competitive Algorithms for Online Multidimensional Knapsack Problems

In this work, we study the online multidimensional knapsack problem (called OMdKP) in which there is a knapsack whose capacity is represented in m dimensions, each dimension could have a different capacity. Then, n items with different scalar profit values and m-dimensional weights arrive in an online manner and the goal is to admit or decline items upon their arrival such that the total profit obtained by admitted items is maximized and the capacity of knapsack across all dimensions is respected. This is a natural generalization of the classic single-dimension knapsack problem with several relevant applications such as in virtual machine allocation, job scheduling, and all-or-nothing flow maximization over a graph. We develop an online algorithm for OMdKP that uses an exponential reservation function to make online admission decisions. Our competitive analysis shows that the proposed online algorithm achieves the competitive ratio of O(log (Θ α)), where α is the ratio between the aggregate knapsack capacity and the minimum capacity over a single dimension and θ is the ratio between the maximum and minimum item unit values. We also show that the competitive ratio of our algorithm with exponential reservation function matches the lower bound up to a constant factor.

SESSION: Session: Resource Allocation

Offline and Online Algorithms for SSD Management

The abundance of system-level optimizations for reducing SSD write amplification, which are usually based on experimental evaluation, stands in contrast to the lack of theoretical algorithmic results in this problem domain. To bridge this gap, we explore the problem of reducing write amplification from an algorithmic perspective, considering it in both offline and online settings. In the offline setting, we present a near-optimal algorithm. In the online setting, we first consider algorithms that have no prior knowledge about the input. We present a worst case lower bound and show that the greedy algorithm is optimal in this setting. Then we design an online algorithm that uses predictions about the input. We show that when predictions are pretty accurate, our algorithm circumvents the above lower bound. We complement our theoretical findings with an empirical evaluation of our algorithms, comparing them with the state-of-the-art scheme. The results confirm that our algorithms exhibit an improved performance for a wide range of input traces.

Online Caching Networks with Adversarial Guarantees

We study a cache network under arbitrary adversarial request arrivals. We propose a distributed online policy based on the online tabular greedy algorithm. Our distributed policy achieves sublinear (1-1/e)-regret, also in the case when update costs cannot be neglected. Numerical evaluation over several topologies supports our theoretical results and demonstrates that our algorithm outperforms state-of-art online cache algorithms.

Real-time Bidding for Time Constrained Impression Contracts in First and Second Price Auctions - Theory and Algorithms

We study an optimal bidding and allocation problem that builds upon the earlier work of [4]. The model we study is of interest in online real-time advertising where the demand of a large number of advertisers is aggregated by an intermediary who bids in the advertising auction market on their behalf. This intermediary is referred to as a Demand Side Platform (DSP).

Sequential Fair Allocation: Achieving the Optimal Envy-Efficiency Tradeoff Curve

We consider the problem of dividing limited resources to individuals arriving over T rounds. Each round has a random number of individuals arrive, and individuals can be characterized by their type (i.e. preferences over the different resources). A standard notion of 'fairness' in this setting is that an allocation simultaneously satisfy envy-freeness and efficiency. For divisible resources, when the number of individuals of each type are known upfront, the above desiderata are simultaneously achievable for a large class of utility functions. However, in an online setting when the number of individuals of each type are only revealed round by round, no policy can guarantee these desiderata simultaneously.

We show that in the online setting, the two desired properties (envy-freeness and efficiency) are in direct contention, in that any algorithm achieving additive counterfactual envy-freeness up to a factor of LT necessarily suffers a efficiency loss of at least 1 / LT. We complement this uncertainty principle with a simple algorithm, Guarded-Hope, which allocates resources based on an adaptive threshold policy and is able to achieve any fairness-efficiency point on this frontier.

SESSION: Session: Topology Design and Load Balancing

Understanding the Performance Guarantee of Physical Topology Design for Optical Circuit Switched Data Centers

As a first step of designing O ptical-circuit-switched D ata C enters (ODC), physical topology design is critical as it determines the scalability and the performance limit of the entire ODC. However, prior works on ODC have not yet paid much attention to physical topology design, and the adopted physical topologies either scale poorly, or lack performance guarantee. We offer a mathematical foundation for the design and performance analysis of ODC physical topologies in this paper. We introduce a new performance metric β(\mathcalG )$ to evaluate the gap between a physical topology $\mathcalG $ and the ideal physical topology. We develop a coupling technique that bypasses a significant amount of computational complexity of calculating β(\mathcalG )$. Using β(\mathcalG )$ and the coupling technique, we study four physical topologies that are representative of those in literature, analyze their scalabilities and prove their performance guarantees. Our analysis may provide new guidance for network operators to design better physical topologies for ODCs.

Cerberus: The Power of Choices in Datacenter Topology Design - A Throughput Perspective

The bandwidth and latency requirements of modern datacenter applications have led researchers to propose various topology designs using static, dynamic demand-oblivious (rotor), and/or dynamic demand-aware switches. However, given the diverse nature of datacenter traffic, there is little consensus about how these designs would fare against each other. In this work, we analyze the throughput of existing topology designs under different traffic patterns and study their unique advantages and potential costs in terms of bandwidth and latency "tax''. To overcome the identified inefficiencies, we propose Cerberus, a unified, two-layer leaf-spine optical datacenter design with three topology types. Cerberus systematically matches different traffic patterns with their most suitable topology type: e.g., latency-sensitive flows are transmitted via a static topology, all-to-all traffic via a rotor topology, and elephant flows via a demand-aware topology. We show analytically and in simulations that Cerberus can improve throughput significantly compared to alternative approaches and operate datacenters at higher loads while being throughput-proportional.

Large-System Insensitivity of Zero-Waiting Load Balancing Algorithms

This paper studies the sensitivity (or insensitivity) of a class of load balancing algorithms that achieve asymptotic zero-waiting in the sub-Halfin-Whitt regime, named LB-zero. Most existing results on zero-waiting load balancing algorithms assume the service time distribution is exponential. This paper establishes the large-system insensitivity of LB-zero for jobs whose service time follows a Coxian distribution with a finite number of phases. This result suggests that LB-zero achieves asymptotic zero-waiting for a large class of service time distributions. To prove this result, this paper develops a new technique, called "Iterative State-Space Peeling'' (or ISSP for short). ISSP first identifies an iterative relation between the upper and lower bounds on the queue states and then proves that the system lives near the fixed point of the iterative bounds with a high probability. Based on ISSP, the steady-state distribution of the system is further analyzed by applying Stein's method in the neighborhood of the fixed point. ISSP, like state-space collapse in heavy-traffic analysis, is a general approach that may be used to study other complex stochastic systems.

Mean Field and Refined Mean Field Approximations for Heterogeneous Systems: It Works!

Mean field approximation is a powerful technique to study the performance of large stochastic systems represented as n interacting objects. Applications include load balancing models, epidemic spreading, cache replacement policies, or large-scale data centers. Mean field approximation is asymptotically exact for systems composed of n homogeneous objects under mild conditions. In this paper, we study what happens when objects are heterogeneous. This can represent servers with different speeds or contents with different popularities. We define an interaction model that allows obtaining asymptotic convergence results for stochastic systems with heterogeneous object behavior and show that the error of the mean field approximation is of order O(1/n). More importantly, we show how to adapt the refined mean field approximation, developed by the authors of Gast et al. 2019, and show that the error of this approximation is reduced to O(1/n2). To illustrate the applicability of our result, we present two examples. The first addresses a list-based cache replacement model, RANDOM(m), which is an extension of the RANDOM policy. The second is a heterogeneous supermarket model. These examples show that the proposed approximations are computationally tractable and very accurate. For moderate system sizes (n ≈ 30) the refined mean field approximation tends to be more accurate than simulations for any reasonable simulation time.

SESSION: Session: Optimization II

Asymptotic Convergence Rate of Dropout on Shallow Linear Neural Networks

We analyze the convergence rate of gradient flows on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks (NNs) ---which can also be viewed as doing matrix factorization using a particular regularizer. Dropout algorithms such as these are thus regularization techniques that use {0,1} -valued random variables to filter weights during training in order to avoid coadaptation of features. By leveraging a recent result on nonconvex optimization and conducting a careful analysis of the set of minimizers as well as the Hessian of the loss function, we are able to obtain (i) a local convergence proof of the gradient flow and (ii) a bound on the convergence rate that depends on the data, the dropout probability, and the width of the NN. Finally, we compare this theoretical bound to numerical simulations, which are in qualitative agreement with the convergence bound and match it when starting sufficiently close to a minimizer.

Robustness and Consistency in Linear Quadratic Control with Untrusted Predictions

We study the problem of learning-augmented predictive linear quadratic control. Our goal is to design a controller that balances "consistency'', which measures the competitive ratio when predictions are accurate, and "robustness'', which bounds the competitive ratio when predictions are inaccurate. We propose a novel λ-confident controller and prove that it maintains a competitive ratio upper bound of 1 + min {O(λ2ε)+ O(1-λ)2,O(1)+O(λ2)} where λ∈ [0,1] is a trust parameter set based on the confidence in the predictions, and ε is the prediction error. Further, motivated by online learning methods, we design a self-tuning policy that adaptively learns the trust parameter λ with a competitive ratio that depends on ε and the variation of system perturbations and predictions. We show that its competitive ratio is bounded from above by 1+O(ε) /(Θ)(1)+Θ(ε))+O(μVar) where μVar measures the variation of perturbations and predictions. It implies that by automatically adjusting the trust parameter online, the self-tuning scheme ensures a competitive ratio that does not scale up with the prediction error ε.

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant stepsize SA algorithms do not converge to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithm with a smooth and strongly convex objective, (2) linear SA algorithm involving a Hurwitz matrix, and (3) nonlinear SA algorithm involving a contractive operator. When the iterate is scaled by 1/√α, where α is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an implicit equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of an appropriate Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be 1/√α, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a heuristic formula to determine the right scaling factor, and make a connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.

Learning To Maximize Welfare with a Reusable Resource

SESSION: Session: Miscellaneous

Free2Shard: Adversary-resistant Distributed Resource Allocation for Blockchains

In this paper, we formulate and study a new, but basic, distributed resource allocation problem arising in scaling blockchain performance. While distributed resource allocation is a well-studied problem in networking, the blockchain setting additionally requires the solution to be resilient to adversarial behavior from a fraction of nodes. Scaling blockchain performance is a basic research topic; a plethora of solutions (under the umbrella of sharding) have been proposed in recent years. Although the various sharding solutions share a common thread (they cryptographically stitch together multiple parallel chains), architectural differences lead to differing resource allocation problems. In this paper we make three main contributions: (a) we categorize the different sharding proposals under a common architectural framework, allowing for the emergence of a new, uniformly improved, uni-consensus sharding architecture. (b) We formulate and exactly solve a core resource allocation problem in the uni-consensus sharding architecture -- our solution, Free2shard, is adversary-resistant and achieves optimal throughput. The key technical contribution is a mathematical connection to the classical work of Blackwell approachability in dynamic game theory. (c) We implement the sharding architecture atop a full-stack blockchain in 3000 lines of code in Rust -- we achieve a throughput of more than 250,000 transactions per second with 6 shards, a vast improvement over state-of-the-art.

Age-Dependent Differential Privacy

The proliferation of real-time applications has motivated extensive research on analyzing and optimizing data freshness in the context of age of information. However, classical frameworks of privacy (e.g., differential privacy (DP)) have overlooked the impact of data freshness on privacy guarantees, and hence may lead to unnecessary accuracy loss when trying to achieve meaningful privacy guarantees in time-varying databases. In this work, we introduce age-dependent DP, taking into account the underlying stochastic nature of a time-varying database. In this new framework, we establish a connection between classical DP and age-dependent DP, based on which we characterize the impact of data staleness and temporal correlation on privacy guarantees. Our characterization demonstrates that aging, i.e., using stale data inputs and/or postponing the release of outputs, can be a new strategy to protect data privacy in addition to noise injection in the traditional DP framework. Furthermore, to generalize our results to a multi-query scenario, we present a sequential composition result for age-dependent DP. We then characterize and achieve the optimal tradeoffs between privacy risk and utility. Finally, case studies show that, when achieving a target of an arbitrarily small privacy risk in a single-query case, the approach of combining aging and noise injection can achieve a bounded accuracy loss, whereas using noise injection only (as in the DP benchmark) will lead to an unbounded accuracy loss.

Unleashing the Power of Paying Multiplexing Only Once in Stochastic Network Calculus

The stochastic network calculus (SNC) holds promise as a versatile and uniform framework to calculate probabilistic performance bounds in networks of queues. A great challenge to accurate bounds and efficient calculations are stochastic dependencies between flows due to resource sharing inside the network. However, by carefully utilizing the basic SNC concepts in the network analysis the necessity of taking these dependencies into account can be minimized. To that end, we unleash the power of the pay multiplexing only once principle (PMOO, known from the deterministic network calculus) in the SNC analysis. We choose an analytic combinatorics presentation of the results in order to ease complex calculations. In tree-reducible networks, a subclass of general feedforward networks, we obtain an effective analysis in terms of avoiding the need to take internal flow dependencies into account. In a comprehensive numerical evaluation, we demonstrate how this unleashed PMOO analysis can reduce the known gap between simulations and SNC calculations significantly, and how it favourably compares to state-of-the art SNC calculations in terms of accuracy and computational effort. Motivated by these promising results, we also consider general feedforward networks, when some flow dependencies have to be taken into account. To that end, the unleashed PMOO analysis is extended to the partially dependent case and a case study of a canonical topology, known as the diamond network, is provided, again displaying favourable results over the state of the art.