Proceedings of the ACM on Measurement and Analysis of Computing Systems: POMACS: Vol. 6, No. 3. 2022

Full Citation in the ACM Digital Library

POMACS V6, N3 December 2022 Editorial

The Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the ACM Special Interest Group SIGMETRICS. All papers in this issue of POMACS will be presented at the ACM SIGMETRICS 2023 conference on June 19-13, 2023, in Orlando, Florida, USA. These papers have been selected by the 82 members of the ACM SIGMETRICS 2023 program committee via a rigorous review process. Each paper was conditionally accepted (and shepherded), allowed a "one-shot" revision (to be resubmitted to one of the subsequent three SIGMETRICS deadlines), or rejected (with re-submission allowed after a year). For this issue, which represents the summer deadline, POMACS is publishing 17 papers out of 93 submissions. All submissions received at least 3 reviews and borderline cases were extensively discussed during the online program committee meeting. Based on the indicated track(s), roughly 30% of the submissions were in the Theory track, 39% were in the Measurement Applied Modeling track, 33% were in the Systems track, and 29% were in the Learning track. Many individuals contributed to the success of this issue of POMACS. First, we would like to thank the authors, who submitted their best work to SIGMETRICS/POMACS. Second, we would like to thank the program committee members who provided constructive feedback in their reviews to authors and participated in the online discussions and program committee meeting. We also thank the several external reviewers who provided their expert opinion on specific submissions that required additional input. We are also grateful to the SIGMETRICS Board Chair, Giuliano Casale, and to past program committee Chairs, Niklas Carlsson, Edith Cohen, and Philippe Robert, who provided a wealth of information and guidance. Finally, we are grateful to the Organization Committee and to the SIGMETRICS Board for their ongoing efforts and initiatives for creating an exciting program for ACM SIGMETRICS 2023.

Characterizing Cryptocurrency-themed Malicious Browser Extensions

Due to the surging popularity of various cryptocurrencies in recent years, a large number of browser extensions have been developed as portals to access relevant services, such as cryptocurrency exchanges and wallets. This has stimulated a wild growth of cryptocurrency themed malicious extensions that cause heavy financial losses to the users and legitimate service providers. They have shown their capability of evading the stringent vetting processes of the extension stores, highlighting a lack of understanding of this emerging type of malware in our community. In this work, we conduct the first systematic study to identify and characterize cryptocurrency-themed malicious extensions. We monitor seven official and third-party extension distribution venues for 18 months (December 2020 to June 2022) and have collected around 3600 unique cryptocurrency-themed extensions. Leveraging a hybrid analysis, we have identified 186 malicious extensions that belong to five categories. We then characterize those extensions from various perspectives including their distribution channels, life cycles, developers, illicit behaviors, and illegal gains. Our work unveils the status quo of the cryptocurrency-themed malicious extensions and reveals their disguises and programmatic features on which detection techniques can be based. Our work serves as a warning to extension users, and an appeal to extension store operators to enact dedicated countermeasures. To facilitate future research in this area, we release our dataset of the identified malicious extensions and open-source our analyzer.

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Deep Neural Networks (DNNs) have had a significant impact on domains like autonomous vehicles and smart cities through low-latency inferencing on edge computing devices close to the data source. However, DNN training on the edge is poorly explored. Techniques like federated learning and the growing capacity of GPU-accelerated edge devices like NVIDIA Jetson motivate the need for a holistic characterization of DNN training on the edge. Training DNNs is resource-intensive and can stress an edge's GPU, CPU, memory and storage capacities. Edge devices also have different resources compared to workstations and servers, such as slower shared memory and diverse storage media. Here, we perform a principled study of DNN training on individual devices of three contemporary Jetson device types: AGX Xavier, Xavier NX and Nano for three diverse DNN model--dataset combinations. We vary device and training parameters such as I/O pipelining and parallelism, storage media, mini-batch sizes and power modes, and examine their effect on CPU and GPU utilization, fetch stalls, training time, energy usage, and variability. Our analysis exposes several resource inter-dependencies and counter-intuitive insights, while also helping quantify known wisdom. Our rigorous study can help tune the training performance on the edge, trade-off time and energy usage on constrained devices, and even select an ideal edge hardware for a DNN workload, and, in future, extend to federated learning too. As an illustration, we use these results to build a simple model to predict the training time and energy per epoch for any given DNN across different power modes, with minimal additional profiling.

Dynamic Bin Packing with Predictions

The MinUsageTime Dynamic Bin Packing (DBP) problem aims to minimize the accumulated bin usage time for packing a sequence of items into bins. It is often used to model job dispatching for optimizing the busy time of servers, where the items and bins match the jobs and servers respectively. It is known that the competitiveness of MinUsageTime DBP has tight bounds of Θ(√łog μ ) and Θ(μ) in the clairvoyant and non-clairvoyant settings respectively, where μ is the max/min duration ratio of all items. In practice, the information about the items' durations (i.e., job lengths) obtained via predictions is usually prone to errors. In this paper, we study the MinUsageTime DBP problem with predictions of the items' durations. We find that an existing O(√łog μ )-competitive clairvoyant algorithm, if using predicted durations rather than real durations for packing, does not provide any bounded performance guarantee when the predictions are adversarially bad. We develop a new online algorithm with a competitive ratio of minØ(ε^2 √łog(ε^2 μ) ), O(μ) (where ε is the maximum multiplicative error of prediction among all items), achieving O(√łog μ) consistency (competitiveness under perfect predictions where ε = 1) and O(μ) robustness (competitiveness under terrible predictions), both of which are asymptotically optimal.

Enabling Long-term Fairness in Dynamic Resource Allocation

We study the fairness of dynamic resource allocation problem under the α-fairness criterion. We recognize two different fairness objectives that naturally arise in this problem: the well-understood slot-fairness objective that aims to ensure fairness at every timeslot, and the less explored horizon-fairness objective that aims to ensure fairness across utilities accumulated over a time horizon. We argue that horizon-fairness comes at a lower price in terms of social welfare. We study horizon-fairness with the regret as a performance metric and show that vanishing regret cannot be achieved in presence of an unrestricted adversary. We propose restrictions on the adversary's capabilities corresponding to realistic scenarios and an online policy that indeed guarantees vanishing regret under these restrictions. We demonstrate the applicability of the proposed fairness framework to a representative resource management problem considering a virtualized caching system where different caches cooperate to serve content requests.

FuncPipe: A Pipelined Serverless Framework for Fast and Cost-Efficient Training of Deep Learning Models

Training deep learning (DL) models in the cloud has become a norm. With the emergence of serverless computing and its benefits of true pay-as-you-go pricing and scalability, systems researchers have recently started to provide support for serverless-based training. However, the ability to train DL models on serverless platforms is hindered by the resource limitations of today's serverless infrastructure and DL models' explosive requirement for memory and bandwidth. This paper describes FuncPipe, a novel pipelined training framework specifically designed for serverless platforms that enable fast and low-cost training of DL models. FuncPipe is designed with the key insight that model partitioning can be leveraged to bridge both memory and bandwidth gaps between the capacity of serverless functions and the requirement of DL training. Conceptually simple, we have to answer several design questions, including how to partition the model, configure each serverless function, and exploit each function's uplink/downlink bandwidth. In particular, we tailor a micro-batch scheduling policy for the serverless environment, which serves as the basis for the subsequent optimization. Our Mixed-Integer Quadratic Programming formulation automatically and simultaneously configures serverless resources and partitions models to fit within the resource constraints. Lastly, we improve the bandwidth efficiency of storage-based synchronization with a novel pipelined scatter-reduce algorithm. We implement FuncPipe on two popular cloud serverless platforms and show that it achieves 7%-77% cost savings and 1.3X-2.2X speedup compared to state-of-the-art serverless-based frameworks.

Optimistic No-regret Algorithms for Discrete Caching

We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning, where the caching policy has access to a prediction oracle (provided by, e.g., a Neural Network). The successive file requests are assumed to be generated by an adversary, and no assumption is made on the accuracy of the oracle. In this setting, we provide a universal lower bound for prediction-assisted online caching and proceed to design a suite of policies with a range of performance-complexity trade-offs. All proposed policies offer sublinear regret bounds commensurate with the accuracy of the oracle. Our results substantially improve upon all recently-proposed online caching policies, which, being unable to exploit the oracle predictions, offer only O(√T) regret. In this pursuit, we design, to the best of our knowledge, the first comprehensive optimistic Follow-the-Perturbed leader policy, which generalizes beyond the caching problem. We also study the problem of caching files with different sizes and the bipartite network caching problem. Finally, we evaluate the efficacy of the proposed policies through extensive numerical experiments using real-world traces.

Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale.

On the Stochastic and Asymptotic Improvement of First-Come First-Served and Nudge Scheduling

Recently it was shown that, contrary to expectations, the First-Come-First-Served (FCFS) scheduling algorithm can be stochastically improved upon by a scheduling algorithm called Nudge for light-tailed job size distributions. Nudge partitions jobs into 4 types based on their size, say small, medium, large and huge jobs. Nudge operates identical to FCFS, except that whenever a small job arrives that finds a large job waiting at the back of the queue, Nudge swaps the small job with the large one unless the large job was already involved in an earlier swap.

In this paper, we show that FCFS can be stochastically improved upon under far weaker conditions. We consider a system with 2 job types and limited swapping between type-1 and type-2 jobs, but where a type-1 job is not necessarily smaller than a type-2 job. More specifically, we introduce and study the Nudge-K scheduling algorithm which allows type-1 jobs to be swapped with up to K type-2 jobs waiting at the back of the queue, while type-2 jobs can be involved in at most one swap. We present an explicit expression for the response time distribution under Nudge-K when both job types follow a phase-type distribution. Regarding the asymptotic tail improvement ratio (ATIR), we derive a simple expression for the ATIR, as well as for the K that maximizes the ATIR. We show that the ATIR is positive and the optimal K tends to infinity in heavy traffic as long as the type-2 jobs are on average longer than the type-1 jobs.

Optimal Scheduling in the Multiserver-job Model under Heavy Traffic

Multiserver-job systems, where jobs require concurrent service at many servers, occur widely in practice. Essentially all of the theoretical work on multiserver-job systems focuses on maximizing utilization, with almost nothing known about mean response time. In simpler settings, such as various known-size single-server-job settings, minimizing mean response time is merely a matter of prioritizing small jobs. However, for the multiserver-job system, prioritizing small jobs is not enough, because we must also ensure servers are not unnecessarily left idle. Thus, minimizing mean response time requires prioritizing small jobs while simultaneously maximizing throughput. Our question is how to achieve these joint objectives.

We devise the ServerFilling-SRPT scheduling policy, which is the first policy to minimize mean response time in the multiserver-job model in the heavy traffic limit. In addition to proving this heavy-traffic result, we present empirical evidence that ServerFilling-SRPT outperforms all existing scheduling policies for all loads, with improvements by orders of magnitude at higher loads.

Because ServerFilling-SRPT requires knowing job sizes, we also define the ServerFilling-Gittins policy, which is optimal when sizes are unknown or partially known.

Leveraging the Properties of mmWave Signals for 3D Finger Motion Tracking for Interactive IoT Applications

mmWave signals form a critical component of 5G and next-generation wireless networks, which are also being increasingly considered for sensing the environment around us to enable ubiquitous IoT applications. In this context, this paper leverages the properties of mmWave signals for tracking 3D finger motion for interactive IoT applications. While conventional vision-based solutions break down under poor lighting, occlusions, and also suffer from privacy concerns, mmWave signals work under typical occlusions and non-line-of-sight conditions, while being privacy-preserving. In contrast to prior works on mmWave sensing that focus on predefined gesture classification, this work performs continuous 3D finger motion tracking. Towards this end, we first observe via simulations and experiments that the small size of fingers coupled with specular reflections do not yield stable mmWave reflections. However, we make an interesting observation that focusing on the forearm instead of the fingers can provide stable reflections for 3D finger motion tracking. Muscles that activate the fingers extend through the forearm, whose motion manifests as vibrations on the forearm. By analyzing the variation in phases of reflected mmWave signals from the forearm, this paper designs mm4Arm, a system that tracks 3D finger motion. Nontrivial challenges arise due to the high dimensional search space, complex vibration patterns, diversity across users, hardware noise, etc. mm4Arm exploits anatomical constraints in finger motions and fuses them with machine learning architectures based on encoder-decoder and ResNets in enabling accurate tracking. A systematic performance evaluation with 10 users demonstrates a median error of 5.73° (location error of 4.07 mm) with robustness to multipath and natural variation in hand position/orientation. The accuracy is also consistent under non-line-of-sight conditions and clothing that might occlude the forearm. mm4Arm runs on smartphones with a latency of 19ms and low energy overhead.

Robust Multi-Agent Bandits Over Undirected Graphs

We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) łog (T) / Δ ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m łl K, this improves over the single-agent baseline regret of O(Kłog(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in K and n. In light of this negative result, we propose a new algorithm for which the i-th agent has regret O(( dmal (i) + K/n) łog(T)/Δ) on any connected and undirected graph, where dmal(i) is the number of i's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where dmal(i) = m), and show the effect of malicious agents is entirely local (in the sense that only the dmal (i) malicious agents directly connected to i affect its long-term regret).

Streaming Algorithms for Constrained Submodular Maximization

It is of great importance to design streaming algorithms for submodular maximization, as many applications (e.g., crowdsourcing) have large volume of data satisfying the well-known ''diminishing returns'' property, which cannot be handled by offline algorithms requiring full access to the whole dataset. However, streaming submodular maximization has been less studied than the offline algorithms due to the hardness brought by more stringent requirements on memory consumption. In this paper, we consider the fundamental problem of Submodular Maximization under k-System and d-Knapsack constraints (SMSK), which has only been successfully addressed by offline algorithms in previous studies, and we propose the first streaming algorithm for it with provable performance bounds. Our approach adopts a novel algorithmic framework dubbed MultiplexGreedy, making it also perform well under a single k-system constraint. For the special case of SMSK with only d-knapsack constraints, we further propose a streaming algorithm with better performance ratios than the state-of-the-art algorithms. As the SMSK problem generalizes most of the major problems studied in submodular maximization, our algorithms have wide applications in big data processing.

Switching in the Rain: Predictive Wireless x-haul Network Reconfiguration

4G, 5G, and smart city networks often rely on microwave and millimeter-wave x-haul links. A major challenge associated with these high frequency links is their susceptibility to weather conditions. In particular, precipitation may cause severe signal attenuation, which significantly degrades the network performance. In this paper, we develop a Predictive Network Reconfiguration (PNR) framework that uses historical data to predict the future condition of each link and then prepares the network ahead of time for imminent disturbances. The PNR framework has two components: (i) an Attenuation Prediction (AP) mechanism; and (ii) a Multi-Step Network Reconfiguration (MSNR) algorithm. The AP mechanism employs an encoder-decoder Long Short-Term Memory (LSTM) model to predict the sequence of future attenuation levels of each link. The MSNR algorithm leverages these predictions to dynamically optimize routing and admission control decisions aiming to maximize network utilization, while preserving max-min fairness among the nodes using the network (e.g., base-stations) and preventing transient congestion that may be caused by switching routes. We train, validate, and evaluate the PNR framework using a dataset containing over 2 million measurements collected from a real-world city-scale backhaul network. The results show that the framework: (i) predicts attenuation with high accuracy, with an RMSE of less than 0.4 dB for a prediction horizon of 50 seconds; and (ii) can improve the instantaneous network utilization by more than 200% when compared to reactive network reconfiguration algorithms that cannot leverage information about future disturbances.

The M/M/k with Deterministic Setup Times

Capacity management, whether it involves servers in a data center, or human staff in a call center, or doctors in a hospital, is largely about balancing a resource-delay tradeoff. On the one hand, one would like to turn off servers when not in use (or send home staff that are idle) to save on resources. On the other hand, one wants to avoid the considerable setup time required to turn an ''off'' server back ''on.'' This paper aims to understand the delay component of this tradeoff, namely, what is the effect of setup time on average delay in a multi-server system?

Surprisingly little is known about the effect of setup times on delay. While there has been some work on studying the M/M/k with Exponentially-distributed setup times, these works provide only iterative methods for computing mean delay, giving little insight as to how delay is affected by k, by load, and by the setup time. Furthermore, setup time in practice is much better modeled by a Deterministic random variable, and, as this paper shows, the scaling effect of a Deterministic setup time is nothing like that of an Exponentially-distributed setup time.

This paper provides the first analysis of the M/M/k with Deterministic setup times. We prove a lower bound on the effect of setup on delay, where our bound is highly accurate for the common case where the setup time is much higher than the job service time. Our result is a relatively simple algebraic formula which provides insights on how delay scales with the input parameters. Our proof uses a combination of renewal theory, martingale arguments and novel probabilistic arguments, providing strong intuition on the transient behavior of a system that turns servers on and off.

The Online Knapsack Problem with Departures

The online knapsack problem is a classic online resource allocation problem in networking and operations research. Its basic version studies how to pack online arriving items of different sizes and values into a capacity-limited knapsack. In this paper, we study a general version that includes item departures, while also considering multiple knapsacks and multi-dimensional item sizes. We design a threshold-based online algorithm and prove that the algorithm can achieve order-optimal competitive ratios. Beyond worst-case performance guarantees, we also aim to achieve near-optimal average performance under typical instances. Towards this goal, we propose a data-driven online algorithm that learns within a policy-class that guarantees a worst-case performance bound. In trace-driven experiments, we show that our data-driven algorithm outperforms other benchmark algorithms in an application of online knapsack to job scheduling for cloud computing.

Joint Learning and Control in Stochastic Queueing Networks with Unknown Utilities

We study the optimal control problem in stochastic queueing networks with a set of job dispatchers connected to a set of parallel servers with queues. Jobs arrive at the dispatchers and get routed to the servers following some routing policy. The arrival processes of jobs and the service processes of servers are stochastic with unknown arrival rates and service rates. Upon the completion of each job from dispatcher un at server sm, a random utility whose mean is unknown is obtained. We seek to design a control policy that makes routing decisions at the dispatchers and scheduling decisions at the servers to maximize the total utility obtained by the end of a finite time horizon T. The performance of policies is measured by regret, which is defined as the difference in total expected utility with respect to the optimal dynamic policy that has access to arrival rates, service rates and underlying utilities.

We first show that the expected utility of the optimal dynamic policy is upper bounded by T times the solution to a static linear program, where the optimization variables correspond to rates of jobs from dispatchers to servers and the feasibility region is parameterized by arrival rates and service rates. We next propose a policy for the optimal control problem that is an integration of a learning algorithm and a control policy. The learning algorithm seeks to learn the optimal extreme point solution to the static linear program based on the information available in the optimal control problem. The control policy, a mixture of priority-based and Joint-the-Shortest-Queue routing at the dispatchers and priority-based scheduling at the servers, makes decisions based on the graphical structure induced by the extreme point solutions provided by the learning algorithm. We prove that our policy achieves logarithmic regret whereas application of existing techniques to the optimal control problem would lead to Ω(√T)-regret. The theoretical analysis is further complemented with simulations to evaluate the empirical performance of our policy.

Malcolm: Multi-agent Learning for Cooperative Load Management at Rack Scale

We consider the problem of balancing the load among servers in dense racks for microsecond-scale workloads. To balance the load in such settings tens of millions of scheduling decisions have to be made per second. Achieving this throughput while providing microsecond-scale latency and high availability is extremely challenging. To address this challenge, we design a fully decentralized load-balancing framework. In this framework, servers collectively balance the load in the system. We model the interactions among servers as a cooperative stochastic game. To find the game's parametric Nash equilibrium, we design and implement a decentralized algorithm based on multi-agent-learning theory. We empirically show that our proposed algorithm is adaptive and scalable while outperforming state-of-the art alternatives. In homogeneous settings, Malcolm performs as well as the best alternative among other baselines. In heterogeneous settings, compared to other baselines, for lower loads, Malcolm improves tail latency by up to a factor of four. And for the same tail latency, Malcolm achieves up to 60% more throughput compared to the best alternative among other baselines.