The Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the ACM Special Interest Group SIGMETRICS. All papers in this issue of POMACS will be presented at the ACM SIGMETRICS 2023 conference on June 19-23, 2023, in Orlando, Florida, USA. These papers have been selected during the fall submission round by the 91 members of the ACM SIGMETRICS 2023 program committee via a rigorous review process. Each paper was conditionally accepted (and shepherded), allowed a "one-shot" revision (to be resubmitted to one of the subsequent three SIGMETRICS deadlines), or rejected (with re-submission allowed after a year). For this issue, which represents the fall deadline, POMACS is publishing 26 papers out of 119 submissions. All submissions received at least 3 reviews and borderline cases were extensively discussed during the online program committee meeting. Based on the indicated track(s), roughly 21% of the submissions were in the Theory track, 40% were in the Measurement & Applied Modeling track, 29% were in the Systems track, and 39% were in the Learning track. Many individuals contributed to the success of this issue of POMACS. First, we would like to thank the authors, who submitted their best work to SIGMETRICS/POMACS. Second, we would like to thank the program committee members who provided constructive feedback in their reviews to authors and participated in the online discussions and program committee meeting. We also thank the several external reviewers who provided their expert opinion on specific submissions that required additional input. We are also grateful to the SIGMETRICS Board Chair, Giuliano Casale, and to past program committee Chairs, Niklas Carlsson, Edith Cohen, and Philippe Robert, who provided a wealth of information and guidance. Finally, we are grateful to the Organization Committee and to the SIGMETRICS Board for their ongoing efforts and initiatives for creating an exciting program for ACM SIGMETRICS 2023.
The performance of large-scale computing systems often critically depends on high-performance communication networks. Dynamically reconfigurable topologies, e.g., based on optical circuit switches, are emerging as an innovative new technology to deal with the explosive growth of datacenter traffic. Specifically, periodic reconfigurable datacenter networks (RDCNs) such as RotorNet (SIGCOMM 2017), Opera (NSDI 2020) and Sirius (SIGCOMM 2020) have been shown to provide high throughput, by emulating a complete graph through fast periodic circuit switch scheduling.
However, to achieve such a high throughput, existing reconfigurable network designs pay a high price: in terms of potentially high delays, but also, as we show as a first contribution in this paper, in terms of the high buffer requirements. In particular, we show that under buffer constraints, emulating the high-throughput complete graph is infeasible at scale, and we uncover a spectrum of unvisited and attractive alternative RDCNs, which emulate regular graphs, but with lower node degree than the complete graph.
We present Mars, a periodic reconfigurable topology which emulates ad-regular graph with near-optimal throughput. In particular, we systematically analyze how the degree d can be optimized for throughput given the available buffer and delay tolerance of the datacenter. We further show empirically that Mars achieves higher throughput compared to existing systems when buffer sizes are bounded.
A diverse set of scheduling objectives (e.g., resource contention, fairness, priority, etc.) breed a series of objective-specific schedulers for multi-core architectures. Existing designs incorporate thread-to-thread statistics at runtime, and schedule threads based on such an abstraction (we formalize thread-to-thread interaction as the Thread-Interaction Matrix). However, such an abstraction also reveals a consistently-overlooked issue: the Thread-Interaction Matrix (TIM) is highly sparse. Therefore, existing designs can only deliver sub-optimal decisions, since the sparsity issue limits the amount of thread permutations (and its statistics) to be exploited when performing scheduling decisions.
We introduce Sparsity-Lightened Intelligent Thread Scheduling (SLITS), a general scheduler design for mitigating the sparsity issue of TIM, with the customizability for different scheduling objectives. SLITS is designed upon the key insight that: the sparsity issue of the TIM can be effectively mitigated via advanced Machine Learning (ML) techniques. SLITS has three components. First, SLITS profiles Thread Interactions for only a small number of thread permutations, and form the TIM using the run-time statistics. Second, SLITS estimates the missing values in the TIM using Factorization Machine (FM), a novel ML technique that can fill in the missing values within a large-scale sparse matrix based on the limited information. Third, SLITS leverages Lazy Reschedule, a general mechanism as the building block for customizing different scheduling policies for different scheduling objectives. We show how SLITS can be (1) customized for different scheduling objectives, including resource contention and fairness; and (2) implemented with only negligible hardware costs. We also discuss how SLITS can be potentially applied to other contexts of thread scheduling.
We evaluate two SLITS variants against four state-of-the-art scheduler designs. We highlight that, averaged across 11 benchmarks, SLITS achieves an average speedup of 1.08X over the de facto standard for thread scheduler - the Completely Fair Scheduler, under the 16-core setting for a variety of number of threads (i.e., 32, 64 and 128). Our analysis reveals that the benefits of SLITS are credited to significant improvements of cache utilization. In addition, our experimental results confirm that SLITS is scalable and the benefits are robust when of the number of threads increases. We also perform extensive studies to (1) break down SLITS components to justify the synergy of our design choices, (2) examine the impacts of varying the estimation coverage of FM, (3) justify the necessity of Lazy Reschedule rather than periodic rescheduling, and (4) demonstrate the hardware overheads for SLITS implementations can be marginal (<1% chip area and power).
We describe the results of a large-scale study of third-party dependencies around the world based on regional top-500 popular websites accessed from vantage points in 50 countries, together covering all inhabited continents. This broad perspective shows that dependencies on a third-party DNS, CDN or CA provider vary widely around the world, ranging from 19% to as much as 76% of websites, across all countries. The critical dependencies of websites -- where the site depends on a single third-party provider -- are equally spread ranging from 5% to 60% (CDN in Costa Rica and DNS in China, respectively). Interestingly, despite this high variability, our results suggest a highly concentrated market of third-party providers: three providers across all countries serve an average of 92% and Google, by itself, serves an average of 70% of the surveyed websites. Even more concerning, these differences persist a year later with increasing dependencies, particularly for DNS and CDNs. We briefly explore various factors that may help explain the differences and similarities in degrees of third-party dependency across countries, including economic conditions, Internet development, economic trading partners, categories, home countries, and traffic skewness of the country's top-500 sites.
In this paper, we study kernelized bandits with distributed biased feedback. This problem is motivated by several real-world applications (such as dynamic pricing, cellular network configuration, and policy making), where users from a large population contribute to the reward of the action chosen by a central entity, but it is difficult to collect feedback from all users. Instead, only biased feedback (due to user heterogeneity) from a subset of users may be available. In addition to such partial biased feedback, we are also faced with two practical challenges due to communication cost and computation complexity. To tackle these challenges, we carefully design a new distributed phase-then-batch-based elimination (DPBE) algorithm, which samples users in phases for collecting feedback to reduce the bias and employs maximum variance reduction to select actions in batches within each phase. By properly choosing the phase length, the batch size, and the confidence width used for eliminating suboptimal actions, we show that DPBE achieves a sublinear regret of ~O(T1-α/2 +√γT T), where α ∈ (0,1) is the user-sampling parameter one can tune. Moreover, DPBE can significantly reduce both communication cost and computation complexity in distributed kernelized bandits, compared to some variants of the state-of-the-art algorithms (originally developed for standard kernelized bandits). Furthermore, by incorporating various differential privacy models (including the central, local, and shuffle models), we generalize DPBE to provide privacy guarantees for users participating in the distributed learning process. Finally, we conduct extensive simulations to validate our theoretical results and evaluate the empirical performance.
Gacha game is a special opaque selling approach, where the seller is selling gacha pulls to the buyer. Each gacha pull provides a certain probability for the buyer to win the gacha game reward. The gacha game has been enthusiastically embraced in numerous online video games and has a wide range of potential applications.In this work, we model the complex interaction between the seller and the buyer as a Stackelberg game, where the sequential decision of the buyer is modeled as a Markov Decision Process (MDP). We define the whale property in the context of gacha games. Then, we show that this is the necessary condition to achieve optimal revenue. Moreover, we provide the revenue-optimal gacha game design and show that it is equivalent to the single-item single-bidder Myerson auction.We further explore two popular multi-item gacha games, namely, the sequential multi-item gacha game and the banner-based multi-item gacha game. We also discuss the subsidies in the gacha game and demonstrate how subsidies may encourage the buyer to engage in grinding behavior. Finally, we provide a case study on blockchain systems as gacha games.
Programmable switches have been widely used to design network monitoring solutions that operate in the fast data-plane level, e.g., detecting heavy hitters, super-spreaders, computing flow size distributions and their entropy. Many existing works on networking monitoring assume switches deploy a single memory that is accessible by each processed packet. However, high-speed ASIC switches increasingly deploymultiple independent pipes, each equipped with its own independent memory thatcannot be accessed by other pipes.
In this work, we initiate the study of deploying existing heavy-hitter data-plane monitoring solutions on multi-pipe switches where packets of a "flow" may spread over multiple pipes, i.e., stored into distinct memories. We first quantify the accuracy degradation due to splitting a monitoring data structure across multiple pipes (e.g., up to 3000x worse flow-size estimation average error). We then present PipeCache, a system that adaptsexisting data-plane mechanisms to multi-pipe switches by carefully storing all the monitoring information of each traffic class into exactly one specific pipe (as opposed to replicate the information on multiple pipes). PipeCache relies on the idea of briefly storing monitoring information into a per-pipe cache and then piggybacking this information onto existing data packets to the correct pipeentirely at data-plane speed. We implement PipeCache on ASIC switches and we evaluate it using a real-world trace. We show that existing data-plane mechanisms achieves accuracy levels and memory requirements similar to single-pipe deployments when augmented with PipeCache (i.e., up to 16x lower memory requirements).
Public hosting services provide convenience for domain owners to build web applications with better scalability and security. However, if a domain name points to released service endpoints (e.g., nameservers allocated by a provider), adversaries can take over the domain by applying the same endpoints. Such a security threat is called "hosting-based domain takeover''. In recent years, a large number of domain takeover incidents have occurred; even well-known websites like the subdomains of microsoft.com have been impacted. However, until now, there has been no effective detection system to identify these vulnerable domains on a large scale. In this paper, we fill this research gap by presenting a novel framework, HostingChecker, for detecting domain takeovers. Compared with previous work, HostingChecker expands the detection scope and improves the detection efficiency by: (i) systematically identifying vulnerable hosting services using a semi-automated method; and (ii) effectively detecting vulnerable domains through passive reconstruction of domain dependency chains. The framework enables us to detect the subdomains of Tranco sites on a daily basis. We evaluate the effectiveness of HostingChecker and eventually detect 10,351 subdomains from Tranco Top-1M apex domains vulnerable to domain takeover, which are over 8× more than previous findings. Furthermore, we conduct an in-depth security analysis on the affected vendors, like Amazon and Alibaba, and gain a suite of new insights, including flawed implementation of domain ownership validation. Following responsible disclosure processes, we have reported issues to the security response centers of affected vendors, and some (e.g., Baidu and Tencent) have adopted our mitigation.
Modern data center networks are required to support huge and complex forwarding policies as they handle the traffic of the various tenants. However, these policies cannot be stored in their entirety within the limited memory available at commodity switches. The common approach in such scenarios is to have SDN controllers manage the memory available at the switch as a fast cache, updating and changing the forwarding rules in the cache according to the workloads dynamics and the global policy at hand. Many such policies, such as Longest-prefix-match (LPM) policies, introduce dependencies between the forwarding rules. Ensuring that the cache content is always consistent with the global policy often requires the switch to store (potentially many) superfluous rules, which may lead to suboptimal performance in terms of delay and throughput.
To overcome these deficiencies, previous work suggested the concept of splicing, where modified Go-to-Controller rules can be inserted into the cache to improve performance while maintaining consistency. These works focused mostly on heuristics, and it was conjectured that the problem is computationally intractable. As our main result, we show that the problem of determining the optimal set of rules, with splicing, can actually be solved efficiently by presenting a polynomial-time algorithm that produces an optimal solution, i.e., for a given cache size we find an optimal set of rules, some of which are go-to-controller, which maximize the total weight of the cache while maintaining consistency. However, such optimality comes at a cost, encompassed by the fact that our algorithm has a significantly larger running time than SoTA solutions which do not employ splicing. Therefore, we further present a heuristic exhibiting close-to-optimal performance, with significantly improved running time, matching that of the best algorithm, which does not employ splicing. In addition, we present the results of an evaluation study that compares the performance of our solutions with that of SoTA approaches, showing that splicing can reduce the cache miss ratio by as much as 30%, without increasing the cache size. Lastly, we propose a simple and fast-to-compute metric (that is consistency-oblivious) in order to evaluate the potential benefits of splicing compared to classical LPM-caching, for a given policy and traffic distribution. We show that our metric is highly correlated with such benefits, thus serving as an indication of whether splicing should be incorporated within the system architecture.
We examine the problem of smoothed online optimization, where a decision maker must sequentially choose points in a normed vector space to minimize the sum of per-round, non-convex hitting costs and the costs of switching decisions between rounds. The decision maker has access to a black-box oracle, such as a machine learning model, that provides untrusted and potentially inaccurate predictions of the optimal decision in each round. The goal of the decision maker is to exploit the predictions if they are accurate, while guaranteeing performance that is not much worse than the hindsight optimal sequence of decisions, even when predictions are inaccurate. We impose the standard assumption that hitting costs are globally α-polyhedral. We propose a novel algorithm, Adaptive Online Switching (AOS), and prove that, for a large set of feasible δ > 0, it is (1+δ)-competitive if predictions are perfect, while also maintaining a uniformly bounded competitive ratio of 2~O (1/(α δ)) even when predictions are adversarial. Further, we prove that this trade-off is necessary and nearly optimal in the sense that any deterministic algorithm which is (1+δ)-competitive if predictions are perfect must be at least 2~Ω (1/(α δ)) -competitive when predictions are inaccurate. In fact, we observe a unique threshold-type behavior in this trade-off: if δ is not in the set of feasible options, then no algorithm is simultaneously (1 + δ)-competitive if predictions are perfect and ζ-competitive when predictions are inaccurate for any ζ < ∞. Furthermore, we discuss that memory is crucial in AOS by proving that any algorithm that does not use memory cannot benefit from predictions. We complement our theoretical results by a numerical study on a microgrid application.
We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its κ-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in κ. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing κ. Numerical simulations demonstrate the effectiveness of LPI.
Understanding the resource consumption of the mobile web is an important topic that has garnered much attention in recent years. However, existing works mostly focus on the networking or computational aspects of the mobile web and largely ignore memory, which is an important aspect given the mobile web's reliance on resource-heavy JavaScript. In this paper, we propose a framework, called JS Capsules, for characterizing the memory of JavaScript functions and, using this framework, we investigate the key browser mechanics that contribute to the memory overhead. Leveraging our framework on a testbed of Android mobile phones, we conduct measurements of the Alexa top 1K websites. While most existing frameworks focus on V8 - the JavaScript engine used in most popular browsers - in the context of memory, our measurements show that the memory implications of JavaScript extends far beyond V8 due to the cascading effects that certain JavaScript calls have on the browser's rendering mechanics. We quantify and highlight the direct impact that website DOM have on JavaScript memory overhead and present, to our knowledge, the first root-cause analysis to dissect and characterize their impact on JavaScript memory overheads.
Modern cloud block service provides cloud users with virtual block disks (VDisks), and it usually relies on a forwarding layer consisting of multiple proxy servers to forward the block-level writes from applications to the underlying distributed storage. However, we discover that severe traffic imbalance exists among the proxy servers at the forwarding layer, thus creating a performance bottleneck which severely prolongs the latency of accessing VDisks. Worse yet, due to the diverse access patterns of VDisk s, stable traffic and burst traffic coexist at the forwarding layer, and thus making existing load balancing designs inefficient for balancing the traffic at the forwarding layer of VDisk s, as they are unaware of and also lacks the ability to differentiate the decomposable burst and stable traffic. To this end, we propose a novel traffic forwarding scheme DiffForward for cloud block services. DiffForward differentiates the burst traffic from stable traffic in an accurate and efficient way at the client side, then it forwards the burst traffic to a decentralized distributed log store to realize real-time load balance by writing the data in a round-robin manner and balances the stable traffic by segmentation. DiffForward also judiciously coordinates the stable and burst traffic and preserves strong consistency under differentiated forwarding. Extensive experiments with reallife workloads on our prototype show that DiffForward effectively balances the traffic at the forwarding layer at a fine-grained subsecond level, thus significantly reducing the write latency of VDisks.
Resource disaggregation offers a cost effective solution to resource scaling, utilization, and failure-handling in data centers by physically separating hardware devices in a server. Servers are architected as pools of processor, memory, and storage devices, organized as independent failure-isolated components interconnected by a high-bandwidth network. A critical challenge, however, is the high performance penalty of accessing data from a remote memory module over the network. Addressing this challenge is difficult as disaggregated systems have high runtime variability in network latencies/bandwidth, and page migration can significantly delay critical path cache line accesses in other pages.
This paper conducts a characterization analysis on different data movement strategies in fully disaggregated systems, evaluates their performance overheads in a variety of workloads, and introduces DaeMon, the first software-transparent mechanism to significantly alleviate data movement overheads in fully disaggregated systems. First, to enable scalability to multiple hardware components in the system, we enhance each compute and memory unit with specialized engines that transparently handle data migrations. Second, to achieve high performance and provide robustness across various network, architecture and application characteristics, we implement a synergistic approach of bandwidth partitioning, link compression, decoupled data movement of multiple granularities, and adaptive granularity selection in data movements. We evaluate DaeMon in a wide variety of workloads at different network and architecture configurations using a state-of-the-art simulator. DaeMon improves system performance and data access costs by 2.39× and 3.06×, respectively, over the widely-adopted approach of moving data at page granularity.
Today, location-based services have become prevalent in the mobile platform, where mobile apps provide specific services to a user based on his or her location. Unfortunately, mobile apps can aggressively harvest location data with much higher accuracy and frequency than they need because the coarse-grained access control mechanism currently implemented in mobile operating systems (e.g., Android) cannot regulate such behavior. This unnecessary data collection violates the data minimization policy, yet no previous studies have investigated privacy violations from this perspective, and existing techniques are insufficient to address this violation. To fill this knowledge gap, we take the first step toward detecting and measuring this privacy risk in mobile apps at scale. Particularly, we annotate and release thefirst dataset to characterize those aggressive location harvesting apps and understand the challenges of automatic detection and classification. Next, we present a novel system, LocationScope, to address these challenges by(i) uncovering how an app collects locations and how to use such data through a fine-tuned value set analysis technique,(ii) recognizing the fine-grained location-based services an app provides via embedding data-flow paths, which is a combination of program analysis and machine learning techniques, extracted from its location data usages, and(iii) identifying aggressive apps with an outlier detection technique achieving a precision of 97% in aggressive app detection. Our technique has further been applied to millions of free Android apps from Google Play as of 2019 and 2021. Highlights of our measurements on detected aggressive apps include their growing trend from 2019 to 2021 and the app generators' significant contribution of aggressive location harvesting apps.
Consumers, regulators, and ISPs all use client-based "speed tests" to measure network performance, both in single-user settings and in aggregate. Two prevalent speed tests, Ookla's Speedtest and Measurement Lab's Network Diagnostic Test (NDT), are often used for similar purposes, despite having significant differences in both the test design and implementation, and in the infrastructure used to perform measurements. In this paper, we present the first-ever comparative evaluation of Ookla and NDT7 (the latest version of NDT), both in controlled and wide-area settings. Our goal is to characterize when and to what extent these two speed tests yield different results, as well as the factors that contribute to the differences. To study the effects of the test design, we conduct a series of controlled, in-lab experiments under a comprehensive set of network conditions and usage modes (e.g., TCP congestion control, native vs. browser client). Our results show that Ookla and NDT7 report similar speeds under most in-lab conditions, with the exception of networks that experience high latency, where Ookla consistently reports higher throughput. To characterize the behavior of these tools in wide-area deployment, we collect more than 80,000 pairs of Ookla and NDT7 measurements across nine months and 126 households, with a range of ISPs and speed tiers. This first-of-its-kind paired-test analysis reveals many previously unknown systemic issues, including high variability in NDT7 test results and systematically under-performing servers in the Ookla network.
The performance of many cloud-based applications critically depends on the capacity of the underlying datacenter network. A particularly innovative approach to improve the throughput in datacenters is enabled by emerging optical technologies, which allow to dynamically adjust the physical network topology, both in an oblivious or demand-aware manner. However, such topology engineering, i.e., the operation and control of dynamic datacenter networks, is considered complex and currently comes with restrictions and overheads.
We present Duo, a novel demand-aware reconfigurable rack-to-rack datacenter network design realized with a simple and efficient control plane. Duo is based on the well-known de Bruijn topology (implemented using a small number of optical circuit switches) and the key observation that this topology can be enhanced using dynamic (''opportunistic'') links between its nodes.
In contrast to previous systems, Duo has several desired features: i) It makes effective use of the network capacity by supporting integrated and multi-hop routing (paths that combine both static and dynamic links). ii) It uses a work-conserving queue scheduling which enables out-of-the-box TCP support. iii) Duo employs greedy routing that is implemented using standard IP longest prefix match with small forwarding tables. And iv) during topological reconfigurations, routing tables require only local updates, making this approach ideal for dynamic networks.
We evaluate Duo in end-to-end packet-level simulations, comparing it to the state-of-the-art static and dynamic networks designs. We show that Duo provides higher throughput, shorter paths, lower flow completion times for high priority flows, and minimal packet reordering, all using existing network and transport layer protocols. We also report on a proof-of-concept implementation of Duo's control and data plane.
IPv6 adoption continues to grow, making up more than 40% of client traffic to Google globally. While the ubiquity of the IPv4 address space makes it comparably easier to understand, the vast and less studied IPv6 address space motivates a variety of works detailing methodology to collect and analyze IPv6 properties, many of which use knowledge from specific data sources as a lens for answering research questions. Despite such work, questions remain on basic properties such as the appropriate prefix size for different research tasks.
Our work fills this knowledge gap by presenting an analysis of the apportionment of the IPv6 address space from the ground-up, using data and knowledge from numerous data sources simultaneously, aimed at identifying how to leverage IPv6 address information for a variety of research tasks. Utilizing WHOIS data from RIRs, routing data, and hitlists, we highlight fundamental differences in apportionment sizes and structural properties depending on data source and examination method. We focus on the different perspectives each dataset offers and the disjoint, heterogeneous nature of these datasets when taken together. We additionally leverage a graph-based analysis method for these datasets that allows us to draw conclusions regarding when and how to intersect the datasets and their utility. The differences in each dataset's perspective is not due to dataset problems but rather stems from a variety of differing structural and deployment behaviors across RIRs and IPv6 providers alike. In light of these inconsistencies, we discuss network address partitioning, best practices, and considerations for future IPv6 measurement and analysis projects.
There are a number of forums where people participate under pseudonyms. One example is peer review, where the identity of reviewers for any paper is confidential. When participating in these forums, people frequently engage in "batching": executing multiple related tasks (e.g., commenting on multiple papers) at nearly the same time. Our empirical analysis shows that batching is common in two applications we consider -- peer review and Wikipedia edits. In this paper, we identify and address the risk of deanonymization arising from linking batched tasks. To protect against linkage attacks, we take the approach of adding delay to the posting time of batched tasks. We first show that under some natural assumptions, no delay mechanism can provide a meaningful differential privacy guarantee. We therefore propose a "one-sided" formulation of differential privacy for protecting against linkage attacks. We design a mechanism that adds zero-inflated uniform delay to events and show it can preserve privacy. We prove that this noise distribution is in fact optimal in minimizing expected delay among mechanisms adding independent noise to each event, thereby establishing the Pareto frontier of the trade-off between the expected delay for batched and unbatched events. Finally, we conduct a series of experiments on Wikipedia and Bitcoin data that corroborate the practical utility of our algorithm in obfuscating batching without introducing onerous delay to a system.
Mean field approximation is a powerful technique which has been used in many settings to study large-scale stochastic systems. In the case of two-timescale systems, the approximation is obtained by a combination of scaling arguments and the use of the averaging principle. This paper analyzes the approximation error of this 'average' mean field model for a two-timescale model (X, Y), where the slow component X describes a population of interacting particles which is fully coupled with a rapidly changing environment Y. The model is parametrized by a scaling factor N, e.g. the population size, which as N gets large decreases the jump size of the slow component in contrast to the unchanged dynamics of the fast component. We show that under relatively mild conditions, the 'average' mean field approximation has a bias of order O(1/N) compared to E[X]. This holds true under any continuous performance metric in the transient regime, as well as for the steady-state if the model is exponentially stable. To go one step further, we derive a bias correction term for the steady-state, from which we define a new approximation called the refined 'average' mean field approximation whose bias is of order O(1/N2). This refined 'average' mean field approximation allows computing an accurate approximation even for small scaling factors, i.e., N ~10 -50. We illustrate the developed framework and accuracy results through an application to a random access CSMA model.
Up-to-date and accurate prediction of Channel State Information (CSI) is of paramount importance in Ultra-Reliable Low-Latency Communications (URLLC), specifically in dynamic environments where unpredictable mobility is inherent. CSI can be meticulously tracked by means of frequent pilot transmissions, which on the downside lead to an increase in metadata (overhead signaling) and latency, which are both detrimental for URLLC. To overcome these issues, in this paper, we take a fundamentally different approach and propose PEACH, a machine learning system which utilizes environmental information with depth images to predict CSI amplitude in beyond 5G systems, without requiring metadata radio resources, such as pilot overheads or any feedback mechanism. PEACH exploits depth images by employing a convolutional neural network to predict the current and the next 100 ms CSI amplitudes. The proposed system is experimentally validated with extensive measurements conducted in an indoor environment, involving two static receivers and two transmitters, one of which is placed on top of a mobile robot. We prove that environmental information can be instrumental towards proactive CSI amplitude acquisition of both static and mobile users on base stations, while providing an almost similar performance as pilot-based methods, and completely avoiding the dependency on feedback and pilot transmission for both downlink and uplink CSI information. Furthermore, compared to demodulation reference signal based traditional pilot estimation in ideal conditions without interference, our experimental results show that PEACH yields the same performance in terms of average bit error rate when channel conditions are poor (using low order modulation), while not being much worse when using higher modulation orders, like 16-QAM or 64-QAM. More importantly, in the realistic cases with interference taken into account, our experiments demonstrate considerable improvements introduced by PEACH in terms of normalized mean square error of CSI amplitude estimation, up to 6 dB, when compared to traditional approaches.
This paper presents a first-of-its-kind performance measurement of Wi-Fi 6 (IEEE 802.11ax) using real experiments. Our experiments focus on multi-client scenarios. The results reveal the impact of the new channel access mechanisms (i.e., OFDMA and TWT) on the spectrum efficiency, energy consumption, latency, and network security. (i) A comparison with the legacy CSMA/CA scheme shows that the commodity Wi-Fi 6 achieves 3× overall throughput and dramatically reduce the latency (5×) when coexisting with legacy Wi-Fi network. (ii) However, the current OFDMA implementation significantly increases the power consumption (6×), implying a design tradeoff between throughput and latency gain versus the cost of energy consumption. (iii) Finally, TWT negotiating procedure is vulnerable to various malicious attacks. We believe that our findings provide critical insights for the scheduling algorithm design, power optimization, and security protection of the next-generation WLANs.
We investigate the problem of stabilizing an unknown networked linear system under communication constraints and adversarial disturbances. We propose the first provably stabilizing algorithm for the problem. The algorithm uses a distributed version of nested convex body chasing to maintain a consistent estimate of the network dynamics and applies system level synthesis to determine a distributed controller based on this estimated model. Our approach avoids the need for system identification and accommodates a broad class of communication delay while being fully distributed and scaling favorably with the number of subsystems.
Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources.
To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length.
When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.