Proceedings of the ACM on Measurement and Analysis of Computing Systems: POMACS: Vol. 6, No. 1. 2022

Full Citation in the ACM Digital Library

POMACS V6, N1, March 2022 Editorial

The ACM Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the ACM Special Interest Group SIGMETRICS. All papers in this issue of POMACS will be presented during the ACM SIGMETRICS/Performance 2022 conference. The issue contains papers selected by the editorial board via a rigorous review process that follows a hybrid conference and journal model, with reviews conducted by the 97 members of our POMACS editorial board. Each paper was either conditionally accepted (and shepherded), allowed a "one-shot" revision (to be resubmitted to one of the subsequent two deadlines), or rejected (with resubmission allowed after a year). For this issue, which represents the fall deadline, we accepted 25 papers out of 106 submissions (including 4 papers that had been given a "one-shot" revision opportunity). All submitted papers received at least 3 reviews and we held an online TPC meeting. Based on the indicated primary track, roughly 30% of the submissions were in the Measurement & Applied Modeling track, 28% were in the Systems track, 26% were in the Theory track, and 15% were in the Learning track. Many people contributed to the success of this issue of POMACS. First, we would like to thank the authors, who submitted their best work to SIGMETRICS/POMACS. Second, we would like to thank the TPC members who provided constructive feedback in their reviews to authors and participated in the online discussions and TPC meetings. We also thank several external reviewers who provided their expert opinion on specific submissions that required additional input. We are also grateful to the SIGMETRICS Board Chair, Giuliano Casale, and to past TPC Chairs. Finally, we are grateful to the Organization Committee and to the SIGMETRICS Board for their ongoing efforts and initiatives for creating an exciting program for ACM SIGMETRICS/Performance 2022.

Argus: Predictable Millimeter-Wave Picocells with Vision and Learning Augmentation

We propose Argus, a system to enable millimeter-wave (mmWave) deployers to quickly complete site-surveys without sacrificing the accuracy and effectiveness of thorough network deployment surveys. Argus first models the mmWave reflection profile of an environment, considering dominant reflectors, and then use this model to find locations that maximize the usability of the reflectors. The key component in Argus is an efficient machine learning model that can map the visual data to the mmWave signal reflections of an environment and can accurately predict mmWave signal profile at any unobserved locations. It allows Argus to find the best picocell locations to provide maximum coverage and also lets users self-localize accurately anywhere in the environment. Furthermore, Argus allows mmWave picocells to predict device's orientation accurately and enables object tagging and retrieval for VR/AR applications. Currently, we implement and test Argus on two different buildings consisting of multiple different indoor environments. However, the generalization capability of Argus can easily update the model for unseen environments, and thus, Argus can be deployed to any indoor environment with little or no model fine-tuning.

Automatic Inference of BGP Location Communities

The Border Gateway Protocol (BGP) orchestrates Internet communications between and inside Autonomous Systems. BGP's flexibility allows operators to express complex policies and deploy advanced traffic engineering systems. A key mechanism to provide this flexibility is tagging route announcements with BGP communities, which have arbitrary, operator-defined semantics, to pass information or requests from router to router. Typical uses of BGP communities include attaching metadata to route announcements, such as where a route was learned or whether it was received from a customer, and controlling route propagation, for example to steer traffic to preferred paths or blackhole DDoS traffic. However, there is no standard for specifying the semantics nor a centralized repository that catalogs the meaning of BGP communities. The lack of standards and central repositories complicates the use of communities by the operator and research communities. In this paper, we present a set of techniques to infer the semantics of BGP communities from public BGP data. Our techniques infer communities related to the entities or locations traversed by a route by correlating communities with AS paths. We also propose a set of heuristics to filter incorrect inferences introduced by misbehaving networks, sharing of BGP communities among sibling autonomous systems, and inconsistent BGP dumps. We apply our techniques to billions of routing records from public BGP collectors and make available a public database with more than 15 thousand location communities. Our comparison with manually-built databases shows our techniques provide high precision (up to 93%), better coverage (up to 81% recall), and dynamic updates, complementing operators' and researchers' abilities to reason about BGP community semantics.

A Comprehensive Empirical Study of Query Performance Across GPU DBMSes

In recent years, GPU database management systems (DBMSes) have rapidly become popular largely due to their remarkable acceleration capability obtained through extreme parallelism in query evaluations. However, there has been relatively little study on the characteristics of these GPU DBMSes for a better understanding of their query performance in various contexts. Also, little has been known about what the potential factors could be that affect the query processing jobs within the GPU DBMSes. To fill this gap, we have conducted a study to identify such factors and to propose a structural causal model, including key factors and their relationships, to explicate the variances of the query execution times on the GPU DBMSes. We have also established a set of hypotheses drawn from the model that explained the performance characteristics. To test the model, we have designed and run comprehensive experiments and conducted in-depth statistical analyses on the obtained empirical data. As a result, our model achieves about 77% amount of variance explained on the query time and indicates that reducing kernel time and data transfer time are the key factors to improve the query time. Also, our results show that the studied systems should resolve several concerns such as bounded processing within GPU memory, lack of rich query evaluation operators, limited scalability, and GPU under-utilization.

Curvature-based Analysis of Network Connectivity in Private Backbone Infrastructures

The main premise of this work is that since large cloud providers can and do manipulate probe packets that traverse their privately owned and operated backbones, standard traceroute-based measurement techniques are no longer a reliable means for assessing network connectivity in large cloud provider infrastructures. In response to these developments, we present a new empirical approach for elucidating private connectivity in today's Internet. Our approach relies on using only "light-weight" ( i.e., simple, easily-interpretable, and readily available) measurements, but requires applying a "heavy-weight" or advanced mathematical analysis. In particular, we describe a new method for assessing the characteristics of network path connectivity that is based on concepts from Riemannian geometry ( i.e., Ricci curvature) and also relies on an array of carefully crafted visualizations ( e.g., a novel manifold view of a network's delay space). We demonstrate our method by utilizing latency measurements from RIPE Atlas anchors and virtual machines running in data centers of three large cloud providers to (i) study different aspects of connectivity in their private backbones and (ii) show how our manifold-based view enables us to expose and visualize critical aspects of this connectivity over different geographic scales.

Data-Driven Network Path Simulation with iBox

While network simulation is widely used for evaluating network protocols and applications, ensuring realism remains a key challenge. There has been much work on simulating network mechanisms faithfully (e.g., links, buffers, etc.), but less attention on the critical task of configuring the simulator to reflect reality. We present iBox ("Internet in a Box"), which enables data-driven network path simulation, using input/output packet traces gathered at the sender/receiver in the target network to create a model of the end-to-end behaviour of a network path. Our work builds on recent work in this direction and makes three contributions: (1) estimation of a lightweight non reactive cross-traffic model, (2) estimation of a more powerful reactive cross-traffic model based on Bayesian optimization, and (3) evaluation of iBox in the context of congestion control variants in an Internet research testbed and also controlled experiments with known ground truth.

Data Convection: A GPU-Driven Case Study for Thermal-Aware Data Placement in 3D DRAMs

Stacked DRAMs have been studied, evaluated in multiple scenarios, and even productized in the last decade. The large available bandwidth they offer make them an attractive choice, particularly, in high-performance computing (HPC) environments. Consequently, many prior research efforts have studied and evaluated 3D stacked DRAM-based designs. Despite offering high bandwidth, stacked DRAMs are severely constrained by the overall memory capacity offered. In this paper, we study and evaluate integrating stacked DRAM on top of a GPU in a 3D manner which in tandem with the 2.5D stacked DRAM increases the capacity and the bandwidth without increasing the package size. This integration of 3D stacked DRAMs aids in satisfying the capacity requirements of emerging workloads like deep learning. Though this vertical 3D integration of stacked DRAMs also increases the total available bandwidth, we observe that the bandwidth offered by these 3D stacked DRAMs is severely limited by the heat generated on the GPU. Based on our experiments on a cycle-level simulator, we make a key observation that the sections of the 3D stacked DRAM that are closer to the GPU have lower retention-times compared to the farther layers of stacked DRAM. This thermal-induced variable retention-times causes certain sections of 3D stacked DRAM to be refreshed more frequently compared to the others, thereby resulting in thermal-induced NUMA paradigms. To alleviate such thermal-induced NUMA behavior, we propose and experimentally evaluate three different incarnations of Data Convection, i.e., Intra-layer, Inter-layer, and Intra + Inter-layer, that aim at placing the most-frequently accessed data in a thermal-induced retention-aware fashion, taking into account both bank-level and channel-level parallelism. Our evaluations on a cycle-level GPU simulator indicate that, in a multi-application scenario, our Intra-layer, Inter-layer and Intra + Inter-layer algorithms improve the overall performance by 1.8%, 11.7%, and 14.4%, respectively, over a baseline that already encompasses 3D+2.5D stacked DRAMs.

Differentially Private Reinforcement Learning with Linear Function Approximation

Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large-scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.

Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.

End-to-end Characterization of Game Streaming Applications on Mobile Platforms

With the advent of 5G, supporting high-quality game streaming applications on edge devices has become a reality. This is evidenced by a recent surge in cloud gaming applications on mobile devices. In contrast to video streaming applications, interactive games require much more compute power for supporting improved rendering (such as 4K streaming) with the stipulated frames-per second (FPS) constraints. This in turn consumes more battery power in a power-constrained mobile device. Thus, the state-of-the-art gaming applications suffer from lower video quality (QoS) and/or energy efficiency. While there has been a plethora of recent works on optimizing game streaming applications, to our knowledge, there is no study that systematically investigates the <QoS, Energy> design pairs on the end-to-end game streaming pipeline across the cloud, network, and edge devices to understand the individual contributions of the different stages of the pipeline for improving the overall QoS and energy efficiency. In this context, this paper presents a comprehensive performance and power analysis of the entire game streaming pipeline consisting of the server/cloud side, network, and edge. Through extensive measurements with a high-end workstation mimicking the cloud end, an open-source platform (Moonlight-GameStreaming) emulating the edge device/mobile platform, and two network settings (WiFi and 5G) we conduct a detailed measurement-based study with seven representative games with different characteristics. We characterize the performance in terms of frame latency, QoS, bitrate, and energy consumption for different stages of the gaming pipeline. Our study shows that the rendering stage and the encoding stage at the cloud end are the bottlenecks to support 4K streaming. While 5G is certainly more suitable for supporting enhanced video quality with 4K streaming, it is more expensive in terms of power consumption compared to WiFi. Further, fluctuations in 5G network quality can lead to huge frame drops thus affecting QoS, which needs to be addressed by a coordinated design between the edge device and the server. Finally, the network interface and the decoder units in a mobile platform need more energy-efficient design to support high quality games at a lower cost. These observations should help in designing more cost-effective future cloud gaming platforms.

Free2Shard: Adversary-resistant Distributed Resource Allocation for Blockchains

In this paper, we study a canonical distributed resource allocation problem arising in blockchains. While distributed resource allocation is a well-studied problem in networking, the blockchain setting additionally requires the solution to be resilient to adversarial behavior from a fraction of nodes. Scaling blockchain performance is a basic research topic; a plethora of solutions (under the umbrella of sharding ) have been proposed in recent years. Although the various sharding solutions share a common thread (they cryptographically stitch together multiple parallel chains), architectural differences lead to differing resource allocation problems. In this paper we make three main contributions: (a) we categorize the different sharding proposals under a common architectural framework, allowing for the emergence of a new, uniformly improved, uni-consensus sharding architecture. (b) We formulate and exactly solve a core resource allocation problem in the uni-consensus sharding architecture -- our solution, Free2shard, is adversary-resistant and achieves optimal throughput. The key technical contribution is a mathematical connection to the classical work of Blackwell approachability in dynamic game theory. (c) We implement the sharding architecture atop a full-stack blockchain in 3000 lines of code in Rust -- we achieve a throughput of more than 250,000 transactions per second with 6 shards, a vast improvement over state-of-the-art.

NG-Scope: Fine-Grained Telemetry for NextG Cellular Networks

Accurate and highly-granular channel capacity telemetry of the cellular last hop is crucial for the effective operation of transport layer protocols and cutting edge applications, such as video on demand and video telephony. This paper presents the design, implementation, and experimental performance evaluation of NG-Scope, the first such telemetry tool able to fuse physical-layer channel occupancy readings from the cellular control channel with higher-layer packet arrival statistics and make accurate capacity estimates. NG-Scope handles the latest cellular innovations, such as when multiple base stations aggregate their signals together to serve mobile users. End-to-end experiments in a commercial cellular network demonstrate that wireless capacity varies significantly with channel quality, mobility, competing traffic within each cell, and the number of aggregated cells. Our experiments demonstrate significantly improved cell load estimation accuracy, missing the detection of less than 1% of data capacity overall, a reduction of 82% compared to OWL, the state-of-the-art in cellular monitoring. Further experiments show that MobileInsight-based CLAW has a root-mean-squared capacity error of 30.5 Mbit/s, which is 3.3× larger than NG-Scope (9.2 Mbit/s).

Mean Field and Refined Mean Field Approximations for Heterogeneous Systems: It Works!

Mean field approximation is a powerful technique to study the performance of large stochastic systems represented as n interacting objects. Applications include load balancing models, epidemic spreading, cache replacement policies, or large-scale data centers. Mean field approximation is asymptotically exact for systems composed of n homogeneous objects under mild conditions. In this paper, we study what happens when objects are heterogeneous. This can represent servers with different speeds or contents with different popularities. We define an interaction model that allows obtaining asymptotic convergence results for stochastic systems with heterogeneous object behavior, and show that the error of the mean field approximation is of order $O(1/n)$. More importantly, we show how to adapt the refined mean field approximation, developed by Gast et al., and show that the error of this approximation is reduced to O(1/n^2). To illustrate the applicability of our result, we present two examples. The first addresses a list-based cache replacement model, RANDOM(m), which is an extension of the RANDOM policy. The second is a heterogeneous supermarket model. These examples show that the proposed approximations are computationally tractable and very accurate. They also show that for moderate system sizes (30) the refined mean field approximation tends to be more accurate than simulations for any reasonable simulation time.

Memory Space Recycling

Many program codes from different application domains process very large amounts of data, making their cache memory behavior critical for high performance. Most of the existing work targeting cache memory hierarchies focus on improving data access patterns, e.g., maximizing sequential accesses to program data structures via code and/or data layout restructuring strategies. Prior work has addressed this data locality optimization problem in the context of both single-core and multi-core systems. Another dimension of optimization, which can be as equally important/beneficial as improving data access pattern is to reduce the data volume (total number of addresses) accessed by the program code. Compared to data access pattern restructuring, this volume minimization problem has relatively taken much less attention. In this work, we focus on this volume minimization problem and address it in both single-core and multi-core execution scenarios. Specifically, we explore the idea of rewriting an application program code to reduce its "memory space footprint". The main idea behind this approach is to reuse/recycle, for a given data element, a memory location that has originally been assigned to another data element, provided that the lifetimes of these two data elements do not overlap with each other. A unique aspect is that it is "distance aware", i.e., in identifying the memory/cache locations to recycle it takes into account the physical distance between the location of the core and the memory/cache location to be recycled. We present a detailed experimental evaluation of our proposed memory space recycling strategy, using five different metrics: memory space consumption, network footprint, data access distance, cache miss rate, and execution time. The experimental results show that our proposed approach brings, respectively, 33.2%, 48.6%, 46.5%, 31.8%, and 27.9% average improvements in these metrics, in the case of single-threaded applications. With the multi-threaded versions of the same applications, the achieved improvements are 39.5%, 55.5%, 53.4%, 26.2%, and 22.2%, in the same order.

Metamorphic Testing of Deep Learning Compilers

The prosperous trend of deploying deep neural network (DNN) models to diverse hardware platforms has boosted the development of deep learning (DL) compilers. DL compilers take the high-level DNN model specifications as input and generate optimized DNN executables for diverse hardware architectures like CPUs, GPUs, and various hardware accelerators. Compiling DNN models into high-efficiency executables is not easy: the compilation procedure often involves converting high-level model specifications into several different intermediate representations (IR), e.g., graph IR and operator IR, and performing rule-based or learning-based optimizations from both platform-independent and platform-dependent perspectives. Despite the prosperous adoption of DL compilers in real-world scenarios, principled and systematic understanding toward the correctness of DL compilers does not yet exist. To fill this critical gap, this paper introduces MT-DLComp, a metamorphic testing framework specifically designed for DL compilers to effectively uncover erroneous compilations. Our approach leverages deliberately-designed metamorphic relations (MRs) to launch semantics-preserving mutations toward DNN models to generate their variants. This way, DL compilers can be automatically examined for compilation correctness utilizing DNN models and their variants without requiring manual intervention. We also develop a set of practical techniques to realize an effective workflow and localize identified error-revealing inputs. Real-world DL compilers exhibit a high level of engineering quality. Nevertheless, we detected over 435 inputs that can result in erroneous compilations in four popular DL compilers, all of which are industry-strength products maintained by Amazon, Facebook, Microsoft, and Google. While the discovered error-triggering inputs do not cause the DL compilers to crash directly, they can lead to the generation of incorrect DNN executables. With substantial manual effort and help from the DL compiler developers, we uncovered four bugs in these DL compilers by debugging them using the error-triggering inputs. Our proposed testing frameworks and findings can be used to guide developers in their efforts to improve DL compilers.

NURA: A Framework for Supporting Non-Uniform Resource Accesses in GPUs

Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize GPU resources, is still challenging. Some pieces of prior work (e.g., spatial multitasking) have limited opportunity to improve resource utilization, while other works, e.g., simultaneous multi-kernel, provide fine-grained resource sharing at the price of unfair execution. This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensures fairness and Quality-of-Service (QoS). The key idea is that each streaming multiprocessor (SM) executes Cooperative Thread Arrays (CTAs) belong to only one application (similar to the spatial multi-tasking) and shares its unused resources with the SMs running other applications demanding more resources. NURA handles resource sharing process mainly using a software approach to provide simplicity, low hardware cost, and flexibility. We also perform some hardware modifications as an architectural support for our software-based proposal. We conservatively analyze the hardware cost of our proposal, and observe less than 1.07% area overhead with respect to the whole GPU die. Our experimental results over various mixes of GPU workloads show that NURA improves GPU system throughput by 26% compared to state-of-the-art spatial multi-tasking, on average, while meeting the QoS target. In terms of fairness, NURA has almost similar results to spatial multitasking, while it outperforms simultaneous multi-kernel by an average of 76%.

Online Optimization with Feedback Delay and Nonlinear Switching Cost

We study a variant of online optimization in which the learner receives k-rounddelayed feedback about hitting cost and there is a multi-step nonlinear switching cost, i.e., costs depend on multiple previous actions in a nonlinear manner. Our main result shows that a novel Iterative Regularized Online Balanced Descent (iROBD) algorithm has a constant, dimension-free competitive ratio that is $O(L^2k )$, where L is the Lipschitz constant of the switching cost. Additionally, we provide lower bounds that illustrate the Lipschitz condition is required and the dependencies on k and L are tight. Finally, via reductions, we show that this setting is closely related to online control problems with delay, nonlinear dynamics, and adversarial disturbances, where iROBD directly offers constant-competitive online policies.

Robustness and Consistency in Linear Quadratic Control with Untrusted Predictions

We study the problem of learning-augmented predictive linear quadratic control. Our goal is to design a controller that balances consistency, which measures the competitive ratio when predictions are accurate, and robustness, which bounds the competitive ratio when predictions are inaccurate.

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Stochastic approximation (SA) and stochastic gradient descent (SGD) algorithms are work-horses for modern machine learning algorithms. Their constant stepsize variants are preferred in practice due to fast convergence behavior. However, constant stepsize SA algorithms do not converge to the optimal solution, but instead have a stationary distribution, which in general cannot be analytically characterized. In this work, we study the asymptotic behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero. Specifically, we consider the following three settings: (1) SGD algorithm with a smooth and strongly convex objective, (2) linear SA algorithm involving a Hurwitz matrix, and (3) nonlinear SA algorithm involving a contractive operator. When the iterate is scaled by 1/α, where α is the constant stepsize, we show that the limiting scaled stationary distribution is a solution of an implicit equation. Under a uniqueness assumption (which can be removed in certain settings) on this equation, we further characterize the limiting distribution as a Gaussian distribution whose covariance matrix is the unique solution of a suitable Lyapunov equation. For SA algorithms beyond these cases, our numerical experiments suggest that unlike central limit theorem type results: (1) the scaling factor need not be 1/α, and (2) the limiting distribution need not be Gaussian. Based on the numerical study, we come up with a heuristic formula to determine the right scaling factor, and make insightful connection to the Euler-Maruyama discretization scheme for approximating stochastic differential equations.

The First 5G-LTE Comparative Study in Extreme Mobility

5G claims to support mobility up to 500 km/h according to the 3GPP standard. However, its field performance under high-speed scenes remains in mystery. In this paper, we conduct the first large-scale measurement campaign on a high-speed railway route operating at the maximum speed of 350 km/h, with full coverage of LTE and 5G (NSA and SA) along the track. Our study consumed 1788.8 GiB of cellular data in six months, covering the three major carriers in China and the recent standardized QUIC protocol. Based on our dataset, we reveal the key characteristics of 5G and LTE in extreme mobility in terms of throughput, RTT, loss rate, signal quality, and physical resource utilization. We further develop a taxonomy of handovers in both LTE and 5G and carry out the link-layer latency breakdown analysis. Our study pinpoints the deficiencies in the user equipment, radio access network, and core network which hinder seamless connectivity and better utilization of 5G's high bandwidth. Our findings highlight the directions of the next step in the 5G evolution.

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

Understanding I/O Direct Cache Access Performance for End Host Networking

Direct Cache Access (DCA) enables a network interface card (NIC) to load and store data directly on the processor cache, as conventional Direct Memory Access (DMA) is no longer suitable as the bridge between NIC and CPU in the era of 100 Gigabit Ethernet. As numerous I/O devices and cores compete for scarce cache resources, making the most of DCA for networking applications with varied objectives and constraints is a challenge, especially given the increasing complexity of modern cache hardware and I/O stacks. In this paper, we reverse engineer details of one commercial implementation of DCA, Intel's Data Direct I/O (DDIO), to explicate the importance of hardware-level investigation into DCA. Based on the learned knowledge of DCA and network I/O stacks, we (1) develop an analytical framework to predict the effectiveness of DCA (i.e., its hit rate) under certain hardware specifications, system configurations, and application properties; (2) measure penalties of the ineffective use of DCA (i.e., its miss penalty) to characterize its benefits; and (3) show that our reverse engineering, measurement, and model contribute to a deeper understanding of DCA, which in turn helps diagnose, optimize, and design end-host networking.

Fusing Speed Index during Web Page Loading

With conventional web page load metrics (e.g., Page Load Time) being blamed for deviating from actual user experiences, in recent years a more sensible and complex metric called Speed Index (SI) has been widely adopted to measure the user's quality of experience (QoE). In brief, SI indicates how quickly a page is filled up with above-the-fold visible elements (or crucial elements for short). To date, however, SI has been used as a metric for performance evaluation, rather than as an explicit heuristic to improve page loading. To demystify this, we examine the entire loading process of various pages and ascribe such incapability to three-fold fundamental uncertainties in terms of network, browser execution, and viewport size. In this paper, we design SipLoader, an SI-oriented page load scheduler through a novel cumulative reactive scheduling framework. It does not attempt to deal with uncertainties in advance or in one shot, but schedules page loading by "repairing" the anticipated (nearly) SI-optimal scheduling when uncertainties actually occur. This is achieved with a suite of efficient designs that fully exploit the cumulative nature of SI calculation. Evaluations show that SipLoader improves the median SI by 41%, and provides 1.43 times to 1.99 times more benefits than state-of-the-art solutions.