The Proceedings of the ACM series present the highest quality research conducted in diverse areas of computer science, as represented by the ACM Special Interest Groups (SIGs). The ACM Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the Special Interest Group SIGMETRICS. All the papers in the last three issues of POMACS will be presented during the joint ACM SIGMETRICS / IFIP Performance 2019 conference. Despite the extraordinary progress of networks and computer systems in the last decades, these are far from perfect. There are still performance issues and security threats, and the complexity of these systems make them often difficult to manage and operate. Research is expected to contribute to the further improvements of these systems, to make them more efficient, more robust, more flexible and more predictable. In the era of global warming, there is also an urgent need to make the digital economy more green, and thus beneficial to all in the long term. The papers that have appeared in POMACS during the past year reflect the ability of the journal to attract high-quality submissions in this area. The last three issues have included papers on mainstream topics of the ACM SIGMETRICS / IFIP Performance communities, including load balancing, scheduling, resource allocation, performance evaluation and measurement, as well as emerging topics that have become prominent in recent years, such as virtualization, network science, and machine learning. As part of a new initiative to highlight work from outside the SIGMETRICS community that may spur research within the community, the 2019 conference will include a Highlights Beyond SIGMETRICS session featuring leading papers from conferences in neighboring disciplines. We thank Adam Wierman for leading this initiative.
Most congestion control algorithms, like TCP, rely on a reactive control system that detects congestion, then marches carefully towards a desired operating point (e.g. by modifying the window size or adjusting a rate). In an effort to balance stability and convergence speed, they often take hundreds of RTTs to converge; an increasing problem as networks get faster, with less time to react.
This paper is about an alternative class of congestion control algorithms based on proactive-scheduling: switches and NICs "pro-actively" exchange control messages to run a \em distributed algorithm to pick "explicit rates for each flow. We call these Proactive Explicit Rate Control (PERC) algorithms. They take as input the routing matrix and link speeds, but not a congestion signal. By exploiting information such as the number of flows at a link, they can converge an order of magnitude faster than reactive algorithms.
Our main contributions are (1) s-PERC ("stateless" PERC), a new practical distributed PERC algorithm without per-flow state at the switches, and (2) a proof that s-PERC computes exact max-min fair rates in a known bounded time, the first such algorithm to do so without per-flow state. To analyze s-PERC, we introduce a parallel variant of standard waterfilling, 2-Waterfilling. We prove that s-PERC converges to max-min fair in 6N rounds, where N is the number of iterations 2-Waterfilling takes for the same routing matrix.
We describe how to make s-PERC practical and robust to deploy in real networks. We confirm using realistic simulations and an FPGA hardware testbed that s-PERC converges 10-100x faster than reactive algorithms like TCP, DCTCP and RCP in data-center networks and 1.3--6x faster in wide-area networks (WANs). Long flows complete in close to the ideal time, while short-lived flows are prioritized, making it appropriate for data-centers and WANs.
Although using look-ahead information is known to improve the competitive ratios of online convex optimization (OCO) problems with switching costs, the competitive ratios obtained from existing results often depend on the cost coefficients of the problem, and can potentially be large. In this paper, we propose new online algorithms that can utilize look-ahead to achieve much lower competitive ratios for OCO problems with switching costs and hard constraints. For the perfect look-ahead case where the algorithm is provided with the exact inputs in a future look-ahead window of size K, we design an Averaging Regularized Moving Horizon Control (ARMHC) algorithm that can achieve a competitive ratio of K+1/K. To the best of our knowledge, ARMHC is the first to attain a low competitive ratio that is independent of either the coefficients of the switching costs and service costs, or the upper and lower bounds of the inputs. Then, for the case when the future look-ahead has errors, we develop a Weighting Regularized Moving Horizon Control (WRMHC) algorithm that carefully weights the decisions inside the look-ahead window based on the accuracy of the look-ahead information. As a result, WRMHC also achieves a low competitive ratio that is independent of the cost coefficients, even with uncertain hard constraints. Finally, our analysis extends online primal-dual analysis to the case with look-ahead by introducing a novel "re-stitching" idea, which is of independent interest.
Mean field modeling is a popular approach to assess the performance of large scale computer systems. The evolution of many mean field models is characterized by a set of ordinary differential equations that have a unique fixed point. In order to prove that this unique fixed point corresponds to the limit of the stationary measures of the finite systems, the unique fixed point must be a global attractor. While global attraction was established for various systems in case of exponential job sizes, it is often unclear whether these proof techniques can be generalized to non-exponential job sizes. In this paper we show how simple monotonicity arguments can be used to prove global attraction for a broad class of ordinary differential equations that capture the evolution of mean field models with hyperexponential job sizes. This class includes both existing as well as previously unstudied load balancing schemes and can be used for systems with either finite or infinite buffers. The main novelty of the approach exists in using a Coxian representation for the hyperexponential job sizes and a partial order that is stronger than the componentwise partial order used in the exponential case.
Virtualization is becoming increasingly common in data centers due to its various advantages. However, how to choose among different platforms, including both software and hardware, is a considerable challenge. In this context, evaluating the virtualization capabilities of different platforms is critically important. Regrettably, the existing benchmarks are not qualified for meeting this requirement. Different hardware mechanisms and hypervisor designs introduce many different hypervisor-level events, such as transitions between VMs and the hypervisor, two-dimensional page walk, and binary translation. These events are key factors affecting virtualization performance. Existing benchmarks either overlook these changes or are tightly coupled to a particular hypervisor. In this paper, we present HyperBench, a benchmark suite that focuses on the capabilities of different virtualization platforms. Currently, we design 15 hypervisor benchmarks covering CPU, memory, and I/O. The virtualization-sensitive operation in each benchmark triggers hypervisor-level events, which examines the platform's ability in the target area. HyperBench is designed as a custom kernel which can adapt to different hypervisors and architectures. What's more, adding a new benchmark is pretty easy. Finally, we perform a series of experiments on the host machine and several popular hypervisors, such as QEMU, KVM, and Xen, demonstrating that HyperBench is capable of revealing the performance implications of the hardware mechanism and hypervisor design.
Many systems, such as the Internet, social networks, and the power grid, can be represented as graphs. When analyzing graphs, it is often useful to compute scores describing the relative importance or distance between nodes. One example is Personalized PageRank (PPR), which assigns to each node v a vector whose i-th entry describes the importance of the i-th node from the perspective of v. PPR has proven useful in many applications, such as recommending who users should follow on social networks (if this i-th entry is large, v may be interested in following the i-th user). Unfortunately, computing n PPR vectors exactly for a graph of n nodes has complexity O(n^3), which is infeasible for many graphs of interest. In this work, we devise a scheme to estimate all n PPR vectors with bounded l_1 error and complexity O(nc), where c < 2 depends on the degrees of the graph at hand, the desired error tolerance, and a parameter that defines PPR. This improves upon existing methods, the best of which have complexity O(n2 łog n) in our setting. Our complexity guarantee holds with high probability, for certain choices of the PPR parameter, and for a certain class of random graphs (roughly speaking, the sparse directed configuration model with heavy-tailed in-degrees); our accuracy guarantee holds with probability 1 and for arbitrary graphs and PPR parameters. The complexity result arises as a consequence of our main (structural) result, which shows that the dimensionality of the set of PPR vectors scales sublinearly in n with high probability, for the same class of random graphs and for a notion of dimensionality similar to matrix rank. It is this coupling of the PPR vectors for the nodes on a common underlying graph that allows for estimating them faster. Hence, at a high level, our scheme is analogous to (but distinct from) low-rank matrix approximation. We also note that our scheme is similar to one that was proposed in [Jeh and Widom 2003] but lacked accuracy and complexity guarantees, so another contribution of our paper is to address this gap in the literature.
Container Orchestration Platforms (COPs), such as Kubernetes, are increasingly used to manage large-scale clusters by automating resource allocation between applications encapsulated in containers. Increasingly, the resources underlying COPs are virtual machines (VMs) dynamically acquired from cloud platforms. COPs may choose from many different types of VMs offered by cloud platforms, which differ in their cost, performance, and availability. In particular, while transient VMs cost significantly less than on-demand VMs, platforms may revoke them at any time, causing them to become unavailable. While transient VMs' price is attractive, their unreliability is a problem for COPs designed to support mixed workloads composed of, not only delay-tolerant batch jobs, but also long-lived interactive services with high availability requirements. To address the problem, we design TR-Kubernetes, a COP that optimizes the cost of executing mixed interactive and batch workloads on cloud platforms using transient VMs. To do so, TR-Kubernetes enforces arbitrary availability requirements specified by interactive services despite transient VM unavailability by acquiring many more transient VMs than necessary most of the time, which it then leverages to opportunistically execute batch jobs when excess resources are available. When cloud platforms revoke transient VMs, TR-Kubernetes relies on existing Kubernetes functions to internally revoke resources from batch jobs to maintain interactive services' availability requirements. We show that TR-Kubernetes requires minimal extensions to Kubernetes, and is capable of lowering the cost (by 53%) and improving the availability (99.999%) of a representative interactive/batch workload on Amazon EC2 when using transient compared to on-demand VMs.
Streaming analytics require real-time aggregation and processing of geographically distributed data streams continuously over time. The typical analytics infrastructure for processing such streams follow a hub-and-spoke model, comprising multiple edges connected to a center by a wide-area network (WAN). The aggregation of such streams often require that the results be available at the center within a certain acceptable delay bound. Further, the WAN bandwidth available between the edges and the center is often scarce or expensive, requiring that the traffic between the edges and the center be minimized. We propose a novel Time-to-Live (TTL-)based mechanism for real-time aggregation that provably optimizes both delay and traffic, providing a theoretical basis for understanding the delay-traffic tradeoff that is fundamental to streaming analytics. Our TTL-based optimization model provides analytical answers to how much aggregation should be performed at the edge versus the center, how much delay can be incurred at the edges, and how the edge-to-center bandwidth must be apportioned across applications with different delay requirements. To evaluate our approach, we implement our TTL-based aggregation mechanism in Apache Flink, a popular stream analytics framework. We deploy our Flink implementation in a hub-and-spoke architecture on geo-distributed Amazon EC2 data centers and a WAN-emulated local testbed, and run aggregation tasks for realistic workloads derived from extensive Akamai and Twitter traces. The delay-traffic tradeoff achieved by our Flink implementation agrees closely with theoretical predictions of our model. We show that by deriving the optimal TTLs using our model, our system can achieve a "sweet spot" where both delay and traffic are minimized, in comparison to traditional aggregation schemes such as batching and streaming.
BGP is the de-facto Internet routing protocol for exchanging prefix reachability information between Autonomous Systems (AS). It is a dynamic, distributed, path-vector protocol that enables rich expressions of network policies (typically treated as secrets). In this regime, where complexity is interwoven with information hiding, answering questions such as "what is the expected catchment of the anycast sites of a content provider on the AS-level, if new sites are deployed?", or "how will load-balancing behave if an ISP changes its routing policy for a prefix?", is a hard challenge. In this work, we present a formal model and methodology that takes into account policy-based routing and topological properties of the Internet graph, to predict the routing behavior of networks. We design algorithms that provide new capabilities for informative route inference (e.g., isolating the effect of randomness that is present in prior simulation-based approaches). We analyze the properties of these inference algorithms, and evaluate them using publicly available routing datasets and real-world experiments. The proposed framework can be useful in a number of applications: measurements, traffic engineering, network planning, Internet routing models, etc. As a use case, we study the problem of selecting a set of measurement vantage points to maximize route inference. Our methodology is general and can capture standard valley-free routing, as well as more complex topological and routing setups appearing in practice.
We derive simple bounds on the queue distribution in finite-buffer queues with Markovian arrivals. Our technique relies on a subtle equivalence between tail events and stopping times orderings. The bounds capture a truncated exponential behavior, involving joint horizontal and vertical shifts of an exponential function; this is fundamentally different than existing results capturing horizontal shifts only. Using the same technique, we obtain similar bounds on the loss distribution, which is a key metric to understand the impact of finite-buffer queues on real-time applications. Simulations show that the bounds are accurate in heavy-traffic regimes, and improve existing ones by orders of magnitude. In the limiting regime with utilization ρ=1 and iid arrivals, the bounds on the queue size distribution are insensitive to the arrivals distribution.
The overwhelmingly large design space of congestion control protocols, along with the increasingly diverse range of application environments, makes evaluating such protocols a daunting task. Simulation and experiments are very helpful in evaluating the performance of designs in specific contexts, but give limited insight into the more general properties of these schemes and provide no information about the inherent limits of congestion control designs (such as, which properties are simultaneously achievable and which are mutually exclusive). In contrast, traditional theoretical approaches are typically focused on the design of protocols that achieve to specific, predetermined objectives (e.g., network utility maximization), or the analysis of specific protocols (e.g., from control-theoretic perspectives), as opposed to the inherent tensions/derivations between desired properties. To complement today's prevalent experimental and theoretical approaches, we put forth a novel principled framework for reasoning about congestion control protocols, which is inspired by the axiomatic approach from social choice theory and game theory. We consider several natural requirements ("axioms'') from congestion control protocols -- e.g., efficient resource-utilization, loss-avoidance, fairness, stability, and TCP-friendliness -- and investigate which combinations of these can be achieved within a single design. Thus, our framework allows us to investigate the fundamental tradeoffs between desiderata, and to identify where existing and new congestion control architectures fit within the space of possible outcomes.
The approaching end of DRAM scaling and expansion of emerging memory technologies is motivating a lot of research in future memory systems. Novel memory systems are typically explored by hardware simulators that are slow and often have a simplified or obsolete abstraction of the CPU. This study presents PROFET, an analytical model that predicts how an application's performance and energy consumption changes when it is executed on different memory systems. The model is based on instrumentation of an application execution on actual hardware, so it already takes into account CPU microarchitectural details such as the data prefetcher and out-of-order engine. PROFET is evaluated on two real platforms: Sandy Bridge-EP E5-2670 and Knights Landing Xeon Phi platforms with various memory configurations. The evaluation results show that PROFET's predictions are accurate, typically with only 2% difference from the values measured on actual hardware. We release the PROFET source code and all input data required for memory system and application profiling. The released package can be seamlessly installed and used on high-end Intel platforms.
Load balancing plays a crucial role in achieving low latency in large distributed systems. Recent load balancing strategies often rely on replication or use placeholders to further improve latency. However assessing the performance and stability of these strategies is challenging and is therefore often simulation based. In this paper we introduce a unified approach to analyze the performance and stability of a broad class of workload dependent load balancing strategies. This class includes many replication policies, such as replicate below threshold, delayed replication and replicate only small jobs, as well as strategies for fork-join systems. We consider systems with general job size distributions where jobs may experience server slowdown. We show that the equilibrium workload distribution of the cavity process satisfies a functional differential equation and conjecture that the cavity process captures the limiting behavior of the system as its size tends to infinity. We study this functional differential equation in more detail for a variety of load balancing policies and propose a numerical method to solve it. The numerical method relies on a fixed point iteration or a simple Euler iteration depending on the type of functional differential equation involved. We further show that additional simplifications can be made if certain distributions are assumed to be phase-type. Various numerical examples are included that validate the numerical method and illustrate its strength and flexibility.
Graph alignment in two correlated random graphs refers to the task of identifying the correspondence between vertex sets of the graphs. Recent results have characterized the exact information-theoretic threshold for graph alignment in correlated Erdös-Rényi graphs. However, very little is known about the existence of efficient algorithms to achieve graph alignment without seeds. In this work we identify a region in which a straightforward O(n11/5 log n)-time canonical labeling algorithm, initially introduced in the context of graph isomorphism, succeeds in aligning correlated Erdos-Rényi graphs. The algorithm has two steps. In the first step, all vertices are labeled by their degrees and a trivial minimum distance alignment (i.e., sorting vertices according to their degrees) matches a fixed number of highest degree vertices in the two graphs. Having identified this subset of vertices, the remaining vertices are matched using a alignment algorithm for bipartite graphs. Finally, we show that the implementation of a variant of this algorithm allows for the efficient alignment of large graphs under limited noise.
When evaluating the impact of a policy (e.g., gun control) on a metric of interest (e.g., crime-rate), it may not be possible or feasible to conduct a randomized control trial. In such settings where only observational data is available, synthetic control (SC) methods [2-4] provide a popular data-driven approach to estimate a "synthetic" or "virtual" control by combining measurements of "similar" alternatives or units (called "donors"). Recently, robust synthetic control (RSC)  was proposed as a generalization of SC to overcome the challenges of missing data and high levels of noise, while removing the reliance on expert domain knowledge for selecting donors. However, both SC and RSC (and its variants) suffer from poor estimation when the pre-intervention period is too short. As the main contribution of this work, we propose a generalization of unidimensional RSC to multi-dimensional Robust Synthetic Control, mRSC. Our proposed mechanism, mRSC, incorporates multiple types of measurements (or metrics) in addition to the measurement of interest for estimating a synthetic control, thus overcoming the challenge of poor inference due to limited amounts of pre-intervention data. We show that the mRSC algorithm, when using K relevant metrics, leads to a consistent estimator of the synthetic control for the target unit of interest under any metric. Our finite-sample analysis suggests that the mean-squared error (MSE) of our predictions decays to zero at a rate faster than the RSC algorithm by a factor of K and √K for the training (pre-intervention) and testing (post-intervention) periods, respectively. Additionally, we propose a principled scheme to combine multiple metrics of interest via a diagnostic test that evaluates if adding a metric can be expected to result in improved inference. Our mechanism for validating mRSC performance is also an important and related contribution of this work: time series prediction. We propose a method to predict the future evolution of a time series based on limited data when the notion of time is relative and not absolute, i.e., where we have access to a donor pool that has already undergone the desired future evolution. We conduct extensive experimentation to establish the efficacy of mRSC in three different scenarios: predicting the evolution of a metric of interest using synthetically generated data from a known factor model, and forecasting weekly sales and score trajectories of a Walmart store and Cricket game, respectively.
Deliberate use of approximate computing has been an active research area recently. Observing that many application programs from different domains can live with less-than-perfect accuracy, existing techniques try to trade off program output accuracy with performance-energy savings. While these works provide point solutions, they leave three critical questions regarding approximate computing unanswered, especially in the context of dropping/skipping costly data accesses: (i) what is the maximum potential of skipping (i.e., not performing) data accesses under a given inaccuracy bound?; (ii) can we identify the data accesses to drop randomly, or is being architecture aware (i.e., identifying the costliest data accesses in a given architecture) critical?; and (iii) do two executions that skip the same number of data accesses always result in the same output quality (error)? This paper first provides answers to these questions using ten multithreaded workloads, and then, motivated by the negative answer to the third question, presents a program slicing-based approach that identifies the set of data accesses to drop such that (i) the resulting performance/energy benefits are maximized and (ii) the execution remains within the error (inaccuracy) bound specified by the user. Our slicing-based approach first uses backward slicing and then forward slicing to decide the set of data accesses to drop. Our experimental evaluations using ten multithreaded workloads show that, when averaged over all benchmark programs we have, 8.8% performance improvement and 13.7% energy saving are possible when we set the error bound to 2%, and the corresponding improvements jump to 15% and 25%, respectively, when the error bound is raised to 4%.
The Least Recently Used (LRU) caching and its variants are used in large-scale data systems in order to provide high-speed data access for a wide class of applications. Nonetheless, a fundamental question still remains open: in order to minimize miss probabilities, how should the cache space be organized to serve multiple data flows? Commonly used strategies can be categorized into two designs: pooled LRU (PLRU) caching and separated LRU (SLRU) caching. However, neither of these designs can satisfactorily solve this problem. PLRU caching is easy to implement and self-adaptive, but does not often achieve optimal or even efficient performance because its set of feasible solutions are limited. SLRU caching can be statically configured to achieve optimal performance for stationary workload, which nevertheless could suffer in a dynamically changing environment and from a cold-start problem. To this end, we propose a new insertion based pooled LRU paradigm, termed I-PLRU, where data flows can be inserted at different positions of a pooled cache. This new design can achieve the optimal performance of the static SLRU, and retains the adaptability of PLRU in virtue of resource sharing. Theoretically, we characterize the asymptotic miss probabilities of I-PLRU, and prove that, for any given SLRU design, there always exists an I-PLRU configuration that achieves the same asymptotic miss probability, and vice versa. We next design a policy to minimize the miss probabilities. However, the miss probability minimization problem turns out to be non-convex under the I-PLRU paradigm. Notably, we utilize an equivalence mapping between I-PLRU and SLRU to efficiently find the optimal I-PLRU configuration. We prove that I-PLRU outperforms PLRU and achieves the same miss probability as the optimal SLRU for stationary workload. Engineeringly, the flexibility of I-PLRU avoids separating the memory space, supports dynamic and refined configurations, and alleviates the cold-start problem, potentially yielding better performance than both SLRU and PLRU.
We consider the problem of learning the weighted edges of a graph by observing the noisy times of infection for multiple epidemic cascades on this graph. Past work has considered this problem when the cascade information, i.e., infection times, are known exactly. Though the noisy setting is well motivated by many epidemic processes (e.g., most human epidemics), to the best of our knowledge, very little is known about when it is solvable. Previous work on the no-noise setting critically uses the ordering information. If noise can reverse this -- a node's reported (noisy) infection time comes after the reported infection time of some node it infected -- then we are unable to see how previous results can be extended. We therefore tackle two versions of the noisy setting: the limited-noise setting, where we know noisy times of infections, and the extreme-noise setting, in which we only know whether or not a node was infected. We provide a polynomial time algorithm for recovering the structure of bidirectional trees in the extreme-noise setting, and show our algorithm matches lower bounds established in the no-noise setting, and hence is optimal. We extend our results for general degree-bounded graphs, where again we show that our (poly-time) algorithm can recover the structure of the graph with optimal sample complexity. We also provide the first efficient algorithm to learn the weights of the bidirectional tree in the limited-noise setting. Finally, we give a polynomial time algorithm for learning the weights of general bounded-degree graphs in the limited-noise setting. This algorithm extends to general graphs (at the price of exponential running time), proving the problem is solvable in the general case. All our algorithms work for any noise distribution, without any restriction on the variance.
This paper combines data-driven and model-driven methods for real-time misinformation detection. Our algorithm, named QuickStop, is an optimal stopping algorithm based on a probabilistic information spreading model obtained from labeled data. The algorithm consists of an offline machine learning algorithm for learning the probabilistic information spreading model and an online optimal stopping algorithm to detect misinformation. The online detection algorithm has both low computational and memory complexities. Our numerical evaluations with a real-world dataset show that QuickStop outperforms existing misinformation detection algorithms in terms of both accuracy and detection time (number of observations needed for detection). Our evaluations with synthetic data further show that QuickStop is robust to (offline) learning errors.
Load balancing systems, comprising a central dispatcher and a scheduling policy at each server, are widely used in practice, and their response time has been extensively studied in the theoretical literature. While much is known about the scenario where the scheduling at the servers is First-Come-First-Served (FCFS), to minimize mean response time we must use Shortest-Remaining-Processing-Time (SRPT) scheduling at the servers. Much less is known about dispatching polices when SRPT scheduling is used. Unfortunately, traditional dispatching policies that are used in practice in systems with FCFS servers often have poor performance in systems with SRPT servers. In this paper, we devise a simple fix that can be applied to any dispatching policy. This fix, called guardrails, ensures that the dispatching policy yields optimal mean response time under heavy traffic when used in a system with SRPT servers. Any dispatching policy, when augmented with guardrails, becomes heavy-traffic optimal. Our results yield the first analytical bounds on mean response time for load balancing systems with SRPT scheduling at the servers.
We consider a bandit problem with K task types from which the controller activates one task at a time. Each task takes a random and possibly heavy-tailed completion time, and a reward is obtained only after the task is completed. The task types are independent from each other, and have distinct and unknown distributions for completion time and reward. For a given time horizon τ, the goal of the controller is to schedule tasks adaptively so as to maximize the reward collected until τ expires. In addition, we allow the controller to interrupt a task and initiate a new one. In addition to the traditional exploration-exploitation dilemma, this interrupt mechanism introduces a new one: should the controller complete the task and get the reward, or interrupt the task for a possibly shorter and more rewarding alternative? We show that for all heavy-tailed and some light-tailed completion time distributions, this interruption mechanism improves the reward linearly over time. Applications of this model include server scheduling, optimal free sampling strategies in advertising and adaptive content selection. From a learning perspective, the interrupt mechanism necessitates learning the whole arm distribution from truncated observations. For this purpose, we propose a robust learning algorithm named UCB-BwI based on median-of-means estimator for possibly heavy-tailed reward and completion time distributions. We show that, in a K-armed bandit setting with an arbitrary set of L possible interrupt times, UCB-BwI achieves O(Kłog(τ)+KL) regret. We also prove that the regret under any admissible policy is Ømega(Kłog(τ)), which implies that UCB-BwI is order optimal.