Ubisoft constantly pushes the boundaries of game development to create immersive worlds that capture the imagination of millions of players worldwide. To achieve this, performance engineering plays a crucial role in ensuring that games run smoothly on various platforms and devices.
In this talk, we will explore the latest advancements in the field of performance engineering for video games, focusing on runtime performance, network optimization, backend and database optimization, and cloud gaming. We will discuss how machine learning techniques enhance classical profiling and optimize game engine scheduling.
Additionally, we will address the challenges of deterministic replication of assets between clients and optimizing micro-services for cloud gaming experiences. Lastly, we will touch on the importance of performance engineering for non-code aspects of game development, such as animation, textures, props, and assets.
Realizing modern software systems poses new challenges to the software engineers: Users of applications running on limited capability devices still demand acceptable performance [2, 5, 13, 15]; users of systems relying on artificial intelligence to take decision (rightly) reclaim a fair treatment [4 , 7, 12]; users of social networking systems expect to be protected against malicious behaviours [1]. Moreover, AI-enabled software systems are so energy-greedy that their usage is causing an alarming surge in energy consumption with a significant increase in CO2 emissions [10].
Equipping software with appealing functionalities and minimising faults, is not enough if the emerging non-functional properties of these systems, such as fairness, safety and sustainability, are not taken into account. Mobile users will stop using an app if it is too slow or uses much bandwidth [5 , 13]. Human bias can be transferred to various real-word systems relying on ML: Bias has been found in advertisement, recruitment, admission processes [3 , 9 , 19], among others, and human rights [16]. A growing number of malicious users use well-intentioned software platforms as a tool to attack the innocent users with whom they share the platform. Examples of such harmful acts are sadly too many to list; they include bullying, harassment, hate speech, misinformation, election interference, scamming and spamming. ChatGTP is an AI model able to answer a variety of questions, compose essays, have philosophical conversations, and even code or fix bugs [18]. However, all these come at a high cost: ChatGPT has been estimated to consume the equivalent of the electricity consumed by 175,000 people in Denmark per month.
In this keynote, I will discuss the necessity to take these properties into account when realizing these type of systems, and the extent to which it is possible to automate their optimization. I will discuss existing solutions mainly based, but not limited to, multi-objective optimisation [5, 6, 8, 10, 14 , 17]. In fact, we cannot expect that a software engineer, regardless of their level of expertise, would be able to manually find all opportunities for optimising these non-functional properties [11]. I will review research trends, presenting results from the SOLAR group and others. I will also discuss some directions for future work and open-challenges towards achieving better, fairer, safer and greener software.
In High Performance Computing, resource efficiency is paramount. Expensive systems need to be utilized to the maximum of their capabilities, but deep insight into the bottlenecks of a particular hardware-software combination is often lacking on the users' side. Analytic, first-principles performance models can provide such insight. They are built on simplified descriptions of the machine, the software, and how they interact. This goes, to some extent, against the general trend towards automation in computer science; the individual conducting the analysis does require some knowledge of the application and the hardware in order to make performance engineering a scientific process instead of blindly generating data with tools that are poorly understood. This talk uses examples from parallel high-performance computing to demonstrate how analytic performance models can support scientific thinking in performance engineering: Sparse matrix-vector multiplication, the HPCG benchmark, the CloverLeaf proxy app, and a lattice-Boltzmann solver. Interestingly, the most intriguing insights emerge from the failure of analytic models to accurately predict performance measurements.
Many organizations maintain and operate large shared computing clusters, since they can substantially reduce computing costs by leveraging statistical multiplexing to amortize it across all users. Importantly, such shared clusters are generally not free to use, but have an internal pricing model that funds their operation. Since employees at many large organizations, especially Universities, have some budgetary autonomy over purchase decisions, internal shared clusters are increasingly competing for users with cloud platforms, which may offer lower costs and better performance. As a result, many organizations are shifting their shared clusters to operate on cloud resources. This paper empirically analyzes the user incentives for shared cloud clusters under two different pricing models using an 8-year job trace from a large shared cluster for a large University system.
Our analysis shows that, with either pricing model, a large fraction of users have little financial incentive to participate in a shared cloud cluster compared to directly acquiring resources from a cloud platform. While shared cloud clusters can provide some limited reductions in cost by leveraging reserved instances at a discount, due to bursty workloads, realizing these reductions generally requires imposing long job waiting times, which for many users are likely not worth the cost reduction. In particular, we show that, assuming users defect from the shared cluster if their wait time is greater than 15x their average job runtime, over 80% of the users would defect, which increases the price of the remaining users such that it eliminates any incentive to participate in a shared cluster. Thus, while shared cloud clusters may provide users other benefits, their financial incentives are weak.
With containers becoming a prevalent method of software deployment, there is an increasing interest to use container orchestration frameworks not only in data centers, but also on resource-constrained hardware, such as Internet-of-Things devices, Edge gateways, or developer workstations. Consequently, software vendors have released several lightweight Kubernetes (K8s) distributions for container orchestration in the last few years, but it remains difficult for software developers to select an appropriate solution. Existing studies on lightweight K8s distribution performance tested only small workloads, showed inconclusive results, and did not cover recently released distributions. The contribution of this paper is a comparison of MicroK8s, k3s, k0s, and MicroShift, investigating their minimal resource usage as well as control plane and data plane performance in stress scenarios. While k3s and k0s showed by a small amount the highest control plane throughput and MicroShift showed the highest data plane throughput, usability, security, and maintainability are additional factors that drive the decision for an appropriate distribution.
Autoscalers are indispensable parts of modern cloud deployments and determine the service quality and cost of a cloud application in dynamic workloads. The configuration of an autoscaler strongly influences its performance and is also one of the biggest challenges and showstoppers for the practical applicability of many research autoscalers. Many proposed cloud experiment methodologies can only be partially applied in practice, and many autoscaling papers use custom evaluation methods and metrics. This paper presents a practical guideline for obtaining meaningful and interpretable results on autoscaler performance with reasonable overhead. We provide step-by-step instructions for defining realistic usage behaviors and traffic patterns. We divide the analysis of autoscaler performance into a qualitative antipattern-based analysis and a quantitative analysis. To demonstrate the applicability of our guideline, we conduct several experiments with a microservice of our industry partner in a realistic test environment.
GPUs have become common in HPC systems to accelerate scientific computing and machine learning applications. Efficiently mapping these applications to rapid evolutions of GPU architectures for high performance is a well-known challenge. Various performance inefficiencies exist in GPU kernels that impede applications from obtaining bare-metal performance. While existing tools are able to measure these inefficiencies, they mostly focus on data collection and presentation, requiring significant manual efforts to understand the root causes for actionable optimization. Thus, we develop DrGPU, a novel profiler that performs top-down analysis to guide GPU code optimization. As its salient feature, DrGPU leverages hardware performance counters available in commodity GPUs to quantify stall cycles, decompose them into various stall reasons, pinpoint root causes, and provide intuitive optimization guidance. With the help of DrGPU, we are able to analyze important GPU benchmarks and applications and obtain nontrivial speedups --- up to 1.77X on V100 and 2.03X on GTX 1650.
We present a novel benchmark suite for implementations of vector fields in high-performance computing environments to aid developers in quantifying and ranking their performance. We decompose the design space of such benchmarks into access patterns and storage backends, the latter of which can be further decomposed into components with different functional and non-functional properties. Through compile-time meta-programming, we generate a large number of benchmarks with minimal effort and ensure the extensibility of our suite. Our empirical analysis, based on real-world applications in high-energy physics, demonstrates the feasibility of our approach on CPU and GPU platforms, and highlights that our suite is able to evaluate performance-critical design choices. Finally, we propose that our work towards composing vector fields from elementary components is not only useful for the purposes of benchmarking, but that it naturally gives rise to a novel library for implementing such fields in domain applications.
Dependable power measurements are the backbone of energy-efficient computing systems. The IBM PowerNV platform offers such power measurements through an embedded PowerPC 405 processor: The On-Chip Controller (OCC). Among other system-control tasks, the OCC provides power measurements for several domains, such as system, CPU, and GPU. This paper provides a detailed description and an in-depth evaluation of these OCC-provided power measurements. For that, we describe the provided interfaces themselves and experimentally verify their overhead (3.6 µs to 10.8 µs per access) and readout rate (24.95 Sa/s). We also study the consistency of the reported sensor readouts across the measurement domains and compare it to externally measured data. Furthermore, we estimate the internal sampling rate (1996 Sa/s) by provoking aliasing errors with artificial workloads, and quantify the errors that such aliasing could introduce in practice (for power consumption of processors 12% in our experimental worst-case scenario). Given these insights, practitioners using the IBM PowerNV platform can assess the quality of the embedded measurements, permitting sought-after energy efficiency improvements.
Model transformation languages are special-purpose languages, which are designed to define transformations as comfortably as possible, i.e., often in a declarative way. Typically, developers create their transformations based on small input models which systematically cover the language of the input models. This makes it difficult for the developers to estimate how the transformations would perform for a large and diverse set of input models.
Hence, developers would benefit from an approach for predicting the performance of model transformations based on just abstract characteristics of input models. Regression approaches based on machine learning lend themselves well to such predictions. However, it is currently unknown, whether and which regression approach is suitable in this context as well as how a model should be abstractly characterized for this purpose.
We conducted several experiments to analyze how well different machine learning methods predict the execution time of model transformations defined in the Atlas Transformation Language (ATL) transformations for distinct sets of model characteristics. As possible methods, we have investigated linear regression, random forests and support vector regression using a radial basis function kernel.
The results of our experiments show that support vector regression is the best choice in terms of usability and prediction accuracy for the model transformation modules covered in our experiments and is thus suited for a prediction approach. In addition, simple model characterizations based only on the number of model elements, the number of references, and the number of attributes are a suitable way to easily describe a model and to achieve decent prediction accuracy.
Predicting the performance and energy consumption of computing hardware is critical for many modern applications. This will inform procurement decisions, deployment decisions, and autonomic scaling. Existing approaches to understanding the performance of hardware largely focus around benchmarking -- leveraging standardised workloads which seek to be representative of an end-user's needs. Two key challenges are present; benchmark workloads may not be representative of an end-user's workload, and benchmark scores are not easily obtained for all hardware. Within this paper, we demonstrate the potential to build Deep Learning models to predict benchmark scores for unseen hardware. We undertake our evaluation with the openly available SPEC 2017 benchmark results. We evaluate three different networks, one fully-connected network along with two Convolutional Neural Networks (one bespoke and one ResNet inspired) and demonstrate impressive R2 scores of 0.96, 0.98 and 0.94 respectively.
Due to the proliferation of inference tasks on mobile devices, state-of-the-art neural architectures are typically designed using Neural Architecture Search (NAS) to achieve good tradeoffs between machine learning accuracy and inference latency. While measuring inference latency of a huge set of candidate architectures during NAS is not feasible, latency prediction for mobile devices is challenging, because of hardware heterogeneity, optimizations applied by machine learning frameworks, and diversity of neural architectures. Motivated by these challenges, we first quantitatively assess the characteristics of neural architectures and mobile devices that have significant effects on inference latency. Based on this assessment, we propose an operation-wise framework which addresses these challenges by developing operation-wise latency predictors and achieves high accuracy in end-to-end latency predictions, as shown by our comprehensive evaluations on multiple mobile devices using multicore CPUs and GPUs. To illustrate that our approach does not require expensive data collection, we also show that accurate predictions can be achieved on real-world neural architectures using only small amounts of profiling data.
Cyber-Physical Systems (CPS) rely on sensing to control and optimize their operation. Nevertheless, sensing itself is prone to errors that can originate at several stages, from sampling to communication. In this context, several systems adopt multivariate predictors to assess the quality of the sensed data, to replace data from faulty sensors, or to derive variables that cannot be directly sensed. These predictors are often evaluated based on their accuracy and computing demands, however, such evaluations often do not consider the system's architecture from a broader perspective, ignoring the way components are interconnected and how they cascade as inputs of other Machine Learning (ML) models. In this work, we introduce a method to evaluate the performance of interdependent predictors based on the stability of the estimation error dynamics in faulty scenarios. The proposed method estimates the ability of a predictor to produce accurate predictions while accounting for the impacts of cascading predicted values as its inputs. The prediction correctness is estimated based solely on information acquired during the training of the multivariate predictors and mathematical properties of the ML activation functions. The proposed method is evaluated with a meaningful dataset in the scope of monitoring and control of a Cyber-Physical System, and the evaluation demonstrates the ability of the proposed method to account for the interdependence of data predictors.
Cloud Computing has revolutionized the information technology world and the application offering over the last two decades. At the same time recent trends in Network Function Virtualization (NFV) and Software-Defined Wide Area Networks (SD-WAN) and the combination of those with the Cloud paradigm has allowed an unprecedented shift of enterprise networking services towards the Public Cloud. Even though this network evolutionary approach brings many benefits, it still presents many drawbacks as well. The performance stability and service continuity over a black box Public Cloud infrastructure can hinder the formal service guarantees that many new emerging applications may have. To this end, in this paper, we aim to shed light on the overall performance achieved when deploying coast-to-coast and intercontinental Service Function Chains (SFCs) that interconnect geographically distributed enterprise branches over the Amazon Web Services (AWS) infrastructure. In particular, we investigate the impact of region, Virtual Machine (VM) instance, time of the day and day of the week in the overall throughput and delay attained. The obtained results show the strengths and weaknesses of entirely relying on the AWS infrastructure to offer networking services by investigating possible hidden performance bottlenecks.
HHVM is commonly developed for large online web services, yet there remains much room for optimizing HHVM performance. This paper discusses challenges and techniques in optimizing HHVM performance for Meta's web service. We begin by evaluating the effectiveness of semantic request routing, a request routing method aimed at enhancing code cache performance in HHVM, and examine its implications for optimizing HHVM performance. Second, we characterize HHVM performance for a large-scale datacenter and identify the challenges brought by uncontrollable confounding factors. Finally, we present the performance management framework for autotuning HHVM performance at scale.
The capability to isolate system resources is an essential characteristic of virtualisation technologies and is therefore important for research and industry alike. It allows the co-location of experiments and workloads, the partitioning of system resources and enables multi-tenant business models such as cloud computing. Poor isolation among tenants bears the risk of noisy-neighbour and contention effects which negatively impacts all of those use-cases. These effects describe the negative impact of one tenant onto another by utilising shared resources. Both industry and research provide many different concepts and technologies to realise isolation. Yet, the isolation capabilities of all these different approaches are not well understood; nor is there an established way to measure the quality of their isolation capabilities. Such an understanding, however, is of uttermost importance in practice to elaborately decide on a suited implementation. Hence, in this work, we present a novel methodology to measure the isolation capabilities of virtualisation technologies for system resources, that fulfils all requirements to benchmarking including reliability. It relies on an immutable approach, based on Experiment-as-Code. The complete process holistically includes everything from bare metal resource provisioning to the actual experiment enactment.
The results determined by this methodology help in the decision for a virtualisation technology regarding its capability to isolate given resources. Such results are presented here as a closing example in order to validate the proposed methodology.
Computer architectures have evolved from single core to chips with thousands of cores. Loop and instruction level parallelism techniques like software pipelining that are successful for single cores have limitations in the multi-core era. We extend the software pipelining technology beyond the limits of fine-grained, instruction-level parallelism. We accomplish this through dataflow software pipelining technology and its extension. Specifically, we present extensions to dataflow-based codelet model and its abstract machine to exploit pipelined parallelism across loops.
We extend the runtime implementation of the codelet model with our proposed extensions to take advantage of dataflow software pipelining principles using efficient single-owner fifo buffer across Codelet's dependencies. We show promising improvements with the use of dataflow software pipelining techniques by performing an in-depth case study of Cannon's algorithm for matrix multiplication.
Due to increasing popularity and strict performance requirements, online games have become a workload of interest for the performance engineering community. One of the most popular types of online games is the Minecraft-like Game (MLG), in which players can terraform the environment. The most popular MLG, Minecraft, provides not only entertainment, but also educational support and social interaction, to over 130 million people world-wide. MLGs currently support their many players by replicating isolated instances that support each only up to a few hundred players under favorable conditions. In practice, as we show here, the real upper limit of supported players can be much lower. In this work, we posit that performance variability is a key cause for the lack of scalability in MLGs, investigate experimentally causes of performance variability, and derive actionable insights. We propose a novel operational model for MLGs and use it to design the first benchmark that focuses on MLG performance variability, defining specialized workloads, metrics, and processes. We conduct real-world benchmarking of MLGs, both cloud-based and self-hosted, and find environment-based workloads and cloud deployment to be significant sources of performance variability: peak-latency degrades sharply to 20.7 times the arithmetic mean, and exceeds by a factor of 7.4 the performance requirements. We derive actionable insights for game-developers, game-operators, and other stakeholders to tame performance variability.
Container orchestration frameworks play a critical role in modern cloud computing paradigms such as cloud-native or serverless computing. They significantly impact the quality and cost of service deployment as they manage many performance-critical tasks such as container provisioning, scheduling, scaling, and networking. Consequently, a comprehensive performance assessment of container orchestration frameworks is essential. However, until now, there is no benchmarking approach that covers the many different tasks implemented in such platforms and supports evaluating different technology stacks. In this paper, we present a systematic approach that enables benchmarking of container orchestrators. Based on a definition of container orchestration, we define the core requirements and benchmarking scope for such platforms. Each requirement is then linked to metrics and measurement methods, and a benchmark architecture is proposed. With COFFEE, we introduce a benchmarking tool supporting the definition of complex test campaigns for container orchestration frameworks. We demonstrate the potential of our approach with case studies of the frameworks Kubernetes and Nomad in a self-hosted environment and on the Google Cloud Platform. The presented case studies focus on container startup times, crash recovery, rolling updates, and more.
Change point detection has recently gained popularity as a method of detecting performance changes in software due to its ability to cope with noisy data. In this paper we present Hunter, an open source tool that automatically detects performance regressions and improvements in time-series data. Hunter uses a modified E-divisive means algorithm to identify statistically significant changes in normally-distributed performance metrics. We describe the changes we made to the E-divisive means algorithm along with their motivation. The main change we adopted was to replace the significance test using randomized permutations with a Student's t-test, as we discovered that the randomized approach did not produce deterministic results, at least not with a reasonable number of iterations. In addition we've made tweaks that allow us to find change points the original algorithm would not, such as two nearby changes. For evaluation, we developed a method to generate real timeseries, but with artificially injected changes in latency. We used these data sets to compare Hunter against two other well known algorithms, PELT and DYNP. Finally, we conclude with lessons we've learned supporting Hunter across teams with individual responsibility for the performance of their project.
Widely used in datacenters and clouds, network traffic shaping is a performance influencing factor that is often overlooked when benchmarking or simply deploying distributed applications. While in theory traffic shaping should allow for a fairer sharing of network resources, in practice it also introduces new problems: performance (measurement) inconsistency and long tails. In this paper we investigate the effects of traffic shaping mechanisms on common distributed applications. We characterize the performance of a distributed key-value store, big data workloads, and high-performance computing under state-of-the-art benchmarks, while the underlying network's traffic is shaped using state-of-the-art mechanisms such as token-buckets or priority queues. Our results show that the impact of traffic shaping needs to be taken into account when benchmarking or deploying distributed applications. To help researchers, practitioners, and application developers we uncover several practical implications and make recommendations on how certain applications are to be deployed so that performance is least impacted by the shaping protocols.
In this paper, we use Wireshark packet-level traces to study the performance of the Zoom network application. Our work is motivated by several anecdotal reports of Zoom performance problems on our campus network during the Fall 2021 semester. Through the collection and analysis of Wireshark traces from different vantage points, we are able to pinpoint the root cause of the Zoom performance problems, which is a congested external Internet link for our campus network. We also identify several characteristics of the Zoom application that exacerbate its performance issues on congested and lossy networks, due to multi-layer protocol interactions.