{2020}, {Haoran Qiu}, {OSDI}
写完笔记之后最后填,概述文章的内容,以后查阅笔记的时候先看这一段。注:写文章summary切记需要通过自己的思考,用自己的语言描述。忌讳直接Ctrl + c原文。
User-facing latency-sensitive web services including numerous distributed, intercommunicating microservices that promise to simpify software development and operation. However, mulitplexing of computing resources across microservices is still challenging in production because contention for shared resoures can cause latency spikes that violate the service-level objectives(SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and © take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up tp 16x while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11x.
These microservices must handle diverse load characteristics while
efficiently multiplexing shared resources in order to maintain SLOs like end-to-end latnecy.
前人工作未能解决的问题
Unfortunately, these approaches suffer from two main problems. First, they fail to efficiently multiplex resources, such as caches, memory, I/O channels and network links, at fine granularity, and thus may not reduce SLO violations.
Second, significant human-effort and training are needed to build high-fidelity performance models of large-scale microservice deployments that capture low-level resource contention.
When we mention CP alone without the target microservice m, it means the critical path of the “Service Response” to the client (See Fig. 2(b)), i.e., end-to-end latency.
In this section, we describe the overall architecture of the FIRM framework and its implementation.
Given th list of critical service instances, FIRM’s Resoure estimator is designed to analyze resource contention and provide reprovisioning actions for the cluster manager to take. FIRM estimates and controls a fine-grained set of resources, including CPU time, memory bandwidth, LLC capacity, disk I/O bandwidth, and network bandwidth. It make decisions on scaling each type of resouce or the the number of containers by using measurements of tracing and telemetry dat collected from the Tracing Coordinator.
FIRM leverages reinforcement learning(RL) to optimize resource management policies for long-term reward in dynamic microservice environments.
An RL agent solves sequential decision-making problem (modeled as Markov decision process) by interacting with an environment. At each discrete time step t, the agent observe a state of the environment and performs an action. At the following time step t+1, the agent observes an immedidate reward given by a reward function. The immediate reward represents the loss/gain in transitiioning from s t s_t st to s t + 1 s_{t+1} st+1 because of action a t a_t at. The tuple is called one transition.
作者如何评估自己的方法?实验的setup是什么样的?感兴趣实验数据和结果有哪些?有没有问题或者可以借鉴的地方?
We drove the services with various open-loop asynchronous workload generators to represent an active production environment. We uniformly generated workloads for every request type across all microservice benchmarks. The parameters for the workload generators were the same as those for DeathStarBench and varied from predictable constant, diurnal, distributions such as Poission to unpredictable loads with spikes in user demand. The workload generaors and the microservice benchmark applications were never co-located.
We used our performance anomaly injector to inject various types of performance anomalies into containers uniformly at random with configurable injection timing and intensity. Following the common way to study resource interference, our experiments on SLO violation mitigation with anomalies were designed to be comprehensive by covering the worst-case scenarios, given the random and nondeterministic nature of shared-resource interference in production environments. Unless otherwise speficied, (i) the anomaly injection time interval was in an exponential distribution with $$ and (ii) the anomaly type and intensity were selected uniformly at random.
We first evaluated how well FIRM localizes the microservice instances that are responsible for SLO violations under different types of single-anomaly injections. For each type of performance anomaly and each type of request, we gradually increased the intensity of injected resource interference and recorded end-to-end latencies. The intensity parameter was chosen uniformly at random between [start-point, end-point], where the start-point is the intensity that starts to trigger SLO violations, and the end-point is the intensity when either the anomaly injector has consumed all possible resources or over 80% of user requests have been dropped. We observe that the localization accuracy of FIRM, when subject to different types of anomalies, does not vary significantly.
作者给出了哪些结论?哪些是strong conclusions, 哪些又是weak的conclusions(即作者并没有通过实验提供evidence,只在discussion中提到;或实验的数据并没有给出充分的evidence)?
(optional) 不在以上列表中,但需要特别记录的笔记。
(optional) 列出相关性高的文献,以便之后可以继续track下去。
https://gitlab.engr.illinois.edu/DEPEND/firm