FIRM: An intelligent Fine-Grained Resource Management Framework for SLO-Ooritented Microservices

上官羽

2023-12-01

{FIRM: An intelligent Fine-Grained Resource Management Framework for SLO-Ooritented Microservices}

{2020}, {Haoran Qiu}, {OSDI}

Summary

写完笔记之后最后填，概述文章的内容，以后查阅笔记的时候先看这一段。注：写文章summary切记需要通过自己的思考，用自己的语言描述。忌讳直接Ctrl + c原文。

Research Objective(s)

User-facing latency-sensitive web services including numerous distributed, intercommunicating microservices that promise to simpify software development and operation. However, mulitplexing of computing resources across microservices is still challenging in production because contention for shared resoures can cause latency spikes that violate the service-level objectives(SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and © take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up tp 16x while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11x.

Background / Problem Statement

These microservices must handle diverse load characteristics while
efficiently multiplexing shared resources in order to maintain SLOs like end-to-end latnecy.

前人工作未能解决的问题
Unfortunately, these approaches suffer from two main problems. First, they fail to efficiently multiplex resources, such as caches, memory, I/O channels and network links, at fine granularity, and thus may not reduce SLO violations.
Second, significant human-effort and training are needed to build high-fidelity performance models of large-scale microservice deployments that capture low-level resource contention.

Method(s)

Support vector machine (SVM) driven detectiion and localization of SLO violations to individual microservice instances. FIRM first identifies the “critical paths”, and then uses per-critical-path and per-microservice-instance performance variability metrics(e.g. sojourn time[1]) to output a binary decision on whether or not a microservice instance is responsible for SLO violations.
Reinforement learning(RL) driven mitigation of SLO violations that reduces contention on shared resources. FIRM then user resource utilization, workload characteristics, and performance metrics to make dynamic reprovisioning decisions, which inlcudes (a) increasing or reducing the partition portion of limit for a resource type, (b) scaling up/down, i.e., adding or reducing the amount of resources attached to a container, and © scaling out/in, i.e,. scaling the number of replicas for services. By continuing to learn mitigation policies through reinforcement, FIRM can optimize for dynamic workload-specific characteristics.

Definition 2.3 The critical path(CP) to a microservice m in the execution history graph of a request is the path of maximal duration that starts with the client request and ends with m.

When we mention CP alone without the target microservice m, it means the critical path of the “Service Response” to the client (See Fig. 2(b)), i.e., end-to-end latency.

Insight 1: Dynamic Behavior of CPs.
However, CPs do not remain static over the execution of requests in microservices, but rather change dynamically based on the performance of individual service and their sensitivity to this interference.
Insignt 2: Microservices with Larger Latency Are not Necessarily Root Causes of SLO violations.
Insight 3:Mitigation Policies Vary with User load and Resource in Contention.
The only way to mitigate the effects of dynamically changing CPs, which in turn cause dynamically changing latencies and tail behaviors, is to efficiently identify microservice instances on the CP that are resource-starved or contending for resources and then provides them with more of the resources.

The FIRM Framework

In this section, we describe the overall architecture of the FIRM framework and its implementation.

Based on the insight that resource contention manefests as dynamically evolving CPs, FIRM first detects CP changes and extracts critical microservice instances from them. It does so uing the Tracing Coordinator, which is marked as 1 in figure 6. The Tracing Coordinator collects tracing and telemetry data from every microservice instance and stores them in a centralized graph database for processing.
The Extractor detects SLO violations and queries Tracing Coordinator with collected real-time data (a) to extract CPs and (b) to locallize critical microservice instances which are likely the causes of SLO violations.
Using the telemetry data collected in 1 and the critical instances identified in 3, FIRM makes mitigation decisions to scale and reprovision resources for the crtical instances. The policy used to make such decisions is automatically generated using RL. The RL agent jointly analyzes the contextual information about resource utilization(i.e. low-level performance counter data collected from the CPU, memory, and network), performance metrics(i.e. per-microservice and end-to-end latency distributions).
Finally, actions are validated and executed on the underlying Kubernetes clusters through the deployment module.
In order to train the ML models in the Extractor as well as the RL agent. FIRM includes a performance anomaly injection framework that triggers SLO violations by configurable intensity and timing.

3.4 SLO violation Mitigation Using RL

Given th list of critical service instances, FIRM’s Resoure estimator is designed to analyze resource contention and provide reprovisioning actions for the cluster manager to take. FIRM estimates and controls a fine-grained set of resources, including CPU time, memory bandwidth, LLC capacity, disk I/O bandwidth, and network bandwidth. It make decisions on scaling each type of resouce or the the number of containers by using measurements of tracing and telemetry dat collected from the Tracing Coordinator.

FIRM leverages reinforcement learning(RL) to optimize resource management policies for long-term reward in dynamic microservice environments.

RL Primer.

An RL agent solves sequential decision-making problem (modeled as Markov decision process) by interacting with an environment. At each discrete time step t, the agent observe a state of the environment and performs an action. At the following time step t+1, the agent observes an immedidate reward given by a reward function. The immediate reward represents the loss/gain in transitiioning from $s_t$ to $s_{t+1}$ because of action $a_t$ . The tuple is called one transition.

Evaluation

作者如何评估自己的方法？实验的setup是什么样的？感兴趣实验数据和结果有哪些？有没有问题或者可以借鉴的地方？

Load Generation.

We drove the services with various open-loop asynchronous workload generators to represent an active production environment. We uniformly generated workloads for every request type across all microservice benchmarks. The parameters for the workload generators were the same as those for DeathStarBench and varied from predictable constant, diurnal, distributions such as Poission to unpredictable loads with spikes in user demand. The workload generaors and the microservice benchmark applications were never co-located.

Injection and Comparison baselines.

We used our performance anomaly injector to inject various types of performance anomalies into containers uniformly at random with configurable injection timing and intensity. Following the common way to study resource interference, our experiments on SLO violation mitigation with anomalies were designed to be comprehensive by covering the worst-case scenarios, given the random and nondeterministic nature of shared-resource interference in production environments. Unless otherwise speficied, (i) the anomaly injection time interval was in an exponential distribution with $$ and (ii) the anomaly type and intensity were selected uniformly at random.

Single anomaly localization.

We first evaluated how well FIRM localizes the microservice instances that are responsible for SLO violations under different types of single-anomaly injections. For each type of performance anomaly and each type of request, we gradually increased the intensity of injected resource interference and recorded end-to-end latencies. The intensity parameter was chosen uniformly at random between [start-point, end-point], where the start-point is the intensity that starts to trigger SLO violations, and the end-point is the intensity when either the anomaly injector has consumed all possible resources or over 80% of user requests have been dropped. We observe that the localization accuracy of FIRM, when subject to different types of anomalies, does not vary significantly.

Conclusion

作者给出了哪些结论？哪些是strong conclusions, 哪些又是weak的conclusions（即作者并没有通过实验提供evidence，只在discussion中提到；或实验的数据并没有给出充分的evidence）?

Notes

(optional) 不在以上列表中，但需要特别记录的笔记。

References

(optional) 列出相关性高的文献，以便之后可以继续track下去。
https://gitlab.engr.illinois.edu/DEPEND/firm

tc: Traffic Control in the Linux kernel.
Jaeger:Open source, end-to-end distributed tarcing.
OPenZipkin: A distributed tracing system.
Trickle:A lightweight userspace bandwidth shaper.
wrk2:An HTTPbenchmarking tool based mostly on wrk.
Sysbench. github
Sysdig. githug
stress-ng. ubuntu
Opentracing
pmbw:Parallel memory bandwidth benchmark.
Intel cache allocation technology. github
Intel memory bandwidth allocatinon. github
HTB hierarchical token bucket.
Neo4j: Native Grath Database. github