{2016}, {}, {OSDI}
写完笔记之后最后填,概述文章的内容,以后查阅笔记的时候先看这一段。注:写文章summary切记需要通过自己的思考,用自己的语言描述。忌讳直接Ctrl + c原文。
Modern resource management framework for large-scale analytics leave unsolved the problematic tention between high cluster utilization and job’s performanc predictability – respectively coveted by operators and users. We address this in Morpheus, a new system that: 1) codifies implict user expectations as explict Service Level Objectives(SLOs), infered from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, 3) mitigates inherent performance variance by means of dynamical reprovisioning of jobs. We validate these ideas against production traces from a 50k node cluster, and show that Morpheus can lower the number of deadline violations by 5x to 13x, while retaining cluster-utilization, and lowering cluster footprint by 14% to 28%. We demonstrate the scalability and practicablity of our implementation by deploying Morpheus on a 2700-node cluster and running it against production-derived workloads.
Commercial enterprises ranging from Fortune-500 companies to venture-captital funded startups are increasingly relying on muti-tenated clusters for running their business analytics jobs. These jobs comprise of multiple tasks that are run on different cluster nodes, where the unit of per-task resource allocation is a container(i.e, a bundle of resource such as CPU, RAM and disk I/O) on an individual machine. From an analysis of large-scale prodution workloads, we observe significant variance in job runtimes, which sometimes results in missed deadlines and negative business impact. This is perceived by users as an unpredictable excution experience, and it accounts for 25% of user escaltions in Microsoft big-data clusters.
Unpredictability comes from serveral sources, we roughtly group as follows:
Unpredictability is most noticeable to users who submit periodic jobs(scheduled runs of the same job on newly arriving data). Their recurrent nature prompts users to form an expectation on jobs’ runtime performance as well as react to any deviation from it, pariticularly, if the job is a busniess-critical.
Unfortunately, widely deployed resource managers provide limited mechanisms for users to cope with unpredictability of such jobs.
Mopherus is a system that continuously observes and learns as perioic jobs run over time. The findings are used to economically reserve resources for the job ahead of job execution, and dynamically adapt to changing conditions at runtime.
We next give a brief overview of the different components in Morpheus, highlighting the different timescales in which they operates. We start our description with the automatic inference module which consists of two sub-components.
Extractor of target SLOs. This sub-component operates on Provenance Graph(PG). This is a graph capturing all inter-job data dependencies, as we as all in-gress-egress operations. The SLO extractor leverages the precise timing information stored in PG to statistically derive a deadline for the periodic job-- as time at which downstream consumers read a job’s output.
Job resource model. This sub-compoent takes as input detailed telemetry information on job runs. The information includes time-series of resource utilization(“skyline”) – the amount of resources (represented by container) used by the job at certain time granularity, tyipically one minute. Based on time-series of multiple runs**, the sub-componets constructs a tight envelop R* around the typical behavior** of the job.
The second part of our inference module produces a resource allocation R* that has high fidelity ot the actual requirements of some periodic job j. In a nutshell, Morpheus collects resource usage patterns of periodic jobs over N instances that have run in the past, and solve an LP that “best fits” all patterns.
Formally, a skyline for the i-th instance can be defined by the sequence s i , k {s_{i,k}} si,k, the average number of containers used for each time-step k. Using a collection of sequences as input, the optimization problem outputs the vector s=(s1,…sk)-- the number of containers reserved at each time-step.
Our optimization objective is a cost function which is a linear combination of two terms: One term which penalizes for “over-allocation” A 0 ( s ) A_0(s) A0(s) and another term which penalizes for “under-allocation” A u ( s ) A_u(s) Au(s), both illustrated in Fig.5. Formally, we wish to minimize a A 0 ( s ) + ( 1 − a ) A u ( s ) aA_0(s)+(1-a)A_u(s) aA0(s)+(1−a)Au(s). Next, we describe these terms.
Over-allocation penalty. The over-allocation penalty is defined as the average over-allocation of containers. Formally, the expression ( s k − s i , k ) = m a x s i , k , 0 (s_k-s_{i,k}) = max{s_i,k, 0} (sk−si,k)=maxsi,k,0 is the instantaneous over-allocation for instance i at time-step k. Accordingly, the over-allocation penalty is given by .
The idea behind choosing these particular forms of penalties is to model, as closely as possible, the usage of allocated resources by a job that requests them. Particularly, the over-allocation penalty modes the amount of unused resource because the job instance doesn’t need them. Wasted resouces allocated in a time-step cannot be recovered back at a later time-step. However, a shortage of resource at a time-step can be statisfied at a later point in time assuming the job is elastic.
Regardless of packing algorithm we shall use, we face a practical challenge of how to reserve resource for multiple, possibly infinite instances of a periodic job. It is inefficient to calculate and store a seperate reservation for each instance of a periodic job. To address this challege, we force the constraint that all instances that assoiciated with the same perioidic job would have the same reservation across runs. E.g., a daily job which requires 10 containers for one hour to run between 10am and 4pm maybe forced to execute between noon to 1pm every day. While this decision choice might reduce the flexibility of a reservation algorithm, it can provide stronger predictability to users and and reducce allocation complexity.
作者如何评估自己的方法?实验的setup是什么样的?感兴趣实验数据和结果有哪些?有没有问题或者可以借鉴的地方?
作者给出了哪些结论?哪些是strong conclusions, 哪些又是weak的conclusions(即作者并没有通过实验提供evidence,只在discussion中提到;或实验的数据并没有给出充分的evidence)?
(optional) 不在以上列表中,但需要特别记录的笔记。
(optional) 列出相关性高的文献,以便之后可以继续track下去。