名篇,讲的Google自用的调度平台Borg,我感觉也是Google的钓鱼论文,当年大家知道有Borg的时候,好多人在各种地方呼吁Google把Borg开源,或者再详细的讲讲细节。结果Google趁势推出Kubernetes,“Borg虽然不开源,可是俺们开源了在这个基础上研发的更新、更通用的Kubernetes啊,大家快来用啊啊啊啊啊啊啊啊”。 Kubernetes于是大火。
Borgs 最NB的地方是同时跑Long-running service和batch jobs, 这样据该论文所说会提高大概20~30%的效率,很NB的。他的原话是:“Since many other organizations run user-facing and batch jobs in separate clusters, we examined what would happen if we did the same. Figure 5 shows that segregating prod and non-prod work would need 20–30% more machines in the median cell to run our workload.” 大意就是别人都是把面向用户和批处理Job分开在不同的机群里跑,我们也试了一下这么会怎么样。我们一试,结果哎呀妈呀,要多用20~30%的机器才行。
目的
Google'sBorg system is a cluster manager that runs hundreds of thousands of jobs, frommany thousands of different applications, across a number of clusters each withup to tens of thousands of machines.
好处:
1, Hides the detail of resources management and failure handings
2, operates with very haigh relability and availability and supportsapplications that do the same
3, lets user run workloads accross tens of thousands o machines.
概念:
1,Borg cell: a set of machines that are managed as a unit.
2,Workload: Borg cells run a heterogenous workload withtwo main parts.
Thefirst is long-running services that should “never” go down, and handleshort-lived latency-sensitive requests (a few ms to a few hundred ms). Suchservices are used for end-user-facing products such as Gmail, Google Docs, andweb search, and for internal infrastructure services (e.g., BigTable).
Thesecond is batch jobs that take from a few seconds to a few days to complete;these are much less sensitive to short-term performance fluctuations.
3,Cluster: The machines in a cell belong to a singlecluster, defined by the high-performance datacenter-scale network fabric thatconnects them. A cluster lives inside a single datacenter building, and acollection of buildings makes up a site.
4,Jobs:A Borg job’s properties include its name, owner, andthe number of tasks it has. Jobs can have constraints to force its tasks to runon machines with particular attributes such as processor architecture, OSversion, or an external IP address.
5,Task: Each task maps to aset of Linux processes running in a container on a machine
6,Alloc: A Borg alloc (short for allocation) is areserved set of resources on a machine in which one or more tasks can be run;the resources remain assigned whether or not they are used.
7,Quota:Quota is used to decide which jobs to admit forscheduling. Quota is expressed as a vector of resource quantities (CPU, RAM,disk, etc.) at a given priority, for a period of time (typically months).
架构
Borgmaster:Each cell’s Borgmaster consists of two processes: themain Borgmaster process and a separate scheduler (x3.2). The main Borgmasterprocess handles client RPCs that either mutate state (e.g., create job) orprovide read-only access to data (e.g., lookup job). It also manages statemachines for all of the objects in the system (machines, tasks, allocs, etc.),communicates with the Borglets, and offers a web UI as a backup to Sigma.
Scheduling:When a job is submitted, the Borgmaster records itpersistently in the Paxos store and adds the job’s tasks to the pending queue.This is scanned asynchronously by the scheduler, which assigns tasks tomachines if there are sufficient available resources that meet the job’sconstraints. (The scheduler primarily operates on tasks, not jobs.)
Borglet:The Borglet is a local Borg agent that is present onevery machine in a cell. It starts and stops tasks; restarts them if they fail;manages local resources by manipulating OS kernel settings; rolls over debuglogs; and reports the state of the machine to the Borgmaster and othermonitoring systems.
一些小细节
The vastmajority of the Borg workload does not run inside virtual machines
Borgwrites the task's hostname and port into a consistent. highly-available file inChubby
Allcomponents of Borg are written in c++
A keydesign feature in Borg is that already-running tasks continue to run even ifthe Borgmaster or a task's Borglet goes down.
性能
各种NB.
最后推K8S的广告:
The Kubernetes architecture goes further: it has an API server at its core that is responsible only for processing requests and manipulating the underlying state objects. The cluster management logic is built as small, composable micro-services that are clients of this API server, such as the replication controller, which maintains the desired number of replicas of a pod in the face of failures, and the node controller, which manages the machine lifecycle.