kube-scheduler 组件是Kubernetes 系统的核心组件之一,主要负责整个集群Pod资源对象的调度,根据内置或扩展的调度算法(预选与优选调度算法),将未调度的Pod资源对象调度到最优的工作节点上,从而更加合理、更加充分地利用集群的资源
kube-scheduler 负责分配调度 Pod 到集群内的节点上,它监听 kube-apiserver,查询还未分配 Node 的 Pod,然后根据调度策略为这些 Pod 分配节点(更新 Pod 的 NodeName
字段)。
调度器需要充分考虑诸多的因素:
有三种方式指定 Pod 只运行在指定的 Node 节点上
首先给 Node 打上标签
kubectl label nodes node-01 disktype=ssd
然后在 daemonset 中指定 nodeSelector 为 disktype=ssd
:
spec: nodeSelector: disktype: ssd
nodeAffinity 目前支持两种:requiredDuringSchedulingIgnoredDuringExecution 和 preferredDuringSchedulingIgnoredDuringExecution,分别代表必须满足条件和优选条件。比如下面的例子代表调度到包含标签 kubernetes.io/e2e-az-name
并且值为 e2e-az1 或 e2e-az2 的 Node 上,并且优选还带有标签 another-node-label-key=another-node-label-value
的 Node。
apiVersion: v1 kind: Pod metadata: name: with-node-affinity spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/e2e-az-name operator: In values: - e2e-az1 - e2e-az2 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: another-node-label-key operator: In values: - another-node-label-value containers: - name: with-node-affinity image: gcr.io/google_containers/pause:2.0
podAffinity 基于 Pod 的标签来选择 Node,仅调度到满足条件 Pod 所在的 Node 上,支持 podAffinity 和 podAntiAffinity。这个功能比较绕,以下面的例子为例:
security=S1
标签且运行中的 Pod”,那么可以调度到该 Nodesecurity=S2
标签且运行中 Pod” 的 Node 上apiVersion: v1 kind: Pod metadata: name: with-pod-affinity spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: failure-domain.beta.kubernetes.io/zone podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 topologyKey: kubernetes.io/hostname containers: - name: with-pod-affinity image: gcr.io/google_containers/pause:2.0
Taints 和 tolerations 用于保证 Pod 不被调度到不合适的 Node 上,其中 Taint 应用于 Node 上,而 toleration 则应用于 Pod 上。
目前支持的 taint 类型
然而,当 Pod 的 Tolerations 匹配 Node 的所有 Taints 的时候可以调度到该 Node 上;当 Pod 是已经运行的时候,也不会被删除(evicted)。另外对于 NoExecute,如果 Pod 增加了一个 tolerationSeconds,则会在该时间之后才删除 Pod。
比如,假设 node1 上应用以下几个 taint
kubectl taint nodes node1 key1=value1:NoSchedule kubectl taint nodes node1 key1=value1:NoExecute kubectl taint nodes node1 key2=value2:NoSchedule
下面的这个 Pod 由于没有 toleratekey2=value2:NoSchedule
无法调度到 node1 上
tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute"
而正在运行且带有 tolerationSeconds 的 Pod 则会在 600s 之后删除
tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" tolerationSeconds: 600 - key: "key2" operator: "Equal" value: "value2" effect: "NoSchedule"
注意,DaemonSet 创建的 Pod 会自动加上对 node.alpha.kubernetes.io/unreachable
和 node.alpha.kubernetes.io/notReady
的 NoExecute Toleration,以避免它们因此被删除。
从 v1.8 开始,kube-scheduler 支持定义 Pod 的优先级,从而保证高优先级的 Pod 优先调度。并从 v1.11 开始默认开启。
注:在 v1.8-v1.10 版本中的开启方法为
- apiserver 配置
--feature-gates=PodPriority=true
和--runtime-config=scheduling.k8s.io/v1alpha1=true
- kube-scheduler 配置
--feature-gates=PodPriority=true
在指定 Pod 的优先级之前需要先定义一个 PriorityClass(非 namespace 资源),如
apiVersion: v1 kind: PriorityClass metadata: name: high-priority value: 1000000 globalDefault: false description: "This priority class should be used for XYZ service pods only."
其中
value
为 32 位整数的优先级,该值越大,优先级越高globalDefault
用于未配置 PriorityClassName 的 Pod,整个集群中应该只有一个 PriorityClass 将其设置为 true然后,在 PodSpec 中通过 PriorityClassName 设置 Pod 的优先级:
apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent priorityClassName: high-priority
如果默认的调度器不满足要求,还可以部署自定义的调度器。并且,在整个集群中还可以同时运行多个调度器实例,通过 podSpec.schedulerName
来选择使用哪一个调度器(默认使用内置的调度器)。
apiVersion: v1 kind: Pod metadata: name: nginx labels: app: nginx spec: # 选择使用自定义调度器 my-scheduler schedulerName: my-scheduler containers: - name: nginx image: nginx:1.10
kube-scheduler 还支持使用 --policy-config-file
指定一个调度策略文件来自定义调度策略,比如
{ "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "PodFitsHostPorts"}, {"name" : "PodFitsResources"}, {"name" : "NoDiskConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "HostName"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ], "extenders":[ { "urlPrefix": "http://127.0.0.1:12346/scheduler", "apiVersion": "v1beta1", "filterVerb": "filter", "prioritizeVerb": "prioritize", "weight": 5, "enableHttps": false, "nodeCacheCapable": false } ] }
scheduler.alpha.kubernetes.io/critical-pod=''
[{"key":"CriticalAddonsOnly", "operator":"Exists"}]
system-cluster-critical
或者 system-node-critical
kube-scheduler --address=127.0.0.1 --leader-elect=true --kubeconfig=/etc/kubernetes/scheduler.conf
kube-scheduler 调度原理:
For given pod:
+---------------------------------------------+
| Schedulable nodes: |
| |
| +--------+ +--------+ +--------+ |
| | node 1 | | node 2 | | node 3 | |
| +--------+ +--------+ +--------+ |
| |
+-------------------+-------------------------+
|
|
v
+-------------------+-------------------------+
Pred. filters: node 3 doesn't have enough resource
+-------------------+-------------------------+
|
|
v
+-------------------+-------------------------+
| remaining nodes: |
| +--------+ +--------+ |
| | node 1 | | node 2 | |
| +--------+ +--------+ |
| |
+-------------------+-------------------------+
|
|
v
+-------------------+-------------------------+
Priority function: node 1: p=2
node 2: p=5
+-------------------+-------------------------+
|
|
v
select max{node priority} = node 2
kube-scheduler 调度分为两个阶段,predicate 和 priority
predicates 策略
pod.Spec.NodeName
是否与候选节点一致pod.Spec.NodeSelector
是否匹配priorities 策略
代码入口路径
在release-1.9及之前的代码入口在plugin/cmd/kube-scheduler,从release-1.10起,kube-scheduler的核心代码迁移到pkg/scheduler目录下面,入口也迁移到cmd/kube-scheduler