Kubernetes中的调度策略主要分为全局调度与运行时调度2种。其中全局调度策略在调度器启动时配置,而运行时调度策略主要包括选择节点(nodeSelector),节点亲和性(nodeAffinity),pod亲和与反亲和性(podAffinity与podAntiAffinity)。Node Affinity、podAffinity/AntiAffinity以及后文即将介绍的污点(Taints)与容忍(tolerations)等特性,在Kuberntes1.6中均处于Beta阶段。
本文着重介绍运行时调度策略。
Label是Kubernetes核心概念之一,其以key/value的形式附加到各种对象上,如Pod、Service、Deployment、Node等,达到识别这些对象,管理关联关系等目的,如Node和Pod的关联。
获取当前集群中的全部节点:
[root@master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master.example.com Ready control-plane,master 2d2h v1.23.1
node1.example.com Ready <none> 2d2h v1.23.1
node2.example.com Ready <none> 2d2h v1.23.1
查看节点默认label:
[root@master ~]# kubectl get node node1.example.com --show-labels
NAME STATUS ROLES AGE VERSION LABELS
node1.example.com Ready <none> 2d2h v1.23.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux
为指定节点设置label:
[root@master ~]# kubectl label node node1.example.com disktype=ssd
node/node1.example.com labeled
确认节点label是否设置成功:
[root@master ~]# kubectl get node node1.example.com --show-labels
NAME STATUS ROLES AGE VERSION LABELS
node1.example.com Ready <none> 2d2h v1.23.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux
[root@master ~]# kubectl get nodes -l disktype=ssd
NAME STATUS ROLES AGE VERSION
node1.example.com Ready <none> 2d2h v1.23.1
nodeSelector是目前最为简单的一种pod运行时调度限制,目前在Kubernetes1.7.x及以下版本可用。Pod.spec.nodeSelector通过kubernetes的label-selector机制选择节点,由调度器调度策略匹配label,而后调度pod到目标节点,该匹配规则属于强制约束。后文要讲的nodeAffinity具备nodeSelector的全部功能,所以未来Kubernetes会将nodeSelector废除。
设置label
[root@master ~]# kubectl label node node1.example.com disktype=ssd
node/node1.example.com labeled
查看满足非master节点且disktype类型为ssd的节点:
[root@master ~]# kubectl get nodes -l 'role!=master, disktype=ssd'
NAME STATUS ROLES AGE VERSION
node1.example.com Ready <none> 2d2h v1.23.1
pod.yaml文件内容:
[root@master ~]# vi pod.yml
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
创建pod:
[root@master ~]# kubectl apply -f pod.yml
pod/nginx created
查看pod nginx被调度到预期节点运行:
[root@master ~]# kubectl get pod nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 9m35s 10.244.1.31 node1.example.com <none> <none>
注:如果非默认namespace,需要指定具体namespace,例如:
kubectl -n kube-system get pods -o wide
Kubernetes自v1.4开始,节点有一些built-in label,罗列如下:
yaml文件内容:
[root@master ~]# vi pod.yml
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
kubernetes.io/hostname: node1.example.com
创建pod,并检查结果符合预期,pod被调度在预先设置的节点 node1.example.com
[root@master ~]# kubectl apply -f pod.yml
pod/nginx unchanged
[root@master ~]# kubectl get pod nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 63s 10.244.1.32 node1.example.com <none> <none>
前面提及的nodeSelector,其仅以一种非常简单的方式、即label强制限制pod调度到指定节点。而亲和性(Affinity)与非亲和性(anti-affinity)则更加灵活的指定pod调度到预期节点上,相比nodeSelector,Affinity与anti-affinity优势体现在:
亲和性主要分为3种类型:node affinity与inter-pod affinity/anti-affinity,下文会进行详细说明。
Node affinity在Kubernetes 1.2做为alpha引入,其涵盖了nodeSelector功能,主要分为requiredDuringSchedulingIgnoredDuringExecution与preferredDuringSchedulingIgnoredDuringExecution 2种类型。前者可认为一种强制限制,如果 Node 的标签发生了变化导致其没有符合 Pod 的调度要求节点,那么pod调度就会失败。而后者可认为理解为软限或偏好,同样如果 Node 的标签发生了变化导致其不再符合 pod 的调度要求,pod 依然会调度运行。
设置节点label:
[root@master ~]# kubectl label nodes node1.example.com cpu=high
node/node1.example.com labeled
[root@master ~]# kubectl label node node1.example.com disktype=ssd
node/node1.example.com labeled
[root@master ~]# kubectl label nodes node2.example.com cpu=low
node/node2.example.com labeled
部署pod的预期是到ssd类型硬盘(disktype=ssd)、且CPU高配的机器上(cpu=high)。
查看满足条件节点:
[root@master ~]# kubectl get nodes -l 'cpu=high, disktype=ssd'
NAME STATUS ROLES AGE VERSION
node1.example.com Ready <none> 2d3h v1.23.1
pod.yaml文件内容如下:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: cpu
operator: In
values:
- high
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
检查结果符合预期,pod nginx成功部署到ssd类型硬盘且CPU高配的机器上。
[root@master ~]# kubectl get pod nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 27s 10.244.1.33 node1.example.com <none> <none>
对于Node affinity,无论是强制约束(hard)或偏好(preference)方式,都是调度pod到预期节点上,而Taints恰好与之相反,如果一个节点标记为 Taints ,除非 Pod也被标识为可以耐受污点节点,否则该Taints节点不会被调度pod。Taints)与tolerations当前处于beta阶段,
Taints节点应用场景比如用户希望把Kubernetes Master节点保留给 Kubernetes 系统组件使用,或者把一组具有特殊资源预留给某些 pod。pod不会再被调度到taint标记过的节点。
[root@master ~]# kubectl taint node node1.example.com cpu=high:NoSchedule
node/node1.example.com tainted
[root@master ~]# kubectl apply -f pod.yml
pod/nginx created
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 6s
如果仍然希望某个pod调度到taint节点上,则必须在 Spec 中做出Toleration 定义,才能调度到该节点,举例如下:
[root@master ~]# vim pod.yml
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: cpu
operator: In
values:
- high
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "cpu"
operator: "Equal"
value: "high"
effect: "NoSchedule"
[root@master ~]# kubectl apply -f pod.yml
pod/nginx configured
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 3m48s
effect 共有三个可选项,可按实际需求进行设置: