K8S scheduler

司寇正志

2023-12-01

定义

Scheduler是k8s的调度器，主要的任务是吧定义的pod分配到集群的节点上

公平
资源高效利用
效率
灵活

Scheduler是作为单独的程序运行的，启动之后一直监听APIserver，获取PodSpec.NodeName为空的Pod，对每隔Pod都会创建一个binding，表明该pod应该放在哪个节点上

调度过程

首先过滤掉不满足条件的节点，这个过程称为predicate，然后通过节点按照优先级排序，这个是priority，最后从中选择优先级最高的节点

如果在predicate过程中没有合适的节点，pod会一直在pending状态，不断重新调试，直到有节点满足条件

节点亲和性

将Pod分配给节点

可以约束一个Pod只能在特定的节点上运行，nodeSelector是节点选择约束的最简单

1.添加标签到节点

kubectl label nodes k8snode1 distribute=k8snode1

kubectl get nodes --show-labels

2.添加nodeSelector字段到Pod配置中

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    distribute: k8snode1

目前有两种类型的节点亲和性，分别为 requiredDuringSchedulingIgnoredDuringExecution 和preferredDuringSchedulingIgnoredDuringExecution。你可以视它们为“硬需求”和“软需求”

.requiredDuringSchedulingIgnoredDuringExecution 的示例将是 “仅将 Pod 运行在具有 Intel CPU 的节点上”，而preferredDuringSchedulingIgnoredDuringExecution 的示例为 “尝试将这组 Pod 运行在 XYZ 故障区域，如果这不可能的话，则允许一些 Pod 在其他地方运行”。

亲和性分类

nodeAffinity

apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration

profiles:
  - schedulerName: default-scheduler
  - schedulerName: foo-scheduler
    pluginConfig:
      - name: NodeAffinity
        args:
          addedAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: scheduler-profile
                  operator: In
                  values:
                  - foo

Pod亲和性

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: topology.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

可以通过key： app values:为元数据定义的值的方式实现多个pod同时部署到一个节点中或者多个节点中

污点

节点亲和性是 Pod 的一种属性，它使 Pod 被吸引到一类特定的节点（这可能出于一种偏好，也可能是硬性要求）。污点（Taint）则相反——它使节点能够排斥一类特定的 Pod。

容忍度（Toleration）是应用于 Pod 上的，允许（但并不要求）Pod 调度到带有与之匹配的污点的节点上。

污点和容忍度（Toleration）相互配合，可以用来避免 Pod 被分配到不合适的节点上。每个节点上都可以应用一个或多个污点，这表示对于那些不能容忍这些污点的 Pod，是不会被该节点接受的

组成

kubectl taint 可以某个Node节点设置污点，每个污点存在key:value作为污点的标签，其中value可以为空，effect描述污点的作用，effect支持三个选项

NoSchedule ：表示K8S将不会将Pod的调度到具有该污点的Node上

PreferNoSchedule:表示k8s将尽量避免将Pod调度到具有该污点的Node上

NoExecute:不会调度同时将node存在的pod驱逐出去

设置，查看和去除

kubectl taint nodes node1 key1=value1:NoSchedule

kubectl describe node NodeName 中查找Taints字段

kubectl taint nodes node1 key1=value1:NoSchedule-

容忍度pod

如果容忍度与污点匹配，则可以分配到相应pod中

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  tolerations:
  - key: "example-key"
    operator: "Exists"
    effect: "NoSchedule"
    tolerationSeconds: 36000

1.当不指定key的值，表示容忍所有污点

2.当不指定effect值，表示容忍所有的污点作用