当前位置: 首页 > 工具软件 > kube-score > 使用案例 >

kube-scheduler 配置文件及插件

邓俊英
2023-12-01

kube-scheduler是k8s的一个核心组件,其主要功能是为刚创建的pod(nodename为空)选择一个合适的node。

工作流程大概为:kube-scheduler会使用informer机制监听pod资源变化(除了监听pod,也会监听node, pv等,这里先关注pod),如果发现pod的pod.Spec.NodeName字段为空,表示此pod还没有被分配node(如果用户指定pod.Spec.NodeName为某个node,则不用为其执行调度),并且通过pod.Spec.SchedulerName可以找到对应的调度器,则会将其加入调度队列等待调度。

本文主要看一下创建调度器所需的配置文件,及调度过程中需要的插件如何配置。

配置文件
配置文件通过kube-scheduler进程的选项–configfile指定,文件格式为配置API格式,此配置API不会通过RESTful对外暴露,只能通过指定文件的形式创建,配置文件每个字段的意思和插件参数可参考官网,下面看一个配置文件示例

apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
//Leader选举配置
leaderElection:
  leaderElect: true
//apiserver通信配置
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
//通过profiles指定调度器,可以指定多个调度器
profiles:
  //调度器[0]配置,名字是'default-scheduler'
- schedulerName: default-scheduler
  //扩展点上的插件配置
  plugins:
    //queueSort扩展点禁用所有默认插件,使能Test
    queueSort:
      enabled:
      - name: Test
      disabled:
      - name: "*"
	//PreFilter扩展点使能Test
    preFilter:
      enabled:
      - name: Test
  //插件参数配置
  pluginConfig:
  - name: Test
    args:
      abcd: efg
  //调度器[1]配置,名字是'scheduler1'
- schedulerName: scheduler1
  //扩展点上的插件配置
  plugins:
    //queueSort扩展点禁用所有默认插件,使能Test
    queueSort:
      enabled:
      - name: Test
      - name: test2
      disabled:
      - name: "*"

配置API对应到代码中的结构体如下

//pkg/scheduler/apis/config/types.go
// KubeSchedulerConfiguration configures a scheduler
type KubeSchedulerConfiguration struct {
	//k8s所有的api都有的元数据,用来指定APIVersion和kind
	metav1.TypeMeta

	//并行个数,默认值为16。后面执行调度算法时,会启动Parallelism个协程执行filter
	Parallelism int32

	//暂且不关心
	LeaderElection componentbaseconfig.LeaderElectionConfiguration

	//保存和apiserver通信的信息
	ClientConnection componentbaseconfig.ClientConnectionConfiguration
	//指定健康检查server监听的ip,默认为0.0.0.0:10251
	HealthzBindAddress string
	//指定metrics server监听的ip,默认为0.0.0.0:10251
	MetricsBindAddress string

	//debug相关配置,暂且忽略
	componentbaseconfig.DebuggingConfiguration

	//并不是每次调度都要尝试所有node,这样效率会比较低,所以可通过此参数指定参加调度的百分比
	PercentageOfNodesToScore int32

	//pod调度失败后,会先被放入不可调度队列,再由协程或其他事件触发将pod放入podBackoff队列,
	//pod第一次调度失败后,会在podBackoff队列的时间为PodInitialBackoffSeconds*1,默认为1s,即调度失败1s后进行第二次调度
	PodInitialBackoffSeconds int64

	//pod第二次调度失败后,会在podBackoff队列的时间为PodInitialBackoffSeconds*2,依次类推,但是最大值为PodMaxBackoffSeconds,
	//可参考函数calculateBackoffDuration,只要调度失败就会一直尝试,除非此pod被删除
	PodMaxBackoffSeconds int64

	//此参数用来指定调度器,为数组类型,表示可指定多个调度器。
	//创建pod时可通过pod.Spec.SchedulerName指定使用哪个调度器,如果没有指定,则使用默认的调度器default-scheduler
	Profiles []KubeSchedulerProfile

	//暂且忽略
	Extenders []Extender
}

KubeSchedulerProfile表示一个调度器

//pkg/scheduler/apis/config/types.go
// KubeSchedulerProfile is a scheduling profile.
type KubeSchedulerProfile struct {
	//调度器名字,如果pod.Spec.SchedulerName指定了,则使用指定的调度器进行调度
	SchedulerName string

	//包括多个扩展点,每个扩展点又包含多个插件
	Plugins *Plugins

	//插件的参数,有些插件需要参数,可通过此配置指定
	PluginConfig []PluginConfig
}

Plugins用来指定调度器的多个扩展点,调度器执行过程中按照顺序执行扩展点上的插件

type Plugins struct {
	// QueueSort is a list of plugins that should be invoked when sorting pods in the scheduling queue.
	QueueSort PluginSet

	// PreFilter is a list of plugins that should be invoked at "PreFilter" extension point of the scheduling framework.
	PreFilter PluginSet

	// Filter is a list of plugins that should be invoked when filtering out nodes that cannot run the Pod.
	Filter PluginSet

	// PostFilter is a list of plugins that are invoked after filtering phase, no matter whether filtering succeeds or not.
	PostFilter PluginSet

	// PreScore is a list of plugins that are invoked before scoring.
	PreScore PluginSet

	// Score is a list of plugins that should be invoked when ranking nodes that have passed the filtering phase.
	Score PluginSet

	// Reserve is a list of plugins invoked when reserving/unreserving resources
	// after a node is assigned to run the pod.
	Reserve PluginSet

	// Permit is a list of plugins that control binding of a Pod. These plugins can prevent or delay binding of a Pod.
	Permit PluginSet

	// PreBind is a list of plugins that should be invoked before a pod is bound.
	PreBind PluginSet

	// Bind is a list of plugins that should be invoked at "Bind" extension point of the scheduling framework.
	// The scheduler call these plugins in order. Scheduler skips the rest of these plugins as soon as one returns success.
	Bind PluginSet

	// PostBind is a list of plugins that should be invoked after a pod is successfully bound.
	PostBind PluginSet
}

上面所有扩展点的类型都是PluginSet,其用来指定每个扩展点上使能的插件和关闭的插件

type PluginSet struct {
	// Enabled specifies plugins that should be enabled in addition to default plugins.
	// These are called after default plugins and in the same order specified here.
	Enabled []Plugin
	// Disabled specifies default plugins that should be disabled.
	// When all default plugins need to be disabled, an array containing only one "*" should be provided.
	Disabled []Plugin
}

Plugin用来表示每个插件的信息,名字和权重,其中权重仅作用在score扩展点上

type Plugin struct {
	// Name defines the name of plugin
	Name string
	// Weight defines the weight of plugin, only used for Score plugins.
	Weight int32
}

插件
这里不会说明每种插件具体的作用,只以问答的形式介绍一下插件
a. 如果没有指定config文件,或者config文件中没有配置插件,有没有默认的插件,如果有的话在哪设置的?
有默认使能的插件的,可参考函数getDefaultPlugins,用于获取每个扩展点上默认使能的插件

//pkg/sheduler/apis/config/v1beta2/default_plugins.go
// getDefaultPlugins returns the default set of plugins.
func getDefaultPlugins() *v1beta2.Plugins {
	plugins := &v1beta2.Plugins{
		QueueSort: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.PrioritySort},
			},
		},
		PreFilter: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.NodeResourcesFit},
				{Name: names.NodePorts},
				{Name: names.VolumeRestrictions},
				{Name: names.PodTopologySpread},
				{Name: names.InterPodAffinity},
				{Name: names.VolumeBinding},
				{Name: names.NodeAffinity},
			},
		},
		Filter: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.NodeUnschedulable},
				{Name: names.NodeName},
				{Name: names.TaintToleration},
				{Name: names.NodeAffinity},
				{Name: names.NodePorts},
				{Name: names.NodeResourcesFit},
				{Name: names.VolumeRestrictions},
				{Name: names.EBSLimits},
				{Name: names.GCEPDLimits},
				{Name: names.NodeVolumeLimits},
				{Name: names.AzureDiskLimits},
				{Name: names.VolumeBinding},
				{Name: names.VolumeZone},
				{Name: names.PodTopologySpread},
				{Name: names.InterPodAffinity},
			},
		},
		PostFilter: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.DefaultPreemption},
			},
		},
		PreScore: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.InterPodAffinity},
				{Name: names.PodTopologySpread},
				{Name: names.TaintToleration},
				{Name: names.NodeAffinity},
			},
		},
		Score: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.NodeResourcesBalancedAllocation, Weight: pointer.Int32Ptr(1)},
				{Name: names.ImageLocality, Weight: pointer.Int32Ptr(1)},
				{Name: names.InterPodAffinity, Weight: pointer.Int32Ptr(1)},
				{Name: names.NodeResourcesFit, Weight: pointer.Int32Ptr(1)},
				{Name: names.NodeAffinity, Weight: pointer.Int32Ptr(1)},
				// Weight is doubled because:
				// - This is a score coming from user preference.
				// - It makes its signal comparable to NodeResourcesFit.LeastAllocated.
				{Name: names.PodTopologySpread, Weight: pointer.Int32Ptr(2)},
				{Name: names.TaintToleration, Weight: pointer.Int32Ptr(1)},
			},
		},
		Reserve: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.VolumeBinding},
			},
		},
		PreBind: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.VolumeBinding},
			},
		},
		Bind: v1beta2.PluginSet{
			Enabled: []v1beta2.Plugin{
				{Name: names.DefaultBinder},
			},
		},
	}
	applyFeatureGates(plugins)

	return plugins
}

b.如果指定了config,则只会使用配置的插件?默认的插件还生效吗
默认插件还会生效,最终生效的插件为配置文件中指定的插件和默认插件的合集,可参考下面的代码mergePlugins

c. 调度器配置的enabled/disabled指定的插件如何和默认插件组合
可参考下面的代码mergePlugins,大概意思如下:

如果配置文件disabled指定了*,则关闭所有默认插件,最终使能的只有配置文件enabled指定的插件。
如果配置文件disabled指定的非*,则最终使能的插件为配置文件enable指定的插件和默认使能的插件的合集

//pkg/sheduler/apis/config/v1beta2/default_plugins.go
// mergePlugins merges the custom set into the given default one, handling disabled sets.
func mergePlugins(defaultPlugins, customPlugins *v1beta2.Plugins) *v1beta2.Plugins {
	if customPlugins == nil {
		return defaultPlugins
	}

	defaultPlugins.QueueSort = mergePluginSet(defaultPlugins.QueueSort, customPlugins.QueueSort)
	defaultPlugins.PreFilter = mergePluginSet(defaultPlugins.PreFilter, customPlugins.PreFilter)
	defaultPlugins.Filter = mergePluginSet(defaultPlugins.Filter, customPlugins.Filter)
	defaultPlugins.PostFilter = mergePluginSet(defaultPlugins.PostFilter, customPlugins.PostFilter)
	defaultPlugins.PreScore = mergePluginSet(defaultPlugins.PreScore, customPlugins.PreScore)
	defaultPlugins.Score = mergePluginSet(defaultPlugins.Score, customPlugins.Score)
	defaultPlugins.Reserve = mergePluginSet(defaultPlugins.Reserve, customPlugins.Reserve)
	defaultPlugins.Permit = mergePluginSet(defaultPlugins.Permit, customPlugins.Permit)
	defaultPlugins.PreBind = mergePluginSet(defaultPlugins.PreBind, customPlugins.PreBind)
	defaultPlugins.Bind = mergePluginSet(defaultPlugins.Bind, customPlugins.Bind)
	defaultPlugins.PostBind = mergePluginSet(defaultPlugins.PostBind, customPlugins.PostBind)
	return defaultPlugins
}

func mergePluginSet(defaultPluginSet, customPluginSet v1beta2.PluginSet) v1beta2.PluginSet {
	disabledPlugins := sets.NewString()
	enabledCustomPlugins := make(map[string]pluginIndex)
	// replacedPluginIndex is a set of index of plugins, which have replaced the default plugins.
	replacedPluginIndex := sets.NewInt()
	for _, disabledPlugin := range customPluginSet.Disabled {
		disabledPlugins.Insert(disabledPlugin.Name)
	}
	for index, enabledPlugin := range customPluginSet.Enabled {
		enabledCustomPlugins[enabledPlugin.Name] = pluginIndex{index, enabledPlugin}
	}
	var enabledPlugins []v1beta2.Plugin
	if !disabledPlugins.Has("*") {
		for _, defaultEnabledPlugin := range defaultPluginSet.Enabled {
			if disabledPlugins.Has(defaultEnabledPlugin.Name) {
				continue
			}
			// The default plugin is explicitly re-configured, update the default plugin accordingly.
			if customPlugin, ok := enabledCustomPlugins[defaultEnabledPlugin.Name]; ok {
				klog.InfoS("Default plugin is explicitly re-configured; overriding", "plugin", defaultEnabledPlugin.Name)
				// Update the default plugin in place to preserve order.
				defaultEnabledPlugin = customPlugin.plugin
				replacedPluginIndex.Insert(customPlugin.index)
			}
			enabledPlugins = append(enabledPlugins, defaultEnabledPlugin)
		}
	}

	// Append all the custom plugins which haven't replaced any default plugins.
	// Note: duplicated custom plugins will still be appended here.
	// If so, the instantiation of scheduler framework will detect it and abort.
	for index, plugin := range customPluginSet.Enabled {
		if !replacedPluginIndex.Has(index) {
			enabledPlugins = append(enabledPlugins, plugin)
		}
	}
	return v1beta2.PluginSet{Enabled: enabledPlugins}
}

d. 如果指定了config,但是config中没有指定default-scheduler调度器配置,那创建pod时,pod.Spec.SchedulerName也没有赋值,会调度成功吗? 还会有默认调度器default-scheduler吗?
不会成功了,如果指定了config,则只有config中指定的调度器。
如果此时pod.Spec.SchedulerName也没有赋值,会因为找不到default-scheduler得不到调度,pod一直处于pending状态

 类似资料: