kube-scheduler是k8s的一个核心组件,其主要功能是为刚创建的pod(nodename为空)选择一个合适的node。
工作流程大概为:kube-scheduler会使用informer机制监听pod资源变化(除了监听pod,也会监听node, pv等,这里先关注pod),如果发现pod的pod.Spec.NodeName字段为空,表示此pod还没有被分配node(如果用户指定pod.Spec.NodeName为某个node,则不用为其执行调度),并且通过pod.Spec.SchedulerName可以找到对应的调度器,则会将其加入调度队列等待调度。
本文主要看一下创建调度器所需的配置文件,及调度过程中需要的插件如何配置。
配置文件
配置文件通过kube-scheduler进程的选项–configfile指定,文件格式为配置API格式,此配置API不会通过RESTful对外暴露,只能通过指定文件的形式创建,配置文件每个字段的意思和插件参数可参考官网,下面看一个配置文件示例
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
//Leader选举配置
leaderElection:
leaderElect: true
//apiserver通信配置
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
//通过profiles指定调度器,可以指定多个调度器
profiles:
//调度器[0]配置,名字是'default-scheduler'
- schedulerName: default-scheduler
//扩展点上的插件配置
plugins:
//queueSort扩展点禁用所有默认插件,使能Test
queueSort:
enabled:
- name: Test
disabled:
- name: "*"
//PreFilter扩展点使能Test
preFilter:
enabled:
- name: Test
//插件参数配置
pluginConfig:
- name: Test
args:
abcd: efg
//调度器[1]配置,名字是'scheduler1'
- schedulerName: scheduler1
//扩展点上的插件配置
plugins:
//queueSort扩展点禁用所有默认插件,使能Test
queueSort:
enabled:
- name: Test
- name: test2
disabled:
- name: "*"
配置API对应到代码中的结构体如下
//pkg/scheduler/apis/config/types.go
// KubeSchedulerConfiguration configures a scheduler
type KubeSchedulerConfiguration struct {
//k8s所有的api都有的元数据,用来指定APIVersion和kind
metav1.TypeMeta
//并行个数,默认值为16。后面执行调度算法时,会启动Parallelism个协程执行filter
Parallelism int32
//暂且不关心
LeaderElection componentbaseconfig.LeaderElectionConfiguration
//保存和apiserver通信的信息
ClientConnection componentbaseconfig.ClientConnectionConfiguration
//指定健康检查server监听的ip,默认为0.0.0.0:10251
HealthzBindAddress string
//指定metrics server监听的ip,默认为0.0.0.0:10251
MetricsBindAddress string
//debug相关配置,暂且忽略
componentbaseconfig.DebuggingConfiguration
//并不是每次调度都要尝试所有node,这样效率会比较低,所以可通过此参数指定参加调度的百分比
PercentageOfNodesToScore int32
//pod调度失败后,会先被放入不可调度队列,再由协程或其他事件触发将pod放入podBackoff队列,
//pod第一次调度失败后,会在podBackoff队列的时间为PodInitialBackoffSeconds*1,默认为1s,即调度失败1s后进行第二次调度
PodInitialBackoffSeconds int64
//pod第二次调度失败后,会在podBackoff队列的时间为PodInitialBackoffSeconds*2,依次类推,但是最大值为PodMaxBackoffSeconds,
//可参考函数calculateBackoffDuration,只要调度失败就会一直尝试,除非此pod被删除
PodMaxBackoffSeconds int64
//此参数用来指定调度器,为数组类型,表示可指定多个调度器。
//创建pod时可通过pod.Spec.SchedulerName指定使用哪个调度器,如果没有指定,则使用默认的调度器default-scheduler
Profiles []KubeSchedulerProfile
//暂且忽略
Extenders []Extender
}
KubeSchedulerProfile表示一个调度器
//pkg/scheduler/apis/config/types.go
// KubeSchedulerProfile is a scheduling profile.
type KubeSchedulerProfile struct {
//调度器名字,如果pod.Spec.SchedulerName指定了,则使用指定的调度器进行调度
SchedulerName string
//包括多个扩展点,每个扩展点又包含多个插件
Plugins *Plugins
//插件的参数,有些插件需要参数,可通过此配置指定
PluginConfig []PluginConfig
}
Plugins用来指定调度器的多个扩展点,调度器执行过程中按照顺序执行扩展点上的插件
type Plugins struct {
// QueueSort is a list of plugins that should be invoked when sorting pods in the scheduling queue.
QueueSort PluginSet
// PreFilter is a list of plugins that should be invoked at "PreFilter" extension point of the scheduling framework.
PreFilter PluginSet
// Filter is a list of plugins that should be invoked when filtering out nodes that cannot run the Pod.
Filter PluginSet
// PostFilter is a list of plugins that are invoked after filtering phase, no matter whether filtering succeeds or not.
PostFilter PluginSet
// PreScore is a list of plugins that are invoked before scoring.
PreScore PluginSet
// Score is a list of plugins that should be invoked when ranking nodes that have passed the filtering phase.
Score PluginSet
// Reserve is a list of plugins invoked when reserving/unreserving resources
// after a node is assigned to run the pod.
Reserve PluginSet
// Permit is a list of plugins that control binding of a Pod. These plugins can prevent or delay binding of a Pod.
Permit PluginSet
// PreBind is a list of plugins that should be invoked before a pod is bound.
PreBind PluginSet
// Bind is a list of plugins that should be invoked at "Bind" extension point of the scheduling framework.
// The scheduler call these plugins in order. Scheduler skips the rest of these plugins as soon as one returns success.
Bind PluginSet
// PostBind is a list of plugins that should be invoked after a pod is successfully bound.
PostBind PluginSet
}
上面所有扩展点的类型都是PluginSet,其用来指定每个扩展点上使能的插件和关闭的插件
type PluginSet struct {
// Enabled specifies plugins that should be enabled in addition to default plugins.
// These are called after default plugins and in the same order specified here.
Enabled []Plugin
// Disabled specifies default plugins that should be disabled.
// When all default plugins need to be disabled, an array containing only one "*" should be provided.
Disabled []Plugin
}
Plugin用来表示每个插件的信息,名字和权重,其中权重仅作用在score扩展点上
type Plugin struct {
// Name defines the name of plugin
Name string
// Weight defines the weight of plugin, only used for Score plugins.
Weight int32
}
插件
这里不会说明每种插件具体的作用,只以问答的形式介绍一下插件
a. 如果没有指定config文件,或者config文件中没有配置插件,有没有默认的插件,如果有的话在哪设置的?
有默认使能的插件的,可参考函数getDefaultPlugins,用于获取每个扩展点上默认使能的插件
//pkg/sheduler/apis/config/v1beta2/default_plugins.go
// getDefaultPlugins returns the default set of plugins.
func getDefaultPlugins() *v1beta2.Plugins {
plugins := &v1beta2.Plugins{
QueueSort: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.PrioritySort},
},
},
PreFilter: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.NodeResourcesFit},
{Name: names.NodePorts},
{Name: names.VolumeRestrictions},
{Name: names.PodTopologySpread},
{Name: names.InterPodAffinity},
{Name: names.VolumeBinding},
{Name: names.NodeAffinity},
},
},
Filter: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.NodeUnschedulable},
{Name: names.NodeName},
{Name: names.TaintToleration},
{Name: names.NodeAffinity},
{Name: names.NodePorts},
{Name: names.NodeResourcesFit},
{Name: names.VolumeRestrictions},
{Name: names.EBSLimits},
{Name: names.GCEPDLimits},
{Name: names.NodeVolumeLimits},
{Name: names.AzureDiskLimits},
{Name: names.VolumeBinding},
{Name: names.VolumeZone},
{Name: names.PodTopologySpread},
{Name: names.InterPodAffinity},
},
},
PostFilter: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.DefaultPreemption},
},
},
PreScore: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.InterPodAffinity},
{Name: names.PodTopologySpread},
{Name: names.TaintToleration},
{Name: names.NodeAffinity},
},
},
Score: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.NodeResourcesBalancedAllocation, Weight: pointer.Int32Ptr(1)},
{Name: names.ImageLocality, Weight: pointer.Int32Ptr(1)},
{Name: names.InterPodAffinity, Weight: pointer.Int32Ptr(1)},
{Name: names.NodeResourcesFit, Weight: pointer.Int32Ptr(1)},
{Name: names.NodeAffinity, Weight: pointer.Int32Ptr(1)},
// Weight is doubled because:
// - This is a score coming from user preference.
// - It makes its signal comparable to NodeResourcesFit.LeastAllocated.
{Name: names.PodTopologySpread, Weight: pointer.Int32Ptr(2)},
{Name: names.TaintToleration, Weight: pointer.Int32Ptr(1)},
},
},
Reserve: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.VolumeBinding},
},
},
PreBind: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.VolumeBinding},
},
},
Bind: v1beta2.PluginSet{
Enabled: []v1beta2.Plugin{
{Name: names.DefaultBinder},
},
},
}
applyFeatureGates(plugins)
return plugins
}
b.如果指定了config,则只会使用配置的插件?默认的插件还生效吗
默认插件还会生效,最终生效的插件为配置文件中指定的插件和默认插件的合集,可参考下面的代码mergePlugins
c. 调度器配置的enabled/disabled指定的插件如何和默认插件组合
可参考下面的代码mergePlugins,大概意思如下:
如果配置文件disabled指定了*,则关闭所有默认插件,最终使能的只有配置文件enabled指定的插件。
如果配置文件disabled指定的非*,则最终使能的插件为配置文件enable指定的插件和默认使能的插件的合集
//pkg/sheduler/apis/config/v1beta2/default_plugins.go
// mergePlugins merges the custom set into the given default one, handling disabled sets.
func mergePlugins(defaultPlugins, customPlugins *v1beta2.Plugins) *v1beta2.Plugins {
if customPlugins == nil {
return defaultPlugins
}
defaultPlugins.QueueSort = mergePluginSet(defaultPlugins.QueueSort, customPlugins.QueueSort)
defaultPlugins.PreFilter = mergePluginSet(defaultPlugins.PreFilter, customPlugins.PreFilter)
defaultPlugins.Filter = mergePluginSet(defaultPlugins.Filter, customPlugins.Filter)
defaultPlugins.PostFilter = mergePluginSet(defaultPlugins.PostFilter, customPlugins.PostFilter)
defaultPlugins.PreScore = mergePluginSet(defaultPlugins.PreScore, customPlugins.PreScore)
defaultPlugins.Score = mergePluginSet(defaultPlugins.Score, customPlugins.Score)
defaultPlugins.Reserve = mergePluginSet(defaultPlugins.Reserve, customPlugins.Reserve)
defaultPlugins.Permit = mergePluginSet(defaultPlugins.Permit, customPlugins.Permit)
defaultPlugins.PreBind = mergePluginSet(defaultPlugins.PreBind, customPlugins.PreBind)
defaultPlugins.Bind = mergePluginSet(defaultPlugins.Bind, customPlugins.Bind)
defaultPlugins.PostBind = mergePluginSet(defaultPlugins.PostBind, customPlugins.PostBind)
return defaultPlugins
}
func mergePluginSet(defaultPluginSet, customPluginSet v1beta2.PluginSet) v1beta2.PluginSet {
disabledPlugins := sets.NewString()
enabledCustomPlugins := make(map[string]pluginIndex)
// replacedPluginIndex is a set of index of plugins, which have replaced the default plugins.
replacedPluginIndex := sets.NewInt()
for _, disabledPlugin := range customPluginSet.Disabled {
disabledPlugins.Insert(disabledPlugin.Name)
}
for index, enabledPlugin := range customPluginSet.Enabled {
enabledCustomPlugins[enabledPlugin.Name] = pluginIndex{index, enabledPlugin}
}
var enabledPlugins []v1beta2.Plugin
if !disabledPlugins.Has("*") {
for _, defaultEnabledPlugin := range defaultPluginSet.Enabled {
if disabledPlugins.Has(defaultEnabledPlugin.Name) {
continue
}
// The default plugin is explicitly re-configured, update the default plugin accordingly.
if customPlugin, ok := enabledCustomPlugins[defaultEnabledPlugin.Name]; ok {
klog.InfoS("Default plugin is explicitly re-configured; overriding", "plugin", defaultEnabledPlugin.Name)
// Update the default plugin in place to preserve order.
defaultEnabledPlugin = customPlugin.plugin
replacedPluginIndex.Insert(customPlugin.index)
}
enabledPlugins = append(enabledPlugins, defaultEnabledPlugin)
}
}
// Append all the custom plugins which haven't replaced any default plugins.
// Note: duplicated custom plugins will still be appended here.
// If so, the instantiation of scheduler framework will detect it and abort.
for index, plugin := range customPluginSet.Enabled {
if !replacedPluginIndex.Has(index) {
enabledPlugins = append(enabledPlugins, plugin)
}
}
return v1beta2.PluginSet{Enabled: enabledPlugins}
}
d. 如果指定了config,但是config中没有指定default-scheduler调度器配置,那创建pod时,pod.Spec.SchedulerName也没有赋值,会调度成功吗? 还会有默认调度器default-scheduler吗?
不会成功了,如果指定了config,则只有config中指定的调度器。
如果此时pod.Spec.SchedulerName也没有赋值,会因为找不到default-scheduler得不到调度,pod一直处于pending状态