在 CentOS 7.9 上使用二进制包部署 Kubernetes v1.24.1 集群,kubelet 使用 Containerd 作为 container runtime。启动kubelet失败,问题排查和解决。
服务 | 版本 |
CentOS | 7.9 |
Kernel | 5.4.195-1.el7.elrepo.x86_64 |
Kubernetes | v1.24.1 |
containerd | v1.6.4 |
[root @ machine5 ~]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Fri 2022-06-10 21:56:47 CST; 304ms ago
[root @ machine5 ~]$ journalctl -xe -u kubelet
Jun 10 22:23:33 machine5 kubelet[11122]: I0610 22:23:33.098633 11122 remote_runtime.go:114] "Finding the CRI API runtime version"
Jun 10 22:23:33 machine5 kubelet[11122]: W0610 22:23:33.838519 11122 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to { <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...
Jun 10 22:23:33 machine5 kubelet[11122]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"
报错信息是“failed to run Kubelet: unable to determine runtime API version”
从报错信息来看,kubelet 找不到 Containerd 服务提供的接口,但Containerd服务已经启动了
[root @ machine5 ~]$ systemctl status containerd -l
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-06-10 22:20:06 CST; 6s ago
Docs: https://containerd.io
Process: 9923 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 9925 (containerd)
Tasks: 9
Memory: 26.0M
CGroup: /system.slice/containerd.service
└─9925 /usr/bin/containerd
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913907117+08:00" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913913347+08:00" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913926772+08:00" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913933530+08:00" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913947444+08:00" level=error msg="failed to initialize a tracing processor \"otlp\"" error="no OpenTelemetry endpoint: skip plugin"
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913983093+08:00" level=info msg="loading plugin \"io.containerd.grpc.v1.cri\"..." type=io.containerd.grpc.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914128628+08:00" level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="invalid plugin config: `systemd_cgroup` only works for runtime io.containerd.runtime.v1.linux"
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914306062+08:00" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914331072+08:00" level=info msg=serving... address=/run/containerd/containerd.sock
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914370232+08:00" level=info msg="containerd successfully booted in 0.022563s"
在仔细查看kubelet启动日志时发现:kubelet 启动时,container-runtime-endpoint 配置是空的,但是 containerd 参数(已弃用)默认配置了 Containerd 的套接字地址。
Jun 10 22:23:26 machine5 kubelet[11122]: I0610 22:23:26.358097 11122 flags.go:64] FLAG: --container-runtime-endpoint=""
Jun 10 22:23:26 machine5 kubelet[11122]: I0610 22:23:26.358099 11122 flags.go:64] FLAG: --containerd="/run/containerd/containerd.sock"
....省略...
Jun 10 22:23:33 machine5 kubelet[11122]: --container-runtime-endpoint string The endpoint of remote runtime service. Unix Domain Sockets are supported on Linux, while npipe and tcp en
Jun 10 22:23:33 machine5 kubelet[11122]: --containerd string containerd endpoint (default "/run/containerd/containerd.sock")
在查看官方文档《Changing the Container Runtime on a Node from Docker Engine to containerd》和 《Component tools - Kubelet》 文档关于配置 Kubelet 使用 Containerd 作为 Container runtimes 的说明以及 kubelet “--container-runtime-endpoint” 参数的说明
配置 Kubelet 使用 Containerd 作为 Container runtimes 的说明
--- 来自《Changing the Container Runtime on a Node from Docker Engine to containerd》
Configure the kubelet to use containerd as its container runtime
Edit the file /var/lib/kubelet/kubeadm-flags.env and add the containerd runtime to the flags. --container-runtime=remote and
--container-runtime-endpoint=unix:///run/containerd/containerd.sock".
... 中间省略....
Note that new CRI socket paths must be prefixed with unix:// ideally.
--container-runtime string Default: docker
The container runtime to use. Possible values: docker, remote.
--container-runtime-endpoint string Default: unix:///var/run/dockershim.sock
[Experimental] The endpoint of remote runtime service. Currently unix socket endpoint is supported on Linux, while npipe and tcp endpoints are supported on windows. Examples: unix:///var/run/dockershim.sock, npipe:./pipe/dockershim.
从文档和kubelet参数说明中可以看出,如果使用 Containerd 作为 Container runtime 时,kubelet启动时需要配置 “--container-runtime-endpoint” 和 “--container-runtime” 两个参数
由于我是使用 systemd 管理 kubelet 服务,需要修改 kubelet.service 中启动kubelet时的参数配置。如下:
[root @ machine5 ~]$ vim /usr/lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service
Requires=containerd.service
[Service]
WorkingDirectory=/data/kubernetes/kubelet
ExecStart=/usr/local/bin/kubelet \
...省略...
--container-runtime=remote \
--container-runtime-endpoint=unix:///run//containerd/containerd.sock
...省略...
[root @ machine5 ~]$ systemctl daemon-reload
[root @ machine5 ~]$ systemctl start kubelet
[root @ machine5 ~]$ journalctl -xe -u kubelet
Jun 10 23:05:31 machine5 kubelet[25811]: I0610 23:05:31.877416 25811 kubelet.go:376] "Attempting to sync node with API server"
Jun 10 23:05:31 machine5 kubelet[25811]: I0610 23:05:31.877443 25811 kubelet.go:278] "Adding apiserver pod source"
Jun 10 23:05:31 machine5 kubelet[25811]: I0610 23:05:31.877457 25811 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Jun 10 23:05:31 machine5 kubelet[25811]: E0610 23:05:31.878947 25811 remote_runtime.go:168] "Version from runtime service failed" err="rpc error: code = Unimplemented desc = unknown service runtime.v1alph"
Jun 10 23:05:31 machine5 kubelet[25811]: E0610 23:05:31.878995 25811 kuberuntime_manager.go:225] "Get runtime version failed" err="get remote runtime typed version failed: rpc error: code = Unimplemented"
Jun 10 23:05:31 machine5 kubelet[25811]: Error: failed to run Kubelet: failed to create kubelet: get remote runtime typed version failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alp
从报错信息来看,还是 Contiainerd 的问题。Containerd 1.6.4 找不到 runtime.v1alp
在网上搜到的很多分析的原因是 Containerd 配置/etc/containerd/config.toml中禁用了“cri” 插件,解决方案是就是删除 "/etc/containerd/config.toml" 并重启Containerd。
但,我的Containerd配置并没有禁用 cri 插件,并且做了相应的配置。同时我要用Containerd作为Container runtime,并使用 systemd 替换 cgroups,所以以上解决方案并不能很好解决我的问题。《container-runtimes:containerd》
在查看 Containerd 日志时,突然发现启动日志中有一个关于”systemd_cgroup“的Warming日志
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914128628+08:00" level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="invalid plugin config: `systemd_cgroup` only works for runtime io.containerd.runtime.v1.linux"
这是说明 Containerd v1.6.4版本 “systemd_cgroup” 只能在 runtime type 为 “io.containerd.runtime.v1.linux” 模式下使用。
看来是 Containerd 的配置有问题。
[root @ machine5 ~]$ vim /etc/containerd/config.toml
...省略...
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
...省略...
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"
[root @ machine5 ~]$ systemctl restart containerd
[root @ machine5 ~]$ systemctl status containerd
重启后,上面的Containerd 的warming信息没有了。
不过很尬尴的看到另一条“Warming” :level=warning msg="runtime v1 is deprecated since containerd v1.4, consider using runtime v2"
也就是说 Containerd 1.4 开始弃用runtime v1 了。但 kubelet 1.24.1 使用 runtime v1.
[root @ machine5 ~]$ systemctl start kubelet
[root @ machine5 ~]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-06-10 23:37:42 CST; 4min 19s ago
#################### 我 是 分 割 线 ####################
[root @ machine1 ~]$ kubectl get node
NAME STATUS ROLES AGE VERSION
machine1 Ready <none> 20h v1.24.1
machine2 Ready <none> 20h v1.24.1
machine3 Ready <none> 20h v1.24.1
machine4 Ready <none> 20h v1.24.1
machine5 Ready <none> 5m13s v1.24.1
以上问题解决
一只干运维又作过运维产品经理的解决方案