当前位置: 首页 > 工具软件 > runtime > 使用案例 >

k8s部署问题集锦(一) kubelet 启动报错failed to run Kubelet unable to determine runtime

吕俊美
2023-12-01

背景

在 CentOS 7.9 上使用二进制包部署 Kubernetes v1.24.1 集群,kubelet 使用 Containerd 作为 container runtime。启动kubelet失败,问题排查和解决。

版本信息

服务

版本

CentOS

7.9

Kernel

5.4.195-1.el7.elrepo.x86_64

Kubernetes

v1.24.1

containerd

v1.6.4

排查和解决

kubelet 启动失败

[root @ machine5 ~]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2022-06-10 21:56:47 CST; 304ms ago

查看报错信息

[root @ machine5 ~]$ journalctl -xe -u kubelet
Jun 10 22:23:33 machine5 kubelet[11122]: I0610 22:23:33.098633   11122 remote_runtime.go:114] "Finding the CRI API runtime version"
Jun 10 22:23:33 machine5 kubelet[11122]: W0610 22:23:33.838519   11122 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix: missing address". Reconnecting...
Jun 10 22:23:33 machine5 kubelet[11122]: Error: failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix: missing address"

报错信息是“failed to run Kubelet: unable to determine runtime API version”

从报错信息来看,kubelet 找不到 Containerd 服务提供的接口,但Containerd服务已经启动了

Containerd服务启动信息

[root @ machine5 ~]$  systemctl status containerd -l
● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-06-10 22:20:06 CST; 6s ago
     Docs: https://containerd.io
  Process: 9923 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 9925 (containerd)
    Tasks: 9
   Memory: 26.0M
   CGroup: /system.slice/containerd.service
           └─9925 /usr/bin/containerd

Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913907117+08:00" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913913347+08:00" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913926772+08:00" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913933530+08:00" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913947444+08:00" level=error msg="failed to initialize a tracing processor \"otlp\"" error="no OpenTelemetry endpoint: skip plugin"
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.913983093+08:00" level=info msg="loading plugin \"io.containerd.grpc.v1.cri\"..." type=io.containerd.grpc.v1
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914128628+08:00" level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="invalid plugin config: `systemd_cgroup` only works for runtime io.containerd.runtime.v1.linux"
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914306062+08:00" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914331072+08:00" level=info msg=serving... address=/run/containerd/containerd.sock
Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914370232+08:00" level=info msg="containerd successfully booted in 0.022563s"

在仔细查看kubelet启动日志时发现:kubelet 启动时,container-runtime-endpoint 配置是空的,但是 containerd 参数(已弃用)默认配置了 Containerd 的套接字地址。

Jun 10 22:23:26 machine5 kubelet[11122]: I0610 22:23:26.358097   11122 flags.go:64] FLAG: --container-runtime-endpoint=""
Jun 10 22:23:26 machine5 kubelet[11122]: I0610 22:23:26.358099   11122 flags.go:64] FLAG: --containerd="/run/containerd/containerd.sock"
....省略...
Jun 10 22:23:33 machine5 kubelet[11122]: --container-runtime-endpoint string                        The endpoint of remote runtime service. Unix Domain Sockets are supported on Linux, while npipe and tcp en
Jun 10 22:23:33 machine5 kubelet[11122]: --containerd string                                        containerd endpoint (default "/run/containerd/containerd.sock")

在查看官方文档《Changing the Container Runtime on a Node from Docker Engine to containerd》《Component tools - Kubelet》 文档关于配置 Kubelet 使用 Containerd 作为 Container runtimes 的说明以及 kubelet “--container-runtime-endpoint” 参数的说明

配置 Kubelet 使用 Containerd 作为 Container runtimes 的说明

--- 来自《Changing the Container Runtime on a Node from Docker Engine to containerd》

Configure the kubelet to use containerd as its container runtime 

Edit the file /var/lib/kubelet/kubeadm-flags.env and add the containerd runtime to the flags. --container-runtime=remote and 

--container-runtime-endpoint=unix:///run/containerd/containerd.sock".

... 中间省略....

Note that new CRI socket paths must be prefixed with unix:// ideally.

kubelet “--container-runtime-endpoint” 参数的说明

--container-runtime string     Default: docker

The container runtime to use. Possible values: docker, remote.

--container-runtime-endpoint string     Default: unix:///var/run/dockershim.sock

[Experimental] The endpoint of remote runtime service. Currently unix socket endpoint is supported on Linux, while npipe and tcp endpoints are supported on windows. Examples: unix:///var/run/dockershim.sock, npipe:./pipe/dockershim.

从文档和kubelet参数说明中可以看出,如果使用 Containerd 作为 Container runtime 时,kubelet启动时需要配置 “--container-runtime-endpoint” 和 “--container-runtime” 两个参数

由于我是使用 systemd 管理 kubelet 服务,需要修改 kubelet.service 中启动kubelet时的参数配置。如下:

[root @ machine5 ~]$ vim /usr/lib/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service
Requires=containerd.service
[Service]
WorkingDirectory=/data/kubernetes/kubelet
ExecStart=/usr/local/bin/kubelet \
...省略...
  --container-runtime=remote \
  --container-runtime-endpoint=unix:///run//containerd/containerd.sock
...省略...       

启动 kubelet 出现新的错误 “unknown service runtime.v1alp”


[root @ machine5 ~]$ systemctl daemon-reload
[root @ machine5 ~]$ systemctl start kubelet
[root @ machine5 ~]$ journalctl -xe -u kubelet
Jun 10 23:05:31 machine5 kubelet[25811]: I0610 23:05:31.877416   25811 kubelet.go:376] "Attempting to sync node with API server"
Jun 10 23:05:31 machine5 kubelet[25811]: I0610 23:05:31.877443   25811 kubelet.go:278] "Adding apiserver pod source"
Jun 10 23:05:31 machine5 kubelet[25811]: I0610 23:05:31.877457   25811 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Jun 10 23:05:31 machine5 kubelet[25811]: E0610 23:05:31.878947   25811 remote_runtime.go:168] "Version from runtime service failed" err="rpc error: code = Unimplemented desc = unknown service runtime.v1alph"
Jun 10 23:05:31 machine5 kubelet[25811]: E0610 23:05:31.878995   25811 kuberuntime_manager.go:225] "Get runtime version failed" err="get remote runtime typed version failed: rpc error: code = Unimplemented"
Jun 10 23:05:31 machine5 kubelet[25811]: Error: failed to run Kubelet: failed to create kubelet: get remote runtime typed version failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alp

从报错信息来看,还是 Contiainerd 的问题。Containerd 1.6.4 找不到 runtime.v1alp

在网上搜到的很多分析的原因是 Containerd 配置/etc/containerd/config.toml中禁用了“cri” 插件,解决方案是就是删除 "/etc/containerd/config.toml" 并重启Containerd。

但,我的Containerd配置并没有禁用 cri 插件,并且做了相应的配置。同时我要用Containerd作为Container runtime,并使用 systemd 替换 cgroups,所以以上解决方案并不能很好解决我的问题。《container-runtimes:containerd》

解决 “unknown service runtime.v1alp” 问题

在查看 Containerd 日志时,突然发现启动日志中有一个关于”systemd_cgroup“的Warming日志

Jun 10 22:20:06 machine5 containerd[9925]: time="2022-06-10T22:20:06.914128628+08:00" level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="invalid plugin config: `systemd_cgroup` only works for runtime io.containerd.runtime.v1.linux"

这是说明 Containerd v1.6.4版本 “systemd_cgroup” 只能在 runtime type 为 “io.containerd.runtime.v1.linux” 模式下使用。

看来是 Containerd 的配置有问题。

修改 Containerd 配置并重启Containerd 和 kubelet

[root @ machine5 ~]$ vim /etc/containerd/config.toml
...省略...
      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
...省略... 
        runtime_root = ""
        runtime_type = "io.containerd.runtime.v1.linux"                
[root @ machine5 ~]$ systemctl restart containerd
[root @ machine5 ~]$ systemctl status containerd

重启后,上面的Containerd 的warming信息没有了。

不过很尬尴的看到另一条“Warming” :level=warning msg="runtime v1 is deprecated since containerd v1.4, consider using runtime v2"

也就是说 Containerd 1.4 开始弃用runtime v1 了。但 kubelet 1.24.1 使用 runtime v1.

启动kubelet并检查node状态

[root @ machine5 ~]$ systemctl start kubelet
[root @ machine5 ~]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2022-06-10 23:37:42 CST; 4min 19s ago


#################### 我 是 分 割 线 ####################

[root @ machine1 ~]$ kubectl get node
NAME       STATUS   ROLES    AGE     VERSION
machine1   Ready    <none>   20h     v1.24.1
machine2   Ready    <none>   20h     v1.24.1
machine3   Ready    <none>   20h     v1.24.1
machine4   Ready    <none>   20h     v1.24.1
machine5   Ready    <none>   5m13s   v1.24.1

以上问题解决

一只干运维又作过运维产品经理的解决方案

 类似资料: