测试环境遇到一个问题node节点显示状态异常,我手工delete node。之后在重启服务想把node加入集群。结果一直进不去。
没有办法只有仔细查看日志了。
简单分析,加入不了集群应该是kubelet异常:
看日志有
Aug 6 16:33:22 node125 kubelet: E0806 16:33:22.591193 20451 eviction_manager.go:247] eviction manager: failed to get summary stats: failed to get node info: node "node125" not found
Aug 6 16:33:22 node125 kubelet: I0806 16:33:22.595423 20451 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
这个报错,查看配置文件和hosts解析都是没有问题的,初步排查kubelet,但看日志kubelet一直在重启,没有其他报错。应该是docker有问题。
Aug 6 16:28:32 node125 kubelet: E0806 16:28:32.073254 19188 node_container_manager_linux.go:50] Failed to create ["kubepods"] cgroup
Aug 6 16:28:32 node125 kubelet: F0806 16:28:32.073276 19188 kubelet.go:1372] Failed to start ContainerManager Cannot set property TasksAccounting, or unknown property.
docker的配置:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 28
Server Version: 1.13.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version: (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: e9c345b3f906d5dc5e8100b05ce37073a811c74a (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: 5b117de7f824f3d3825737cf09581645abbe35d4 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
WARNING: You're not using the default seccomp profile
Profile: /etc/docker/seccomp.json
Kernel Version: 4.4.54-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 4
Total Memory: 7.797 GiB
Name: node125
ID: V442:WMKW:IUXK:P57X:ZFRG:NKVS:6ROQ:U7FA:LAMQ:GAPS:CKJ5:B6PI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
192.168.100.119
registry:5000
127.0.0.0/8
Live Restore Enabled: false
Registries: docker.io (secure)
使用的是:Cgroup Driver: systemd
查看kubelet配置:
[root@node125 kubernetes]# cat kubelet.yaml
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
address: 192.168.60.125
port: 10250
readOnlyPort: 10255
clusterDNS:
- 10.254.10.20
clusterDomain: cluster.local
cgroupDriver: systemd
kubeletCgroups: /systemd/system.slice
failSwapOn: false
authentication:
webhook:
enabled: false
cacheTTL: "2m0s"
anonymous:
enabled: true
使用的都是:cgroupDriver: systemd 。一致的没有问题。
那应该这个错是kubelet报的docker的,然后看下还docker还真有这个问题,版本问题,这个类型的有bug。
yum update systemd
systemctl restart docker
Aug 6 16:33:22 node125 kubelet: E0806 16:33:22.591193 20451 eviction_manager.go:247] eviction manager: failed to get summary stats: failed to get node info: node "node125" not found
Aug 6 16:33:22 node125 kubelet: I0806 16:33:22.595423 20451 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Aug 6 16:33:22 node125 kubelet: E0806 16:33:22.595487 20451 kubelet.go:2252] node "node125" not found
Aug 6 16:33:22 node125 kubelet: I0806 16:33:22.597140 20451 kubelet_node_status.go:72] Attempting to register node node125
Aug 6 16:33:22 node125 kubelet: I0806 16:33:22.599621 20451 plugin_manager.go:116] Starting Kubelet Plugin Manager
Aug 6 16:33:22 node125 kubelet: I0806 16:33:22.602856 20451 kubelet_node_status.go:75] Successfully registered node node125
Aug 6 16:33:22 node125 kubelet: I0806 16:33:22.695543 20451 reconciler.go:150] Reconciler: start to sync state
在查看下kubectl get node就有了
[root@node125 kubernetes]# kubelet --help| grep cg
--cgroup-driver string Driver that the kubelet uses to manipulate cgroups on the host. Possible values: 'cgroupfs', 'systemd' (default "cgroupfs") (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
这个默认和docker的默认不一致,建议修改docker的cgroup-driver的类型为cgroupfs,这个我在其他环境使用没有发现出过类似的问题。