记一次测试虚拟机异常关机导致的问题
关于[root@master log]# kubectl get node
The connection to the server 192.168.0.105:6443 was refused - did you specify the right host or port?
这个问题网上的大部分解决方案是kubeadm init
直接这么搞我相信肯定能恢复,毕竟是重新初始化,但是很少有问题确认或排查的结果
这边的思路是
1、首先看一下kubelet的状态
[root@master log]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sun 2022-08-07 21:53:47 CST; 7min ago
Docs: https://kubernetes.io/docs/
Main PID: 896 (kubelet)
CGroup: /system.slice/kubelet.service
└─896 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=k8...
Aug 07 22:01:31 master kubelet[896]: E0807 22:01:31.099750 896 kubelet.go:2466] "Error getting node" err="node \"master\" not found"
Aug 07 22:01:31 master kubelet[896]: E0807 22:01:31.200690 896 kubelet.go:2466] "Error getting node" err="node \"master\" not found"
Aug 07 22:01:31 master kubelet[896]: E0807 22:01:31.302463 896 kubelet.go:2466] "Error getting node" err="node \"master\" not found"
确认问题确实和标题一样,链接6443端口失败
2、确认端口和服务的问题
[root@master log]# netstat -an |grep 443
tcp 0 0 192.168.0.105:45700 101.6.15.130:443 TIME_WAIT
tcp 0 0 192.168.0.105:53188 54.230.61.3:443 TIME_WAIT
[root@master log]# netstat -an |grep 6443
netstat查看端口确实没有
想查服务进程,这里想到6443端口是apiserver监听的端口,服务进程肯定是没有了因为是容器启动的,所以需要查看一下容器进程的情况
3、确认apiserver容器进程
[root@master log]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
288e14904ca7 5bc0062e9555 "kube-apiserver --ad…" 13 seconds ago Up 12 seconds k8s_kube-apiserver_kube-apiserver-master_kube-system_400ffc84fa5a476e6e92ca8f68ba4b4f_17
123224d20e8f 0198979b7707 "kube-scheduler --au…" 4 minutes ago Up 4 minutes k8s_kube-scheduler_kube-scheduler-master_kube-system_979e4fa280c61a85414b1eeb67f9a06c_4
86026c6ec248 f6bc1b780606 "kube-controller-man…" 4 minutes ago Up 4 minutes k8s_kube-controller-manager_kube-controller-manager-master_kube-system_66054ac4c4bc791e96e1b8b3f0a88fda_4
e042794651b6 k8s.gcr.io/pause:3.6 "/pause" 4 minutes ago Up 4 minutes k8s_POD_etcd-master_kube-system_a7d996a74340d0ec859869d3f547fd5e_5
6ca6069627c1 k8s.gcr.io/pause:3.6 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-scheduler-master_kube-system_979e4fa280c61a85414b1eeb67f9a06c_5
96890c46a350 k8s.gcr.io/pause:3.6 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-controller-manager-master_kube-system_66054ac4c4bc791e96e1b8b3f0a88fda_5
b16ae5e49636 k8s.gcr.io/pause:3.6 "/pause" 4 minutes ago Up 4 minutes k8s_POD_kube-apiserver-master_kube-system_400ffc84fa5a476e6e92ca8f68ba4b4f_5
apiserver运行12秒???其他进程运行的是4分钟???
好像发现问题所在了,大概率是apiserver的问题
4、查看容器日志
查看apiserver日志
[root@master log]# docker logs 288e14904ca7
I0807 13:58:23.892217 1 server.go:565] external host was not specified, using 192.168.0.105
I0807 13:58:23.892642 1 server.go:172] Version: v1.23.9
I0807 13:58:24.284096 1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
I0807 13:58:24.284850 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0807 13:58:24.284859 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I0807 13:58:24.285508 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0807 13:58:24.285517 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
W0807 13:58:24.287476 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:25.285443 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:25.287937 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:26.287159 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:26.776841 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:28.127012 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:28.864148 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:30.382813 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:33.431992 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:35.267596 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:39.993202 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W0807 13:58:41.380614 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
E0807 13:58:44.288612 1 run.go:74] "command failed" err="context deadline exceeded"
这一顿输出,其他的没看懂,但是看到了“127.0.0.1:2379: connect: connection refused”,2379不是etcd的端口吗?
等等
刚才docker ps确实也没有etcd,难道!
[root@master log]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f0c92ea6d056 25f8c7f3da61 "etcd --advertise-cl…" 43 seconds ago Exited (2) 42 seconds ago k8s_etcd_etcd-master_kube-system_a7d996a74340d0ec859869d3f547fd5e_20
75dded3ea121 5bc0062e9555 "kube-apiserver --ad…" About a minute ago Exited (1) 44 seconds ago k8s_kube-apiserver_kube-apiserver-master_kube-system_400ffc84fa5a476e6e92ca8f68ba4b4f_19
123224d20e8f 0198979b7707 "kube-scheduler --au…" 11 minutes ago Up 11 minutes k8s_kube-scheduler_kube-scheduler-master_kube-system_979e4fa280c61a85414b1eeb67f9a06c_4
86026c6ec248 f6bc1b780606 "kube-controller-man…" 11 minutes ago Up 11 minutes k8s_kube-controller-manager_kube-controller-manager-master_kube-system_66054ac4c4bc791e96e1b8b3f0a88fda_4
e042794651b6 k8s.gcr.io/pause:3.6 "/pause" 11 minutes ago Up 11 minutes k8s_POD_etcd-master_kube-system_a7d996a74340d0ec859869d3f547fd5e_5
6ca6069627c1 k8s.gcr.io/pause:3.6 "/pause" 11 minutes ago Up 11 minutes k8s_POD_kube-scheduler-master_kube-system_979e4fa280c61a85414b1eeb67f9a06c_5
96890c46a350 k8s.gcr.io/pause:3.6 "/pause" 11 minutes ago Up 11 minutes k8s_POD_kube-controller-manager-master_kube-system_66054ac4c4bc791e96e1b8b3f0a88fda_5
b16ae5e49636 k8s.gcr.io/pause:3.6 "/pause" 11 minutes ago Up 11 minutes k8s_POD_kube-apiserver-master_kube-system_400ffc84fa5a476e6e92ca8f68ba4b4f_5
[root@master log]# docker logs f0c92ea6d056
{"level":"info","ts":"2022-08-07T14:04:41.931Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.0.105:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--initial-advertise-peer-urls=https://192.168.0.105:2380","--initial-cluster=master=https://192.168.0.105:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.0.105:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.0.105:2380","--name=master","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
{"level":"info","ts":"2022-08-07T14:04:41.931Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2022-08-07T14:04:41.931Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.0.105:2380"]}
{"level":"info","ts":"2022-08-07T14:04:41.931Z","caller":"embed/etcd.go:478","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2022-08-07T14:04:41.932Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.0.105:2379"]}
{"level":"info","ts":"2022-08-07T14:04:41.932Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.1","git-sha":"e8732fb5f","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2,"member-initialized":true,"name":"master","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.0.105:2380"],"listen-peer-urls":["https://192.168.0.105:2380"],"advertise-client-urls":["https://192.168.0.105:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.0.105:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2022-08-07T14:04:41.945Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"11.913748ms"}
{"level":"info","ts":"2022-08-07T14:04:41.971Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":10001,"snapshot-size":"7.5 kB"}
{"level":"warn","ts":"2022-08-07T14:04:41.972Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":10001,"snapshot-file-path":"/var/lib/etcd/member/snap/0000000000002711.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-08-07T14:04:41.972Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/embed/etcd.go:244\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdmain/etcd.go:227\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdmain/etcd.go:122\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/home/remote/sbatsche/.gvm/gos/go1.16.3/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000194600, 0xc0005420c0, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005d4190, 0x122f41e, 0x2a, 0xc0005420c0, 0x1, 0x1)
/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7ffd911e0e4d, 0x6, 0x0, 0x0, 0x0, 0x0, 0xc0000e1d40, 0x1, 0x1, 0xc00021c360, ...)
/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000524000, 0xc000524600, 0x0, 0x0)
/tmp/etcd-release-3.5.1/etcd/release/etcd/server/embed/etcd.go:244 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000524000, 0x1203b54, 0x6, 0xc000522101, 0x2)
/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdmain/etcd.go:227 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a140, 0x13, 0x14)
/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdmain/etcd.go:122 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a140, 0x13, 0x14)
/tmp/etcd-release-3.5.1/etcd/release/etcd/server/etcdmain/main.go:40 +0x11f
main.main()
/tmp/etcd-release-3.5.1/etcd/release/etcd/server/main.go:32 +0x45
当看到“failed to find [SNAPSHOT-INDEX].snap.db",“snapshot-index”:10001,“snapshot-file-path”:“/var/lib/etcd/member/snap/0000000000002711.snap.db”,“error”:“snap: snapshot file doesn’t exist”}”
我似乎悟了。这难道就是真相吗?!!
5、尝试恢复
[root@master etcd]# ls
member
[root@master etcd]# cp member/ member-bak -r
[root@master etcd]# ls
member member-bak
[root@master etcd]# cd member
[root@master member]# ls
snap wal
[root@master member]# cd snap/
[root@master snap]# ls
0000000000000003-0000000000002711.snap db
[root@master snap]# rm 0000000000000003-0000000000002711.snap
rm: remove regular file ‘0000000000000003-0000000000002711.snap’? y
[root@master snap]# ls
db
[root@master snap]# cd ..
直接备份member,并删除/var/lib/etcd/member/snap/0000000000002711.snap.db
重启kubelet
[root@master member]# systemctl restart kubelet
[root@master member]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
faf9b1f4843d 5bc0062e9555 "kube-apiserver --ad…" 1 second ago Up 1 second k8s_kube-apiserver_kube-apiserver-master_kube-system_400ffc84fa5a476e6e92ca8f68ba4b4f_20
2c3e2b324974 25f8c7f3da61 "etcd --advertise-cl…" 1 second ago Up Less than a second k8s_etcd_etcd-master_kube-system_a7d996a74340d0ec859869d3f547fd5e_21
123224d20e8f 0198979b7707 "kube-scheduler --au…" 14 minutes ago Up 14 minutes k8s_kube-scheduler_kube-scheduler-master_kube-system_979e4fa280c61a85414b1eeb67f9a06c_4
86026c6ec248 f6bc1b780606 "kube-controller-man…" 14 minutes ago Up 14 minutes k8s_kube-controller-manager_kube-controller-manager-master_kube-system_66054ac4c4bc791e96e1b8b3f0a88fda_4
e042794651b6 k8s.gcr.io/pause:3.6 "/pause" 14 minutes ago Up 14 minutes k8s_POD_etcd-master_kube-system_a7d996a74340d0ec859869d3f547fd5e_5
6ca6069627c1 k8s.gcr.io/pause:3.6 "/pause" 14 minutes ago Up 14 minutes k8s_POD_kube-scheduler-master_kube-system_979e4fa280c61a85414b1eeb67f9a06c_5
96890c46a350 k8s.gcr.io/pause:3.6 "/pause" 14 minutes ago Up 14 minutes k8s_POD_kube-controller-manager-master_kube-system_66054ac4c4bc791e96e1b8b3f0a88fda_5
b16ae5e49636 k8s.gcr.io/pause:3.6 "/pause" 14 minutes ago Up 14 minutes k8s_POD_kube-apiserver-master_kube-system_400ffc84fa5a476e6e92ca8f68ba4b4f_5
[root@master member]# kubectl get node
NAME STATUS ROLES AGE VERSION
master Ready control-plane,master 7d2h v1.23.9
node1 Ready <none> 7d2h v1.23.9
好像开头就说明了是异常关机引起的,etcd数据异常
目前好像没发现其他问题。