监控环境环境:k8s高可用集群+ceph+promethues+mysql主从复制读写分离+grafana
原因:由于物理服务器硬盘给ceph分的是动态占用(即用多少占用物理服务器多少),但是由于每台分配200G,集群又同属于一台服务器上,此服务器有其它生产环境,最终导致磁盘空间只有9M,监控服务出现故障,不能远程访问。
1、手动重启服务k8s服务器集群
2、发现集群节点node1和node2节点NotReady
3、重新加入node节点
1)创建并且查看token
[root@k8s-master2 prometheus]# kubeadm token create --print-join-command
W0324 16:38:22.275556 19965 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
kubeadm join 192.168.2.151:7443 --token o7ws26.w80jrx4q8kbo5hx1 --discovery-token-ca-cert-hash sha256:137ef2971de69ada01c5a37a9833638f8b9a3c0ed1e52627b48ea03f4b5a35ee
注意:修改下面文件脚本
--token o7ws26.w80jrx4q8kbo5hx1
2)重新部署node节点
#在各个节点上执行,重新部署nodes
[root@k8s-node-1 ~]# cat node_net_reset.sh
kubeadm reset
systemctl stop kubelet
systemctl stop docker
rm -rf /var/lib/cni/ /var/lib/kubelet/* /etc/cni/
ifconfig cni0 down
ifconfig flannel.1 down
ifconfig docker0 down
ip link delete cni0
ip link delete flannel.1
systemctl start docker
sleep 3
kubeadm join 192.168.2.151:7443 --token qwbnyh.p8hi90borg1r48c2 --discovery-token-ca-cert-hash sha256:137ef2971de69ada01c5a37a9833638f8b9a3c0ed1e52627b48ea03f4b5a35ee
3)查看nodes集群
[root@k8s-master2 prometheus]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master1 Ready master 224d v1.18.6
k8s-master2 Ready master 224d v1.18.6
k8s-node-1 NotReady <none> 61m v1.18.6
k8s-node-2 NotReady <none> 172m v1.18.6
k8s-node-3 Ready <none> 197d v1.18.6
!!!!!说明集群存在问题
3、查看运行中pods报错不能连接网络
[root@k8s-node-1 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since 三 2021-03-24 14:17:47 CST; 7min ago
Docs: https://kubernetes.io/docs/
Main PID: 15282 (kubelet)
Memory: 40.4M
CGroup: /system.slice/kubelet.service
└─15282 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd...
3月 24 14:25:18 k8s-node-1 kubelet[15282]: E0324 14:25:18.912086 15282 pod_workers.go:191] Error syncing pod c33a7e20-dd68-42d5-b289-a21387d02a3e ("rook-discover-gjjt2_rook-ceph(c33a7e20-dd68-42d5-b...
3月 24 14:25:20 k8s-node-1 kubelet[15282]: E0324 14:25:20.912057 15282 pod_workers.go:191] Error syncing pod c33a7e20-dd68-42d5-b289-a21387d02a3e ("rook-discover-gjjt2_rook-ceph(c33a7e20-dd68-42d5-b...
3月 24 14:25:21 k8s-node-1 kubelet[15282]: E0324 14:25:21.478052 15282 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:dock...uninitialized
3月 24 14:25:22 k8s-node-1 kubelet[15282]: E0324 14:25:22.912055 15282 pod_workers.go:191] Error syncing pod c33a7e20-dd68-42d5-b289-a21387d02a3e ("rook-discover-gjjt2_rook-ceph(c33a7e20-dd68-42d5-b...
3月 24 14:25:23 k8s-node-1 kubelet[15282]: W0324 14:25:23.546665 15282 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
3月 24 14:25:24 k8s-node-1 kubelet[15282]: E0324 14:25:24.912216 15282 pod_workers.go:191] Error syncing pod c33a7e20-dd68-42d5-b289-a21387d02a3e ("rook-discover-gjjt2_rook-ceph(c33a7e20-dd68-42d5-b...
3月 24 14:25:26 k8s-node-1 kubelet[15282]: E0324 14:25:26.494655 15282 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:dock...uninitialized
3月 24 14:25:26 k8s-node-1 kubelet[15282]: E0324 14:25:26.912302 15282 pod_workers.go:191] Error syncing pod c33a7e20-dd68-42d5-b289-a21387d02a3e ("rook-discover-gjjt2_rook-ceph(c33a7e20-dd68-42d5-b...
3月 24 14:25:28 k8s-node-1 kubelet[15282]: W0324 14:25:28.547008 15282 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
3月 24 14:25:28 k8s-node-1 kubelet[15282]: E0324 14:25:28.912030 15282 pod_workers.go:191] Error syncing pod c33a7e20-dd68-42d5-b289-a21387d02a3e ("rook-discover-gjjt2_rook-ceph(c33a7e20-dd68-42d5-b...
Hint: Some lines were ellipsized, use -l to show in full.
4、找到网路部署yaml文件,删除yaml文件,并且重新执行yaml部署网络。
#删除
[root@master flannel]# kubectl delete -f kube-flannel.yml
#部署
[root@master flannel]# kubectl apply -f kube-flannel.yml
!!!发现所有网络pods部署无反应。
5、查看日志发现资源限制导致
【资源限制说明博客】
https://blog.csdn.net/weixin_44723434/article/details/97948289
https://blog.csdn.net/skh2015java/article/details/108409883
https://blog.csdn.net/shida_csdn/article/details/88838762
6、操作删除资源限制。
kubectl delete quota -n kube-system compute-resources