kubervirt ,在k8s上管理虚拟机。
先搭建一个k8s集群,我用的1.18版本,搭建此处略过
kubevirt官方安装文档照着走就行 https://kubevirt.io/user-guide/operations/installation/
export RELEASE=v0.35.0
kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml
kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-cr.yaml
部署完查看一下运行状况
[root@master ~]# kubectl get pods -n kubevirt
NAME READY STATUS RESTARTS AGE
virt-api-64999f7bf5-hjv6h 1/1 Running 1 99m
virt-api-64999f7bf5-ktjsm 1/1 Running 1 99m
virt-controller-8696ccdf44-d57z7 1/1 Running 1 98m
virt-controller-8696ccdf44-wdmjm 1/1 Running 1 98m
virt-handler-hltv7 1/1 Running 1 98m
virt-handler-mdgzf 1/1 Running 1 98m
virt-handler-r9695 1/1 Running 1 98m
virt-operator-78fbcdfdf4-l5wf6 1/1 Running 1 100m
virt-operator-78fbcdfdf4-skrqb 1/1 Running 1 100m
[root@master ~]# kubectl get svc -n kubevirt
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubevirt-operator-webhook ClusterIP 10.103.27.69 <none> 443/TCP 101m
kubevirt-prometheus-metrics ClusterIP 10.110.171.192 <none> 443/TCP 101m
virt-api ClusterIP 10.111.187.72 <none> 443/TCP 101m
使用官方提供的vm案例创建vm,准备一个vm.yaml,内容如下
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
labels:
kubevirt.io/vm: vm-cirros
name: vm-cirros
spec:
running: false
template:
metadata:
labels:
kubevirt.io/vm: vm-cirros
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: containerdisk
- disk:
bus: virtio
name: cloudinitdisk
machine:
type: ""
resources:
requests:
memory: 64M
terminationGracePeriodSeconds: 0
volumes:
- name: containerdisk
containerDisk:
image: kubevirt/cirros-container-disk-demo:latest
- cloudInitNoCloud:
userDataBase64: IyEvYmluL3NoCgplY2hvICdwcmludGVkIGZyb20gY2xvdWQtaW5pdCB1c2VyZGF0YScK
name: cloudinitdisk
排查的时候很多东西没记录下来,这里只能大致梳理一下问题报错
kubectl apply -f vm.yaml
会出现超时的问题, 大致报错如下
Error from server (InternalError): error when creating "vm.yaml": Internal error occurred: failed calling webhook "virtualmachine-validator.kubevirt.io": Post https://virt-api.kubevirt.svc:443/virtualmachines-validate?timeout=4s: context deadline exceeded
通过查看api-resources也会出现如下报错
[root@master ~]# kubectl api-resources
unable to retrieve the complete list of server APIs: subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request
然后去检查virt-api
[root@master ~]# kubectl -n kubevirt describe svc virt-api
Name: virt-api
Namespace: kubevirt
Labels: app.kubernetes.io/component=kubevirt
app.kubernetes.io/managed-by=kubevirt-operator
kubevirt.io=virt-api
Annotations: kubevirt.io/customizer-identifier: bf21a9e8fbc5a3846fb05b4fa0859e0917b2202f
kubevirt.io/generation: 2
kubevirt.io/install-strategy-identifier: a52e28fafafe1ab43f512e85012027a1d69903e0
kubevirt.io/install-strategy-registry: index.docker.io/kubevirt
kubevirt.io/install-strategy-version: v0.35.0
Selector: kubevirt.io=virt-api
Type: ClusterIP
IP: 10.111.187.72
Port: <unset> 443/TCP
TargetPort: 8443/TCP
Endpoints: 10.244.2.30:8443,10.244.3.27:8443
Session Affinity: None
Events: <none>
通过curl检查Endpoints都没有问题
[root@master ~]# curl https://10.244.2.30:8443 -k
{
"paths": [
"/apis",
"/apis/",
"/openapi/v2",
"/apis/subresources.kubevirt.io",
"/apis/subresources.kubevirt.io/v1alpha3"
]
}[root@master ~]# curl https://10.244.3.27:8443 -k
{
"paths": [
"/apis",
"/apis/",
"/openapi/v2",
"/apis/subresources.kubevirt.io",
"/apis/subresources.kubevirt.io/v1alpha3"
]
}
但是curl svc ip 的时候会非常慢,虽然最后也能得到响应。
[root@master ~]# curl https://10.111.187.72 -k
接下来去检查kube-proxy的日志
发现了大致如下的内容
mode set "" use iptables
1.18版本中没有设置kube-proxy mode ,默认使用了iptables。虽然感觉没有什么影响,还是先改为了ipvs
[root@master ~]# kubectl -n kube-system edit cm kube-proxy
metricsBindAddress: ""
mode: "" ---> mode: "ipvs"
nodePortAddresses: null
oomScoreAdj: null
直接删掉kube-proxy pod,会重建新的,新的就为ipvs模式了
[root@master ~]# kubectl -n kube-system delete pods `kubectl -n kube-system get pods | grep kube-proxy | awk '{print $1}'`
再次查看kube-proxy日志
出现了大量报错,类似如下。
ip=[10 244 2 32 0 0 0 0 0 0 0 0 0 0 0 0]
E0417 02:35:16.855674 1 proxier.go:1192] Failed to sync endpoint for service: 10.101.104.171:3000/TCP, err: parseIP Error ip=[10 244 2 32 0 0 0 0 0 0 0 0 0 0 0 0]
E0417 02:35:16.855990 1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[10 244 3 27 0 0 0 0 0 0 0 0 0 0 0 0]
说是因为内核版本低导致的
[root@master kubevirt_compute]# uname -a
Linux master 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24
于是开始升级内核
导入公钥
[root@master ~]# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
安装源
[root@master ~]# yum install https://www.elrepo.org/elrepo-release-7.0-4.el7.elrepo.noarch.rpm -y
列出可用的kernel版本
[root@master ~]# yum --disablerepo="*" --enablerepo="elrepo-kernel" list available
可安装的软件包
elrepo-release.noarch 7.0-5.el7.elrepo elrepo-kernel
kernel-lt-devel.x86_64 5.4.113-1.el7.elrepo elrepo-kernel
kernel-lt-doc.noarch 5.4.113-1.el7.elrepo elrepo-kernel
kernel-lt-headers.x86_64 5.4.113-1.el7.elrepo elrepo-kernel
kernel-lt-tools.x86_64 5.4.113-1.el7.elrepo elrepo-kernel
kernel-lt-tools-libs.x86_64 5.4.113-1.el7.elrepo elrepo-kernel
kernel-lt-tools-libs-devel.x86_64 5.4.113-1.el7.elrepo elrepo-kernel
kernel-ml.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
kernel-ml-devel.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
kernel-ml-doc.noarch 5.11.15-1.el7.elrepo elrepo-kernel
kernel-ml-headers.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
kernel-ml-tools.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
kernel-ml-tools-libs.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
kernel-ml-tools-libs-devel.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
perf.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
python-perf.x86_64 5.11.15-1.el7.elrepo elrepo-kernel
安装新的内核版本
[root@master ~]# yum --disablerepo='*' --enablerepo=elrepo-kernel install kernel-lt
设置默认启动内核版本
[root@master ~]# cat /etc/grub2.cfg | grep "menuentry " | awk -F"'" '{print $2}'
CentOS Linux (5.4.113-1.el7.elrepo.x86_64) 7 (Core)
CentOS Linux (3.10.0-862.el7.x86_64) 7 (Core)
CentOS Linux (0-rescue-d8c9bde751d74f5398f997244ed6c161) 7 (Core)
[root@master ~]# grub2-set-default 'CentOS Linux (5.4.113-1.el7.elrepo.x86_64) 7 (Core)'
重启
[root@master ~]# reboot
再去查看kube-proxy日志
[root@master kubevirt_compute]# kubectl -n kube-system logs -f kube-proxy-qd5bj --tail=200
I0417 05:42:55.649976 1 node.go:136] Successfully retrieved node IP: 10.10.128.103
I0417 05:42:55.650107 1 server_others.go:259] Using ipvs Proxier.
W0417 05:42:55.650763 1 proxier.go:429] IPVS scheduler not specified, use rr by default
I0417 05:42:55.651334 1 server.go:583] Version: v1.18.0
I0417 05:42:55.652266 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0417 05:42:55.652927 1 config.go:133] Starting endpoints config controller
I0417 05:42:55.652991 1 config.go:315] Starting service config controller
I0417 05:42:55.652997 1 shared_informer.go:223] Waiting for caches to sync for endpoints config
I0417 05:42:55.653015 1 shared_informer.go:223] Waiting for caches to sync for service config
I0417 05:42:55.753408 1 shared_informer.go:230] Caches are synced for endpoints config
I0417 05:42:55.753475 1 shared_informer.go:230] Caches are synced for service config
然后就莫名其妙的正常了,再去curl kubevirt virt-api svc的ip。
[root@master ~]# curl https://10.111.187.72 -k
{
"paths": [
"/apis",
"/apis/",
"/openapi/v2",
"/apis/subresources.kubevirt.io",
"/apis/subresources.kubevirt.io/v1alpha3"
]
}
再尝试去部署虚拟机,也可以正常。
[root@master kubevirt_compute]# kubectl apply -f vm.yaml
[root@master kubevirt_compute]# kubectl get vm
NAME AGE VOLUME
vm-cirros 8s
虽然问题解决了,很多地方都还不太明确,中间的许多细节还需再慢慢去学习一下。