我们使用kube-prometheus 在K8S中部署Prometheus ,我们直接使用开源的 mainfest 文件即可。我们创建单独的 namespace 进行监控
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
官网如下:
## 参考1
https://github.com/prometheus-operator/kube-prometheus
###
https://github.com/prometheus-operator/kube-prometheus/tree/main/manifests/setup
## 参考2
https://github.com/camilb/prometheus-kubernetes
### 告警配置
https://www.qikqiak.com/post/prometheus-operator-custom-alert/
安装部署如下:
###先查看k8s 是哪个版本,切到那个版本下
git checkout -b 本地分支 origi/远程分支
###
# Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources
kubectl apply --server-side -f manifests/setup
until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
kubectl apply -f manifests/
组件分析
(1)kube-state-metrics与metrics-server对比?
我们服务在运行过程中,我们想了解服务运行状态,pod有没有重启,伸缩有没有成功,pod的状态是怎么样的等,这时就需要kube-state-metrics,它主要关注deployment,、node 、 pod等内部对象的状态。而metrics-server 主要用于监测node,pod等的CPU,内存,网络等系统指标。
报错解决
原因是拉不下镜像,因为网络原因。kube-state-metrics 镜像无法拉下来
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned monitoring/kube-state-metrics-5fcb7d6fcb-k6sfd to 172.19.193.25
Normal SuccessfulMountVolume 19m kubelet Successfully mounted volumes for pod "kube-state-metrics-5fcb7d6fcb-k6sfd_monitoring(0c0134c9-120c-4fd7-adcf-e61b2dae680a)"
Normal Pulling 18m kubelet Pulling image "quay.io/brancz/kube-rbac-proxy:v0.11.0"
Normal Pulled 18m kubelet Successfully pulled image "quay.io/brancz/kube-rbac-proxy:v0.11.0" in 39.040594662s
Normal Pulled 18m kubelet Container image "quay.io/brancz/kube-rbac-proxy:v0.11.0" already present on machine
Normal Started 18m kubelet Started container kube-rbac-proxy-main
Normal SuccessfulCreate 18m kubelet Created container kube-rbac-proxy-main
Normal SuccessfulCreate 18m kubelet Created container kube-rbac-proxy-self
Normal Started 18m kubelet Started container kube-rbac-proxy-self
Normal Pulling 17m (x3 over 19m) kubelet Pulling image "k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0"
Warning FailedCreate 16m (x3 over 18m) kubelet Error: ErrImagePull
Warning FailedPullImage 16m (x3 over 18m) kubelet Failed to pull image "k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0": rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning FailedCreate 16m (x4 over 17m) kubelet Error: ImagePullBackOff
Warning BackOffPullImage 4m29s (x50 over 17m) kubelet Back-off pulling image "k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0"
$ kubectl describe po kube-state-metrics-5fcb7d6fcb-k6sfd -n monitoring
我们换成其他镜像下载:
##
vim kubeStateMetrics-deployment.yaml
##
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-prometheus
template:
metadata:
annotations:
kubectl.kubernetes.io/default-container: kube-state-metrics
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.3.0
spec:
containers:
- args:
- --host=127.0.0.1
- --port=8081
- --telemetry-host=127.0.0.1
- --telemetry-port=8082
#image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.3.0
image: quay.io/coreos/kube-state-metrics:v1.9.8 # 改成可下载的
prometheus-adapter 镜像无法拉下来
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned monitoring/prometheus-adapter-58668f79bc-lgj95 to 172.19.193.102
Normal SuccessfulMountVolume 30m kubelet Successfully mounted volumes for pod "prometheus-adapter-58668f79bc-lgj95_monitoring(518e3bbd-23d5-4944-ad63-25948338122d)"
Normal Pulling 27m (x4 over 30m) kubelet Pulling image "k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1"
Warning FailedPullImage 27m (x4 over 30m) kubelet Failed to pull image "k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1": rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning FailedCreate 27m (x4 over 30m) kubelet Error: ErrImagePull
Warning FailedCreate 27m (x6 over 30m) kubelet Error: ImagePullBackOff
Warning BackOffPullImage 38s (x115 over 30m) kubelet Back-off pulling image "k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1"
$ kubectl describe po prometheus-adapter-58668f79bc-lgj95 -n monitoring
解决:Prometheus-Adapter安装_多云容器平台 MCP_用户指南_监控中心_华为云
##
vim prometheusAdapter-deployment.yaml
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
template:
metadata:
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: kube-prometheus
.... 省略
#image: k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
image: directxman12/k8s-prometheus-adapter-amd64:v0.7.0 # 改成这个
name: prometheus-adapter
最后查看是否都起来了:
$ kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 38m
alertmanager-main-1 2/2 Running 0 38m
alertmanager-main-2 2/2 Running 0 38m
blackbox-exporter-776596fdf8-82qj7 3/3 Running 0 39m
grafana-667874d57-xvvpt 1/1 Running 0 39m
kube-state-metrics-584858f6fc-24jlx 3/3 Running 0 12m
node-exporter-hn88p 2/2 Running 0 39m
node-exporter-jt7b8 2/2 Running 0 39m
prometheus-adapter-544596c9f5-gsbzp 1/1 Running 0 42s
prometheus-adapter-544596c9f5-rsb7d 1/1 Running 0 42s
prometheus-k8s-0 2/2 Running 0 38m
prometheus-k8s-1 2/2 Running 0 38m
prometheus-operator-7ddc6877d5-d58rd 2/2 Running 0 39m
(1)修改proms的svc
# vi prometheus-service.yaml
##
[root@k8s-01 manifests]# cat prometheus-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 2.36.1
name: prometheus-k8s
namespace: monitoring
spec:
type: NodePort
ports:
- name: web
port: 9090
targetPort: web
nodePort: 30100 # 外部访问
# - name: reloader-web
# port: 8080
# targetPort: reloader-web
selector:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
sessionAffinity: ClientIP
[root@k8s-01 manifests]#
(2)修改grafana的svc
[root@k8s-01 manifests]# cat grafana-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 8.5.5
name: grafana
namespace: monitoring
spec:
type: NodePort
ports:
- name: http
port: 3000
targetPort: http
nodePort: 30200
selector:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
(3)访问:
##
http://xx.cn:30200
#
http://xx.cn:30100
### grafana的默认账号和密码为
admin/admin
# 查询指定命名空间信息
container_cpu_usage_seconds_total{namespace="car-stg"}
规则文章可参考如下:
## 参考1
https://awesome-prometheus-alerts.grep.to/rules.html
## 参考2
https://github.com/camilb/prometheus-kubernetes/blob/master/manifests/prometheus/prometheus-k8s-rules.yaml
如何修改alert rule?
#### 方式1: 通过rule规则修改
## edit
kubectl edit cm prometheus-k8s-rulefiles-0 -n monitoring
#### 方式2: 修改配置文件方式
cd /opt/proms-k8s/kube-prometheus/manifests
vim kubePrometheus-prometheusRule.yaml
###
kubectl apply kubePrometheus-prometheusRule.yaml
- alert: KubernetesNodeReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Kubernetes Node ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes memory pressure (instance {{ $labels.instance }})
description: "{{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesOutOfDisk
expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes out of disk (instance {{ $labels.instance }})
description: "{{ $labels.node }} has OutOfDisk condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
##