此篇文章在于记录监控搭建方法
prometheus存储的是时序数据,即按相同时序(相同名称和标签),以时间维度存储连续的数据的集合。
监控目标可用consul注册发现:
consul 安装:
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
sudo yum -y install consul
更具该条命令参数修改/etc/consul/consul.hcl 中的参数,端口是8500
(consul agent -server -ui -bootstrap-expect=1 -data-dir=/opt/consul -node=consul-1 -client=0.0.0.0 -bind=10.72.88.200 -datacenter=dc1)
curl -X PUT -d '{"id": "test-key-value","name": "10.72.88.200","address": "10.72.88.200","port": 9100,"tags": ["node","hf004"],"meta":{"cloud":"geely","project":"bond"},"checks": [{"http": "http://10.72.88.200:9100/metrics", "interval": "5s"}]}' http://10.72.88.200:8500/v1/agent/service/register
将节点注册到consul服务中并且添加标签
curl -X PUT http://10.72.88.200:8500/v1/agent/service/deregister/node-exporter1(id) 将节点从consul中注销
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: “,”
action: drop 添加这个可以把consul 8300指标删除
- regex: _meta_consul_service_metadata(.+)
action: labelmap (将自定义的标签保留下来)
./promethues --web.enable-lifecycle 加入该参数可以进行热加载配置文件
curl -X POST http://IP/-/reload
global:
scrape_interval: 15s 每隔多少秒去检测一次目标
evaluation_interval: 15s 每隔多少秒去执行rules
# scrape_timeout is set to the global default (10s).
# 配置你的altermanager(可以同时配置多个)
alerting:
alertmanagers:
- static_configs:
- targets:
# - 127.0.0.1:9093
#配置你的规则(可以同时配置多个)
rule_files:
# - "rules/first_rules.yml"
# - "rules/second_rules.yml"
#监控目标配置
scrape_configs:
- job_name: "consul_test"
consul_sd_configs:
- server: '172.30.12.167:8500'
services: []
- job_name: "prometheus1"
static_configs: (手动添加)
- targets: ["localhost:9090"]
- targets: ["localhost:9100"]
自定义一些标签可以在alertmanager里使用
labels:
idc: shanghai
system: baidu
owner: xxx
- job_name: "prometheus2"
- job_name: "prometheus1"
file_sd_configs:
- files:
- /usr/local/prometheus/test.yaml
refresh_interval: 5s
可以将现有的标签进行替换
relabel_configs:
- action: replace
source_labels: ["_address_"]
regex: "(.*)"
target_label: "instance"(自动新增的标签)
replacement: "$1"
或者
- source_labels: ["_address_"]
regex: "(.*)"
target_label: "test"
replacement: $1
test.yaml内容如下:
- targets:
- 10.1.9.1xx
- 10.1.9.2xx
labels:
service: aaa
如果要监控接口等信息要运行blackbox_exporter
- job_name: 'http_status'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets: ['10.72.88.200:80']
labels:
instance: 'port_status'
group: 'port'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.72.88.200:9115(将地址修改成black_exporter地址端口)
这里的relabel_configs:不加好像不行
promtool check rules /path/to/example.rules.yml 检查语法是否正确
groups:
- name: node_monitor
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: 'critical'
annotations:
summary: "Instance {{ $labels.instance }} down"
description: " {{ $labels.instance }} has been down for more than 5 minutes. {{$labels.test}}"
- name: cpu_test
rules:
- alert: CPU
expr: (1-rate(node_cpu_seconds_total{mode="idle"}[1m]))*100 > 1
for: 5s
labels:
severity: 'warning'
annotations:
summary: " cpu利用率超过 90%,{{ $labels.instance }}当前值: {{ $value }}%"
global:
resolve_timeout: 5m
smtp_from: "archive@qq.com"
smtp_smarthost: "smtp.partner.com:587"
smtp_auth_username: "archive@qq.com"
smtp_auth_password: "mi1PooI7F%Ht9m0#"
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5s
receiver: 'email' 这里只是配置默认的receiver
routes:
- match: 直接匹配
service: foo1
receiver: "email1"
- match_re: 正则匹配
owner: "xxxx"
receiver: "email"
receivers: 这里配置多个receiver,email,webhook等
- name: 'email'
email_configs:
- to: 'test@qq.com'
send_resolved: true 发送已解决的问题
- name: 'email1' 一个receiver下面可以有多个接收器
webhook_configs:
- url: 'http://prometheus-webhook-dingtalk.kube.com
email_configs:
- to: 'test@qq.com'
send_resolved: true
inhibit_rules: # 抑制规则
- source_match: # 源标签警报触发时抑制含有目标标签的警报,在当前警报匹配
severity: 'warning' # 此处的抑制匹配一定在最上面的route中配置不然,会提示找不key。
target_match:
severity: 'critical' # 目标标签值正则匹配,可以是正则表达式如: ".*MySQL.*"
equal: ['alertname','instance'] # 确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制
4、PrometheusAlert
github或者gittee 中搜索feiyu563/PrometheusAlert
下载后编辑app.conf 然后运行promethuesalert
访问后使用app.conf中的username和password 点击模板修改模板
但是注意在alertmanager中配置
receivers:
可手动或等待Prometheus告警触发后,去PrometheusAlert中查看收到的日志消息。通过json中的键值调整模板中的信息。
时间格式不一样的话可以在模板中指定时间格式 TimeFormat $v.startsAt "2006/01/02 15:04:05"或者直接 GetCSTtime ""获取当前时间
5、接口监控
**promethues:**
- job_name: "http"
metrics_path: /probe
static_configs:
- targets:
- 10.72.88.200:80
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 1x.xx.xx.xx:9115 接口检测black_exporter必须写这个 relabel_configs(没弄懂为啥)
**rules:**
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }}:80 is down"
description: "This requires immediate action!"
6、promethuesAlert模板
{{if eq $v.status "resolved"}}
<h1><a href ={{$v.generatorURL}}>Prometheus恢复信息</a></h1>
<h2><a href ={{$var}}>{{$v.labels.alertname}}</a></h2>
<h5>告警级别:{{$v.labels.severity}}</h5>
<h5>开始时间:{{GetCSTtime $v.startsAt}}</h5>
<h5>结束时间:{{GetCSTtime $v.endsAt}}</h5>
<h5>故障主机IP:{{$v.labels.instance}}</h5>
<h5>cloud:{{$v.labels.cloud}}</h5>
<h5>cloud:{{$v.labels.project}}</h5>
<h3>{{$v.annotations.summary}}</h3>
{{else}}
<h1><a href ={{$v.generatorURL}}>Prometheus告警信息</a></h1>
<h2><a href ={{$var}}>{{$v.labels.alertname}}</a></h2>
<h5>告警级别:{{$v.labels.severity}}</h5>
<h5>开始时间:{{GetCSTtime $v.startsAt}}</h5>
<h5>结束时间:{{GetCSTtime $v.endsAt}}</h5>
<h5>故障主机IP:{{$v.labels.instance}}</h5>
<h5>cloud:{{$v.labels.cloud}}</h5>
<h5>cloud:{{$v.labels.project}}</h5>
<h3>{{$v.annotations.summary}}</h3>
{{end}}
{{ end }}```
如果alertmanager自己报警smtp模板:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.partner.outlo:25'
smtp_from: 'gitlab_notific.com'
smtp_auth_username: 'gitla'
smtp_auth_password: 'Joq3440'
smtp_require_tls: true
templates:
- '/usr/local/alertmanager/*.tmp'
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 30s
repeat_interval: 3m
receiver: email
routes:
- receiver: email
group_wait: 30s
match:
severity: critical
- receiver: web-hook
group_wait: 30s
match:
severity: warning
receivers:
- name: 'web-hook'
webhook_configs:
- url: 'http://10.172.88.200:8888/prometheusalert?type=email&tpl=prometheus-email&email=minglo@tech.com'
send_resolved: true
- name: 'email'
email_configs:
- to: 'minglo@tech.com'
html: '{{ template "email.to.html" . }}'
headers: { Subject: " {{ .CommonLabels.instance }} {{ .CommonLabels.alertname}}" }
send_resolved: true
alert.tmp
{{ define "email.from" }}12345671@qq.com{{ end }}
{{ define "email.to 1" }}minglo@tech.com{{ end }}
{{ define "email.to 2" }}minglo@tech.com{{ end }}
{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range .Alerts }}
<h2>@告警通知</h2>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }} <br>
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}{{ range .Alerts }}
<h2>@告警恢复</h2>
告警程序: prometheus_alert <br>
故障主机: {{ .Labels.instance }}<br>
故障主题: {{ .Annotations.summary }}<br>
告警详情: {{ .Annotations.description }}<br>
告警时间: {{ .StartsAt.Local.Format "2006-01-02 15:04:05" }}<br>
恢复时间: {{ .EndsAt.Local.Format "2006-01-02 15:04:05" }}<br>
{{ end }}{{ end -}}
{{- end }}