各类地址
confd网址
https://github.com/kelseyhightower/confd/releases
https://github.com/kelseyhightower/confd/blob/master/docs/templates.md
etcd网址
https://github.com/coreos/etcd/
alertmanager网址
https://github.com/prometheus/alertmanager
Prometheus网址
https://github.com/prometheus
grafana网址
https://github.com/grafana/grafana
node_exporter网址
https://github.com/prometheus/node_exporter
软件放置位置均在 /root 目录下
各个软件的启动
/root/alertmanager/alertmanager --config.file=/root/alertmanager/alertmanager.yml
/root/etcd-v3.4.15/etcd
/root/prometheus/prometheus --web.enable-lifecycle --config.file=/root/prometheus/prometheus.yml --storage.tsdb.path=/root/prometheus/data
/root/node_exporter/node_exporter
/root/grafana-7.0.3/bin/grafana-server -homepath /root//grafana-7.0.3
Prometheus启动参数 --web.enable-lifecycle 是为了可以使用api重新加载prometheus配置文件
etcd的配置文件如下
[root@bogon alertmanager]# cat /etc/etcd/etcd.conf
ETCD_DATA_DIR="/var/lib/etcd/"
ETCD_LISTEN_CLIENT_URLS="http://192.168.73.101:2379"
ETCD_NAME="default"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.73.101:2379"
自动化管理Prometheus
confd的conf.d及templates文件如下
[root@bogon conf.d]# cat /etc/confd/conf.d/prometheus.conf.toml
[template]
#prefix = "/prometheus"
src = "prometheus.yml.tmpl"
dest = "/root/prometheus/prometheus.yml"
mode = "0755"
keys = [
"/job/",
]
reload_cmd = "curl -XPOST 'http://192.168.73.100:9090/-/reload'"
[root@bogon templates]# cat /etc/confd/templates/prometheus.yml.tmpl
# 全局配置
global:
scrape_interval: 15s # 设置抓取(pull)时间间隔,默认是1m
evaluation_interval: 15s # 设置rules评估时间间隔,默认是1m
# scrape_timeout is set to the global default (10s).
# 告警管理配置,暂未使用,默认配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 加载rules,并根据设置的时间间隔定期评估,暂未使用,默认配置
rule_files:
- /root/prometheus/alert.rules
- /root/prometheus/prometheus.rules
# - "first_rules.yml"
# - "second_rules.yml"
# 抓取(pull),即监控目标配置
# 默认只有主机本身的监控配置
scrape_configs:
# 监控目标的label(这里的监控目标只是一个metric,而不是指某特定主机,可以在特定主机取多个监控目标),在抓取的每条时间序列表中都会添加此label
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
# 可覆盖全局配置设置的抓取间隔,由15秒重写成5秒。
scrape_interval: 5s
# 静态指定监控目标,暂不涉及使用一些服务发现机制发现目标
static_configs:
- targets: ['192.168.73.100:9090']
# (opentional)再添加一个label,标识了监控目标的主机
- job_name: 'server'
static_configs:
- targets: ['192.168.73.100:9100']
{{range $job_name := gets "/job/*"}}
{{$jobJson := json $job_name.Value}}
- job_name: '{{$jobJson.name}}'
scheme: '{{$jobJson.scheme}}'
metrics_path: '{{$jobJson.metrics}}'
static_configs:
{{$target := printf "%s/*" $job_name.Key}}{{range $ins_name := gets $target}}
{{$insJson := json $ins_name.Value}}
- targets: ['{{$insJson.instance}}']
labels:
name: '{{$insJson.name}}'
ip: '{{$insJson.ip}}'
{{end}}
{{end}}
confd启动
confd -watch -backend etcdv3 -node http://192.168.73.100:2379 &
etcd数据写入
etcdctl --endpoints="http://192.168.73.100:2379" put /job/test '{"scheme":"http","metrics":"/metrics","name":"test"}'
etcdctl --endpoints="http://192.168.73.100:2379" put /job/test/test2 '{"name":"test2","instance":"2.2.2.2:9093","ip":"2.2.2.2"}'
查看Prometheus配置文件
[root@localhost ~]# cat /root/prometheus/prometheus.yml
# 全局配置
global:
scrape_interval: 15s # 设置抓取(pull)时间间隔,默认是1m
evaluation_interval: 15s # 设置rules评估时间间隔,默认是1m
# scrape_timeout is set to the global default (10s).
# 告警管理配置,暂未使用,默认配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 加载rules,并根据设置的时间间隔定期评估,暂未使用,默认配置
rule_files:
- /root/prometheus/alert.rules
- /root/prometheus/prometheus.rules
# - "first_rules.yml"
# - "second_rules.yml"
# 抓取(pull),即监控目标配置
# 默认只有主机本身的监控配置
scrape_configs:
# 监控目标的label(这里的监控目标只是一个metric,而不是指某特定主机,可以在特定主机取多个监控目标),在抓取的每条时间序列表中都会添加此label
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
# 可覆盖全局配置设置的抓取间隔,由15秒重写成5秒。
scrape_interval: 5s
# 静态指定监控目标,暂不涉及使用一些服务发现机制发现目标
static_configs:
- targets: ['192.168.73.100:9090']
# (opentional)再添加一个label,标识了监控目标的主机
- job_name: 'server'
static_configs:
- targets: ['192.168.73.100:9100']
- targets: ['2.2.2.2:9093']
labels:
name: 'test2'
ip: '2.2.2.2'
查看Prometheus页面
告警规则自动化管理
告警规则定义可以根据KEY值写入不同的文件中
confd文件
[root@localhost conf.d]# cat /etc/confd/conf.d/HCYrules.conf.toml
[template]
src = "HCYrules.yml.tmpl"
dest = "/tmp/HCYrules.yml"
mode = "0755"
keys = [
"/rules/alert/HCY/",
"/rules/alert/HCY/alert",
]
reload_cmd = "curl -XPOST 'http://192.168.73.100:9090/-/reload'"
[root@localhost conf.d]#
[root@localhost conf.d]# cat /etc/confd/conf.d/HLYrules.conf.toml
[template]
src = "HLYrules.yml.tmpl"
dest = "/tmp/HLYrules.yml"
mode = "0755"
keys = [
"/rules/alert/HLY/",
"/rules/alert/HLY/alert",
]
reload_cmd = "curl -XPOST 'http://192.168.73.100:9090/-/reload'"
templates文件
[root@localhost templates]# cat /etc/confd/templates/HCYrules.yml.tmpl
groups:
{{range $alert_name := gets "/rules/alert/HCY/*"}}
{{$alertJson := json $alert_name.Value}}
- name: {{$alertJson.name}}
rules :
{{$alert := printf "%s/*" $alert_name.Key}}{{range $ins_name := gets $alert}}
{{$insJson := json $ins_name.Value}}
- alert: {{$insJson.alert}}
expr: {{$insJson.expr}}
for: {{$insJson.for}}
Labels:
severity: {{$insJson.labels.serverity}}
annotations :
summary: {{$insJson.annotations.summary}}
description: {{$insJson.annotations.description}}
{{end}}
{{end}}
[root@localhost templates]#
[root@localhost templates]# cat /etc/confd/templates/HLYrules.yml.tmpl
groups:
{{range $alert_name := gets "/rules/alert/HLY/*"}}
{{$alertJson := json $alert_name.Value}}
- name: {{$alertJson.name}}
rules :
{{$alert := printf "%s/*" $alert_name.Key}}{{range $ins_name := gets $alert}}
{{$insJson := json $ins_name.Value}}
- alert: {{$insJson.alert}}
expr: {{$insJson.expr}}
for: {{$insJson.for}}
Labels:
severity: {{$insJson.labels.serverity}}
annotations :
summary: {{$insJson.annotations.summary}}
description: {{$insJson.annotations.description}}
{{end}}
{{end}}
prometheus配置文件修改,包括confd中templates中的prometheus.yml.tmpl文件
新加入/tmp/HCYrules.yml和/tmp/HLYrules.yml
# 加载rules,并根据设置的时间间隔定期评估,暂未使用,默认配置
rule_files:
- /root/prometheus/alert.rules
- /root/prometheus/prometheus.rules
- /tmp/HCYrules.yml
- /tmp/HLYrules.yml
# - "first_rules.yml"
# - "second_rules.yml"
confd启动
confd -watch -backend etcdv3 -node http://192.168.73.100:2379 &
etcd数据写入
etcdctl --endpoints="http://192.168.73.100:2379" put /rules/alert/HCY/alerrule '{"name":"HCY"}'
etcdctl --endpoints="http://192.168.73.100:2379" put /rules/alert/HCY/alerrule/alerrule '{"alert":"HCY","expr":"up == 0","for":"1m","labels":{"serverity":"page"},"annotations":{"summary":"hcy","description":"hcy"}}'
etcdctl --endpoints="http://192.168.73.100:2379" put /rules/alert/HCY/alerrule/alerrules '{"alert":"HCY","expr":"up == 0","for":"1m","labels":{"serverity":"page"},"annotations":{"summary":"hcy","description":"hcy"}}'
etcdctl --endpoints="http://192.168.73.100:2379" put /rules/alert/HLY/hello '{"name":"HLY"}'
etcdctl --endpoints="http://192.168.73.100:2379" put /rules/alert/HLY/hello/helloworld '{"alert":"HLY","expr":"up == 0","for":"1m","labels":{"serverity":"page"},"annotations":{"summary":"hcy","description":"hcy"}}'
etcdctl --endpoints="http://192.168.73.100:2379" put /rules/alert/HLY/hello/helloworlds '{"alert":"HLY","expr":"up == 0","for":"1m","labels":{"serverity":"page"},"annotations":{"summary":"hcy","description":"hcy"}}'
查看新产生的告警规则文件
[root@localhost conf.d]# cat /tmp/HCYrules.yml
groups:
- name: HCY
rules :
- alert: HCY
expr: up == 0
for: 1m
Labels:
severity: page
annotations :
summary: hcy
description: hcy
- alert: HCY
expr: up == 0
for: 1m
Labels:
severity: page
annotations :
summary: hcy
description: hcy
[root@localhost conf.d]#
[root@localhost conf.d]# cat /tmp/HLYrules.yml
groups:
- name: HLY
rules :
- alert: HLY
expr: up == 0
for: 1m
Labels:
severity: page
annotations :
summary: hcy
description: hcy
- alert: HLY
expr: up == 0
for: 1m
Labels:
severity: page
annotations :
summary: hcy
description: hcy
查看prometheus页面,告警规则是否已经成功写入