prometheus监控预警之AlertManager邮箱报警

水铭晨
2023-12-01

Alertmanager 主要用于接收 Prometheus 发送的告警信息,它支持丰富的告警通知渠道,例如邮件、微信、钉钉、Slack 等常用沟通工具,而且很容易做到告警信息进行去重,降噪,分组等,是一款很好用的告警通知系统。

一、安装alertmanager并配置邮箱报警

1、配置邮箱报警之模拟node节点down掉之后报警,恢复之后报警

cd /usr/local


wget https://github.com/prometheus/alertmanager/releases/download/v0.22.1/alertmanager-0.22.1.linux-amd64.tar.gz


tar xf alertmanager-0.22.1.linux-amd64.tar.gz


ln -s alertmanager-0.22.1.linux-amd64 alertmanager

修改alertmanager的配置文件:
vim /ur/local/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: '*********@qq.com'              #发件人邮箱
  smtp_auth_username: '********@qq.com'    #发件人用户名
  smtp_auth_password: '*********'    #邮箱授权码(这个码要登录你的邮箱在设置里可以获取)
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'email'
receivers:
- name: 'email'
  email_configs:
  - to: '*********@163.com'                #收件人邮箱
    headers: {Subject: "WARNING-告警邮件"}
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance'

添加alertmanager系统服务:
vim /lib/systemd/system/alertmanager.service

[Unit]
Description=Prometheus Alertmanager Service daemon
After=network.target

[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/local/alertmanager/alertmanager --config.file="/usr/local/alertmanager/alertmanager.yml" --storage.path="/usr/local/alertmanager
/data/" --data.retention=120h --web.external-url="http://xxx.xxx.xxx.133:9093" --web.listen-address=":9093"Restart=on-failure

[Install]
WantedBy=multi-user.target

systemctl daemon-reload        #重新加载配置

修改prometheus配置文件:
vim /usr/local/prometheus/prometheus.yml
......
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - xxx.xxx.xxx.133:9093    #主机ip,默认端口号9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - rules/*.yaml         #报警规则文件路径
......

创建rules目录,并编写报警规则:
mkdir /usr/local/prometheus/rules


cd /usr/local/prometheus/rules


vim node_rule.yaml


groups:
- name:
UP
  rules:
  - alert:
nodes
    expr: up{job="node_exporter_discovery"} == 0
    for: 30s
    labels:
      severity:
critical
    annotations:
      description:
"{{ $labels.instance }} of job of {{ $labels.job }} has been down for more than 5 minutes."
      summary: "{{ $labels.instance }} down,up=={{ $value }}" 

注意:expr规则可以浏览9090端口Graph下搜索各种规则用于匹配报警,一旦达到预想的值就会进行报警 

重启prometheus和启动alertmanager:
systemctl restart prometheus
systemctl start alertmanager

可以尝试将监控的node主机down掉或关机,然后查看是否收到邮件报警,恢复之后是否收到恢复的邮件通知 

 类似资料: