alertmanager笔记

闻昊英
2023-12-01

1 prometheus的思想

所有告警都应该立刻处理掉,不应该存在长时间未解决的告警。所以具体的表现就是高频的数据采集,和告警的自动恢复(默认5分钟)

2 alertmanager API调用

使用如下命令即可手工制造告警,注意startsAt和endsAt时间为当前实际时间的UTC格式。

curl -H "Content-Type: application/json" -X POST -d '[{"labels":{"字段1": "值1", "字段2": "值2", "字段3": "值3"},"annotations":{"desc": "xxxx"},"generatorURL":"http://1.1.1.1","startsAt":"2022-08-10T20:57:46.000+08:00"}]' "http://127.0.0.1:9093/api/v2/alerts"

3 alertmanager告警json

alertmanager发送给receiver的为一个json,多条告警形成alerts数组,示例如下:

'{"receiver": "email", "status": "firing", "alerts": [{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}, "annotations": {"desc": "xxxx"}, "startsAt": "2023-02-09T09:58:45+08:00", "endsAt": "2023-02-09T10:00:45+08:00", "generatorURL": "http://1.1.1.1", "fingerprint": "12345"},{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}, "annotations": {"desc": "xxxx"}, "startsAt": "2023-02-09T09:58:45+08:00", "endsAt": "2023-02-09T10:00:45+08:00", "generatorURL": "http://1.1.1.1", "fingerprint": "12345"},{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}, "annotations": {"desc": "xxxx"}, "startsAt": "2023-02-09T09:58:45+08:00", "endsAt": "2023-02-09T10:00:45+08:00", "generatorURL": "http://1.1.1.1", "fingerprint": "12345"}], "groupLabels": {"字段1": "值1"}, "commonLabels": {"字段1": "值1", "字段2"}, "commonAnnotations": {"desc": "xxxx"}, "externalURL": "http://prometheus:9093", "version": "4", "truncatedAlerts": 0}'

告警恢复之后,对应的status字段会被置为resolved,只有alerts数组中所有告警都变为resolved状态,整条json的status才会置为resolved。

4 参数说明

  • group_wait:当收到第一条告警时,延时该时间才进行发送,在此期间如果有其他告警被归并到相同group下,则届时会在json中一并发送给receiver。任何告警都会有此延时。
  • group_interval:group_wait时间之后,每隔group_interval发送一次json给receiver
  • repeat_interval:假如这个group没有任何变化,那么经过repeat_interval才会发送给receiver

4.1 举例

假设group_wait设置为30秒,group_interval设置为1分钟,repeat_interval设置为10分钟

  1. 10:00:00(t0)接收到第一条告警,10:00:20接收到第二条告警,则在10:00:30(t0+group_wait)会发送第一条json如下:
{"receiver": "email", "status": "firing", "alerts": [{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...},{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...}], ...}
  1. 10:00:40产生第三条告警,则在10:01:30(t0+group_wait+group_interval)会发送第二条json如下:
{"receiver": "email", "status": "firing", "alerts": [{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...},{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...},{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...}], ...}
  1. 在10:01:40第一条告警恢复了,则10:02:30(t0+group_wait+group_interval*2)发送第三条json如下:
{"receiver": "email", "status": "firing", "alerts": [{"status": "resolve", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...},{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...},{"status": "firing", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...}], ...}
  1. 在10:02:40另外两条告警也恢复了,则10:03:30(t0+group_wait+group_interval*3)发送第四条json如下:
{"receiver": "email", "status": "resolve", "alerts": [{"status": "resolve", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...},{"status": "resolve", "labels": {"字段1": "值1", "字段2": "值2", "字段3": "值3"}...}], ...}

假如10:00:30发送第一条json之后,2、3、4步骤都没有发生,且告警一直没有恢复,则10:10:30(t0+repeat_interval)会重复发送第一条json。

 类似资料: