os: ubuntu 16.04
db: postgresql 9.6.8
db: influxdb 1.7.5
grafana: 6.1.3
python3
pgwatch2 编译后的 pgwatch2 命令运行一段时间后就出现无法写入 influxdb 的 error timeout 提示.
5432 - Postgres configuration (or metrics storage) DB
8080 - Management Web UI (monitored hosts, metrics, metrics configurations)
8081 - Gatherer healthcheck / statistics on number of gathered metrics (JSON).
3000 - Grafana dashboarding
8086 - InfluxDB API (when using the InfluxDB version)
8088 - InfluxDB Backup port (when using the InfluxDB version)
# /usr/pgwatch2/pgwatch2-1.5.1/pgwatch2/pgwatch2 --verbose \
--host=127.0.0.1 --port=5432 --dbname=pgwatch2 --user=pgwatch2 --password=xyz \
--ihost=127.0.0.1 --iport=8086 --idbname2=pgwatch2 --iuser=pgwatch2 --ipassword=xyz
2019/04/17 12:24:11 INFO FetchMetrics: fetched 1 rows for [pgsql_56_92:stat_statements_calls] in 1.8 ms
2019/04/17 12:24:11 INFO FetchMetrics: fetched 1 rows for [pgsql_56_92:archiver] in 1.2 ms
2019/04/17 12:24:21 ERRO MetricsPersister: Failed to write into datastore 0: {"error":"timeout"}
2019/04/17 12:24:41 ERRO MetricsPersister: Error processing retry queue 0 : {"error":"timeout"}
2019/04/17 12:25:01 ERRO MetricsPersister: Error processing retry queue 0 : {"error":"timeout"}
datastore 0 就是在页面配置的 influxdb 数据源.
必须重启下 influxdb ,给使用造成很大麻烦
# systemctl restart influxdb.service;
从报错信息看是数据无法写入 influxdb,这个需要排查下是什么问题.
# ps -ef|head -1;ps -ef|grep -i influ |grep -v grep
UID PID PPID C STIME TTY TIME CMD
influxdb 1648 1 0 10:10 ? 00:00:07 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
查看 syslog 时发现如下信息
Apr 17 12:23:11 pgw influxd[1648]: [httpd] 127.0.0.1 - pgwatch2 [17/Apr/2019:12:23:11 +0800] "POST /write?consistency=&db=pgwatch2&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" 82b80aeb-60c8-11e9-828d-0800277bd51d 4837
Apr 17 12:23:11 pgw influxd[1648]: [httpd] 127.0.0.1 - pgwatch2 [17/Apr/2019:12:23:11 +0800] "POST /write?consistency=&db=pgwatch2&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" 82d85daa-60c8-11e9-828e-0800277bd51d 19923
Apr 17 12:23:11 pgw influxd[1648]: [httpd] 127.0.0.1 - pgwatch2 [17/Apr/2019:12:23:11 +0800] "POST /write?consistency=&db=pgwatch2&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" 82fab8be-60c8-11e9-828f-0800277bd51d 6068
Apr 17 12:24:21 pgw influxd[1648]: [httpd] 127.0.0.1 - pgwatch2 [17/Apr/2019:12:24:11 +0800] "POST /write?consistency=&db=pgwatch2&precision=ns&rp= HTTP/1.1" 500 20 "-" "InfluxDBClient" a6794d3b-60c8-11e9-8290-0800277bd51d 10047387
Apr 17 12:24:21 pgw influxd[1648]: ts=2019-04-17T04:24:21.467821Z lvl=error msg="[500] - \"timeout\"" log_id=0ErGnFmG000 service=httpd
Apr 17 12:24:41 pgw influxd[1648]: [httpd] 127.0.0.1 - pgwatch2 [17/Apr/2019:12:24:31 +0800] "POST /write?consistency=&db=pgwatch2&precision=ns&rp= HTTP/1.1" 500 20 "-" "InfluxDBClient" b270e8cf-60c8-11e9-8291-0800277bd51d 10005719
Apr 17 12:24:41 pgw influxd[1648]: ts=2019-04-17T04:24:41.503993Z lvl=error msg="[500] - \"timeout\"" log_id=0ErGnFmG000 service=httpd
Apr 17 12:25:01 pgw influxd[1648]: [httpd] 127.0.0.1 - pgwatch2 [17/Apr/2019:12:24:51 +0800] "POST /write?consistency=&db=pgwatch2&precision=ns&rp= HTTP/1.1" 500 20 "-" "InfluxDBClient" be694300-60c8-11e9-8292-0800277bd51d 10001542
Apr 17 12:25:01 pgw influxd[1648]: ts=2019-04-17T04:25:01.582161Z lvl=error msg="[500] - \"timeout\"" log_id=0ErGnFmG000 service=httpd
看一看出 HTTP/1.1 由 204 变成了 500
查看端口
# netstat -lntp |head -2;netstat -lntp |grep -i infl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:8086 0.0.0.0:* LISTEN 1648/influxd
tcp 0 0 127.0.0.1:8088 0.0.0.0:* LISTEN 1648/influxd
手动插入数据试试
# curl -i -XPOST 'http://127.0.0.1:8086/write?db=pgwatch2&u=pgwatch2&p=xyz' --data-binary 'mc_resource_double,internal_id=2 value=1337'
HTTP/1.1 500 Internal Server Error
Content-Type: application/json
Request-Id: b0a9ed8c-60dd-11e9-845c-0800277bd51d
X-Influxdb-Build: OSS
X-Influxdb-Error: timeout
X-Influxdb-Version: 1.7.5
X-Request-Id: b0a9ed8c-60dd-11e9-845c-0800277bd51d
Date: Wed, 17 Apr 2019 06:54:57 GMT
Content-Length: 20
{"error":"timeout"}
运行 show stats 都能 hang 住.
# influx
Connected to http://localhost:8086 version 1.7.5
InfluxDB shell version: 1.7.5
Enter an InfluxQL query
> show databases;
name: databases
name
----
_internal
pgwatch2
> show stats
运行 show diagnostics 同样 hang 住.
这么快就碰到了bug? 在 github 发现还真有这样的问题
<<After several hours of running normally, server enters a state where it returns error=timeout for all writes and httpd logs wrongly show 10mb+ data size #13342>>
<<Influx 1.3 Error 500’s until restart #8533>>
<<InfluxDB goes unresponsive #8500>>
<<failed to store statistics: timeout 1.2.0 #8036>>
<<Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010>>
目前有两种方案:
参考:
https://www.influxdata.com/blog/how-to-use-the-show-stats-command-and-the-_internal-database-to-monitor-influxdb/
https://github.com/influxdata/influxdb/issues/3349
https://github.com/influxdata/influxdb/issues/13342
https://github.com/influxdata/influxdb/issues/8533
https://github.com/influxdata/influxdb/issues/8500
https://github.com/influxdata/influxdb/issues/8036
https://github.com/influxdata/influxdb/issues/13010