os: ubuntu 16.04
db: postgresql 9.6.8
ha 架构为 patroni + etcd + postgresql async stream replication 一主一从,检测到从库不正常,出现了错误.
# patronictl -c patroni_postgresql.yml list "9.6/main"
+----------+----------+--------------+--------+--------------+-----------+
| Cluster | Member | Host | Role | State | Lag in MB |
+----------+----------+--------------+--------+--------------+-----------+
| 9.6/main | pgsql_90 | 192.168.0.90 | Leader | running | 0.0 |
| 9.6/main | pgsql_92 | 192.168.0.92 | | start failed | unknown |
+----------+----------+--------------+--------+--------------+-----------+
呵呵,有错误.
在当前为 master 的机器上查找日志,发现了提升动作
# tail -f /var/log/syslog
Jan 31 10:27:57 node2 patroni[26010]: 2019-01-31 10:27:57,185 INFO: promoted self to leader by acquiring session lock
Jan 31 10:27:57 node2 patroni[26010]: 2019-01-31 10:27:57,190 WARNING: Loop time exceeded, rescheduling immediately.
Jan 31 10:27:57 node2 patroni[26010]: server promoting
Jan 31 10:27:57 node2 patroni[26010]: 2019-01-31 10:27:57,205 INFO: cleared rewind state after becoming the leader
Jan 31 10:27:57 node2 patroni[26010]: 2019-01-31 10:27:57,228 FATAL: patroni callback on_role_change master 9.6/main
Jan 31 10:27:57 node2 patroni[26010]: 2019-01-31 10:27:57,230 INFO: /sbin/ip addr add 192.168.0.119/24 brd 192.168.0.255 dev enp134s0f0 label enp134s0f0:1
Jan 31 10:27:57 node2 patroni[26010]: 2019-01-31 10:27:57,326 INFO: Lock owner: pgsql_90; I am pgsql_90
Jan 31 10:27:57 node2 patroni[26010]: 2019-01-31 10:27:57,576 INFO: updated leader lock during promote
Jan 31 10:27:58 node2 patroni[26010]: 2019-01-31 10:27:58,371 INFO: Lock owner: pgsql_90; I am pgsql_90
Jan 31 10:27:58 node2 patroni[26010]: 2019-01-31 10:27:58,909 INFO: no action. i am the leader with the lock
# tail -f /var/log/syslog
Jan 31 11:12:38 node1 patroni[38728]: 2019-01-31 11:12:38,649 WARNING: Postgresql is not running.
Jan 31 11:12:38 node1 patroni[38728]: 2019-01-31 11:12:38,649 INFO: Lock owner: pgsql_90; I am pgsql_92
Jan 31 11:12:38 node1 patroni[38728]: 2019-01-31 11:12:38,776 INFO: Lock owner: pgsql_90; I am pgsql_92
Jan 31 11:12:38 node1 patroni[38728]: 2019-01-31 11:12:38,777 INFO: Lock owner: pgsql_90; I am pgsql_92
Jan 31 11:12:38 node1 patroni[38728]: 2019-01-31 11:12:38,889 INFO: starting as a secondary
Jan 31 11:12:38 node1 patroni[38728]: 2019-01-31 11:12:38,937 INFO: postmaster pid=16817
Jan 31 11:12:39 node1 patroni[38728]: localhost:5432 - no response
Jan 31 11:12:40 node1 patroni[38728]: 2019-01-31 11:12:40,076 ERROR: postmaster is not running
Jan 31 11:12:40 node1 patroni[38728]: 2019-01-31 11:12:40,239 INFO: Lock owner: pgsql_90; I am pgsql_92
Jan 31 11:12:40 node1 patroni[38728]: 2019-01-31 11:12:40,376 INFO: failed to start postgres
找到对应时间点的postgresql日志
$ tail -f $PGDATA/pg_log/postgresql-2019-01-31.csv
2019-01-31 11:12:39.056 CST,,,16817,,5c5267a6.41b1,1,,2019-01-31 11:12:38 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2019-01-31 11:12:39.059 CST,,,16820,,5c5267a7.41b4,1,,2019-01-31 11:12:39 CST,,0,LOG,00000,"database system was shut down in recovery at 2019-01-31 10:27:57 CST",,,,,,,,,""
2019-01-31 11:12:39.059 CST,,,16820,,5c5267a7.41b4,2,,2019-01-31 11:12:39 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2019-01-31 11:12:39.059 CST,,,16820,,5c5267a7.41b4,3,,2019-01-31 11:12:39 CST,,0,FATAL,XX000,"requested timeline 7 does not contain minimum recovery point 0/465BCC90 on timeline 6",,,,,,,,,""
2019-01-31 11:12:39.060 CST,,,16821,"127.0.0.1:64140",5c5267a7.41b5,1,"",2019-01-31 11:12:39 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=64140",,,,,,,,,""
2019-01-31 11:12:39.062 CST,,,16817,,5c5267a6.41b1,2,,2019-01-31 11:12:38 CST,,0,LOG,00000,"startup process (PID 16820) exited with exit code 1",,,,,,,,,""
2019-01-31 11:12:39.062 CST,,,16817,,5c5267a6.41b1,3,,2019-01-31 11:12:38 CST,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,,""
2019-01-31 11:12:39.067 CST,,,16817,,5c5267a6.41b1,4,,2019-01-31 11:12:38 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,""
2019-01-31 11:12:39.072 CST,,,16821,"127.0.0.1:64140",5c5267a7.41b5,2,"",2019-01-31 11:12:39 CST,,0,FATAL,57P01,"terminating connection due to unexpected postmaster exit",,,,,,,,,""
关键的错误信息: requested timeline 7 does not contain minimum recovery point 0/465BCC90 on timeline 6
$./pg_controldata -D /data/pg9.6/main.bak
pg_control version number: 960
Catalog version number: 201608131
Database system identifier: 6652128518978933053
Database cluster state: shut down in recovery
pg_control last modified: Thu 31 Jan 2019 10:27:57 AM CST
Latest checkpoint location: 0/464D5868
Prior checkpoint location: 0/464D5868
Latest checkpoint's REDO location: 0/4638E2A8
Latest checkpoint's REDO WAL file: 000000050000000000000046
Latest checkpoint's TimeLineID: 5
Latest checkpoint's PrevTimeLineID: 5
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:666539
Latest checkpoint's NextOID: 27652
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 543
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 666539
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Thu 31 Jan 2019 10:25:54 AM CST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/465BCC90
Min recovery ending loc's timeline: 6
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: logical
wal_log_hints setting: on
max_connections setting: 1000
max_worker_processes setting: 40
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
关键的信息 Minimum recovery ending location: 0/465BCC90
$ cat $PGDATA/pg_xlog/00000007.history
1 0/5000098 no recovery target specified
2 0/6000098 no recovery target specified
3 0/8000060 no recovery target specified
4 0/4638E278 no recovery target specified
5 0/4655A098 no recovery target specified
6 0/46589E00 no recovery target specified
从错误提示 requested timeline 7 does not contain minimum recovery point 0/465BCC90 on timeline 6 来看,timeline 7 应该是包含 0/465BCC90 的.
使用 pg_xlogdump 查看下.
$ pg_xlogdump -s 0/465BCB00 $PGDATA/pg_xlog/000000070000000000000046 |more
first record is after 0/465BCB00, at 0/465BCB28, skipping over 40 bytes
rmgr: Btree len (rec/tot): 80/ 80, tx: 667887, lsn: 0/465BCB28, prev 0/465BCAD8, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16385/19274 blk 1
rmgr: Btree len (rec/tot): 88/ 88, tx: 667887, lsn: 0/465BCB78, prev 0/465BCB28, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16385/19275 blk 1
rmgr: Btree len (rec/tot): 88/ 88, tx: 667887, lsn: 0/465BCBD0, prev 0/465BCB78, desc: INSERT_LEAF off 19, blkref #0: rel 1663/16385/19276 blk 1
rmgr: Btree len (rec/tot): 88/ 88, tx: 667887, lsn: 0/465BCC28, prev 0/465BCBD0, desc: INSERT_LEAF off 2, blkref #0: rel 1663/16385/19277 blk 1
rmgr: Btree len (rec/tot): 80/ 80, tx: 667887, lsn: 0/465BCC80, prev 0/465BCC28, desc: INSERT_LEAF off 138, blkref #0: rel 1663/16385/19278 blk 6
rmgr: Btree len (rec/tot): 96/ 96, tx: 667887, lsn: 0/465BCCD0, prev 0/465BCC80, desc: INSERT_LEAF off 124, blkref #0: rel 1663/16385/19279 blk 8
rmgr: Btree len (rec/tot): 88/ 88, tx: 667887, lsn: 0/465BCD30, prev 0/465BCCD0, desc: INSERT_LEAF off 157, blkref #0: rel 1663/16385/19280 blk 2
rmgr: Btree len (rec/tot): 104/ 104, tx: 667887, lsn: 0/465BCD88, prev 0/465BCD30, desc: INSERT_LEAF off 108, blkref #0: rel 1663/16385/19281 blk 9
rmgr: Btree len (rec/tot): 104/ 104, tx: 667887, lsn: 0/465BCDF0, prev 0/465BCD88, desc: INSERT_LEAF off 52, blkref #0: rel 1663/16385/19282 blk 10
rmgr: Heap len (rec/tot): 238/ 238, tx: 667887, lsn: 0/465BCE58, prev 0/465BCDF0, desc: INSERT off 99, blkref #0: rel 1663/16385/19192 blk 4
rmgr: Btree len (rec/tot): 144/ 144, tx: 667887, lsn: 0/465BCF48, prev 0/465BCE58, desc: INSERT_LEAF off 37, blkref #0: rel 1663/16385/19198 blk 8
改变下思路,到bing 上搜索.
$ cat postgresql-2019-01-31.csv |grep -i "checkpoint starting: force"
2019-01-31 10:25:54.851 CST,,,39397,,5c51239e.99e5,133,,2019-01-30 12:10:06 CST,,0,LOG,00000,"checkpoint starting: force",,,,,,,,,""
2019-01-31 10:27:22.051 CST,,,10415,,5c525d08.28af,1,,2019-01-31 10:27:20 CST,,0,LOG,00000,"checkpoint starting: force",,,,,,,,,""
$ cat postgresql-2019-01-31.csv |grep -i "database system is ready to accept connections"
2019-01-31 10:25:54.862 CST,,,39392,,5c51239e.99e0,3,,2019-01-30 12:10:06 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
2019-01-31 10:27:22.054 CST,,,10411,,5c525d08.28ab,3,,2019-01-31 10:27:20 CST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
Note that the relevant text here is that the checkpoint is starting; it hasn’t completed yet.
The hypothesis is (not 100% confirmed, but seems likely) that the pg_rewind on the new secondary was done before the checkpoint had finished. Since the one of last things a checkpoint does is write the pg_control file, and one of the first things that pg_rewind does is read the control file from the source machine, there’s a window where the pg_control file on disk will be out of date. The result is a rather confused situation on the new secondary.
Thus, if you are going to do a pg_rewind, make sure the initial checkpoint on the source machine has completed before doing the rewind. In a scripted environment, this can be tricky, since the initial checkpoint can take any amount of time depending on how much has to be flushed out to disk. One option is to issue a CHECKPOINT statement to the newly-promoted primary to make sure a checkpoint has been completed; this does a fast checkpoint, and doesn’t return until the checkpoint is completed. (Edited to recommend CHECKPOINT.)
这段换大概的意思就是 pg_rewind 在 checkpoint 完成前已经完成,造成了这种混乱情况.
参考:
https://www.postgresql.org/message-id/006901cdc723$ac6a41c0
053
e
c
540
053ec540
053ec540@kapila@huawei.com
https://thebuild.com/blog/2018/07/18/pg_rewind-and-checkpoints-caution/
https://www.postgresql-archive.org