os: ubuntu 16.04
db: postgresql 9.6.8
pacemaker: Pacemaker 1.1.14 Written by Andrew Beekhof
corosync: Corosync Cluster Engine, version ‘2.3.5’
目前的集群如下:
vip-mas 192.168.56.119
vip-sla 192.168.56.120
node1 192.168.56.92
node2 192.168.56.90
node3 192.168.56.88
现在添加一个新节点 node4
node4 192.168.56.86
root@node1:~# crm_mon -Afr -1
Last updated: Tue Feb 19 16:00:40 2019 Last change: Tue Feb 19 15:51:27 2019 by root via crm_attribute on node1
Stack: corosync
Current DC: node1 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 7 resources configured
Online: [ node1 node2 node3 ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ node1 ]
Slaves: [ node2 node3 ]
Resource Group: master-group
vip-mas (ocf::heartbeat:IPaddr2): Started node1
vip-sla (ocf::heartbeat:IPaddr2): Started node1
Node Attributes:
* Node node1:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000006000098
+ pgsql-status : PRI
* Node node2:
+ master-pgsql : -INFINITY
+ pgsql-data-status : STREAMING|ASYNC
+ pgsql-status : HS:async
* Node node3:
+ master-pgsql : -INFINITY
+ pgsql-data-status : STREAMING|ASYNC
+ pgsql-status : HS:async
Migration Summary:
* Node node1:
* Node node3:
* Node node2:
目前 node1 充当了 master 角色.
root@node1:~# su - postgres
postgres@node1:~$ psql -c "select * from pg_stat_replication;"
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
14728 | 16384 | repl | node2 | 192.168.56.90 | | 52404 | 2019-02-19 15:51:22.288195+08 | 588 | streaming | 0/8000060 | 0/8000060 | 0/8000060 | 0/8000060 | 0 | async
15093 | 16384 | repl | node3 | 192.168.56.88 | | 47440 | 2019-02-19 15:51:26.645476+08 | 588 | streaming | 0/8000060 | 0/8000060 | 0/8000060 | 0/8000060 | 0 | async
(2 rows)
# iptables -F
# systemctl stop ufw;
systemctl disable ufw;
禁用selinux,有的话就修改,没有就不修改(依赖policycoreutils)
# vi /etc/selinux/config
SELINUX=disabled
# vi /etc/hosts
192.168.56.92 node1
192.168.56.90 node2
192.168.56.88 node3
192.168.56.86 node4
配置 ssh 信任
# ssh-keygen -t rsa
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1;
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2;
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node3;
ssh-copy-id -i ~/.ssh/id_rsa.pub root@node4;
另外 node1 node2 node3 节点上也 要执行下
# ssh-copy-id -i ~/.ssh/id_rsa.pub root@node4;
安装配置好 1 master 2 slave async stream replication.
详细过程可以参考另外的blog,注意需要禁止 postgresql 随机启动,用 pacemaker + corosync 来管理 postgresql.
# systemctl disable postgresql
留意 2224 端口的使用情况
# netstat -lntp |grep -i 2224
node4 节点安装对应的软件
# apt install -y pacemaker corosync corosync-dev pcs psmisc fence-agents crmsh
# dpkg -l |grep -Ei "pacemaker|corosync|pcs|psmisc|fence-agents|crmsh"
对应的完全卸载指令
# apt-get -y remove --purge corosync corosync-dev libcorosync-common-dev libcorosync-common4 pacemaker pacemaker-cli-utils pacemaker-common pacemaker-resource-agents pcs psmisc fence-agents crmsh
node4 节点修改 hacluster 用户密码
# passwd hacluster
node4 节点 启动
# systemctl status pacemaker corosync pcsd
# systemctl enable pacemaker corosync pcsd
# ls -l /lib/systemd/system/corosync.service;
ls -l /lib/systemd/system/pacemaker.service;
ls -l /lib/systemd/system/pcsd.service;
node4 节点备份 corosync 的配置文件 /etc/corosync/corosync.conf
# mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.bak
node4 节点删除 pacemaker 信息
# ls -l /var/lib/pacemaker/cib/cib*
# rm -f /var/lib/pacemaker/cib/cib*
node4 节点重启下 pacemaker corosync pcsd
# systemctl stop pacemaker corosync
# systemctl restart pcsd
# systemctl status pacemaker corosync pcsd
更新PostgreSQL集群,添加新加的节点,会出现闪断
node1 节点上操作
# pcs cluster auth -u hacluster -p rootroot 192.168.56.92 192.168.56.90 192.168.56.88 192.168.56.86
# pcs cluster node add 192.168.56.86 --start
# pcs resource update msPostgresql pgsql master-max=1 master-node-max=1 clone-max=5 clone-node-max=1 notify=true
# pcs resource update pgsql pgsql node_list="node1 node2 node3 node4"
# pcs cluster enable --all
192.168.56.92: Cluster Enabled
192.168.56.90: Cluster Enabled
192.168.56.88: Cluster Enabled
192.168.56.86: Cluster Enabled
node4 节点重启 corosync pacemaker pcsd
# systemctl restart pacemaker corosync pcsd
# pcs status
Cluster name: pgcluster
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Tue Feb 19 17:04:43 2019 Last change: Tue Feb 19 17:02:17 2019 by root via crm_attribute on node1
Stack: corosync
Current DC: node1 (version 1.1.14-70404b0) - partition with quorum
4 nodes and 7 resources configured
Online: [ node1 node2 node3 node4 ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ node1 ]
Slaves: [ node4 ]
Stopped: [ node2 node3 ]
Resource Group: master-group
vip-mas (ocf::heartbeat:IPaddr2): Started node1
vip-sla (ocf::heartbeat:IPaddr2): Started node1
Failed Actions:
* pgsql_start_0 on node3 'unknown error' (1): call=46, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
last-rc-change='Tue Feb 19 17:01:45 2019', queued=0ms, exec=191ms
* pgsql_monitor_4000 on node2 'not running' (7): call=62, status=complete, exitreason='none',
last-rc-change='Tue Feb 19 17:01:47 2019', queued=0ms, exec=91ms
PCSD Status:
node1 (192.168.56.92): Online
node2 (192.168.56.90): Online
node3 (192.168.56.88): Online
node4 (192.168.56.86): Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
观察到 Stopped: [ node2 node3 ] ,这两个节点为什么停止?不懂
# rm /var/lib/pgsql/tmp/PGSQL.lock
# pcs resource cleanup msPostgresql
最终结果
# crm_mon -Afr -1
Last updated: Tue Feb 19 17:09:21 2019 Last change: Tue Feb 19 17:09:07 2019 by root via crm_attribute on node1
Stack: corosync
Current DC: node1 (version 1.1.14-70404b0) - partition with quorum
4 nodes and 7 resources configured
Online: [ node1 node2 node3 node4 ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ node1 ]
Slaves: [ node2 node3 node4 ]
Resource Group: master-group
vip-mas (ocf::heartbeat:IPaddr2): Started node1
vip-sla (ocf::heartbeat:IPaddr2): Started node1
Node Attributes:
* Node node1:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 000000000D000098
+ pgsql-status : PRI
* Node node2:
+ master-pgsql : -INFINITY
+ pgsql-data-status : STREAMING|ASYNC
+ pgsql-status : HS:async
+ pgsql-xlog-loc : 000000000D000140
* Node node3:
+ master-pgsql : -INFINITY
+ pgsql-data-status : STREAMING|ASYNC
+ pgsql-status : HS:async
+ pgsql-xlog-loc : 000000000D000140
* Node node4:
+ master-pgsql : -INFINITY
+ pgsql-data-status : STREAMING|ASYNC
+ pgsql-status : HS:async
+ pgsql-xlog-loc : 000000000D000140
Migration Summary:
* Node node1:
* Node node3:
* Node node2:
* Node node4:
postgres@node1:~$ psql -c "select * from pg_stat_replication;"
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
29873 | 16384 | repl | node4 | 192.168.56.86 | | 50746 | 2019-02-19 17:02:15.923324+08 | 588 | streaming | 0/D000140 | 0/D000140 | 0/D000140 | 0/D000140 | 0 | async
27221 | 16384 | repl | node2 | 192.168.56.90 | | 52994 | 2019-02-19 17:09:05.656079+08 | 588 | streaming | 0/D000140 | 0/D000140 | 0/D000140 | 0/D000140 | 0 | async
27222 | 16384 | repl | node3 | 192.168.56.88 | | 48038 | 2019-02-19 17:09:05.672641+08 | 588 | streaming | 0/D000140 | 0/D000140 | 0/D000140 | 0/D000140 | 0 | async
(3 rows)