os: ubuntu 16.04
db: mysql 5.7.25
规划如下:
192.168.56.92 node1 # mysql
192.168.56.90 node2 # mysql
192.168.56.88 node3 # mysql
标题显示不全
[ERROR] Plugin group_replication reported: ‘Member was expelled from the group due to network failures, changing member status to ERROR.’
三个节点是 multi primary mode 模式,如下:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 4bd106a1-3bee-11e9-8034-080027c780f8 | node1 | 3306 | ONLINE |
| group_replication_applier | 523c134d-3bee-11e9-b57a-08002756ee51 | node2 | 3306 | ONLINE |
| group_replication_applier | 56cf559d-3bee-11e9-abb2-080027366485 | node3 | 3306 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)
但是,运行了一端时间发现状态不对了.
在 node1 节点查看时:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 4bd106a1-3bee-11e9-8034-080027c780f8 | node1 | 3306 | ERROR |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)
在 node2 node3 节点查看时:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 523c134d-3bee-11e9-b57a-08002756ee51 | node2 | 3306 | ONLINE |
| group_replication_applier | 56cf559d-3bee-11e9-abb2-080027366485 | node3 | 3306 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)
2019-03-01T08:29:22.766248Z 59 [Note] Aborted connection 59 to db: 'unconnected' user: 'repl' host: 'node3' (failed on flush_net())
2019-03-01T09:39:21.784167Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6301ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
2019-03-01T09:39:21.797758Z 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2019-03-01T09:39:21.799307Z 0 [Note] Plugin group_replication reported: 'Going to wait for view modification'
2019-03-01T09:39:25.058328Z 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2019-03-01T09:39:25.059169Z 0 [Note] Plugin group_replication reported: 'Going to wait for view modification'
显示的是 Member was expelled from the group due to network failures, changing member status to ERROR. 但是从 node1 节点都可以 ping 通 node2 node3,使用 repl 也可以连接上.
2019-03-01T09:39:21.723121Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6143ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
2019-03-01T09:39:21.731315Z 0 [Warning] Plugin group_replication reported: 'Member with address node1:3306 has become unreachable.'
2019-03-01T09:39:21.736068Z 0 [Note] Plugin group_replication reported: '[GCS] Removing members that have failed while processing new view.'
2019-03-01T09:39:22.725045Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: node1:3306'
2019-03-01T09:39:22.725102Z 0 [Note] Plugin group_replication reported: 'Group membership changed to node2:3306, node3:3306 on view 15514286216262210:8.'
2019-03-04T00:32:29.260469Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 225820058ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
2019-03-04T00:32:29.261682Z 0 [Warning] Plugin group_replication reported: 'Member with address node3:3306 has become unreachable.'
2019-03-04T00:32:29.261934Z 0 [ERROR] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
2019-03-04T00:32:30.264808Z 0 [Warning] Plugin group_replication reported: 'Member with address node3:3306 is reachable again.'
2019-03-04T00:32:30.264858Z 0 [Warning] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2019-03-01T09:39:20.433168Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4475ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
2019-03-01T09:39:21.427791Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: node1:3306'
2019-03-01T09:39:21.427850Z 0 [Note] Plugin group_replication reported: 'Group membership changed to node2:3306, node3:3306 on view 15514286216262210:8.'
2019-03-04T00:32:42.911818Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 225821503ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
2019-03-04T00:32:42.912527Z 0 [Warning] Plugin group_replication reported: 'Member with address node2:3306 has become unreachable.'
2019-03-04T00:32:42.912546Z 0 [ERROR] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
2019-03-04T00:32:43.912580Z 0 [Warning] Plugin group_replication reported: 'Member with address node2:3306 is reachable again.'
2019-03-04T00:32:43.912617Z 0 [Warning] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
node2 node2 日志均有提示 It is possible to use group_replication_force_members to force a new group membership.
loose-group_replication_local_address= "192.168.56.92:24901"
loose-group_replication_group_seeds="192.168.56.92:24901,192.168.56.90:24901,192.168.56.88:24901"
loose-group_replication_ip_whitelist="192.168.56.0/24,127.0.0.1/8"
发现 node2 node3 无法telnet到 node1 的 24901 端口.
root@node2:~# telnet 192.168.56.92 24901
Trying 192.168.56.92...
telnet: Unable to connect to remote host: Connection refused
root@node3:~# telnet 192.168.56.92 24901
Trying 192.168.56.92...
telnet: Unable to connect to remote host: Connection refused
这些可以初步解释为什么日志里提示 network failures 了.
在 node1 上排查问题时发现确实本地没有监听 24901 端口.
root@node1:~# netstat -lntp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 1865/mysqld
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1516/sshd
tcp 0 0 127.0.0.1:6010 0.0.0.0:* LISTEN 1953/0
tcp 0 0 127.0.0.1:6011 0.0.0.0:* LISTEN 3043/1
tcp6 0 0 :::22 :::* LISTEN 1516/sshd
tcp6 0 0 ::1:6010 :::* LISTEN 1953/0
tcp6 0 0 ::1:6011 :::* LISTEN 3043/1
为什么不监听本地的 24901 端口呢?
node1 节点上 stop group_replication
mysql> stop group_replication;
Query OK, 0 rows affected (6.01 sec)
2019-03-04T01:33:00.679764Z 7 [Note] Plugin group_replication reported: 'Plugin 'group_replication' is stopping.'
2019-03-04T01:33:00.679805Z 7 [Note] Plugin group_replication reported: 'Going to wait for view modification'
2019-03-04T01:33:00.679936Z 0 [Warning] Plugin group_replication reported: 'read failed'
2019-03-04T01:33:00.680085Z 0 [Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2019-03-04T01:33:05.680674Z 7 [Note] Plugin group_replication reported: 'auto_increment_increment is reset to 1'
2019-03-04T01:33:05.680710Z 7 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1'
2019-03-04T01:33:05.680919Z 47 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2019-03-04T01:33:05.680991Z 47 [Note] Slave SQL thread for channel 'group_replication_applier' exiting, replication stopped in log 'FIRST' at position 85
2019-03-04T01:33:05.685005Z 44 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'
2019-03-04T01:33:05.685302Z 7 [Note] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.'
node1 节点上 start group_replication
mysql> start group_replication;
Query OK, 0 rows affected, 1 warning (2.04 sec)
2019-03-04T01:33:43.316266Z 7 [Note] Plugin group_replication reported: 'Group communication SSL configuration: group_replication_ssl_mode: "DISABLED"'
2019-03-04T01:33:43.316548Z 7 [Warning] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the whitelist. It is mandatory that it is added.'
2019-03-04T01:33:43.316664Z 7 [Note] Plugin group_replication reported: '[GCS] SSL was not enabled'
2019-03-04T01:33:43.316692Z 7 [Note] Plugin group_replication reported: 'Initialized group communication with configuration: group_replication_group_name: "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"; group_replication_local_address: "192.168.56.92:24901"; group_replication_group_seeds: "192.168.56.92:24901,192.168.56.90:24901,192.168.56.88:24901"; group_replication_bootstrap_group: true; group_replication_poll_spin_loops: 0; group_replication_compression_threshold: 1000000; group_replication_ip_whitelist: "192.168.56.0/24,127.0.0.1/8"'
2019-03-04T01:33:43.316715Z 7 [Note] Plugin group_replication reported: '[GCS] Configured number of attempts to join: 0'
2019-03-04T01:33:43.316728Z 7 [Note] Plugin group_replication reported: '[GCS] Configured time between attempts to join: 5 seconds'
2019-03-04T01:33:43.316748Z 7 [Note] Plugin group_replication reported: 'Member configuration: member_id: 1; member_uuid: "4bd106a1-3bee-11e9-8034-080027c780f8"; single-primary mode: "false"; group_replication_auto_increment_increment: 7; '
2019-03-04T01:33:43.317032Z 64 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 2155, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2019-03-04T01:33:43.348320Z 67 [Note] Slave SQL thread for channel 'group_replication_applier' initialized, starting replication in log 'FIRST' at position 0, relay log './mysql-relay-bin-group_replication_applier.000013' position: 2420
2019-03-04T01:33:43.348443Z 7 [Note] Plugin group_replication reported: 'Group Replication applier module successfully initialized!'
2019-03-04T01:33:43.348467Z 7 [Note] Plugin group_replication reported: 'auto_increment_increment is set to 7'
2019-03-04T01:33:43.348649Z 7 [Note] Plugin group_replication reported: 'auto_increment_offset is set to 1'
2019-03-04T01:33:43.349124Z 0 [Note] Plugin group_replication reported: 'XCom protocol version: 3'
2019-03-04T01:33:43.349148Z 0 [Note] Plugin group_replication reported: 'XCom initialized and ready to accept incoming connections on port 24901'
2019-03-04T01:33:44.359931Z 70 [Note] Plugin group_replication reported: 'Only one server alive. Declaring this server as online within the replication group'
2019-03-04T01:33:44.359980Z 0 [Note] Plugin group_replication reported: 'Group membership changed to node1:3306 on view 15516632243591695:1.'
2019-03-04T01:33:44.363891Z 0 [Note] Plugin group_replication reported: 'This server was declared online within the replication group'
node1 节点上查看 replication_group_members
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 4bd106a1-3bee-11e9-8034-080027c780f8 | node1 | 3306 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)
为什么没有看到其余的 group member ?
最后还是通过 group_replication_allow_local_disjoint_gtids_join 添加了节点.
mysql> stop group_replication;
Query OK, 0 rows affected (0.00 sec)
mysql> set global group_replication_allow_local_disjoint_gtids_join=ON;
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> start group_replication;
Query OK, 0 rows affected, 1 warning (3.19 sec)
mysql>
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 4bd106a1-3bee-11e9-8034-080027c780f8 | node1 | 3306 | ONLINE |
| group_replication_applier | 523c134d-3bee-11e9-b57a-08002756ee51 | node2 | 3306 | ONLINE |
| group_replication_applier | 56cf559d-3bee-11e9-abb2-080027366485 | node3 | 3306 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)