mysql galera status_MySQL galera cluster集群的监控

车嘉实
2023-12-01

一、集群复制状态检查

1、SHOW GLOBAL STATUS LIKE 'wsrep_%';

+------------------------------+-------------------------------------------------------------+

| Variable_name | Value |

+------------------------------+-------------------------------------------------------------+

| wsrep_local_state_uuid | 9f6a992a-7dd9-11e5-9f85-f760745ffb39 |

| wsrep_protocol_version | 7 |

| wsrep_last_committed | 53 |

| wsrep_replicated | 6 |

| wsrep_replicated_bytes | 1368 |

| wsrep_repl_keys | 9 |

| wsrep_repl_keys_bytes | 210 |

| wsrep_repl_data_bytes | 774 |

| wsrep_repl_other_bytes | 0 |

| wsrep_received | 37 |

| wsrep_received_bytes | 23347 |

| wsrep_local_commits | 0 |

| wsrep_local_cert_failures | 0 |

| wsrep_local_replays | 0 |

| wsrep_local_send_queue | 0 |

| wsrep_local_send_queue_max | 2 |

| wsrep_local_send_queue_min | 0 |

| wsrep_local_send_queue_avg | 0.125000 |

| wsrep_local_recv_queue | 0 |

| wsrep_local_recv_queue_max | 2 |

| wsrep_local_recv_queue_min | 0 |

| wsrep_local_recv_queue_avg | 0.027027 |

| wsrep_local_cached_downto | 14 |

| wsrep_flow_control_paused_ns | 0 |

| wsrep_flow_control_paused | 0.000000 |

| wsrep_flow_control_sent | 0 |

| wsrep_flow_control_recv | 0 |

| wsrep_cert_deps_distance | 1.000000 |

| wsrep_apply_oooe | 0.100000 |

| wsrep_apply_oool | 0.000000 |

| wsrep_apply_window | 1.250000 |

| wsrep_commit_oooe | 0.000000 |

| wsrep_commit_oool | 0.000000 |

| wsrep_commit_window | 1.250000 |

| wsrep_local_state | 4 |

| wsrep_local_state_comment | Synced |

| wsrep_cert_index_size | 10 |

| wsrep_cert_bucket_count | 22 |

| wsrep_gcache_pool_size | 27144 |

| wsrep_causal_reads | 0 |

| wsrep_cert_interval | 0.325000 |

| wsrep_incoming_addresses | |

| wsrep_evs_delayed | |

| wsrep_evs_evict_list | |

| wsrep_evs_repl_latency | 0/0/0/0/0 |

| wsrep_evs_state | OPERATIONAL |

| wsrep_gcomm_uuid | 5e28860a-829e-11e5-9c06-665d7fe4003d |

| wsrep_cluster_conf_id | 3 |

| wsrep_cluster_size | 3 |

| wsrep_cluster_state_uuid | 9f6a992a-7dd9-11e5-9f85-f760745ffb39 |

| wsrep_cluster_status | Primary |

| wsrep_connected | ON |

| wsrep_local_bf_aborts | 0 |

| wsrep_local_index | 1 |

| wsrep_provider_name | Galera |

| wsrep_provider_vendor | Codership Oy |

| wsrep_provider_version | 3.12(rXXXX) |

| wsrep_ready | ON |

+------------------------------+-------------------------------------------------------------+

wsrep_notify_cmd.sh——监控状态的变化。使用方法参见http://galeracluster.com/documentation-webpages/notificationcmd.html

2、wsrep_cluster_state_uuid显示了cluster的state UUID,由此可看出节点是否还是集群的一员

SHOW GLOBAL STATUS LIKE 'wsrep_cluster_state_uuid'

集群内每个节点的value都应该是一样的,否则说明该节点不在集群中了

+--------------------------+--------------------------------------+

| Variable_name | Value |

+--------------------------+--------------------------------------+

| wsrep_cluster_state_uuid | 9f6a992a-7dd9-11e5-9f85-f760745ffb39 |

+--------------------------+--------------------------------------+

3、wsrep_cluster_conf_id显示了整个集群的变化次数。所有节点都应相同,否则说明某个节点与集群断开了

4、wsrep_cluster_size显示了集群中节点的个数

5、wsrep_cluster_status显示集群里节点的主状态。标准返回primary。如返回non-Primary或其他值说明是多个节点改变导致的节点丢失或者脑裂。如果所有节点都返回不是Primary,则要重设quorum。具体参见http://galeracluster.com/documentation-webpages/quorumreset.html如果返回都正常,说明复制机制在每个节点都能正常工作,下一步该检查每个节点的状态确保他们都能收到write-set

show global status like 'wsrep_cluster_status';

+----------------------+---------+

| Variable_name | Value |

+----------------------+---------+

| wsrep_cluster_status | Primary |

+----------------------+---------+

二、检查节点状态

节点状态显示了集群中的节点接受和更新write-set状态,以及可能阻止复制的一些问题

1、wsrep_ready显示了节点是否可以接受queries。ON表示正常,如果是OFF几乎所有的query都会报错,报错信息提示“ERROR 1047 (08501) Unknown Command”

SHOW GLOBAL STATUS LIKE 'wsrep_ready';

+---------------+-------+

| Variable_name | Value |

+---------------+-------+

| wsrep_ready | ON |

+---------------+-------+

2、SHOW GLOBAL STATUS LIKE 'wsrep_connected’显示该节点是否与其他节点有网络连接。(实验得知,当把某节点的网卡down掉之后,该值仍为on。说明网络还在)丢失连接的问题可能在于配置wsrep_cluster_address或wsrep_cluster_name的错误

+-----------------+-------+

| Variable_name | Value |

+-----------------+-------+

| wsrep_connected | ON |

+-----------------+-------+

3、wsrep_local_state_comment 以人能读懂的方式显示节点的状态,正常的返回值是Joining, Waiting on SST, Joined, Synced or Donor,返回Initialized说明已不在正常工作状态

+---------------------------+--------+

| Variable_name | Value |

+---------------------------+--------+

| wsrep_local_state_comment | Synced |

+---------------------------+--------+

三、查看复制的健康状态

通过Flow Control的反馈机制来管理复制进程。当本地收到的write-set超过某一阀值时,该节点会启动flow control来暂停复制直到它赶上进度。监控本地收到的请求和flow control,有如下几个参数:

1、wsrep_local_recv_queue_avg——平均请求队列长度。当返回值大于0时,说明apply write-sets比收write-set慢,有等待。堆积太多可能导致启动flow control

+----------------------------+----------+

| Variable_name | Value |

+----------------------------+----------+

| wsrep_local_recv_queue_avg | 0.027027 |

+----------------------------+----------+

wsrep_local_recv_queue_max 和 wsrep_local_recv_queue_min可以看队列设置的最大最小值

2、wsrep_flow_control_paused 显示了自从上次查询之后,节点由于flow control而暂停的时间占整个查询间隔时间比。总体反映节点落后集群的状况。如果返回值为1,说明自上次查询之后,节点一直在暂停状态。如果发现某节点频繁落后集群,则应该调整wsrep_slave_threads或者把节点剔除

+---------------------------+----------+

| Variable_name | Value |

+---------------------------+----------+

| wsrep_flow_control_paused | 0.000000 |

+---------------------------+----------+

3、wsrep_cert_deps_distance显示了平行apply的最低和最高排序编号或者sql编号之间的平均距离值。这代表了节点潜在的并行程度,和线程相关

+--------------------------+----------+

| Variable_name | Value |

+--------------------------+----------+

| wsrep_cert_deps_distance | 1.000000 |

+--------------------------+----------+

四、检测网络慢的问题

通过检查发送队列来看传出的连接状况

1、wsrep_local_send_queue_avg显示自上次查询之后的平均发送队列长度。比如网络瓶颈和flow control都可能是原因

+----------------------------+----------+

| Variable_name | Value |

+----------------------------+----------+

| wsrep_local_send_queue_avg | 0.033333 |

+----------------------------+----------+

wsrep_local_send_queue_max 和 wsrep_local_send_queue_min可以看队列设置的最大值和最小值

五、日志监控

在my.cnf中做如下配置

# wsrep Log Options

wsrep_log_conflicts=ON   #会将冲突信息写入错误日志中,例如两个节点同时写同一行数据

wsrep_provider_options="cert.log_conflicts=ON"    #复制过程中的错误信息写在日志中

wsrep_debug=ON    #显示debug 信息在日志中,其中也包括鉴权信息,例如账号密码。因此在生产环境中不开启

六、附加的日志

当某节点在从节点上应用一个事件失败时,数据库服务器会创建一个特殊的binary log文件。文件名默认是GRA_*.log

 类似资料: