Oracle Coherence运维监控

许亦

2023-12-01

1. 环境参数检查与设置 环境参数检查与设置

具体请参考Oracle® Coherence Administrator's Guide的第6章：Performance Tuning。针对本次项目的AIX环境，建议调整下面这些参数：

1.1. AIX操作系统参数

1.1.1. SocketBuffer Sizes

默认的socket buffer sizes一般都比较小，Coherence会报下面的Warning:

UnicastUdpSocket failed to set receive buffer size to1428 packets (2096304

bytes); actual size is 89 packets (131071 bytes).Consult your OS documentation

regarding increasing the maximum socket buffer size.Proceeding with the actual

value may cause sub-optimal performance.

用root用户执行下面的命令进行调整：

no -o rfc1323=1

no -o sb_max=4194304

1.1.2. 多播与IPV6选项

AIX5.2以上版本缺省以IPV6进行多播，需要在启动Coherence服务与应用时候，在JVM使用以下系统属性确认使用IPV4

-D java.net.preferIPv4Stack = true

同时在/etc/netsvc.conf中hosts=local,bind4

1.2. IBM JVM特殊配置

1.2.1. OutOfMemoryError

如果某个节点处于OutOfMemoryError状态，会给集群带来不好的影响，所以当某个节点处于这种状态，应该让它退出而不是师徒恢复。所以需要在IBM JVM的启动参数中配置：

UNIX:

-Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,exec="kill-9 %pid"

1.2.2. HeapSizing

IBM JVM不建议采用固定大小的heap,所以建议只配置-Xms，不配置-Xmx，具体可参考：http://www.ibm.com/developerworks/java/jdk/diagnosis/

2. 启停脚本

2.1. 启动脚本

2.2. 数据加载脚本

2.3. 停止脚本

3. Coherence日志管理

3.1. 日志说明

Coherence有它自己的日志框架，同时还支持使用log4j，SLF4J以及Javalogging ，为应用程序提供一个通用的日志环境。Coherence的日志是一个专用和低优先级线程，以降低日志记录对系统的关键部分的影响。日志被预先配置，并根据需要将默认设置进行修改。

Coherence记录日志级别决定了日志消息发出。默认的日志级别发出的错误，警告，信息，以及一些调试消息。在开发过程中，日志级别应提高到其最大设置，以确保所有调试消息记录。生产环境的日志输出级别3是合理的，在开发环境下，日志级别越高，输出信息越详细，默认值为5. 以下日志级别说明：

· 0 – Thislevel includes messages that are not associated with a logging level.与日志级别没有关系的信息

· 1 – Thislevel includes the previous level's messages plus error messages.错误日志

· 2 – Thislevel includes the previous levels' messages plus warning messages.警告日志

· 3 – Thislevel includes the previous levels' messages plus informational messages.

· 4-9 – Theselevels include the previous levels' messages plus internal debugging messages.More log messages are emitted as the log level is increased. The default loglevel is5. debug的信息

· -1 – Nolog messages are emitted.无日志输出

3.2. 日志级别设置

Coherence的日志级别可以在tangosol-coherence-override.xml文件中配置，如下说示：

<logging-config>

<destinationsystem-property="tangosol.coherence.log">log4j</destination>

<severity-levelsystem-property="tangosol.coherence.log.level">3</severity-level>

</logging-config>

3.3. 日志监控

如果Coherence的日志文件或者应用的日志文件比较多或者比较大，要及时清理，防止把磁盘空间耗光。需要定期检查Coherence的日志，要注意警告warning及以上级别的日志信息，特别要注意的下面这些问题：

1、 Un-indexed data access 无索引的数据访问日志关注的内容

1) at com.tangosol...readSerializable(ExternalizableHelper.java:2180

2) YYYY-MM-DD HH:MM:SS.mmm/55.838 Oracle Coherence GE 12.1.2.0.0<…> . . .Timeout while delivering a packet; requestingthe departure confirmation for Member(. . . ) by MemberSet(. . . )

2、 Heap exhaustion 内存消耗日志关注的内容

java.lang.OutOfMemoryError: GC overhead limit exceeded Dumpingheap to java_pid6199.hprof. . .

Heap dump file created [16864871 bytes in 1.921 secs]

3、 Unresponsive service 未响应的服务

(thread=Cluster, member=2): Detected soft timeout) of {WrapperGuardableGuard{Daemon=DistributedCache}

4、有关SWAP 的消息

2013/09/17 10:20:26 | [GC 938176K->865107K(1021376K), 19.7179554secs]

5、 Potential Bandwidth Messages 潜在的带宽的消息

a) Experienceda XXX ms communication delay (probable remote GC) with MemberYYY

b) Apotential communication problem has been detected.

c) Thisnode appears to have become disconnected

6、 Potential Disconnect Messages 潜在断开消息

a) (thread=Cluster,member=5): Failed to reach address /192.168.1.103within the IpMonitor timeout. Members [Member(Id=3. . . )] are suspect.

b) (thread=Cluster,member=5): Timed-out members MemberSet(Size=4,BitSetCount=2Member(Id=1, Timestamp=2011-02-05

7、 Detecting Split Brain 集群脑裂的信息

a) 2013-01-2508:16:59.555/638.831 Oracle Coherence GE 12.1.2.0.0/465p4 <D5>Anexistence of a cluster island

b) 2010-01-2509:38:43.213/460.877 Oracle Coherence GE 12.1.2.0.0/465p4Receivedpanic from senior Member,. . .

4. Coherence集群监控

4.1. Coherence集群监控说明

有多种工具可以监控Coherence集群，主要有：

1. Using JMX to Manage Oracle Coherence

JMX工具，主要是指Jconsole或者Java VisualVM.

2. Using Oracle Coherence Reporting

Coherence本身提供的功能，可生产文本格式的统计报告。

3. Using Oracle WebLogic Server

可通过Weblogic Console监控Coherence节点的健康状态，并启停Coherence节点。

4. Using Oracle Enterprise Manager

也就是通过OEM的ManagementPack for Oracle Coherence，具体请参见：https://docs.oracle.com/cd/E24628_01/install.121/e24215/coherence_getstarted.htm

如果是通过JXM工具监控，需要修改Coherence启动脚本,加上下面的参数：

-Dcom.sun.management.jmxremote-Dtangosol.coherence.management=all -Dtangosol.coherence.management.remote=true

如果需要远程监控：还需要加上：

-Dcom.sun.management.jmxremote.host=10.46.158.140-Dcom.sun.management.jmxremote.port=7091-Dcom.sun.management.jmxremote.ssl=false-Dcom.sun.management.jmxremote.authenticate=false

如果连接不上，还要加上

-Dcom.sun.management.jmxremote.local.only=false

为减少对集群性能的影响，一个集群中，只要有一个节点配置了上面的JMX参数就可以了。不需要每个节点都配置.

JMX工具只能监控从JMX工具启动到停止这个阶段的Coherence集群情况，而通过OEM监控，则可以把采集到的监控数据保存到数据库中，可以查看历史情况。

对Coherence的监控，重点是对内存的监控，如果发现内存没有及时回收并且即将耗光，可进行手工GC, Jconsole或者java VisualVM都可以手工GC，见下面的介绍。

4.2. 通过Java VisualVM监控

4.2.1. 安装Coherence插件

4.2.2. Coherence集群的Machine状态监控

4.2.3. Coherence集群的成员监控

要注意publisher success rate和receiver success rate, send Q size等指标,并注意每个节点的内存是否足够。Free memory等指标

4.2.4. Coherence集群的Service监控

要注意是不是所有的Service都处于正常状态，并注意task average duration, request average duration是否正常。Task backlog是否为0

如下面的Service状态就不正常，处于ENDANGERED状态, request average duration值也特别高。

4.2.5. Coherence集群的Cache监控

4.2.6. Coherence节点CPU,内存监控

如下图所示，VisualVM可监控到具体某个节点的CPU,内存使用情况，并且可以进行手工GC.

4.3. 通过JConsole监控

JConsole可监控具体某个Coherence节点的CPU,内存，进程情况，并可通过Jconsole手工执行GC。

另外通过JConsole的MBean可以监控更多细节的东西，这是JConsole比VisualVM强的地方。

4.4. 通过JMX编程监控

通过jmx管理Coherence，通过MBean数据可以显示Coherence集群简明的操作信息，实现实时的监控和分析。用Coherence-JVisualVM插件可以得到很多的Coherence相关信息，比如：Coherence集群的Machines,Members,Services,Caches等相关信息。

Coherence的MBean列表如下：

CacheMBean	Represents a cache. A cluster member includes zero or more instances of this managed bean.
ClusterMBean	Represents a cluster. Each cluster member includes a single instance of this managed bean.
ClusterNodeMBean	Represents a cluster member. Each cluster member includes a single instance of this managed bean.
ConnectionManagerMBean	Represents an Oracle Coherence*Extend proxy. A cluster member includes zero or more instances of this managed bean.
ConnectionMBean	Represents a remote client connection through Oracle Coherence*Extend. A cluster member includes zero or more instances of this managed bean.
FlashJournalRM	Represents a flash journal resource manager. The managed bean is an instance of the JournalMBean interface. Each cluster member includes a single instance of this managed bean.
ManagementMBean	Represents the grid JMX infrastructure. Each cluster member includes a single instance of this managed bean.
PointToPointMBean	Represents the network status between two cluster members. Each cluster member includes a single instance of this managed bean.
RamJournalRM	Represents a RAM journal resource manager. The managed bean is an instance of the JournalMBean interface. Each cluster member includes a single instance of this managed bean.
ReporterMBean	Represents the Oracle Coherence reporter. Each cluster member includes a single instance of this managed bean.
ServiceMBean	Represents a clustered service. A cluster member includes zero or more instances of this managed bean.
StorageManagerMBean	Represents a storage instance for a storage-enabled distributed cache service. A cluster member includes zero or more instances of this managed bean.
TransactionManagerMBean	Represents a transaction manager. A cluster member includes zero or more instances of this managed bean.

每个MBean又有相关的属性，有的是只读的，有的是可以修改的，帮助完成Coherence的管理和监控。下面列出几个MBean的具体属性信息。更多的信息请参考Oracle® Fusion Middleware Managing Oracle Coherence 。