[Linux] 系统资源监控工具collectl

逑衡

2023-12-01

Collectl 是一个轻量级的性能监控工具，可监控CPU、磁盘、网络、内存、网络、进程等信息。

在Linux环境上安装collectl，例如 rpm -ivh collectl-xxx.rpm

一、collectl 常用选项介绍

1.collectl 不指定任何选项，会显示cpu、硬盘和网络信息。

[root@xxx ~]# collectl
waiting for 1 second sample...
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   0   0   374    935      0      0      0      0      0      1      0       1
   0   0   426    971      0      0      0      0      0      1      0       1
   0   0   503   1149      0      0      0      0      0      1      0       1
Ouch!
[root@xxx ~]#

说明：显示“Ouch!” 的地方是键盘按下了“Ctrl+C” 终止监控打印，以下输出也是如此。

2.选项 --all 显示除slab以外各个子系统的统计数据

(注：slab是Linux操作系统的一种内存分配机制)

[root@xxx ~]# collectl --all
waiting for 1 second sample...
#<----CPU[HYPER]-----><-----------------Int------------------><-----------------Memory-----------------><----------Disks-----------><----------Network----------><-------TCP--------><------Sockets-----><----Files---><------NFS Totals------>
#cpu sys inter  ctxsw Cpu0 Cpu1 Cpu2 Cpu3 Cpu4 Cpu5 Cpu6 Cpu7 Free Buff Cach Inac Slab  Map   Fragments KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut   IP  Tcp  Udp Icmp  Tcp  Udp  Raw Frag Handle Inodes  Reads Writes Meta Comm
   0   0   431   1013   78  102   74   64   32   17   36   27   2G  24K   1G 690M 366M  10G ttttssrk713      0      0      0      0      0      1      0       1    0    0    0    0 1439    0    0    0  14848 104610      0      0    0    0
   0   0   497   1188   85  115  100   55   30   32   49   33   2G  24K   1G 690M 366M  10G ttttssrk713      0      0    156     17      1     11      1      11    0    0    0    0 1439    0    0    0  14848 104611      0      0    0    0
Ouch!
[root@xxx ~]#

选项 --ALL是 --all 的升级版，显示的信息更详细，但不显示 TCP的详细信息（需要查看可使用 -P 或 -f）

3.选项 -s, --subsys subsystem 查看子系统的统计数据

默认采集cdn数据，即 collectl 和 collectl -scdn的输出相同

SUMMARY SUBSYSTEMS

b - buddy info (memory fragmentation)
c - CPU
d - Disk
f - NFS V3 Data
i - Inode and File System
j - Interrupts
l - Lustre
m - Memory
n - Networks
s - Sockets
t - TCP
x - Interconnect
y - Slabs (system object caches)

DETAIL SUBSYSTEMS

说明：以下"Environmental" and "Process" 没有对应的summary data（对照上述 SUMMARY SUBSYSTEMS）

C - CPU
D - Disk
E - Environmental data (fan, power, temp), via ipmitool
F - NFS Data
J - Interrupts
L - Lustre OST detail OR client Filesystem detail
M - Memory node data, which is also known as numa data
N - Networks
T - 65 TCP counters only available in plot format
X - Interconnect
Y - Slabs (system object caches)
Z - Processes

（1）查看cpu使用情况

[root@xxx ~]# collectl -sc
waiting for 1 second sample...
#<----CPU[HYPER]----->
#cpu sys inter  ctxsw
   0   0   453   1000
   0   0   587   1249
Ouch!

（2）查看磁盘使用情况
-sd查看磁盘总的使用情况，-sD可以查看每块磁盘的使用情况，-sdD涵盖以上

[root@xxx ~]# collectl -sd
waiting for 1 second sample...
#<----------Disks----------->
#KBRead  Reads KBWrit Writes
      0      0      0      0
Ouch!
[root@xxx ~]# collectl -sD
waiting for 1 second sample...

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sda              0      0    0    0       0      0    0    0       0     0     0      0    0
dm-0             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-1             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-2             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-3             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-4             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-5             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-6             0      0    0    0       0      0    0    0       0     0     0      0    0
Ouch!
[root@xxx ~]# collectl -sdD
waiting for 1 second sample...

### RECORD    1 >>> xxx <<< (1578897244.001) (Mon Jan 13 14:34:04 2020) ###

# DISK SUMMARY (/sec)
#KBRead RMerged  Reads SizeKB  KBWrite WMerged Writes SizeKB
      0       0      0      0        0       0      0      0

# DISK STATISTICS (/sec)
#          <---------reads---------><---------writes---------><--------averages--------> Pct
#Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
sda              0      0    0    0       0      0    0    0       0     0     0      0    0
dm-0             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-1             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-2             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-3             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-4             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-5             0      0    0    0       0      0    0    0       0     0     0      0    0
dm-6             0      0    0    0       0      0    0    0       0     0     0      0    0
Ouch!
[root@xxx ~]#

说明：dm-0,dm-1(dm,device mapper)等是配置计算机时创建的逻辑卷管理器(LVM)的逻辑卷(LV)。
(其它场景常见的sda0,sda1是连接到计算机的硬盘驱动器sda的分区)

LVM会把每个LV连接到一个/dev/dm-x的设备档，它不是一个真正的磁盘，所以不会有分区表存在，不能区分dm设备。
@ 命令iostat -d 可以查看device的实时I/O

[root@xxx ~]# iostat -d
Linux 3.10.0-1062.1.2.el7.x86_64 (xoam86)       01/13/2020      _x86_64_        (8 CPU)

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               7.87         1.97       108.20    8187843  449198326
dm-0              0.73         1.40        13.43    5820775   55757200
dm-1              0.11         0.07         0.37     297772    1551708
dm-2              0.01         0.00         0.12       4653     490694
dm-3              0.09         0.45         3.33    1873778   13830625
dm-4              4.98         0.01        75.11      58662  311807765
dm-5              0.00         0.00         0.00       2148       3028
dm-6              0.38         0.00        14.00       2259   58118824

[root@xxx  ~]#

说明：该命令可以指定打印间隔，例如 iostat -d 2 代表2秒打印一次

@ 命令dmsetup ls 和 dmsetup info (更详细) 可以来查看dm设备的映射情况

[root@xxx ~]# dmsetup ls
rhel-home       (253:2)
rhel-swap       (253:1)
rhel-root       (253:0)
rhel-opt        (253:3)
……

（3）查看网络使用情况

[root@xxx ~]# collectl -sn
waiting for 1 second sample...
#<----------Network---------->
#  KBIn  PktIn  KBOut  PktOut
      0      1      0       1
      0      1      0       1
Ouch!

（4）查看内存使用情况

[root@xxx ~]# collectl -sm
waiting for 1 second sample...
#<-----------Memory----------->
#Free Buff Cach Inac Slab  Map
   2G  24K   1G 673M 367M  10G
   2G  24K   1G 673M 367M  10G
Ouch!

（5）查看tcp数据

[root@xxx ~]# collectl -st
waiting for 1 second sample...
#<-------TCP-------->
#  IP  Tcp  Udp Icmp
    0    0    0    0
    0    0    0    0
    0    0    0    0
Ouch!

（6）collectl -sZ 查看进程 ,默认60s刷新一次

[root@xxx ~]# collectl -sZ
waiting for 60 second sample...

### RECORD    1 >>> xxx <<< (1578897643.002) (Mon Jan 13 14:40:43 2020) ###

# PROCESS SUMMARY (counters are /sec)
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
    1  root     20     0    0 S  189M    5M  0  0.02  0.01   0  10:33.90    0    0    0    1 /usr/lib/systemd/systemd
    2  root     20     0    0 S     0     0  6  0.00  0.00   0  00:04.31    0    0    0    0 kthreadd
……

4. 选项-P显示plot格式的数据

(1) 不使用-f会打印在屏幕上

[root@xxx tmp]# collectl -st -P
waiting for 1 second sample...
#Date Time [TCP]IpErr [TCP]TcpErr [TCP]UdpErr [TCP]IcmpErr [TCP]Loss [TCP]FTrans
20200113 18:07:05 0 0 0 0 0 0
20200113 18:07:06 0 0 0 0 0 0
20200113 18:07:07 0 0 0 0 0 0
Ouch!

(2) 使用-f写到/tmp下（会自动以主机名和时间戳生成一个文件）

[root@xxx tmp]# collectl -st -P -f /tmp
Ouch!
[root@xxx tmp]# vim xxx-20200113.tab.gz

注：使用cat命令打印出来的是乱码，使用vim命令查看是易读的（不知道为什么……）内容如下。

################################################################################
# Collectl:   V4.3.0-1  HiRes: 1  Options: -st -P -f /tmp
# Host:       xxx  DaemonOpts:
# （省略...）
################################################################################
#Date Time [TCP]IpErr [TCP]TcpErr [TCP]UdpErr [TCP]IcmpErr [TCP]Loss [TCP]FTrans
20200113 18:07:23 0 0 0 0 0 0
20200113 18:07:24 0 0 0 0 0 0

(3) 使用 --sep separator给定分隔符
默认的分隔符是"空格"，--sep后可跟分隔符字符或者其ASCII码（"--sep :" 和 "--sep 58" 均表示分隔符是"冒号"），以下两条命令效果相同
collectl -st -P -f /tmp --sep ,
collectl -st -P -f /tmp --sep 44

(4) 使用 -i, --interval interval[:interval2[:interval3]] 指定采集间隔
This is the sampling interval in seconds. The default is 10 seconds when run as a daemon and 1 second otherwise.
The process subsystem and slabs (-sY and -sZ) are sampled at the lower rate of interval2.
Environmentals(-sE), which only apply to a subset of hardware, are sampled at interval3.
Both interval2 and interval3, if specified, must be an even multiple of interval1.
The daemon default is -i10:60:300 and all other modes are -i1:60:300.
To sample only processes once every 10 seconds use -i:10.

5.选项-f保存到文件，如果想保存plot格式的数据使用 -P

[root@xxx tmp]# collectl -st -f /tmp
[root@xxx tmp]# vim xxx-20200113-181330.raw.gz

内容如下

################################################################################
# Collectl:   V4.3.0-1  HiRes: 1  Options: -st -f /tmp
# Host:       xxx  DaemonOpts:
#（省略...）
################################################################################
>>> 1578910411.001 <<<
tcp-Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
tcp-Ip: 1 64 204761492 0 0 0 0 0 204760068 207503242 119 48 0 0 0 0 0 0 0
tcp-Icmp: InMsgs InErrors InCsumErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps OutTimestampReps OutAddrMasks OutAddrMaskReps
tcp-Icmp: 1654 36 0 1613 3 0 0 0 28 10 0 0 0 0 1998 0 1573 0 0 0 0 397 28 0 0 0 0
tcp-Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts InCsumErrors
tcp-Tcp: 1 200 120000 -1 2529296 1004890 1434620 34627 232 213188012 216569679 11079 2 1446991 0
tcp-Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
tcp-Udp: 158408 1484 0 155387 0 0 0
>>> 1578910412.002 <<<
……

6. 选项 -p, --playback Filename 从文件中读数据

collectl使用分为记录模式和回放模式，上述例子涉及的是记录模式

Record Mode - read data from live system and write to file or display on terminal

collectl [-f file] [options]

Playback Mode - read data from one or more raw data files and display on terminal

collectl -p file1 [file2 ...] [options]

使用回放模式需要带上参数 -p，从指定的文件中读取数据，文件名必须以raw或raw.gz结尾。
[例] collectl -p xxx-20200114-000100.raw.gz

[root@xxx collectl]# collectl -p xxx-20200114-000100.raw.gz

### RECORD    1 >>> xxx <<< (1578931270.002) (Tue Jan 14 00:01:10 2020) ###

# CPU[HYPER] SUMMARY (INTR, CTXSW & PROC /sec)
# User  Nice   Sys  Wait   IRQ  Soft Steal Guest NiceG  Idle  CPUs  Intr  Ctxsw  Proc  RunQ   Run   Avg1  Avg5 Avg15 RunT BlkT
     0     0     0     0     0     0     0     0     0    99     8   700   1555    14  2032     0   0.03  0.05  0.05    0    0
……

二、作为守护进程使用

如果要将collectl用作daemon，需要在配置文件/etc/collectl.conf编辑如下行
#DaemonCommands = -f /var/log/collectl -r00:00,7 -m -F60 -s+YZ
DaemonCommands = -f /var/log/collectl -r00:05,7 -m -F60 -s+Zms --procfilt cjava,cmysql

说明：
-f /var/log/collectl 每天的raw文件存放到路径/var/log/collectl
-r00:05,7 每个文件结束写入的时间是00:05，保存7天的文件
-m 以月为单位记录日志
-F60 每60秒输出缓冲区
-s+Zms 除了统计cdn（CPU、磁盘和网络）,还要统计进程(Z)、内存(m)和Sockets(s)
--procfilt cjava,cmysql 用于从所有进程中过滤出特定的进程信息

以下是用man collectl查看以上涉及的选项用法的简单摘要：
（1）-r, --rolllogs time[[,days[:months]][,minutes]]
设置该选项后，collectl将无限期运行（或至少直到系统重新启动为止）。保留的raw、plot文件的最大数量（较旧的文件会自动删除）取决于days字段，默认值为7。
如果还指定了-m，即指示collectl将消息写入日志记录目录中的日志文件，则保留这些日志的月数取决于months字段，其默认值为12。
增量字段也是可选的，取决于minutes字段，以分钟为单位指定单个收集文件的持续时间，默认值为1440（1天）。

（2）--procfilt Process Filters
这些过滤器限制选择要收集/显示的进程。通过使用此过滤器 collectl会创建一个黑名单，将显着减少进程数据收集的负担。
过滤器的格式是在以下选项后接匹配字符串，指定多个过滤器用逗号分隔。
c - 从 /proc/<pid>/stat显式读取的正在执行命令的子字符串，可以是perl表达式。
C - 以指定字符串开头的任何命令
f - 从 /proc/<pid>/cmdline读取的命令的完整路径，包括参数，可以是perl表达式。
p - pid
P - parent pid
u - 此用户的UID所拥有的任何进程，或者在uxxx-yyy指定的范围内
U - 此用户名拥有的任何进程
注意：collectl会尝试将 -c/-C 与/proc/<pid>/stat中的第二个域匹配，必要时可以检查确认是否为预期的匹配

[root@xxx ~]# collectl -sZ --procfilt cmysql
waiting for 60 second sample...

# PROCESS SUMMARY (counters are /sec)
# PID  User     PR  PPID THRD S   VSZ   RSS CP  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
20924  dbuser   20     1  102 S    6G  923M  3  0.04  0.08   0  02:17:19    0    5    0    0 /opt/mysql/bin/mysqld
20924  dbuser   20     1  102 S    6G  923M  3  0.12  0.27   0  02:17:20    0  197    0    0 /opt/mysql/bin/mysqld
^COuch!
[root@xxx ~]#

（3）-F, --flush seconds
在此秒数后刷新输出缓冲区。如果为0，则将在每个数据收集间隔发生刷新。

（4）-m, --messages
将状态写到与输出文件位于同一目录中的月度日志文件中（还需要指定-f）,文件名称为collectl-yyyymm.log。

（5）-s, --subsys subsystem
可以使用+或-在默认值之间添加或减去子系统。
例如，"-s-cdn + N" 将在添加网络详细信息时从默认设置中删除cpu，磁盘和网络监视。
SUMMARY SUBSYSTEMS 包括
c - CPU
d - Disk
m - Memory
n - Networks
s - Sockets
t - TCP
……

DETAIL SUBSYSTEMS 包括
N - Networks
Y - Slabs (system object caches)
Z - Processes
……

/var/log/collectl/下保存的raw文件类似如下，其中当天文件还在写，其它写好的文件有7个（修改时间为次日的00:05）

[root@xxx ~]# cd /var/log/collectl/
[root@xxx collectl]# ls -l
total 75492
-rw-r--r--. 1 root root 10083493 Jan  8 00:05 xxx-20200107-000100.raw.gz
-rw-r--r--. 1 root root 10086579 Jan  9 00:05 xxx-20200108-000100.raw.gz
-rw-r--r--. 1 root root 10128225 Jan 10 00:05 xxx-20200109-000100.raw.gz
-rw-r--r--. 1 root root 10185719 Jan 11 00:05 xxx-20200110-000100.raw.gz
-rw-r--r--. 1 root root 10249894 Jan 12 00:05 xxx-20200111-000100.raw.gz
-rw-r--r--. 1 root root 10261303 Jan 13 00:05 xxx-20200112-000100.raw.gz
-rw-r--r--. 1 root root 10187449 Jan 14 00:05 xxx-20200113-000100.raw.gz
-rw-r--r--. 1 root root  4463579 Jan 14 10:32 xxx-20200114-000100.raw.gz
-rw-r--r--. 1 root root      341 Nov 30 00:05 xxx-collectl-201911.log
-rw-r--r--. 1 root root      992 Dec 31 00:05 xxx-collectl-201912.log
-rw-r--r--. 1 root root      448 Jan 14 00:05 xxx-collectl-202001.log
[root@xxx collectl]#

三、抽取raw.gz文件的部分统计数据生成csv文件

可以使用 -p从raw.gz文件中读取数据，并用 -P 转换成plot格式，再用cut命令抽取关心的列。可以通过一个脚本完成上述操作。

[例] collectl_system_data.sh 内容如下

#!/bin/sh

hostname=`hostname`
timestamp=`date +%Y%m%d%H%M`
time=00:00:00-23:59:59
output_file_name="/srv/data/system_data-$hostname-$timestamp.csv"

echo "Date,Time,[CPU]Totl%,[MEM]Used(GB),[NET]RxMBTot,[NET]TxMBTot" > $output_file_name
for file in $@
do
    collectl -p $file -scmnd -P --sep ,  --from $time |grep -v ^#| cut -d"," -f1-2,11,24,28-29 | awk 'BEGIN{ FS=OFS="," }{ printf "%s,%s,%d,%.2f,%.2f,%.2f\n", $1,$2,$4,$6/1024/1024,$12/1024,$13/1024 }' >> $output_file_name

done

执行 sh collectl_system_data.sh <path of filename.raw.gz>

[root@xxx tmp]# ./collectl_system_data.sh /tmp/xxx-20200113-180222.raw.gz
[root@xxx tmp]# cat system_record-xxx-202001131804.csv
Date,Time,[CPU]Totl%,[MEM]Used(GB),[NET]RxMBTot,[NET]TxMBTot,[DSK]ReadMBTot,[DSK]WriteMBTot
20200113,18:02:24,0,0.00,0.00,0.00
20200113,18:02:25,1,0.00,0.00,0.00

=====================================================

collectl 常用选项参考资料：
Collectl: Linux 性能监控的全能冠军 https://linux.cn/article-3154-1.html
Collectl 监控进程和产生plot格式文件 https://blog.csdn.net/guoguangwu/article/details/100084356

其它服务器性能监控工具参考：
25个Linux性能监控工具 https://blog.csdn.net/hdyrz/article/details/75452499
Linux CPU占用率监控工具小结 https://www.cnblogs.com/arnoldlu/p/9462221.html
Linux vmstat命令详解 https://www.cnblogs.com/ftl1012/p/vmstat.html
Linux iostat命令详解 https://www.cnblogs.com/ftl1012/p/iostat.html
Linux netstat命令详解 https://www.cnblogs.com/ftl1012/p/netstat.html