使用pmu-tools进行性能分析

穆城

2023-12-01

什么是pmu-tools

PMU-tools 是用于性能分析和调试的 Linux 性能计数器工具包，可用于测量各种硬件事件（例如 CPU 指令、缓存访问等）以及进程和系统级别的性能数据。
下面是PMU-Tools的github链接pmu-tools

安装与使用pmu-tools

安装pmu-tools

pmu-tools并不需要通过apt来按照，只需要从git上将对应的库拉下来，然后将对应的/path/to/pmu-tools路径加入到PATH中。
推荐通过修改~/.bashrc文件的PATH来实现。

vim ~/.bashrc

将对应的pmu_tools路径加入到PATH中

export PATH=$PATH:/opt/riscv/bin:/home/chentaowu/chentaowu/c++/for_csdn/pmu-tools

使用pmu-tools

修改系统值

按照github上的教程，官方建议将kernel.perf_event_paranoid值设为-1。
官方给出的bash命令为：

sysctl -p 'kernel.perf_event_paranoid=-1'

但当我执行完上述命令后，出现以下错误：

sysctl: cannot open "kernel.perf_event_paranoid=-1": No such file or directory

查了sysctl -p命令发现其功能为:

-p, --load[=<file>]  read values from file

而 kernel.perf_event_peranoid并不是一个文件，我这边应该使用-w命令，通过命令：

sudo sysctl -w kernel.perf_event_paranoid=-1

来修改kernel.perf_event_paranoid值。
然后通过命令：

sudo sysctl -w  kernel.nmi_watchdog=0

修改kernel.nmi_watchdog值。

测试toplev是否成执行

可以通过以下命令：

toplev -l1  bash -c 'echo "7^199999" | bc > /dev/null'

来测试toplev是否成功执行，成功执行时将会出现以下结果：

ill measure complete system.
# 4.5-full-perf on Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz [skl/skylake]
C0    BAD            Bad_Speculation  % Slots                      32.5   [50.1%]<==
        This category represents fraction of slots wasted due to
        incorrect speculations...
C0-T0 MUX                             %                            50.07 
        PerfMon Event Multiplexing accuracy indicator
C0-T1 MUX                             %                            50.07 
Run toplev --describe Bad_Speculation^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Bad_Speculation*/2,+MUX' for breakdown.
Idle CPUs 1-7,9-15 may have been hidden. Override with --idle-threshold 10

至此说明pmu-tools正式安装成功。

使用toplev测试程序性能

测试代码如下：

#include <stdlib.h>
#define CACHE_LINE __attribute__((aligned(64)))
struct S1 {
  int r1;
  int r2;
  int r3;
  S1() : r1(1), r2(2), r3(3) {}
} CACHE_LINE;

void add(const S1 smember[], int members, long &total) {
  int idx = members;
  do {
    total += smember[idx].r1;
    total += smember[idx].r2;
    total += smember[idx].r3;
  } while (--idx);
}
int main(int argc, char *argv[]) {
  const int SIZE = 204800;
  S1 *smember = (S1 *)malloc(sizeof(S1) * SIZE);
  long total = 0L;
  int loop = 10000;
  while (--loop) { // 方便对比测试
    add(smember, SIZE, total);
  }
  return 0;
}

使用编译命令：

g++ cache_line.cpp -o cache_line ; task_set -c 1 ./cache_line

然后通过命令：

toplev -l3 --single-thread --force-events -D8 ./cache_line

得到如下结果：

# 4.5-full-perf on Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz [skl/skylake]
BE             Backend_Bound                                % Slots                      29.8   [ 4.0%]
BE/Core        Backend_Bound.Core_Bound                     % Slots                      28.7   [ 4.0%]
RET            Retiring.Light_Operations                    % Slots                      66.5   [ 4.0%]
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization   % Clocks                     24.0   [ 4.0%]<==
        This metric estimates fraction of cycles the CPU performance
        was potentially limited due to Core computation issues (non
        divider-related)...
RET            Retiring.Light_Operations.Memory_Operations  % Slots                      34.0   [ 4.0%]
        This metric represents fraction of slots where the CPU was
        retiring memory operations -- uops for memory load or store
        accesses...
MUX                                                         %                             3.99 
        PerfMon Event Multiplexing accuracy indicator

这里对结果中的某些数据做简单说明。

BE——Backend_Bound

backend的作用如下：
1.接收Front-End 提交的微指令
2.必要时对Front-End 提交的微指令进行重排
3.从内存中获取对应的指令操作数
4.执行微指令、提交结果到内存
Back-End Bound 表示部分pipeline slots 因为Back-End缺少一些必要的资源导致没有uOps交付给Back-End。
一般说来Backend_Bound的比率越低，说明程序性能越好

RET——Retiring

Retiring表示运行有效的uOps 的pipeline slot，即这些uOps[3]最终会退出（注意一个微指令最终结果要么被丢弃、要么退出将结果回写到register），它可以用于评估程序对CPU的相对比较真实的有效率。理想情况下，所有流水线slot都应该是"Retiring"。100% 的Retiring意味着每个周期的 uOps Retiring数将达到最大化，极致的Retiring可以增加每个周期的指令吞吐数（IPC）。
一般说来RET比例越高，说明程序性能越好。

而以上程序性能测试结果中RET是66.5%,BE是29.8%说明程序还有优化的空间。

以上就是对pmu-tools的一个简单介绍，以后会更加深入的介绍如何使用pmu-tools进行更细致的性能分析，并根据结果去提高程序性能。