perf 常见使用方法

师腾

2023-12-01

perf的使用场景

分析CPU cache、CPU迁移，分支预测、指令周期等各种硬件事件。
寻找程序运行过程中的热点函数，做性能分析，从而定位性能问题，具体的做法是对事件进行采样，然后再根据采样数，评估调用流程中各个函数的调用频率。
自定义追踪感兴趣的事件或者是函数。

限制：并不是所有/proc/kallsyms 内核符号中的函数都可以被跟踪！！!

perf list

列出所有能够出发Perf采样点的事件，类似于ftrace中/sys/kernel/debug/tracing/available_events中包含的事件，但Perf支持的事件比ftrace要多。

perf trace

类似于strace跟踪进程系统调用，相较于strace有更好的性能。

$ perf strace ls
         ? (         ): ls/7996  ... [continued]: execve())                                           = 0
     0.078 ( 0.004 ms): ls/7996 brk()                                                                 = 0x556e616e7000
     0.090 ( 0.003 ms): ls/7996 arch_prctl(option: 0x3001, arg2: 0x7ffd2d59b9f0)                      = -1 EINVAL (Invalid argument)
     0.130 ( 0.012 ms): ls/7996 access(filename: 0x4895c9e0, mode: R)                                 = -1 ENOENT (No such file or directory)
     0.154 ( 0.010 ms): ls/7996 openat(dfd: CWD, filename: 0x48959b80, flags: RDONLY|CLOEXEC)         = 3
kernel-hook-framework_4.x  text_poke  yin
     0.168 ( 0.005 ms): ls/7996 fstat(fd: 3, statbuf: 0x7ffd2d59abf0)                                 = 0
     0.177 ( 0.010 ms): ls/7996 mmap(len: 75723, prot: READ, flags: PRIVATE, fd: 3)                   = 0x7fae48921000
     0.190 ( 0.003 ms): ls/7996 close(fd: 3)                                                          = 0
     0.209 ( 0.007 ms): ls/7996 openat(dfd: CWD, filename: 0x48963e10, flags: RDONLY|CLOEXEC)         = 3
     0.221 ( 0.006 ms): ls/7996 read(fd: 3, buf: 0x7ffd2d59ad98, count: 832)                          = 832
     0.231 ( 0.003 ms): ls/7996 fstat(fd: 3, statbuf: 0x7ffd2d59ac40)                                 = 0
....

perf start

运行命令并收集性能统计信息。

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

int main(void)
{
	printf("hello, pid=%d\n", getpid());
	while(1) {
		printf("begin sleep...\n");
		sleep(1);
	}
}
// gcc main.c -o main

$ perf stat ./main
hello, pid=8119
begin sleep...
begin sleep...
begin sleep...
^C./main: Interrupt

 Performance counter stats for './main':

              0.84 msec task-clock                #    0.000 CPUs utilized          
                 3      context-switches          #    3.576 K/sec                  
                 1      cpu-migrations            #    1.192 K/sec                  
                54      page-faults               #   64.373 K/sec                  
         1,364,290      cycles                    #    1.626 GHz                    
           848,212      instructions              #    0.62  insn per cycle         
           172,372      branches                  #  205.482 M/sec                  
             8,017      branch-misses             #    4.65% of all branches        

       3.000841844 seconds time elapsed

       0.001547000 seconds user
       0.000000000 seconds sys
       
# task-clock:用于执行程序的cpu时间；
# context-switches：程序在运行过程中经历的上下文切换次数；
# page-faults：进程运行过程中产生的缺页次数；
# cpu-migrations：程序在运行过程中发生的CPU迁移次数，即被调度器从一个CPU转移到另外一个CPU上运行；
# instructions：该进程在这段时间内完成的CPU指令数；
# cycles：CPU时钟周期；

perf record

运行命令并将产生的数据写入perf.data文件。

# -p $pid 记录进程pid的events
# -a 采集所有cpu上的events
# -e $event 指定PMU（处理器监控单元）event，默认是cycles:app（cpu周期数）
# -g 启动堆栈/栈回溯功能
# -F $freq	采用频率
# -o $path 指定采样文件输出路径

#include <iostream>
#include <math.h>
#include <sys/types.h>
#include <unistd.h>

using namespace std;

void for_loop()
{
        for (int i = 0; i < 1000; i++) {
                for (int j = 0; j < 10000; j++) {
                        int x = sin(i) + cos(j);
                }
        }
}

void loop_small()
{
        for (int i = 0; i < 10; i++) {
                for_loop();
        }
}

void loop_big()
{
        for (int i = 0; i < 100; i++) {
                for_loop();
        }
}

int main(void)
{
        printf("pid=%d\n", getpid());
        loop_big();
        loop_small();

        return 0;

}

// g++ test.c -o test

# 使用示例
# 采样时间10,采样频率99，可以采集到约1000个事件
$ perf record -p 10086 -a -g -F 100 -- sleep 10
$ perf record -p 10086 -a -g -F 1000 -- sleep 10
$ perf record -g -e cpu-clock ./test

perf report

对perf record采样的数据进行分析，可视化显示。

--no-childern：不统计chiledren开销
	Self：Self记录的是最后一列符号（可以理解为函数）本身采样数占总采样数的百分比，从而找到热点函数
	Children：记录的是这个符号调用的其他符号（理解为子函数，包括间接调用和直接调用）的采样数之和占总采样数的百分比，
	从而找到较高层热点函数

perf script

从perf.data读取数据并显示详细的采样信息，一般有多少个时间，就有多少条记录。

perf kmem

跟踪/测量内核内存属性

record：记录kmem events
–slab：记录slab申请器的events
–page：记录page申请器的events
stat：报告内核内存统计信息
–slab：记录slab申请器的events
–page：记录page申请器的events

perf mem

分析内存访问

perf lock

分析锁的性能

perf kvm

针对kvm虚拟化分析

perf sched

分析内核调度器性能

# record：采集和记录scheduling events
$ perf sched record --sleep 10
$ pref sched record -p 10086 --sleep 10

# script：报告采集到的事件
# latency：报告每个任务的调度延迟和进程的其他调度事件
# timehist：提供调度事件分析报告
时间格式：msec.usec

    wait time
    time between sched-out and next sched-in events for the task;

    task scheduling delay
    time between wakeup and actually running;

    run time
    run time for the task;

perf probe

用于自定义添加函数的probe点，可以用perf list来看支持的tracepoint，自定义添加probe点会自动去找内核符号表中的函数，来判断函数是否存在，如果不存在，在添加probe的时候会报错，在内核符号表中找不到需要添加的符号信息，这种方法相较于Kprobe来说免去了驱动编写的过程，提高了效率。

# 查看所有以及注册的probe函数
$ perf probe --list

# 添加自定义probe函数，只能跟踪内核或者模块中导出的符号（T）
$ perf probe --add submit_bio

# 追踪submit_bio probe事件
# -g表示记录函数的调用栈，sleep标志采样事件
# -g                    enables call-graph recording
# -a, --all-cpus        system-wide collection from all CPUs
# -R, --raw-samples     collect raw sample records from all opened counters
$ perf record -e probe:submit_bio -aR -g sleep 10

# 删除自定义的probe点
$ perf probe --del

简单使用 -g 参数存在大量的Unkown函数，从perf record产生的结果来看，这部分对应的地址大部分是非法地址。
perf 支持集中栈回溯和栈追踪的集中方法，可以添加一下几个参数，详细信息看以下帮助信息：

$ perf record -h
--call-graph <record_mode[,record_size]>	//设置栈回溯的方式
                  setup and enables call-graph (stack chain/backtrace):

                        record_mode:    call graph recording mode (fp|dwarf|lbr)
                        record_size:    if record_mode is 'dwarf', max size of stack recording (<bytes>)
                                        default: 8192 (bytes)

                        Default: fp

栈回溯支持三种模式，（fp|dwarf|lbr），默认为fp模式，三种模式的说明：

fp

fp 就是 Frame Pointer，即 x86 中的 EBP 寄存器，fp 指向当前栈帧栈底地址，此地址保存着上一栈帧的 EBP 值，具体可参考此文章的介绍，根据 fp 就可以逐级回溯调用栈。然而这一特性是会被优化掉的，而且这还是 GCC 的默认行为，在不手动指定 -fno-omit-frame-pointer 时默认都会进行此优化，此时 EBP 被当作一般的通用寄存器使用，以此为依据进行栈回溯显然是错误的。不过尝试指定 -fno-omit-frame-pointer 后依然没法获取到正确的调用栈，根据 GCC 手册的说明，指定了此选项后也并不保证所有函数调用都会使用 fp…… 看来只有放弃使用 fp 进行回溯了。

dwarf

dwarf 是一种调试文件格式，GCC 编译时附加的 -g 参数生成的就是 dwarf 格式的调试信息，其中包括了栈回溯所需的全部信息，使用 libunwind 即可展开这些信息。dwarf 的进一步介绍可参考 “关于 DWARF”，值得一提的是，GDB 进行栈回溯时使用的正是 dwarf 调试信息。实际测试表明使用 dwarf 可以很好的获取到准确的调用栈。

lbr

最后 perf 还支持通过 lbr 获取调用栈，lbr 即 Last Branch Records，是较新的 Intel CPU 中提供的一组硬件寄存器，其作用是记录之前若干次分支跳转的地址，主要目的就是用来支持 perf 这类性能分析工具，其详细说明可参考 “An introduction to last branch records” & “Advanced usage of last branch records”。此方法是性能与准确性最高的手段，然而它存在一个很大的局限性，由于硬件 Ring Buffer 寄存器的大小是有限的，lbr 能记录的栈深度也是有限的，具体值取决于特定 CPU 实现，一般就是 32 层，若超过此限制会得到错误的调用栈。

perf ftrace

通过perf 的ftrace功能，可以获取内核目标函数的调用信息。

root@curtis-PC:/home/curtis/work# perf ftrace -h

 Usage: perf ftrace [<options>] [<command>]
    or: perf ftrace [<options>] -- <command> [<options>]

    -a, --all-cpus        System-wide collection from all CPUs	# 跟踪所有cpu
    -C, --cpu <cpu>       List of cpus to monitor	# 跟踪指定cpu
    -D, --delay <n>       Number of milliseconds to wait before starting tracing after program start
    -F, --funcs <[FILTER]>
                          Show available functions to filter
    -G, --graph-funcs <func>
                          Trace given functions using function_graph tracer
    -g, --nograph-funcs <func>
                          Set nograph filter on given functions
    -m, --buffer-size <size>
                          Size of per cpu buffer, needs to use a B, K, M or G suffix.
    -N, --notrace-funcs <func>
                          Do not trace given functions
    -p, --pid <pid>       Trace on existing process id # 跟踪已经存在的进程pid号
    -T, --trace-funcs <func>
                          Trace given functions using function tracer
    -t, --tracer <tracer>
                          Tracer to use: function_graph(default) or function
    -v, --verbose         Be more verbose
        --func-opts <options>
                          Function tracer options, available options: call-graph,irq-info
        --graph-opts <options>
                          Graph tracer options, available options: nosleep-time,noirqs,verbose,thresh=<n>,depth=<n>
        --inherit         Trace children processes
        --tid <tid>       Trace on existing thread id (exclusive to --pid)

以下示例为跟踪vfs_open函数的调用信息，从以下信息可以获取到每一个调用路径详细信息。

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
   4)   0.968 us    |  mutex_unlock();
   0)               |  vfs_open() {
   0)   0.217 us    |    irq_enter_rcu();
   0)               |    __sysvec_irq_work() {
   0)               |      __wake_up() {
   0)               |        __wake_up_common_lock() {
   0)   0.093 us    |          _raw_spin_lock_irqsave();
   0)               |          __wake_up_common() {
   0)               |            pollwake() {
   0)               |              default_wake_function() {
   0)               |                try_to_wake_up() {
   0)   0.209 us    |                  _raw_spin_lock_irqsave();
   0)               |                  select_task_rq_fair() {
   0)   0.157 us    |                    available_idle_cpu();
   0)   0.142 us    |                    available_idle_cpu();
   0)   0.069 us    |                    available_idle_cpu();
   0)   0.068 us    |                    rcu_read_unlock_strict();
   0)   0.891 us    |                  }
   0)               |                  ttwu_queue_wakelist() {
   0)               |                    __smp_call_single_queue() {
   0)               |                      send_call_function_single_ipi() {
   0)               |                        native_send_call_func_single_ipi() {
   0)   7.103 us    |                          x2apic_send_IPI();
   0)   7.451 us    |                        }
   0)   7.628 us    |                      }
   0)   7.879 us    |                    }
   0)   8.077 us    |                  }
   0)   0.146 us    |                  _raw_spin_unlock_irqrestore();
   0)   9.893 us    |                }
   0) + 10.027 us   |              }
   0) + 10.175 us   |            }
   0) + 10.562 us   |          }
   0)   0.065 us    |          _raw_spin_unlock_irqrestore();
   0) + 11.002 us   |        }
   0) + 11.136 us   |      }
   0) + 11.537 us   |    }
   0)               |    irq_exit_rcu() {
   0)   0.084 us    |      idle_cpu();
   0)   0.230 us    |    }
   0)               |    do_dentry_open() {
   0)               |      path_get() {
   0)   0.127 us    |        mntget();
   0)   0.377 us    |      }
   0)   0.121 us    |      try_module_get();
   0)               |      security_file_open() {
   0)   0.154 us    |        apparmor_file_open();
   0)   0.400 us    |      }
   0)               |      proc_single_open() {
   0)               |        single_open() {
   0)               |          kmem_cache_alloc_trace() {
   0)               |            __cond_resched() {
   0)   0.066 us    |              rcu_all_qs();
   0)   0.204 us    |            }
   0)   0.069 us    |            should_failslab();
   0)   0.087 us    |            mem_cgroup_from_task();
## 以上仅为部分函数调用信息