阅读更多

1 perf

perf命令具体用法参考Linux-Frequently-Used-Commands

1.1 event

perf list可以查看当前环境支持的所有event。event可以分为Software event以及Tracepoint event两大类

Software event：需要采样的event。比如需要通过perf record -F 99指定采样频率
Tracepoint event：不需要采样的event，有其固定的埋点，执行到了就会统计。Tracepoint event又可细分为许多类别

1.2 libperf

1.3 Reference

perf Examples

2 Flame Graph

2.1 CPU Flame Graph

Flame Graphs

相关git项目

FlameGraph

# 以 99Hz 的频率捕获指定进程的 cpu-clock 事件，捕获时长 60s，该命令会在当前目录生成 perf.data 文件
# -F：指定频率，若不指定，默认以 4000Hz 采样。最大采样频率对应内核参数为：kernel.perf_event_max_sample_rate（/proc/sys/kernel/perf_event_max_sample_rate）
# -g：开启 gragh 模式
sudo perf record -F 99 -g -p <pid> -- sleep 60

# 解析当前目录下的 perf.data 文件
sudo perf script > out.perf

# 生成火焰图
# 下面这两个脚本来自 FlameGraph 项目
FlameGraph_path=xxx
${FlameGraph_path}/stackcollapse-perf.pl out.perf > out.folded
${FlameGraph_path}/flamegraph.pl out.folded > out.svg

2.2 CPU Flame Graph for Java

Java Flame Graphs

相关git项目：

perf-map-agent

# Install
yum install -y centos-release-scl
yum install -y devtoolset-9
yum install -y cmake
scl enable devtoolset-9 bash
cd perf-map-agent && cmake . && make

# This is required by the runtime
yum install -y java-1.8.0-openjdk-devel
export JAVA_HOME=/usr/lib/jvm/java-1.8.0

FlameGraph

Java进程相关配置：

-XX:+PreserveFramePointer

# 以 99Hz 的频率捕获所有进程的 cpu-clock 事件，捕获时长 300s，该命令会在当前目录生成 perf.data 文件
# -g：开启 gragh 模式
sudo perf record -F 99 -g -p <pid> -- sleep 300

# 下载并安装 perf-map-agent
# 安装依赖 cmake，openjdk（只有 jre 是不够的）
PerfMapAgent_path=xxx
sudo ${PerfMapAgent_path}/bin/create-java-perf-map.sh <pid>

# 解析当前目录下的 perf.data 文件
sudo perf script > out.perf

# 生成火焰图
# 下面这两个脚本来自 FlameGraph 项目
FlameGraph_path=xxx
${FlameGraph_path}/stackcollapse-perf.pl out.perf > out.folded
${FlameGraph_path}/flamegraph.pl out.folded > out.svg

2.3 Cache Miss Flame Graph

2.4 CPI Flame Graph

CPI Flame Graphs: Catching Your CPUs Napping

2.5 Summary

perf record默认采集的event是cpu-clock，因此这种方式做出来的就是CPU火焰图
perf record配合-e参数，指定event类型，可以做出任意事件的火焰图

2.6 Reference

3 Off-CPU Analysis

Off-CPU Analysis

分析工具：

>= Linux 4.8：eBPF, extended BPF
- 要求Linux版本至少是4.8
- 开销更小，因为它只捕获和转换独特的堆栈
- Linux eBPF Off-CPU Flame Graph
< Linux 4.8：针对不同的blocking类型（I/O，scheduler，lock），需要使用不同的分析工具，例如SystemTap、perf event logging、BPF
- Linux perf_events Off-CPU Time Flame Graph
- Linux eBPF Off-CPU Flame Graph
其他工具
- time：一个非常简单的统计工具
  - real：整体耗时
  - user：用户态的CPU时间
  - sys：内核态的CPU时间
  - real - user - sys：off-CPU时间
- brpc

其他参考：

Off-CPU Flame Graphs

3.1 Using perf

# 启用调度的tracepoint，需要在root账号下执行，一般账号sudo可能执行不了
echo 1 > /proc/sys/kernel/sched_schedstats

# 数据采集，若要采集某个进程，将 -a 换成 -p <pid>
sudo perf record \
    -e sched:sched_stat_sleep \
    -e sched:sched_switch \
    -e sched:sched_process_exit \
    -a \
    -g \
    -o perf.data.raw \
    sleep 30

# 其中，-s 参数主要用于合并 sched_stat 以及 sched_switch 这两个事件，用于生成对应的睡眠时间
sudo perf inject -v -s \
    -i perf.data.raw \
    -o perf.data

sudo perf script -F comm,pid,tid,cpu,time,period,event,ip,sym,dso | \
    sudo awk '
    NF > 4 { exec = $1; period_ms = int($5 / 1000000) } 
    NF > 1 && NF <= 4 && period_ms > 0 { print $2 } 
    NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }
    ' | \
    sudo ${FlameGraph_path}/stackcollapse.pl | \
    sudo ${FlameGraph_path}/flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg

3.2 Using BPF

安装：

yum install bcc
工具目录：/usr/share/bcc/tools/
- /usr/share/bcc/tools/offcputime

制作offcpu火焰图：

占比很小的堆栈会被忽略

# 采样指定进程 30s
/usr/share/bcc/tools/offcputime -df -p <pid> 30 > out.bcc

# 生成火焰图
# 下面这两个脚本来自 FlameGraph 项目
FlameGraph_path=xxx
${FlameGraph_path}/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us out.bcc > out.svg

看单个线程的offcpu堆栈：

会输出所有的堆栈，以及出现的时间（单位微秒），越后面的出现频率越高

1 2	# 采样指定线程 30s sudo /usr/share/bcc/tools/offcputime -d -t <tid> 30

4 VTune

安装Vtune-Profile：

Offline Installer下载并安装，默认安装路径是~/intel/oneapi/vtune
二进制工具的目录：${install_dir}/intel/oneapi/vtune/latest/bin64，记为vtune_bin_dir
- vtune-gui：可视化程序，需要X Window System
  - Menu
  - Project Navigator
    - 默认项目路径：~/intel/vtune/projects，在其他机器采集到的数据，拷贝到这个目录下，即可打开
    - Open project貌似有问题
  - Config Analysis
    - 右下角>_可获取与可视化配置等价的vtune采集命令
  - Compare results
  - Open Results
    - 貌似有问题
- vtune：命令行工具，不需要X Window System
  - ${vtune_bin_dir}/vtune -collect hotspots --duration 30 --target-pid <pid>：会在当前目录下生成类似r000hs名称的目录，采集的数据会保存到该目录中
  - ${vtune_bin_dir}/vtune -collect hotspots --duration 30 --target-pid <pid> -r <target_dir>：采集的数据会保存到指定的目录中
- vtune-self-checker.sh：环境自检
通常来说，使用vtune-gui的机器，和目标机器（服务器一般不会装X Window System）不是同一台，有如下两种处理方式：
- 在目标机器上，安装Vtune-Profile（✅推荐）
- 在目标机器上，安装Vtune-Profile-Target（仅包含采集数据所需的软件包），但是会有坑（❌不推荐）：
  - 自动安装：Configure Analysis -> Remote Linux(ssh) -> Deploy
  - 手动安装：将${install_dir}/intel/oneapi/vtune/latest/target/linux下的压缩包拷贝到目标机器上并解压
我的MacOS系统版本是Monterey 12.0.1，这个版本无法远程Linux机器。如何解决？在目标Linux系统上安装X Window System、Vtune-Profile，通过vnc或者nx等远程桌面软件登录目标Linux机器，再通过vtune-gui打开Vtune-Profile，并分析本地的程序

大致流程：

假设有2台机器，A和B
- A：需要X Window System
- B：无需X Window System
- A和B可以是同一台机器
分别在A和B安装Vtune-Profile
在B机器上，使用vtune进行采样，假设生成的数据存放在r000hs目录中
将B机器上的r000hs目录拷贝到A机器的~/intel/vtune/projects目录下
打开A机器上的vtune-gui对项目r000hs进行分析

4.1 Reference

5 Chrome tracing view

https://github.com/StarRocks/starrocks/pull/7649

6 pcm

Processor Counter Monitor, pmc 包含如下工具：

pcm：最基础监控工具
pcm-sensor-server：在本地提供一个Http服务，以JSON的格式返回metrics
pcm-memory：用于监控内存带宽
pcm-latency：用于监控L1 cache miss以及DDR/PMM memory latency
pcm-pcie：用于监控每个插槽的PCIe带宽
pcm-iio：用于监控每个PCIe设备的PCIe带宽
pcm-numa：用于监控本地以及远程的内存访问
pcm-power
pcm-tsx
pcm-core/pmu-query
pcm-raw
pcm-bw-histogram

7 sysbench

sysbench

示例：

sysbench --test=memory --memory-block-size=1M --memory-total-size=10G --num-threads=1 run
sysbench --test=cpu run
sysbench --test=fileio --file-test-mode=seqwr run
sysbench --test=threads run
sysbench --test=mutex run

8 valgrind (by andy pavlo)

valgrind

Tools:

memcheck: Detects memory-related errors.
cachegrind: Cachegrind is a Valgrind tool used for profiling programs to analyze cache performance and identify bottlenecks in code. It simulates how your code uses the CPU’s cache hierarchy and memory, helping you optimize code by identifying areas with inefficient memory usage or poor cache locality.
callgrind: Profiles function calls and CPU performance.
helgrind: Detects data races and threading bugs.
drd: Analyzes threading issues, similar to Helgrind.
massif: Memory profiler focused on heap and stack usage.
dhat: A tool for examining how programs use their heap allocations.
lackey: An experimental tool that performs basic code instrumentation.

Example:

cat > main.cpp << 'EOF'
#include <cstdlib>
#include <iostream>
#include <vector>

const int SIZE = 1000;

void inefficientFunction() {
    std::vector<std::vector<int>> matrix(SIZE, std::vector<int>(SIZE));

    // Fill matrix with values, accessing elements in a column-major order
    // which is inefficient for cache locality.
    for (int i = 0; i < SIZE; ++i) {
        for (int j = 0; j < SIZE; ++j) {
            matrix[j][i] = rand() % 100; // Random values in matrix
        }
    }

    // Sum up all elements
    int total = 0;
    for (int i = 0; i < SIZE; ++i) {
        for (int j = 0; j < SIZE; ++j) {
            total += matrix[i][j];
        }
    }

    std::cout << "Total: " << total << std::endl;
}

int main() {
    inefficientFunction();
    return 0;
}
EOF

gcc -o main main.cpp -lstdc++ -std=gnu++17 -g

valgrind --tool=massif ./main
find . -name "massif.out.*" | xargs ms_print

valgrind --tool=cachegrind ./main
find . -name "cachegrind.out.*" | xargs cg_annotate

valgrind --tool=dhat ./main

Reference:

9 Best Practice

9.1 The primary metrics that performance analysis should prioritize

Cycles
IPC
Instructions
L1 Miss
LLC Miss, Last Level Cache
Branch Miss
Contention
%usr、%sys
bandwidth、packet rate、irq

9.2 What is the approach for performance bottleneck analysis?

CPU无法打满，可能原因包括：
- 没有充分并行
- 存在串行点（std::mutex）
- 其他资源是否已经打满，导致CPU无法进一步提高，比如网卡、磁盘等

9.3 What is the approach for high system cpu analysis?

Check system calls: sudo perf stat -e "syscalls:sys_enter_*"

Liuye Notebook

Linux-Performance-Analysis