SRS性能(CPU)、内存优化工具用法

2021-03-05 本文已影响0人 winlinvip

SRS提供了一系列工具来定位性能瓶颈和内存泄漏，这些在./configure && make后的summary中是有给出来用法的，不过不是很方便，所以特地把用法写到这个文章中。

文中所有的工具，对于其他的linux程序也是有用的。

备注：所有工具用起来都会导致SRS性能低下，所以除非是排查问题，否则不要开启这些选项。

备注：本文CSDN链接在这里，简书链接在这里，以简书为准。

RTC

RTC是UDP的协议，先设置网卡队列缓冲区，下面命令是UDP分析常用的：

# 查看UDP缓冲区长度，默认只有200KB左右。
sysctl net.core.rmem_max &&
sysctl net.core.rmem_default &&
sysctl net.core.wmem_max &&
sysctl net.core.wmem_default

# 修改缓冲区长度为16MB
sysctl net.core.rmem_max=16777216
sysctl net.core.rmem_default=16777216
sysctl net.core.wmem_max=16777216
sysctl net.core.wmem_default=16777216

也可以修改系统文件/etc/sysctl.conf，重启也会生效：

# vi /etc/sysctl.conf
# For RTC
net.core.rmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_max=16777216
net.core.wmem_default=16777216

查看接收和发送的丢包信息：

# 查看丢包
netstat -suna
# 查看30秒的丢包差
netstat -suna && sleep 30 && netstat -suna

实例说明：
224911319 packets received，这是接收到的总包数。
65731106 receive buffer errors，接收的丢包，来不及处理就丢了。
123534411 packets sent，这是发送的总包数。
0 send buffer errors，这是发送的丢包。

备注：SRS的日志会打出UDP接收丢包和发送丢包，例如loss=(r:49,s:0)，意思是每秒有49个包来不及收，发送没有丢包。

查看接收和发送的长度：

netstat -lpun

实例说明；
Recv-Q 427008，程序的接收队列中的包数。Established: The count of bytes not copied by the user program connected to this socket.
Send-Q 0，程序的发送队列中的包数目。Established: The count of bytes not acknowledged by the remote host.

下面是netstat的一些参数：

--udp|-u 筛选UDP协议。
--numeric|-n 显示数字IP或端口，而不是别名，比如http的数字是80.
--statistics|-s 显示网卡的统计信息。
--all|-a 显示所有侦听和非侦听的。
--listening|-l 只显示侦听的socket。
--program|-p 显示程序名称，谁在用这个FD。

PERF

PERF是Linux性能分析工具，参考[PERF](perf record -e block:block_rq_issue -ag)。

可以实时看到当前的SRS热点函数：

perf top -p `ps aux|grep srs|grep conf|awk '{print $2}'`

或者记录一定时间的数据：

perf record -p `ps aux|grep srs|grep conf|awk '{print $2}'`
# 需要按CTRL+C取消record，然后执行下面的
perf report

记录堆栈，显示调用图：

perf record -a --call-graph fp -p `ps aux|grep srs|grep conf|awk '{print $2}'`
perf report --call-graph --stdio

Note: 也可以打印到文件perf report --call-graph --stdio >t.txt。

Remark: 由于ST的堆栈是不正常的，perf开启-g后记录的堆栈都是错乱的，所以perf只能看SRS的热点，不能看堆栈信息；如果需要看堆栈，请使用GPERF: GCP，参考下面的章节。

GPROF

GPROF是个GNU的CPU性能分析工具。参考SRS GPROF，以及GNU GPROF。

Usage:

# Build SRS with GPROF
./configure --gprof=on && make

# Start SRS with GPROF
./objs/srs -c conf/console.conf

# Or CTRL+C to stop GPROF
killall -2 srs

# To analysis result.
gprof -b ./objs/srs gmon.out

GPERF

GPERF是google tcmalloc提供的cpu和内存工具，参考GPERF。

GPERF: GCP

GCP是CPU性能分析工具，就是一般讲的性能瓶颈，看哪个函数调用占用过多的CPU。参考GCP。

Usage:

# Build SRS with GCP
./configure --gperf=on --gcp=on && make

# Start SRS with GCP
./objs/srs -c conf/console.conf

# Or CTRL+C to stop GCP
killall -2 srs

# To analysis cpu profile
./objs/pprof --text objs/srs gperf.srs.gcp*

Note: 用法可以参考cpu-profiler。

图形化展示，在CentOS上安装dot：

yum install -y graphviz

然后生成svg图片，可以用Chrome打开：

./objs/pprof --svg ./objs/srs gperf.srs.gcp >t.svg

GPERF: GMD

GMD是GPERF提供的内存Defense工具，检测内存越界和野指针。一般在越界写入时，可能不会立刻导致破坏，而是在切换到其他线程使用被破坏的对象时才会发现破坏了，所以这种内存问题很难排查；GMD能在越界和野指针使用时直接core dump，定位在那个出问题的地方。参考GMD。

Usage:

# Build SRS with GMD.
./configure --gperf=on --gmd=on && make

# Start SRS with GMD.
env TCMALLOC_PAGE_FENCE=1 ./objs/srs -c conf/console.conf

Note: 用法可以参考heap-defense。

Note: 注意GMD需要链接libtcmalloc_debug.a，并且开启环境变量TCMALLOC_PAGE_FENCE。

GPERF: GMC

GMC是内存泄漏检测工具，参考GMC。

Usage:

# Build SRS with GMC
./configure --gperf=on --gmc=on && make

# Start SRS with GMC
env PPROF_PATH=./objs/pprof HEAPCHECK=normal ./objs/srs -c conf/console.conf 2>gmc.log 

# Or CTRL+C to stop gmc
killall -2 srs

# To analysis memory leak
cat gmc.log

Note: 用法可以参考heap-checker。

GPERF: GMP

GMP是内存性能分析工具，譬如检测是否有频繁的申请和释放堆内存导致的性能问题。参考GMP。

Usage:

# Build SRS with GMP
./configure --gperf=on --gmp=on && make

# Start SRS with GMP
./objs/srs -c conf/console.conf

# Or CTRL+C to stop gmp
killall -2 srs 

# To analysis memory profile
./objs/pprof --text objs/srs gperf.srs.gmp*

Note: 用法可以参考heap-profiler。

VALGRIND

VALGRIND是大名鼎鼎的C分析工具，SRS3之后支持了。SRS3之前，因为使用了ST，需要给ST打PATCH才能用。

valgrind --leak-check=full ./objs/srs -c conf/console.conf

Remark: SRS3之前的版本，可以手动给ST打PATCH支持VALGRIND，参考state-threads，详细的信息可以参考ST#2。

Syscall

系统调用的性能排查，参考centos6的性能分析工具集合

OSX

在OSX/Darwin/Mac系统，可以用Instruments，在xcode中选择Open Develop Tools，就可以看到Instruments，也可以直接找这个程序，参考Profiling c++ on mac os x

instruments -l 30000 -t Time\ Profiler -p 72030

Remark: 也可以在Active Monitor中选择进程，然后选择Sample采样。

还有DTrace可以用，参考动态追踪技术（中） - Dtrace、SystemTap、火焰图或者浅谈动态跟踪技术之DTrace。

在这里插入图片描述

多核和软中断

多核时，一般网卡软中断在CPU0上，可以把SRS调度到其他CPU：

taskset -p 0xfe `cat objs/srs.pid`

或者，指定SRS运行在CPU1上：

taskset -pc 1 `cat objs/srs.pid`

调整后，可以运行top，然后按数字1，可以看到每个CPU的负载：

top # 进入界面后按数字1
#%Cpu0  :  1.8 us,  1.1 sy,  0.0 ni, 90.8 id,  0.0 wa,  0.0 hi,  6.2 si,  0.0 st
#%Cpu1  : 67.6 us, 17.6 sy,  0.0 ni, 14.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

或者使用mpstat -P ALL：

mpstat -P ALL
#01:23:14 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
#01:23:14 PM  all   33.33    0.00    8.61    0.04    0.00    3.00    0.00    0.00    0.00   55.02
#01:23:14 PM    0    2.46    0.00    1.32    0.06    0.00    6.27    0.00    0.00    0.00   89.88
#01:23:14 PM    1   61.65    0.00   15.29    0.02    0.00    0.00    0.00    0.00    0.00   23.03

可以使用命令cat /proc/softirqs，查看所有CPU的具体软中断类型，参考Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)。

如果将SRS强制绑定在CPU0上，则会导致较高的softirq，这可能是进程和系统的软中断都在CPU0上，可以看到si也比分开的要高很多。

如果是多CPU，比如4CPU，则网卡中断可能会绑定到多个CPU，可以通过下面的命令，查看网卡中断的绑定情况：

# grep virtio /proc/interrupts | grep -e in -e out
 29:   64580032          0          0          0   PCI-MSI-edge      virtio0-input.0
 30:          1         49          0          0   PCI-MSI-edge      virtio0-output.0
 31:   48663403          0   11845792          0   PCI-MSI-edge      virtio0-input.1
 32:          1          0          0         52   PCI-MSI-edge      virtio0-output.1

# cat /proc/irq/29/smp_affinity
1 # 意思是virtio0的接收，绑定到CPU0
# cat /proc/irq/30/smp_affinity
2 # 意思是virtio0的发送，绑定到CPU1
# cat /proc/irq/31/smp_affinity
4 # 意思是virtio1的接收，绑定到CPU2
# cat /proc/irq/32/smp_affinity
8 # 意思是virtio1的发送，绑定到CPU3

我们可以强制将网卡软中断绑定到CPU0，参考Linux: scaling softirq among many CPU cores和SMP IRQ affinity：

for irq in $(grep virtio /proc/interrupts | grep -e in -e out | cut -d: -f1); do 
    echo 1 > /proc/irq/$irq/smp_affinity
done

Note：如果要绑定到CPU 0-1，执行echo 3 > /proc/irq/$irq/smp_affinity

然后将SRS所有线程，绑定到CPU0之外的CPU：

taskset -a -p 0xfe $(cat objs/srs.pid)

软中断默认分配方式，占用较多CPU

将软中断集中在CPU0，降低20%左右CPU

如果要获取极高的性能，那么可以在SRS的启动脚本中，在启动SRS之前，执行绑核和绑软中断的命令。

进程优先级

可以设置SRS为更高的优先级，可以获取更多的CPU时间：

renice -n -15 -p `cat objs/srs.pid`

说明：nice的值从-20到19，默认是0，一般ECS的优先的进程是-10，所以这里设置为-15。

可以从ps中，看到进程的nice，也就是NI字段：

top -n1 -p `cat objs/srs.pid`
#  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                
# 1505 root       5 -15  519920 421556   4376 S  66.7  5.3   4:41.12 srs

SRS性能(CPU)、内存优化工具用法

RTC

PERF

GPROF

GPERF

GPERF: GCP

GPERF: GMD

GPERF: GMC

GPERF: GMP

VALGRIND

Syscall

OSX

多核和软中断

进程优先级

猜你喜欢

热点阅读