Linux学习笔记之理解平均负载

我们常用top和uptime查看系统的负载情况，那什么是平均负载呢？

平均负载：指单位时间内，系统处于可运行状态和不可中断状态的平均进程数，即平均活跃进程数，它和cpu使用率没有直接关系。

可运行状态的进程：正在使用cpu或者正在等待cpu的进程，也就是我们常用ps命令看到的，处于R状态（Running或Runnable）的进程

不可中断状态的进程：正处于内核态关键流程中的进程，并且这些流程是不可打断的，比如最常见的是等待硬件设备的I/O响应，也就是我们在ps命令中看到的D状态（UninterruptibleSleep，也称为Disk Sleep）的进程。不可中断状态实际上是系统对进程和硬件设备的一种保护机制。

平均负载为多少合适？

最理想的情况是每个cpu上都刚好运行一个进程，这样cpu就得到了充分的利用

查看系统的cpu数：grep ‘model name’ /proc/cpuinfo | wc -l

平均负载和cpu使用率

常见疑惑：既然平均负载代表的是活跃的进程数，那平均负载高了，是不是就意味着cpu使用率高？

平均负载：单位时间内，处于可运行状态和不可中断状态的进程数。所以不仅包括了正在使用的cpu的进程，还包括等待cpu和等待i/o的进程。

cpu使用率：单位时间内cpu繁忙情况的统计，和平均负载并不完全对应。比如：

cpu密集型进程，使用大量cpu会导致平均负载升高，此时两者一致。

i/o密集型进程，等待i/o也会导致平均负载升高，但cpu使用率不一定很高

大量等待cpu的进程调度也会导致平均负载升高，此时的cpu使用率也会比较高

用iostat、mpstat、pidstat找出平均负载高的根源

mpstat：多核cpu性能分析工具，用来实时查看每个cpu的性能指标，以及cpu的平均指标

pidstat：进程性能分析工具，用来实时查看进程的cpu、内存、i/o以及上下文切换等性能指标

环境准备：

Centos7系统

安装stress（Linux系统压力测试工具）和sysstat（Linux性能工具）

yum install stress 一直找不到镜像所以用了rpm方式安装

http://ftp.tu-chemnitz.de/pub/linux/dag/redhat/el7/en/x86_64/rpmforge/RPMS/stress-1.0.2-1.el7.rf.x86_64.rpm

然后 rpm -Uvh stress-1.0.2-1.el7.rf.x86_64.rpm 安装

sysstat使用yum安装 yum install sysstat

场景一：CPU密集型进程

首先，我们开一个终端，用stress命令，模拟一个CPU使用率 100%的场景

[root@localhost ~]# stress --cpu 1 --timeout 600
stress: info: [5643] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

然后，打开第二个终端，运行uptime查看平均负载的变化，使用watch -d表示高亮显示变化的区域

最后，打开第三个终端，运行 mpstat 查看CPU使用率的变化情况

# -P ALL 表示监控所有的CPU，后面数字 5 表示间隔 5秒后输出一组数据
mpstat -P ALL 5
15时57分42秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15时57分47秒  all   50.15    0.00    0.30    0.00    0.00    0.10    0.00    0.00    0.00   49.45
15时57分47秒    0   45.97    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   53.83
15时57分47秒    1   54.53    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   45.27

分析：在终端二中能看到1分钟的平均负载一直升高，逼近1，在终端3中看到一个CPU的使用率为100%,而iowait为0，说明平均负载升高的原因只有cpu，定位使CPU使用率为100%的进程用 pidstat命令

[root@localhost ~]# pidstat -u 5 1
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 	2018年11月24日 	_x86_64_	(2 CPU)

15时53分50秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
15时53分55秒     0      1043    0.20    0.20    0.00    0.40     0  containerd
15时53分55秒     0      5266    0.00    0.20    0.00    0.20     0  kworker/0:1
15时53分55秒     0      5644   99.60    0.00    0.00   99.60     1  stress
15时53分55秒     0      5933    0.00    0.20    0.00    0.20     0  pidstat

平均时间:   UID       PID    %usr %system  %guest    %CPU   CPU  Command
平均时间:     0      1043    0.20    0.20    0.00    0.40     -  containerd
平均时间:     0      5266    0.00    0.20    0.00    0.20     -  kworker/0:1
平均时间:     0      5644   99.60    0.00    0.00   99.60     -  stress
平均时间:     0      5933    0.00    0.20    0.00    0.20     -  pidstat
[root@localhost ~]#

从结果中，能看到是 stress进程导致CPU使用率为100%

场景二：I/O密集型进程

首先运行 stress 命令，模拟 I/O压力，即不停的执行sync

[root@localhost ~]# stress -i 300 --timeout 600
stress: info: [8184] dispatching hogs: 0 cpu, 300 io, 0 vm, 0 hdd

然后在第二个终端运行 uptime 查看平均负载变化

最后，在第三个终端运行 mpstat查看cpu使用率变化情况

平均时间:  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
平均时间:  all    0.22    0.00   99.75    0.00    0.00    0.03    0.00    0.00    0.00    0.00
平均时间:    0    0.21    0.00   99.76    0.00    0.00    0.03    0.00    0.00    0.00    0.00
平均时间:    1    0.22    0.00   99.75    0.00    0.00    0.03    0.00    0.00    0.00    0.00

发现iowait一直是0，很奇怪。找到原因后回来更新。先截张老师电脑上的结果

解决上面实操iowait没有升高的的问题：

iowait无法升高是因为案例中stress使用的是 sync()系统调用，它的作用是刷新缓冲区内存到磁盘中，对于新安装的虚拟机，缓冲区可能比较小，无法产生大的io压力，这样大部分都是系统调用的消耗了。所以，只看到系统CPU使用率升高。解决办法是使用stress的下一代stree-ng，如 stress-ng -i 1 -hdd 1 --timout 600（-hdd表示读写临时文件）

终端一使用 stress-ng -i 1 -hdd 1 --timout 600

[root@localhost ~]# stress-ng -i 1 --hdd 1 --timeout 600
stress-ng: info:  [18042] dispatching hogs: 1 hdd, 1 io

终端二监控平均负载

终端三查看CPU使用率

[root@localhost ~]# mpstat -P ALL 5 20
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 	2018年11月25日 	_x86_64_	(2 CPU)


20时13分19秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
20时13分24秒  all    0.57    0.00   32.65   57.82    0.00    7.82    0.00    0.00    0.00    1.13
20时13分24秒    0    0.46    0.00   28.93   59.91    0.00    8.66    0.00    0.00    0.00    2.05
20时13分24秒    1    0.45    0.00   36.20   56.11    0.00    7.01    0.00    0.00    0.00    0.23

[root@localhost ~]# pidstat -u 5 1
Linux 3.10.0-693.el7.x86_64 (localhost.localdomain) 	2018年11月25日 	_x86_64_	(2 CPU)

20时16分00秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
20时16分05秒     0         3    0.00    1.17    0.00    1.17     0  ksoftirqd/0
20时16分05秒     0         9    0.00    0.19    0.00    0.19     0  rcu_sched
20时16分05秒     0        13    0.00    1.36    0.00    1.36     1  ksoftirqd/1
20时16分05秒     0        30    0.00   12.45    0.00   12.45     0  kswapd0
20时16分05秒     0       669    0.39    0.58    0.00    0.97     1  vmtoolsd
20时16分05秒     0       760    0.00    0.39    0.00    0.39     1  kworker/1:1H
20时16分05秒     0      1043    6.81    3.11    0.00    9.92     0  containerd
20时16分05秒     0      6449    0.00    8.37    0.00    8.37     0  kworker/u256:1
20时16分05秒     0     16847    0.00    0.39    0.00    0.39     0  kworker/u256:2
20时16分05秒     0     17425    0.00    2.33    0.00    2.33     0  kworker/0:0
20时16分05秒     0     17865    0.00    3.11    0.00    3.11     1  kworker/1:2
20时16分05秒     0     17974    0.00    0.19    0.00    0.19     0  watch
20时16分05秒     0     18043    0.39   49.22    0.00   49.61     1  stress-ng-hdd
20时16分05秒     0     18044    0.19   28.21    0.00   28.40     1  stress-ng-io
20时16分05秒     0     18048    0.00    0.19    0.00    0.19     0  kworker/0:4
20时16分05秒     0     18284    0.19    0.39    0.00    0.58     1  pidstat

平均时间:   UID       PID    %usr %system  %guest    %CPU   CPU  Command
平均时间:     0         3    0.00    1.17    0.00    1.17     -  ksoftirqd/0
平均时间:     0         9    0.00    0.19    0.00    0.19     -  rcu_sched
平均时间:     0        13    0.00    1.36    0.00    1.36     -  ksoftirqd/1
平均时间:     0        30    0.00   12.45    0.00   12.45     -  kswapd0
平均时间:     0       669    0.39    0.58    0.00    0.97     -  vmtoolsd
平均时间:     0       760    0.00    0.39    0.00    0.39     -  kworker/1:1H
平均时间:     0      1043    6.81    3.11    0.00    9.92     -  containerd
平均时间:     0      6449    0.00    8.37    0.00    8.37     -  kworker/u256:1
平均时间:     0     16847    0.00    0.39    0.00    0.39     -  kworker/u256:2
平均时间:     0     17425    0.00    2.33    0.00    2.33     -  kworker/0:0
平均时间:     0     17865    0.00    3.11    0.00    3.11     -  kworker/1:2
平均时间:     0     17974    0.00    0.19    0.00    0.19     -  watch
平均时间:     0     18043    0.39   49.22    0.00   49.61     -  stress-ng-hdd
平均时间:     0     18044    0.19   28.21    0.00   28.40     -  stress-ng-io
平均时间:     0     18048    0.00    0.19    0.00    0.19     -  kworker/0:4
平均时间:     0     18284    0.19    0.39    0.00    0.58     -  pidstat

看到iowait高是由于stress-ng-hdd 和 stress-ng-io进程导致

场景三：大量进程的场景

当系统中运行进程超出CPU运行能力时，就会出现等待CPU的进程。

使用stress 模拟8个进程

[root@localhost ~]# stress -c 8 --timeout 600

因为系统只有2个cpu，少于8个进程，所以系统的cpu处于严重过载状态，

用 pidstat看进程的情况

CentOs系统 sysstat版本低，结果中没有出现iowait，说需要升级到11.5.5版本后，但Centos版本没有对应的版本

问题解决后，来更新。先用老师的例子截个图

Linux学习笔记之理解平均负载

猜你喜欢