糖尿病康复 > 大数据测试指标（二）

大数据测试指标（二）

时间：2023-05-19 23:05:04

相关推荐

大数据测试指标（二）

1 指标

当前性能测试指标和标准尚未完全确立，不同服务测试指标不同，相应的标准也不同，例如接入层服务和后端服务指标是不同的。

2 指标说明

2.1 负载（Load）

（1）什么是负载

负载(load)是linux机器的一个重要指标，直观了反应了机器当前的状态。

来看下负载的定义是怎样的：

In UNIX computing, the system load is a measure of the amount of computational work that a computer system performs. The load average represents the average system load over a period of time. It conventionally appears in the form of three numbers which represent the system load during the last one-, five-, and fifteen-minute periods.（wikipedia）

简单解释一下：在UNIX系统中，系统负载是对当前CPU工作量的度量，被定义为特定时间间隔内运行队列中的平均线程数。load average 表示机器一段时间内的平均load。这个值越低越好。负载过高会导致机器无法处理其他请求及操作，甚至导致死机。

Linux的负载高，主要是由于CPU使用、内存使用、IO消耗三部分构成。任意一项使用过多，都将导致服务器负载的急剧攀升。

/proc/loadavgThe first three fields in this file are load average figures giving the number of jobs in the run queue (stateR) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as the load averagenumbers given by uptime(1) and other programs. The fourth field consists of two numbers separated by a slash(/). The first of these is the number of currently executing kernel scheduling entities (processes, threads);this will be less than or equal to the number of CPUs. The value after the slash is the number of kernelscheduling entities that currently exist on the system. The fifth field is the PID of the process that was mostrecently created on the system.

这段话大意是说，loadavg文件中前三个字段是平均负载值，分别代表1、5和15分钟的作业(job)数量的平均值，作业(job)包括运行队列(state R)或者等待磁盘I/O(state D)两种类型。这里面有这么几层信息： /proc/loadavg中前三个数字分别表示load1、load5、load15的值。load值代表的是对应时间内的jobs的平均数量，比如load1就表示过去1分钟内的jobs数量的平均值。job主要是一个shell概念，和进程组概念近似，这里应该属于用词不当（后面会分析，准确的用词应该是内核中的tasks或用户空间中的threads概念）。而且只包含state状态为R和D的两种jobs，其他state状态不包含在内。

（2）机器正常负载范围

对于机器的Load到底多少算正常的问题，一直都是很有争议的，不同人有着不同的理解。对于单个CPU，有人认为如果Load超过0.7就算是超出正常范围了。也有人认为只要不超过1都没问题。也有人认为，单个CPU的负载在2以下都可以接受。

为什么会有这么多不同的理解呢，是因为不同的机器除了CPU影响之外还有其他因素的影响，运行的程序、机器内存、甚至是机房温度等都有可能有区别。

比如，有些机器用于定时执行大量的跑批任务，这个时间段内，Load可能会飙的比较高。而其他时间可能会比较低。那么这段飙高时间我们要不要去排查问题呢？

我的建议是，最好根据自己机器的实际情况，建立一个指标的基线（如近一个月的平均值），只要日常的load在基线上下范围内不太大都可以接收，如果差距太多可能就要人为介入检查了。

但是，总要有个建议的阈值吧，关于这个值。阮一峰在自己的博客中有过以下建议：