Load Average

From Noah.org
Revision as of 17:01, 11 July 2012 by Root (Talk | contribs) (Load versus CPU Usage)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


What does Load Average as shown by `uptime` mean?

Load Average, as given by `uptime`, is the number of processes that are are running or waiting to run. This number is average number of processes in the system run queue. If the Load Average goes above 1.0 then that means that processes were forced to wait. Another way to think about it is the number of CPUs you would need to handle the current load. If your Load Average goes above 2.0 then this means the kernel could have handled all requests without waiting if you had two CPUs. This is a rule of thumb. Load is actually much more complex and differs in different UNIX systems.

On a single user system that is not doing much you should usually see the Load Average below 0.10. Anything below 1.0 is OK. If your Load Average hovers around 2.0 all day then you need another CPU. That's for a single core -- you have to divide the Load Average by the number of cores in your system. If you have two dual-core CPUs (4 cores) then a Load Average of 4.0 is the upper range of "OK". As a rule of thumb, anything below 1.0 per core is barely OK -- you should keep some breathing room. If my Load Average gets over 0.50 for most of the day then I start to plan for extra capacity.

Load versus CPU Usage

Load is not the same as CPU usage. If you have a single process using 99% of the CPU cycles your load will still hover around 1.0. To see a combination of CPU usage and Load Average run `top` or `procinfo`. You might have hundreds of processes open, but most of them should be idle (sleeping -- waiting to be interrupted to do something). Every shell window you have open and even the web browser you are using to read this are not doing much of anything. On my system a `ps ax | wc -l` shows that I have 150 processes open, but my Load Average is 0.13 and my CPU Idle time is 95%, so despite the number of processes my CPU is not working hard at all. For an example of this, run `top` and note all the processes open. Now press 'i'. This hides all the idle processes. You should see only one or two processes now and `top` will likely be one of them.

One easy way to get a feel for CPU usage is to look at the %idle of the CPU. That's the relative amount of time the CPU has spent doing nothing. That that value and reverse it to see how much time the CPU has spent doing something useful. Figuring out what "useful" is can be a lot more complex. Linux breaks that time into user, nice, system kernel, iowait, irq, softirq, and virtual processor steal.

uninterruptable sleep

Unfortunately Load Average counts every process not sleeping. A process can be in a state called uninterruptable sleep, which is not counteded as sleeping, so this process will contribute to the Load Average. These processes will not contribute to CPU usage, so they can cause Load Average to be high without actually slowing the system down. The uninterruptable sleep state is usually caused when a process is waiting for disk or network I/O, or for a system call to return. If the network goes down while a process was reading from an NFS mount then the process may get stuck in uninterruptable sleep.

To see this run:

ps maxwwo pid,ppid,lwp,pcpu,stat,etime,command

Observe the STAT column. If a process is in the D state then it is in uninterruptable sleep.

/proc

All the good kernel info is accessible in the /proc filesystem. The file /proc/stat will contain the statistics that the kernel records about CPU usage and load average. This is not a static file. Every time you read the file it will have the latest values on the kernel statistics. Many of the measuring tools take information from /proc/stat and present it in an easy to read form. You can look at /proc/stat directly just by cat'ing the file:

cat /proc/stat
cpu  3036642 30370 666929 56555797 864944 19982 105267 0 0
cpu0 1634940 20751 369270 27422772 550582 19982 103219 0 0
cpu1 1401701 9619 297658 29133025 314361 0 2048 0 0
intr 235236340 46970713 0 0 0 0 0 0 0 61 0 0 0 0 0 3140273 0 12733177 93671868 9 1413708 0 7218308 10401382 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1418813 0 0
ctxt 321997790
btime 1230949026
processes 339983
procs_running 3
procs_blocked 0
  • system: processes executed in kernel mode
  • idle: milliseconds spent doing nothing
  • iowait: milliseconds spent waiting for I/O to complete
  • irq: milliseconds spent servicing irqs
  • softirq: milliseconds spent servicing softirqs
  • steal: time taken from us (we are a virtual machine). This is time a hypervizor forced us to wait. NEW
  • guest: time we (as a host) spent running any of our guest's operating systems. NEW

See "man -S 5 proc" for more information. The last two are newer stat fields gathered by kernels after version 2.6.11 for "steal" and version 2.6.24 for "guest".

The 'intr' line shows a total of all interrupts generated since the kernel booted and each column is the count for a specific interrupt number. This is an x86 CPU, so it has 224 interrupts.

Note that this shows a dual core CPU. The first 'cpu' line is the average of the next two lines, 'cpu0' and 'cpu1'. These are total counts since the kernel first booted, so the numbers won't mean much if you don't take the difference between two readings over a period of time.

mpstat

mpstat -P ALL 4

Update much lower than 1 second might not be as accurate. It's better to use as long an averaging period as practical to allow for noise. The activity of mpstat itself will raise the results in the intr/s column.

procstat

This tool is a little old. See mpstat.