Difference between revisions of "Benchmarks"

From Noah.org
Jump to navigationJump to search
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[[Category: Engineering]]
 
[[Category: Engineering]]
 +
[[Category: Performance]]
  
 
= Units of Measurement of Speed =
 
= Units of Measurement of Speed =
Line 32: Line 33:
 
;correct: 650 bytes * 1024 * 1024 = 650 MB
 
;correct: 650 bytes * 1024 * 1024 = 650 MB
 
;wrong:  650 bytes * 1024 bytes * 1024 bytes = 681574400 Bytes<sup>3</sup>
 
;wrong:  650 bytes * 1024 bytes * 1024 bytes = 681574400 Bytes<sup>3</sup>
 +
 +
== `dd` units ==
 +
 +
Note that when the `dd` command prints a total of '''MB/s''' it uses '''1000000 Bytes = 1 MB''', not 1048576 Bytes (almost 5% difference); and it uses '''1000000000 Bytes = 1 GB''', not 1073741824 Bytes (almost 7% difference).
 +
 +
For example, a memory speed test using `dd` to copy 128 MB gives 609 MB/s, but it thinks you copied 134 MB:
 +
<pre>
 +
# dd if=/proc/kcore of=/dev/shm/mem bs=$((1024*1024)) count=128
 +
128+0 records in
 +
128+0 records out
 +
134217728 bytes (134 MB) copied, 0.220513 s, 609 MB/s
 +
</pre>
 +
 +
Now if you use 1048576 Bytes for a MegaByte then you get a lower interpretation or only 580 MB/s:
 +
<pre>
 +
echo "134217728 0.220513" | awk '{printf ("%4.2f MB/s\n", $1 / $2 / (1024*1024))}'
 +
580.46 MB/s
 +
</pre>
 +
 +
But since `dd` does print the total bytes and the time we can put that all together to get MB/s where a MegaByte is taken as 1048576 Bytes and '''128 MB = (1024*1024 * 128''':
 +
<pre>
 +
dd if=/proc/kcore of=/dev/shm/mem bs=$((1024*1024)) count=128 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("%4.2f MB/s\n", $1 / $2 / (1024*1024))}'
 +
565.95 MB/s
 +
</pre>
  
 
= Benchmark CPU Speed =
 
= Benchmark CPU Speed =
Line 37: Line 62:
 
<pre>
 
<pre>
 
openssl speed aes-256-cbc
 
openssl speed aes-256-cbc
 +
</pre>
 +
 +
<pre>
 +
vendor_id      : GenuineIntel
 +
cpu family      : 6
 +
model          : 23
 +
model name      : Intel(R) Core(TM)2 CPU        E7400  @ 2.80GHz
 +
stepping        : 10
 +
cpu MHz        : 2798.369
 +
cache size      : 3072 KB
 +
 +
openssl speed aes-256-cbc
 +
The 'numbers' are in 1000s of bytes per second processed.
 +
type            16 bytes    64 bytes    256 bytes  1024 bytes  8192 bytes
 +
aes-256 cbc      83294.15k  102960.03k  108277.81k  110150.31k  111374.09k
 
</pre>
 
</pre>
  
 
= Benchmark Memory Speed =
 
= Benchmark Memory Speed =
  
This assumes your /dev/shm device has over 1 GB free.
+
This assumes your /dev/shm device has over 128 MB free. Remember, `dd` uses 1000000 Bytes = 1 MB.
 +
 
 +
<pre>
 +
dd if=/proc/kcore of=/dev/shm/mem bs=$((1000*1000)) count=128
 +
128+0 records in
 +
128+0 records out
 +
128000000 bytes (128 MB) copied, 0.221084 s, 579 MB/s
 +
</pre>
 +
 
 +
If you like 1048576 Bytes = 1 MB then that would be 552 MB/s. You can get this directly with this command:
  
 
<pre>
 
<pre>
dd if=/proc/kcore of=/dev/shm/mem bs=$((1024*1024)) count=1024
+
dd if=/proc/kcore of=/dev/shm/mem bs=$((1024*1024)) count=128 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("%4.2f MB/s\n", $1 / $2 / (1024*1024))}'
 +
556.27 MB/s
 
</pre>
 
</pre>
  
Line 50: Line 100:
  
 
= Benchmark Disk Speed =
 
= Benchmark Disk Speed =
 +
 +
See also [[drive_speed_tests]] for a table of primitive test results on various storage devices.
  
 
== Trivial Disk Speed Testing ==
 
== Trivial Disk Speed Testing ==
Line 176: Line 228:
  
 
== Some aliases for spot checks ==
 
== Some aliases for spot checks ==
 +
 +
=== Updated tests ===
 +
 +
Write block size = 8 MB. Total file size = 8 MB * 16 blocks = 128 MB:
 +
<pre>
 +
dd if=/dev/zero of=test_data.bin oflag=dsync conv=fdatasync  bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("write-speed: %7.2f MB/s, ", $1 / $2 / (1024*1024))}' && \
 +
dd if=test_data.bin iflag=direct conv=fdatasync of=/dev/null bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("read-speed:  %7.2f MB/s\n", $1 / $2 / (1024*1024))}'
 +
write-speed:  93.94 MB/s, read-speed:  135.74 MB/s
 +
</pre>
 +
Note that 128 MB = 32768 drive sectors, where 1 sector block = 4096 bytes.
 +
 +
This can even be made into an alias:
 +
<pre>
 +
alias test-drive-speed='dd if=/dev/zero of=test_data.bin oflag=dsync conv=fdatasync  bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '\''{printf ("write-speed: %7.2f MB/s, ", $1 / $2 / (1024*1024))}'\'' && dd if=test_data.bin iflag=direct conv=fdatasync of=/dev/null bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '\''{printf ("read-speed:  %7.2f MB/s\n", $1 / $2 / (1024*1024))}'\'''
 +
</pre>
 +
 +
=== Older tests ===
  
 
These tests give the write speed for a 100 MB file using block sizes of 4 KB, 100 KB, 1 MB, and 10 MB.
 
These tests give the write speed for a 100 MB file using block sizes of 4 KB, 100 KB, 1 MB, and 10 MB.
Line 184: Line 253:
 
|test-write-tt ||            4 KB || align="right" |      4096 bytes || align="right" |  0.00390625 MB ||  * 25600 blocks || typical minimal sector size<br>typical Linux page size
 
|test-write-tt ||            4 KB || align="right" |      4096 bytes || align="right" |  0.00390625 MB ||  * 25600 blocks || typical minimal sector size<br>typical Linux page size
 
|-
 
|-
|test-write-sm ||          100 KB || align="right" |    102400 bytes || align="right" |  0.09765625 MB ||  * 1024 blocks || ~1/10th X
+
|test-write-sm ||          100 KB || align="right" |    102400 bytes || align="right" |  0.09765625 MB ||  * 1024 blocks || 1/10th X
 
|-
 
|-
 
|test-write-md ||            1 MB || align="right" |    1048576 bytes || align="right" |  1.00000000 MB ||    * 100 blocks || 1 X
 
|test-write-md ||            1 MB || align="right" |    1048576 bytes || align="right" |  1.00000000 MB ||    * 100 blocks || 1 X
Line 253: Line 322:
 
That makes the direct flag a big pain in the ass to use. In fact, it's not really that useful in the real world since you usually don't want to avoid the kernel cache in the first place. So what's the point? Well, it does help with writing benchmark tools :-) And here we can see something even more mysterious.
 
That makes the direct flag a big pain in the ass to use. In fact, it's not really that useful in the real world since you usually don't want to avoid the kernel cache in the first place. So what's the point? Well, it does help with writing benchmark tools :-) And here we can see something even more mysterious.
  
Run a few times without 'iflag=direct'. Notice that the first run is slow, but subsequent runs are much faster -- the data comes from cache:
+
Run a few times without 'iflag=direct'. Notice that the first run is slow, but subsequent runs are much faster because the data comes from Linux buffer cache:
 
 
 
<pre>
 
<pre>
 
# dd if=TESTDATA of=/dev/null bs=1048576
 
# dd if=TESTDATA of=/dev/null bs=1048576
Line 277: Line 345:
 
</pre>
 
</pre>
  
 
+
We can flush the kernel cache so that subsequent tests should not be effected by the previous test, but the this is not what happens. Notice that in the following test that later tests still get faster. Why is this? This is the effect of the cache built into the disk. I don't know how to suppress this artifact.
We can flush the kernel cache so that subsequent tests are not effected by the previous test, but notice here that they do get faster. What's this? This is the '''disk''' cache! I don't know how to suppress that!
 
 
<pre>
 
<pre>
 
# echo 3 > /proc/sys/vm/drop_caches
 
# echo 3 > /proc/sys/vm/drop_caches
Line 302: Line 369:
 
</pre>
 
</pre>
  
Reading a different file should cause the disk cache to be flushed, but something weird is going on here:
+
Reading a different file should cause the cache on the disk to be flushed (depending on the size), but something weird is going on here:
 
 
 
<pre>
 
<pre>
 
# echo 3 > /proc/sys/vm/drop_caches
 
# echo 3 > /proc/sys/vm/drop_caches

Latest revision as of 02:58, 2 December 2014


Units of Measurement of Speed

The common CD-ROM (74 minute, 12 cm, ISO-9960) can store 681984000 bytes, which is approximately 650 MB.

(650 bytes * (1024 * 1024)) / (124 * (kbit / s)) = 11.9283154 hours

Note that Google uses powers of 2 for unit prefixes K, M, and G (Kilo, Mega, Giga).

1 KB 1024 bytes "1 KB in bytes"
1 MB 1048576 bytes "1 MB in bytes"
1 GB 1073741824 bytes "1 GB in bytes"
1 TB 1099511627776 bytes "1 TB in bytes"

Google Calculator search expressions for calculating bandwidth:
(650 bytes*(1024*1024))/(124*(kbit/s))

<form name="input" action="http://www.google.com/search" method="get"> Google query: <input type="text" name="q" /> <input type="submit" value="submit" /> </form>

Remember when putting units in a formula that you must not label every scalar number. For example:

correct
650 bytes * 1024 * 1024 = 650 MB
wrong
650 bytes * 1024 bytes * 1024 bytes = 681574400 Bytes3

`dd` units

Note that when the `dd` command prints a total of MB/s it uses 1000000 Bytes = 1 MB, not 1048576 Bytes (almost 5% difference); and it uses 1000000000 Bytes = 1 GB, not 1073741824 Bytes (almost 7% difference).

For example, a memory speed test using `dd` to copy 128 MB gives 609 MB/s, but it thinks you copied 134 MB:

# dd if=/proc/kcore of=/dev/shm/mem bs=$((1024*1024)) count=128
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0.220513 s, 609 MB/s

Now if you use 1048576 Bytes for a MegaByte then you get a lower interpretation or only 580 MB/s:

echo "134217728 0.220513" | awk '{printf ("%4.2f MB/s\n", $1 / $2 / (1024*1024))}'
580.46 MB/s

But since `dd` does print the total bytes and the time we can put that all together to get MB/s where a MegaByte is taken as 1048576 Bytes and 128 MB = (1024*1024 * 128:

dd if=/proc/kcore of=/dev/shm/mem bs=$((1024*1024)) count=128 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("%4.2f MB/s\n", $1 / $2 / (1024*1024))}'
565.95 MB/s

Benchmark CPU Speed

openssl speed aes-256-cbc
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 CPU         E7400  @ 2.80GHz
stepping        : 10
cpu MHz         : 2798.369
cache size      : 3072 KB

openssl speed aes-256-cbc
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      83294.15k   102960.03k   108277.81k   110150.31k   111374.09k

Benchmark Memory Speed

This assumes your /dev/shm device has over 128 MB free. Remember, `dd` uses 1000000 Bytes = 1 MB.

dd if=/proc/kcore of=/dev/shm/mem bs=$((1000*1000)) count=128
128+0 records in
128+0 records out
128000000 bytes (128 MB) copied, 0.221084 s, 579 MB/s

If you like 1048576 Bytes = 1 MB then that would be 552 MB/s. You can get this directly with this command:

dd if=/proc/kcore of=/dev/shm/mem bs=$((1024*1024)) count=128 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("%4.2f MB/s\n", $1 / $2 / (1024*1024))}'
556.27 MB/s

Benchmark Network Speed

Benchmark Disk Speed

See also drive_speed_tests for a table of primitive test results on various storage devices.

Trivial Disk Speed Testing

Often I want to run simple read and write tests of a disk for sanity testing. Performance geeks will object to simple tests as being meaningless, but these are often good enough for quick comparison testing.

The /dev/urandom device is based on SHA1 which has a small, but significant computational expense. If you need random data then generate a source data file in shared memory (the special device mounted here: /dev/shm/). This is a virtual filesystem stored entirely in memory. You couldn't store the test data on a separate drive because then you would be including the read speed of that drive in the test. By putting the file on /dev/shm you test only copying from memory to a drive.

dd if=/dev/urandom of=/dev/shm/random-data.bin bs=104857600 count=1

You cannot use /dev/null as the input file is valid because `dd` looks at this the same as if it were trying to read from a closed file. It immediately quits the test. However, I have found that the test results are about the same when using an input file set to /dev/zero as compared to /dev/shm/random-data.bin. You might think that /dev/zero would short-circuit some part of the read logic in the kernel since it does not have to do any large memory copies, whereas reading from /dev/shm/random-data.bin must transfer a large chunk of memory to the drive. Also, you might wonder if the drive or kernel can somehow compress the data stream since it is all zeros, whereas random data cannot be compressed. No matter what may be happening under the hood, I have not found the results to be much faster than reading from /dev/shm/random-data.bin.

for x in $(seq 1 100); do echo $x $(dd if=/dev/zero of=test_data.bin oflag=dsync conv=fdatasync bs=104857600 count=1 2>&1 | grep --only-matching -E "[0-9]+\.?[0-9]+ [kKmMgGtT]B/s"); done | tee test_data_write_times.txt

You can then plot the write timing data with GNU plotutils and ImageMagick `display`:

cat test_data_write_times.txt | cut -f1,2 -d" " | graph -y 0 --bitmap-size 1024x768 -F HersheySans -T png | display -

Check your block size (even though it's always 512 Bytes)

# dd if=/dev/urandom of=TESTDATA count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000458444 s, 1.1 MB/s

Create a 1MB test data file (random bytes)

Oh, look! Now we're almost testing speed, since `dd` reports its own statistics.

dd if=/dev/urandom of=TESTDATA count=2048
2048+0 records in
2048+0 records out
1048576 bytes (1.0 MB) copied, 0.309145 s, 3.4 MB/s

This does the same, but will make the math easier in future tests. This sets the blocksize to 1MB and count to 1 block. It is good to see that `dd` doesn't show any odd behavior here. The speed is about the same.

# dd if=/dev/urandom of=TESTDATA bs=1048576 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.296067 s, 3.5 MB/s

Now create a 10 MB file. Setting the blocksize make it more clear how big a file we want.

# dd if=/dev/urandom of=TESTDATA bs=10485760 count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 2.39607 s, 4.4 MB/s

But the Kernel will cache output data and keep writing after the process is done and has close the file. How do we know the data is really all there? Add 'oflag=sync'. Note that this does slow down the total speed a little bit.

# dd if=/dev/urandom of=TESTDATA oflag=sync bs=10485760 count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 2.62692 s, 4.0 MB/s

But that shouldn't make a big difference for large blobs of data. But the dataset would have to be larger than the page cache which is less than the physical RAM. Even a 100MB file is plenty small enough to fit. The kernel will cache the entire thing and then sync it to disk at its leisure.

So we still see a drop in speed when using sync even for large blocks. With smaller block sizes and larger block counts we begin to see a penalty -- from 4.0 MB/s down to 1.2 MB/s.

# dd if=/dev/urandom of=TESTDATA oflag=sync bs=4096 count=2560
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 8.68235 s, 1.2 MB/s

What about the default blocksize of `dd` 512 bytes? To get 10MB we need 20480 blocks:

# dd if=/dev/urandom of=TESTDATA count=20480
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 2.47843 s, 4.2 MB/s

That speed seems close to using one block with a size of 10485760 bytes, so blocksize does not seem to effect the speed very much. But with sync turned on that changes and we can see that `dd` must be syncing the disk much more often. Ouch:

# dd if=/dev/urandom of=TESTDATA oflag=sync count=20480
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 35.4203 s, 296 kB/s

Linux page cach uses 4096 bytes per page, but there doesn't seem to be any special relationship here:

# dd if=/dev/urandom of=TESTDATA oflag=sync bs=4096 count=2560
2560+0 records in
2560+0 records out
10485760 bytes (10 MB) copied, 8.62401 s, 1.2 MB/s

It's debatable whether one should use 'fsync' or 'fdatasync' options. The 'fsync' option also makes sure filesystem metadata is written to disk. That makes more sense if you are testing the whole filesystem speed. The 'fdatasync' option only ensure the file's contents is on the disk. That would be better if you care about testing raw disk performance... But this is all theoretical. In general you won't see a difference and these are primitive tests that you wouldn't want to take to a performance testing debate.

Uh oh...

This shows why primitive testing can be bad. You have to know what you are doing. Look at these terrible results. It turns out that `dd` reads one byte then writes one byte over and over until done.

# dd if=/dev/urandom of=TESTDATA bs=1 count=1048576
1048576+0 records in
1048576+0 records out
1048576 bytes (1.0 MB) copied, 6.32568 s, 166 kB/s

Things get even worse if you add the 'sync' option because now `dd` will read then write then sync the disk. Ouch. Super slow. It seems that disks are not designed to write one byte at a time and guarantee that the byte was actually written to the disk before going on to the next byte. This is only 1K of data! But it is nice to see that the sync and blocksize options actually do seem to do what say they will -- tune performance.

# dd if=/dev/urandom of=TESTDATA oflag=sync bs=1 count=1024
1024+0 records in
1024+0 records out
1024 bytes (1.0 kB) copied, 1.42996 s, 0.7 kB/s

Some aliases for spot checks

Updated tests

Write block size = 8 MB. Total file size = 8 MB * 16 blocks = 128 MB:

dd if=/dev/zero of=test_data.bin oflag=dsync conv=fdatasync  bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("write-speed: %7.2f MB/s, ", $1 / $2 / (1024*1024))}' && \
dd if=test_data.bin iflag=direct conv=fdatasync of=/dev/null bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '{printf ("read-speed:  %7.2f MB/s\n", $1 / $2 / (1024*1024))}'
write-speed:   93.94 MB/s, read-speed:   135.74 MB/s

Note that 128 MB = 32768 drive sectors, where 1 sector block = 4096 bytes.

This can even be made into an alias:

alias test-drive-speed='dd if=/dev/zero of=test_data.bin oflag=dsync conv=fdatasync  bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '\''{printf ("write-speed: %7.2f MB/s, ", $1 / $2 / (1024*1024))}'\'' && dd if=test_data.bin iflag=direct conv=fdatasync of=/dev/null bs=8388608 count=16 2>&1 | grep "copied" | cut -f1,6 -d" " | awk '\''{printf ("read-speed:  %7.2f MB/s\n", $1 / $2 / (1024*1024))}'\'''

Older tests

These tests give the write speed for a 100 MB file using block sizes of 4 KB, 100 KB, 1 MB, and 10 MB.

command block size human block size bytes block size MB * block count = 100 MB Note
test-write-tt 4 KB 4096 bytes 0.00390625 MB * 25600 blocks typical minimal sector size
typical Linux page size
test-write-sm 100 KB 102400 bytes 0.09765625 MB * 1024 blocks 1/10th X
test-write-md 1 MB 1048576 bytes 1.00000000 MB * 100 blocks 1 X
test-write-lg 10 MB 10485760 bytes 10.00000000 MB * 10 blocks 10 X
alias test-write-tt='dd if=/dev/urandom of=random_data.bin oflag=dsync conv=fdatasync bs=4096  count=25600 2>&1 | grep --only-matching -E "[0-9]+\.?[0-9]+ [kKmMgGtT]B/s"'
alias test-write-sm='dd if=/dev/urandom of=random_data.bin oflag=dsync conv=fdatasync bs=102400 count=1024 2>&1 | grep --only-matching -E "[0-9]+\.?[0-9]+ [kKmMgGtT]B/s"'
alias test-write-md='dd if=/dev/urandom of=random_data.bin oflag=dsync conv=fdatasync bs=1048576 count=100 2>&1 | grep --only-matching -E "[0-9]+\.?[0-9]+ [kKmMgGtT]B/s"'
alias test-write-lg='dd if=/dev/urandom of=random_data.bin oflag=dsync conv=fdatasync bs=10485760 count=10 2>&1 | grep --only-matching -E "[0-9]+\.?[0-9]+ [kKmMgGtT]B/s"'

Streaming big bursts or lots of little discrete chunks

So now we see that performance measurement depends on how we want to use the disk. Do we only care how fast we can write a giant blob of data? Or do we care how fast it can write lots of little blocks of data. If you are recording video then you probably care more about writing giants blobs of data. If you are writing small transactions in a log file (such as a server log) then you probably care more about small block performance.

If you really are just trying to test the raw disk write speed then you should just test a large burst. It really is the operating system's job to worry about how to manage lots of small requests and still get performance.

read speed testing

For these tests I want a 10MB data file, so first I create a fresh one:

# dd if=/dev/urandom of=TESTDATA oflag=sync bs=10485760 count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 2.69033 s, 3.9 MB/s

When using `dd` for read testing don't forget the 'iflag=direct' option. All files in Linux pass through the Page Cache, so successive testing of read performance on file will represent the speed to read from cache, not disk. That is usually not what people want to see in speed tests, but it has it's place. Note that the data still goes through a buffer since the data goes through a DMA channel in most disk IO, but that buffer is in the userspace, not the kernel, so the 'DIRECT' option for reading disk can actually speed up IO in some applications since you get rid of the kernel overhead.

# dd if=TESTDATA iflag=direct of=/dev/null
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 3.58198 s, 2.9 MB/s

That seems slower than writing... Oh, I forgot the set blocksize. Silly me. Ha! quite a bit faster now:

# dd if=TESTDATA iflag=direct of=/dev/null bs=10485760
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.200038 s, 52.4 MB/s

Let's explore this a little bit. How bad is reading just 1K at a time? Still faster than writing, but not by much.

# dd if=TESTDATA iflag=direct of=/dev/null bs=1024
10240+0 records in
10240+0 records out
10485760 bytes (10 MB) copied, 1.65979 s, 6.3 MB/s

And then we can see a weird piece of machinery if we use the 'direct' flag wrong. It turns out that 'direct' requires alignment with the device block size (512 on my disk). See what happens if I set a blocksize that doesn't align with the disk block size?

# dd if=TESTDATA iflag=direct of=/dev/null bs=513
dd: reading `TESTDATA': Invalid argument
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000634682 s, 0.0 kB/s

That makes the direct flag a big pain in the ass to use. In fact, it's not really that useful in the real world since you usually don't want to avoid the kernel cache in the first place. So what's the point? Well, it does help with writing benchmark tools :-) And here we can see something even more mysterious.

Run a few times without 'iflag=direct'. Notice that the first run is slow, but subsequent runs are much faster because the data comes from Linux buffer cache:

# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.394607 s, 26.6 MB/s
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.0110787 s, 946 MB/s
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00873244 s, 1.2 GB/s
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00825178 s, 1.3 GB/s

We can flush the kernel cache so that subsequent tests should not be effected by the previous test, but the this is not what happens. Notice that in the following test that later tests still get faster. Why is this? This is the effect of the cache built into the disk. I don't know how to suppress this artifact.

# echo 3 > /proc/sys/vm/drop_caches
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.918925 s, 11.4 MB/s
root@home: /root 0
# echo 3 > /proc/sys/vm/drop_caches
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.689502 s, 15.2 MB/s
root@home: /root 0
# echo 3 > /proc/sys/vm/drop_caches
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.425424 s, 24.6 MB/s

Reading a different file should cause the cache on the disk to be flushed (depending on the size), but something weird is going on here:

# echo 3 > /proc/sys/vm/drop_caches
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.180207 s, 58.2 MB/s
root@home: /root 0
# dd if=TESTDATA2 of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.601255 s, 17.4 MB/s
root@home: /root 0
# echo 3 > /proc/sys/vm/drop_caches
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.222922 s, 47.0 MB/s
root@home: /root 0
# echo 3 > /proc/sys/vm/drop_caches
root@home: /root 0
# dd if=TESTDATA2 of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.672152 s, 15.6 MB/s
root@home: /root 0
# echo 3 > /proc/sys/vm/drop_caches
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.211442 s, 49.6 MB/s
root@home: /root 0
root@home: /root 0
# dd if=/dev/urandom of=TESTDATA oflag=sync bs=10485760 count=1
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 2.63246 s, 4.0 MB/s
root@home: /root 0
# dd if=TESTDATA of=/dev/null bs=1048576
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.00826575 s, 1.3 GB/s

Well-known Benchmark Suites

[1] Phoronix is probably the best known of the open test suites.

HardInfo This is a GUI tool intended to give an overview of system hardware. It also contains a section with benchmarks. It is simple and quick and makes it easy to compare the performance of one machine to another. If run without an X DISPLAY it will generate an HTML text report describing the system and performance.