Tips for benchmarking a compressor

You're about to evaluate Oodle (thanks for having a look!) or some other compressor. Before you start, consider these tips :

1. Time only the compressor.

Place your time measurements only around the compressor. Not IO, not your parsing, not mallocs, just the compress or decompress calls. I understand that in the end what you care about is total time to load, but there can be a lot of issues there that need fixing, and they can cloud the comparison of just the compression part. eg. if your parsing is really slow, that will dominate the CPU time and hide the differences between the compressors.

2. Time what you actually care about.

If you care about decode time, time the decompression. If you care about encode time, time compression. If you care about round-trip time, add the two times. Compressors are not just "fast" or "slow" at both ends, you can't time encoding and decide that it's a fast or slow compressor if what you care about is decoding.

3. Choose the right options.

Most compressors have the ability to target slightly different use cases. The most common option is the ability to trade off encode time vs. compression ratio. So, if what you care about is smallest size, then run the compressor at its highest encode effort level. It can be tricky to get the options right in most compression libraries; we are woefully non-standardized and not well documented. Aside from the simple "level" parameter, there may be other options that are relevant to your goals, perhaps trading off decompressor memory usage, or decompression speed. With Oodle the best option is always to email us and ask what options will best suit your goals.

4. Run apples-to-apples (threads-to-threads) comparisons.

It can be tricky to compare compressors fairly. As much as possible they should be run in the same way, and they should be run in the way that you will actually use them in your final application. Don't profile them with threads if you will not use them threaded in your shipping application.

Threads are a common problem. Compressors should either be tested all threaded (if you will use threads in your final application), or all non-threaded. Unfortunately the defaults are not the same. "lzma" (7z) and LZHAM create threads by default. You have to change their options to tell them to *not* create threads. The normal Oodle_Compress calls will not use threads by default, you have to specifically call one of the _Async threaded routines. (my personal preference is to benchmark everything without threads to compare single-threaded performance, and you can always add threads for production use)

5. Take the MIN of N run times.

To get reliable timing, you need to run the loop many times, and take the MIN of all times. The min will give you the time it takes when the OS isn't interrupting you with task switches, the CPU isn't clocking-down for speedstep, etc. I usually do 30 *per core* but you can probably get a way with a bit less.

6. Wipe the cache.

Assuming you are now doing N loops, you need to invalidate the cache between iterations. If you don't, you will be running the compressor in a "hot cache" scenario, with some buffers already in cache.

7. Don't pack a bunch of files together in a tar if that's not how you load.

It may seem like a good way to test to grab your bunch of test files and pack them together in a tar (or zip -0 or similar package) and run the compression tests on that tar. That's a fine option if that's really how you load data in your final application - as one big contiguous chunk that must be loaded in one big blob. But most people don't. You need to test the compressors in the same way they will be used in the final application. If you load whole file at a time, test the compressors on whole file units. Many people do loading on some kind of paging unit, like perhaps 1 MB chunks. If you do that, then test the compressor on the same thing.

8. Choose your test set.

If you could test on the entire set of buffers that your final application will load, that would be an accurate test. (though actually, even that is a bit subtle, since some buffers are more latency sensitive than others, so for example you might care more about the first few things you load to get into a running application as quickly as possible). That's probably not practical, so you want to choose a set that is representative of what you will actually load. Don't exclude things like already-compressed files (JPEGs and so on) *if* you will be running them through the compressor. (though consider *not* running them your compressed-file loading path, in which case you should exclude them from testing). It's pretty hard to get an accurate representative sample, so it's generally best to just get a variety of files and look at individual per-file results.

9. Look at the spectrum of results, not the sum.

After you run on your test set, don't just add up the compressed sizes and times to make a "total" result. Sums can be misleading. One issue is there are some large incompressible files, they can hide the differences on the more compressible files. But a bigger and more subtle trap is the way that sums weight the combination of results. A sum is a weighting by the size of each file in the test set. That's fine if your test set is all of your data, or is a perfectly proportionally representative sampling of all of your data (a subset which acts like the whole). But most likely it's not. It's best to keep the results per file separate and just have a look at individual cases to see what's going on, how the results differ, and try not to simplify to just looking at the sum.

10. If you do sum, sum *time* not speed, sum *size* not ratio.

Speed (like mb/s) and ratio (raw size/comp size) are inverted measures and shouldn't be summed. What you actually care about is total compressed size, and total time to decode. So if you run over a set of files, don't look at "average speed" or "average ratio" , because those are inverted meaures that will oddly weight the accumulation. Instead accumulate total time to decode, total raw size, and total compressed size, and then if you like you can make "overall speed" and "overall ratio" from those total.

11. Try not to malloc in the timing loop.

Your malloc might be fast, it might be slow, it's best to not have that as a variable in the timing. In general try to allocate the memory for the compressor or decompressor outside of the timing loop. (In Oodle this is done by passing in your own pointer for the "decoderMemory" argument of OodleLZ_Decompress). That would be an unfair test if you didn't also do that in the final application - so do it in the final application too! (similarly, make sure there's no logging inside the timing loop).

12. Consider excluding almost-incompressible files.

This is something you should consider for final shipping application, and if you do it in your shipping application, then you should do it for the benchmark too. The most common case is already-compressed files like JPEG images and MP3 audio. These files can usually be compressed slightly, maybe saving 1% of their size, but the time to decode them is not worth it overall - you can get more total size savings by running a more powerful compressor on other files. So it's most efficient to just send them uncompressed.

13. Tiny files should either be excluded or packed together.

There's almost never a use case where you really want to compress tiny files (< 16k bytes or so) as independent units. There's too much per-unit overhead in the compressor, and more importantly there's too much per-unit overhead in IO - you don't want to eat a disk seek to just to get one tiny file. So in a real application tiny files should always be grouped into paging units that are 256k or more, a size where loading them won't just be a total waste of disk seek time. So, when benchmarking compressors you also shouldn't run them on tiny independent files, because you will never do that in a shipping application.

1 comment:

Rich Geldreich said...

This is a really good list!

I need to wipe the caches between runs in my benchmark. Hmm - now I'm wondering, how do I do this reliably (and elegantly if possible) on a wide range of CPU's.

old rants