cbloom rants: The Weissman Score

Wikipedia suggests the Weissman score should be

which ignoring constants is just W = r/logT

That's just wrong. You don't take a logarithm of something with units. But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a logarithm of time should be involved. Just not like that.

I present a formula which I call the correct Weissman Score :


W = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )  -
    comp_ratio * log10( 1 + speed/(disk_speed_hi *comp_ratio) )

or

W = comp_ratio * log10( ( comp_ratio + speed/disk_speed_lo ) / ( comp_ratio + speed/disk_speed_hi ) )

You can have a Weissman score for encode speed or decode speed. It's a measure of space-speed tradeoff success.

I suggest the range should be 1-256. disk_speed_lo = 1 MB/s (to evaluate performance on very slow channels, favoring small files), disk_speed_hi = 256 MB/s (to evalue performance on very fast disks, favoring speed). And because 1 and 256 are amongst programmers' favorite numbers.

You could also just let the hi range go to infinity. Then you don't need a hi disk speed parameter and you get :


Weissman-infinity = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )

with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like. While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses the score more on the area that most people care about.

I came up with this formula because I started thinking about summarizing a score from the Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds; like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them? speedup is a good way to measure a compressor even if you don't actually care about speed. Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to get the area under the curve? Note that we're integrating in log-scale of disk speed.

Turns out you can just do that integral :

    speedup = (time to load uncompressed) / (time to load compressed + decompress)
    speedup = (raw_size/disk_speed) / (comp_size/disk_speed + raw_size/ decompress_speed)
    speedup = (1/disk_speed) / (1/(disk_speed*compression_ratio) + 1 / decompress_speed)
    speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
    speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
    speedup = compression_ratio / (1 + exp( log_disk_speed ) * compression_ratio/decompress_speed)
    speedup = compression_ratio * 1 / (1 + exp( log_disk_speed + log(compression_ratio/decompress_speed)))

speedup is a sigmoid :

    y = 1 / (1 + e^-x ) 
    
    Integral{y} = ln( 1 + e^x )

    x = - ( log_disk_speed + log(compression_ratio/decompress_speed) )

so substitute some variables and you get the above formula for the Weissman score.

In the final formula, I changed from natural log to log-base-10, which is just a constant scaling factor.

The Weissman (decode Core i7-3770 3.4 GHz; 1-256 range) scores on Silesia are :

lz4hc    : 6.243931
zstdmax  : 7.520236
lzham    : 6.924379
lzma     : 5.460073
zlib9    : 5.198510
Kraken   : 8.431461

Weissman-infinity scores are :

lz4hc    : 7.983104
zstdmax  : 8.168544
lzham    : 7.277707
lzma     : 5.589155
zlib9    : 5.630476
Kraken   : 9.551152

Goal : beat 10.0 !

ADD : this post was a not-sure-if-joking. But I actually think it's useful. I find it useful anyway.

When you're trying to tweak out some space-speed tradeoff decisions, you get different sizes and speeds, and it can be hard to tell if that tradeoff was good. You can do things like plot all your options on a space-speed graph and try to guess the pareto frontier and take those. But when iterating an optimization of a parameter you want just a simple score.

This corrected Weissman score is a nice way to do that. You have to choose what domain you're optimizing for, size-dominant slower compressors should use Weissman 1-256 , for balance of space and super speed use Weissman 1-inf (or 40-800), for the fast domain (LZ4-ish) use a range like 100-inf. Then you can just iterate to maximize that number!

For whatever space-speed tradeoff domain you're interested in, there exists a Weissman score range (lo-hi disk speed paramaters) such that maximizing the Weissman score in that range gives you the best space-speed tradeoff in the domain you wanted. The trick is choosing what that lo-hi range is (it doesn't necessarily directly correspond to actual disk or channel speeds; there are other factors to consider like latency, storage usage, etc. that might cause you to bias the lo-hi away from the actual channel speeds in some way; for example high speed decoders should always set the upper speed to infinity, which corresponds to the use case that the compressed data might be already resident in RAM so it has zero time to load).

cbloom rants

5/17/2016

The Weissman Score

No comments:

Post a Comment