The problem is amortized hashing. "Amortized" = we just stop walking after A steps. This limits the worst case. Technically it makes hash1 O(N) (or O(A*N)).
Without it, hash1 has disastrous worst cases, which makes it just not viable. The problem is that the "amortize" can hurt compression quite a bit and unpredictably so.
One perhaps surprising thing is that the value of A (the walk limit) doesn't actually matter that much. It can be 32 or 256 and either way you will save your speed from the cliff. Surpisingly even an A of 1024 on a 128k window helps immensely.
not "amortized" : file, encode, decode, out size lzt00, 225.556, 35.223, 4980 lzt01, 143.144, 1.925, 198288 lzt02, 202.040, 3.067, 227799 lzt03, 272.027, 2.164, 1764774 lzt04, 177.060, 5.454, 11218 lzt05, 756.387, 3.940, 383035 lzt06, 227.348, 6.078, 423591 lzt07, 624.699, 4.179, 219707 lzt08, 311.441, 7.329, 261242 lzt09, 302.741, 3.746, 299418 lzt10, 777.609, 1.647, 10981 lzt11, 1073.232, 4.999, 19962 lzt12, 3250.114, 3.134, 25997 lzt13, 101.577, 5.644, 978493 lzt14, 278.619, 6.026, 47540 lzt15, 1379.194, 9.396, 10911 lzt16, 148.908, 12.558, 10053 lzt17, 135.530, 5.693, 18517 lzt18, 171.413, 6.003, 68028 lzt19, 540.656, 3.433, 300354 lzt20, 109.678, 5.488, 900001 lzt21, 155.648, 3.605, 147000 lzt22, 118.776, 6.671, 290907 lzt23, 103.056, 6.350, 822619 lzt24, 218.596, 4.439, 2110882 lzt25, 266.006, 2.498, 123068 lzt26, 118.093, 7.062, 209321 lzt27, 913.469, 4.340, 250911 lzt28, 627.070, 2.576, 322822 lzt29, 1237.463, 4.090, 1757619 lzt30, 75.217, 0.646, 100001 "amortized" to 128 steps : file, encode, decode, out size lzt00, 216.417, 30.567, 4978 lzt01, 99.315, 1.620, 198288 lzt02, 85.209, 3.556, 227816 lzt03, 79.299, 2.189, 1767316 lzt04, 90.983, 7.073, 12071 lzt05, 86.225, 4.715, 382841 lzt06, 91.544, 6.930, 423629 lzt07, 127.232, 4.502, 220087 lzt08, 161.590, 7.725, 261366 lzt09, 119.749, 4.696, 301442 lzt10, 55.662, 1.980, 11165 lzt11, 108.619, 6.072, 19978 lzt12, 112.264, 3.119, 26977 lzt13, 103.460, 6.215, 978493 lzt14, 87.520, 5.529, 47558 lzt15, 98.902, 7.568, 10934 lzt16, 90.138, 12.503, 10061 lzt17, 115.166, 6.016, 18550 lzt18, 176.121, 5.402, 68035 lzt19, 272.349, 3.310, 304212 lzt20, 107.739, 5.589, 900016 lzt21, 68.255, 3.568, 147058 lzt22, 108.045, 5.867, 290954 lzt23, 108.023, 6.701, 822619 lzt24, 78.380, 4.631, 2112700 lzt25, 93.013, 2.554, 123219 lzt26, 108.348, 6.143, 209321 lzt27, 103.226, 3.468, 249081 lzt28, 145.280, 2.658, 324569 lzt29, 199.174, 4.063, 1751916 lzt30, 75.093, 1.019, 100001The times are in clocks per byte. In particular let's look at some files that are really slow without amortize :
no amortize : file, encode, decode, out size lzt11, 1073.232, 4.999, 19962 lzt12, 3250.114, 3.134, 25997 lzt15, 1379.194, 9.396, 10911 lzt27, 913.469, 4.340, 250911 lzt29, 1237.463, 4.090, 1757619 amortize 128 : file, encode, decode, out size lzt11, 108.619, 6.072, 19978 lzt12, 112.264, 3.119, 26977 lzt15, 98.902, 7.568, 10934 lzt27, 103.226, 3.468, 249081 lzt29, 199.174, 4.063, 1751916Massive speed differences. The funny thing is the only file where the compression ratio is drastically changes is lzt12. It's changed by around 4%. (lzt29 is a bigger absolute difference but only 0.34%)
So amortized hashing saved us massive time, and only cost us 4% on one file in the test case. Let me summarize the cases. There are three main classes of file :
1. 80% : Not really affected at all by amortize or not. These files don't have lots of degeneracy so there aren't a ton of links in the same hash bucket.
2. 15% : Very slow without amortize, but compression not really affected. These files have some kind of degeneracy (such as long runs of one character) but the most recent occurances of that hash are the good ones, so the amortize doesn't hurt compression.
3. 5% : Has lots of hash collisions, and we do need the long offsets to make good matches. This case is rare.
So obviously it should occur to us that some kind of "introspective" algorithm could be used. Somehow monitor what's happening and adjust your amortize limit (it doesn't need to be a constant) or switch to another algorithm for part of the hash table.
The problem is that we can't tell between classes 2 and 3 without running the slow compressor. That is, it's easy to tell if you have a lot of long hash chains, but you can't tell if they are needed for good compression or not without running the slow mode of the compressor.
4 comments:
got it
Why do you call this "amortized"? It seems more like "truncated".
The first description of this that I ever saw back on comp.compression in the long long ago called it "amortized hashing" so I'm following that. If you look back at very early LZ77 coders they refer to "amortized hashing" but I don't believe it's a standard term in any other discipline.
BTW my string matchers now use "adaptive amortization". The amortize count changes based on local history to speed up incompressible regions.
Post a Comment