There have been some unfortunate claims made about ANS being "faster than Huffman". That's simply not true. And in fact it should be obvious that it's impossible for ANS to be faster than Huffman, since ANS is a strict superset of Huffman. You can always implement your Huffman coder by putting the Huffman code tree into your ANS coder, therefore the speed of Huffman is strictly >= ANS.
In practice, the table-based ANS decoder is so extremely similar to a table-based Huffman decoder that they are nearly identical in speed, and all the variation comes from minor implementation details (such as how you do your bit IO).
The "tANS" (table ANS, aka FSE) decode is :
int sym = decodeTable[state].symbol;
*op++ = sym;
int nBits = decodeTable[state].nbBits;
state = decodeTable[state].newState + getbits(nBits);
while a standard table-based Huffman decode is :
int sym = decodeTable[state].symbol;
*op++ = sym;
int nBits = codelen[sym];
state = ((state<
where for similarly I'm using a Huffman code with the first bits at the bottom. In the Huffman case, "state" is
just a portion of the bit stream that you keep in a variable. In the ANS case, "state" is a position in the decoder
state machine that has memory; this allows it to carry fractional bits forward.
<nBits)&STATE_MASK) + getbits(nBits);
If you so chose, you could of course put the Huffman codelen and next state into decodeTable just like for ANS and they would be identical.
So, let's see some concrete results comparing some decent real world implementations.
I'm going to compare four compressors :
huf = order-0 static Huffman
fse = Yann's implementation of tANS
rans = ryg's implementation of rANS
arith = arithmetic coder with static power of 2 cumulative frequency total and decode table
For fse, rans, and arith I use a 12-bit table (the default in fse.c)
huf uses a 10-bit table and does not limit code length
Runs are on x64 code, but the implementations are 32 bit. (no 64-bit math used)
Some notes on the four implementations will follow.
First the raw results :
inName : book1
H = 4.527
arith 12: 768,771 -> 435,378 = 4.531 bpb = 1.766 to 1
arith encode : 0.006 seconds, 69.44 b/kc, rate= 120.08 mb/s
arith decode : 0.011 seconds, 40.35 b/kc, rate= 69.77 mb/s
"rans 12: 768,771 -> 435,378 = 4.531 bpb = 1.766 to 1
rans encode : 0.010 seconds, 44.04 b/kc, rate= 76.15 mb/s
rans decode : 0.006 seconds, 80.59 b/kc, rate= 139.36 mb/s
fse : 768,771 -> 435,981 = 4.537 bpb = 1.763 to 1
fse encode : 0.005 seconds, 94.51 b/kc, rate= 163.44 mb/s
fse decode : 0.003 seconds, 166.95 b/kc, rate= 288.67 mb/s
huf : 768,771 -> 438,437 = 4.562 bpb = 1.753 to 1
huf encode : 0.003 seconds, 147.09 b/kc, rate= 254.34 mb/s
huf decode : 0.003 seconds, 163.54 b/kc, rate= 282.82 mb/s
huf decode : 0.003 seconds, 175.21 b/kc, rate= 302.96 mb/s (*1)
inName : pic
H = 1.210
arith 12: 513,216 -> 79,473 = 1.239 bpb = 6.458 to 1
arith encode : 0.003 seconds, 91.91 b/kc, rate= 158.90 mb/s
arith decode : 0.007 seconds, 45.07 b/kc, rate= 77.93 mb/s
rans 12: 513,216 -> 79,474 = 1.239 bpb = 6.458 to 1
rans encode : 0.007 seconds, 45.52 b/kc, rate= 78.72 mb/s
rans decode : 0.003 seconds, 96.50 b/kc, rate= 166.85 mb/s
fse : 513,216 -> 80,112 = 1.249 bpb = 6.406 to 1
fse encode : 0.003 seconds, 93.86 b/kc, rate= 162.29 mb/s
fse decode : 0.002 seconds, 164.42 b/kc, rate= 284.33 mb/s
huf : 513,216 -> 106,691 = 1.663 bpb = 4.810 to 1
huf encode : 0.002 seconds, 162.57 b/kc, rate= 281.10 mb/s
huf decode : 0.002 seconds, 189.66 b/kc, rate= 328.02 mb/s
And some conclusions :
1. "tANS" (fse) is almost the same speed to decode as huffman, but provides fractional bits. Obviously this is a huge win on skewed files like "pic". But even on more balanced distributions, it's a decent little compression win for no decode speed hit, so probably worth doing everywhere.
2. Huffman encode is still significantly faster than tANS encode.
3. "rANS" and "arith" almost have their encode and decode speeds swapped. Round trip time is nearly identical. They use identical tables for encode and decode. In fact they are deeply related, which is something we will explore more in the future.
4. "tANS" is about twice as fast as "rANS". (at this point)
And some implementation notes for the record :
"fse" and "rans" encode the array by walking backwards. The "fse" encoder output bits forwards and
consume them backwards, while the "rans" encoder writes bits backwards and consumes them forwards.
"huf" and "fse" are transmitting their code tables. "arith" and "rans" are not.
They should add about 256 bytes of header to be fair.
"arith" is a standard Schindler range coder, with byte-at-a-time renormalization
"arith" and "rans" here are nearly identical, both byte-at-a-time, and use the exact same tables
for encode and decode.
All times include their table-build times, and the time to histogram and normalize counts.
If you didn't include those times, the encodes would appear faster.
"huf" here is not length-limited. A huf decoder with a 12-bit table and 12-bit length limitted
codes (like "fse" uses) should be faster.
(*1 = I did a length-limited version with a non-overflow handling decoder)
"huf" here is was implemented with PowerPC and SPU in mind. A more x86/x64 oriented version should be
a little faster. ("fse" is pretty x86/x64 oriented).
and todo : compare binary rANS with a comparable binary arithmetic coder.