2/01/2014

02-01-14 - Understanding ANS - 3

I'm gonna take an aside from the slow exposition and jump way ahead to some results. Skip to the bottom for summary.

There have been some unfortunate claims made about ANS being "faster than Huffman". That's simply not true. And in fact it should be obvious that it's impossible for ANS to be faster than Huffman, since ANS is a strict superset of Huffman. You can always implement your Huffman coder by putting the Huffman code tree into your ANS coder, therefore the speed of Huffman is strictly >= ANS.

In practice, the table-based ANS decoder is so extremely similar to a table-based Huffman decoder that they are nearly identical in speed, and all the variation comes from minor implementation details (such as how you do your bit IO).

The "tANS" (table ANS, aka FSE) decode is :


{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = decodeTable[state].nbBits;
  state = decodeTable[state].newState + getbits(nBits);
}

while a standard table-based Huffman decode is :

{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = codelen[sym];
  state = ((state<<nBits)&STATE_MASK) + getbits(nBits);  
}

where for similarly I'm using a Huffman code with the first bits at the bottom. In the Huffman case, "state" is just a portion of the bit stream that you keep in a variable. In the ANS case, "state" is a position in the decoder state machine that has memory; this allows it to carry fractional bits forward.

If you so chose, you could of course put the Huffman codelen and next state into decodeTable[] just like for ANS and they would be identical.

So, let's see some concrete results comparing some decent real world implementations.

I'm going to compare four compressors :


huf = order-0 static Huffman

fse = Yann's implementation of tANS

rans = ryg's implementation of rANS

arith = arithmetic coder with static power of 2 cumulative frequency total and decode table

For fse, rans, and arith I use a 12-bit table (the default in fse.c)
huf uses a 10-bit table and does not limit code length

Runs are on x64 code, but the implementations are 32 bit. (no 64-bit math used)

Some notes on the four implementations will follow. First the raw results :


inName : book1
H = 4.527

arith 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
arith encode     : 0.006 seconds, 69.44 b/kc, rate= 120.08 mb/s
arith decode     : 0.011 seconds, 40.35 b/kc, rate= 69.77 mb/s

"rans 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
rans encode      : 0.010 seconds, 44.04 b/kc, rate= 76.15 mb/s
rans decode      : 0.006 seconds, 80.59 b/kc, rate= 139.36 mb/s

fse :   768,771 ->   435,981 =  4.537 bpb =  1.763 to 1 
fse encode       : 0.005 seconds, 94.51 b/kc, rate= 163.44 mb/s
fse decode       : 0.003 seconds, 166.95 b/kc, rate= 288.67 mb/s

huf :   768,771 ->   438,437 =  4.562 bpb =  1.753 to 1 
huf encode       : 0.003 seconds, 147.09 b/kc, rate= 254.34 mb/s
huf decode       : 0.003 seconds, 163.54 b/kc, rate= 282.82 mb/s
huf decode       : 0.003 seconds, 175.21 b/kc, rate= 302.96 mb/s (*1)


inName : pic
H = 1.210

arith 12:   513,216 ->    79,473 =  1.239 bpb =  6.458 to 1 
arith encode     : 0.003 seconds, 91.91 b/kc, rate= 158.90 mb/s
arith decode     : 0.007 seconds, 45.07 b/kc, rate= 77.93 mb/s

rans 12:   513,216 ->    79,474 =  1.239 bpb =  6.458 to 1 
rans encode      : 0.007 seconds, 45.52 b/kc, rate= 78.72 mb/s
rans decode      : 0.003 seconds, 96.50 b/kc, rate= 166.85 mb/s

fse :   513,216 ->    80,112 =  1.249 bpb =  6.406 to 1 
fse encode       : 0.003 seconds, 93.86 b/kc, rate= 162.29 mb/s
fse decode       : 0.002 seconds, 164.42 b/kc, rate= 284.33 mb/s

huf :   513,216 ->   106,691 =  1.663 bpb =  4.810 to 1 
huf encode       : 0.002 seconds, 162.57 b/kc, rate= 281.10 mb/s
huf decode       : 0.002 seconds, 189.66 b/kc, rate= 328.02 mb/s

And some conclusions :

1. "tANS" (fse) is almost the same speed to decode as huffman, but provides fractional bits. Obviously this is a huge win on skewed files like "pic". But even on more balanced distributions, it's a decent little compression win for no decode speed hit, so probably worth doing everywhere.

2. Huffman encode is still significantly faster than tANS encode.

3. "rANS" and "arith" almost have their encode and decode speeds swapped. Round trip time is nearly identical. They use identical tables for encode and decode. In fact they are deeply related, which is something we will explore more in the future.

4. "tANS" is about twice as fast as "rANS". (at this point)

And some implementation notes for the record :


"fse" and "rans" encode the array by walking backwards.  The "fse" encoder output bits forwards and
consume them backwards, while the "rans" encoder writes bits backwards and consumes them forwards.

"huf" and "fse" are transmitting their code tables.  "arith" and "rans" are not.  
They should add about 256 bytes of header to be fair.

"arith" is a standard Schindler range coder, with byte-at-a-time renormalization

"arith" and "rans" here are nearly identical, both byte-at-a-time, and use the exact same tables
for encode and decode.

All times include their table-build times, and the time to histogram and normalize counts.
If you didn't include those times, the encodes would appear faster.

"huf" here is not length-limited.  A huf decoder with a 12-bit table and 12-bit length limitted
codes (like "fse" uses) should be faster.
(*1 = I did a length-limited version with a non-overflow handling decoder)

"huf" here is was implemented with PowerPC and SPU in mind.  A more x86/x64 oriented version should be
a little faster.  ("fse" is pretty x86/x64 oriented).

and todo : compare binary rANS with a comparable binary arithmetic coder.

36 comments:

Fabian 'ryg' Giesen said...

Note: that rANS implementation has an encoder that's much slower than necessary, and I don't think it's representative enough to judge algorithm speeds by it.

Yann Collet said...

Yeah, as stated earlier, Huffman can't be slower than ANS, since it is a restricted case of it.

I've received a faster open-source implementation of Huffman that I shall include soon into the comparative with FSE, phasing out the slower one.

cbloom said...

@ryg - yeah, fastest rANS seems to be a rapidly moving target!

These numbers are for a rANS impl that looks almost exactly like the arith impl, so in a sense they reflect the speed of just doing the basic operations to code.

But rANS seems to have properties that allow it to be optimized in ways that arith can't.

enotuss said...

Any source?

Fabian 'ryg' Giesen said...

cbloom: My various crazy-ass experiments aside :), the one obvious practical change was just to go from 32 bits to 31 bits resolution and precompute the reciprocals, and that's a fairly simple (and isolated) change.

Completely divide-free multi-symbol arithmetic coding is kind of nice.

enotuss: Updated version of the rANS code Charles is using is here: https://github.com/rygorous/ryg_rans (I'll probably blog about it too). The rest is various standard arithmetic/"range" coders.

enotuss said...

2Fabian 'ryg' Giesen

thx. rANS is most interesting part.

enotuss said...

2cbloom

According to source rans must be faster than AC and RC. Encode and decode, without tricks (2^scale_bits). Must be faster. Implementations not good ...

enotuss said...

Probably mess with precision.
32bit RC vs 64bit rANS

DudaJarek said...

Thanks Fabian and Charles,
I wonder how its speed compares with RC?
I have just started a separate thread - you can post benchmarks there:
http://encode.ru/threads/1870-Range-ANS-%28rANS%29-faster-direct-replacement-for-range-coding
Best,
Jarek

enotuss said...

2DudaJarek

Cannot register on encode.ru.

First part - base program for testing (speed measuring, file comparison) and copy implementation (with statistic table).
RC can be taken from 7zip as is.

And i suggest not using "2^n range" trick. It will not work for dinamic statistics.

ps just installed c++ ... its so hard to recall after so many years

Fabian 'ryg' Giesen said...

enotuss: "According to source rans must be faster than AC and RC. Encode and decode, without tricks (2^scale_bits)"

The "2^scale_bits" thing is not a "trick" in this case, unfortunately. The rANS construction *requires* that the denominator of the probability distribution ('m' in the paper) divide the lower bound of your normalization interval ('l' in the paper) for decodability. You can use non-power-of-2 m (but now your decoding requires divs and mods too!), but it must still divide l, which means you can't change it easily at run time. Powers-of-2 for both m and l are the most practical compromise.

You might miss that in the paper since the presentation isn't exactly clear, to say the least.

As a general rule, the "source" is full of bold claims (like "faster than Huffman") with absolutely no hard data to back any of it up.

No offense to Jarek, but while I absolutely agree the algorithm is very cool and interesting, he's been continually making unfounded (and often untrue) claims about the performance of his algorithms: No, tANS is *not* faster than a table-driven Huffman decoder (see other posts by Charles). Comparing your highly-optimized pet algorithm against a sub-optimal Huffman implementation is hardly scientific (not that there's any actual numbers in the paper). Similarly, while the ABS-based (binary) coders are interesting, I have yet to see an implementation that actually beats a conventional binary range coder (or current state-of-the-art table-driven coders like the M-coder for tABS). So far, for everything I've seen, they're either the same or ABS slightly slower.

The paper also has several serious problems: no proof of the fact that under C, x is monotonically non-decreasing, which is crucial for the renormalization strategy to work; no mention of the actual tests to ensure b-uniqueness in rABS/rANS, or any proof thereof; the definition of b-uniqueness is a mess (the fact that an interval with the required three properties has the form {l, l+1, ..., b*l-1} requires proof - admittedly a short one - yet in the paper it's just stated as part of the definition).

And then there's the awful "cryptography" section in the paper. No analysis of biases in the output stream, no discussion of attacks (in fact, it reads like Jarek has no idea of what cryptographic attacks look like), nothing. Just 3 images, and some hand-waving of "I don't know how to crack this so I guess it's secure". That's not how cryptography works. Everyone can design an algorithm that they themselves can't break.

Jarek: "I wonder how its speed compares with RC?"

That's what this entire post is about - "arith" is a range coder.

enotuss said...

2Fabian 'ryg' Giesen

>Powers-of-2 for both m and l are the most practical compromise

In most cases, data model probabilities are array of counters.
m=sum(counters)
If m=2^k then you need rescale counters. Rescale needs division. If probabilities are dynamic rescale will take all time.

m=2^k is special case for static probabilities.

Direct comparison RC and rANS must use any m.


>You might miss that in the paper

I dont get most from paper. Only two rANS formulas: encode and decode. Idea of rANS is simple and clear.

enotuss said...

2Fabian 'ryg' Giesen

>the fact that an interval with the required three properties has the form {l, l+1, ..., b*l-1} requires proof

You dont need this. You can rescale as you wish, if you ensure same rescaling in encoder and decoder. But if x<lMax (max range or max counter) you will lose compression.

Fabian 'ryg' Giesen said...

enotuss: I know how standard models look, and I know that arbitrary m would be desirable. What I'm trying to tell you is that the "streaming" (finite-precision) versions of rANS just can't do it, not as given in the paper anyway. The renormalization algorithm Jarek describes requires l=k*m for some positive integer k. The infinite-precision version does not, but that's little use if you want to write a fast encoder/decoder to say the least.

Furthermore, once you support arbitrary m, the decoder now also needs to do a div/mod operation.

"Rescale needs division. If probabilities are dynamic rescale will take all time.

m=2^k is special case for static probabilities.

Direct comparison RC and rANS must use any m."

Exactly! The rANS encoder/decoder I posted can only support static probability distributions (efficiently), and the same goes for the table-based (tANS) variants. You can have several baked tables and switch between them quickly, but adaptive updates are off the table for the non-binary variants.

What Charles is measuring is rANS against an arithmetic (range) coder that also assumes m is a power of two, so it's a fair comparison. However, the general arithmetic coder can support arbitrary m easily while rANS cannot. So yes, comparing rANS vs. AC directly and claiming ANS is faster is disingenuous. That's precisely what I'm complaining about when I say Jarek's over-selling it (on that axis anyway).

enotuss said...

2Fabian 'ryg' Giesen

>The renormalization algorithm Jarek describes requires l=k*m for some positive integer k.

You dont need this. You dont need to follow paper.


>Furthermore, once you support arbitrary m, the decoder now also needs to do a div/mod operation.

rANS with arbitrary m uses one div+mod for encoder and decoder:
C(s,x) = m[x/ls] + bs + mod(x,ls)
D(x) = (s, ls[x/m] + mod(x,m) - bs)

RC uses one division too.

So for arbitrary m rANS must be faster than RC and, i think, rANS must give little better compression (due to limited RC precision).

DudaJarek said...

Hi Fabian,
The paper was supposed to be theoretical, I wrote "combining Huffman speed" in the title, and in abstract I have clearly referred to Yann's implementation, which back then seemed to be a fast implantation (I haven't made tests myself) - I will definitely be more careful in the next version.

Regarding tABS, Pascal Massimino (skal) has written that after some "obvious optimizations" the speed of his version of tANS has significantly improved - unfortunately he hasn't posted improved results for the binary version: https://groups.google.com/a/webmproject.org/d/msg/codec-devel/idezdUoV1yY/dPqqSeaQhhYJ

The fact that C(s,x) increases with x seems obvious from construction?
The renormalization condition for uABS and range versions is discussed in Section 3.3.
Regarding cryptographic applications, I have made some larger discussion in the previous paper ( 0902.0271 ), but decided to leave only some basic remarks in this version - it just requires further work of someone with cryptographic background.
Thanks,
Jarek

enotuss said...

Speed of tANS is well covered by cbloom. Not much slower than Huffman.

Question is: how fast rANS is.

Fabian 'ryg' Giesen said...

enotuss: "You dont need this. You dont need to follow paper."

As long as you don't come up with a different renormalization algorithm that works with finite precision (or a laxer sufficient condition for b-uniqueness), yes I do.

Example: l=5, b=2, m=3 (note m does not divide l). Two symbols, l_0=2, l_1=1 and therefore b_0=0, b_1=2.

I = { 5,6,7,8,9 }

if you work out the I_s, you get:

I_0 = { 4,5,6 }
I_1 = { 1,2 }

I_0 isn't b-unique. You're toast. Suppose you're at x=7 and want to encode a 0. x=7 is not in I_0, so you emit a bit and divide x by 2. Now you have x=3 - still not in I_0, and there's no way to get back into it now, you're already below its lower bound.

cbloom said...

Yeah, ryg's the "b-unique" issue is the crucial thing with ANS. I'll talk about it in the context of tANS in the next post or two.

@enotuss in general - rANS is super fast for order-0 coding of arrays. If you're doing order-0 arrays you should of course scale the frequencies to a power of 2 total. That's what the report here is comparing.

rANS for arbitrary frequency sums and for adaptive coding is a more questionable area (eg. could you use it for PPM?). Unclear if it's a win vs arith. The need to buffer and reverse the encoding side is definitely a pain. But it might still be a win when you mainly care about decoder speed.

Most modern adaptive compressors are binary arith only. Comparison of that to rABS is something we have yet to do.

enotuss said...

2Fabian 'ryg' Giesen

hah It appears ensuring same rescaling in encoder and decoder is big problem.

Following solution is not good but will work:

x - state
b - shift bits (2^32 for example)
limit - limit (2^16 for example)

Decoding:
if(x<2^limit){x=(x<>b)
if(xNext>=2^limit){output_b_lower_bits(x};x=xNext}
else{x=encode_symbol(s,x)}

enotuss said...

comments system sucks

Decoding:
if(x<2^limit){x=(x<>b)
if(xNext>=2^limit){output_b_lower_bits(x};x=xNext}
else{x=encode_symbol(s,x)}

enotuss said...

one more try
Decoding:
if(x<2^limit){x=(x<>b)
if(xNext>=2^limit){output_b_lower_bits(x};x=xNext}
else{x=encode_symbol(s,x)}

enotuss said...

x - state
b - shift bits (2^32 for example)
limit - limit (2^16 for example)

Decoding:
if(x<2^limit){x=(x shl b)|input_b_bits()}
(s,x)=decode_symbol(x)

Encoding:
xNext=encode_symbol(s,x shr b)
if(xNext>=2^limit){output_b_lower_bits(x};x=xNext}
else{x=encode_symbol(s,x)}

enotuss said...

Some addition to above code.

m=sum(counters)<2^32-1
b=32
limit=32
x precision=64

Covering corner case:

Decoding:
if(x<2^limit){x=((x-1) shiftLeft b)|input_b_bits()}
(s,x)=decode_symbol(x)

Encoding:
xNext=encode_symbol(s,x shiftRight b +1)
if(xNext>=2^limit){output_b_lower_bits(x};x=xNext}
else{x=encode_symbol(s,x)}

Fabian 'ryg' Giesen said...

enotuss: None of this works.

1. Supporting only one "shift" of b is not enough (e.g. take b=2, m=1024 and a symbol with frequency 512 - you need to shift 9 times) to make that work.
2. You can get into cases where none of the options you're considering are viable (i.e. none lead to a valid x). My example above shows that for the renormalization as in the paper, your moving it around doesn't get rid of the problem, it just makes it happen elsewhere.

Discussing this further here is pointless. It might well be possible that there's a different way to renormalize that works with less restrictive constraints, but unless you can actually present a working implementation (not pseudocode) I'm not interested in wasting more time on this.

enotuss said...

I checked. My rescaling code works. Except for corner case. Big files (>1m) corrupt. Little better compression than carryless RC.

Can anybody help me fixing RC code? It work, but slow.
http://pastebin.com/gd8EcUM9

enotuss said...

2Fabian 'ryg' Giesen

>None of this works.

It works!! 1M wiki8 compress and decompress. Equal data. Only corner case needs fixing. I was afraid that corner case will show up on 200 byte data, but its working on 1M file.

One or two days, i will fix corner case, optimize code ...

Fabian 'ryg' Giesen said...

Try encoding a file consisting of one 'a' followed by 999999 'b's.

enotuss said...

File length 78890 bytes, first 'a', rest 'b'. Worked well with not fixed rANS.

org file size 78890
CompressorRC
cmp 12 bytes 0.001585 sec 0.015211% 47.463245 mb/s
dcm 78890 bytes 0.002393 sec 100.000000% 31.435408 mb/s
Good

CompressorRAns
cmp 12 bytes 0.002172 sec 0.015211% 34.642339 mb/s
dcm 78890 bytes 0.002423 sec 100.000000% 31.046799 mb/s
Good

enotuss said...

bigger file works too

org file size 1261520
CompressorRC
cmp 45 bytes 0.029616 sec 0.003567% 40.622053 mb/s
dcm 1261520 bytes 0.037891 sec 100.000000% 31.751321 mb/s
Good

CompressorRAns
cmp 44 bytes 0.034650 sec 0.003488% 34.720928 mb/s
dcm 1261520 bytes 0.039045 sec 100.000000% 30.812467 mb/s
Good

DudaJarek said...

enotuss,
I also thought about such reversed renormalization: transfer digits after encoding step (not decoding). Unfortunately, in this case we don't know how many digits we should transfer before decoding step ...
I couldn't think of any other practical renormalization, but the current one seems ok(?) - especially that for uABS, rABS, rANS you can use large b=2^8, 2^16 to take one or two bytes at once.
Alternatively, most of processors should support finding the number of the most significant "0"s ( http://en.wikipedia.org/wiki/Find_first_set ), which allows to immediately deduce the number of bits to transfer while decoding.

Regarding division, in Matt's fpaqc it was changed into multiplication thanks to storing 1/f in a table.

enotuss said...

2DudaJarek

>which allows to immediately deduce the number of bits to transfer while decoding.

This was my first rescale way. But it needs bit buffers and bit counters variables. So it will be slow.

My currently used rescaling is faster in decoding. Only this stupid corner case with carry. Only way i see - use ugly ifs and reencoding tryes for (x==2^32-1)...

enotuss said...

i write carry flag to stream and all work good, except loss of 0.01% compresson.

enotuss said...

i implemented fast version of rANS
64 bit registers
sum(counters)=2^12
lookup table

decode take about 15 CPU clocks per byte
perfect compression

main decode loop:
http://pastebin.com/LM8Au01p

enotuss said...

difference between tANS and fast rANS - additional and+shift+mul+add

but rANS gets input by words, while tANS gets input by bits

so proximately mul+add (about 4 cpu clocks from rANS 15 clocks) is all difference, while rANS gives perfect compression

Yann Collet said...

> "huf" and "fse" are transmitting their code tables. "arith" and "rans" are not.
They should add about 256 bytes of header to be fair.

Quick comment, just in case it would matter : FSE exposes its internal functions, so it's possible to retrieve the size of the compressed stream "without the code tables" :
int FSE_compress_usingCTable (void* dest, const unsigned char* source, int sourceSize, const void* CTable);

In this case, only a small 4-bytes header remains.

old rants