02-18-14 - ans_fast implementation notes

Some notes about the ans_fast I posted earlier .

ans_fast contains a tANS (table-ANS) implementation and a rANS (range-ANS) implementation.

First, the benchmarking. You can compare to the more naive implementations I posted earlier . However, do not compare this tANS impl to Huffman or arithmetic and conclude "ANS is faster" because the tANS impl here has had rather more loving than those. Most of the tricks used on "ans_fast" can be equally used for other algorithms (though not all).

Here L=4096 to match the 12-bits used in the previous test. This is x64 on my lappy (1.7 Ghz Core i7 with turbo disabled). Compressed sizes do not include sending the counts. Time "withtable" includes all table construction but not histogramming or count normalization (that affects encoder only). ("fse" and "huf" on the previous page included table transmission and histogramming time)


tANS 768771 -> 435252.75

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

rANS 768771 -> 435980 bytes (v2)

ticks to encode: 6.97 decode: 5.06
mbps encode: 248.02 decode: 341.63

withtable ticks to encode: 6.97 decode: 5.07
withtable mbps encode: 247.92 decode: 341.27


tANS 513216 -> 78856.88

ticks to encode: 4.53 decode: 3.47
mbps encode: 382.02 decode: 497.75

withtable ticks to encode: 4.62 decode: 3.56
withtable mbps encode: 374.45 decode: 485.40

rANS 513216 -> 79480 bytes (v2)

ticks to encode: 5.62 decode: 3.53
mbps encode: 307.78 decode: 490.32

withtable ticks to encode: 5.63 decode: 3.54
withtable mbps encode: 307.26 decode: 488.88

First a note on file sizes : rANS file sizes are a few bytes larger than the "rans 12" posted last time. That's because that was a 32-bit impl. The rANS here is 64-bit and dual-state so I have to flush 16 bytes instead of 4. There are ways to recover some of those bytes.

The tANS file sizes here are smaller than comparable coders. The win comes from the improvements to normalizing counts and making the sort order. In fact, the +1 bias heuristic lets me beat "arith 12" and "rans 12" from the last post, which were coding nearly perfectly to the expected codelen of the normalized counts.

If you run "ans_learning" you will often see that the written bits are less than the predicted codelen :

H = 1.210176
CL = 1.238785
wrote 1.229845 bpb
this is because the +1 bias heuristic lets the codelens match the data better than the normalized counts do.

Okay, so on to the speed.

The biggest thing is that the above reported speeds are for 2x interleaved coders. That is, two independent states encoding the single buffer to a single compressed stream. I believe ryg will talk about this more soon. You can read his paper on arxiv now. Note that this is not just unrolling. Because the states are independent they allow independent execution chains to be in flight at the same time.

The speedup from interleaving is pretty huge (around 1.4X) :


rANS non-interleaved (v1)

ticks to encode: 26.84 decode: 7.33
mbps encode: 64.41 decode: 235.97

withtable ticks to encode: 26.85 decode: 7.38
withtable mbps encode: 64.41 decode: 234.19

rANS 2x interleaved (v1)

ticks to encode: 17.15 decode: 5.16
mbps encode: 100.84 decode: 334.95

withtable ticks to encode: 17.15 decode: 5.22
withtable mbps encode: 100.83 decode: 331.31

tANS non-interleaved

ticks to encode: 6.43 decode: 4.68
mbps encode: 269.10 decode: 369.44

withtable ticks to encode: 6.48 decode: 4.73
withtable mbps encode: 266.86 decode: 365.39

tANS 2x interleaved

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

But even non-interleaved it's fast. (note that interleaved tANS is using only a single shared bit buffer). The rest of the implementation discussion will use the non-interleaved versions for simplicity.

The tANS implementation is pretty straightforward.

Decoding one symbol is :

    struct decode_entry { uint16 next_state; uint8 num_bits; uint8 sym; };

    decode_entry * detable = table - L;

    #define DECODE_ONE() do { \
        de = detable + state; \
        nb = de->num_bits; \
        state = de->next_state; \
        BITIN_OR(bitin_bits,bitin_numbits,nb,state); \
        *toptr++ = (uint8) de->sym; \
    } while(0)

where BITIN_OR reads "nb" bits and ors them onto state.

With a 64-bit bit buffer, I can ensure >= 56 bits are in the buffer. That means with L up to 14 bits, I can do four decodes before checking for more bits needed. So the primary decode loop is :

        // I know >= 56 bits are available  
        // each decode consumes <= 14 bits

        // now >= 56 bits again

The fastest way I could find to do the bit IO was "big endian style". That's the next bits at the top of the word. Bits in the word are in order of bits in the file. This lets you unconditionally grab the next 8 bytes to refill, but requires a bswap (on little endian platforms). eg :

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += bytesToGet; \
        bitin_bits |= (next8 >> 1) >> (bitin_numbits-1); \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } while(0)

The other nice thing about the bits-at-top style is that the encoder can put bits in the word without any masking. The encoder is :

    #define ENCODE_ONE() do { \
        sym = *bufptr--; ee = eetable+sym; \
        msnb = ee->max_state_numbits; \
        msnb += ( state >= ee->max_state_thresh ); \
        BITOUT_PUT(bout_bits,bout_numbits, state,msnb); \
        state = ee->packed_table_ptr[ state>>msnb ]; \
        } while(0)

    #define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        bout_bits >>= nb; \
        bout_bits |= ((uint64)val) << (64 - nb); \
        bout_numbits += nb; \
    } while(0)

the key interesting part being that the encoder just does BITOUT_PUT with "state", and by shifting it up to the top of the word for the bitio, it gets automatically masked. (and rotate-right is a way to make that even faster).

Similarly to the decoder, the encoder can write 4 symbols before it has to check if the bit buffer needs any output.

The other crucial thing for fast tANS is the sort order construction. I do a real sort, using a radix sort. I do the first step of radix sorting (generating a histogram), and then I directly build the tables from that, reading out of the radix histogram. There's no need to explicitly generate the sorted symbol list as an intermediate step. I use only an 8-bit radix here (256 entries) but it's not significantly different (in speed or compression) than using a larger radix table.

The rANS implementation is pretty straightforward and I didn't spend much time on it, so it could probably be faster (particularly encoding which I didn't spend any time on (ADDENDUM : v2 rANS now sped up and encoder uses fast reciprocals)). I use a 64-bit state with 32-bit renormalization. The basic decode operation is :

        uint64 xm = x & mask;   
        const rans_decode_table::entry & e = table[xm];
        x = e.freq * (x >> cumprobtot_bits) + e.xm_minus_low;
        buffer[i] = (uint8) e.sym;
        if ( x < min_x )
            x <<= 32;
            x |= *((uint32 *)comp_ptr);
            comp_ptr += 4;

One thing I should note is that my rANS decode table is 2x bigger than the tANS decode table. I found it was fastest to use an 8-byte decode entry for rANS :

    // 8-byte decode entry
    struct entry { uint16 freq; uint16 xm_minus_low; uint8 sym; uint16 pad; };
obviously you can pack that a lot smaller (32 bits from 12+12+8) but it hurts speed.

For both tANS and rANS I make the encoder write backwards and the decoder read forwards to bias in favor of decoder speed. I make "L" a variable, not a constant, which hurts speed a little.


Fabian 'ryg' Giesen said...

"The tANS file sizes here are much smaller than fse or rANS. If I'm fair and transmit the counts they should go up a bit. The rest of the win comes from the improvements to normalizing counts and making the sort order."
Well, your comparison vs. rANS in your code is fair since it uses the same normalized counts (and your rANS doesn't store them either); it's really just FSE that's penalized.

cbloom said...

Yeah, fse and "huf" ; revising to make that a little clearer...

Yann Collet said...

Interestingly :) the current "beta" branch of FSE feature some equivalent speed trick (the "ILP" mode store fractional bits into multiple accumulators), with approximately equivalent impact on speed. It's the same concept.

The rest of the code though is different, since your version seems more accurate during "preparation steps" (while fse favor speed for the time being).

Beta is expected to be pushed to master later this week.

Fabian 'ryg' Giesen said...

The paper went live a bit more than a day ago: http://arxiv.org/abs/1402.3392

I'll also go over this in my blog but the key ideas are all in there.

I thought ANS was kinda neat before, but the ability to do SIMD/GPU encoding/decoding (without the resulting bit stream being dependent on the SIMD width!) is fairly unique, and not something I ever expected to see in an entropy coder. (Note that the general interleaving construction given in the paper will not, generally, result in a bitstream that's SIMD-width independent; one-renormalize ANS is special in that regard.)

Ethatron said...

"The fastest way I could find to do the bit IO was "big endian style""

Can't you just invert the bit-sequence and the tables? With huffman coding you can convert the prefix-code into a postfix-code this way and do little-endian style decoding (from least to most significant bit and byte).

cbloom said...

@Ethatron - I don't see how to make that work here. Maybe someone can figure it out.

DudaJarek said...

Hi Charles,
As in many applications there is needed bitwise entropy coder - with probabilities varying from bit to bit, I wonder how ABS compares to arithmetic coding?
uABS or tABS should be accurate enough even for 8bit * 16 SIMD.

cbloom said...

Jarek, I just don't see a usage where that's compelling. ABS is not any faster than binary arithmetic coding, which is already fast. Yes you can SIMD it, which would make it very fast indeed, but you have to gather the probability model for each bit - AND those models can't be dependent on those bits. eg. you can't use it for coding the 8 bits in a byte, since each bit decode would be dependent on the previous. It could only be used for doing like many macroblock mode bits at the same time in a video coder. The usability of it is just vanishingly rare and not even a big speedup when it is usable.

DudaJarek said...

Charles, you are right - it can speedup (many times) encoding as encoder knows all probabilities, but there is a big issue with decoding - there cannot be any dependencies between probabilities of such e.g. 8 SIMD simultaneously decoded symbols.
The question is when such independence appears in real applications, especially video compression, where such speedup would translate into longer battery life of mobile devices.
As you have mentioned, there are used independent regions like macroblocks (or slices in h.265), what is indeed difficult to use here ... but there are also DCT coefficients, which, after subtracting predictions, are encoded independently(?) - couldn't we decode 8 of them simultaneously this way?
Such decoder would rather need a pass mode for some positions.

Fabian 'ryg' Giesen said...

Yes, you can SIMD/GPU-decode e.g. AC coefficient magnitudes. But you wouldn't want to use a binary coder there. Use rANS/tANS to code log(|x|) as one symbol then send sign+suffix bits raw.

One of the reasons to use binary coders over multi-symbol coders (aside from the speed) is that adaptive modeling of bits is easy and simple. But with ANS you lose that simplicity, on the encoder side anyway (with static models that's less of a concern because you have a two-pass split between gathering stats and encoding anyway).

Also, adaptive binary coders do a *lot* of model updates that need to be visible to everyone using that model afterwards, which forces sequence points and kills parallelism.

Finally, note that the symbols/models in a SIMD decoder needn't be completely independent;
there are *certain* kinds of dependencies between symbols that can be handled efficiently in a SIMD model.

For example, you can switch models in the middle of a SIMD "run", if all lanes can determine the location of model switch points (and thus which model they should be using) with a parallel prefix operation that only depends on the initial state vector. So you can do things like "use model 0 until you've seen three consecutive zero symbols, then switch to model 1".

But the sweet spot for that is still switching between a handful of multisymbol contexts, not between tons of binary contexts (since the switching step is comparatively expensive).

DudaJarek said...

Fabian, sure there is usually dependence in the binary case, but still e.g. tABS has a few advantages. For example let say 64 tables for different quantized probability of the least probable symbol on 64 states - a few kilobytes of tables for deltaH~0.0001 bits/symbol (or less than a kilobyte for perfect for cheap hardware implementation qABS):

Encoding: buffering (symbol,probability) we can use SIMD version - getting a few times speedup.

Decoding: tABS step is just a table use and BitOr - CABAC M-coder has additionally renormalization and addition - tABS should be at least 20% faster, and both Charles and Yann's implementations seem to be.
Additionally, sometimes there can appear independence of a few succeeding probabilities, what allows to use a few times faster SIMD variant - there can be a few separate decoding step variants depending on the number of bits to decode in this step.

DudaJarek said...

Here is an article about CABAC performance: http://iphome.hhi.de/marpe/download/iwk06_marpe_et_al.pdf
Its graphs end on 40MB/s, while directly using Charles or Yann's tANS implementation for binary case we would get ~70MB/s.
One reason is that CABAC renormalization is bit by bit, while tANS just takes the required number of bits once per step.

Fabian Giesen said...

Did you post the wrong link? The only graph in that paper is about the bit rate improvement of CABAC over CAVLC. There is no mention of absolute perf numbers that I can see, and their measurements were done on a 10-year old Pentium 4 - which is notable for having variable bit shifts that are 12x slower than integer adds in terms of latency.

DudaJarek said...

Fabian, I couldn't find a better paper, but look at its relatively complicated code (fig 2B) - especially bit-by-bit renormalization.
In comparison tANS is just a single table use and BitOr taking all bits at once - it is at least two "if"s less per step.

cbloom said...

Jarek, you're being silly. CABAC is not designed to be the fastest arithmetic encoder in software. To do fast arithmetic coding in software you would just use a range coder with byte or dword renormalization.

cbloom said...

For the record, here is rABS and binary arithmetic code :


including renorm and modeling. The speed will be nearly identical and depend on quirks of usage.

Yes tABS might be a little faster than rABS.

Fabian Giesen said...

tABS by itself means binary coder with static model though. That's not a very useful primitive.

DudaJarek said...

Charles, range coder is for large alphabets, while now I want to say that e.g. tABS has also advantages for the binary case - CABAC seems to be best for comparison here(?).

tANS step:
t = DecodingTable[m][x];
value = valMPS XOR t.symbol;
x = t.NewX + read_bits(t.nbBits);

CABAC decoding step from article:
R_LPS = RTAB[m][(R > > 6) & 3];
R_MPS = R - R_LPS;
if (V < (R_MPS < < BitsLeft))
{R=R_MPS; value=valMPS;
Rnorm = (R_MPS > > 8) XOR 1;}
{V = V – (R_MPS < < BitsLeft)
R = R_LPS; value = !valMPS
Rnorm = RnormTAB[R_LPS > > 3];}
R = R << Rnorm;
BitsLeft = BitsLeft - Rnorm;
if (BitsLeft < = 0)
{V = (V < < M_BITS) | read_bits(M_BITS);
BitsLeft = BitsLeft + M_BITS;}

cbloom said...

Please look at the code I posted at pastebin. Stop comparing to CABAC, it's not a fair comparison, it's designed for hardware. You need to compare equivalent things.

In tANS, "read_bits" is not free. It's actually hiding the "if" for renormalization.

DudaJarek said...

Fabian, single tABS is not interesting, but exactly like for CABAC, we can use multiple small tables for different quantized probabilities of less probable symbol.
For example use 32 small tables: t=decodingTable[m][x], where m = p/64 <= 1/2.

DudaJarek said...
This comment has been removed by the author.
DudaJarek said...

Charles, so what is a fair comparison?
regarding read_bits, e.g.:
{bits = bitBuffer & (1 < < k - 1);
bitBuffer = bitBuffer > > k;
Using 64 bit BitBuffer, every let say 4 symbols you check if you didn't get below 32 bits - if so, read 32 bits.
You need one branch per a few symbols.

Fabian Giesen said...

A fair comparison is what Charles just posted, or Matt Mahoney's fpaq[abc].

DudaJarek said...

Ok, Charles code ( http://pastebin.com/qqyLNXFA ) clearly shows that decoding should have practically the same speed in both cases for extremely accurate entropy coding. Encoding could be vectorized for rANS to get a few times speedup, but it is not an essential advantage.

However, for a bit less accurate cases like CABAC, tANS seems much simpler. For example 16 tables for 16 states (qANS from paper) to operate deltaH~0.001 bits/symbol would need 16*16=256 bytes (4 bits for new state, 1 bit for symbol, 2 bits for nbBits).

old rants