5/11/2015

05-11-15 - ANS Minimal Flush

A detail for the record :

ANS (TANS or RANS) in the straightforward implementation writes a large minimum number of bytes.

To be concrete I'll consider a particular extremely bad case : 64-bit RANS with 32-bit renormalization.

The standard coder is :


initialize encoder (at end of stream) :

x = 1<<31

renormalize so x stays in the range x >= (1<<31) and x < (1<<63)

flush encoder (at the beginning of the stream) :

output all 8 bytes of x

decoder initializes by reading 8 bytes of x

decoder renormalizes via :

if ( x < (1<<31) )
{
  x <<= 32;  x |= get32(ptr); ptr += 4;
}

decoder terminates and can assert that x == 1<<31

this coder outputs a minimum of 8 bytes, which means it wastes up to 7 bytes on low-entropy data (assuming 1 byte minimum output and that the 1 byte required to byte-align output is not "waste").

In contrast, it's well known how to do minimal flush of arithmetic coders. When the arithmetic coder reaches the end, it has a "low" and "range" specifying an interval. "low" might be 64-bits, but you don't need to output them all, you only need to output enough such that the decoder will get something in the correct interval between "low" and "low+range".

Historically people often did arithmetic coder minimum flush assuming that the decoder would read zero-valued bytes after EOF. I no longer do that. I prefer to do a minimum flush such that decoder will get something in the correct interval no matter what byte follows EOF. This allows the decoder to just read past the end of your buffer with no extra work. (the arithmetic coder reads some # of bytes past EOF because it reads enough to fill "low" with bits, even though the top bits are all that are needed at the end of the stream).

The arithmetic coder minimum flush outputs a number of bytes proportional to log2(1/range) , which is the number of bits of information that are currently held pending in the arithmetic coder state, which is good. The excess is at most 1 byte.

So, to make ANS as clean as arithmetic coding we need a minimal flush. There are two sources of the waste in the normal ANS procedure outlined above.

One is the initial value of x (at the end of the stream). By setting x to (1<<31) , the low end of the renormalization interval, we have essentually filled it with bits it has to flush. (the pending bits in x is log2(x)). But those bits don't contain anything useful (except a value we can check at the end of decoding). One way to remove that waste is to stuff some other value in the initial state which contains bits you care about. Any value you initialize x with, you get back at the end of decoding, so then those bits aren't "wasted". But this can be annoying to find something useful to put in there, since you don't get that value out until the end of decoding.

The other source of waste is the final flush of x (at the beginning of the stream). This one is obvious - the # of pending bits stored in x at any time is log2(x). Clearly we should be flushing the final value of x in a # of bits proportional to log2(x).

So to do ANS minimal flush, here's one way :


initialize encoder (at end of stream) :

x = 0

renormalize so x stays in the range x < (1<<63)

flush encoder (at the beginning of the stream) :

output # of bytes with bits set in x, and those bytes

decoder initializes by reading variable # of bytes of x

decoder renormalizes via :

if ( x < (1<<31) )
{
  if ( ptr < ptrend )
  {
    x <<= 32;  x |= get32(ptr); ptr += 4;
  }
}

decoder terminates and can assert that x == 0

This ANS variant will output only 1 byte on very-low-entropy data.

There are now two phases of the coder. In the beginning of encoding (at the ending of the stream), x is allowed to be way below the renormalization range. During this phase, encoding just puts information into x, and the value of x grows. (note that x can actually stay 0 and never hold any bits if your consists of entirely the bottom symbol in RANS). Once x grows up into the renormalization interval, you enter the next phase where bits of x are pushed to the output to keep x in the renormalization interval. Decoding, in the first phase you read bytes from the stread to fill x with bits and keep it in the renormalization interval. Once the decoder read pointer hits the end, you switch to the second phase, and now x is allowed to shrink below the renormalization minimum and you can continue to decode the remaining information held in it.

This appears to add an extra branch to the decoder renormalization, but that can be removed by duplicating your decoder into "not near the end" and "near the end" variants.

The #sigbit output of x at the head is just the right thing and should always be done in all variants of ANS.

The checking ptr vs. ptrend and starting x = 0 is the variant that I call "minimal ANS".

Unfortunately "minimal ANS" doesn't play well with the ILP multi-state interleaved ANS. To do interleaved ANS like this you would need an EOF marker for each state. That's possible in theory (and could be done compactly in theory) but is a pain in the butt in practice.

9 comments:

Fabian Giesen said...

I think I mentioned this in a mail, but again for the record: initializing the state to 1 and letting it grow does not buy you as much as it does with a minimal-flush arithmetic coder.

With a minimum-flush ari, you can initialize an arithmetic coder, send 7 random bits with a uniform distrib, and when you flush you'll always get a single byte, no problem.

The rANS update is "x = ((x / freq) << prob_bits) + (x % freq) + base;". With a uniform binary distrib, one symbol will have base = (1 << prob_bits)/2, and so your x will instantly grow to prob_bits-1 bits and keep growing from there; so with a 16-bit probability resolution, a single bit can grow your output to 2 bytes.

This is purely an artifact of the rANS "output permutation". A tANS-style interleaved permutation (or uABS-style binary coder) grows x more "smoothly".

cbloom said...

Yup true.

The RANS approximation hurts more for smaller x (relative to prob_bits). So it's particularly bad at x=0 !

These are early days for ANS. I suspect in 20 years we'll have worked out standard ways of doing these details that are better. (it took ages to tweak out the details or arithmetic coding, 40 years or something? crazy)

Yann Collet said...

I initially tried to "store" something useful in the final decoded state value (the one by which encoding is started...). This way, on reaching the end of ANS decoding, there was still "something useful" to decode from the state.

But then, I realized that to properly finish decoding, I needed to indicate the number of symbols to decode from the ANS stream. This is, in itself, a header cost.

In contrast, by keeping the final state value at a known "zero", it embedded one way to check that the end of ANS stream is reached.

In my case, state is 9-11 bits, while stream size is a 2-3 bytes value. Therefore, it's a bit more advantageous to use the state.

But the difference is very minor.
So I'm not yet settled on the best way to proceed.
Let's just say that both costs are comparable, and I found no secure way to get rid of both.

Fabian Giesen said...

Not sure I understand what you mean here. Do you mean that you reserve one state value for EOF? How is that different from including an explicit EOF symbol in your alphabet?

Unless you stop normalizing near the end like Charles does (and for that, you need to know where the end is!), you'll be maintaining normalization the whole way through. But reserving a normalized state value to denote EOF really is just adding an EOF symbol to your alphabet. Or am I misunderstanding you?

If it is an EOF symbol, that's a perfectly reasonable thing to do (and a common choice) but it's not a fixed cost; yes, you pay ~11 bits near the end to actually send the EOF, but you also knock out one state you could otherwise use, which has a small but nonzero tax of -log2((2^11 - 1) / 2^11) = 0.0007 bits per symbol.

Say a block has 16384 symbols, then that adds up to about 11.54 bits overhead - plus the final expected 11 bits to send an EOF flag with probability 2^-11. Which comes out to... about 3 bytes. :) (Worse if your block has more symbols, better if it's less.)

Yann Collet said...

Hi Fabian

Yes, it looks similar from my earlier comment, but short answer : it's different from an EOF symbol.

Long answer :
There are 2 conditions to tell that a compressed stream is finished.

First, the compressed stream must be entirely and strictly consumed. That means that the size of the compressed stream is known, which in my implementation is necessary, since decompression read is done backward.

This is properly checked : the beginning of the stream is known, so it's trivial to ensure we never read beyond that limit. The stream is consumed when not a single bit remains (and there is an error if it requires 1+ more bits than available).

But that's not enough. Since some symbols may have a probability > 50%, in such case, they can be compressed in less than a bit. So just reaching exactly the end of bitstream is not enough.

Then, it's also necessary to check the value of "state", and ensure it reaches a pre-determined value (which is proposed to be "zero"). So if some high probability symbols are still there, they can be decoded, just from decreasing state value, up to zero.

Such check wouldn't be necessary if the number of symbols was known beforehand.
So that's where the trade-off is : either provide the number of symbols to decode, or agree on a pre-determined final state value.


Alternatively, note that if no symbol is > 50%, then such additional check is unnecessary, and just following the bit stream would be enough. Still, checking the state value can still be useful as an "integrated error checker", since if it has not reached the desired final state value, it means the compressed stream is malformed.

Hope it makes sense.

Yann Collet said...

@cbloom

While I do understand how to make use of the initial encoding state (last decoding state) to store something useful in order to minimize flush losses,

I don't really understand the same for the last encoding state (hence first decoding state) which is described in your blog post.

I suspect the difference is related to rANS/tANS different behaviors.

For tANS, any bit stored is mandatory to correctly interpret the state, hence the symbol retrieved, and therefore the next symbol. Not a single bit can be missed, or guessed.

I suspect this is because in tANS, symbols are spread, while in rANS, they lay next to each other, in a contiguous segment. So it's possible to not use all bits to guess the appropriate segment, hence appropriate symbol.

But still, you have to make sure those "remaining" bits don't have an avalanche effect later on, to deduce some future symbol in the stream.

cbloom said...

In the encoder final flush (at the head of the stream), Minimal ANS does not and cannot discard any bits.

But the state x may have many leading 0 bits. Those do not need to be sent.

This is true for TANS and RANS.

Basically we are using the fact that the probability distribution of x is 1/x. Higher values are less likely.

This corresponds to the buffered information content of x being log2(x).

Fortunately the ideal coder for 1/x is very simple - just send # of bits in x, then the value of those bits.

For TANS this is a much smaller savings because the # of bits in state (L) is typically very small, perhaps 12. With RANS the number of bits in x is much larger (32 or 64).

cbloom said...

I guess it's not clear in my comment -

while the same theory applies to TANS , in practice there's no point in using that flush TANS.

TANS state x varies from L to 2L (in implementation we usually ignore the +L shift). So the probability difference from the bottom to the top of the range is only a factor of *1/2 (this corresponds to the fact that a maximum of just under 1 bit can be pending in x).

So you have to write some bits to save 1 bit, doesn't make sense.

The difference comes not from TANS vs RANS per se, but from the base used for output; if you renorm to output bits, or bytes, or dwords. In TANS we typically choose to output bits, and in RANS we choose to output something larger.

Yann Collet said...

Thanks Charles. It's pretty clear and does make sense now.

old rants