AAAAAAABBBBBBBBBBBBBCCCCCCCCCCCD
What's wrong with that sort?
(That's the naive rANS sort order; it's just a "cum2sym" table. It's each symbol Fs times in consecutive blocks. It has M=32 entries. M = sum{Fs} , L = coding precision)
(here I'm talking about a tANS implementation with L=M ; the larger (L/M) is, the more you preserve the information in the state x)
Think about what the state variable "x" does as you are coding. In the renormalization range it's in [32,63]. Its position in that range is a slider for the number of fraction bits it contains. At the bottom of the range, log2(x) is 5, at the top log2(x) is 6.
Any time you want to encode a "D" you must go back to a singleton precursor state, Is = [1]. That means you have to output all the bits in x, so all fractional bits are thrown away. All information about where you were in that I range is gone. Then from that singleton Is range you jump to the end of the I range.
(if Fs=2 , then you quantize the fractional bits up to 0.5 ; is Fs=3, you quantize to 1/3 of a bit, etc.)
Obviously the actual codelen for a "D" is longer than that for an "A". But so is the codelen for a "C", and the codelen for "A" is too short. Another way to think of it is that you're taking an initial state x that spans the whole interval [32,63] and thus has variable fractional bits, and you're mapping it into only a portion of the interval.
In order to preserve the fractional bit state size, you want to map from the whole interval back to the whole interval. In the most extreme case, something like :
ACABACACABACABAD
(M=16) , when you encode an A you go from [16,31] to [8,15] and then back the A's in that string. The net result is that state just lost its bottom bit. That is, x &= ~1. You still have the full range of possible fractional bits from [0,1] , you just lost the bottom bit of precision.
I was thinking about this because I was making some weird alternative tANS tables. In fact I suppose not actually ANS tables, but more general coding tables.
For background, you can make one of the heuristic tANS tables thusly :
shuffle(s) = some permutation function
shuffle is one-to-one over the range [0,L-1]
such as Yann's stepping prime-mod-L
or bit reverse
make_tans_shuffle()
{
int next_state[256];
uint8 permutation[MAX_L];
// make permutation :
uint32 cumulative_count = 0;
for LOOP(sym,alphabet)
{
uint32 count = normalized_counts[sym];
if ( count == 0 ) continue;
next_state[sym] = count;
for LOOP(c,(int)count)
{
uint32 index = shuffle(cumulative_count);
cumulative_count++;
permutation[index] = (uint8)sym;
}
}
ASSERT( cumulative_count == (uint32)L );
// permutation is now our "output string"
for LOOP(i,L) // iterating over destination state
{
int sym = permutation[i];
// step through states for this symbol
int from_state = next_state[sym];
next_state[sym] ++;
int to_state = L + i;
encode_packed_table_ptr[sym][from_state] = to_state;
}
}
which is all well and good. But I started thinking - can I eliminate the intermediate permutation[] table
entirely? Well, yes. There are a few ways.
If you have a "cum2sym" table already handy, then you can just use shuffle() to look up directly into cum2sym[], and that is identical to the above. But you probably don't have cum2sym.
Well what if we just use shuffle() to make the destination state? Note that this is calling it in the opposite direction (from cum2sym index to to_state , rather than from to_state to cum2sym). If your shuffle is self-inverting like bit reversal is, then it's the same.
It gives you a very simple table construction :
make_tans_shuffle_direct()
{
uint32 cumulative_count = 0;
for LOOP(sym,alphabet)
{
uint32 count = normalized_counts[sym];
if ( count == 0 ) continue;
for LOOP(c,(int)count)
{
uint32 index = shuffle(cumulative_count);
cumulative_count++;
uint32 to_state = index + L;
int from_state = count + c;
encode_packed_table_ptr[sym][from_state] = to_state;
}
}
ASSERT( cumulative_count == (uint32)L );
}
make_tans_shuffle_direct walks the Fs in a kind of cum2sym order and then scatters those symbols out to
semi-random target locations using the shuffle() function.
It doesn't work. Or rather, it works, it encodes & decodes data correctly, but the total coded size is worse.
The problem is that the encode table is no longer monotonic. That is, as "from_state" increases, "to_state" does not necessarily increase. The Fs encode table entries for each symbol are not numerically in order.
In the images we've been picturing from earlier in the post we can see the problem. Some initial state x is renormalized down to the Is coding range. We then follow the state transition back to the I range - but we go somewhere random. We don't go to the same neighborhood where we started, so we randomly get more or less fractional bits.
You can fix it thusly :
make_tans_shuffle_direct_fixed()
{
uint32 cumulative_count = 0;
for LOOP(sym,alphabet)
{
uint32 count = normalized_counts[sym];
if ( count == 0 ) continue;
for LOOP(c,(int)count)
{
uint32 index = shuffle(cumulative_count);
cumulative_count++;
uint32 to_state = index + L;
int from_state = count + c;
encode_packed_table_ptr[sym][from_state] = to_state;
}
// fix - to_states not monotonic
// sort the destination states for this symbol :
std::sort( encode_packed_table_ptr[sym]+count, encode_packed_table_ptr[sym]+2*count );
}
ASSERT( cumulative_count == (uint32)L );
}
and then it is identical to "make_tans_shuffle" (identical if shuffle is self-inverting, and if not then it's
different but equal, since shuffle is really just a random number generator so running it backwards
doesn't hurt compression).
For the record the compression penalty for getting the state transition order wrong is 1-2% :
CCC total bytes out :
correct sort : 1788631
shuffle fixed: 1789655
shuffle bad : 1813450
Interesting. I was looking for something equivalent, but got scared by the final sorting stage. Is it actually any faster than using an intermediate distribution table ?
ReplyDeleteRgds
With the final sort included, no it's not any faster than just making the permutation table.
ReplyDeleteObviously the non-sorting version is faster, but not by so much that it's worth the compression loss.
I think there is something there, but it would require some more careful work. For example, if "shuffle" is the prime stepping heuristic, then the sort order should be easier to generate than doing a full general purpose sort.
eg. at L = 32 it makes things like
{7,14,21,28,3,10,17}
which obviously you just divide into blocks of 4 and then interleave. There should be a fast way to do something like that.