low += (symlow * range) / symtot; range = (symrange * range) / symtot;Intead we do :

U32 r = range / symtot; low += symlow * r; range = symrange * r;This has changed the parentheses of the multiplication, which lets us do math in 32 bits and also saves a divide. It's inaccuracte, we would waste some range, so we check for being at the end of the range and put the wasted space on there :

if ( (symlow + symrange) == symtot ) { U32 t = symlow * (range / symtot); low += t; range -= t; } else { U32 r = range / symtot; low += symlow * r; range = symrange * r; }We can see the waste by comparing the range update for the top symbol in the two cases :

case1 = range - symlow * (range / symtot); case2 = symrange * (range / symtot); waste = case1 - case2; waste = range - symlow * (range / symtot) - symrange * (range / symtot); waste = range - (symtot - symrange) * (range / symtot) - symrange * (range / symtot); waste = range - symtot * (range / symtot); waste = range % symtot;Putting the waste on the last symbol is not great. To make the most of it, you should really reorganize your alphabet so that the most probable symbol is the last one. Symtot is usually around (1<<14) which means the waste is on average (1<<13). Range is between (1<<24) and (1<<32) so the fractional waste is between 2^-11 and 2^-19. This appears to come out to about a waste of 1 byte every 4096 on text with the alphabet rearranged to put the space (0x20) at the end.

I should note that many many people have arithcoders that do this check to put the wasted range on the last symbol, but they *don't* do the work to make sure the last symbol is the MPS (the most probable symbol). If you don't do that, it's rather pointless. With english text for example you might be putting the extra range on the symbol "255" which basically never happens, so you are doing work for nothing.

Sean asked a good question that had me a little confused yesterday. Why do some of the renormalization updates shift in 1's to "high" and some don't? You might see :

while ( range < MinRange ) { *outptr++ = low >> 24; low <<= 8; range <<= 8; } or : while ( range < MinRange ) { *outptr++ = low >> 24; low <<= 8; range <<= 8; range += 255; }I thought maybe there was some subtle math reason why you would choose not to put in the 255. Nope. It's just another approximation + optimization. Shifting in the 1 bits to range is the more accurate correct way to do it, but at that point range is already in 2^24 to 2^32, so adding on 255 doesn't amount to much. In practice on my real file tests it doesn't even save 1 byte. I guess it should save something like 1 every 2^20. So, most people just leave it out to save the 1 instruction.

Here are some numbers :

r:\bigtest.bin : inLen : 25651384 True Entropy : 18090226.9 : 5.642 // these are all using cum prob max = 1<<14 and doing a divide for range radArith cycles: 64.9 radArith : encLen : 18090853 = 5.642 bpb // radArith is a Michael Schindler 31 bit range encoder with check to put waste range on top sym Arithmetic_Codec cycles: 58.2 Arithmetic_Codec : encLen : 18091236 = 5.642 bpb // Arithmetic_Codec is FastAC as modified by me with no check for waste on top sym rrArithCoder cycles: 57.3 rrArithCoder : encLen : 18091236 = 5.642 bpb // rrArithCoder is my reimplemented 32 bit range coder similar to FastAC ; yay I beat them by 1 clock ! rrArithBinaryModel cycles: 35.1 rrArithBinaryModel : encLen : 18844782 = 5.877 bpb // rrArithBinaryModel is a bit-by-bit adaptive model + coder ; that's cycles *per bit* , it's 280 cycles/byte rrArithCoderRawBits cycles: 24.1 rrArithCoderRawBits : encLen : 25651385 = 8.000 bpb // rrArithCoderRawBits just puts the bytes out through the arithmetic coder // this basically just measures the speed of the renorm loop rrHuffman cycles: 14.5 rrHuffman : encLen : 18179876 = 5.670 bpb // Huffman encoding is still 4X faster and doesn't cost you a whole lot in compression efficiency

## No comments:

Post a Comment