2/29/2016

Bit Input Notes

1. The big win of U64 branchless bit input is having >= 56 bits (or 57) after refill. The basic refill operation itself is not faster than branchy 32-at-a-time refills, but that only has >= 32 (or 33) bits after refill. The advantage comes if you can unconditionally consume bits knowing that count. eg. if you have a 12-bit limitted Huffman, you can consume 4 symbols without needing to refill.

2. The best case for bit input is when the length that you consume is not very variable. eg. in the Huffman case, 1-12 bits, has a reasonably low limit. The worst case is when it has a high max and is quite random. Then you can't avoid refill checks, and they're quite unpredictable (if you do the branchy case)

3. If your refills have a large maximum, but the average is low, branchy can be faster than branchless. Because the maximum is high (eg. maybe a max of 32 bits consumed), you can only do one decode op before checking refill. Branchless will then always refill. Branchy can skip the refill if the average is low - particularly if it's predictably low.

4. If using branchy refills, try to make it predictable. An interesting idea is to use multiple bit buffers so that each consumption spot gets its own buffer, and then can create a pattern. A very specific case is consuming a fixed number of bits. something like :


bitbuffer

if ( random )
{
  consume 4 bits from bitbuffer
  if bitbuffer out -> refill
}
else
{
  consume 6 bits from bitbuffer
  if bitbuffer out -> refill
}

these branches (for bitbuffer refill) will be very random because of the two different sites that consume different amounts. However, this :

bitbuffer1, bitbuffer2

if ( random )
{
  consume 4 bits from bitbuffer1
  if bitbuffer1 out -> refill
}
else
{
  consume 6 bits from bitbuffer2
  if bitbuffer2 out -> refill
}

these branches for refill are now perfectly predictable in a pattern (they are taken every Nth time exactly).

5. Bit buffer work is slow, but it's "mathy". On modern processors that are typically math-starved, it can be cheap *if* you have enough ILP to fully use all the execution units. The problem is a single bit buffer on its own is super serial work, so you need multiple bit buffers running simultaneously, or enough other work.

For example, it can actually be *faster* than byte-aligned input (using something like "EncodeMod") if the byte-input does a branch, and that branch is unpredictable (in the bad 25-75% randomly taken range).

No comments:

old rants