4/18/2018

Race in EOF on Windows Pipes

There appears to be a race in the EOF check on Windows pipes. I haven't seen this written about elsewhere, so here it is.

TLDR : don't call eof on pipes in Windows! Use read returning zero instead to detect eof.

We observed that semi-randomly pipes would report EOF too soon and we'd get a truncated stream. (this is not the issue with ctrl-Z ; of course you have to make sure your pipe is set to binary).

To reproduce the issue you may use this simple piper :


// piper :

#include <fcntl.h>
#include <io.h>

int main(int argc,char *argv[])
{
    int f_in  = _fileno(stdin);
    int f_out = _fileno(stdout);
    // f_in & out are always 0 and 1

    _setmode(f_in,  _O_BINARY);
    _setmode(f_out, _O_BINARY);

    char buf[4096];
    
    // NO : sees early eof!
    //  don't ask about eof until _read returns 0
    while ( ! _eof(f_in) )
    // YES : just for loop here
    //for(;;)
    {
        int n = _read(f_in,buf,sizeof(buf));
        if ( n > 0 )
        {
            _write(f_out,buf,n);
        }
        else if ( n == 0 )
        {
            // I think eof is always true here :
            if ( _eof(f_in) )
                break;
        }
        else
        {
            int err = errno;
            
            if ( err == EAGAIN )
                continue;
                
            // respond to other errors?
            
            if ( _eof(f_in) )
                break;
        }
    }

    return 0;
}

the behavior of piper is this :

vc80.pdb                 1,044,480

redirect input & ouput :

piper.exe < vc80.pdb > r:\out

r:\out                      1,044,480

consistently copies whole stream, no problems.

Now using a pipe :

piper.exe < vc80.pdb | piper > r:\out

r:\out                         16,384
r:\out                         28,672
r:\out                         12,288

semi-random output sizes due to hitting eof early

If the eof check marked with "NO" in the code is removed, and the for loop is used instead, piping works fine.

I can only guess at what's happening, but here's a shot :

If the pipe reader asks about EOF and there is nothing currently pending to read in the pipe, eof() returns true.

Windows anonymous pipes are unbuffered. That is, they are lock-step between the reader & writer. When the reader calls read() it blocks until the writer puts something, and vice-versa. The bytes are copied directly from writer to reader without going through an internal OS buffer.

In this context, what this means is if the reader drains out everything the writer had to put, and then races ahead and calls eof() before the writer puts anything, it sees eof true. If the writer puts something first, it seems eof false. It's just a straight up race.


time    reader                          writer
-----------------------------------------------------------
0                                       _write blocks waiting for reader
1       _eof returns false
2       _read consumes from writer
3                                       _write blocks waiting for reader
4       _eof returns false
5       _read consumes from writer
6       _eof returns true
7                                       _write blocks waiting for reader

at times 1 and 4, the eof check returns false because the writer had gotten to the pipe first. At time 6 the reader runs faster and checks eof before the writer can put anything, now it seems eof is true.

As a check of this hypothesis : if you add a Sleep(1) call immediately before the _eof check (in the loop), there is no early eof observed, because the writer always gets a chance to put data in the pipe before the eof check.

Having this behavior be a race is pretty nasty. To avoid the problem, never ask about eof on pipes in Windows, instead use the return value of read(). The difference is that read() blocks on the writer process, waiting for it to either put some bytes or terminate. I believe this is a bug in Windows. Pipes should never be returning EOF as long as the process writing to them is alive.

The Kraft Number, Binary Arithmetic, and Incremental Package Merge

The Kraft number that we compute for length limited Huffman codelen construction can be thought of as a set of bit flags that tell you which codelens to change.

Recall


K = Sum{ 2^-L_i }

K <= 1 is prefix codeable

you can think of K as the sum of effective coding probabilities (P = 2^-L), so K over 1 is a total probability over 100%.

when we initially make Huffman codelens and then apply a limit, we use too much code space. That corresponds to K > 1.

If you write K as a binary decimal, it will be something like :


K = 1.00101100

Those 1 bits that are below K = 1.0 are exactly the excess codelens that we need to correct.

That is, if you have an extra symbol of length L, that is K too high by 2^-L , that's a 1 bit in the binary at position L to the right of the decimal.


codelen set of {1,2,2,3}

a : len 1
b : len 2
c : len 2
d : len 3

K = 2^-1 + 2^-2 + 2^-2 + 2^-3

K = 0.100 +
    0.010 +
    0.010 +
    0.001

K = 1.001

take (K-1) , the part below the decimal :

K-1 = .001

we have an excess K of 2^-3 ; that's one len 3 too many.

To fix that we can change a code of len 2 to len 3

that does 

-= 0.010
+= 0.001

same as

-= 0.001

That is, when (K-1) has 2 leading zeros, you correct it by promoting a code of len 2 to len 3

so if we compute K in fixed point (integer), we can just take K - one and do a count leading zeros on it, and it tells you which code len to correct. The bits that are on in K tell us exactly what's wrong with our code.

Now, similarly, we can think of the compound operations in the same way.

Any time we need to do a correction of K by 2^-L we could do one code from (L-1) -> L , or we could do two codes from L -> (L+1), or ... that can be seen as just an expression of the mathematics of how you can add bits together to make the desired delta.

That is :


You want 0.001 (increment codelen 2 to 3)

that can be :

0.001 = 0.0001
       +0.0001

(increment two lower lens)

or :

0.001 = 0.01
       -0.001

increment a lower codelen then decrement one at your level

Now, thinking about it this way we can try to enumerate all the possible moves.

To reduce the space of all possible moves, we need a few assumptions :

1. No non-len-group changing moves are profitable. That is, the set of symbols for the current len groups are the best possible set. eg. it's not profitable to do something like { symbol a from len 2 to 3 and symbol b from len 3 to 2 } . If there are any profitable moves like that, do them separately. What this means is the counts are sorted; eg. if a symbol is at a higher codelen, its count is less equal the count of any symbol at lower codelen.

2. I only need to enumerate the moves that can be the cheapest (in terms of total code len).

In that case I think that you can enumerate all the moves thusly :


a change of 2^-L can be accomplished via

inc(L-1)   (that means increment a codelen at L-1)

inc(L) can be substitituted with { inc(L+1), inc(L+1) }

And each of those inc(L+1) can also be substituted with a pair at L+2.
You take either a codelen at L or two at the next higher len, 
whichever has a smaller effect on CL (total codelen).

a change of 2^-L can also be accomplished via :

inc(L-2) and dec(L-1)

OR

inc(L-3) and dec(L-2) and dec(L-1)

again these can be understood as binary decimals :

0.0001 = 0.0010 - 0.0001
0.0001 = 0.0100 - 0.0011
0.0001 = 0.1000 - 0.0111

and finally the decs are also a tree of pairs :

dec(L) can instead be { dec(L+1), dec(L+1) }

this seems like a lot, but compared to all possible ways to make the number X from adds & subtractions of any power of two, it's quite small. The reason we can consider such a reduced set of moves is because we only need the one best way to toggle a bit of K, not all possible ways.

Really we just do :


enumerate the position of the lowest codelen to inc
between 1 and (L-1)

decrement at all codelens below the one you incremented
down to (L-1)

this produce the desired change in K of 2^-L

each "inc" & "dec" can either be at that code len, or a pair of next codelen

(somebody smarter than me : prove that these are in fact all the necessary moves)

Let's look at how dynamic programming reduces the amount of work we have to do.


Say we need to do an inc() at L = 3

(inc means increment a codelen, that decreases K by 2^-4)

We can either increment a single symbol at L = 3
or a pair from L = 4

(this is just the same kind of operation you do in normal Huffman tree building)

The best symbol at L = 3 is just the lowest count symbol (if any exists)

Ask for the best two nodes at L = 4

Those can also be a single symbol, or a pair from L = 5

When you ask for the first node at L = 4, it gets the best two at L = 5

but then imagine the single symbol at L = 4 was lower count and is taken

Now you ask for the second node at L = 4, it again needs the best two at L = 5

we already have them, no more work is needed.

Any time you chose a symbol rather than a pair of higher len, the 2^n tree stops growing.

[3] -> { [4a], [4b] }

[4a] -> sym(4) or { [5a], [5b] }

[4a] takes sym(4)

[4b] -> sym2(4) or { [5a], [5b] }

[4b] doesn't need a new evaluation at level 5

Another issue I need to mention is that as you increment and decrement codelens, they move between the lists, so the cached dynamic programming lists cannot be retained, or can they? (for example, you want to keep the symbols at each len sorted by count)

In fact the accounting for symbols moving is simple and doesn't need to invalidate the cached lists.


When you do an inc(L) , that symbol moves to L+1 and is now available for a further inc(L+1)

(this does not occur with dec(L) since it moves in the opposite direction)

Say you wanted an inc(3) ; you consider doing a pair of { inc(4), inc(4) }

One of the inc(4)'s can be a pair of inc(5)'s , and one of those len 5 symbols can be the one you did inc 4 on.

That is, say you have 'A' at len 4 and 'B' at len 5

inc(3) <- { inc( 'A' @ 4) , { (inc 'A' @ 5) , inc( 'B' @ 5 } }

This is a legal move and something you have to consider.

But the rule for it is quite simple - if a symbol occurs earlier in the list of chosen increments, it is
available at the next level.

If you're familiar with the way Package Merge makes its lists, this is quite similar. It just means when you choose the lowest count symbol at the current level, you can also draw from the previous increments in your list if they have lower count.

These queues we are building are exactly the same thing you would need to do the full package merge algorithm. The difference is, in traditional Package Merge you would start with all the symbols at codelen = max (K is too low), and then incrementally apply the best decrements to increase K. Here we are starting with K pretty close to 1 , with K greater than 1. The result is that in many cases we can do far fewer package merge steps. I call this Incremental Package Merge. It allows you to start from a nearly-Kraft set of codelens and get to the same optimal solution as if you did full package merge.

Let's look at a concrete example or two :


codelen : symbol+count

len 3 : a+7
len 4 : b+3 , c+3
len 5 : d+1 , e+1

you need an inc(3) to get K -= 2^-4

you can :

inc(a) ; cost 7
inc(b + c) ; cost 6
inc(b + { d + e } ) ; cost 5

The best inc(4) is {b,d,e}

Another example :

len 3 : a+7
len 4 : b+2
len 5 : c+1

again say you want an inc(3)

the best is

inc( b + { b + c } ) ; cost 5

here the best option is to inc b twice

And finally let's think again about how Package Merge is related to the "numismatic" or "coin collector problem".

If you play with this you will see what we're really doing is working on a two-cost accountancy. Each symbol has a cost in K which is determined only by its current codelen (2^-L). It also has a cost in total codelen CL which is detemined only by its symbol count. We are trying to pay off a debt in K (or spend a credit in K) and maximize the value we get in CL.

4/04/2018

Engel Coding and Length Limited Huffman Heuristic Revisited

Joern Engel has written about a technique he described to me for forming prefix code lengths here on his blog.

Engel Coding is a fast/approximate way of forming length limited prefix code lengths.

I'm going to back up first and remind us what a prefix code is and the use of the Kraft inequality.

We want to entropy code some alphabet using integer code lengths. We want those codes to be decodeable without side information, eg. by only seeing the bits themselves, not transmitting the length in bits.

0  : a
10 : b
11 : c
is a prefix code. If you see the sequence "11100" it can only decode to "11=c,10=b,0=a".
0  : a
1  : b
11 : c
is not a prefix code. Any occurance of "11" can either be two b's or one c. They can't be resolved without knowing the length of the code. There is no bit sequence you can possibly assign to c that works. The fact that this is impossible is a function of the code lengths.

You can construct a prefix code from code lengths if and only if the lengths satisfy the Kraft inequality :


Sum{ 2^-L_i } <= 1

It's pretty easy to understand this intuitively if you think like an arithmetic coder. 2^-L is the effective probability for a code length, so this is just saying the probabilities must sum to <= 100%

That is, think of the binary code as dividing the range [0,1] like an arithmetic coder does. The first bit divides it in half, so a single bit code would take half the range. A two bit code takes half of that, so a quarter of the original range, eg. 2^-L.

The two numbers that we care about are the Kraft code space used by the code lengths, and the total code length of the alphabet under this encoding :


Kraft code space :

K = Sum{ 2^-L_i }

Total code length :

CL = Sum{ C_i * L_i }

L_i = code length in bits of symbol i
C_i = count of symbol i


Minimize CL subject to K <= 1 (the "Kraft inequality")

We want the minimum total code length subject to the prefix code constraint.

The well known solution to this problem is Huffman's algorithm. There are of course lots of other ways to make prefix code lengths which do not minimize CL. A famous historical one is Shannon-Fano coding, but there have been many others, particularly in the early days of data compression before Huffman's discovery.

Now for a length-limited code we add the extra constraint :


max{ L_i } <= limit

now Huffman's standard algorithm can't be used. Again the exact solution is known; to minimize CL under the two constraints of the Kraft inequality and the maximum codelength limit, the algorithm is "Package Merge".

In Oodle we (uniquely) actually use Package Merge at the higher compression levels, but it is too slow and complicated to use when you want fast encoding, so at the lower compression levels we use a heuristic.

The goal of the heuristics is to find a set of code lengths that satisfy the contraints and get CL reasonably close to the minimum (what Package Merge would find).

The Oodle heuristic works by first finding the true Huffman code lengths, then if any are over the limit, they are changed to equal the limit. This now violates the Kraft inequality (they are not prefix decodeable), so we apply corrections to get them to K = 1. ZStd uses a similar method (and I imagine lots of other people have in the past; this is pretty much how length-limited near-Huffman is done). My previous post on the heuristic length limited code is below, with some other Huffman background :

cbloom rants: 07-03-10 - Length-Limitted Huffman Codes Heuristic
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 08-12-10 - The Lost Huffman Paper
cbloom rants Huffman Performance
cbloom rants Huffman Correction

(Engel correctly points out that most of the places where I say "Huffman coding" I should really be saying "prefix coding". The decoding methods and canonical code assignment and so on can be done with any prefix code. A Huffman code is only the special case of a prefix code with optimal lengths. That is, Huffman's algorithm is only the part about code length assignment; the rest is just prefix coding.)

So Engel's idea is : if we're going to limit the code lengths and muck them up with some heuristic anyway, don't bother with first finding the optimal non-length-limited Huffman code lengths. Just start with heuristic code lengths.

His heuristic is (conceptually) :


L_i = round( log2( P_i ) )

which is intuitively a reasonable place to start. If your code lengths didn't need to be an integer number of bits, then you would want them to be as close to log(P) as possible.

Then apply the limit and fix the lengths to satisfy the Kraft inequality. Note that in this case the tweaking of lens to satisfy Kraft is not just caused by lens that exceed the limit. After the heuristic codelens are made, even if they are short, they might not be Kraft. eg. you can get code lengths like { 2,3,3,3,3,3,4,4,4 } which are not prefix (one of the 3's need to be changed to a 4). The idea is that unlike Huffman or Shannon-Fano which explicitly work by creating a prefix code by construction, Engel coding instead makes code lengths which could be non-prefix and relies on a fix up phase.

When Joern told me about this it reminded me of "Polar Coding" (Andrew Polar's, not the more common use of the term for error correction). Andrew Polar's code is similar in the sense that it tries to roughly assign log2(P) codelens to symbols, and then uses a fix-up phase to make them prefix. The details of the heuristic are not the same. (I suspect that there are lots of these heuristic entropy coders that have been invented over the years and usually not written down).

Obviously you don't actually want to do a floating log2; for the details of Engel's heuristic see his blog.

But actually the details of the initial codelen guess is not very important to Engel coding. His codelen adjustment phase is what actually determines the codelens. You can start the codelens all at len 1 and let the adjustment phase do all the work to set them, and in fact that gives the same final codelens!

I tried four methods of initial codelen assignment and they all produced the exact same final codelens. The only difference is how many steps of the iterative refinement were needed to get them to Kraft equality.


all initial codelens = 1    : num_adjustment_iterations = 2350943

codelens = floor( log2(P) ) : num_adjustment_iterations = 136925

codelens = round( log2(P) ) : num_adjustment_iterations = 25823

Engel Coding heuristic      : num_adjustment_iterations = 28419

The crucial thing is how the refinement is done.

To get to the refinement, let's go over some basics. I'll first describe the way we were actually doing the length limit heuristic in Oodle (which is not the same as what I described in the old blog post above).

In the Oodle heuristic, we start with Huffman, then clamp the lens to the limit. At this point, the Kraft K is too big. That is, we are using more code space than we are allowed to. We need to raise some codelens somewhere to free up more code space. But raising codelens increases the total codelen (CL). So the goal is to bump up some codelens to get K = 1, with a minimum increase to CL.


If you do L_i ++

K -= 2^(-L_i)
K += 2^(-(L_i+1))

for a net change of :

K -= 2^(-(L_i+1))

(shorter codelen symbols makes a bigger change to K)

and CL does :

CL -= C_i * L_i
CL += C_i * (L_i + 1)

net :

CL += C_i

(lower count symbols hurt total codelen less)

K_delta_i = 2^(-(L_i+1))
CL_delta_i = C_i

To get under the K budget, we want to find the lowest CL_delta with the maximum K_delta. That is, code space (K) is the resource you want to buy, and code len (CL) is the currency you use to pay for that code space. You want the best price :

price = CL_delta / K_delta

price = C_i * 2^(L_i+1)

What I was doing in Oodle was taking the step with the best "price" that didn't overshoot the target of K = 1.

If your symbols are sorted by count (which they usually are for Huffman codelen algorithms), then you don't need to compute "price" for all your symbols. The minimum price will always occur at the lowest count (first in the sorted list) at each codelen. So rather than making a full heap of up to 256 symbols (or whatever your alphabet size is), you only need a heap of the 16 (or whatever codelen limit is) lowest count symbols at each codelen.

The big improvement in Engel's refinement heuristic is that it allows overshooting K if the absolute value of the distance to K decreases.

Consider K in fixed point with a 12 bit codelen limit. Then "one" is at K = 4096. Say you had K = 4099. It's 3 too big. My heuristic could only consider K steps of -=2 and -=1 (only power of two steps are possible). Engel can also take a step of -= 4 , changing K to 4095. It's now 1 too small (codeable but wasteful) and rather than increasing codelens to fit in the code space, we can decrease a symbol codelen somewhere to gain some total codelen.

Engel converges to K = one by (possibly) taking successively smaller overshooting steps, so K wiggles around the target, delta going positive & negative. This does not always converge, so a bail out to a simpler approach is needed. This overshooting lets it get to K by doing a combination of positive and negative steps (eg. 3 = 4 - 1 , not just 3 = 1 + 2), which is a little bit of a step towards Package Merge (the difference being that package merge find the cheapest path to get the desired sum, while Engel's heuristic is greedy, taking the single cheapest step each time).

In practice this turns out to be much better than only taking non-overshooting steps.

Time to look at the results on some real data :


"seven" test set, cut into 64k chunks, order 0 entropy coded

comparing code len to package merge (ideal)

The length in excess (percent) is :

excess percent = (codelen(heuristic) - codelen(packagemerge))*100/codelen(packagemerge)

Huffman then limit monotonic      mean : 0.027% max : 2.512%
Huffman then limit overshooting   mean : 0.002% max : 0.229%
Engel coding                      mean : 0.016% max : 10.712%

(codelen limit of 12, 256 symbol byte alphabet)

The heuristic (overshooting) limit is really very good, extremely close to package merge and even the maximum excess len is small. Engel coding (non-Huffman initial code lengths) works fine on average but does have (very rare) bad cases. This is not surprising; there's reason we use Huffman's algorithm to get the code lengths right.

In that bad 10.71% excess case, the package merge average code len is 1.604 but Engel coding produces an average code length of 1.776

Note that many of the blocks in this test did not hit the codelen limit; in that case "Huffman then limit" produces the best possible codelens, but Engel coding might not.

For most applications, it's probably best to make the true Huffman code and then limit the lengths with a heuristic. The time saved from the approximate initial code lengths is pretty small compared to other operations needed to do entropy coding (histogramming for example is annoyingly slow). Nevertheless I found this technique to be an interesting reminder to keep an open mind about approximations and understand where our algorithms came from and why we use the ones we do.


Another thing I find interesting is how Engel Coding points back to Package Merge again.

First there's the fact that you can start Engel Coding with just all the codelens set at 1 , and let the Kraft fixup make the codelens. That's how Package Merge works. It starts all codelens at 1, and the increments the cheapest ones until it gets up to Kraft. The log2ish starting guess for Engel Coding is just a way of jump-starting the codelens closer to the final answer to avoid lots of steps.

Engel Coding's overshooting heuristic improves on the monotonic heuristic by allowing you to take some +- steps. That is, increment one codelen and decrement another. In particular it can do things like : rather than increment a len 3 codelen , instead increment a len 2 codelen and decrement a len 3 codelen. This is the kind of move that you need to make to get to real optimal code lens.

The key missing element is considering the costs of all possible steps and finding a path to the desired K. That is, Engel coding takes greedy (locally cheapest price) steps, which may not give the optimal path overall. The way to turn this greedy algorithm into an optimal one is dynamic programming. Lo and behold, that's what Package Merge is.