cbloom rants: 03/2014

3/30/2014

03-30-14 - Decoding GIF

So I'm writing a little image viewer for myself because I got fed up with ACDSee sucking so bad. Anyway, I had almost every image format except GIF, so I've been adding that.

It's mostly straightforward except for a few odd quirks, so I'm writing them down.

Links :

spec of gif89a
A breakdown of a GIF decoder
The GIFLIB Library
Frame Delay Times for Animated GIFs by humpy77 on deviantART
gif_timing test
ImageMagick - Animation Basics -- IM v6 Examples
(Optional) Building zlib, libjpeg, libpng, libtiff and giflib � Leptonica & Visual Studio 2008
theimage.com gif Disposal Methods 2
theimage.com GIF Table of Contents

My notes :

A GIF is a "canvas" and the size of the canvas is in the file header as the "screen width/height". There are then multiple "images" in the file drawn into the canvas.

In theory multiple images could be used even for non-animated gifs. Each image can have its own palette, which lets you do true color gifs by assigning different palettes to different parts of the image. So you should not assume that a GIF decodes into an 8-bit palettized image. I have yet to see any GIF in the wild that does this. (and if you had one, most viewers would interpret it as an animated gif, since delay=0 is not respected literally)

(one hacky way around that which I have seen suggested elsewhere : a gif with multiple images but no GCE blocks should be treated as compositing to form a single still image, whereas GCE blocks even with delay of 0 must be interpreted as animation frames)

Note that animation frames which only update part of the image *is* common. Also the transparency in a GIF must be used when drawing frames onto the canvas - it does not indicate that the final pixel should be transparent. That is, an animation frame may mark some pixels as transparent, and that means don't update those pixels.

There is an (optional) global palette and an (optional) per-image palette. In the global header there is a "background color". That is an index to the global palette, if it exists. The background color will be visible in parts of the canvas where there is no image rectangle, and also where images are transparent all the way down to the canvas. However, the ImageMagick link above has this note :


        There is some thinking that rather than clearing the overlaid area to
        the transparent color, this disposal should clear it to the 'background'
        color meta-data setting stored in the GIF animation. In fact the old
        "Netscape" browser (version 2 and 3), did exactly that. But then it also
        failed to implement the 'Previous' dispose method correctly.

        On the other hand the initial canvas should also be set from the formats
        'background' color too, and that is also not done. However all modern
        web browsers clear just the area that was last overlaid to transparency,
        as such this is now accepted practice, and what IM now follows.

which makes me think that many decoders (eg. web browsers) ignore background and just make those pixels transparent.

(ADD : I've seen quite a few cases where the "background" value is larger than the global palette. eg. global palette has 64 colors, and "background" is 80 or 152.)

In the optional GCE block, each image can have a transparent color set. This is a palette index which acts as a color-key transparency. Tranparent pixels should show whatever was beneath them in the canvas. That is, they do not necessarily result in transparent pixels in the output if there was a solid pixel beneath them in the canvas from a previous image.

Each image has a "delay" time and "dispose" setting in the optional GCE block (which occurs before the image data). These apply *after* that frame.

Delay is the frame time; it can vary per frame, there is no global constant fps. Delay is in centiseconds, and the support for delay < 10 is nasty. In practice you must interpret a delay of 0 or 1 centiseconds to mean "not set" rather than to mean they actually wanted a delay of 0 or 1. (people who take delay too literally are why some gifs play way too fast in some viewers).

Dispose is an action to take on the image after it has been displayed and held for delay time. Note that it applies to the *image* not the canvas (the image may be just a rectangular portion of the canvas). It essentially tells how that image's data will be committed to the canvas for future frames. Unfortunately the dispose value of 0 for "not specified" is extremely common. It appears to be correct to treat that the same as a value of 1 (leave in place).

(ADD : I've seen several cases of a dispose value of "5". Dispososal is supposed to be a 3 bit value, of which only the values 0-3 are defined (and fucking 0 means "undefined"). Values 4-7 are supposed to be reserved.)

The ImageMagick and "theimage.com" links above are very handy for testing disposal and other funny animation details.

It's a shame that zero-delay is so widely mis-used and not supported, because it is the most interesting feature in GIF for encoder flexibility.

3/25/2014

03-25-14 - deduper

So here's my little dedupe :

dedupe.zip (x64 exe)

This is a file level deduper (not a block or sector deduper) eg. it finds entire files that are identical.

dedupe.exe does not delete dupes or do anything to them. Instead it outputs a batch file which contains commands to do something to the dupes. The intention is that you then go view/edit that batch file and decide which files you actually want to delete or link or whatever.

It runs on Oodle, using multi-threaded dir enumeration and all that. (see also tabdir )

It finds possible dedupes first by checking file sizes. For any file size where there is more than one file of that size, it then hashes and collects possible dupes that have the same hash. Those are then verified to be actual byte-by-byte dupes.

Obviously deleting a bunch of dupes is dangerous, be careful, it's not my fault, etc.

Possible todos to make it faster :

1. Use an (optional) cache of hashes and modtimes to accelerate re-runs. One option is to store the hash of each file in an NTFS extra data stream on that file (which allows the file to be moved or renamed and retain its hash); probably better just to have an explicit cache file and not muck about with rarely-used OS features.

2. If there are only 2 or 3 files of a given size, I should just jump directly to the byte-by-byte compare. Currently they are first opened and hashed, and then if the hashes match they are byte-by-byte compared, which means an extra scan of those files.

3/15/2014

03-15-14 - Bit IO of Elias Gamma

Making a record of something not new :

Elias Gamma codes are made by writing the position of the top bit using unary, and then the lower bits raw. eg to send 30, 30 = 11110 , the top bit is in position 4 so you send that as "00001" and then the bottom 4 bits "1110". The first few values (starting at 1) are :


1 =   1 : 1
2 =  10 : 01 + 0
3 =  11 : 01 + 1
4 = 100 : 001 + 00
5 = 101 : 001 + 01
...

The naive way to send this code is :


void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutUnary( BITOUT, topbit );

    val -= (1<<topbit); // or &= (1<<topbit)-1;

    PutBits( BITOUT, val, topbit );
}

But it can be done more succinctly.

We should notice two things. First of all PutUnary is very simple :


void PutUnary( BITOUT, unsigned val )
{
    PutBits( BITOUT , 1, val+1 );
}

That is, it's just putting the value 1 in a variable number of bits. This gives you 'val' leading zero bits and then a 1 bit, which is the unary encoding.

The next is that the 1 from the unary is just the same as the 1 we remove from the top position of 'val'. That is, we can think of the bits thusly :


5 = 101 : 001 + 01

unary of two + remaining 01
or

5 = 101 : 00 + 101

two zeros + the value 5

The result is a much simplified elias gamma coder :


void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutBits( BITOUT, val, 2*topbit+1 );
}

note that if your bit IO is backwards then this all works slightly differently (I'm assuming you can combine two PutBits into one with the first PutBits in the top of the second; that is

PutBits(a,na)+PutBits(b,nb) == PutBits((a<<nb)|b,na+nb)

Perhaps more importantly, we can do a similar transformation on the reading side.

The naive reader is :


int GetEliasGamma( BITIN )
{
    int bits = GetUnary( BITIN );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

(assuming your GetBits can handle getting zero bits, and returns a value >= 1). The naive unary reader is :


int GetUnary( BITIN )
{
    int ret = 0;
    while( GetOneBit( BITIN ) == 0 )
    {
        ret++;
    }
    return ret;
}

but if your bits are top-justified in your bit input word (as in ans_fast for example, or see the end of this post for a reference implementation), then you can use count_leading_zeros to read unary :


int GetUnary( BITIN )
{
    int clz = count_leading_zeros( BITIN );

    ASSERT( clz < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, clz+1 );
    ASSERT( one == 1 );

    return clz;
}

here the GetBits is just consuming the zeros and the one on bit of the unary code. Just like in the Put case, the key thing is that the trailing 1 bit of the unary is the same as the top bit value ( "(1<<bits)" ) that we added in the naive reader. That is :


int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, bits+1 );
    ASSERT( one == 1 );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

can be simplified to combine the GetBits :

int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int ret = GetBits( BITIN, 2*bits + 1 );

    ASSERT( ret >= (1<<bits) && ret < (2<<bits) );

    return ret;
}

again assuming that your GetBits combines like big-endian style.

You can do the same for "Exp Golomb" of course, which is just Elias Gamma + some raw bits. (Exp Golomb is the special case of Golomb codes with a power of two divisor).

Summary :

//===============================================================================

// Elias Gamma works on vals >= 1
// these assume that the value fits in your bit word
// and your bit reader is big-endian and top-justified

#define BITOUT_PUT_ELIASGAMMA(bout_bits,bout_numbits,val) do { \
    ASSERT( val >= 1 ); \
    uint32 topbit = bsr64(val); \
    BITOUT_PUT(bout_bits,bout_numbits, val, 2*topbit + 1 ); \
    } while(0)

#define BITIN_GET_ELIASGAMMA(bitin_bits,bitin_numbits,val) do { \
    uint32 nlz = clz64(bitin_bits); \
    uint32 nbits = 2*nlz+1; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    } while(0)

//===============================================================================
// MSVC implementations of bsr and clz :

static inline uint32 bsr64( uint64 val )
{
    ASSERT( val != 0 );
    unsigned long b = 0;
    _BitScanReverse64( &b, val );
    return b;
}

static inline uint32 clz64(uint64 val)
{
    return 63 - bsr64(val);
}

//===============================================================================
// and for completeness, reference bitio that works with those functions :
//  (big endian ; bit input word top-justified)

#define BITOUT_VARS(bout_bits,bout_numbits,bout_ptr) \
    uint64 bout_bits; \
    int64 bout_numbits; \
    uint8 * bout_ptr;

#define BITOUT_START(bout_bits,bout_numbits,bout_ptr,buf) do { \
        bout_bits = 0; \
        bout_numbits = 0; \
        bout_ptr = (uint8 *)buf; \
    } while(0)

#define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        ASSERT( (val) < (1ULL<<(nb)) ); \
        bout_numbits += nb; \
        bout_bits |= ((uint64)(val)) << (64 - bout_numbits); \
    } while(0)
    
#define BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr) do { \
        *((uint64 *)bout_ptr) = _byteswap_uint64( bout_bits ); \
        bout_bits <<= (bout_numbits&~7); \
        bout_ptr += (bout_numbits>>3); \
        bout_numbits &= 7; \
    } while(0)
    
#define BITOUT_END(bout_bits,bout_numbits,bout_ptr) do { \
        BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr); \
        if ( bout_numbits > 0 ) bout_ptr++; \
    } while(0)

//===============================================================

#define BITIN_VARS(bitin_bits,bitin_numbits,bitin_ptr) \
    uint64 bitin_bits; \
    int64 bitin_numbits; \
    uint8 * bitin_ptr;

#define BITIN_START(bitin_bits,bitin_numbits,bitin_ptr,begin_ptr) do { \
        bitin_ptr = (uint8 *)begin_ptr; \
        bitin_bits = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += 8; \
        bitin_numbits = 64; \
    } while(0)

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { if ( bitin_numbits <= 56 ) { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        bitin_ptr += bytesToGet; \
        bitin_bits |= next8 >> bitin_numbits; \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } } while(0)

#define BITIN_GET(bitin_bits,bitin_numbits,nb,ret) do { \
        ASSERT( nb <= bitin_numbits ); \
        ret = (bitin_bits >> 1) >> (63 - nb); \
        bitin_bits <<= nb; \
        bitin_numbits -= nb; \
    } while(0)

//=========================================================

and yeah yeah I know this bitio could be faster. It's a reference implementation that's trying to avoid obfuscations. GTFO.

Added exp-golomb. The naive put is :


PutEliasGamma( val >> r );
PutBits( val & ((1<<r)-1) , r );

but you do various reductions and get to :

//===============================================================================

// this Exp Golomb is for val >= 0
// Exp Golomb is Elias Gamma + 'r' raw bits

#define BITOUT_PUT_EXPGOLOMB(bout_bits,bout_numbits,r,val) do { \
    ASSERT( val >= 0 ); \
    uint64 up = (val) + (1ULL<<(r)); \
    uint32 topbit = bsr64(up); \
    ASSERT( topbit >= (uint32)(r) ); \
    BITOUT_PUT(bout_bits,bout_numbits, up, 2*topbit + 1 - r ); \
    } while(0)
    
#define BITIN_GET_EXPGOLOMB(bitin_bits,bitin_numbits,r,val) do { \
    uint32 nbits = 2*clz64(bitin_bits)+1+r; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    ASSERT( val >= (1ULL<<r) ); \
    val -= (1ULL<<r); \
    } while(0)

//=========================================================

3/14/2014

03-14-14 - Fold Up Negatives

Making a record of something not new :

Say you want to take the integers {-inf,inf} and map them to just the non-negatives {0,1,..inf}. (and/or vice-versa)

(this is common for example when you want to send a signed value using a variable length code, like unary or golomb or whatever; yes yes there are other ways, for now assume you want to do this).

We need to generate a number line with the negatives "folded up" and interleaved with the positives, like


0,-1,1,-2,2,...

The naive way is :


// fold_up makes positives even
//   and negatives odd

unsigned fold_up_negatives(int i)
{
    if ( i >= 0 )
        return i+i;
    else
        return (unsigned)(-i-i-1); 
}

int unfold_negatives(unsigned i)
{
    if ( i&1 ) 
        return - (int)((i+1)>>1);
    else
        return (i>>1);
}

Now we want to do it branchless.

Let's start with folding up. What we want to achieve mathematically is :


fold_up_i = 2*abs(i) - 1 if i is negative

To do this we will use some tricks on 2's complement integers.

The first is getting the sign. Assuming 32-bit integers for now, we can use :


minus_one_if_i_is_negative = (i >> 31);

= 0 if i >= 0
= -1 if i < 0

which works by taking the sign bit and replicating it down. (make sure to use signed right shift, and yes this is probably undefined blah blah gtfo etc).

The other trick is to use the way a negative is made in 2's complement.


(x > 0)

-x = (x^-1) + 1

or

-x = (x-1)^(-1)

and of course x^-1 is the same as (~x), that is flip all the bits. This also gives us :


x^-1 = -x -1

And it leads obviously to a branchless abs :


minus_one_if_i_is_negative = (i >> 31);
abs_of_i = (i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

since if i is negative this is

-x = (x^-1) + 1

and if i is positive it's

x = (x^0) + 0

So we can plug this in :


fold_up_i = 2*abs(i) - 1 if i is negative

fold_up_i = abs(2i) - 1 if i is negative

minus_one_if_i_is_negative = (i >> 31);
abs(2i) = (2i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

fold_up_i = abs(2i) + minus_one_if_i_is_negative

fold_up_i = (2i) ^ minus_one_if_i_is_negative

or in code :


unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

For unfold we use the same tricks. I'll work it backwards from the answer for variety and brevity. The answer is :


int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}

and let's prove it's right :


if i is even

half_i = i>>1;
sign_i = 0;

return half_i ^ 0 = i/2;
// 0,2,4,... -> 0,1,2,...

if i is odd

half_i = i>>1; // i is odd, this truncates
sign_i = -1;

return half_i ^ -1 
 = -half_i -1
 = -(i>>1) -1
 = -((i+1)>>1)
// 1,3,5,... -> -1,-2,-3,...

which is what we wanted.

Small note : on x86 you might rather use cdq to get the replicated sign bit of an integer rather than >>31 ; there are probably similar instructions on other architectures. Is there a neat way to make C generate that? I dunno. Not sure it ever matters. In practice you should use an "int32" type or compiler_assert( sizeof(int) == 4 );

Summary :


unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}

3/13/2014

03-13-14 - Hilbert Curve Testing

So I stumbled on this blog post about counting the rationals which I found rather amusing.

The snippet that's relevant here is that if you iterate through the rationals naively by doing something like

1/1 2/1 3/1 4/1 ...
1/2 2/2 3/2 4/2 ...
1/3 2/3 ...
...

then you will never even reach 1/2 because the first line is infinitely long. But if you take a special kind of path, you can reach any particular rational in a finite number of steps. Much like the way a Hilbert curve lets you walk the 2d integer grid using only a 1d path.

It reminded me of something that I've recently changed in my testing practices.

When running code we are always stepping along a 1d path, which you can think of as discrete time. That is, you run test 1, then test 2, etc. You want to be able to walk over some large multi-dimensional space of tests, but you can only do so by stepping along a 1d path through that space.

I've had a lot of problems testing Oodle, because there are lots of options on the compressors, lots of APIs, lots of compression levels and different algorithms - it's impossible to test all combinations, and particularly impossible to test all combinations on lots of different files. So I keep getting bitten by some weird corner case.

(Total diversion - actually this is a good example of why I'm opposed to the "I tested it" style of coding in general. You know when you stumble on some code that looks like total garbage spaghetti, and you ask the author "wtf is going on here, do you even understand it?" and they "well it works, I tested it". Umm, no, wrong answer. Maybe it passed the test, but who knows how it's going to be used down the road and fail in mysterious ways? Anyway, that's an old school cbloom coding practices rant that I don't bother with anymore ... and I haven't been great about following my own advice ...)

If you try to set up a big test by just iterating over each option :

for each file
  for each compressor
    for each compression_level
      for each chunking
        for each parallel branching
          for each dictionary size
            for each sliding window size
              ...

then you'll never even get to the second file.

The better approach is to get a broad sampling of options. An approach I like is to enumerate all the tests I want to run, using a loop like the above, put them all in a list, then randomly permute the list. Because it's just a permutation, I will still only run each test once, and will cover all the cases in the enumeration, but by permuting I get a broader sampling more quickly.

(you could also add the nice technique that we worked out here long ago - generate a consistent permutation using a pseudorandom generator with known seed, and save your place in the list with a file or the registry or something. That way when you stop and start the tests, you will resume where you left off, and eventually cover more of the space (also when a test fails you will automatically repeat the same test if you restart)).

The other trick that's useful in practice is to front-load the quicker tests. You do want to have a test where you run on 8 GB files to make sure that works, but if that's your first test you have to wait forever to get a failure. This is particularly for the case that something dumb is broken, it should show up as quickly as possible so you can just cancel the test and fix it. So you want an initial broad sampling of options on very quick tests.

3/03/2014

03-03-14 - Windows Links

I wrote a deduper in Oodle today. I was considering making the default action be to replace duplicate files with a link to the original.

I wasn't sure whether to use "hard" or "soft" links, so I did a little research.

In Windows a "hard link" means that multiple file names all point at the same file data. It's a bit of a misnomer, it's not really a "link" to the original. There is no "original" or base file - all instances of a hard link are equal peers.

A "soft link" is just an OS-level shortcut. There is an original base file, and the soft links point at it.

Both are ridiculously broken concepts and IMO should almost never be used.

With "hard links" the problem is that if you accidentally edit any of the links, you have editted the master data. If you did not intend that, you may have fucked up something severely.

Hard links are reasonable *if* the files linked are read-only (and somehow actually kept read-only, not just attrib'ed away).

The problem with "soft links" is that the links are not protected; if you rename or move or delete the original file, all the links are broken, and again you may have just severely fucked something up.

The big problem is that you get no warning in either case. Clearly what you want when you try to rename a file which has a bunch of soft links pointing at it is some kind of dialog that says "hey all these links point at this file, do you really want to rename it and break the links? or should I update the links to point at the new name?". Similarly with hard links, obviously what you want is some prompt like "hey if you modify this, so you want these hard links to see the new version or the old version?".

Now obviously you can't solve this problem in general without user prompts. But I believe that a refcounted copy-on-write link would have been a much more robust and safe solution. Open for write should have done a COW by default unless you passed a special flag to indicate you intend to edit shared data.

Even ignoring the fundamental logic of how links work, there are some very bad practical issues for links in windows.

1. Soft links show a file size of 0 in the dir enumeration file info. This breaks the assumption that most apps make that the file size they get from the dir enumeration will be the same as the file size they get if they open that file handle and ask for its size. It can also screw up enumerations that are just trying to skip zero-size files.

Hard link file sizes are out of date. If the file data is modified, only the directory entry for the one that was used to modify the data is updated. All other links still have the old file sizes, mod times, etc.

2. Hard links break the assumption that saving to a file is the same as saving to a temp and then renaming onto the file. Many apps may or may not use the "write to temp then rename" pattern; what you get is massively different results in a very unexpected way.

3. Mod times are hosed. In general attributes are hosed; neither type of link reflects the attributes of the actual file data in the link - until they are opened, then they get updated. Mod times are particularly bad because many apps use them to detect changes, and with links the file data can be changed but the mod time won't reflect it.

Dear lord. So non-robust.

cbloom rants