3/25/2015

03-25-15 - My Chameleon

I did my own implementation of the Chameleon compression algorithm. (the original distribution is via the density project)

This is the core of Chameleon's encoder :

    cur = *fm32++; h = CHAMELEON_HASH(cur); flags <<= 1;
    if ( c->hash[h] == cur ) { flags ++; *to16++ = (uint16) h; }
    else { c->hash[h] = cur; *((uint32 *)to16) = cur; to16 += 2; }

This is the decoder :

    if ( (int16)flags < 0 ) { cur = c->hash[ *fm16++ ]; }
    else { cur = *((const uint32 *)fm16); fm16 += 2; c->hash[ CHAMELEON_HASH(cur) ] = cur; }
    flags <<= 1; *to32++ = cur;

I thought it deserved a super-simple STB-style header-only dashfuly-described implementation :

Chameleon.h

My Chameleon.h is not portable or safe or any of that jizzle. Maybe it will be someday. (Update : now builds on GCC & clang. Tested on PS4. Still not Endian-invariant.)


// Usage :

#define CHAMELEON_IMPL
#include "Chameleon.h"

Chameleon c;

Chameleon_Reset(&c);

size_t comp_buf_size = CHAMELEON_MAXIMUM_OUTPUT_SIZE(in_size);

void * comp_buf = malloc(comp_buf_size);

size_t comp_len = Chameleon_Encode(&c, comp_buf, in_buf, in_size );

Chameleon_Reset(&c);

Chameleon_Decode(&c, out_buf, in_size, comp_buf );

int cmp = memcmp(in_buf,out_buf,in_size);
assert( comp == 0 );


ADD : Chameleon2 SIMD prototype now posted : (NOTE : this is not good, do not use)

Chameleon2.h - experimental SIMD wide Chameleon
both Chameleons in a zip

The SIMD encoder is not fast. Even on SSE4 it only barely beats scalar Chameleon. So this is a dead end. Maybe some day when we get fast hardware scatter/gather it will be good (*).

(* = though use of hardware scatter here is always going to be treacherous, because hashes may be repeated, and the order in which collisions resolve must be consistent)

2 comments:

AJ Baxter said...

a note in your header suggests the output size may exceed the input size, and to ensure the buffer is big enough. By how much? Have you any idea about what size is required?

cbloom said...

Maximum output size is

(33/32)*in_size + 2

I believe.

You could also add a check for output overrun in each unrolled loop step. It's only done once per 64 input bytes so it should not effect speed much.

old rants