11/13/2012

11-13-12 - Another Oodle Rewrite Note

Of course this is not going to happen. But in the imaginary world in which I rewrite from scratch :

I've got a million (actually several hundred) APIs that start an Async op. All of those APIs take a bunch of standard arguments that they all share, so they all look like :


OodleHandle Oodle_Read_Async(

                // function-specific args :
                OodleIOQFile file,void * memory,SINTa size,S64 position,

                // standard args on every _Async function :
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

The idea was that you pass in everything needed to start the op, and when it's returned you get a fully valid Handle which is enqueued to run.

What I should have done was make all the little _Async functions create an incomplete handle, and then have a standard function to start it. Something like :


// prep an async handle but don't start it :
OodleHandleStaging Oodle_Make_Read(
                OodleIOQFile file,void * memory,SINTa size,S64 position
                );

// standard function to run any op :
OodleHandle Oodle_Start( OodleHandleStaging handle,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

it would remove a ton of boiler-plate out of all my functions, and make it a lot easier to add more standard args, or have different ways of firing off handles. It would also allow things like creating a bunch of "Staging" handles that aren't enqueued yet, and then firing them off all at once, or even just holding them un-fired for a while, etc.

It's sort of ugly to make clients call two functions to run an async op, but you can always get client code that looks just like the old way by doing :


OodleHandle Oodle_Start( Oodle_Make_Read( OodleIOQFile file,void * memory,SINTa size,S64 position ) ,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

and I could easily make macros that make that look like one function call.

Having that interval of a partially-constructed op would also let me add more attributes that you could set on the Staging handle before firing it off.

(for example : when I was testing compresses on enwik, some of the tasks could allocate something like 256MB each; it occurred to me that a robust task system should understand limitting the number of tasks that run at the same time if their usage of some resource exceeds the max. eg. for memory usage, if you know you have 2 GB free, don't run more than 8 of those 256 MB tasks at once, but you could run other non-memory-hungry tasks during that time. (I guess the general way to do that would be to make task groups and assign tasks to groups and then limit the number from a certain group that can run simultaneously))

11/08/2012

11-08-12 - Job System Task Types

I think I've written this before, but a bit more verbosely :

It's useful to have at least 3 classes of task in your task system :

1. "Normal" tasks that want to run on one of the worker threads. These take some amount of time, you don't want them to block other threads, they might have dependencies and create other jobs.

2. "IO" tasks. This is CPU work, but the main thing it does is wait on an IO and perhaps spawn another one. For example something like copying a file through a double-buffer is basically just wait on an IO op then do a tiny bit of math, then start another IO op. This should be run on the IO thread, because it avoids all thread switching and thread communication as it mediates the IO jobs.

(of course the same applies to other subsytems if you have them)

3. "Tiny" / "Run anywhere" tasks. These are tasks that take very little CPU time and should not be enqueued onto worker threads, because the time to wake up a thread for them dwarfs the time to just run them.

The only reason you would run these as async tasks at all is because they depend on something. Generally this is something trivial like "set a bool after this other job is done".

These tasks should be run immediately when they become ready-to-run on whatever thread made them rtr. So they might run on the IO thread, or a worker thread, or the main thread (any client thread). eg. if a tiny task is enqueued on the main thread and its dependencies are already done, it should just be run immediately.

It's possible that class #2 could be merged into class #3. That is, eliminate the IO-tasks (or GPU-tasks or whatever) and just call them all "tiny tasks". You might lose a tiny bit of efficiency from that, but the simplicity of having only two classes of tasks is probably preferable. If the IO tasks are made into just generic tiny tasks, then it's important that the IO thread be able to execute tiny tasks from the generic job system itself, otherwise it might go to sleep thinking there is no IO to be done, when a pending tiny IO task could create new IO work for it.

Okay.

Beyond that, for "normal" tasks there's the question of typical duration, which tells you whether it's worth it to fire up more threads.

eg. say you shoot 10 tasks at your thread-pool worker system. Should you wake up 1 thread and let it do all 10 ? Or wake up 10 threads and give each one a task? Or maybe wake 2?

One issue that still is bugging me is when you have a worker thread, and in doing some work it makes some more tasks ready-to-run. Should it fire up new worker threads to take those tasks, or should it just finish its task and then do them itself? You need two pieces of information : 1. are the new tasks significant enough to warrant a new thread? and 2. how close to the end of my current task am I? (eg. if I'm in the middle of some big work I might want to fire up a new thread even though the new RTR tasks are tiny).

When you have "tiny" and "normal" tasks at the same priority level, it's probably worth running all the tinies before any normals.

Good lord.

11/06/2012

11-06-12 - Using your own malloc with operator new

In Oodle, I use my own allocator, and I wish to be able to still construct & destruct classes. (Oodle's public API is C only, but I use C++ internally).

The traditional way to do this is to write your own "operator new" implementation which will link in place of the library implementation. This way sucks for various reasons. The important one to me is that it changes all the news of any other statically-linked code, which is just not an okay thing for a library to do. You may want to have different mallocs for different purposes; the whole idea of a single global allocator is kind of broken in the modern world.

(the presence of global state in the C standard lib is part of what makes C code so hard to share. The entire C stdlib should be a passed-in vtable argument. Perhaps more on this in a later post.)

Anyway, what I want is a way to do a "new" without interfering with client code or other libraries. It's relatively straightforward (*), but there are a few little details that took me a while to get right, so here they are.

(* = ADDENDUM = not straightforward at all if multiple-inheritance is used and deletion can be done on arbitrary parts of the MI class)

//==================================================================

/*

subtlety : just calling placement new can be problematic; it's safer to make an explicit selection
of placement new.  This is how we call the constructor.

*/

enum EnumForPlacementNew { ePlacementNew };

// explicit access to placement new when there's ambiguity :
//  if there are *any* custom overrides to new() then placement new becomes ambiguous   
inline void* operator new   (size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
inline void  operator delete(void*,  EnumForPlacementNew, void*) { }

#ifdef __STRICT_ALIGNED
// subtlety : some stdlibs have a non-standard operator new with alignment (second arg is alignment)
//  note that the alignment is not getting passed to our malloc here, so you must ensure you are
//    getting it in some other way
inline void* operator new   (size_t , size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
#endif

// "MyNew" macro is how you do construction

/*

subtlety : trailing the arg list off the macro means we don't need to do this kind of nonsense :

    template <typename Entry,typename t_arg1,typename t_arg2,typename t_arg3,typename t_arg4,typename t_arg5,typename t_arg6,typename t_arg7,typename t_arg8,typename t_arg9>
    static inline Entry * construct(Entry * pEntry, t_arg1 arg1, t_arg2 arg2, t_arg3 arg3, t_arg4 arg4, t_arg5 arg5, t_arg6 arg6, t_arg7 arg7, t_arg8 arg8, t_arg9 arg9)

*/

//  Stuff * ptr = MyNew(Stuff) (constructor args); 
//  eg. for void args :
//  Stuff * ptr = MyNew(Stuff) ();
#define MyNew(t_type)   new (ePlacementNew, (t_type *) MyMalloc(sizeof(t_type)) ) t_type 

// call the destructor :
template <typename t_type>
static inline t_type * destruct(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    ptr->~t_type();
    return ptr;
}

// MyDelete is how you kill a class

/*

subtlety : I like to use a Free() which takes the size of the object.  This is a big optimization
for the allocator in some cases (or lets you not store the memory size in a header of the allocation).
*But* if you do this, you must ensure that you don't use sizeof() if the object is polymorphic.
Here I use MSVC's nice __has_virtual_destructor() extension to detect if a type is polymorphic.

*/

template <typename t_type>
void MyDeleteNonVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    #ifdef _MSC_VER
    RR_COMPILER_ASSERT( ! __has_virtual_destructor(t_type) );
    #endif
    destruct(ptr);
    MyFree_Sized((void *)ptr,sizeof(t_type));
}

template <typename t_type>
void MyDeleteVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    destruct(ptr);
    // can't use size :
    MyFree_NoSize((void *)ptr);
}

#ifdef _MSC_VER

// on MSVC , MyDelete can select the right call at compile time

template <typename t_type>
void MyDelete(t_type * ptr)
{
    if ( __has_virtual_destructor(t_type) )
    {
        MyDeleteVirtual(ptr);
    }
    else
    {
        MyDeleteNonVirtual(ptr);
    }
}

#else

// must be safe and use the polymorphic call :

#define MyDelete    MyDeleteVirtual

#endif

and the end result is that you can do :

    foo * f = MyNew(foo) ();

    MyDelete(f);

and you get normal construction and destruction but with your own allocator, and without polluting (or depending on) the global linker space. Yay.

11-06-12 - Protect Ad-Hoc Multi-Threading

Code with un-protected multi-threading is bad code just waiting to be a nightmare of a bug.

"ad-hoc" multi-threading refers to sharing of data across threads without an explicit sharing mechanism (such as a queue or a mutex). There's nothing wrong per-se with ad-hoc multi-threading, but too often people use it as an excuse for "comment moderated handoff" which is no good.

The point of this post is : protect your threading! Use name-changes and protection classes to make access lifetimes very explicit and compiler (or at least assert) moderated rather than comment-moderated.

Let's look at some examples to be super clear. Ad-Hoc multi-threading is something like this :


int shared;


thread0 :
{

shared = 7; // no atomics or protection or anything

// shared is now set up

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

print shared;

}

thread1 :
{

shared ++;

}

this code works (assuming that thread creation and waiting has some kind of memory barrier in it, which it usually does), but the hand-offs and synchronization are all ad-hoc and "comment moderated". This is terrible code.

I believe that even with something like a mutex, you should make the protection compiler-enforced, not comment-enforced.

Comment-enforced mutex protection is something like :


struct MyStruct s_data;
Mutex s_data_mutex;
// lock s_data_mutex before touching s_data

That's okay, but comment-enforced code is always brittle and bug-prone. Better is something like :

struct MyStruct s_data_needs_mutex;
Mutex s_data_mutex;
#define MYSTRUCT_SCOPE(name)    MUTEX_IN_SCOPE(s_data_mutex); MyStruct & name = s_data_needs_mutex;

assuming you have some kind of mutex-scoper class and macro. This makes it impossible to accidentally touch the protected stuff outside of a lock.

Even cleaner is to make a lock-scoper class that un-hides the data for you. Something like :

//-----------------------------------

template <typename t_data> class ThinLockProtectedHolder;

template <typename t_data>
class ThinLockProtected
{
public:

    ThinLockProtected() : m_lock(0), m_data() { }
    ~ThinLockProtected() { }    

protected:

    friend class ThinLockProtectedHolder<t_data>;
    
    OodleThinLock   m_lock;
    t_data          m_data;
};

template <typename t_data>
class ThinLockProtectedHolder
{
public:

    typedef ThinLockProtected<t_data>   t_protected;
    
    ThinLockProtectedHolder(t_protected * ptr) : m_protected(ptr) { OodleThinLock_Lock(&(m_protected->m_lock)); }
    ~ThinLockProtectedHolder() { OodleThinLock_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define TLP_SCOPE(t_data,ptr,data)  ThinLockProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//--------
/*

// use like :

    ThinLockProtected<int>  tlpi;
    {
    TLP_SCOPE(int,&tlpi,shared_int);
    shared_int = 7;
    }

*/

//-----------------------------------
Errkay.

So the point of this whole post is that even when you are just doing ad-hoc thread ownership, you should still use a robustness mechanism like this. For example by direct analogy you could use something like :

//=========================================================================

template <typename t_data> class AdHocProtectedHolder;

template <typename t_data>
class AdHocProtected
{
public:

    AdHocProtected() : 
    #ifdef RR_DO_ASSERTS
        m_lock(0), 
    #endif
        m_data() { }
    ~AdHocProtected() { }

protected:

    friend class AdHocProtectedHolder<t_data>;
    
    #ifdef RR_DO_ASSERTS
    U32             m_lock;
    #endif
    t_data          m_data;
};

#ifdef RR_DO_ASSERTS
void AdHoc_Lock( U32 * pb)  { U32 old = rrAtomicAddExchange32(pb,1); RR_ASSERT( old == 0 ); }
void AdHoc_Unlock(U32 * pb) { U32 old = rrAtomicAddExchange32(pb,-1); RR_ASSERT( old == 1 ); }
#else
#define AdHoc_Lock(xx)
#define AdHoc_Unlock(xx)
#endif


template <typename t_data>
class AdHocProtectedHolder
{
public:

    typedef AdHocProtected<t_data>  t_protected;
    
    AdHocProtectedHolder(t_protected * ptr) : m_protected(ptr) { AdHoc_Lock(&(m_protected->m_lock)); }
    ~AdHocProtectedHolder() { AdHoc_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define ADHOC_SCOPE(t_data,ptr,data)    AdHocProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//==================================================================
which provides scoped checked ownership of variable hand-offs without any explicit mutex.

We can now revisit our original example :


AdHocProtected<int> ahp_shared;


thread0 :
{

{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared = 7; // no atomics or protection or anything

// shared is now set up
}

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

{
ADHOC_SCOPE(int,&ahp_shared,shared);
print shared;
}

}

thread1 :
{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared ++;

}

And now we have code which is efficient, robust, and safe from accidents.

10/26/2012

10-26-12 - Oodle Rewrite Thoughts

I'm getting increasingly annoyed with the C-style Oodle threading code. It's just such a nightmare to manually manage things like object lifetimes in an async / multi-threaded environment.

Even something as simple as "write part of this buffer to a file" constantly causes me pain, because implied in that operation is "the buffer must not be freed until the write is done" , "the buffer should not be changed in the area being written until the write is done" , and "the file should not be closed until the write is done".

When you first start out and aren't doing a lot of complicated ops, it doesn't seem too bad, you can keep those things in your head; they become "comment-enforced" rules; that is, the code doesn't make itself correct, you have to write comments like "// write is pending, don't free buffer yet" (often you don't actually write the comments, but they're still "comment-enforced" as opposed to "code-enforced").

I think the better way is the very-C++-y Oodle futures .

Oodle futures rely on every object they take as inputs having refcounts, so there is no issue of free before exit. Some key points about the Oodle futures that I think are good :

A. Dependencies are automatic based on your arguments. You depend on anything you take as arguments. If the arguments themselves depend on async ops, then you depend on the chain of ops automatically. This is super-sweet and just removes a ton of bugs. You are then required to write code such that all your dependencies are in the form of function arguments, which at first is a pain in the ass, but actually results in much cleaner code overall because it makes the expression of dependencies really clear (as opposed to just touching some global deep inside your function, which creates a dependency in a really nasty way).

B. Futures create implicit async handles; the async handles in Oodle future are all ref-counted so they clean themselves automatically when you no longer care about them. This is way better than the manual lifetime management in Oodle right now, in which you either have to hold a bunch of handles.

C. It's an easy way to plug in the result of one async op into the input of the next one. It's like an imperative way of using code to do that graph drawing thing ; "this op has an output which goes into this input slot". Without an automated system for this, what I'm doing at the moment is writing lots of little stub functions that just wait on one op, gather up its results and starts the next op. There's no inefficiency in this, it's the same thing the future system does, but it's a pain in the ass.

If I was restarting from scratch I would go even further. Something like :

1. Every object has a refcount AND a read-write lock built into. Maybe the refcount and RW lock count go together in one U32 or U64 which is maintained by lockfree ops.

Refcounting is obvious. Lifetimes of async ops are way too complicated without it.

The RW lock in every object is something that sophomoric programmers don't see the need for. They think "hey it's a simple struct, I fill it on one thread, then pass it to another thread, and he touches it". No no no, you're a horrible programmer and I don't want to work with you. It seems simple at first, but it's just so fragile and prone to bugs any time you change anything, it's not worth it. If every object doesn't just come with an RW lock it's too easy to be lazy and skip adding one, which is very bad. If the lock is uncontended, as in the simple struct handoff case above, then it's very cheap, so just use it anyway.

2. Whenever you start an async op on an object, it takes a ref and also takes either a read lock or write lock.

3. Buffers are special in that you RW lock them in ranges. Same thing with textures and such. So you can write non-overlapping ranges simultaneously.

4. Every object has a list of the ops that are pending on that object. Any time you start a new op on an object, it is delayed until those pending ops are done. Similarly, every op has a list of objects that it takes as input, and won't run until those objects are ready.

The other big thing I would do in a rewrite from scratch is the basic architecture :

1. Write all my own threading primitives (semaphore, mutex, etc) and base them on a single waitset. (I basically have this already).

2. Write stack-ful coroutines.

3. When the low level Wait() is called on a stackful coroutine, instead yield the coroutine.

That way the coroutine code can just use Semaphore or whatever, and when it goes to wait on the semaphore, it will yield instead. It makes the coroutine code exactly the same as non-coroutine code and makes it "composable" (eg. you can call functions and they actually work), which I believe is crucial to real programming. This lets you write stackful coroutine code that does file IO or waits on async ops or whatever, and when you hit some blocking code it just automatically yields the coroutine (instead of blocking the whole worker thread).

This would mean that you could write coroutine code without any special syntax; so eg. you can call the same functions from coroutines as you do from non-coroutines and it Just Works the way you want. Hmm I think I wrote the same sentence like 3 times, but it's significant enough to bear repetition.

10/22/2012

10-22-12 - Windows 7 Start Menu Input Race

I've been super annoyed by some inconsistent behavior in the Windows 7 start menu for a while now. Sometimes I hit "Start - programname - enter" and nothing happens. I just sort of put it down to "god damn Windows is flakey and shit" but I finally realized yesterday exactly what's happening.

It's an input race , as previously discussed here

What happens is, you hit Start, and you get your focus in the type-in-a-program edit box. That part is fine. You type in a program name. At that point it does the search in the start menu thing in the background (it doesn't stall after each key press). In many cases there will be a bit of a delay before it updates the list of matching programs found.

If you hit Enter before it finds the program and highlights it, it just closes the dialog and doesn't run anything. If you wait a beat before hitting enter, the background program-finder will highlight the thing and hitting enter will work.

Very shitty. The start menu should not have keyboard input races. In this case the solution is obvious and trivial - when you hit enter it should wait on the background search task before acting on that key (but if you hit escape it should immediately close the window and abort the task without waiting).

I've long been an advocate of video game programmers doing "flakiness" testing by playing the game at 1 fps, or capturing recordings of the game at the normal 30 fps and then watching them play back at 1 fps. When you do that you see all sorts of janky shit that should be eliminated, like single frame horrible animation pops, or in normal GUIs you'll see things like the whole thing redraw twice in a row, or single frames where GUI elements flash in for 1 frame in the wrong place, etc.

Things like input races can be very easily found if you artificially slow down the program by 100X or so, so that you can see what it's actually doing step by step.

I'm a big believer in eliminating this kind of flakiness. Almost nobody that I've ever met in development puts it as a high priority, and it does take a lot of work for apparently little reward, and if you ask consumers they will never rate it highly on their wish list. But I think it's more important than people realize; I think it creates a feeling of solidness and trust in the application. It makes you feel like the app is doing what you tell it to, and if your avatar dies in the game it's because of your own actions, not because the stupid game didn't jump even though you hit the jump button because there was one frame where it wasn't responding to input.

10-22-12 - LZ-Bytewise conclusions

Wrapping this up + index post. Previous posts in the series :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3
cbloom rants 09-04-12 - Encoding Values in Bytes Part 4
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-10-12 - LZ4 - Large Window
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-13-12 - LZNib
cbloom rants 09-14-12 - Things Most Compressors Leave On the Table
cbloom rants 09-15-12 - Some compression comparison charts
cbloom rants 09-23-12 - Patches and Deltas
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 09-28-12 - LZNib on enwik8 with Long Range Matcher
cbloom rants 09-30-12 - Long Range Matcher Notes
cbloom rants 10-02-12 - Small note on LZHAM
cbloom rants 10-04-12 - Hash-Link match finder tricks
cbloom rants 10-05-12 - OodleLZ Encoder Speed Variation with Worker Count
cbloom rants 10-07-12 - Small Notes on LZNib
cbloom rants: 10-16-12 - Two more small notes on LZNib

And some little additions :

First a correction/addendum on cbloom rants 09-04-12 - LZ4 Optimal Parse :

I wrote before that going beyond the 15 states needed to capture the LRL overflowing the control byte doesn't help much (or at all). That's true if you only go up to 20 or 30 or 200 states, but if you go all the way to 270 states, so that you capture the transition to needing another byte, there is some win to be had (LZ4P-LO-332 got lztestset to 12714031 with small optimal state set, 12492631 with large state set).

If you just do it naively, it greatly increases memory use and run time. However, I realized that there is a better way. The key is to use the fact that there are so many code-cost ties. In LZ-Bytewise with the large state set, often the coding decision in a large number of states will have the same cost, and furthermore often the end point states will all have the same cost. When this happens, you don't need to make the decision independently for each state, instead you make one decision for the entire block, and you store a decision for a range of states, instead of one for each state.

eg. to be explicit, instead of doing :


in state 20 at pos P
consider coding a literal (takes me to state 21 at pos P+1)
consider various matches (takes me to state 0 at pos P+L)
store best choice in table[P][20]

in state 21 ...

do :

in states 16-260 at pos P
consider coding a literal (takes me to states 17-261 at pos P+1 which I saw all have the same cost)
consider various matches (takes me to state 0 at pos P+L)
store in table[P] : range {16-260} makes decision X

in states 261-263 ...

so you actually can do the very large optimal parse state set with not much increase in run time or memory use.

Second : I did a more complex variant of LZ4P (large window). LZ4P-LO includes "last offset". LZ4P-LO-332 uses a 3-bit-3-bit-2-bit control word (as described previously here : cbloom rants 09-10-12 - LZ4 - Large Window ) ; the 2 bit offset reserves one value for LO and 3 values for normal offsets.

(I consider this an "LZ4" variant because (unlike LZNib) it sends LZ codes as a strictly alternating LRL-ML pairs (LRL can be zero) and the control word of LRL and ML is in one byte)

Slightly better than LZ4P-LO-332 is LZ4P-LO-695 , where the numbering has switched from bits to number of values (so 332 should be 884 for consistency). You may have noticed that 6*9*5 = 270 does not fit in a byte, but that's fixed easily by forbidding some of the possibilities. 6-9-5 = 6 values for literals, 9 for match lengths, and 5 for offsets. The 5 offsets are LO + 2 bits of normal offset. So for example one of the ways that the 270 values is reduced is because an LO match can never occur after an LRL of 0 (the previous match would have just been longer), so those combinations are removed from the control byte.

LZ4P-LO-695 is not competitive with LZNib unless you spill the excess LRL and ML (the amount that is too large to fit in the control word) to nibbles, instead of spilling to bytes as in the original LZ4 and LZ4P. Even with spilling to nibbles, it's no better than LZNib. Doing LZ4P-LO-695, I found a few bugs in LZNib, so its results also got better.

Thirdly, current numbers :

raw lz4 lz4p332 lz4plo695 lznib d8 zlib OodleLZHLW
lzt00 16914 6473 6068 6012 5749 4896 4909
lzt01 200000 198900 198880 198107 198107 198199 198271
lzt02 755121 410695 292427 265490 253935 386203 174946
lzt03 3471552 1820761 1795951 1745594 1732491 1789728 1698003
lzt04 48649 16709 15584 15230 14352 11903 10679
lzt05 927796 460889 440742 420541 413894 422484 357308
lzt06 563160 493055 419768 407437 398780 446533 347495
lzt07 500000 265688 248500 240004 237120 229426 210182
lzt08 355400 331454 322959 297694 302303 277666 232863
lzt09 786488 344792 325124 313076 298340 325921 268715
lzt10 154624 15139 13299 11774 11995 12577 10274
lzt11 58524 25832 23870 22381 22219 21637 19132
lzt12 164423 33666 30864 29023 29214 27583 24101
lzt13 1041576 1042749 1040033 1039169 1009055 969636 923798
lzt14 102400 56525 53395 51328 51522 48155 46422
lzt15 34664 14062 12723 11610 11696 11464 10349
lzt16 21504 12349 11392 10881 10889 10311 9936
lzt17 53161 23141 22028 21877 20857 18518 17931
lzt18 102400 85659 79138 74459 76335 68392 59919
lzt19 768771 363217 335912 323886 299498 312257 268329
lzt20 1179702 1045179 993442 973791 955546 952365 855231
lzt21 679936 194075 113461 107860 102857 148267 83825
lzt22 400000 361733 348347 336715 331960 309569 279646
lzt23 1048576 1040701 1035197 1008638 989387 777633 798045
lzt24 3471552 2369885 1934129 1757927 1649592 2289316 1398291
lzt25 1029744 324190 332747 269047 230931 210363 96745
lzt26 262144 246465 244990 239816 239509 222808 207600
lzt27 857241 430350 353497 315394 328666 333120 223125
lzt28 1591760 445806 388712 376137 345343 335243 259488
lzt29 3953035 2235299 1519904 1451801 1424026 1805289 1132368
lzt30 100000 100394 100393 100010 100013 100020 100001
total 24700817 14815832 13053476 12442709 12096181 13077482 10327927

And comparison charts on the aggregated single file lzt99 :

Speeds are the best of 20 trials on each core; speed is the best of either x86 or x64 (usually x64 is faster). The decode times measured are slightly lower for everybody in this post (vs the last post of this type) because of the slightly more rigorous timing runs. For reference the decode speeds I measured are (mb/s) :


LZ4 :      1715.10235
LZNib :     869.1924302
OodleLZHLW: 287.2821629
zlib :      226.9286645
LZMA :       31.41397495

Also LZNib current enwik8 size : (parallel chunking (8 MB chunks) and LRM 12/12 with bubble)

LZNib enwik8 mml3 : 30719351
LZNib enwik8 stepml : 30548818

(all other LZNib results are for mml3)

10/16/2012

10-16-12 - Two more small notes on LZNib

Followup to Small Notes on LZNib

1. Because cost ties are common, and ties are not actually ties (due to "last offset"), just changing the order that you visit matches can change your compression. eg. if you walk matches from long to short or short to long or low offset to high offset, etc.

Another important way to break ties is for speed. Basically prefer long matches and long literal runs vs. a series of shorter ones that make the same output length. Because the code cost is integer bytes, you can do this pretty easily by just adding a small bias to the cost (one thousandth of a byte or whatever) each time you start a new match or LRL.

(more generally in an ideal world every compressor should have a lagrange parameter for space-speed tradeoff, but that's the kind of thing nobody ever gets around to)

2. Traditional LZ coders did not output matches unless they were cheaper than literals. That is, say you send a match len in 4 bits and an offset in 12 bits, so a match is 2 bytes - you would think that the minimum match length should be 3 - not 2 - because sending a 2 byte match is pointless (it's cheaper or the same cost to send those 2 bytes as literals (cheaper as literals if you are in a literal run-len already)). By using a larger MML, you can send higher match lengths in your 4 bits, so it should be a win.

This is not true if you have "last offset". With LO in your coder, it is often beneficial to send matches which are not a win (vs literals) on their own. eg. in the above example, minimum match length should be 2 in an LO coder.

This is one of those cases where text and binary data differ drastically. If you never tested on structured data you would not see this. Really the nature of LZ compression on text and binary is so different that it's worth considering two totally independent compressors (or at least some different tweaked config vals). Text match offsets fall off very steadily in a perfect curve, and "last offsets" are only used for interrupted matches, not for re-using an offset (and generally don't help that much). Binary match offsets have very sparse histograms with lots of strong peaks at the record sizes in the file, and "last offset" is used often just as a way of cheaply encoding the common record distance.

On text, it is in fact best to use an MML which makes matches strictly smaller than literals.

If I keep at this work in the future I'm sure I'll get around to doing an LZ specifically designed for structured data; it's sort of hopeless trying to find a compromise that works great on both; I see a lot more win possible.

10-16-12 - Thoughts on Bit-Packing Structs Before Compression

If you're trying to transmit some data compactly, and you are *not* using any back-end compression, then it's relatively straightforward to pack the structs through ad-hoc "bit packing" - you just want to squeeze them into as few bits as possible. But if you are going to apply a standard compressor after bit packing, it's a little less clear. In particular, a lot of people make mistakes that result in larger final data than necessary.

To be clear, there are two compression steps :


{ Raw structs } --[ad hoc]--> { Bit Packed } --[compressor]--> { Transmitted Data }

What you actually want to minimize is the size of the final transmitted data, which is not necessarily achieved with the smallest bit packed data.

The ideal scenario is if you know your back-end compressor, simply try a variety of ways of packing and measure the final size. You should always start with completely un-packed data, which often is a reasonable way to go. It's also important to keep in mind the speed hit of bit packing. Compressors (in particular, decompressors) are very fast, so even though your bit-packing may just consist of some simple math, it actually can very easily be much slower than the back-end decompressor. Many people incorrectly spend CPU time doing pre-compression bit-packing, when they would be better off spending that same CPU time by just running a stronger compressor and not doing any twiddling themselves.

The goal of bit-packing should really be to put the data in a form that the compressor can model efficienctly. Almost all compressors assume an 8-bit alphabet, so you want your data to stay in 8-bit form (eg. use bit-aligned packing, don't use non-power-of-2 multiplies to tightly pack values if they will cross a byte boundary). Also almost all compressors, even the best in the world (PAQ, etc) primarily achieve compression by modeling correlation between neighboring bytes. That means if you have data that does not have the property of maximum correlation to its immediate neighbor (and steady falloff) then some swizzling may help, just rearranging bytes to put the correlated bytes near each other and the uncorrelated bytes far away.

Some issues to consider :

1. Lossy bit packing.

Any time you can throw away bits completely, you have a big opportunity that you should exploit (which no back end compressor can ever do, because it sends data exactly). The most common case of this is if you have floats in your struct. Almost always there are several bits in a float which are pure garbage, just random noise which is way below the error tolerance of your app. Those bits are impossible to compress and if you can throw them away, that's pure win. Most floats are better transmitted as something like a 16 bit fixed point, but this requires application-specific knowledge about how much precision is really needed.

Even if you decide you can't throw away those bits, something that can help is just to get them out of the main stream. Having some random bytes mixed in to an otherwise nicely compressible stream really mucks up the order-0 statistics, so just putting them on the side is a nice way to go. eg. you might take the bottom 4 or 8 bits out of each float and just pass them uncompressed.

(in practical bone-head tips, it's pretty common for un-initialized memory to be passed to compressors; eg. if your structs are padded by C so there are gaps between values, put something highly compressible in the gap, like zero or a duplicate of the neighboring byte)

2. Relationships between values.

Any time you have a struct where the values are not completely independent, you have a good opportunity for packing. Obviously there are cases where one value in a struct can be computed from another and should just not be sent.

There are more subtle cases, like if A = 1 then B has certain statistics (perhaps it's usually high), while if A = 0 then B has other statistics (perhaps it's usually low). In these cases there are a few options. One is just to rearrange the transmission order so that A and B are adjacent. Most back end compressors model correlation between values that are adjacent, so putting the most-related values in a struct next to each other will let the back end find that correlation.

There are also often complicated mathematical relationships. A common case is a normalized vector; the 3 values are constrained in a way that the compressor will never be able to figure out (proof that current compressors are still very far away from the ideal of perfect compression). When possible you want to reduce these related values to their minimal set; another common case is rotation matrices, where 9 floats (36 bytes) can be reduced to 3 fixed points (6-9 bytes).

This is really exactly the same as the kinds of variable changes that you want to do physics; when you have a lot of values in a struct that are constrained together in some way, you want to identify the true number of degrees of freedom, and try to convert your values into independent unconstrained variables.

When numerical values are correlated to their neighbors, delta transformation may help. (this particularly helps with larger-than-byte values where a compressor will have a harder time figuring it out)

3. Don't mash together statistics.

A common mistake is to get too aggressive with mashing together values into bits in a way that wrecks the back-end statistical model. Most back end compressors work best if the bytes in the file all have the same probability histogram; that is, they are drawn from the same "source". (as noted in some of the other points, if there are multiple unrelated "sources" in your one data stream, the best thing to do is to separate them from each other in the buffer)

Let me give a really concrete example of this. Say you have some data which has lots of unused space in its bytes, something like :


bytes in the original have values :

0000 + 4 bits from source "A"
0001 + 4 bits from source "B"

(when I say "from source" I mean a random value drawn under a certain probability distribution)

You might be tempted to bit-pack these to compact them before the back end compressor. You might do something like this :


Take the top 4 bits to make a flag bit
Take 8 flag bits and put them in a byte

Then take the 4 bits of either A or B and put them together in the high and low nibble of a byte

eg, in nibbles :

0A 1B 1B 0A 0A 0A 1B 0A 

--[bit packed]-->

01100010 (binary) + ABBAAABA (nibbles)

(and A and B are not the hex numbers but mean 4 bits drawn from that source)

It looks like you have done a nice job of packing, but in fact you've really wrecked the data. The sources A and B had different statistics, and in the original form the compressor would have been able to learn that, because the flag bit was right there in the byte with the payload. But by packing it up tightly what you have done is made a bunch of bytes whose probability model is a mix of {bit flags},{source A},{source B}, which is a big mess.

I guess a related point is :

4. Even straightforward bit packing doesn't work for the reasons you think it does.

Say for example you have a bunch of bytes which only take on the values 0-3 (eg. use 2 bits). You might think that it would be a big win to do your own bit packing before the compressor and cram 4 bytes together into one. Well, maybe.

The issue is that the back end compressor will be able to do that exact same thing just as well. It can see that the bytes only take values 0-3 and thus will send them as 2 bits. It doesn't really need your help to see that. (you could help it if you had say some values that you knew were in 0-3 and some other values you knew were in 0-7, you might de-interleave those values so they are separated in the file, or somehow include their identity in the value so that their statistics don't get mixed up; see #5)

However, packing the bytes down can help in some cases. One is if the values are correlated to their neighbors; by packing them you get more of them near each other, so the correlation is modeled at an effective higher order. (eg. if the back end only used order-0 literals, then by packing you get order-3 (for one of the values anyway)). If the values are not neighbor-correlated, then packing will actually hurt.

(with a Huffman back end packing can also help because it allows you to get fractional bits per original value)

Also for small window LZ, packing down effectively increases the window size. Many people see advantages to packing data down before feeding it to Zip, but largely that is just reflective of the tiny 32k window in Zip (left over from the DOS days and totally insane that we're still using it).

5. Separating values that are independent :

I guess I've covered this in other points but it's significant enough to be redundant about. If you have two different sources (A and B); and there's not much correlation between the two, eg. A's and B's are unrelated, but the A's are correlated to other A's - you should try to deinterleave them.

A common simple case is AOS vs SOA. When you have a bunch of structs, often each value in the struct is more related to the same value in its neighbor struct than to other values within its own struct (eg. struct0.x is related to struct1.x more than to struct0.y). In this case, you should transform from array-of-structs to struct-of-arrays ; that is, put all the .x's together.

For example, it's well known that DXT1 compresses better if you de-interleave the end point colors from the palette interpolation indeces. Note that AOS-SOA transformation is very slow if done naively so this has to be considered as a tradeoff in the larger picture.

More generally when given a struct you want to use app-specific knowledge to pack together values that are strongly correlated and de-interleave values that are not.

10/07/2012

10-07-12 - Small Notes on LZNib

Some little thoughts.

1. It's kind of amazing to me how well LZNib does. (currently 30,986,634 on enwik8 with parallel chunked compress and LRM). I guess it's just the "asymptotic optimality" of LZ77; as the dictionary gets bigger, LZ77 approaches perfect compression (assuming the data source is static, which of course it never is, which is why LZ77 does not in fact approach the best compressor). But anyway, the point is with basic LZ the way matches are encoded becomes less and less important as the window gets bigger (and the average match length thus gets longer).

2. With byte-wise coders you have something funny in the optimal parser than you don't run into much with huffman or arithmetic coders : *ties*. That is, there are frequently many ways to code that have exactly the same code length. (in fact it's not uncommon for *all* the coding choices at a given position to produce the same total length).

You might think ties don't matter but in fact they do. One way you can break a tie is to favor speed; eg. break the tie by picking the encoding that decodes the fastest. But beyond that if your format has some feedback, the tie is important. For example in LZNib the "divider" value could be dynamic and set by feedback from the previous encoding.

In my LZNib I have "last offset" (repeat match), which is affected by ties.

3. My current decoder is around 800 mb/s on my machine. That's almost half the speed of LZ4 (around 1500 mb/s). I think there are a few things I could do to get a little more speed, but it's never going to get all the way. Presumably the main factor is the large window - LZ4 matches mostly come from L1 and if not then they are in L2. LZNib gets a lot of large offsets, thus more cache misses. It might help to do a lagrangian space-speed thing that picks smaller offsets when they don't hurt too much (certainly for breaking ties). (LZNib is also somewhat more branchy than LZ4 which is the other major source of speed loss)

4. One of the nice things about optimal parsing LZNib is that you can strictly pick the set of matches you need to consider. (and there are also enough choices for the optimal parser to make interesting decisions). Offsets can be sent in 12 bits, 20 bits, 28 bits, etc. so for each offset size you just pick the longest match in that window. (this is in contrast to any entropy-coded scheme where reducing to only a few matches is an approximation that hurts compression, or a fixed-size scheme like LZ4 that doesn't give the optimal parser any choices to make)

5. As usual I'm giving up some compression in the optimal parser by not considering all possible lengths for each match. eg. if you find a match of length 10 you should consider only using 3,4,5... ; I don't do that, I only consider lengths that result in a shorter match length code word. That is a small approximation but helps encoder speed a lot.

6. Since LZNib uses "last offset", the optimal parse is only approximate and that is an unsolved problem. Because big groups of offsets code to the same output size, the choice between those offsets should be made by how useful they are in the future as repeat matches, which is something I'm not doing yet.

old rants