3/27/2012

03-27-12 - DXT is not enough

I've been sent this GDC talk a few times now, so I will write some notes about it. (BTW thanks to all who sent it; I don't really follow game dev news much, so I rely on you all to tell me when there's something interesting I should know about).

There's nothing wrong with the general subject matter of the talk, and the points are along the right track in a vague sort of way, but there's just absolutely no effort put into it. I put more effort into the average blog post. If you aren't going to actually do the testing against other methods and measurement on a real (public) data set and see if your ideas are actually good, then just don't do a talk.

Anyhoo, a quick summary of the talk in plain text :

JPEG and then realtime DXTC is kind of bonkers (I agree). To make DXTC smaller, he applies Zip, then pre-huffman, pre-delta of colors, rearranging the colors & indices to be in two separate blocks, and then "codebooks", and finally 8x8 and even 16x16 blocks.

There are a lot of problems with this talk. The first is the assumption that you are using Zip on the back end. (BTW Zip is not a type of LZW at all). Zip is ancient and has a tiny window, there's no reason to use zip. If you just use a better back end, most of what he does next is irrelevant. Essentially a lot of what he does (such as the codebooks and the pre-huffman) are just ways of extending Zip, effectively making the sliding window larger.

Second, whenever you are doing these things you need to consider the memory use and processor use tradeoffs.

For example, reorganizing the DXTC data to separate the colors and the indeces does in fact help. (I do it in Oodle, optionally). But that doesn't make it a clear win. It actually takes a huge amount of CPU. Just swizzling memory around like that can be slower than a very advanced LZ decompressor. (unless you are lucky enough to be on a PC which has an amazing big cache, amazing fast memory, and an amazing out of order processor that can hide cache misses). So you have to consider what is the speed cost of doing that reorg vs. other ways you could use the CPU time to improve compression (eg. running LZMA or LZX or whatever instead of Zip). Even on a PC, the reorg will ruin large block write-combining. For me, reorg took me from 500 MB/s to 300 MB/s or so, and the gain is only a few percent, not obviously worth it. (my back end is much better than Zip so the gains are much smaller, or not there at all).

The only real idea in the talk is going to 8x8 blocks. That is in fact a valid thing to do, but here the rigor of the talk completely falls apart, and instead we get "look at this blurry slide in a terrible lighting environment, can you see the loss?". Errmmm, okay. To be fair it's no worse than the typical GDC graphics talk where you get "look at this picture of my global illumination technique, can you see any artifacts?" ; ermm, well you have chosen the scene to show me, and I'm a hundred feet away looking at a slide, and I can't zoom in or examine the areas I think look funny, so of course it should look good, but in fact, yes I do see lighting artifacts!

Any time you start introducing loss you have to ask : how does this loss I'm injecting compare to other ways I could reduce bitrate and increase loss? An easy one to check is just to halve the resolution of your image (in both dimensions). That's a 4:1 compression, and quite often looks just fine visually (eg. on smooth data it is one of the best possible ways to create loss). And of course since we're in this domain you need to compare against JPEG-DXTC.

CRUNCH addresses this subject much better, even doing some actual tests, and it has some much more interesting ideas.

See my previous writings on DXTC in general.

Now some actual rigor :

DXT1 is a 4 bpp (bits per pixel) format. Additional lossless compression can get you below 4 bpp, but getting to 1 bpp is unrealistic. Here I will show the results for compressors of increasing compression : zip, rrlzhlw, and lzma. The "reorg" here is just separating colors and indices; other reorgs do not help if the back end compressor is rrlzhlw or better.

zip rrlzhlw lzma reorg zip reorg rrlzhlw reorg lzma
kodim01.bmp 3.187 2.962 2.786 2.98 2.794 2.683
kodim02.bmp 2.984 2.738 2.574 2.703 2.575 2.484
kodim03.bmp 2.768 2.534 2.373 2.494 2.344 2.254
kodim04.bmp 3.167 2.931 2.751 2.913 2.727 2.625
kodim05.bmp 3.463 3.272 3.155 3.238 3.108 2.999
kodim06.bmp 3.039 2.827 2.626 2.755 2.635 2.514
kodim07.bmp 2.862 2.622 2.489 2.634 2.469 2.366
kodim08.bmp 3.416 3.197 3.073 3.211 3.041 2.936
kodim09.bmp 2.919 2.701 2.497 2.658 2.525 2.4
kodim10.bmp 3.074 2.838 2.644 2.803 2.638 2.525
kodim11.bmp 3.001 2.827 2.655 2.791 2.668 2.542
kodim12.bmp 2.86 2.645 2.446 2.583 2.451 2.343
kodim13.bmp 3.517 3.331 3.182 3.299 3.159 3.042
kodim14.bmp 3.296 3.104 2.94 3.078 2.922 2.803
kodim15.bmp 3.067 2.835 2.675 2.798 2.632 2.547
kodim16.bmp 2.779 2.565 2.362 2.543 2.401 2.276
kodim17.bmp 3.077 2.849 2.659 2.788 2.653 2.544
kodim18.bmp 3.495 3.315 3.181 3.255 3.106 3.025
kodim19.bmp 3.09 2.878 2.685 2.827 2.698 2.571
kodim20.bmp 2.667 2.486 2.302 2.406 2.308 2.22
kodim21.bmp 3.087 2.893 2.7 2.804 2.712 2.582
kodim22.bmp 3.39 3.213 3.046 3.168 3.005 2.901
kodim23.bmp 3.221 2.985 2.826 2.949 2.758 2.646
kodim24.bmp 3.212 2.986 2.86 3.009 2.826 2.724
clegg.bmp 2.987 2.75 2.598 2.712 2.576 2.459
FRYMIRE.bmp 1.502 1.318 1.224 1.417 1.3 1.209
LENA.bmp 3.524 3.332 3.209 3.304 3.136 3.062
MONARCH.bmp 3.28 3.055 2.916 2.999 2.835 2.741
PEPPERS.bmp 3.381 3.2 3.073 3.131 2.962 2.881
SAIL.bmp 3.425 3.234 3.123 3.197 3.047 2.967
SERRANO.bmp 1.601 1.39 1.289 1.484 1.352 1.26
TULIPS.bmp 3.511 3.27 3.164 3.227 3.061 2.974
total 97.849 91.083 86.083 90.158 85.424 82.105
gain 7.691 5.659 3.978

What you should be able to see : reorg zip is roughly the same as rrlzhlw (without reorg), and reorg rrlzhlw is about the same as lzma (without reorg). Note that reorg is *slow* ; rrlzhlw without reorg decodes quite a bit faster than zip with reorg, so speed is not a good reason to prefer that. (I suppose simplicity of coding is one advantage it has). The gain from reorging decreases as you go to better back-ends.

I should also point out that doing something like reorg lzma is kind of silly. If you really want the maximum compression of DXTC textures, then surely a domain-specific context-based arithmetic coder will do better, and be faster too. (see for example "Lossless Compression of Already Compressed Textures" , Strom and Wennersten ; not a great paper, just the very obvious application of normal compression techniques to ETC (similar to DXTC) texture data).

In the next post I'll ramble a bit about future possibilities.

3/12/2012

03-12-12 - Comparing Compressors

It's always hard to compare compressors fairly in a way that's easily understood by the layman. I think the basic LZH compressors in Oodle are very good, but do they compress as much as LZMA ? No. Are they as fast as LZO? No. So if I really make a fair comparison chart that lists lots of other compressors, I will be neither the fastest nor the highest ratio.

(The only truly fair way to test, as always, is to test in the actual target application, with the actual target data. Other than that, it's easiest to compare "trumps", eg. if compressor A has the same speed as B, but more compaction on all files, then it is just 100% better and we can remove B from consideration)

I wrote before about the total time method of comparing compressors. Basically you assume the disk has some given speed D. Then you see what is the total time to load the compressed file (eg. compressed size/D) and the time to do the decompression.

"Total time" is not really the right metric for various reasons; it assumes that one CPU is fully available to compression and not needed for anything else. It assumes single threaded operation only. But the nice thing about it is you can vary D and see how the best compressor changes with D.

In particular, for two compressors you can solve for the disk speed at which their total time is equal :


D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

disk speed where two compressors are equal :

D = C1 * C2 * (R1 - R2) / (C1 - C2)

at lower disk speeds, the one with more compression is preferred, and at higher disk speeds the faster one with lower compression is preferred.

The other thing you can do is show "effective speed" instead of total time. If you imagine the client just gets back the raw file at the end, they don't know if you just loaded the raw file or if you loaded the compressed file and then decompressed it. Your effective speed is :


D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

S = 1 / ( R/D + 1/C )

So for example, if your compressor is "none", then R = 1.0 and C = infinity, so S = D - your speed is the disk speed.

If we have two compressors that have a different ratio/speed tradeoff, we can compare them in this way. I was going to compare my stuff to Zip/Zlib, but I can't. On the PC I'm both faster than Zip and get more compression, so there is no tradeoff. (*1) (*2)

(*1 = this is not anything big to brag about, Zip is ancient and any good modern compressor should be able to beat it on both speed and ratio. Also Zlib is not very well optimized; my code is also not particularly optimized for the PC, I optimize for the consoles because they are so much slower. It's kind of ironic that some of the most pervasive code in the world is not particularly great).

(*2 = actually there are more dimensions to the "Pareto space"; we usually show it as a curve in 2d, but there's also memory use, and Zip is quite low in memory use (which is why it's so easy to beat - all you have to do is increase the window size and you gain both ratio and speed (you gain speed because you get more long matches)); a full tradeoff analysis would be a manifold in 3d with axes of ratio,speed, and size)

Anyhoo, on my x64 laptop running single threaded and using the timing technique here I get :


zlib9 : 24,700,820 ->13,115,250 =  1.883 to 1, rate= 231.44 M/s

lzhlw : 24,700,820 ->10,171,779 =  2.428 to 1, rate= 256.23 M/s

rrlzh : 24,700,820 ->11,648,928 =  2.120 to 1, rate =273.00 M/s

so we can at least compare rrlzh (the faster/simpler of my LZH's) with lzhlw (my LZH with Large Window).

The nice thing to do is to compute the effective speed S for various possible disk speeds D, and make a chart :

On the left is effective speed vs. disk speed, on the right is a log-log plot of the same thing. The blue 45 degree line is the "none" compressor, eg. just read the uncompressed file at disk speed. The axis is MB/sec, and here (as is most common for me) I use M = millions, not megas (1024*1024) (but the numbers I was showing at GDC were megas, which makes everything seem a little slower).

We see that on the PC, lzhlw is the better choice at any reasonable disk speed. They are equal somewhere around D = 280 MB/sec, but it's kind of irrelevant because at that point they are worse than just loading uncompressed.

The gap between lines in a log-log plot is the *ratio* of the original numbers; eg. the speedup multipler for LZH over RAW is maximum at the lowest speeds (1 MB/sec, = 0 on the log-log chart) and decreases as the disk gets faster.

3/08/2012

03-08-12 - Oodle Coroutines

I wrote a year ago ( 03-11-11 | Worklets , IO , and Coroutines ) about coroutines for doing tasks with external delays. (a year ago! wtf)

This is a small followup to say that yes, in fact coroutines are awesome for this. I never bothered to try to preserve local variables, it's not worth it. You can just store data that you want to save across a "yield" into a struct. In Oodle whenever you are doing a threaded task, you always have a Work context, so it's easy to just stuff your data in there.

I've always been a big fan of coroutines for game scripting languages. You can do things like :


Walk to TV
Turn on TV
if exists Couch
  Walk to Couch

etc. You just write it like linear imperative code, but in fact some of those things take time and yield out of the coroutine.

So obviously the same thing is great for IO or SPU tasks or whatever. You can write :


Vec3 * array = malloc(...);

io_read( array , ... );  //! yields

Mat3 m = camera.view * ...;

spu_transform(array, m); //! yields

object->m_array = array;

and it just looks like nice linear code, but actually you lose execution there at the ! marks and you will only proceed after that operation finishes.

To actually implement the coroutines I have to use macros, which is a bit ugly, but not intolerable. I use the C switch method as previously described; normally I auto-generate the labels for the switch with __COUNTER__ so you can just write :


 .. code ..

YIELD

 .. code ..

and the YIELD macro does something like :

  N = __COUNTER__;
  work->next = N;
  return eCoroutine_Refire;
  }
case N:
  {

(note the braces which mean that variables in one yield chunk are not visible to the next; this means that the failure of the coroutine to maintain local variables is a compile error and thus not surprising).

The exception is if you want to jump around to different blocks of the coroutine, then you need to manually specify a label and you can jump to that label.

Note that yielding without a dependancy is kind of pointless; these coroutines are not yielding to share the CPU, they are yielding because they need to wait on some async handle to finish. So generally when you yield it's because you have some handle (or handles) to async tasks.

The way a yielding call like "spu_transform(array, m);" in the previous example has to be implemented is by starting an async spu task, and then setting the handle as a dependency. It would be something like :


#define spu_transform(args) \
  Handle h = start_spu_transform(args); \
  Work_SetDeps(work,h); \
  YIELD

The coroutine yield will then stop running the work, and the work now has a dependency, so it won't resume until the dependency is done. eg. it waits for the spu task to complete.

I use coroutines basically every time I have to do some processing on a file. For one thing, to minimize memory use I need to stream the file through a double-buffer. For another, you often need to open the file before you start processing, and that needs to be part of the async operation chain as well. So a typical processing coroutine looks something like :


int WorkFunc( Work * work )
{
COROUTINE_START

  Handle h = ioq_open_file( work->filename );

COROUTINE_YIELD_TO(h)

  if open failed -> abort

  get file size

  h = start read chunk 0
  chunkI = 1; // the next chunk is 1

YIELD(h)

  // read of chunkI^1 just finished

  // start the next read :  
  h = ioq_read( chunk[chunkI] );
  chunkI ^= 1;

  // process the chunk we just received :
  process( chunk[chunkI] );

  if done
  {
    ioq_close_file();
    return
  }

YIELD_REPEAT
}

where "YIELD_REPEAT" means resume at the same label so you repeat the current block.

The last block of the coroutine runs over and over, ping-ponging on the double buffer, and yields if the next IO is not done yet when the processing of each block is done.

3/06/2012

03-06-12 - The Worker Wake and Semaphore Delay Issue

Let's say you have a pool of worker threads, and some of them may be asleep. How should you wake them up?

The straightforward way is to use a semaphore which counts the work items, and wait the worker threads on the semaphore. Workers will go to sleep when there is no work for them, and wake up when new work is pushed.

But this is rather too aggressive about waking workers. If you push N items (N less than the number of worker threads) it will wake N workers. But by the time some of those workers wake there may be nothing for them to do.

Let's look at a few specific issues.

First of all, when you're making a bunch of work items, you might want to delay inc'ing the semaphore until you have made all the items, rather than inc'ing it each time. eg. instead of :


1 : incremental push :

push item A
inc sem
push item B 
inc sem
...

instead do

2 : batch push :

push item A
push item B
inc sem twice

There are a few differences. The only disadvantage of batch pushing is that the work doesn't start getting done until all the pushes are done. If you're creating a lot of jobs and there's a decent amount of processing to get them started, this adds latency.

But what actually happens with incremental push? One possibility is like this :


bad incremental push :

push item A
inc sem

sem inc causes work thread to wake up
pusher thread loses execution

worker pops item A
worker does item A
worker sees empty queue and goes to sleep

pusher thread wakes up

push item B 
inc sem
...

That's very bad. The possible slight loss of efficiency from batch push is worth it to avoid this kind of bad execution flow.

There's a related issue when you are creating work items from a worker thread itself. Say a work item does some computation and also creates another work item :


Worker pops item A
does some stuff
push new work item B
inc sem
do some other stuff
item A done

Is this okay? Typically, no.

The problem is if the other worker threads are asleep, that inc sem might wake one up. Then the original worker finishes item A and sees nothing else to do and goes to sleep. It would have been much better if the worker just stayed awake and did item B itself.

We can fix this pretty easily. For work items pushed on worker threads, I typically use "batch push" (that is, delayed semaphore increment) with an extra wrinkle - I delay it up until my own thread tries to do a semaphore decrement.

That is, the way a worker thread runs is something like :


decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

decrement semaphore (wait if count <= 0)
pop item
do work item ...

instead we do :

decrement semaphore (wait if count <= 0)
pop item
do work item (may create new work)

 push new work items but don't post the semaphore
 instead set N = number of incs to sem that are delayed

decrement semaphore AND add N (*)
pop item
do work item ...

The key operation is at (*) , where I post the sem for any work items I made, and also try to dec one.

The gain can be seen from a special case - if I made one work item, then the operation at (*) is a nop - I had one increment to the sem delayed, and I want to take one for myself, so I just take the one I had delayed. (if I made two, I post one and take one for myself). etc.

There is one extra little optimization you can do for the edge case - if you have some threads that are creating work items and some threads doing them, there is a sort of "performance race" between them. You really want them to be running along side with the creator feeding the poppers, neither running too fast. If the poppers are running slightly faster than the creators, you can fall off a huge performance cliff when the poppers see no work available and go into an OS sleep. Now, obviously you use a spin in your semaphore, but an enhanced version is something like this :


delayed/batched work creation :

push various items
...
inc sem N times


work popper :

spin { try pop queue }
try dec sem
if didn't get pop , dec sem (may wait)

In words, the work popper can "shortcut" the delayed sem inc. That is, the pusher has created a delay between the queue push and the sem inc, but the delay only applies to thread wakeups!! (this is the important point). The delay does not apply to the work being available to already running worker threads.

That is, if the pusher is using delay sem incs, and the popper is using sem shortcuts - then an active pusher makes work available to active workers as soon as possible. The thing that is delayed is only thread wakeup, so that we can avoid waking threads that don't need to wake up, or threads that will steal the execution context from the pusher, etc.

Here's an example of how things can go wrong if you aren't careful about these things :

Each work item is creating a followup work item, but that wakes up the other worker thread, who quickly grabs the item, and I go back to sleep.

(you might ask why the work item is creating a followup instead of just doing more work; it's because the followup is dependent on an IO; in this particular case the IO is running faster than the computation, so the IO dependency for each item is already done, and they can be run immediately)

With delayed push & early pop it's all cleaned up :

03-06-12 - Oodle Handle Table - WFMO

The other big advantage of a unified handle system is that you can do things like a true WFMO (wait for multiple objects).

Often you have the case that you have launched several asynchronous operations (let's call them A,B, and C), and you wish to do something more, but only after all three are done.

You can always do this manually by just waiting on all three :


Wait(A);
Wait(B);
Wait(C);
*- now all three are done

(note : the invariant at * is only true if the state transition from "not done" to "done" is one way; this is an example of what I mentioned last time that reasoning and coding about these state transitions is much easier if it's one way; if it was not then you would have absolutely no idea what the state of the objects was at *).

Obviously the disadvantage of this is that your thread may have to wake up and go to sleep several times before it reaches *. You can minimize this by waiting first on the item you expect to finish last, but that only works in some cases.

The overhead of these extra thread sleeps/wakes is enormous; the cost of a thread transition is on the order of 5000-50,000 clocks, whereas the cost of an OS threading call like signalling an event or locking a mutex is on the order of 50-500 clocks. So it's worth really working on these.

So to fix it we use a true WFMO call like WFMO(A,B,C).

What WFMO does is esentially just use the forward permits system that the unified handle table uses for all dependencies. It creates a pseudo-handle H which does nothing and has no data :


WFMO(A,B,C) :

create handle H
set H depends on {A,B,C}
  sets A to permit H , etc.
  sets H needs 3 permits to run
Wait(H);
free handle H

The result is just a single thread sleep, and then when your thread wakes you know all ABC are done and there is no need to poll their status and possibly sleep again.

3/05/2012

03-05-12 - Oodle Handle Table

A few months ago I created a new core structure for Oodle which has worked out really well and tied the whole concept together. I'm going to make a few notes about it here.

The foundation is a lock-free weak reference handle table. One of the goals of Oodle library design was that I didn't want to force a coding structure on the client. eg. if it was me I would use a PagingResource base class with virtual OnLoad functions or whatever, and probably reference counting. But I understand many people don't want to do that, and one of the nice thing about RAD products is that they fit into *your* engine structure, they don't force an engine structure on you. So initially I was trying to write the Oodle systems so that each system is very independent and don't force any overall concept on you. But that was a bit of a mess, and bug prone; for example object lifetime management was left up to the client, eg. if you fired an async task off, you could delete it before it was done, which could be a crash if the SPU was still using that memory.

The weak reference handle table fixes those bugs in a relatively low overhead way (and I mean "low overhead" both in terms of CPU time, but more importantly in terms of coding structure or intellectual burden).

Handles can be locked (with a mutex per handle (*)) which gives you access to any payload associated with them. It also prevents deletion. Simple status checks, however, don't require taking the lock. So other people can monitor your handle while you work on it without blocking.

(* = not an OS mutex of course, but my own; it's one of the mutexes described in the previous serious. The mutex in Oodle uses only a few bits per handle, plus more data per thread which gives you the actual OS waits; game work loads typically involve something like 1000's of objects but only 10's of threads, so it's much better to use a per-thread "waitset" type of model)

One of the crucial things has been that the state of handles can basically only make one state transition, from "not done" to "done". Once they are done, they can never go back to not done; if you decide you have more followup work you create a new handle. This is as opposed to a normal Event or something which can flip states over and over. The nice thing about the single state transition is it makes waiting on that event much simpler and much easier to do race-free. There's no issue like "eventcount" where you have to do a prepare_wait / commit_wait.

The state of the handle is in the same word as the GUID which tells you if the handle is alive. So you only have to do one atomic op and it tells you if the weak reference is alive, and if so what the status is.

The other big thing is that "foward dependencies" (that I call "permits") are built in to the handle table, so all the systems automatically get depenedencies and can wait on each other. (as described previously here : Worker Thread system with reverse dependencies )

You can make any kind of handle, eg. some async SPU job, and then mark it as depending on an IO, so the SPU job will get kicked when the IO is done.

A little example image (click for high res) :

The row of pink is the IO thread doing a bunch of reads. There are pre-made SPU jobs to decompress each chunk, and you can see they fire as each read completes. Then another thread (the brown below the 4 SPUs) waits on each SPU handle and processes completion for them.

This all comes from


(for each chunk:)
OodleAsyncHandle h1 = fire read
OodleAsyncHandle h2 = fire SPU decompress , depends on h1

add h2 to list
wait for all handles in list


Aside :

There's only one disadvantage I haven't figured out yet. If something depends on an SPU job, then when the SPU work item completes it needs to check if it "permits" anything else, and if so mark them as ready to run.

The problem is that it can't do that work from the SPU. (actually it probably could do some of it from the SPU, by chasing pointers and doing DMA's for each chunk, and using the SPU atomic cache line stuff; but in the end it will run into a problem that it needs to call some system call on the PPU).

So when an SPU job completes I need to send back to the PPU a message that "hey this handle is done see if it permits anything and do the processing for that". The problem is there seems to be no good mechanism for that on the PS3, which kind of blows my mind. I mean, I can just tack my messages onto a lock-free queue from the SPU, that part is fine, but then I need a way to signal the PPU (or run an interrupt on the PPU) to get the PPU to process that queue (and the threads that need the processing might be asleep and need to get woken), and that is where it's not great.

3/03/2012

03-03-12 - Stranger's Wrath HD

I just noticed that Stranger's Wrath HD for PS3 (on PSN) finally came out a few months ago. So I bought it and started playing through.

It seems fine from what I've seen so far. No major bugs or framerate problems or anything like that. It supposedly has some "HD" up-resing, but I can't see any, it looks just like the XBox, and that's not entirely bad. (actually the only difference I notice is that the menus seem to have been redone and they use a horrible janky-looking font; I don't think that we had such terrible looking text on the XBox). (maybe the content is up-res'ed and I just can't tell because the screen resolution is also higher?)

Anyhoo, I'm pleased to say it holds up decently still. Some of the realtime art is fantastic, for such an old game it still looks really good, mainly due to all the use of "decorators". The first 30 minutes of play are pretty terrible, but it gets better.

Every time Stranger jumps and does the Al Pacino hoo-ah I bust out laughing. WTF were we thinking? And lots of his monologues and his vocal pacing is just super bizarre. That's a good thing, bizarre is good.

In unrelated news, I tried Rage on PS3 ; holy texture popping batman! I do like the art style; the lighting is like super blown out, really flat looking; it looks cool, it reminds me of Anime. Actually the best thing about Rage is the menus; the menus are superb, really fast, responsive, circular, simple, no loading or unnecessary animation or whatever; every game developer should copy the Rage menus.

1/09/2012

01-09-12 - LZ Optimal Parse with A Star Part 5

Wrapping up the series with lots of numbers.

Previous parts :

cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1

I'm delirious with fever right now so I might write something inane, but I'm so bored of lying in bed so I'm trying to wrap this up. Anyhoo..

So first of all we have to talk a bit about what we're comparing the A Star parse to.

"Normal" is a complex forward lazy parse using heuristics to guide parsing, as described in Part 1. "Fast" is like Normal but uses simpler heuristics and simpler match finder.

"Chain" is more interesting. Chain is a complex "lazy"-type parser which considers N decisions ahead (eg. Chain 4 considers 4 decisions ahead). It works thusly :

Chain Parse : first do a full parse of the file using some other parser; this provides with a baseline cost to end from each point. Now do a forward parse. At each position, consider all match and literal options. For each option, step ahead by that option and consider all the options at the next position. Add up the cost of each coding step. After N steps (for chain N) add on the cost to end from the first baseline parse. Go back to the original position and finalize the choice with the lowest cost. Basically it's a full graph walk for N steps, then use an estimate of the cost to the end from the final nodes of that sub-graph.

To make Chain parsing viable you have to reduce the number of match options to a maximum of 8 or so. Still Chain N has a complexity of 8^N , so it becomes slow very quickly as N grows.

Chain forward parse is significantly better than LZSS style backwards optimal parse for these LZ coders that have important adaptive state. The baseline parse I use for Chain actually is a backwards LZSS optimal parse, so you can see how it does by looking at the "Chain 0" results.


First overall results. Chain 6 is the most amount of steps I can run in reasonable time, and AStar 2048 means the quantum length for dividing up the file for AStar was 2048.

raw Fast Normal Chain 6 AStar 2048
lzt00 16914 5179 5016 4923 4920
lzt01 200000 198313 198321 198312 198312
lzt02 755121 181109 177792 173220 173315
lzt03 3471552 1746443 1713023 1698949 1690655
lzt04 48649 13088 12412 10407 10249
lzt05 927796 368346 367598 355804 354230
lzt06 563160 352827 351051 344721 343173
lzt07 500000 226533 215996 209133 208566
lzt08 355400 250503 249987 230541 230220
lzt09 786488 302927 287479 268544 265525
lzt10 154624 11508 10958 10307 10291
lzt11 58524 20553 19628 19139 19087
lzt12 164423 29001 26488 23966 23622
lzt13 1041576 935484 931415 924510 922745
lzt14 102400 47690 47298 46417 46350
lzt15 34664 10832 10688 10269 10260
lzt16 21504 10110 10055 9952 9927
lzt17 53161 19526 18514 17971 17970
lzt18 102400 64280 63251 59772 59635
lzt19 768771 322951 288872 269132 269162
lzt20 1179702 888881 872315 856369 855588
lzt21 679936 91677 88011 83529 83184
lzt22 400000 287715 284378 279674 279459
lzt23 1048576 807253 804048 798369 798334
lzt24 3471552 1418076 1411387 1399197 1388105
lzt25 1029744 113085 107882 97320 100175
lzt26 262144 212445 210836 207701 207552
lzt27 857241 237253 235137 222023 220837
lzt28 1591760 332660 308940 260547 252808
lzt29 3953035 1193914 1180823 1147160 1135603
lzt30 100000 100001 100001 100001 100001
10800163 10609600 10337879 10289860


Now number of Chain steps for the chain parser : (that's O0 - O6)

U N O0 O1 O2 O3 O4 O5 O6
lzt00 16914 5016 5024 4922 4922 4922 4922 4923 4923
lzt01 200000 198321 198321 198312 198312 198312 198312 198312 198312
lzt02 755121 177792 177877 175905 174835 174073 173759 173509 173220
lzt03 3471552 1713023 1712337 1704417 1703873 1702651 1701635 1700282 1698949
lzt04 48649 12412 11315 10516 10481 10457 10427 10416 10407
lzt05 927796 367598 368729 365743 364332 360630 356403 355968 355804
lzt06 563160 351051 350995 346856 345500 344778 344739 344702 344721
lzt07 500000 215996 215644 211336 209481 209259 209244 209138 209133
lzt08 355400 249987 249372 239375 237320 231554 231435 233324 230541
lzt09 786488 287479 284875 280683 275679 270721 269754 269107 268544
lzt10 154624 10958 10792 10367 10335 10330 10311 10301 10307
lzt11 58524 19628 19604 19247 19175 19225 19162 19159 19139
lzt12 164423 26488 25644 24217 24177 24094 24108 24011 23966
lzt13 1041576 931415 931415 929713 927841 926162 924515 924513 924510
lzt14 102400 47298 47300 46518 46483 46461 46437 46429 46417
lzt15 34664 10688 10656 10317 10301 10275 10278 10267 10269
lzt16 21504 10055 10053 9960 9966 9959 9952 9948 9952
lzt17 53161 18514 18549 17971 17970 17974 17971 17973 17971
lzt18 102400 63251 63248 59863 59850 59799 59790 59764 59772
lzt19 768771 288872 281959 277661 273316 269157 269141 269133 269132
lzt20 1179702 872315 872022 868088 865376 863236 859727 856408 856369
lzt21 679936 88011 88068 84848 83851 83733 83674 83599 83529
lzt22 400000 284378 284297 281902 279711 279685 279689 279696 279674
lzt23 1048576 804048 804064 802742 801324 799891 798367 798368 798369
lzt24 3471552 1411387 1410226 1404736 1403314 1402345 1401064 1400193 1399197
lzt25 1029744 107882 107414 99839 100154 99710 98552 98132 97320
lzt26 262144 210836 210855 207775 207763 207738 207725 207706 207701
lzt27 857241 235137 236568 233524 228073 223123 222884 222540 222023
lzt28 1591760 308940 295072 286018 276905 273520 269611 264726 260547
lzt29 3953035 1180823 1183407 1180733 1177854 1170944 1162310 1152482 1147160
lzt30 100000 100001 100001 100001 100001 100001 100001 100001 100001
10609600 10585703 10494105 10448475 10404719 10375899 10355030 10337879

Some notes : up to 6 (the most I can run) more chain steps is better - for the sum, but not for all files. In some cases, more steps is worse, which should never really happen, but it's an issue of approximate optimal parsers I'll discuss later. (*)

On most files, going past 4 chain steps helps very little, but on some files it seems to monotonically keep improving. For example lzt29 stands out. Those files are ones that get helped the most by AStar.


Now the effect on quantum size on AStar. In all cases I only output codes from the first 3/4 of each quantum.

raw 256 512 1024 2048 4096 8192 16384
lzt00 16914 4923 4923 4920 4920 4920 4921 4921
lzt01 200000 198312 198312 198312 198312 198312 198314 198314
lzt02 755121 175242 173355 173368 173315 173331 173454 173479
lzt03 3471552 1699795 1691530 1690878 1690655 1690594 1690603 1690617
lzt04 48649 10243 10245 10234 10249 10248 10241 10241
lzt05 927796 357166 354629 354235 354230 354233 354242 354257
lzt06 563160 346663 343202 343139 343173 343194 343263 343238
lzt07 500000 209934 208669 208584 208566 208556 208553 208562
lzt08 355400 228389 229447 229975 230220 230300 230374 230408
lzt09 786488 266571 265564 265487 265525 265559 265542 265527
lzt10 154624 10701 10468 10330 10291 10273 10273 10272
lzt11 58524 19139 19123 19096 19087 19085 19084 19084
lzt12 164423 23712 23654 23616 23622 23628 23630 23627
lzt13 1041576 923258 922853 922747 922745 922753 922751 922753
lzt14 102400 46397 46364 46351 46350 46350 46348 46350
lzt15 34664 10376 10272 10260 10260 10251 10258 10254
lzt16 21504 9944 9931 9926 9927 9927 9927 9927
lzt17 53161 17937 17970 17968 17970 17969 17969 17969
lzt18 102400 59703 59613 59632 59635 59637 59640 59640
lzt19 768771 269213 269151 269128 269162 269193 269218 269229
lzt20 1179702 855992 855580 855478 855588 855671 855685 855707
lzt21 679936 83882 83291 83215 83184 83172 83171 83169
lzt22 400000 279803 279368 279414 279459 279605 279630 279647
lzt23 1048576 798325 798319 798321 798334 798354 798357 798358
lzt24 3471552 1393742 1388636 1388031 1388105 1388317 1388628 1388671
lzt25 1029744 97910 101246 101302 100175 100484 100272 100149
lzt26 262144 207779 207563 207541 207552 207559 207577 207576
lzt27 857241 222229 220832 220770 220837 220773 220756 220757
lzt28 1591760 256404 253257 252933 252808 252737 252735 252699
lzt29 3953035 1136193 1135442 1135543 1135603 1135710 1135689 1135713
lzt30 100000 100001 100001 100001 100001 100001 100001 100001
10319878 10292810 10290735 10289860 10290696 10291106 10291116

The best sum is at 2048, but 1024 is a lot faster and almost the same.

Again, as the previous note at (*), we should really see just improvement with larger quantum sizes, but past 2048 we start seeing it go backwards in some cases.


Lastly a look at where the AStar parse is spending its time. This is for a 1024 quantum.

The x axis here is the log2 of the number of nodes visited to parse a quantum. So, log2=20 means a million nodes were needed to parse that quantum. So for speed purposes a cell one to the right is twice as bad. The values in the cells are the percentage of quanta in the file that needed that number of nodes.

(note : log2=20 means one million nodes were visited to output 768 bytes worth of codes, so it's quite a lot)

log2 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
lzt00 0 0 0 18.18 59.09 18.18 4.55
lzt01 3.75 0.75 41.2 34.08 13.86 5.62 0.75
lzt02 1.81 1.36 25.37 34.09 13.59 13.02 8.15 1.93 0.23 0.23
lzt03 1.46 1.18 17.51 18.46 14.16 13.17 6.95 4.81 3.66 4.81 9.54 2.79 0.96 0.11 0.03
lzt04 1.67 0 0 1.67 0 21.67 5 18.33 3.33 10 16.67 16.67 5
lzt05 0.59 0.25 4.41 10.77 9.92 18.32 13.23 10.09 9.67 6.02 12.47 3.22 0.51 0.08 0.08
lzt06 0.8 0.93 6.81 23.77 14.69 16.96 21.09 11.48 2.67 0.8
lzt07 0.46 0.46 8.66 7.88 6.8 15.3 17 14.53 5.56 9.58 8.19 4.79 0.31 0.31
lzt08 0 0 0 0 0 0 1.68 1.68 1.47 27.67 53.88 11.95 1.68
lzt09 0.29 0.48 0.76 0.86 0.95 3.9 28.07 47.76 16.18 0.38
lzt10 0 0.56 10.17 12.99 9.04 9.04 10.17 41.24 4.52 0.56 1.13
lzt11 0 0 7.89 10.53 14.47 17.11 6.58 9.21 21.05 10.53 2.63
lzt12 0 0 0 0 0 4.27 28.91 59.24 7.58
lzt13 0 0 0.07 0.14 0.57 1.72 3.36 5.72 39.24 42.03 7.08 0.07
lzt14 0 0.83 0 2.5 8.33 34.17 42.5 5 2.5 1.67 0.83 0 0.83
lzt15 0 2.27 4.55 15.91 13.64 15.91 13.64 6.82 11.36 11.36 4.55
lzt16 0 0 3.57 0 14.29 42.86 32.14 3.57
lzt17 1.39 1.39 2.78 1.39 4.17 75 13.89
lzt18 0 0 0 0 0 0.72 0 2.17 2.9 11.59 56.52 23.19 2.9
lzt19 0 0 1.26 2.81 0.39 7.56 87.11 0.87
lzt20 0 0.13 2.08 2.02 4.29 67.07 24.29 0.06
lzt21 0.2 0.78 6.07 6.07 5.28 19.77 35.62 22.9 1.96 0.2 0.2
lzt22 0 0.56 2.98 5.59 26.82 62.94 1.12
lzt23 0 0 0 0 0 0.07 1.35 2.63 0.92 70.88 23.15 0.14 0.36 0.5
lzt24 0.44 0.61 4.14 37.41 7.62 12.68 12.72 8.52 6.11 5.19 3.11 0.94 0.31 0.04
lzt25 0.22 0.43 1.52 1.74 2.68 6.44 15.69 27.19 30.22 13.09 0.72
lzt26 0 0 0 1.15 3.15 2.58 77.65 14.61 0.57
lzt27 0.61 0.1 7.55 6.53 1.22 4.39 5 4.08 7.76 44.8 16.43 1.43
lzt28 0.25 0.1 3.71 0.94 0.74 6.77 15.56 10.08 10.97 14.82 18.68 11.41 4.05 1.24 0.1
lzt29 0.3 0.73 1.61 22.37 5.28 6.16 26.34 2.97 0.48 0.85 19.63 12.47 0.73
lzt30 3.7 0.74 47.41 34.07 12.59 0.74

Well there's no easy answer, the character of the files are all very different.

In many cases the A Star parse is reasonably fast (comparable to Chain 3 or something). But in some cases it's quite slow, eg. lzt04, lzt08, lzt28.


Okay, I think that's all the data. We have one point to discuss :

(*) = in all these type of endeavors, we see these anomolies where as we give the optimizer more space to make decisions, it gets better for a while, then starts getting worse. I saw the same thing, but more extreme, with video coding.

Basically what causes this is that you aren't optimizing for your real final goal. If you were optimizing for the total output size, then giving it more freedom should never hurt. But you aren't. With Chain N or with A Star in both cases you are optimizing just some local portion, and it turns out that if you let it make really aggressive decisions trying to optimize the local bit, that can hurt overall.

A similar issue happens with an Huffman optimal parse, becuase you are using the huffman code lengths from the previous parse to do the current parse. That's fine as long as your parse is reasonably similar, but if you let the optimal parser really go nuts, it can start to get pretty far off those statistics, which makes it wrong, so that more optimizing actually gives worse results.

With video coding the main issue I had was that the optimization was generally local (eg. just on one macro block at a time or some such), but it of course affects the future as a source for motion compensation (and in other ways), and it turns out if you do really aggressive optimization on the local decisions, that can wind up hurting overall.

A similar thing can happen in image and video coding if you let optimization proceed very aggressively, because you have to use some simple analytic criterion (such as RMSE - though even if you use a fancier metric the same problems arise). The issue is that the coder can wind up finding strange states that are a good trade-off for RMSE, but wind up looking just horrible visually.

Obviously the correct solution is to optimize with the true final goal in mind. But that's not always possible, either computationally, or because the final goal is subjective.

Generally the solution is to moderate the optimization in some way. You have some heuristic idea of what kind of solutions will provide good globally optimal solutions. (for example, in image/video coding, you might require that the bit rate allocation not create too big of a difference between adjacent blocks). So you sort of want to guide your optimization to start around where you suspect the answer to be, and then you tune it so that you don't allow it to be too aggressive in making whatever decision it thinks is locally optimal.

12/17/2011

12-17-11 - LZ Optimal Parse with A Star Part 4

Continuing ...
Part 1
Part 2
Part 3

So we have our A star parse from last time.

First of all, when we "early out" we still actually fill out that hash_node. That is, you pop a certain "arrival", then you evaluate the early out conditions and decide this arrival is not worth pursuing. You need to make a hash_node and mark it as a dead end, so that when you pop earlier arrivals that see this node, they won't try to visit it again.

One option would be to use a separate hash of just bools that mark dead ends. This could be a super-efficient smaller hash table of bit flags or bloom filters or something, which would save memory and perhaps speed.

I didn't do this because you can get some win from considering parses that have been "early outed". What you do is when you decide to "early out" an arrival, you will not walk to any future nodes that are not yet done, but you *will* consider paths that go to nodes that were already there. In pseudo-code :


pop an arrival

check arrival early outs and just set a flag

for all coding choices at current pos
{
  find next_node
  if next_node exists
    compute cost to end
  else
    if ! early out flag
       push next_node on arrivals stack
}

So the early out stops you from creating any new nodes in the graph walk that you wouldn't have visited anyway, but you can still find new connections through that graph. What this lets you do in practice is drive the early out thresholds tighter.

The other subtlety is that it helps a lot to actually have two (or more) stages of early out. Rather than just stop consider all exit coding choices once you don't like your arrival, you have a couple of levels. If your arrival looks sort of bad but not terrible, then you still consider some of the coding choices. Instead of considering 8 or 16 coding choices, you reduce it to 2 or 4 which you believe are likely advantageous.

The exact details depend on the structure of your back end coder, but some examples of "likely advantangeous" coding choices that you would consider in the intermediate early out case : if you have a "repeat recent offset" structure like LZX/LZMA, then those are obvious things to include in the "likely advantageous". Another one might be RLE or continue-previous type of match codes. Another would be if the literal codes below a certain number of bits with the current statistics. Also the longest match if it's longer than a certain amount.

Okay, so our A star is working now, but we have a problem. We're still just not getting enough early outs, and if you ran this on a big file it will take forever (sometimes).

The solution is to use another aspect we expect from our LZ back end, which is "semi-locality". Locality means that a decision we make now will not have a huge effect way in the future. Yes, it has some effect, because it may change the state and that affects the future, but over time the state changes so many times and adapts to future coding that the decision 4000 bytes ago doesn't matter all that much.

Another key point is that the bad (slow) case occurs when there are lots of parses that cost approximately the same. Because of our early out structure, if there is a really good cheap parse we will generally converge towards it, and then the other choices will be more expensive and they will early out and we won't consider too many paths. We only get into bad degeneracy if there are lots of parses with similar cost. And the thing is, in that case we really don't care which one we pick. So when we find an area of the file that has a huge branching factor that's hard to make a decision about, we are imperfect but it doesn't cost us much overall.

The result is that we can cut up the file to make the parse space tractable. What I do is work in "quanta". You take the current chunk of the file as your quantum and parse it as if it was its own little file. The parse at the beginning of the quantum will be mostly unaffected by the quantum cut, but the parse at the end will be highly affected by the false EOF, so you just throw it out. That is, advance through the first 50% or 75% of the parse, and then start the next quantum there.

There is one special case for the quantum cutting which is long matches that extend past the end of the quantum. What you would see is when outputting the first 50% of the parse, the last code will be a match that goes to the end of the quantum. Instead I just output the full length of the match. This is not ideal but the loss is negligible.

For speed you can go even further and use adaptive quantum lens. On highly degenerate parts of the file, there may be a huge node space to parse that doesn't get early-out'ed. When you detect one of these, you can just reduce the quantum len for that part of the file. eg. you start with a quantum length of 4096 ; if as you are parsing that quantum you find that the hash table occupancy is beyond some threshold (like 1 million nodes for example), you decide the branching factor is too great and reduce the quantum length to 2048 and resume parsing on just the beginning of that chunk. You might hit 1 million nodes again, then you reduce to 1024, etc.

That's it! Probably a followup post with some results numbers and maybe some more notes about subtle issues. I could also do several long posts about ideas I tried that didn't work which I think are sort of interesting.

12-17-11 - LZ Optimal Parse with A Star Part 3

Continuing ...
Part 1
Part 2

At the end of Part 2 we looked at how to do a forward LZSS optimal parse. Now we're going to add adaptive "state" to the mix.

Each node in the walk of parses represents a certain {Pos,State} pair. There are now too many possible nodes to store them all, so we can't just use an array to store all {Pos,State} nodes we have visited. So hopefully we will not visit them all, so we will store them in a hash table.

We are parsing forward, so for any node we visit (a {Pos,State} will be called a "node") we know how we got there. There can be many ways of reaching the same node, but we only care about the cheapest one. So we only need to store one entering link into each node, and the total cost from the beginning of the path to get to that node.

If you think about the flow of how the forward LZSS parse completes, it's sort of like an ice tendril reaching out which then suddenly crystalizes. You start at the beginning and you are always pushing the longest length choice first - that is, you are taking big steps into the parse towards the end without filling in all the gaps. Once you get to the end with that first long path (which is actually the greedy parse - the parse made by taking the longest match available at each step), then it starts popping backwards and filling in all the gaps. It then does all the dense work, filling backwards towards the beginning.

So it's like the parse goes in two directions - reaching from the beginning to get to the end (with node that don't have enough information), and then densely bubbling back from the end (and making final decisions). (if I was less lazy I would make a video of this).

Anyhoo, we'll make that structure more explicit. The hash table, for each node, stores the cost to get to the end from that node, and the coding choice that gives that cost.

The forward parse uses entry links, which I will henceforth call "arrivals". This is a destination node (a {pos,state}), and the cost from the beginning. (you don't need to store how you got here from the beginning since that can be reproduced at the end by rewalking from the beginning).


Full cost of parse through this node =

arrival.cost_from_head + hash_node.cost_to_tail

Once a node has a cost in the hash table, it is done, because it had all the information it needed at that node. But more arrivals can come in later as we fill in the gaps, so the full cost from the beginning of the parse to the end of the parse is not known.

Okay, so let's start looking at the parse, based on our simple LZSS pseudo-code from last time :


hash table of node-to-end costs starts empty
stack of arrivals from head starts empty

Push {Pos 1,state initial} on stack of arrivals

While stack is not empty :

pop stack; gives you an arrival to node {P,state}

see if node {P,state} is already in the hash
if so
{
  total cost is arrival.cost_from_head + hash_node.cost_to_tail
  done with this arrival
  continue (back to stack popping);
}

For each coding choice {C} at the current pos
{
  find next_state = state transition from cur state after coding choice C
  next_pos = P + C.len
  next_node = {next_pos,next_state]

  if next_node is in the hash table :
  {
    compute cost to end from code cost of {C} plus next_node.cost_to_tail
  }
  else
  {
    push next_node to the arrivals stack (*1)
  }
}

if no pushes were done
{
  then processing of current node is done
  choose the best cost to end from the choices above
  create a node {P,state} in the hash with that cost
}

(*1 = if any pushes are done, then the current node is also repushed first (before other pushes). The pushes should be done in order from lowest pos to highest pos, just as with LZSS, so that the deep walk is done first).

So, we have a parse, but it's walking every node, which is way too many. Currently this is a full graph walk. What we need are some early outs to avoid walking the whole thing.

The key is to use our intuition about LZ parsing a bit. Because we step deep first, we quickly get one parse for the whole segment (the greedy parse). Then we start stepping back and considering variations on that parse.

The parse doesn't collapse the way it did with LZSS because of the presence of state. That is, say I parsed to the end and now I'm bubbling back and I get back to some pos P. I already walked the long length, so I'm going to consider a shorter one. When I walk to the shorter one with LZSS, then states I need would already be done. But now, the nodes aren't done, but importantly the positions have been visited. That is -


At pos P, state S
many future node positions are already done
 (I already walked the longest match length forward)

eg. maybe {P+3, S1} and {P+5, S2} and {P+7, S3} have been done

I a shorter length now; eg. to {P+2,S4}

from there I consider {P+5, S5}

the node is not done, but a different state at P+5 was done.

If the state didn't matter, we would be able to reuse that node and collapse back to O(N) like LZSS.

Now of course state does matter, but crucially it doesn't matter *that much*. In particular, there is sort of a limit on how much it can help.

Consider for example if "state" is some semi-adaptive statistics. Those statistics are adaptive, so if you go far enough into the future, the state will adapt to the coding parse, and the initial state won't have helped that much. So maybe the initial state helps a lot for the next 8 coding steps. And maybe it helps at most 4 bits each time. Then having a better initial state can help at most 32 bits.

When you see that some other parse has been through this same position P, albeit with different state at this position, if that parse has completed and has a total cost, then we know it is the optimal cost through that node, not just the greedy parse or whatever. That is, whenever a hash node has a cost_to_tail, it is the optimal parse cost to tail. If there is a good parse later on in the file, the optimal parse is going to find that parse, even if it starts from a non-ideal state.

This is the form of our early outs :


When you pop an arrival to node {P,S} , look at the best cost to arrive to pos P for any state, 

if arrival.cost_from_head - best_cost_from_head[P] > threshold
  -> early out

if arrival.cost_from_head + best_cost_to_tail[P] > best_cost_total + threshold
  -> early out

where we've introduced two arrays that track the best seen cost to head & tail at each pos, regardless of state. We also keep a best total cost, which is initially set to infinity until we get through a total parse, and then is updated any time we see a new whole-walk cost.

This is just A star. From each node we are trying to find a lower bound for the cost to get to the end. What we use is previous encodings from that position to the end, and we assume that starting from a different state can't help more than some amount.

Next time, some subtleties.

old rants