4/04/2013

04-04-13 - Worker Thread Forward Permit Delay-Kicks

I've had this small annoying issue for a while, and finally thought of a pretty simple solution.

You may recall, I use a worker thread system with forward "permits" (reversed dependencies) . When any handle completes it sees if that completion should trigger any followup handles, and if so those are then launched. Handles may be SPU jobs or IOs or CPU jobs or whatever. The problem I will talk about occurred when the predessor and the followup were both CPU jobs.

I'll talk about a specific case to be concrete : decoding compressed data while reading it from disk.

To decode each chunk of LZ data, a chunk-decompress job is made. That job depends on the IO(s) that read in the compressed data for that chunk. It also depends on the previous chunk if the chunk is not a seek-reset point. So in the case of a non-reset chunk, you have a dependency on an IO and a previous CPU job. Your job will be started by one or the other, whichever finishes last.

Now, when decompression was IO bound, then the IO completions were kicking off the decompress jobs, and everything was fine.

In these timelines, the second line is IO and the bottom four are workers. (click images for higher res)

LZ Read and Decompress without seek-resets, IO bound :

You can see the funny fans of lines that show the dependency on the previous decompress job and also the IO. Yellow is a thread that's sleeping.

You may notice that the worker threads are cycling around. That's not really ideal, but it's not related to the problem I'm talking about today. (that cycling is caused by the fact that the OS semaphore is FIFO. For something like worker threads, we'd actually rather have a LIFO semaphore, because it makes it more likely that you get a thread with something useful still hot in cache. Someday I'll replace my OS semaphore with my own LIFO one, but for now this is a minor performance bug). (Win32 docs say that they don't gaurantee any particular order, but in my experience threads of equal priority are always FIFO in Win32 semaphores)

Okay, now for the problem. When the IO was going fast, so we were CPU bound, it's the prior decompress job that triggers the followup work.

But something bad happened due to the forward permit system. The control flow was something like this :


On worker thread 0

wake from semaphore
do on an LZ decompress job
mark job done
completion change causes a permits check
permits check sees that there is a pending job triggered by this completion
  -> fire off that handle
   handle is pushed to worker thread system
   no worker is available to do it, so wake a new worker and give him the job
finalize (usually delete) job I just finished
look for more work to do
   there is none because it was already handed to a new worker

And it looked like this :

LZ Read and Decompress without seek-resets, CPU bound, naive permits :

You can see each subsequent decompress job is moving to another worker thread. Yuck, bad.

So the fix in Oodle is to use the "delay-kick" mechanism, which I'd already been using for coroutine refires (which had a similar problem; the problem occurred when you yielded a coroutine on something like an IO, and the IO was done almost immediately; the coroutine would get moved to another worker thread instead of just staying on the same one and continuing from the yield as if it wasn't there).

The scheme is something like this :


On each worker thread :

Try to pop a work item from the "delay kick queue"
  if there is more than one item in the DKQ,
    take one for myself and "kick" the remainder
    (kick means wake worker threads to do the jobs)

If nothing on DKQ, pop from the main queue
  if nothing on main queue, wait on work semaphore

Do your job

Set "delay kick" = true
  ("delay kick" has to be in TLS of course)
Mark job as done
Permit system checks for successor handles that can now run 
  if they exist, they are put in the DKQ instead of immediately firing
Set "delay kick" = false

Repeat

In brief : work that is made runnable by the completion of work is not fired until the worker thread that did the completion gets its own shot at grabbing that new work. If the completion made 4 jobs runnable, the worker will grab 1 for itself and kick the other 3. The kick is no longer in the completion phase, it's in the pop phase.

And the result is :

LZ Read and Decompress without seek-resets, CPU bound, delay-kick permits :

Molto superiore.

These last two timelines are on the same time scale, so you can see just from the visual that eliminating the unnecessary thread switching is about a 10% speedup.

Anyway, this particular issue may not apply to your worker thread system, or you may have other solutions. I think the main take-away is that while worker thread systems seem very simple to write at first, there's actually a huge amount of careful fiddling required to make them really run well. You have to be constantly vigilant about doing test runs and checking threadprofile views like this to ensure that what's actually happening matches what you think is happening. Err, oops, I think I just accidentally wrote an advertisement for Telemetry .

04-04-13 - Oodle Compression on BC1 and WAV

I put some stupid filters in, so here are some results for the record and my reference.

BC1 (DXT1/S3TC) DDS textures :

All compressors run in max-compress mode. Note that it's not entirely fair because Oodle has the BC1 swizzle and the others don't.

Some day I'd like to do a BC1-specific encoder. Various ideas and possibilities there. Also RD-DXTC.

I also did a WAV filter. This one is particularly ridiculous because nobody uses WAV, and if you want to compress audio you should use a domain-specific compressor, not just OodleLZ with a simple delta filter. I did it because I was annoyed that RAR beat me on WAVs (due to its having a multimedia filter), and RAR should never beat me.

WAV compression :

See also : same chart with 7z (not really fair cuz 7z doesn't have a WAV filter)

Happy to see that Oodle-filt handily beats RAR-filt. I'm using just a trivial linear gradient predictor :


out[i] = in[i] - 2*in[i-1] + in[i-2]

this could surely be better, but whatever, WAV filtering is not important.

I also did a simple BMP delta filter and EXE (BCJ/relative-call) transform. I don't really want to get into the business of offering all kinds of special case filters the way some of the more insane modern archivers do (like undoing ZLIB compression so you can recompress it, or WRT), but anyhoo there's a few.


ADDED : I will say something perhaps useful about the WAV filter.

There's a bit of a funny issue because the WAV data is 16 bit (or 24 or 32), and the back-end entropy coder in a simple LZ is 8 bit.

If you just take a 16-bit delta and put it into bytes, then most of your values will be around zero, and you'll make a stream like :

[00 00] [00 01] [FF FF] [FF F8] [00 04] ...
The bad thing you should notice here are the high bytes are switching between 00 and FF even though the values have quite a small range. (Note that the common thing of centering the values with +32768 doesn't change this at all).

You can make this much better just by doing a bias of +128. That makes it so the most important range of values (around zero (specifically [-128,127])) all have the same top byte.

I think it might be even slightly better to do a "folded" signed->unsigned map, like

{ 0,-1,1,-2,2,-3,...,32767,-32768 }
The main difference being that values like -129 and +128 get the same high byte in this mapping, rather than two different high bytes in the simple +128 bias scheme.

Of course you really want a separate 8-bit huffman for alternating pairs of bytes. One way to get that is to use a few bottom bits of position as part of the literal context. Also, the high byte should really be used as context for the low byte. But both of those are beyond the capabilities of my simple LZ-huffs so I just deinterleave the high and low bytes to two streams.

4/02/2013

04-02-13 - The Method of Random Holdouts

Quick and dirty game-programming style hacky way to make fitting model parameters somewhat better.

You have some testset {T} of many items, and you wish to fit some heuristic model M over T which has some parameters. There may be multiple forms of the model and you aren't sure which is best, so you wish to compare models against each other.

For concreteness, you might imagine that T is a bunch of images, and you are trying to make a perceptual DXTC coder; you measure block error in the encoder as something like (SSD + a * SATD ^ b + c * SSIM_8x8 ) , and the goal is to minimize the total image error in the outer loop, measured using something complex like IW-MS-SSIM or "MyDCTDelta" or whatever. So you are trying to fit the parameters {a,b,c} to minimize an error.

For reference, the naive training method is : run the model on all data in {T}, optimize parameters to minimize error over {T}.

The method of random holdouts goes like this :


Run many trials

On each trial, take the testset T and randomly separate it into a training set and a verification set.
Typically training set is something like 75% of the data and verification is 25%.

Optimize the model parameters on the {training set} to minimize the error measure over {training set}.

Now run the optimized model on the {verification set} and measure the error there.
This is the error that will be used to rate the model.

When you make the average error, compensate for the size of the model thusly :
average_error = sum_error / ( [num in {verification set}] - [dof in model] )

Record the optimal parameters and the error for that trial

Now you have optimized parameters for each trial, and an error for each trial. You can take the average over all trials, but you can also take the sdev. The sdev tells you how well your model is really working - if it's not close to zero then you are missing something important in your model. A term with a large sdev might just be a random factor that's not useful in the model, and you should try again without it.

The method of random holdouts reduces over-training risk, because in each run you are measuring error only on data samples that were not used in training.

The method of random holdouts gives you a decent way to compare models which may have different numbers of DOF. If you just use the naive method of training, then models with more DOF will always appear better, because they are just fitting your data set.

That is, in our example say that (SSD + a * SATD ^ b) is actually the ideal model and the extra term ( + c * SSIM_8x8 ) is not useful. As long as it's not just a linear combo of other terms, then naive training will find a "c" such that that term is used to compensate for variations in your particular testset. And in fact that incorrect "c" can be quite a large value (along with a negative "a").

This kind of method can also be used for fancier stuff like building complex models from ensembles of simple models, "boosting" models, etc. But it's useful even in this case where we wind up just using a simple linear model, because you can see how it varies over the random holdouts.

3/31/2013

03-31-13 - Market Equilibrium

I'm sure there's some standard economic theory of all this but hey let's reinvent the wheel without any background.

There's a fundamental principal of any healthy (*) market that the reward for some labor is equal across all fields - proportional only to standard factors like the risk factor, the scarcity of labor, the capital required for entry, etc. (* = more on "healthy" later). The point is that those factors have *nothing* to do with the details of the field.

The basic factor at play is that if some field changes and suddenly becomes much more profitable, then people will flood into that field, and the risk-equal-capital-return will keep going down until it becomes equal to other fields. Water flows downhill, you know.

When people like Alan Greenspan try to tell you that oh this new field is completely unlike anything we've seen in the past because of blah blah - it doesn't matter, they may have lots of great points that seem reasonable in isolation, but the equilibrium still applies. The pay of a computer programmer is set by the pay of a farmer, because if the difference were out of whack, the farmer would quit farming and start programming; the pay of programmers will go down and the wages of farmers will go up, then the price of lettuce will go up, and in the end a programmer won't be able to buy any more lettuce than anyone else in a similar job. ("similar" only in terms of risk, ease of entry, rarity of talent, etc.)

We went through a drive-through car wash yesterday and Tasha idly wondered how much the car wash operator makes from an operation like that. Well, I bet it's about the same as a quick-lube place makes, and that's about the same as a dry cleaner, and it's about the same as a pizza place (which has less capital outlay but more risk), because if one of them was much more profitable, there would be more competition until equilibrium was reached.

Specifically I've been thinking about this because of the current indie game boom on the PC, which seems to be a bit of a magic gold rush at the moment. That almost inevitably has to die out, it's just a question of when. (so hurry up and get your game out before it does!).

But of course that leads us into the issue of broken markets, since all current game sales avenues are deeply broken markets.

Equilibrium (like most naive economic theory) only applies to markets where there's fluidity, robust competition, no monopolistic control, free information, etc. And of course those don't happen in the real world.

Whenever a market is not healthy, it provides an opportunity for unbalanced reward, well out of equilibrium.

Lack of information can be particularly be a factor in small niches. There can be a company that does something random like make height-adjustable massage tables. If they're a private operation and nobody really pays attention to them, they can have super high profit levels for something that's not particularly difficult - way out of equilibrium. If other people knew how easy that business was, lots of others would enter, but due to lack of information they don't.

Patents and other such mechanisms that create legally enforced distortions of the market. Of course things like the cable and utility systems are even worse.

On a large scale, government distortion means that huge fields like health care, finance, insurance, oil, farming, etc. are all more profitable than they should be.

Perhaps the biggest issue in downloadable games is the oligopoly of App Store and Steam. This creates an unhealthy market distortion and it's hard to say exactly what the long term affect of that will be. (of course you don't see it as "unhealthy" if you are the one benefiting from the favor of the great powers; it's unhealthy in a market fluidity and fair competition sense, and may slow or prevent equilibrium)

Of course new fields are not yet in equilibrium, and one of the best ways to "get rich quick" is to chase new fields. Software has been out of equilibrium for the past 50 years, and is only recently settling down. Eventually software will be a very poorly paid field, because it requires very little capital to become a programmer, it's low risk, and there are lots of people who can do it.

Note that in *every* field the best will always rise to the top and be paid accordingly.

Games used to be a great field to work in because it was a new field. New fields are exciting, they offer great opportunities for innovation, and they attract the best people. Mature industries are well into equilibrium and the only chances for great success are through either big risk, big capital investment, or crookedness.

03-31-13 - Index - Game Threading Architecture

Gathering the series for an index post :

cbloom rants 08-01-11 - A game threading model
cbloom rants 12-03-11 - Worker Thread system with reverse dependencies
cbloom rants 03-05-12 - Oodle Handle Table
cbloom rants 03-08-12 - Oodle Coroutines
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts
cbloom rants 12-18-12 - Async-Await ; Microsoft's Coroutines
cbloom rants 12-21-12 - Coroutine-centric Architecture
cbloom rants 12-21-12 - Coroutines From Lambdas
cbloom rants 12-06-12 - Theoretical Oodle Rewrite Continued
cbloom rants 02-23-13 - Threading - Reasoning Behind Coroutine Centric Design

I believe this is a good architecture, using the techniques that we currently have available, without doing anything that I consider bananas like writing your own programming language (*). Of course if you are platform-specific or know you can use C++11 there are small ways to make things more convenient, but the fundamental architecture would be about the same (and assuming that you will never need to port to a broken platform is a mistake I know well).

(* = a lot of people that I consider usually smart seem to think that writing a custom language is a great solution for lots of problems. Whenever we're talking about "oh reflection in C is a mess" or "dependency analysis should be automatic", they'll throw out "well if you had the time you would just write a custom language that does all this better". Would you? I certainly wouldn't. I like using tools that actually work, that new hires are familiar with, etc. etc. I don't have to list the pros of sticking with standard languages. In my experience every clever custom language for games is a huge fucking disaster and I would never advocate that as a good solution for any problem. It's not a question of limitted dev times and budgets.)

03-31-13 - Endian-Independent Structs

I dunno, maybe this is common practice, but I've never seen it before.

The easy way to load many file formats (I'll use a BMP here to be concrete) is just to point a struct at it :


struct BITMAPFILEHEADER
{
    U16 bfType; 
    U32 bfSize; 
    U16 bfReserved1; 
    U16 bfReserved2; 
    U32 bfOffBits; 
} __attribute__ ((__packed__));


BITMAPFILEHEADER * bmfh = (BITMAPFILEHEADER *)data;

if ( bmfh->bfType != 0x4D42 )
    ERROR_RETURN("not a BM",0);

etc..

but of course this doesn't work cross platform.

So people do all kinds of convoluted things (which I have usually done), like changing to a method like :


U16 bfType = Get16LE(&ptr);
U32 bfSize = Get32LE(&ptr);

or they'll do some crazy struct-parse fixup thing which I've always found to be bananas.

But there's a super trivial and convenient solution :


struct BITMAPFILEHEADER
{
    U16LE bfType; 
    U32LE bfSize; 
    U16LE bfReserved1; 
    U16LE bfReserved2; 
    U32LE bfOffBits; 
} __attribute__ ((__packed__));

where U16LE is just U16 on little-endian platforms and is a class that does bswap on itself on big-endian platforms.

Then you can still just use the old struct-pointing method and everything just works. Duh, I can't believe I didn't think of this earlier.

Similarly, here's a WAV header :


struct WAV_header_LE
{
    U32LE FOURCC_RIFF; // RIFF Header 
    U32LE riffChunkSize; // RIFF Chunk Size 
    U32LE FOURCC_WAVE; // WAVE Header 
    U32LE FOURCC_FMT; // FMT header 
    U32LE fmtChunkSize; // Size of the fmt chunk 
    U16LE audioFormat; // Audio format 1=PCM,6=mulaw,7=alaw, 257=IBM Mu-Law, 258=IBM A-Law, 259=ADPCM 
    U16LE numChan; // Number of channels 1=Mono 2=Sterio 
    U32LE samplesPerSec; // Sampling Frequency in Hz 
    U32LE bytesPerSec; // bytes per second 
    U16LE blockAlign; // normall NumChan* bytes per sample
    U16LE bitsPerSample; // Number of bits per sample 
}  __attribute__ ((__packed__));;

easy.

For file-input type structs, you just do this and there's no penalty. For structs you keep in memory you wouldn't want to eat the bswap all the time, but even in that case this provides a simple way to get the swizzle into native structs by just copying all the members over.

Of course if you have the Reflection-Visitor system that I'm fond of, that's also a good way to go. (cursed C, give me a "do this macro on all members").

3/30/2013

03-30-13 - Error codes

Some random rambling on the topic of returning error codes.

Recently I've been fixing up a bunch of code that does things like

void  MutexLock( Mutex * m )
{
    if ( ! m ) return;
    ...

yikes. Invalid argument and you just silently do nothing. No thank you.

We should all know that silently nopping in failure cases is pretty horrible. But I'm also dealing with a lot of error code returns, and it occurs to me that returning an error code in that situation is not much better.

Personally I want unexpected or unhandleable errors to just blow up my app. In my own code I would just assert; unfortunately that's not viable in OS code or perhaps even in a library.

The classic example is malloc. I hate mallocs that return null. If I run out of memory, there's no way I'm handling it cleanly and reducing my footprint and carrying on. Just blow up my app. Personally whenever I implement an allocator if it can't get memory from the OS it just prints a message and exits (*).

(* = aside : even better is "functions that don't fail" which I might write more about later; basically the idea is the function tries to handle the failure case itself and never returns it out to the larger app. So in the case of malloc it might print a message like "tried to alloc N bytes; (a)bort/(r)etry/return (n)ull?". Another common case is when you try to open a file for write and it fails for whatever reason, it should just handle that at the low level and say "couldn't open X for write; (a)bort/(r)etry/change (n)ame?" )

I think error code returns are okay for *expected* and *recoverable* errors.

On functions that you realistically expect to always succeed and will not check error codes for, they shouldn't return error codes at all. I wrote recently about wrapping system APIs for portable code ; an example of the style of level 2 wrapping that I like is to "fix" the error returns.

(obviously this is not something the OS should do, they just have to return every error; it requires app-specific knowledge about what kind of errors your app can encounter and successfully recover from and continue, vs. ones that just mean you have a catastrophic unexpected bug)

For example, functions like lock & unlock a mutex shouldn't fail (in my code). 99% of the user code in the world that locks and unlocks mutexes doesn't check the return value, they just call lock and then proceed assuming the lock succeeded - so don't return it :


void mypthread_mutex_lock(mypthread_mutex_t *mutex)
{
    int ret = pthread_mutex_lock(mutex);
    if ( ret != 0 )
        CB_FAIL("pthread_mutex_lock",ret);
}

When you get a crazy unexpected error like that, the app should just blow up right at the call site (rather than silently failing and then blowing up somewhere weird later on because the mutex wasn't actually locked).

In other cases there are a mix of expected failures and unexpected ones, and the level-2 wrapper should differentiate between them :


bool mysem_trywait(mysem * sem)
{
    for(;;)
    {
        int res = sem_trywait( sem );
        if ( res == 0 ) return true; // got it

        int err = errno;
        if ( err == EINTR )
        {
            // UNIX is such balls
            continue;
        }
        else if ( err == EAGAIN )
        {
            // expected failure, no count in sem to dec :
            return false;
        }
        else
        {
            // crazy failure; blow up :
            CB_FAIL("sem_trywait",err);
        }
    }
}

(BTW best practice these days is always to copy "errno" out to an int, because errno may actually be #defined to a function call in the multithreaded world)

And since I just stumbled into it by accident, I may as well talk about EINTR. Now I understand that there may be legitimate reasons why you *want* an OS API that's interrupted by signals - we're going to ignore that, because that's not what the EINTR debate is about. So for purposes of discussion pretend that you never have a use case where you want EINTR and it's just a question of whether the API should put that trouble on the user or not.

I ranted about EINTR at RAD a while ago and was informed (reminded) this was an ancient argument that I was on the wrong side of.

Mmm. One thing certainly is true : if you want to write an operating system (or any piece of software) such that it is easy to port to lots of platforms and maintain for a long time, then it should be absolutely as simple as possible (meaning simple to implement, not simple in the API or simple to use), even at the cost of "rightness" and pain to the user. That I certainly agree with; UNIX has succeeded at being easy to port (and also succeeded at being a pain to the user).

But most people who argue on the pro-EINTR side of the argument are just wrong; they are confused about what the advantage of the pro-EINTR argument is (for example Jeff Atwood takes off on a general rant against complexity ; I think we all should know by now that huge complex APIs are bad; that's not interesting, and that's not what "Worse is Better" is about; or Jeff's example of INI files vs the registry - INI files are just massively better in every way, it's not related at all, there's no pro-con there).

(to be clear and simple : the pro-EINTR argument is entirely about simplicity of implementation and porting of the API; it's about requiring the minimum from the system)

The EINTR-returning API is not simpler (than one that doesn't force you to loop). Consider an API like this :


U64 system( U64 code );

doc :

if the top 32 bits of code are 77 this is a file open and the bottom 32 bits specify a device; the
return values then are 0 = call the same function again with the first 8 chars of the file name ...
if it returns 7 then you must sleep at least 1 milli and then call again with code = 44 ...
etc.. docs for 100 pages ...

what you should now realize is that *the docs are part of the API*. (that is not a "simple" API)

An API that requires you to carefully read about the weird special cases and understand what is going on inside the system is NOT a simple API. It might look simple, but it's in disguise. A simple API does what you expect it to. You should be able to just look at the function signature and guess what it does and be right 99% of the time.

Aside from the issue of simplicity, any API that requires you to write the exact same boiler-plate every time you use it is just a broken fucking API.

Also, I strongly believe that any API which returns error codes should be usable if you don't check the error code at all. Yeah yeah in real production code of course you check the error code, but for little test apps you should be able to do :


int fd = open("blah");

read(fd,buf);

close(fd);

and that should work okay in my hack test app. Nope, not in UNIX it doesn't. Thanks to its wonderful "simplicity" you have to call "read" in a loop because it might decide to return before the whole read is done.

Another example that occurs to me is the reuse of keywords and syntax in C. Things like making "static" mean something completely different depending on how you use it makes the number of special keywords smaller. But I believe it actually makes the "API" of the language much *more* complex. Instead of having intuitive and obvious separate clear keywords for each meaning that you could perhaps figure out just by looking at them, you instead have to read a bunch of docs and have very technical knowledge of the internals of what the keywords mean in each usage. (there are legitimate advantages to minimizing the number of keywords, of course, like leaving as many names available to users as possible). Knowledge required to use an API is part of the API. Simplicity is determined by the amount of knowledge required to do things correctly.

3/26/2013

03-26-13 - Simulating Deep Yield with a Wait

I'm becoming increasingly annoyed at my lack of "deep yield" for coroutines.

Any time you are in a work item, if you decide that you can get some more parallelism by doing a branch-merge inside that item, you need deep yield.

Remember you should never ever do an OS wait on a coroutine thread (with normal threads anyway; on a WinRT threadpool thread you can). The reason is the OS wait disables that worker thread, so you have one less. In the worst case, it leads to deadlock, because all your worker threads can be asleep waiting on work items, and with no worker threads they will never get done.

Anyway, I've cooked up a temporary work-around, it looks like this :


I'm in some function and I want to branch-merge

If I'm not on on a worker thread
  -> just do a normal branch-merge, send the work off and use a Wait for completion

If I am on a worker thread :

inc target worker thread count
if # currently live worker threads is < target count
  start a new worker thread (either create or wake from pool)

now do the branch-merge and use OS Wait
dec the target worker thread count


on each worker thread, after completing a work item and before popping more work :
if target worker thread count < currently live count
  stop self (go back into a sleeping state in the pool)

this is basically using OS threads to implement stack-saving deep yield. It's not awesome, but it is okay if deep yield is rare.

03-26-13 - Oodle 1.1 and GDC

Hey it's GDC time again, so if you're here come on by the RAD booth and say "hi" (or "fuck you", or whatever).

The Oodle web site just went live a few days ago.

Sometimes I feel embarassed (ashamed? humiliated?) that it's taken me five years to write a file IO and data compression library. Other times I think I've basically written an entire OS by myself (and all the docs, and marketing materials, and a video compressor, and aborted paging engine, and a bunch of other crap) and that doesn't sound so bad. I suppose the truth is somewhere in the middle. (perhaps with Oodle finally being officially released and selling, I might write a little post-mortem about how it's gone, try to honestly look back at it a bit. (because lord knows what I need is more introspection in my life)).

Oodle 1.1 will be out any day now. Main new features :


Lots more platforms.  Almost everything except mobile platforms now.

LZNIB!  I think LZNIB is pretty great.  8X faster to decode than ZLIB and usually
makes smaller files.

Other junk :
All the compressors can run parallel encode & decode now.
Long-range-matcher for LZ matching on huge files (still only in-memory though).
Incremental compressors for online transmission, and faster resets.

Personally I'm excited the core architecture is finally settling down, and we have a more focused direction to go forward, which is mainly the compressors. I hope to be able to work on some new compressors for 1.2 (like a very-high-compression option, which I currently don't have), and then eventually move on to some image compression stuff.

3/19/2013

03-19-13 - Windows Sleep Variation

Hmm, I've discovered that Sleep(n) behaves very differently on my three Windows boxes.

(Also remember there are a lot of other issues with Sleep(n) ; the times are only reliable here because this is in a no-op test app)

This actually started because I was looking into Linux thread sleep timing, so I wrote a little test to just Sleep(n) a bunch of times and measure the observed duration of the sleep.

(Of course on Windows I do timeBeginPeriod(1) and bump my thread to very high priority (and timeGetDevCaps says the minp is 1)).

Anyway, what I'm seeing is this :


Win7 :
sleep(1) : average = 0.999 , sdev = 0.035 ,min = 0.175 , max = 1.568
sleep(2) : average = 2.000 , sdev = 0.041 ,min = 1.344 , max = 2.660
sleep(3) : average = 3.000 , sdev = 0.040 ,min = 2.200 , max = 3.774

Sleep(n) averages n
duration in [n-1,n+1]

WinXP :
sleep(1) : average = 1.952 , sdev = 0.001 ,min = 1.902 , max = 1.966
sleep(2) : average = 2.929 , sdev = 0.004 ,min = 2.665 , max = 2.961
sleep(3) : average = 3.905 , sdev = 0.004 ,min = 3.640 , max = 3.927

Sleep(n) averages (n+1)
duration very close to (n+1) every time (tiny sdev)

Win8 :
sleep(1) : average = 2.002 , sdev = 0.111 ,min = 1.015 , max = 2.101
sleep(2) : average = 2.703 , sdev = 0.439 ,min = 2.017 , max = 3.085
sleep(3) : average = 3.630 , sdev = 0.452 ,min = 3.003 , max = 4.130

average no good
Sleep(n) minimum very precisely n
duration in [n,n+1] (+ a little error)
rather larger sdev

it's like completely different logic on each of my 3 machines. XP is the most precise, but it's sleeping for (n+1) millis instead of (n) ! Win8 has a very precise min of n, but the average and max is quite sloppy (sdev of almost half a milli, very high variation even with nothing happening on the system). Win7 hits the average really nicely but has a large range, and is the only one that will go well below the requested duration.

As noted before, I had a look at this because I'm running Linux in a VM and seeing very poor performance from my threading code under Linux-VM. So I ran this experiment :


Sleep(1) on Linux :

native : average = 1.094 , sdev = 0.015 , min = 1.054 , max = 1.224
in VM  : average = 3.270 , sdev =14.748 , min = 1.058 , max = 656.297

(added)
in VM2 : average = 1.308 , sdev = 2.757 , min = 1.052 , max = 154.025

obviously being inside a VM on Windows is not being very kind to Linux's threading system. On the native box, Linux's sleep time is way more reliable than Windows (small min-max range) (and this is just with default priority threads and SCHED_OTHER, not even using a high priority trick like with the Windows tests above).

added "in VM2". So the VM threading seems to be much better if you let it see many fewer cores than you have. I'm running on a 4 core (8 hypercore) machine; the base "in VM" numbers are with the VM set to see 4 cores. "in VM2" is with the VM set to 2 cores. Still a really bad max in there, but much better overall.

old rants