Threads/Jobs and memory allocation is a nasty problem.
Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?
The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.
In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.
To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.
The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.
(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").
So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.
While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).
If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).
I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.
Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.
(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)