10/09/2013

10-09-13 - Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)

10 comments:

johnb said...

Writing the core computation code to have two phases (allocate-resources/compute) is a good but painful. I'm not sure it needs to violate your Fundamental Theorems of Jobs, because you can write single threaded code the same way, with basically the same advantages. But it's just too painful to actually write the code like that, and sometimes it's impossible, either because the resource requirements are too heavily dependent on the input data, or because resource availability is changing while you run and you can't practically lock down all resources at the start.

I think the gather/commit separation is more viable in general. It still means accepting that if you run out of resources you've done a bunch of work that you have to throw away, but there will always be cases where you can't avoid that.

Arguably those two approaches sit on a continuum anyway: they're both based on a split between "stuff that can fail due to resource constraints" and "stuff that we know we can complete".

cbloom said...

Arguably those two approaches sit on a continuum anyway: they're both based on a split between "stuff that can fail due to resource constraints" and "stuff that we know we can complete".


True enough, and something I've ranted about in the past, so I should listen to myself.

But in practice it's awfully cumbersome.

Any time you're doing any big CPU process, you should try to get anything that could fail out of the way before you start. (eg. if you need to write a file at the end - open it *first*). Don't spend 12 hours grinding on a video and then fail.

Same principle in some ways.

Fabian 'ryg' Giesen said...

An allocate/compute split will not save you. Even declaring all resource constraints ahead of time and handing it over to the job scheduler will not, in general, solve this problem.

Assume there's a chain of jobs A->B, and you have lots of instances of that.

A uses a known, fixed amount of memory that gets passed to B. B additionally allocates an amount of memory that depends on the input data - this is decided as A runs, but before B is enqueued. B does its works and then frees both the memory it allocated, and the memory allocated by its preceding A.

You have 100 instances of that chain and they're all independent, so the job scheduler happily goes ahead and runs them on your 100-hardware-thread machine. At that point, you have 100MB of memory left. And then the A's all start crunching for a while and eventually it turns out that every single one of the B's wants at least 120MB of memory to proceed.

At that point, you're toast (unless you can back out and threw away some A's and their allocations), even though each job knows and declares its resource requirements at creation time.

If you're dealing with a soft-failure condition (e.g. paging), you can just soldier on and live with the fact that things are gonna suck for a while until you're done paging. For hard-failure conditions (e.g. out of address space) there's just no general way to deal with this in a setting that has side effects. You can fix it using truly hideous constraints - "every job must allocate enough memory for the entire sub-tree of jobs it might eventually spawn" - but that's a cure just as bad as the disease.

cbloom said...

@ryg - yeah totally.

To make it completely robust requires knowing the full execution path in advance, and pre-checking all possible failures along a line of execution (eg. allocations), etc. big mess.

Actually this reminds that I have a similar nasty issue internal to Oodle because I use a fixed size handle table. You have to decide in advance the max # of jobs that can be outstanding.

The bad thing about that is you have to decide in advance how big to make the table, and there are cases where you can start executing a graph and then get into a spot where nothing can resolve.

That is, you can't just go "oh crap we hit the job limit, let's allow previous jobs to complete before we start any new ones" - because the previous jobs might not be able to complete because they require sub-jobs to finish.

There are heuristics that make you less likely to fail, that work in both cases.

For example, running your job graph deep-first instead of wide-first makes you more likely to resolve chains (and thus free memory and job handles). Running wide-first can give better parallelism, but is also more likely to get you in a stuck situation where you run out of resources and nobody can finish.

In general I think fixed size pools of anything are very bad practice because of these kind of issues. It's much better to degrade performance than just to hit some hard limit and crash.

Fabian 'ryg' Giesen said...

"To make it completely robust requires knowing the full execution path in advance, and pre-checking all possible failures along a line of execution (eg. allocations), etc. big mess."
This is not a "big mess", this is simply flat-out impossible, in the strong sense of "requires computing an uncomputable function" - "how much memory does this need" is a non-trivial property of a partial function and hence covered by Rice's theorem, so there's a canonical halting problem reduction:

void jobA(JobHandle *me)
{
if (!someFunc())
me->buffer = new char[5ull << 60];
}

How much memory does A allocate? Well, whatever someFunc needs, plus 5 exabytes if and only if someFunc halts and returns true.

This may sound like pedantry / CS wankery. But Sean wrote up a telling example that answers why stb_truetype requires a dynamic memory allocator and can't just work out of a fixed-size pool. He can't answer "how much memory does rendering a character need" directly without allocating memory, and so forth. So he ends up with "how much memory does this need" functions nested 3-deep.

The halting problem reduction just shows that for sufficiently tricky problems, you can end up with an infinite regress of such functions, making this not even theoretically answerable. In practice, even the 3-level-deep example he gives is horrible enough to never actually get written unless you're truly desperate.

cbloom said...

Nonsense, I think you are being intentionally difficult.

Bringing in the question of the halting problem is just a total CS cop-out. Hey the entire job system might never complete and we have no way of knowing for sure if it will or not. Great, that's true, but also who cares.

I'm not suggesting that the system automatically figure out the maximum memory needed, rather that the jobs report what they need.

Now there's the question of how hard it is for the jobs to report what they need, and that's where it is sometimes a "big mess".

First of all, there is a large and interesting class of work where it is *not* a big mess, and the jobs can pretty straightforwardly pre-reserve memory. In practice very large scale systems like this run very nicely every day on your GPU.

Then there are cases where it's possible, but quite inconvenient, or that require a lot of extra work, sometimes duplicating lots of the job work. eg. anything that is dependent on a file size, you have to go and query the file size in the HowMuchMemory function and then do it again when you actually run.

And before you go off into difficult wonkery again, I'm assuming that this is the real world where I'm allowed to be practical, so I'm allowed to do things like say "my HowMuchMemory function can allocate up to 1M to do its work" and things like "I can only handle directories with up to 10k files". So no, I don't need a HowMuchMemory for my HowMuchMemory call.

Finally there are cases where you actually can't bound your resource usage without allocating a huge amount (maybe the same amount as the job requires). I guess something like if your jobs load a bat file and that can run arbitrary code, you're fucked.

So hey you can't do tight memory bounds in every case.

That doesn't mean you should do it in *no* cases! That's absurd. That's the worst kind of armchair intellectualism; oh integer arithmetic can't be proven to be consistent, therefore we shouldn't do any math.

If you can solve the problem in lots of real-world useful cases, of course you should.

And I believe this is one of those cases.

Response to Sean's writing in the next comment.

cbloom said...

About Sean's article -

http://nothings.org/gamedev/font_rendering_malloc.txt

I think Sean makes very good points and I basically agree.

But I think you are twisting its message to make it mean something it doesn't.

What I think Sean presents well -

1. Converting code that was not designed to run with pre-allocated memory into something that can run with pre-allocated memory is very hard / impossible.

2. Algorithms that run with pre-allocated memory can be much slower (because they need to do a lot of work to compute memory usage), and/or can use much more memory than late allocation (if you are just conservative and over-allocate with some max).

That's true and I agree.

However what he's not saying and what I don't agree with is :

1. Rasterizing truetype fonts with pre-allocated memory is difficult/impossible.

Of course its not. You can do a linear scan to rearrange the edge data. You can do a rasterization pass that only counts the number of edges per scan line. You could do a tiled rasterizer that shrinks the tiles when you exceed a max edge count. etc. etc.

cbloom said...

Anyhoo...

I believe what's actually useful is that there are practical cases where this is pretty simple and also useful.

The basic case is something like a branch-merge where you are free to choose your branching factor.

You're doing some processing, and you can do it in parallel at the granularity of your choice.

Quite often it is possible to get at least a rough estimate of the memory use of each branch.

If you are memory use limited, you should constrain your branch count so that you don't exceed system memory size. Even if you have swap file, you'll have much better performance without it.

eg. I know my SuffixTrie takes a max of 256 MB, so parallel string matching should have its branch factor limited by system memory size / 256.

Fabian 'ryg' Giesen said...

You're misunderstanding where I'm going with this. My point is not that "this is hard so throw up your hands since there's nothing we can do", it was that it's pointless to try and build a system that tries to solve this for you.

A tasklet scheduler that doesn't care about memory usage is a well-scoped problem. So is a tasklet scheduler that is aware of memory pressure and keeps overall resource usage below a given limit.

The magic task graph analyzer that can figure out memory usage for a general job graph in advance given tons of annotations is not, and I'd argue that it wouldn't be a good thing to have even if it existed. There's lots of easy practical cases that you can solve without this, based on simple local invariants, e.g. "no job in this graph can ever use more than 256k of scratch memory" (the SPU/GPU solution), or "max memory usage along a chain of dependent jobs is required to be strictly nonincreasing" (monotonicity), or "the graph of jobs is required to be 2-colorable, red jobs allocate memory, black jobs free it" (e.g. frontend/backend jobs in Pixomatic) - you get the idea.

All of these invariants have simple, explainable, instantly comprehensible reasons why they work.

The magic graph scheduler is just a black box that will produce *some* schedule and at no point is there ever a good explanation for why it does what it does beyond "well, that's the solution the solver found". The only thing that is more frustrating than designing such a system is using it.

And of course, in practice any implementation is not the magic oracle implementation, because that can't actually be done. So not only does such a system deliver something that would be a complexity trap even if it worked perfectly, it by necessity only delivers some rough approximation that also adds its own wrinkles to the pile.

This is the difference between "I know this lock-based algorithm is deadlock-free because there's a strict lock hierarchy" vs. "I know this lock-based algorithm is deadlock-free because of this tricky 6-page proof that hinges on several implementation details". I don't care how ingenious that solution is, I don't want it anywhere near my code.

TL;DR:
Good:
- A simple work scheduler
- Several "frontends" for said work scheduler that can guarantee resource limits (using various models).

Bad:
- The grand unified "do what I mean" scheduler that requires tons of annotations, doesn't do quite what I mean, might still fail badly if someone gets the annotations wrong (and of course that will happen all the time) and even when it does work does so for reasons that nobody is quite sure about.

cbloom said...

Yeah, ok, I don't actually think anyone should have a complicated research project at the core of their job system. (though MS has done that to all of us with WinRT, so...)

But I think there is some middle ground.

Maybe I'll do it as a separate post because writing in this comment box sucks.

old rants