10/08/2012

10-08-12 - Oh dear god, what a relief

For the past few days I've had a terrible heisen-bug. It mainly showed up as decompression failing to reproduce the original data. At first I thought it was just a regular bug, some weird case exposing a broken pathway, but I could never get it to repro, and it only seemed to happen when I ran very long tests; like if I compressed 10,000 files it would show up in the 7000th one, but then if I ran on just the file that failed it would never repro.

I do a lot of weird threading stuff so my first fear was that I had some kind of race. So I turned off all my threading, but it kept happening.

My next thought was some kind of uninitialized memory problem or out-of-bounds problem. The circumstances of failure jive with the bug only happening after I have touched a lot of memory and maybe moved into a weird part of address space, or maybe I'm writing past the end of a buffer somewhere and it doesn't show up and hurt me until much later.

So I turned on my various debug allocator features and tried a bunch of things to stress that, but still couldn't get it to fail in any kind of repeatable way.

Yesterday I saw the exact same kind of bug happen in a few of my different compressors and the lightbulb finally came on in my head : maybe I have bad RAM. Memtest86 and just a few seconds in, yep, bad RAM.

Phew. As pissed as I am to have to deal with this (getting into the RAM on my lappy is a serious pain in the ass) it's nice to not actually have a bizarro bug.

The failure rate of RAM in desktop-replacement lappies is around 100% in my experience. I've had two different desktop replacement lappies in the past 8 years and I have burned out 3 RAM chips; I've blown the OEM RAM on both of them and on this one I also toasted the replacement RAM. Presumably the problem is that it just gets too hot in there and they don't have sufficient cooling. (and yes I keep them on a screen for air flow and all that, and never actually use them on a lap or pillow or anything bad like that). (perhaps I should get one of those laptop stands that has active cooling fans).

Also, shouldn't we have better debugging features by now?

I should be able to take any range of memory, not just page boundaries, and mark it as "no access". So for example I could take compression buffers and put little no access regions at the head and tail.

For uninitialized memory you want to be able to mark every allocation as "fault if it's read before it's written". (this requires a bit per byte which is cleared on write).

You could enforce true const in C by making a true_const template that marks its memory as read-only.

I've ranted before about how thread debugging would be much better if we could mark memory as "fault unless you are thread X", eg. give exclusive access of a memory region to a thread.

I see two good solutions for this : 1. a VM that could run your exe and add these features, or 2. special memory chips and MMU's for programmers. I certainly would pay extra for RAM that had an extra 2 bits per byte with access flags. Hell with how cheap RAM is these days I would pay extra for more error-correction bits too; maybe even completely duplicate bytes. And self-healing RAM wouldn't be bad either (just mark a page as unusable if it sees failures in that page).

(for thread debugging we should also have a VM that can record exact execution traces and replay them, of course).

15 comments:

Dave Moore said...

Valgrind? I think it does some of this, including the "read before written" trick.

ryg said...

"I should be able to take any range of memory, not just page boundaries, and mark it as "no access". So for example I could take compression buffers and put little no access regions at the head and tail.

For uninitialized memory you want to be able to mark every allocation as "fault if it's read before it's written". (this requires a bit per byte which is cleared on write).
"
Try Valgrind (www.valgrind.org) - provided you run on one of their supported platforms. It's instrumentation not HW based, but especially the original module (the memory error checker) is great: it caches the usual memory leaks and so forth (boring), but also reliably finds double-frees and dereferences of deallocated pointers (not so boring), and it tracks whether memory is initialized or not per bit (sounds overkill, but means it works even with bit fields or sparse bit vectors). It's good enough that I've on occasion ported stuff to Linux just so I could Valgrind it (things like [de]compressors are easy since they tend to be self-contained with little platform deps).

won3d said...

http://www.chromium.org/developers/testing/addresssanitizer

won3d said...

http://www.chromium.org/developers/testing/addresssanitizer

nothings said...

I think getting a custom CPU for programmers might not work (the overhead in authoring a real chip is too huge for a tiny audience).

But with multicore machines these days, I've feel like we should be able to have one "debugging core" which has extra features for debugging but which otherwise behaves like normal.

I can still imagine Intel not being satisfied with the value proposition there. But on consoles, there's just no excuse for not having this.

nothings said...

I think getting a custom CPU for programmers might not work (the overhead in authoring a real chip is too huge for a tiny audience).

But with multicore machines these days, I've feel like we should be able to have one "debugging core" which has extra features for debugging but which otherwise behaves like normal.

I can still imagine Intel not being satisfied with the value proposition there. But on consoles, there's just no excuse for not having this.

cbloom said...

@Sean -

I think Intel/others could provide a generic facility for hardware instruction virtualization.

Like provide a way to replace any instruction with a subroutine instead. It would be done in the instruction decode to micro-ops. Obviously it would be much slower, but the idea is to run at full speed if no subroutine is specified.

Then you could replace load [mem] with a subroutine that does whatever you want.

cbloom said...

Yeah I guess I should be my Linux port done then I can use the fancy tools there.

Jeff Roberts said...

You don't need a special chip for allocating memory per-thread - the OS could already do it (just need to manage multiple CR3 registers per thread instead of per-process.

alex/bluespoon said...

on the bad ram front - @antirez (author of redis, in memory k-v store) got so frustrated with really-hard-to-repro bugs turning out to be bad-ram, that he built a mem-test style checker into the main redis executable, such that he could say to people 'please run redis --test-memory on your bug-repro-ing server, with no new installation of software / setup necessary'.
so frustrating.
related: bug in PC game from years ago where a feature triggered by a keypress inexplicably didn't work; turned out to be a broken keyboard...

rossjudson said...

Experience tells us that it's not the RAM, it's not the compiler...that those are the last places to look. Except when it isn't. I am going to flip my BIOS over to slow-boot, RAM-test mode. It's probably worth a few seconds at boot-time to save myself from suicide-inducing intermittent failures like you experienced!

ryg said...

Somewhat related, at current RAM sizes and usual RAM failure rates, you really want ECC memory for absolutely everything.

Even with "just" 4GB of memory and without bad RAM, the likelihood of getting a bit error within a day is above 66%. (See here and here for some updated DRAM bit error rates).

cbloom said...

Yup, I was just looking into ECC.

If you go to Newegg or something they don't even carry ECC RAM

(specifically : when I filter for DDR3 1333 SODIMM every single choice is non-ECC)

but you can go to one of RAM maker's web sites and buy it directly from them. It's only about twice the price, so totally worth it.

(eg. I found it at Corsair.com)

Also I just read that unless all the RAM in the machine is ECC, it may fall back to non-ECC mode.

Aaron said...

Make sure your motherboard supports it too. Many don't.

Newegg's search/filter just sucks. Try looking for "ddr3 ecc 1333", they do sell some.

erno-iki said...

This is an Intel tool that lets
you instrument unmodified binaries:

http://www.pintool.org/

Re motherboards. The memory controllers are integrated to CPUs now (current desktop intel/amd at least). I guess some motherboards might still be missing BIOS code to initialize them.

Re memory error incidence. The average error rate per dimm not that good a metric in predicting your exposure to memory errors ("two-thirds of their machines (67.8%) detected absolutely no memory errors in two and a half years" from http://lambda-diode.com/opinion/ecc-memory-2 paraphrasing http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf)

old rants