7/15/2020

Two News Items

1. Mea Culpa.

We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.

The good news is the next version will be much faster on very large images, with no algorithmic changes (same results and quality). The bad news is we have lots of people testing it and seeing slower speeds than we expected.

2. Fastmail tua culpa.

Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!

Details for each :

1. Mea Culpa.

We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.

The good news is the next version will be much faster on very large images, with no algorithmic changes (same results and quality). The bad news is we have lots of people testing it and seeing slower speeds than we expected.

It was sort of a story of being too "mature" again.

In our image analysis process, we do a lowpass filter with a Gaussian. In coding that up, I was experimenting with lots of different ideas, so I just did a first quick dumb implementation as a temp thing to get the results and see how it worked. I always intended to come back and rewrite it in the optimization phase if it worked out. (90% of the stuff I try in the experimentation phase just gets deleted, so I try to avoid spending too much time on early implementation until we work out what method is the one we want to ship).

So we tried various things and eventually settled on a process, and came back to optimize what we settled on. I immediately thought, oh well this Gaussian filter I did was a really dumb implementation and obviously we know there are various ways to do fast implementations of that, that's an obvious place to look at speed.

But rather than just dive in and optimize it, I decided to be "mature". The mature programmer doesn't just optimize code that is obviously a bad implementation. Instead they profile, and measure how much time it is actually taking. That way you can prioritize your efforts to spend your programming time where it has the biggest impact. Any programmer work is not zero-sum; if you spend time on X it takes away time from Y, so you can't just say yes of course we should do X, you have to say "X is more important than Y". If I'm optimizing the Gaussian I'm not doing something else important.

So I profiled it, and it was ~1% of total CPU Time. So I thought hrmm, well that's surprising, but I guess it's not important to total CPU time, so I won't optimize it.

I was wrong. The problem was I tested on an image that was too small. There's a huge cliff in performance that happens when the image doesn't fit in cache.

(for people who are aware of the performance issues in image filtering, this is obvious. The main issue for CPU image filtering is the cache usage pattern; there are various ways to fix that, tiles and strips and different local access patterns; that's well known)

Images up to 1024*1024 easily fit in cache (even in 4-float format at 16 bytes per pel, that's 16 MB). Up to 2k x 2k can almost fit in the big 64 MB L3 that is increasingly common.

At 8k x 8k , a 4-float image is 1 GB. (it's unintuitive how fast exponential growth goes up!). At that size you get a huge performance penalty from naive filtering implementations, which are constantly cache missing.

Foolishly, I did my one profile of this code section on a 1k x 1k image, so it looked totally fine.

The solution is simple and we'll have it out soon. (in typical Charles & Fabian obsessive perfectionism style, we can't just fix it "good enough", we have to fix it the best way possible, so we'll wind up with the super over-engineered world's best implemenation) I just feel a bit embarassed about it because doing good profiling and making smart implementation decisions is our specialty and I totally F'ed it.

I think it is an instructive case of some general principles :

1A. Profiling is hard, and a little bit of profiling is worse than none.

In this case there's a huge performance cliff when you go from working sets that fit in cache to ones that don't. That depends on cache size and machine; it can also depend on how much other CPU work is happening that's competing for cache. It depends on machine architexture, for example we've seen many compressors perform horribly on ARM big-little systems where latency to main memory can be much bigger than is typical on x86/64 desktops, because their architects did not profile on that type of machine.

Profiling is a special case of the more general "measurement fallacy". People have this very misplaced faith in a measured number. That can be extremely misleading, and in fact bad measurement can be worse than not doing at all. For example medical trials without valid controls or insufficiently large samples can lead to very harmful public policy decisions if their results are not ignored.

You can be making a completely garbage point, but if you start showing that it was 17.20 and here's a chart with some points, all of a sudden people start thinking "this is rigorous"; to trust any measurement you have to dig into how it was done, does it actually measure what you want to know? were noise and biasing factors controlled and measured? You have to pose the right question, measure the right thing in the right way, sample the right group, do statistical analysis of error and bias, etc. without that it's fucking pseudoscience garbage.

I see far too many people who know about this measurement problem, but then ignore it. For example pretty much everyone knows that GDP is a terrible measure of overall economic health of a country, and yet they still talk about GDP all the time. Maybe they'll toss in a little aside about ("GDP isn't really what we should talk about, but...") and then after the "but" they proceed to do a whole article looking at GDP growth. This is the trap! When you have a bad measurement, you MUST ignore it and not even think about it at all. (see also: graduation rates, diet, cost of social programs, etc. etc.)

You see this all the time with profiling where people measure some micro-benchmark of a hash table, or a mutex lock, and find the "fastest" implementation. These things are massively context dependent and measuring them accurately in a synthetic benchmark is nearly impossible (it would require very complex simulation of different input types, memory layouts and working set sizes, different numbers of threads in different thread usage patterns).

The problem with a bad measurement is it gives a number which then people can brandish as if it's unimpeachable (X was 4 cycles and Y was 5 cycles, therefore we must use X even though it's complicated and fragile and harder to use, and in fact after all the surrounding implementation it winds up being much worse). It far too often makes people believe that the result they saw in one measurement is universally true, when in fact all you observed is that *if* measured in that particular way in that particular situation, this is what you saw that one time. (reminds me of the old "black sheep" joke about the engineer, physicist and the mathematician).

There are lots of common mistakes in profiling that we see all the time, unfortunately, as people try Oodle and feel the need to measure performance for themselves. It's not that easy to just "measure performance". We try to be very careful about using data sets that are realistic samples of expected data, we remove fluctuations due to thermal throttling or single-core boosts, we run multiple times to check repeatability of results, etc. This is literally our job and we spend a lot of time thinking about it, and sometimes we still get it wrong, and yet every single day we get people going "oh I just cooked up this benchmark in two seconds and I'm getting weird results". See also : Tips for benchmarking a compressor and The Perils of Holistic Profiling .

In the modern world you have to consider profiling with N other threads running that you don't control, you can't assume that you get the whole machine to yourself. For example a very common huge mistake that I see is unnecessary thread switches; let's just hand off to this other thread very briefly then come back to our first thread to continue the work. That may be totally fine when you test it on a machine that is otherwise idle, but if you're competing for CPU time on a machine that has a ton of other threads running, that "little thread switch" to pop over to a different async task might take seconds. Over-threading tends to improve benchmarks when run on machines in isolation but hurt performance in the real world.

(See also *2 at end)

1B. Optimization is good for its own sake.

The whole idea that you should "avoid premature optimization" has many flaws and should be one of the learnings that you forget. Yes yes, of course don't go off and spend a month writing an assembly version of a loop without being sure it's an important thing to do, and also that you've got the right overall algorithmic structure and memory access pattern and so on. I'm not advocating just being dumb.

But also, don't use a really slow LogPrintf() implementation just because it doesn't show up in profiles.

When you have bad/slow code, it changes the way you use it. You wind up avoiding that function in high performance areas. It makes you code around the performance bug rather than just writing things the way you should.

I've worked at a number of companies where they disable asserts in debug builds because they've gotten too slow. I of course try turning on asserts, and a thousand of them fire because nobody else is testing with asserts on. The solution should have been to fix the speed of the debug build to something usable, not to avoid important programming tools.

Sometimes when you do a good implementation of something (even when it wasn't necessary for any particular profile of total app performance), it becomes a really useful component that you then wind up using all over. Like maybe you do a really cool memcpy that can do interleaves and shuffles, that winds up being a really useful tool that you can build things with, that you wouldn't have thought about until you did the good implementation of it.

It's also just fun and fun is good.

1C. Trust what is obviously true.

When the truth is staring you in the face, but some measurement, or some complex second thoughts contradict it, you need to stop and reevaluate. The obvious truth is probably right and your overthinking or being too "mature" with measuring things may be misleading you.

In this case the fact that a naive filter implementation was a temp place-holder and needed to be optimized was obviously true, and some over-thinking clouded that.

2. Fastmail tua culpa.

Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!

What was happening was fastmail (*1) was generating emails that failed SPF check. This would cause my sent emails to be just rejected by some receivers, with no "undelivered" response at all, so I didn't know it was happening.

The SPF record is supposed to verify that an email came from the sending mail host that it claims to (but not the sending address). Emails coming from the fastmail mail host mark themselves as being from fastmail, then the receiver can look up the SPF record at fastmail.com and see the IP's that it should have come from to verify it actually came from there. This prevents spammers from claiming to be sending mail from fastmail servers but actually using a different server. This makes it possible for receivers to have white & black lists for hosts. (SPF records do *not* verify the "from" field of the email)

I had my fastmail email set up to forward to an alias account (also inside fastmail). When I then replied to these (via SMTP through smtp.fastmail.com), it was going out identified as :

    helo=wforward1-smtp.messagingengine.com;
    client-ip=64.147.123.30
then receivers would check the SPF record for fastmail and get :
v=spf1 include:spf.messagingengine.com ?all

64.147.123.17
64.147.123.18
64.147.123.19
64.147.123.20
64.147.123.21
64.147.123.24
64.147.123.25
64.147.123.26
64.147.123.27
64.147.123.28
64.147.123.29 
which does not include the .30 IP , therefore my email was marked as an SPF failure.

Fastmail tech support was useless and unhelpful about figuring this out. It also sucks that I get no notification of the undelivered mail.

Some things that were useful :

NIST Email Authentication Tester
dmarcanalyzer SPF checker

*1: I switched to fastmail from dreamhost because dreamhost was failing to deliver my sent email. Deja vu. Why is it so fucking hard to deliver a god damn email !? (in dreamhost's case it's because they intentionally provide smtp service to lots of spammers, so the dreamhost smtp servers get into lots of blacklists)

*2: Another common problem with profiling and benchmarking I've been thinking about recently is the drawback of large tests, which you then average or sum.

People now often have access to large amounts of data to test on. That may or may not be great. It depends on whether that data is an unbiased random sampling of real world data that reflects what you care about the performance on in your final application.

The problem is that you often don't know exactly what data you will be used on, and the data you have is just "some stuff" that you don't really know if it reflects the distribution of data that will be observed later. (this is a bit like the machine learning issue of having a training set that is a good reflection of what will be seen in production use).

Again like the "measurement fallacy" the "big data" test can lead to a false sense of getting an accurate number. If you test on 4 TB of sample data that does not mean the numbers that come out are more useful than a test on 1 MB of sample data.

Large data averages and totals can swamp interesting cases with lots of other cases. There might be some extreme outliers in there where your performance is very bad, but they get washed away in the total. That would be fine if that was in fact a good representation of what you will see in real use, but if it's not you could be very bad.

The worst case is for a library provider like us, we don't know what data types are important to the client. That one weird case where we do badly might be 90% of the client's data.

Any time you're working with test sets where you take averages and totals you have to be aware of how you're pooling (weighted by size? (eg. adding times is weighted by size), or by file? or are categories equally weighted?). If you test set is 20% text and 40% executable that is assigning an effective importance weight to those data types.

In data compression we also have the issue of incompressible files, such as already compressed files, which are not something you should ever be running through your compressor. People running "lots of data" that just iterate every file on their personal disk and think they're getting a good measurement are actually biasing the inputs toward weird things that should not be used.

Because of these considerations and more, I have been increasingly using the method of "minimizing the maximum" of bad performance, or boosting the worst case.

Rather than using a big testset to take an average performance, I use a big test set to find the one file with the worse performance, and then do all I can to optimize that bad case. Measure again, find the new worst case, attack that one.

This has many advantages. It prevents clients from ever seeing a really bad case. That one worst case might actually be the type of data they really care about. It also tends to find interesting corner cases and reveals flaws you don't see on average cases (like oh this one weird file runs most of the loop iterations in the tail/safe loop), that lets you find and fix those cases. It's sort of a case of "you learn from your mistakes" by really digging into these examples of bad performance.

Another nice thing about the "fix the worst" method is that it's strictly additive for bigger test sets. You can just always toss more in your test set and you have more chances to find a worst case. You don't have to worry about how the data is distributed and if that reflects real world distributions. For example say someone gives you a terrabyte of images that are all grayscale. You don't have to worry that this is going to bias your image test set towards a weird over-weighting of grayscale.

This approach has been used on both Oodle Leviathan and Oodle Texture. It was one of the guiding principles of Leviathan that we not only be good on average, but we minimize the gap to the best compressor on every type of data. (we can't be the best possible compressor on every type of data, where specialized compressors can excel in some cases, but we wanted to minimize the worst difference). That led to lots of good discoveries in Leviathan that also helped the average case, and we used a similar principle in Oodle Texture. I think of it as a bit like the machine learning technique AdaBoost, where you take your worst cases and train on them more to get better at them, then keep repeating that and you wind up with a good classifier in general.

2 comments:

Jörn Engel said...

> I've worked at a number of companies where they disable asserts in debug builds because they've gotten too slow. I of course try turning on asserts, and a thousand of them fire because nobody else is testing with asserts on. The solution should have been to fix the speed of the debug build to something usable, not to avoid important programming tools.

Not sure if I disagree with you on this one or agree 100%. I no longer want to run debug builds at all.

The fundamental problem with debug builds is that you have some extra checks that are so slow that you wouldn't dare run them in production. So now you have a release build and a debug build. Your testing is split between the two or even heavily weighted to the build you don't run in production. As a result, you test the wrong thing.

Then there are heisenbugs. You see a problem in production, it may even be a common problem, and you can never reproduce it with debug builds. Or those rare bugs that take 1000 machine-years to reproduce. You will never hit 1000 machine-years with debug builds, but depending on your scale you can hit those daily in production.


My answer is to optimize the debug code, then ship it in production. The easy answers like "more logging" and "assert everything" are out. You will quickly need more logging than the system can take. And while the assertions might be cheap, the conditions inside the assertions are often expensive. So you have to think a bit, coming up with debug tools that are both cheap and effective isn't easy. But it is absolutely worth it.

cbloom said...

I'm a big fan of asserts in debug builds, but I agree on your general sentiment that developers should be running the same build that you ship.

One thing I think we can agree on is that having a separate "developer" and "ship" release build is a HUGE mistake and is far far too common in the industry. Whatever benefit you think you get from it is massively outweighed by the potential for bugs and other anomalies that only happen in the shipping build.

The best software is stuff that the developers actually use for their own computing on a daily basis.

(eg. I would hope that cloud/net developers run on real servers on the other side of the world during dev, not just doing local test servers).

old rants