cbloom rants: 07/2020

7/27/2020

Performance of various compressors on Oodle Texture RDO data

Oodle Texture RDO can be used with any lossless back-end compressor. RDO does not itself make data smaller, it makes the data more compressible for the following lossless compressor, which you use for package compression. For example it works great with the hardware compressors in the PS5 and the Xbox Series X.

I thought I'd have a look at how various options for the back end lossless compressor do on BCN texture data after Oodle Texture RDO. (Oodle 2.8.9)

127,822,976 bytes of BC1-7 sample data from a game. BC1,3,4,5, and 7. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.

"baseline" is the non-RDO encoding to BCN by Oodle Texture. "rdo lambda 40" is a medium quality RDO run; at that level visual degradation is just starting to become easier to spot (lambda 30 and below is high quality).

baseline:

by ratio:
ooLeviathan8    :  1.79:1 ,    1.4 enc MB/s , 1069.7 dec MB/s
lzma_def9       :  1.79:1 ,    8.7 enc MB/s ,   34.4 dec MB/s
ooKraken8       :  1.76:1 ,    2.2 enc MB/s , 1743.5 dec MB/s
ooMermaid8      :  1.71:1 ,    4.9 enc MB/s , 3268.7 dec MB/s
zstd22          :  1.70:1 ,    4.5 enc MB/s ,  648.7 dec MB/s
zlib9           :  1.64:1 ,   15.1 enc MB/s ,  316.3 dec MB/s
lz4hc1          :  1.55:1 ,   72.9 enc MB/s , 4657.8 dec MB/s
ooSelkie8       :  1.53:1 ,    7.4 enc MB/s , 7028.2 dec MB/s

rdo lambda=40:

by ratio:
lzma_def9       :  3.19:1 ,    7.7 enc MB/s ,   60.7 dec MB/s
ooLeviathan8    :  3.18:1 ,    1.1 enc MB/s , 1139.3 dec MB/s
ooKraken8       :  3.13:1 ,    1.7 enc MB/s , 1902.9 dec MB/s
ooMermaid8      :  3.01:1 ,    4.2 enc MB/s , 3050.6 dec MB/s
zstd22          :  2.88:1 ,    3.3 enc MB/s ,  733.9 dec MB/s
zlib9           :  2.69:1 ,   16.5 enc MB/s ,  415.3 dec MB/s
ooSelkie8       :  2.41:1 ,    6.6 enc MB/s , 6010.1 dec MB/s
lz4hc1          :  2.41:1 ,  106.6 enc MB/s , 4244.5 dec MB/s

If you compare the log-log charts before & after RDO, it's easy to see that the relative position of all the compressors is basically unchanged, they just all get more compression.

The output size from baseline divided by the output size from post-RDO is the compression improvement factor. For each compressor it is :

ooLeviathan8    : 1.7765
ooKraken8       : 1.7784
ooMermaid8      : 1.7602
ooSelkie8       : 1.5548

lzma_def9       : 1.7821
zstd22          : 1.6941
zlib9           : 1.6402
lz4hc1          : 1.5751

Leviathan, Kraken, Mermaid and LZMA all improve around 1.77 X ; ZStd and Zlib a little bit less (1.65-1.70X), LZ4 and Selkie by less (1.55X - 1.57X). Basically the stronger compressors (on this type of data) get more help from RDO and their advantage grows. ZStd is stronger than Mermaid on many types of data, but Mermaid is particularly good on BCN.

* : Caveat on ZStd & LZ4 speed here : this is a run of all compressors built with MSVC 2017 on my AMD reference machine. ZStd & LZ4 have very poor speed in their MSVC build, they do much better in a clang build. Their clang build can be around 1.5X faster; ZStd-clang is usually slightly faster to decode than Leviathan, not slower. LZ4-clang is probably similar in decode speed to Selkie. The speed numbers fo ZStd & LZ4 here should not be taken literally.

It is common that the more powerful compressors speed up (decompression) slightly on RDO data because they speed up with higher compression ratios, while the weaker compressors (LZ4 and Selkie) slow down slightly on RDO data (because they are often in the incompressible path on baseline BCN, which is a fast path).

Looking at the log-log plots some things stand out to me as different than generic data :

Leviathan, Kraken & Mermaid have a smaller gap than usual. Their compression ratio on this data is quite similar, usually there's a bigger step, but here the line connecting them in log-log space is more horizontal. This makes Mermaid more attractive because you're not losing much compression ratio for the speed gains. (for example, Mermaid + BC7Prep is much better for space & speed than Kraken alone).

ZStd is relatively poor on this type of data. Usually it has more compression than Mermaid and is closer to Kraken, here it's lagging quite far behind, and Mermaid is significantly better.

Selkie is relatively poor on this type of data. Usually Selkie beats LZ4 for compression ratio (sometimes it even beats zlib), but here it's just slightly worse than LZ4. Part of that is the 256 KB chunking is not allowing Selkie to do long-distance matches, but that's not the main issue. Mermaid looks like a much better choice than Selkie here.

Another BCN data set :

358,883,720 of BCN data. Mostly BC7 with a bit of BC6. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.

baseline :

by ratio:
ooLeviathan8    :  1.89:1 ,    1.1 enc MB/s ,  937.0 dec MB/s
lzma_def9       :  1.88:1 ,    7.6 enc MB/s ,   35.9 dec MB/s
ooKraken8       :  1.85:1 ,    1.7 enc MB/s , 1567.5 dec MB/s
ooMermaid8      :  1.77:1 ,    4.3 enc MB/s , 3295.8 dec MB/s
zstd22          :  1.76:1 ,    3.9 enc MB/s ,  645.6 dec MB/s
zlib9           :  1.69:1 ,   11.1 enc MB/s ,  312.2 dec MB/s
lz4hc1          :  1.60:1 ,   73.3 enc MB/s , 4659.9 dec MB/s
ooSelkie8       :  1.60:1 ,    7.0 enc MB/s , 8084.8 dec MB/s

rdo lambda=40 :

by ratio:
lzma_def9       :  4.06:1 ,    7.2 enc MB/s ,   75.2 dec MB/s
ooLeviathan8    :  4.05:1 ,    0.8 enc MB/s , 1167.3 dec MB/s
ooKraken8       :  3.99:1 ,    1.3 enc MB/s , 1919.3 dec MB/s
ooMermaid8      :  3.69:1 ,    3.9 enc MB/s , 2917.8 dec MB/s
zstd22          :  3.65:1 ,    2.9 enc MB/s ,  760.0 dec MB/s
zlib9           :  3.36:1 ,   19.1 enc MB/s ,  438.9 dec MB/s
ooSelkie8       :  2.93:1 ,    6.2 enc MB/s , 4987.6 dec MB/s
lz4hc1          :  2.80:1 ,  114.8 enc MB/s , 4529.0 dec MB/s

On this data set, Mermaid lags between the stronger compressors more, and it's almost equal to ZStd. On BCN data, the strong compressors (LZMA, Leviathan, & Kraken) have less difference in compression ratio than they do on some other types of data. On this data set, Selkie pulls ahead of LZ4 after RDO, as the increased compressibility of post-RDO data helps it find some gains. Zlib, LZ4, and Selkie are almost identical compression ratios on the baseline pre-RDO data but zlib pulls ahead post-RDO.

The improvement factors are :

ooLeviathan8   :    2.154
ooKraken8      :    2.157
ooMermaid8     :    2.085
ooSelkie8      :    1.831

lzma_def9      :    2.148
zstd22         :    2.074
zlib9          :    1.988
lz4hc1         :    1.750

Similar pattern, around 2.15X for the stronger compressors, around 2.08X for the medium ones, and under 2.0 for the weaker ones.

Conclusion:

Oodle Texture works great with all the lossless LZ coders tested here. We expect it to work well with all packaging systems.

The compression improvement factor from Oodle Texture is similar and good for all the compressors, but stronger compressors like Oodle Kraken are able to get even more benefit from the entropy reduction of Oodle Texture. Not only do they start out with more compression on baseline non-RDO data, they also improve by a larger multiplier on RDO data.

The Oodle Data lossless compressors are particularly good on BCN data, even relatively stronger than alternatives like zlib and ZStd than they are on some other data types. For example Oodle Mermaid is often slightly lower compression than ZStd on other data types, but is slightly higher compression than ZStd on BCN.

Mermaid has a substantial compression advantage over zlib on post-RDO BCN data, and decompresses 5-10X faster, making Mermaid a huge win over software zlib (zip/deflate/inflate).

7/26/2020

Oodle 2.8.9 with Oodle Texture speed fix and UE4 integration

Oodle 2.8.9 is now shipping, with the aforementioned speed fix for large textures.

Oodle Texture RDO is always going to be slower than non-RDO encoding, it simply has to do a lot more work. It has to search many possible encodings of the block to BCN, and then it has to evaluate those possible encodings for both R & D, and it has to use more sophisicated D functions, and it has to search for possible good encodings in a non-convex search space. It simply has to be something like 5X slower than non-RDO encoding. But previously we just had a perf bug where working set got larger than cache sized that caused a performance cliff, and that shouldn't happen. If you do find any performance anomalies, such as encoding on a specific texture or with specific options causes much slower performance, please contact RAD.

timerun 287 vs 289

hero_xxx_n.png ; 4096 x 4096
timerun textest bcn bc7 r:\hero_xxx_n.png r:\out.dds -r40 --w32
got opt: rdo_lagrange_parameter=40

Oodle 2.8.7 :

encode time: ~ 8.9 s
per-pixel rmse (bc7): 0.8238
---------------------------------------------
timerun: 10.881 seconds

Oodle 2.8.9 :

encode time: 4.948s
per-pixel rmse (bc7): 0.8229
---------------------------------------------
timerun: 6.818 seconds

the "timerun" time includes all loading and saving and startup, which appears to be about 1.9s ; the RDO encode time has gone from about 8.9s to 4.95 s

(Oodle 2.8.7 textest bcn didn't log encode time so that's estimated; the default number of worker threads has changed, so use --w32 to make it equal for both runs)

We are now shipping a UE4 integration for Oodle Texture!

The Oodle Texture integration is currently only for Oodle Texture RDO/non-RDO BCN encoders (not BC7Prep). It should be pretty simple, once you integrate it your Editor will just do Oodle Texture encodes. The texture previews in the Editor let you see how the encodings look, and that's what you pack in the game. It uses the Unreal Derived Data Cache to avoid regenerating the encodings.

We expose our "lambda" parameter via the "LossyCompressionAmount" field which is already in the Editor GUI per texture. Our engine patches further make it so that LossyCompressionAmount inherits from LODGroup, and if not set there, it inherits from a global default. So you can set lambda at :

per texture LossyCompressionAmount

if Default then look at :

LODGroup LossyCompressionAmount

if Default then look at :

global lambda

We believe that best practice is to avoid having artists tweaking lambda a lot per-texture. We recommend leaving that at "Default" (inherit) as much as possible. The tech leads should set up the global lambda to what's right for your game, and possibly set up the LODGroups to override that for specific texture classes. Only rarely should you need to override on specific textures.

LIMITATIONS :

Currently our Oodle Texture for UE4 integration only works for non-console builds. (eg. Windows,Linux,Mac, host PC builds). It cannot export content for PS4/5/Xbox/Switch console builds. We will hopefully be working with Epic to fix this ASAP.

If you are a console dev, you can still try Oodle Texture for UE4, and it will work in your Editor and if you package a build for Windows, but if you do "package for PS4" it won't be used.

Sample package sizes for "InfiltratorDemo" :

InfiltratorDemo-WindowsNoEditor.pak 

No compression :                            2,536,094,378

No Oodle Data (Zlib), no Oodle Texture :    1,175,375,893

Yes Oodle Data,  no Oodle Texture :           969,205,688

No Oodle Data (Zlib), yes Oodle Texture :     948,127,728

Oodle Data + Oodle Texture lambda=40 :        759,825,164

Oodle Texture provides great size benefit even with the default Zlib compression in Unreal, but it works even better when combined with Oodle Data.

7/15/2020

Two News Items

1. Mea Culpa.

We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.

The good news is the next version will be much faster on very large images, with no algorithmic changes (same results and quality). The bad news is we have lots of people testing it and seeing slower speeds than we expected.

2. Fastmail tua culpa.

Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!

Details for each :

1. Mea Culpa.

We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.

It was sort of a story of being too "mature" again.

In our image analysis process, we do a lowpass filter with a Gaussian. In coding that up, I was experimenting with lots of different ideas, so I just did a first quick dumb implementation as a temp thing to get the results and see how it worked. I always intended to come back and rewrite it in the optimization phase if it worked out. (90% of the stuff I try in the experimentation phase just gets deleted, so I try to avoid spending too much time on early implementation until we work out what method is the one we want to ship).

So we tried various things and eventually settled on a process, and came back to optimize what we settled on. I immediately thought, oh well this Gaussian filter I did was a really dumb implementation and obviously we know there are various ways to do fast implementations of that, that's an obvious place to look at speed.

But rather than just dive in and optimize it, I decided to be "mature". The mature programmer doesn't just optimize code that is obviously a bad implementation. Instead they profile, and measure how much time it is actually taking. That way you can prioritize your efforts to spend your programming time where it has the biggest impact. Any programmer work is not zero-sum; if you spend time on X it takes away time from Y, so you can't just say yes of course we should do X, you have to say "X is more important than Y". If I'm optimizing the Gaussian I'm not doing something else important.

So I profiled it, and it was ~1% of total CPU Time. So I thought hrmm, well that's surprising, but I guess it's not important to total CPU time, so I won't optimize it.

I was wrong. The problem was I tested on an image that was too small. There's a huge cliff in performance that happens when the image doesn't fit in cache.

(for people who are aware of the performance issues in image filtering, this is obvious. The main issue for CPU image filtering is the cache usage pattern; there are various ways to fix that, tiles and strips and different local access patterns; that's well known)

Images up to 1024*1024 easily fit in cache (even in 4-float format at 16 bytes per pel, that's 16 MB). Up to 2k x 2k can almost fit in the big 64 MB L3 that is increasingly common.

At 8k x 8k , a 4-float image is 1 GB. (it's unintuitive how fast exponential growth goes up!). At that size you get a huge performance penalty from naive filtering implementations, which are constantly cache missing.

Foolishly, I did my one profile of this code section on a 1k x 1k image, so it looked totally fine.

The solution is simple and we'll have it out soon. (in typical Charles & Fabian obsessive perfectionism style, we can't just fix it "good enough", we have to fix it the best way possible, so we'll wind up with the super over-engineered world's best implemenation) I just feel a bit embarassed about it because doing good profiling and making smart implementation decisions is our specialty and I totally F'ed it.

I think it is an instructive case of some general principles :

1A. Profiling is hard, and a little bit of profiling is worse than none.

In this case there's a huge performance cliff when you go from working sets that fit in cache to ones that don't. That depends on cache size and machine; it can also depend on how much other CPU work is happening that's competing for cache. It depends on machine architexture, for example we've seen many compressors perform horribly on ARM big-little systems where latency to main memory can be much bigger than is typical on x86/64 desktops, because their architects did not profile on that type of machine.

Profiling is a special case of the more general "measurement fallacy". People have this very misplaced faith in a measured number. That can be extremely misleading, and in fact bad measurement can be worse than not doing at all. For example medical trials without valid controls or insufficiently large samples can lead to very harmful public policy decisions if their results are not ignored.

You can be making a completely garbage point, but if you start showing that it was 17.20 and here's a chart with some points, all of a sudden people start thinking "this is rigorous"; to trust any measurement you have to dig into how it was done, does it actually measure what you want to know? were noise and biasing factors controlled and measured? You have to pose the right question, measure the right thing in the right way, sample the right group, do statistical analysis of error and bias, etc. without that it's fucking pseudoscience garbage.

I see far too many people who know about this measurement problem, but then ignore it. For example pretty much everyone knows that GDP is a terrible measure of overall economic health of a country, and yet they still talk about GDP all the time. Maybe they'll toss in a little aside about ("GDP isn't really what we should talk about, but...") and then after the "but" they proceed to do a whole article looking at GDP growth. This is the trap! When you have a bad measurement, you MUST ignore it and not even think about it at all. (see also: graduation rates, diet, cost of social programs, etc. etc.)

You see this all the time with profiling where people measure some micro-benchmark of a hash table, or a mutex lock, and find the "fastest" implementation. These things are massively context dependent and measuring them accurately in a synthetic benchmark is nearly impossible (it would require very complex simulation of different input types, memory layouts and working set sizes, different numbers of threads in different thread usage patterns).

The problem with a bad measurement is it gives a number which then people can brandish as if it's unimpeachable (X was 4 cycles and Y was 5 cycles, therefore we must use X even though it's complicated and fragile and harder to use, and in fact after all the surrounding implementation it winds up being much worse). It far too often makes people believe that the result they saw in one measurement is universally true, when in fact all you observed is that *if* measured in that particular way in that particular situation, this is what you saw that one time. (reminds me of the old "black sheep" joke about the engineer, physicist and the mathematician).

There are lots of common mistakes in profiling that we see all the time, unfortunately, as people try Oodle and feel the need to measure performance for themselves. It's not that easy to just "measure performance". We try to be very careful about using data sets that are realistic samples of expected data, we remove fluctuations due to thermal throttling or single-core boosts, we run multiple times to check repeatability of results, etc. This is literally our job and we spend a lot of time thinking about it, and sometimes we still get it wrong, and yet every single day we get people going "oh I just cooked up this benchmark in two seconds and I'm getting weird results". See also : Tips for benchmarking a compressor and The Perils of Holistic Profiling .

In the modern world you have to consider profiling with N other threads running that you don't control, you can't assume that you get the whole machine to yourself. For example a very common huge mistake that I see is unnecessary thread switches; let's just hand off to this other thread very briefly then come back to our first thread to continue the work. That may be totally fine when you test it on a machine that is otherwise idle, but if you're competing for CPU time on a machine that has a ton of other threads running, that "little thread switch" to pop over to a different async task might take seconds. Over-threading tends to improve benchmarks when run on machines in isolation but hurt performance in the real world.

(See also *2 at end)

1B. Optimization is good for its own sake.

The whole idea that you should "avoid premature optimization" has many flaws and should be one of the learnings that you forget. Yes yes, of course don't go off and spend a month writing an assembly version of a loop without being sure it's an important thing to do, and also that you've got the right overall algorithmic structure and memory access pattern and so on. I'm not advocating just being dumb.

But also, don't use a really slow LogPrintf() implementation just because it doesn't show up in profiles.

When you have bad/slow code, it changes the way you use it. You wind up avoiding that function in high performance areas. It makes you code around the performance bug rather than just writing things the way you should.

I've worked at a number of companies where they disable asserts in debug builds because they've gotten too slow. I of course try turning on asserts, and a thousand of them fire because nobody else is testing with asserts on. The solution should have been to fix the speed of the debug build to something usable, not to avoid important programming tools.

Sometimes when you do a good implementation of something (even when it wasn't necessary for any particular profile of total app performance), it becomes a really useful component that you then wind up using all over. Like maybe you do a really cool memcpy that can do interleaves and shuffles, that winds up being a really useful tool that you can build things with, that you wouldn't have thought about until you did the good implementation of it.

It's also just fun and fun is good.

1C. Trust what is obviously true.

When the truth is staring you in the face, but some measurement, or some complex second thoughts contradict it, you need to stop and reevaluate. The obvious truth is probably right and your overthinking or being too "mature" with measuring things may be misleading you.

In this case the fact that a naive filter implementation was a temp place-holder and needed to be optimized was obviously true, and some over-thinking clouded that.

2. Fastmail tua culpa.

Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!

What was happening was fastmail (*1) was generating emails that failed SPF check. This would cause my sent emails to be just rejected by some receivers, with no "undelivered" response at all, so I didn't know it was happening.

The SPF record is supposed to verify that an email came from the sending mail host that it claims to (but not the sending address). Emails coming from the fastmail mail host mark themselves as being from fastmail, then the receiver can look up the SPF record at fastmail.com and see the IP's that it should have come from to verify it actually came from there. This prevents spammers from claiming to be sending mail from fastmail servers but actually using a different server. This makes it possible for receivers to have white & black lists for hosts. (SPF records do *not* verify the "from" field of the email)

I had my fastmail email set up to forward to an alias account (also inside fastmail). When I then replied to these (via SMTP through smtp.fastmail.com), it was going out identified as :

    helo=wforward1-smtp.messagingengine.com;
    client-ip=64.147.123.30

then receivers would check the SPF record for fastmail and get :

v=spf1 include:spf.messagingengine.com ?all

64.147.123.17
64.147.123.18
64.147.123.19
64.147.123.20
64.147.123.21
64.147.123.24
64.147.123.25
64.147.123.26
64.147.123.27
64.147.123.28
64.147.123.29

which does not include the .30 IP , therefore my email was marked as an SPF failure.

Fastmail tech support was useless and unhelpful about figuring this out. It also sucks that I get no notification of the undelivered mail.

Some things that were useful :

NIST Email Authentication Tester
dmarcanalyzer SPF checker

*1: I switched to fastmail from dreamhost because dreamhost was failing to deliver my sent email. Deja vu. Why is it so fucking hard to deliver a god damn email !? (in dreamhost's case it's because they intentionally provide smtp service to lots of spammers, so the dreamhost smtp servers get into lots of blacklists)

*2: Another common problem with profiling and benchmarking I've been thinking about recently is the drawback of large tests, which you then average or sum.

People now often have access to large amounts of data to test on. That may or may not be great. It depends on whether that data is an unbiased random sampling of real world data that reflects what you care about the performance on in your final application.

The problem is that you often don't know exactly what data you will be used on, and the data you have is just "some stuff" that you don't really know if it reflects the distribution of data that will be observed later. (this is a bit like the machine learning issue of having a training set that is a good reflection of what will be seen in production use).

Again like the "measurement fallacy" the "big data" test can lead to a false sense of getting an accurate number. If you test on 4 TB of sample data that does not mean the numbers that come out are more useful than a test on 1 MB of sample data.

Large data averages and totals can swamp interesting cases with lots of other cases. There might be some extreme outliers in there where your performance is very bad, but they get washed away in the total. That would be fine if that was in fact a good representation of what you will see in real use, but if it's not you could be very bad.

The worst case is for a library provider like us, we don't know what data types are important to the client. That one weird case where we do badly might be 90% of the client's data.

Any time you're working with test sets where you take averages and totals you have to be aware of how you're pooling (weighted by size? (eg. adding times is weighted by size), or by file? or are categories equally weighted?). If you test set is 20% text and 40% executable that is assigning an effective importance weight to those data types.

In data compression we also have the issue of incompressible files, such as already compressed files, which are not something you should ever be running through your compressor. People running "lots of data" that just iterate every file on their personal disk and think they're getting a good measurement are actually biasing the inputs toward weird things that should not be used.

Because of these considerations and more, I have been increasingly using the method of "minimizing the maximum" of bad performance, or boosting the worst case.

Rather than using a big testset to take an average performance, I use a big test set to find the one file with the worse performance, and then do all I can to optimize that bad case. Measure again, find the new worst case, attack that one.

This has many advantages. It prevents clients from ever seeing a really bad case. That one worst case might actually be the type of data they really care about. It also tends to find interesting corner cases and reveals flaws you don't see on average cases (like oh this one weird file runs most of the loop iterations in the tail/safe loop), that lets you find and fix those cases. It's sort of a case of "you learn from your mistakes" by really digging into these examples of bad performance.

Another nice thing about the "fix the worst" method is that it's strictly additive for bigger test sets. You can just always toss more in your test set and you have more chances to find a worst case. You don't have to worry about how the data is distributed and if that reflects real world distributions. For example say someone gives you a terrabyte of images that are all grayscale. You don't have to worry that this is going to bias your image test set towards a weird over-weighting of grayscale.

This approach has been used on both Oodle Leviathan and Oodle Texture. It was one of the guiding principles of Leviathan that we not only be good on average, but we minimize the gap to the best compressor on every type of data. (we can't be the best possible compressor on every type of data, where specialized compressors can excel in some cases, but we wanted to minimize the worst difference). That led to lots of good discoveries in Leviathan that also helped the average case, and we used a similar principle in Oodle Texture. I think of it as a bit like the machine learning technique AdaBoost, where you take your worst cases and train on them more to get better at them, then keep repeating that and you wind up with a good classifier in general.

7/13/2020

Robust Win32 IO

I see far too much code in production that does not use Win32 IO robustly. Some of the issues are subtle and tricky, but many of them just come down to checking error codes and return values. You cannot assume :

ReadFile(size) either successfully reads all "size" bytes, or fails mysteriously and we should abort

What you actually need to be handling is :

ReadFile(size)
succeeded but got less than size
failed but failed due to being already at EOF
failed but failed due to a temporary system condition that we should retry
succeeded but is not asynchronous the way we expected
succeeded and was asynchronous but then GetOverlapped result does not wait as we expected
failed but failed due to IO size being too big and we should cut it into pieces

In a surely pointless attempt to improve matters, I've tried to make easy to use clean helpers that do all this for you, so you can just include this code and have robust IO :

robustwin32io.zip

Even if you are being careful and checking all the error codes, some issues you may not be handling :

Cut large IOs into pieces. You may have memory allocated for your large IO buffer, but when you create an IO request for that, the OS needs to mirror that into the disk cache, and into kernel memory space for the IO driver. If your buffer is too large, that can fail due to running out of resources.
(this is now rarely an issue on 64-bit windows, but was common on 32-bit windows, and can still happen on the MS consoles)
Retry IOs in case of (some) failures. One of the causes of IO failure if too many requests in the queue, for example if you are spamming the IO system generating lots of small request. If you get these failures you should wait a bit for the queue to drain out then retry.
Always call GetOverLappedResult(FALSE) (no wait) before GetOverLappedResult(TRUE) (wait) to reset the event. If you don't do this, GetOverLappedResult(TRUE) can return without waiting for the IO to return, causing a race against the IO. This behavior was changed in Windows 7 so this might not be necessary any more, but there's some dangerous behavior with the manual-reset Event in the OVERLAPPED struct. When you start an async IO it is not necessarily reset to unsignaled. When you GetOverLappedResult(TRUE) it is supposed to be waiting on an event that is signalled when the IO completes, but if the event was already set to signalled before you called, it will just return immediately.
NOTE this is not the issue with trying to do GetOverLappedResult on the same OVERLAPPED struct from multiple threads - that is just wrong; access to the OVERLAPPED struct should be mutex protected if you will query results from multiple threads, and you should also track your own "is io async" status to check before calling GetOverLappedResult.
Always call SetLastError(0) before calling any Windows API and then doing GetLastError. See previous blog on this topic : 10-03-13 - SetLastError(0). This particular bug was fixed in Windows Vista (so a while ago), but I'm paranoid about it and it's harmless to do, so I still do it. GetLastError/SetLastError is just a variable in your thread-info-block, so it's only a few instructions to access it. It's best practice to always SetLastError(0) at the start of a sequence of operations, that way you know you aren't getting errors that were left over from before.

For example, here's how to call GetOverlappedResult : (only call if st == win32_io_started_async)

BOOL win32_get_async_result(HANDLE handle,
                            OVERLAPPED * data,
                            DWORD * pSize)
{
    // only call this if you got "win32_io_started_async"
    // so you know IO is actually pending
            
    DWORD dwSize = 0;
    
    // first check result with no wait
    //  this also resets the event so that the next call to GOR works :
    
    if ( GetOverlappedResult(handle,data,&dwSize,FALSE) )
    {
        if ( dwSize > 0 )
        {
            *pSize = (DWORD) dwSize;
            return true;
        }
    }   
    
    // if you don't do the GOR(FALSE)
    //  then the GOR(TRUE) call here can return even though the IO is not actually done
    
    // call GOR with TRUE -> this yields our thread if the IO is still pending
    if ( ! GetOverlappedResult(handle,data,&dwSize,TRUE) )
    {
        DWORD err = GetLastError();
        
        if ( err == ERROR_HANDLE_EOF )
        {
            if ( dwSize > 0 )
            {
                *pSize = (DWORD) dwSize;
                return true;
            }
        }
        
        return false;       
    }
        
    *pSize = (DWORD) dwSize;
    
    return true;    
}

Get the code :

robustwin32io.zip

Also note that I don't recommend trying to do unbuffered writes on Windows. It is possible but complex and requires privilege elevation which puts up a UAC prompt, so it's not very usable in practice. Just do buffered writes. See also : 01-30-09 - SetFileValidData and async writing and 03-12-09 - ERROR_NO_SYSTEM_RESOURCES

7/07/2020

Integrating Oodle Texture in your Engine

Oodle Texture RDO should integrate into your engine very easily. It just replaces the BCN encoder you were using previously, and you magically get BC1-7 textures that compress much smaller. There are a couple issues you may wish to consider which I'll talk about here.

(Integrating BC7Prep is rather different; see BC7Prep data flow here. Essentially BC7Prep will integrate like a compressor, you ship the runtime with a decompressor; it doesn't actually make texture data but rather something you can unpack into a texture. BC7Prep is not actually a compressor, it relies on your back-end compressor (Kraken or zip/deflate typically), but it integrates as if it was.)

Caching output for speed and patches

You may wish to cache the output of Oodle Texture BCN encoding, and reuse it when the source content hasn't changed, rather than regenerate it.

One reason is for speed of iteration; most likely you already have this system in some form so that artists can run levels without rebaking to BCN all the time. Perhaps you'd like to have a two-stage cache; a local cache on each person's machine, and also a baked content server that they can fetch from so they don't rebake locally when they encounter new levels.

Oodle Texture RDO encodes can be slow. You wouldn't like to have to rebake all the BCN textures in a level on a regular basis. We will be speeding it up in future versions and probably adding faster (lower quality) modes for quicker iteration, but it will never be realtime.

Caching can also be used to reduce unnecessary patch generation.

Oodle Texture guarantees deterministic output. That is, the same input on the same version of Oodle Texture, on the same platform, with the same options will make the same output. So you might think you can rely on there being no binary diff in the generated output to avoid patches.

The problem with that idea is it locks you into a specific version of Oodle Texture. This is a brand new product and it's not a good idea to assume that you will never have to update to a new version. Newer versions *will* make different output as we improve the algorithms. Relying on there being no binary diff to avoid making patches means never taking updates of Oodle Texture from RAD. While it is possible you could get away with this, it's very risky. It has the potential of leaving you in a situation where you are unable to update to a better new version because it would generate too many binary diffs and cause lots of patching.

It's much safer to make your patches not change if the source content hasn't changed. If the source art and options are the same, use the same cached BCN.

With these considerations, the cache should be indexed by : a hash of the source art bits (perhaps Meow hash ; not the file name and mod time), the options used (such as the lambda level), but NOT the Oodle Texture version.

Lambda for texture types depends on your usage context

It's worth thinking a bit about how your want to expose lambda to control quality for your artists. Just exposing direct control of the lambda value per texture is probably not the right way. A few issues to consider :

You should make it possible to tweak lambda across the board at a later date. It's very common to not know your size target until very late in development. Perhaps only a month before ship you see your game is 9 GB and you'd like to hit 8 GB. You can do that very easily if you have a global multiplier to scale lambdas. What you don't want is to have lots of hard-coded lambda values associated with individual textures.

We try to make "lambda" have the same approximate meaning in terms of visual quality across various texture types, but we can only see how that affects error in the texels, not in how they are shown on screen. Transformations that happen in your shader can affect how important the errors are, and lambda should be scaled appropriately.

For example say you have some type of maps that use a funny shader :

fetch rgb
color *= 2

in that case, texel errors in the map are actually twice as important as we think. So if you were using lambda=40 for standard diffuse textures, you should use lambda=20 for these funny textures.

Now doubling the color is obviously silly, but that is effectively what you do with maps that become more important on screen.

Probably the most intuitive and well known example is normal maps. Normal maps can sometimes massively scale up errors from texel to screen, it depends on how they are used. If you only do diffuse lighting in smooth lighting environments, then normal map errors can be quite mild, standard lambda scaling might be fine. But in other situations, for example if you did environment map reflections with very sharp contrast (class case is like a rotating car with a "chrome" map) then any errors in normals become massively magnified and you will want very little error indeed.

(note that even in the "very little error" case you should still use Oodle Texture RDO, just set lambda to 1 for near lossless encoding; this can still save quite a lot of size with no distortion penalty; for maximum quality you should almost always still be doing RDO in near-lossless mode, not turning it off)

We (RAD) can't just say that "normal maps should always be at 1/2 the lambda of diffuse maps". It really depends on how you're using them, how high contrast the specular lighting is. What really matters is the error in the final image on screen, but what we measure is the error level in the texture; the ratio of the two is how you should multiply lambda :

lambda multiplier = (texture error) / (screen error)

This kind of error magnification depends mainly on the type of map (normals, AO, gloss, metalness, translucency, etc.) and how your engine interprets them. If we think of diffuse albedo as the baseline, the other maps will have errors that are 75% or 200% or whatever the importance, and lambda should be scaled accordingly.

I suggest that you should have a lambda scaling per type of map. This should not be set per texture by artists, but should be set by the shader programmer or tech artists that know how maps feed the rendering pipeline.

In the end, the way you should tweak the map-type lambda scaling is by looking at how the errors come out on the screen in the final rendered image, not by looking at the errors in the texture itself. The transformations you do between texel fetch and screen effect how much those texel errors are visible.

Aside from the per-map-type lambda scaling, you probably want to provide artists with a per-texture override. I would encourage this to be used as little as possible. You don't want artists going through scaling every lambda because they don't like the default, rather get them to change the default. This should be used for cases where almost all the textures look great but we want to tweak a few.

Per-texture scaling can be used to tweak for things that are outside the scope of what we can see inside the texture. For example if the texture is used on the player's weapons so it's right in your face all the time, perhaps you'd like it higher quality. Another common case is human faces are much more sensitive to the human observer, so you might want them to be at higher quality.

I think a good way to expose per-texture scaling is as a percentage of the global map-type lambda. eg. you expose a default of 100% = a 1.0 multiplier, artists can slide that to 50% or 200% to get 0.5 or 2.0x the map-type lambda. So perhaps you set something up like :


global map type default lambdas :

diffuse/albedo lambda = 30
normal maps lambda = 10
AO lambda = 30
roughness lambda = 40

then an artist takes a specific normal map
because it's for a car, say
and slides the "rate reduction %" from "100%" down to "50%"

so it would get a lambda of 5

then late in dev you decide you want everything to be globally a bit smaller
you can go through and tweak just the global map type lambdas and everything adjusts

Delay baking and format choices

It's best practice to delay baking and format choices until right before the BCN encode, and do all the earlier steps at maximum precision.

For example don't hard-code the BCN choice; some older engines specify BC1 for diffuse and "DXT5n" (BC3) for normals. Those are the not the formats you want in most modern games, you probably want BC7 for diffuse and BC5 for normals. It's probably best in your tools to not directly expose the BSN choice to artists, but rather just the texture type and let your baker choose the format.

Oodle Texture is designed to be a very low level lib; we don't do a lot of texture processing for you, we only do the final RGB -> BCN encode step. We assume you will have a baker layer that's just above Oodle Texture that does things like mip maps and format conversions.

Normal maps require special care. If possible they should be kept at maximum precision all the way through the pipeline (from whatever tool that made them up to the Oodle Texture encode). If they come out of geometry normals as F32 float, just keep them that way, don't quantize down to U8 color maps early. You may decide later that you want to process them with something like a semi-octahedral encoding, and to do that you should feed them with full precision input, not quantized U8 values that have large steps. 16 bit maps also have plenty of precision, but with 16 bit integers ensure you are using a valid quantizer (restore to center of quantizer bucket), and the correct normalization (eg. signed float -1.0 to 1.0 should correspond to S16 -32767 to 32767 , -32768 unused). Our BC4/5 encoders are best fed S16 or U16 input.

Delaying quantization to a specific type of int map lets you choose the best way to feed the BCN encoder.

In the Oodle Texture SDK help there's an extensive discussion in the "Texture Mastering Guide" on choosing which BC1-7 and some tips on preparing textures for BCN.

cbloom rants