cbloom rants

1/11/2021

AVIF Test

AVIF is an image format derived from I-frames of AV1 video (similar to HEIC/HEIF from H265/HEVC). See also my 2014 Test of BPG , which is an H265 I-frame image format.

Here are some links I've found on AVIF :

AVIF image format supported by Cloudflare Image Resizing
GitHub - AOMediaCodeclibavif libavif - Library for encoding and decoding .avif files
GitHub - googlebrunsli Practical JPEG Repacker
Releases · kornelskicavif-rs · GitHub
GitHub - link-ucavif avif encoder, using libaom directly.
GitHub - xiphrav1e The fastest and safest AV1 encoder.
AVIF for Next-Generation Image Coding by Netflix Technology Blog Netflix TechBlog
Submissions from xiph.org Hacker News
Image formats for the web HEIC and AVIF – The Publishing Project
Squoosh
Comparing AVIF vs WebP file sizes at the same DSSIM
AV1 Codec
AVIF images color losschange AV1

Unfortunately I have not found a standard encoder with recommended settings. There appears to be zero guidance on settings anywhere. Because of that I am using the simplest encoder I could find, Kornelski's "cavif" which has a simple --quality parameter. I run thusly :

cavif --quality=X -o out.avif in.png
avifdef out.avif dec.png
imdiff in.png dec.png 
CALL FOR HELP : if you know better programs/settings for encoding AVIF, please let me know ; in avifenc , what is the min max Q parameter? adaptive quantization appears to be off by default, don't we want that on?

I measure results using my imdiff

I will compare AVIF to what I call "JPEG++" which is JPEG with the packjpg/Lepton back end, and a deblocking decoder (my "jpegdec"). This a rough stand-in for what I think a real JPEG++ should be (it should really have an R-D front-end and chroma-from-luma as well; that's all very easy and unequivocably good).

With no further ado, some results :

(imdiff "fit" is a quality in 0-10 , higher is better)


porsche640.bmp :


PDI_1200 :


Results are a bit disappointing. AVIF is much beter on RMSE but slightly worse on my other two scores. Overall that means it's most likely better overall, but it's not a huge margin.

(I'm sure AVIF is a big win on graphic/text images where JPEG does poorly)

AVIF results here look worse than what I saw from BPG (HEIC). Perhaps better encoders/settings will fix that.

Looking at the results visually, AVIF preserves sharp lines much better, but is completely throwing away detail in some places. There are some places where I see AVIF actually *change* the image, whereas JPEG is always just making the same image but slightly worse.

7z of encoded images to compare (2 MB)

NOTE : the fact that AVIF wins strongly in RGB RMSE but not in my other perceptual metrics indicates that is is not optimizing for those metrics. Perhaps in other perceptual metrics it would show a strong win. The metrics I use here from imdiff were chosen because I found them to be the best fit to human quality scores. Lots of the standard scores that people use (like naive SSIM) I have found to be utter junk, with no better correlation to human quality than RMSE. MS-SSIM-IW is the best variant of SSIM I know, but I haven't tested some of the newer metrics that have come out in the last few years.

1/06/2021

Some JPEG Tools

A couple of tips and tools for working with JPEG files. I use :

jhead
jpegcrop

Of course you can use exiftool and jpegtran but these are rather simpler.

1. Strip all personal data before uploading images to the web.

JPEG EXIF headers contain things like date shot and location. If you don't want Google to scrape that and use it to track your movements and send drones to follow you carrying advertisements, you might want to strip all that private data before you upload images to the net. I use :

jhead -purejpg %*
jhead -mkexif %*
jhead -dsft %*
with the very handy jhead tool. This removes all non-image headers, then makes a blank exif, then sets the exif time from the mod time. The last two steps are optional, if you want to preserve the shot time (assuming the mod time of the file was equal to the exif time). Note if the times are not equal you can use "jhead -ft" to set modtime from exif time before this.

Also note that if you use tools to modify images (resizing, changing exposure, what have you), they may or may not carry through the old exif data; whether that's good or bad is up to you, but it's very inconsistent. They also will probably set the modtime to now, whereas I prefer to keep the modtime = shot time so that I can find files by date more easily.

2. Remove orientation tags

iPhones are the only device I know of that consistently make use of the JPEG orientation tags rather than actually rotate the pixels. Unfortunately lots of loaders don't support these tags right, so if you load the image in a non-compliant loader it will appear sideways or upside down.

(this is a classic problem with data formats that have too many features; inevitabily many implementers only support a subset of the features which they deem to be the necessary ones; then some bone-head says "oh look the format has this weird feature, let's use it", and they output data which most loaders won't load right (then they blame the loaders). This is a mistake of the people defining the format; don't include unnecessary features, and make the practical subset a well-defined subset, not something that forms ad-hoc from use.)

To fix this, you want to read the orientation tag, rotate the pixels, then blank out the orientation tag (you have to do the last step or compliant loaders will doubly rotate). I used to have scripts to do all that with jpegtran, but it's super easy to do with jhead, you just :

jhead -autorot %*

3. Lossless crop

Avoid loading a JPEG, modifying it, and saving it out again. It destroys the image by re-applying the lossy quantization. I think by far the most common modification people do is cropping and there's no damn reason to re-encode for a crop. You can crop at 8x8 granularity losslessly, and you should. (all the web image uploaders that give you a crop tool need to do this, please, stop sucking so bad).

jpegcrop is a GUI tool that provides a decent lossless crop.

FreeVimager is okay too. Lossless crop is hidden in "JPEG Advanced".

We've got JPEG-XL coming soon, which looks great, but all the tech in the world won't help if people like Instagram & Youtube keep re-encoding uploads, and re-encoding every time you modify.

11/18/2020

Oodle 2.8.13 Release

Oodle 2.8.13 fixes an issue in Oodle Texture with consistency of encodings across machine architectures.

We try to ensure that Oodle Texture creates the same encodings regardless of the machine you run on. So for example if you run on machines that have AVX2 or not, our optional AVX2 routines won't change the results, so you get binary identical encodings.

We had a mistake that was causing some BC1 RDO encodings to be different on AMD and Intel chips. This is now fixed.

The fixed encoding (for BC1 RDO) made by 2.8.13 can be different than either of the previous (AMD or Intel) encodings. I want to use this announcement as an opportunity to repeat I point I am trying to push :

Do not use the binary output of encodings to decide what content needs to be patched!

This is widespread practice in games and it is really a bad idea. The problem is that any changes to the encoders you use (either Oodle Texture, your compressor, or other), can cause you to patch all your content unnecessarily. We do NOT gaurantee that different versions of Oodle Texture produce the same output, and in fact we can specifically promise they won't (always produce the same encodings) because we will continue to improve the encoder and find better encodings over time. Aside from version changes causing binary diffs, you might want to change settings, quality or speed levels, etc. and that shouldn't force a full patch down to customers.

The alternative to using binary output to make patches is to check if the pre-encoding content has changed, and don't patch if that is the same. I know there are difficult issues with that, often for textures you do something like load a bitmap, apply various transforms, then do the BCN encoding, and there's not a snapshot taken of the fully processed texture right before the BCN encoding.

If you are shipping a game with a short lifespan, your intention is to only ship it and only patch for bugs, then patching based on binary compiled content diffs is probably fine. But if you are shipping a game that you intend to have a long lifetime, with various generations of DLC, or a long term online player base, then you seriously need to consider patching based on pre-compression content diffs. It is very likely you will at some point in the life time of the game have to face OS version changes, compiler changes, or perhaps bugs in the encoder which force you to get on a newer version of the compression tools. You don't want to be in a situation where that's impossible because it would generate big patches.


One elegant way to solve this, and also speed up your content cooking, is to implement a cooked content cache.

Take the source bitmap, and all the texture cooking & encoding options, and use those as a key to look up pre-cooked content from a network share or similar. If found, don't re-encode.

Every time you export a level, you don't want to have to re-encode all the textures with Oodle Texture. With huge art teams, when someone edits a texture, everyone else on the team can just fetch from the cook cache rather than encode it locally.

The same kind of system can be used to avoid unnecessary patches. If you then populate your cook-cache with the content from the last shipped version, you won't make new encodings unless the source art or options change, even if the encoder algorithm changes.


For the lossless package compressor, ideally the patch generator would be able to decode the compression and wouldn't generate patches if only the compressed version changed.

To be clear, I'm assuming here that you are compressing assets individually, or even in smaller sub-asset chunks, or some kind of paging unit. I'm assuming you are NOT compressing whole packages as single compression units; if you did that any a single byte changing in any asset could change the entire compressed unit, that's a bad idea for patching.

(note that encryption also has the same property of taking a single byte change and spreading it around the chunk, so encryption should usually be done on the same or smaller chunk than the compression)

With most existing patchers that are not compression-aware, if you change the compressor (for example by changing the encode level option, or updating to a newer version), the compressed bytes will change and generate large patches. What they should ideally do is see that while the compressed bytes changed, the decompressed bytes are the same, so no patch is needed, and the old version of the compressed bytes can be retained. This would allow you to deploy new compressors and have them used for all new content and gradually roll out, without generating unnecessary large patches.

9/24/2020

How Oodle Kraken and Oodle Texture supercharge the IO system of the Sony PS5

The Sony PS5 will have the fastest data loading ever available in a mass market consumer device, and we think it may be even better than you have previously heard. What makes that possible is a fast SSD, an excellent IO stack that is fully independent of the CPU, and the Kraken hardware decoder. Kraken compression acts as a multiplier for the IO speed and disk capacity, storing more games and loading faster in proportion to the compression ratio.

Sony has previously published that the SSD is capable of 5.5 GB/s and expected decompressed bandwidth around 8-9 GB/s, based on measurements of average compression ratios of games around 1.5 to 1. While Kraken is an excellent generic compressor, it struggled to find usable patterns on a crucial type of content : GPU textures, which make up a large fraction of game content. Since then we've made huge progress on improving the compression ratio of GPU textures, with Oodle Texture which encodes them such that subsequent Kraken compression can find patterns it can exploit. The result is that we expect the average compression ratio of games to be much better in the future, closer to 2 to 1.

Oodle Kraken is the lossless data compression we invented at RAD Game Tools, which gets very high compression ratios and is also very fast to decode. Kraken is uniquely well suited to compress game content and keep up with the speed requirements of the fast SSD without ever being the bottleneck. We originally developed Oodle Kraken as software for modern CPUs. In Kraken our goal was to reformulate traditional dictionary compression to maximize instruction level parallelism in the CPU with lots independent work running at all times, and a minimum of serial dependencies and branches. Adapting it for hardware was a new challenge, but it turned out that the design decisions we had made to make Kraken great on modern CPUs were also exactly what was needed to be good in hardware.

The Kraken decoder acts as an effective speed multiplier for data loading. Data is stored compressed on the SSD and decoded transparently at load time on PS5. What the game sees is the rate that it receives decompressed data, which is equal to the SSD speed multiplied by the compression ratio.

Good data compression also improves game download times, and lets you store more games on disk. Again the compression ratio acts as an effective multiplier for download speed and disk capacity. A game might use 80 GB uncompressed, but with 2 to 1 compression it only take 40 GB on disk, letting you store twice as many games. A smaller disk with better compression can hold more games than a larger disk with worse compression.

When a game needs data on PS5, it makes a request to the IO system, which loads compressed data from the SSD; that is then handed to the hardware Kraken decoder, which outputs the decompressed data the game wanted to the RAM. As far the game is concerned, they just get their decompressed data, but with higher throughput. On other platforms, Kraken can be run in software, getting the same compression gains but using CPU time to decode. When using software Kraken, you would first load the compressed data, then when that IO completes perform decompression on the CPU.

If the compression ratio was exactly 1.5 to 1, the 5.5 GB/s peak bandwidth of the SSD would decompress to 8.25 GB/s uncompressed bytes output from the Kraken decoder. Sony has estimated an average compression ratio of between 1.45 to 1 and 1.64 to 1 for games without Oodle Texture, resulting in expected decompressed bandwidth of 8-9 GB/s.

Since then, Sony has licensed our new technology Oodle Texture for all games on the PS4 and PS5. Oodle Texture lets games encode their textures so that they are drastically more compressible by Kraken, but with high visual quality . Textures often make up the majority of content of large games and prior to Oodle Texture were difficult to compress for general purpose compressors like Kraken.

The combination of Oodle Texture and Kraken can give very large gains in compression ratio. For example on a texture set from a recent game :

Zip 1.64 to 1
Kraken 1.82 to 1
Zip + Oodle Texture 2.69 to 1
Kraken + Oodle Texture 3.16 to 1

Kraken plus Oodle Texture gets nearly double the compression of Zip alone on this texture set.

Oodle Texture is a software library that game developers use at content creation time to compile their source art into GPU-ready BC1-7 formats. All games use GPU texture encoders, but previous encoders did not optimize the compiled textures for compression like Oodle Texture does. Not all games at launch of PS5 will be using Oodle Texture as it's a very new technology, but we expect it to be in the majority of PS5 games in the future. Because of this we expect the average compression ratio and therefore the effective IO speed to be even better than previously estimated.

How does Kraken do it?

The most common alternative to Kraken would be the well known Zip compressor (aka "zlib" or "deflate"). Zip hardware decoders are readily available, but Kraken has special advantages over Zip for this application. Kraken gets more compression than Zip because it's able to model patterns and redundancy in the data that Zip can't. Kraken is also inherently faster to decode than Zip, which in hardware translates to more bytes processed per cycle.

Kraken is a reinvention of dictionary compression for the modern world. Traditional compressors like Zip were built around the requirement of streaming with low delay. In the past it was important for compressors to be able to process a few bytes of input and immediately output a few bytes, so that encoding and decoding could be done incrementally. This was needed due to very small RAM budgets and very slow communication channels, and typical data sizes were far smaller than they are now. When loading from HDD or SSD, we always load data in chunks, so decompressing in smaller increments is not needed. Kraken is fundamentally built around decoding whole chunks, and by changing that requirement Kraken is able to work in different ways that are much more efficient for hardware.

All dictionary compressors send commands to the decoder to reproduce the uncompressed bytes. These are either a "match" to a previous substring of a specified length at an "offset" from the current output pointer in the uncompressed stream, or a "literal" for a raw byte that was not matched.

Old fashioned compressors like Zip parsed the compressed bit stream serially, acting on each bit in different ways, which requires lots of branches in the decoder - does this bit tell you it's a match or a literal, how many bits of offset should I fetch, etc. This is also creates an inherent data dependency, where decoding each token depends on the last, because you have to know where the previous token ends to find the next one. This means the CPU has to wait for each step of the decoder before it begins the next step. Kraken can pre-decode all the tokens it needs to form the output, then fetch them all at once and do one branchless select to form output bytes.

Kraken creates optimized streams for the decoder

One of the special things about Kraken is that the encoded bit stream format is modular. Different features of the encoder can be turned on and off, such as entropy coding modes for the different components, data transforms, and string match modes. Crucially the Kraken encoder can choose these modes without re-encoding the entire stream, so it can optimize the way the encoder works for each chunk of data it sees. Orthogonality of bit stream options is a game changer; it means we can try N boolean options in only O(N) time by measuring the benefit of each option independently. If you had to re-encode for each set of options (as in traditional monolithic compressors), it would take O(2^N) time to find the best settings.

The various bit stream options do well on different types of data, and they have different performance trade offs in terms of decoder speed vs compression ratio. On the Sony PS5 we use this to make encoded bit streams that can be consumed at the peak SSD bandwidth so that the Kraken decoder is never the bottleneck. As long as the Kraken decoder is running faster than 5.5 GB/s input, we can turn on slower modes that get more compression. This lets us tune the stream to make maximum use of the time budget, to maximize the compression ratio under the constraint of always reading compressed bits from the SSD at full speed. Without this ability to tune the stream you would have very variable decode speed, so you would have to way over-provision the decoder to ensure it was never the bottleneck, and it would often be wasting computational capacity.

There are a huge number of possible compressed streams that will all decode to the same uncompressed bytes. We think of the Kraken decoder as a virtual machine that executes instructions to make output bytes, and the compressed streams are programs for that virtual machine. The Kraken encoder is then like an optimizing compiler that tries to find the best possible program to run on that virtual machine (the decoder). Previous compressors only tried to minimize the size of the compressed stream without considering how choices affect decode time. When we're encoding for a software decoder, the Kraken encoder targets a blend of decode time and size. When encoding for the PS5 hardware decoder, we look for the smallest stream that meets the speed requirement.

We designed Kraken to inherently have less variable performance than traditional dictionary compressors like Zip. All dictionary compressors work by copying matches to frequently occurring substrings; therefore they have a fast mode of decompression when they are getting lots of long string matches, they can output many bytes per step of the decoder. Prior compressors like Zip fall into a much slower mode on hard to compress data with few matches, where only one byte at a time is being output per step, and another slow mode when they have to switch back and forth between literals and short matches. In Kraken we rearrange the decoder so that more work needs to be done to output long matches, since that's already a super fast path, and we make sure the worst case is faster. Data with short matches or no matches or frequent switches between the two can still be decoded in one step to output at least three bytes per step. This ensures that our performance is much more stable, which means the clock rate of the hardware Kraken decoder doesn't have to be as high to meet the minimum speed required.

Kraken plus Oodle Texture can double previous compression ratios

Kraken is a powerful generic compressor that can find good compression on data with repeated patterns or structure. Some types of data are scrambled in such a way that the compressability is hard for Kraken to find unless that data is prepared in the right way to put it in a usable form. An important case of this for games is in GPU textures.

Oodle Kraken offers even bigger advantages for games when combined with Oodle Texture. Often the majority of game content is in BC1-BC7 textures. BC1-7 textures are a lossy format for GPU that encodes 4x4 blocks of pixels into 8 or 16 byte blocks. Oodle Kraken is designed to model patterns in this kind of granularity, but with previous BC1-BC7 texture encoders, there simply wasn't any pattern there to find, they were nearly incompressible with both Zip and Kraken. Oodle Texture creates BC1-7 textures in a way that has patterns in the data that Kraken can find to improve compression, but that are not visible to the human eye. Kraken can see that certain structures in the data repeat, the lengths of matches and offsets and space between matches, and code them in fewer bits. This is done without expensive operations like context coding or arithmetic coding.

It's been a real pleasure working with Sony on the hardware implementation of Kraken for PS5. It has long been our mission at RAD to develop the best possible compression for games, so we're happy to see publishers and platforms taking data loading and sizes seriously.

9/11/2020

Topics in Quantization for Games

I want to address some topics in quantization, with some specifics for games.

We do "quantization" any time we take a high precision value (a floating point, or higher-bit integer) and store it in a smaller value. The quantized value has less precision. Dequantization takes you back to the space of the input and should be done to minimize the desired error function.

I want to encourage you to think of quantization like this :

quantization takes some interval or "bucket" and assigns it to a label

dequantization restores a given label to a certain restoration point

"quantization" does not necessarily take you to a linear numeric space with fewer bits

The total expected error might be what we want to minimize :

Total_Error = Sum_x P(x) * Error( x,  dequantization( quantization(x) ) )
Note that in general the input values x do not have uniform probability, and the Error is not just linear L1 or L2 error, you might care about some other type of error. (you might also care more about minimizing the maximum rather than the average error).

I like to think of the quantized space as "labels" because it may not be just a linear numerical space where you can do distance metrics - you always dequantize back to your original value space before you do math on the quantization labels.

I started thinking about this because of my recent posts on Widespread error in RGBE and Alternative quantizers for RGBE, and I've been looking in various game-related code bases and found lots of mistakes in quantization code. These are really quite big errors compared to what we work very hard to reduce. I've found this kind of thing before outside of games too. For example it's very common for the YUV conversions in video and image codecs to be quite crap, giving up lots of error for no good reason. Common errors I have seem in the YUV conversions are : using the terribad 16-235 range, using the rec601/bt709 matrix so that you encode with one and decode with the other, using terribad down and/or up filters for the chroma downsample). It's frustrating when the actual H264 layer works very hard to minimize error, but then the YUV-RGB layer outside it adds some that could be easily avoided.

We do quantization all the time. A common case is for 8-bit RGB colors to float colors, and vice versa. We do it over and over when we do rendering passes; every time you write values out to a render target and read them back, you are quantizing and dequantizing. It is important to take care to make sure that those quantization errors are not magnified by later passes. For example when writing something like normals or lighting information, a quantization error of 1/256 can become much larger in the next stage of rendering.

(a common example of that is dot products or cosines; if you have two vectors and store something that acts like a dot product between them (or a cosine of an angle), the quantization bucket around 1.0 for the two vectors being parallel corresponds to a huge amount of angular variation, and this often right where you care most about having good precision, it's much better to store something that's like the acos of the dot product)

If you aren't going to do the analysis about how quantization errors propagate through your pipeline, then the easiest thing to do is to only quantize once, at the very end, and keep as much precision through the stages as possible. If you do something like a video codec, or an image processing pipeline, and try to work in limited precision (even 16 bit), it is important to recognize that each stage is an implicit quantization and to look at how those errors propagate through the stages.

(aside: I will mention just briefly that we commonly talk about a "float" as being the "unquantized" result of dequantization; of course that's not quite right. A "float" is a quantized representation of a real number, it just has variable size quantization bins, smaller bins for smaller numbers, but it's still quantized with steps of 1 ulp (unit in last place). More correctly, going to float is not dequantization, but rather requantization to a higher precision quantizer. The analysis of propagating through quantization error to work in 8 bits or whatever is the same you should do for how float error propagates through a series of operations. That said I will henceforth be sloppy and mostly talk about floats as "dequantized" and assume that 1 ulp is much smaller than precision that we care about.)

So lets go back and start at the beginning :

Linear uniform scalar quantization

If our input values x are all equally probable ( P(x) is a constant ), and the error metric we care about is linear L1 or L2 norm, then the optimal quantizer is just equal size buckets with restoration to center of bucket.

(for L1 norm the total error is actually the same for any restoration point in the bucket; for L2 norm total error is minimized at center of bucket; for L1 norm the maximum error is minimized at center of bucket)

We'll now specifically look at the case of an input value in [0,1) and quantizing to N buckets. The primary options are :


int quantize_floor( float x , int N )
{
    return (int)( x * N );
    // or floor( x * N );
    // output is in [0, N-1] , input x in [0,1) not including 1.0
}

float dequantize_floor( int q, int N )
{
    return (q + 0.5f ) * (1.f / N);
}

int quantize_centered( float x, int N )
{
    return (int)( x * (N-1) + 0.5f );
    // or round( x * (N-1) )
    // output is in [0, N-1] , input x in [0,1] , including 1.0 is okay
}

float dequantize_centered( int q, int N )
{
    return q * (1.f / (N-1));
}

The rule of thumb for these quantizers is you either bias by 0.5 in the quantizer, or in the dequantizer. You must bias on one side or the other, not both and not neither! The "floor" quantizer is "bias on dequant", while the "centered" quantizer is "bias on quant".

Visually they look like this, for the case of N = 4 :

(the top is "floor" quantization, the bottom is "centered")

(the top is "floor" quantization, the bottom is "centered")

In both cases we have 4 buckets and 4 restoration points. In the "floor" case the terminal bucket boundaries correspond to the boundaries of the [0,1) input interval. In the "centered" case, the terminal buckets are centered on the [0,1) endpoint, which means the bucket boundaries actually go past the end, but they restore exactly to the endpoints.

If your input values are actually all equally likely and the error metric that you care about is just L2 norm, then "floor" quantization is strictly better. You can see that the bucket size for "floor" quantization is 1/4 vs. 1/3 for "centered", which means the maximum error after dequantization is 1/8 vs. 1/6.

In practice we often care more about the endpoints or the integers, not just average or maximum error; we suspect the probability P(x) for x = 0 and 1 is higher, and the error metric Error( dequantization( quantization(x) ) - x ) may also be non-linear, giving higher weight to the error when x = 0 and 1.

"centered" quantization also has the property of preserving integers. For example say your input range was [0,255) in floats. If you quantize to N=256 buckets with "centered" quantization, it will restore exactly to the integers.

Games should only be using centered quantization!

While in theory there are cases where you might want to use either type of quantization, if you are in games don't do that!

The reason is that the GPU standard for UNORM colors has chosen "centered" quantization, so you should do that too. Certainly you need to do that for anything that interacts with the GPU and textures, but I encourage you to just do it for all your quantization, because it leads to confusion and bugs if you have multiple different conventions of quantizer in your code base.

The GPU UNORM convention is :

float dequantize_U8_UNORM( unsigned char u8 )
{
  return u8 * (1.f/255);
}
which implies centered quantization, so please use centered quantization everywhere in games. That means : bias 0.5 on quantize, no bias on dequantize.

While on the topic of UNORM, let's look at conversion between quantized spaces with different precision. Let's do U8 UNORM to U16 UNORM for example.

The way to get that right is to think about it as dequantization followed by quantization. We dequantize the U8 UNORM back to real numbers, then quantize real numbers back to U16 :


dequant = u8 * (1.f/255);

u16 = round( dequant * 65535 );

u16 = round( u8 * (1.f/255) * 65535 );

u16 = round( u8 * 257 );

u16 = u8 * 257;

u16 = u8 * (256 + 1);

u16 = (u8<<8) + u8;

So U8 to U16 re-quantization for UNORM is : take the U8 value, and replicate it shifted up by 8.
requantize U8 UNORM to U16 UNORM :

0xAB -> 0xABAB

This obviously has the necessary property that 00 stays zero, and 0xFF becomes 0xFFFF, so 1.0 is preserved.

This is something we call "bit replication". Let's take a moment to see why it works exactly in some cases and only approximately in others.

Bit Replication for re-quantization to higher bit counts

Bit replication is often used in games to change the bit count of a quantized value (to "requantize" it).

For example it's used to take 5-bit colors in BC1 to 8-bit :


The top 3 bits of the 5-bit value are replicated to the bottom :

abcde -> abcde|abc

giving an 8 bit value

Bit replication clearly gets the boundary cases right : all 0 bits to all 0's (dequantizes to 0.0), and all 1 bits to all 1 bits (dequantizes to 1.0); in between bit replication linearly increases the low bits between those endpoints, so it's obviously sort of what you want. In some cases bit replication corresponds exactly to requantization, but not in others.

With a B-bit UNORM value, it has N = 2^B values. The important thing for quantization is the denominator (N-1). For example with a 5-bit value, (N-1) = 31 is the denominator. It becomes clear if we think about requantization as changing the *denominator* of a fraction.


Requantization from 5 bits to 10 bits is changing the denominator from 31 to 1023 :

dequant( 5b ) = 5b / 31.0;
requant_10( x ) = round( x * 1023.0 );

requant_5_to_10 = round( x * 1023 / 31 );

1023/31 = 33 exactly, so :

requant_5_to_10 = x * 33

in integers.  And 33 = (32 + 1) = shift up 5 and replicate

requantization from 5 to 10 bits is just duplicating the bits shifted up
abcde -> abcde|abcde

What that means is bit replication from B to 2B is exactly equal to what you would get if you dequantized that number to UNORM and requantized it again.

This is of course general for any B :


denominator for B is (N-1)
denominator for 2B is (N^2 - 1)

requantiztion is *= (N^2 - 1) / (N-1)

(N^2 - 1) = (N-1) * (N+1)

so 

requantization is *= (N+1)

which is bit replication

Now more generally for bit replication to some number of bits that's not just double (but <= double, eg. between B and 2B) :

b between B and 2B
n = 2^b

requant_B_to_b(x) = round( x * (n-1) / (N-1) )

requant_B_to_b(x) = round( x * (N+1) * (n-1) / (N^2-1) )

requant_B_to_b(x) = round( (x bit replicated to 2B) * ( scale down ) )

bit replication from B to b is :

bitrep(x) = (x bit replicated to 2B) >> (2B - b)

that is, just replicate to 2B and then truncate low bits to get to b

when b = 2B , these are exactly equal as we showed above

obviously also at b = B (NOP)
and also at b = B+1 (adding one bit)

in the range b = [B+2, 2B-1] they are not quite exactly equal, but close

Let's look at an example, 5 bits -> 8 bits :

bitdouble( 5b ) = (5b * 33)

requant_5_to_8(5b) = round( (5b * 33) * ( 255.0 / 1023.0 ) )

bitrep_5_to_8(5b) = (5b * 33) >> 2

we can see where the small difference comes from :

bit replication just truncates off the 2 bottom bits

requantization does * (255/1023) , which is almost a /4 (like >>2) but not quite
and the requantization also rounds instead of truncating

so we should see how bit replication is similar to centered UNORM requantization, but not quite the same.

Now, bit replication is used in BC7, ASTC, etc. Is it a source of error? No, not if you do your encoder right. What it does mean is that you can't just find the 5-bit color value by doing a centered quantizer to 5 bits. Instead you have to ask what does the 5-bit value bit-replicate to, and find the closest value to your input.

Quantizing infinite signed values and the deadzone quantizer

So far we've talked about quantizing finite ranges, specifically [0,1) but you can map any other finite range to that interval. Let's have a brief look at quantizing infinite ranges.

If you just quantize a signed number to a signed quantized number, then you can use the above _floor or _centered quantizers without thinking any more about it. You will have uniform buckets across the whole number line. But what we often want to do is take a signed input number and quantize it to *unsigned* and separate out the sign bit, to create a sign+magnitude representation. (this makes the most sense with values whose probability P(x) is symmetric about zero and whose mean is at zero; eg. after a transform that subtracts off the mean)

One reason we might want to do that is because most of our schemes for sending unbounded (variable length) numbers work on unsigned numbers. For example : Encode Mod and Exp Golomb .

Now one option would be to quantize to signed ints and then Fold up Negatives to make an unsigned number to feed to your variable length scheme.

There are reasons we don't like that in data compression. Folded up negatives have a number line like : {0, -1, 1, -2, 2, -3 ... }

The annoying thing about that for data compression is that if you have a probability model like a Laplacian that decreases with absolutely value of x, the probabilities have these steps where values are repeated : { P(0), P(1), P(1), P(2), P(2), ... } and coding them with something like exp-golomb is no longer quite correct as they don't progressively fall off. Some codecs in the past have used tricks to reduce this (eg. JPEG-LS and CALIC) by doing things like being able to flip the sign so that you get either {0, -1, 1, -2, ... } or {0, 1, -1, 2, ... } depending on whether positive or negative is more probable.

Rather than do all that, let's assume you want to extract the sign bit and send it separately. So you are sending only the magnitude.

So we have taken the sign out and now only have a one sided interval [0, inf) to quantize. You can take that one-sided interval and just apply floor or centered quantization to it :


unsigned half_line_quantize( float x )
{
    ASSERT( x >= 0.f );
    //return floor( x ); // floor quantizer
    //return round( x ); // centered quantizer
    float bias = 0.f for floor and 0.5 for centered;
    return (unsigned) ( x + bias );
}

but something a bit funny has happened.

Floor and centered quantization now just act to shift where the boundary of the 0 bin is. But the 0 bin now occurs on both sides of the half interval, so to make the 0 bin the same size as the other bins, it should have a boundary at 0.5 (half the size of the other bins on the half interval). (I'm assuming here that your quantization bucket size is 1.0 ; for general sized quantization buckets just scale x before it gets here).

It's clear that the zero bin is a bit special, so we usually just go ahead and special case it :


pseduocode signed_line_quantizer( float x )
{
    // x signed

    float ax = fabsf(x);

    if ( ax < deadzone )
    {
        // special bucket for zero :
        // don't send sign bit
        return 0;
    }
    else
    {
        // do send sign bit of x
        // do floor quantizer above the zero bucket :
        return floor(ax - deadzone);
    }
}

Now if you want the zero bucket to have the same size as all others, you would set deadzone = 0.5 (it's half the zero bucket size on the full line). If you want to use a uniform floor quantizer on the half line, that would correspond to deadzone = 1.0 (making the zero bucket actually twice the size of others after mirroring to the negative half of the line).

What's been found in data compression is that a "deadzone" larger than equal size buckets (larger than 0.5) is beneficial. There are two primary reasons :

We use codecs where coding zeros is especially cheap, so sending more zeros is very desirable. So larger deadzone in the quantizer will give you more zeros, hence cheaper coding, and this is a greater benefit than the loss in quality. This is sort of a hacky way of doing some rate-distortion optimization, like trellis quantization but without any work.

The other reason is perceptual modeling; many human perception systems (eyes and ears) are less sensitive to the initial onset of a signal than they are to variations once the signal is present. Signals near zero are not detected by humans at all until they reach some threshold, and then once they pass the threshold there's a finer discrimination of level. For example the human ear might not detect a harmonic until it is 10 dB, but then distinguish volume levels at 1 dB changes after that.

Essentially your quantizer has two parameters, the bucket size for zero, and then the bucket size for values above zero. This is a very simple form of a more general variable quantizer.

In theory you would like to have variable size bins, such that each bin corresponds to an equal amount of perceptual importance (eg. larger bins where the values are less important). For the most part we now do that by applying a nonlinear transformation to the value before it reaches the uniform quantizer, rather than trying to do variable size bins. For example you might take log(x) before quantizing if you think precision of high values is less important. Another common example is the "gamma corrected" color space (or sRGB) for images; that's a non-linear transform applied to the signal (roughly pow 2.2) to map it to a space that's more perceptually uniform so that the quantization buckets give more precision where it's needed.

Something to watch out for is that a lot of code uses a deadzone quantizer without being clear about it. If you see something like :

!
int half_line_quantizer_thats_actually_a_deadzone( float x )
{
  ASSERT( x >= 0.f );
  return (int) x;
}
That's actually a deadzone quantizer with a 2x sized bin zero, if it's being used after sign removal.

In the olden days, variable-size quantization buckets were used as a kind of entropy coder. They would have smaller buckets in higher probability regions and larger buckets in lower probability regions, so that the quantized output value had equal probability for all bins. Then you could send the quantized value with no entropy coding. This is now almost never done, it's better to use quantization purely for error metric optimization and use a separate entropy coder on the output.

Topics in dequantization

Just briefly some topics in dequantization.

For values that are all equally likely, under an L2 (SSD/RMSE) error norm, dequantization to the center of the bucket is optimal. More generally the restoration point for each bucket should minimize the error metric weighted by the probability of that input value.

An easy case is with an L2 error metric but a non-uniform probability. Then the error in a given bucket for a restoration point is :

L2 error of restoring to r in this bucket :

E = Sum_x P(x) * ( r - x )^2

( Sum_x for x's in this bucket )

find r that minimizes E by taking d/dr and setting to zero :

d/dr E = 0

d/dr E = Sum_x P(x) * ( r - x ) * 2

Sum_x P(x) * ( r - x ) = 0

Sum_x P(x) * r = Sum_x P(x) * x

r = ( Sum_x P(x) * x ) / ( Sum_x P(x) )

that's just the expectation value of x in the bucket

we should restore to the average expected 'x' value in the bucket.

A common case of that is for a skewed probability distribution - something like Laplacian or Poisson with a falloff of probabilities away from the peak - we should restore each bucket to a value that's skewed slightly towards the peak, rather than restoring the center.

Now if you have a mathematical model of P(x) then you could compute where these centers should be, and perhaps store them in a table.

What's often better in practice is just to measure them experimentally. Do trial runs and record all the values that fall into each quantization bucket and take their mean - that's your restoration point.

Then you could store those measured restoration points in constants in your code, OR you could measure them and store them per-data item. (for example an image compressor could transmit them per image - maybe not all but a few of the most important ones).

Another thing you can do in dequantization is to not always restore to the same point. I noted briefly previously that if what you care about is L1 norm, then any restoration point in the bucket has the same error. Rather than just pick one, you could restore to any random point in the bucket and that would give the same expected L1 norm.

L2 norm strongly prefers the mean (minimizing L2 is blurring or smoothing, while L1 allows lots of noise), but perceptually it may be better to add some randomness. You could restore to mean in the bucket plus a small amplitude of noise around there. Again this noise could be global constant, or could be sent per-image, or per-band; it could also be predicted from local context so you could have more or less noisy areas.

Note that adding noise in dequantization is not the same as just adding noise arbitrarily after the fact. The values are still within the quantization bucket, so they could have been the true source values. That is, we can reframe dequantization as trying to guess the source given the quantized version :


Encoder had original image I

made Q = quant( I )

Q was transmitted

rather than just run I' = dequant( Q )

we instead pose it as :

we want to find I'
such that
Q = quant( I' )
and I' has the maximum probability of being the original I
or I' has the most perceptual similarity to our guess of I

The key thing here is that noise within the quantization bucket keeps the constraint Q = quant(I') satisfied.

As an example I'll mention something I've done in the past for wavelet bit-plane truncation.

Wavelet coding converts an image into activity residuals at various frequency subbands. These are initially quantized with a uniform+deadzone quantizer (if a floating point wavelet transform was used). Then in many codecs they are sent progressively in bit planes, so the highest bits are sent first, then lower bits, so that you get the most important bits first. You can then truncate the stream, cutting off transmission of lower bits in the higher subbands, effectively increasing the quantizer there. This is done in JPEG2000 with the EBCOT scheme for example.

So a given wavelet residual might be sent like :


value 45

= 101101

only top 2 bits sent :

10xxxx

the others are cut off.

In the decoder you know which bits you got and which are missing, which is equivalent to a larger quantization bucket.

The classic option (eg. SPIHT) was just to fill the lost xx bits with zeros :

10xxxx -> 100000

This makes values that are too low and is generally very smoothing (high frequency detail just goes away)

You might think, it's a quantization bucket, we should restore to the middle, which is 0.5 which is the
next bit on :

10xxxx -> 101000 or 100111

That is much too high, it's larger than the expectation and actually looks like a sharpen filter.
The reason is that wavelet amplitudes have P(x) strongly skewed towards zero, so the mean value is
way below the middle of the bucket.

Restoring to 0.25 is a bit better :

10xxxx -> 100100

but even better is to just measure what is the mean in the image for each missing bit count; that
mean depends on how large our value was (the part that's not truncated).

Finally in addition to restoring the missing bits to mean, you could add randomness in the dequantization, either within the quantization bucket (below the bottom bit), or in the low part of the missing bits (eg. if 4 bits are missing the bottom 2 might get some randomness). You can compute the amount of randomness desired such that the decompressed image matches the high frequency energy of the original image.

And that's enough on quantization for now!

8/20/2020

Oodle 2.8.11 with RDO for BC1_WithTransparency

Oodle Texture 2.8.11 adds support for RDO encoding of "BC1_WithTransparency" and BC2. We now support RDO encoding of all BC1-7 variants.

In Oodle, "BC1_WithTransparency" doesn't necessarily mean that the texture has any 1-bit transparency. (for background: BC1 can encoding alpha values of 0 or 255 (1.0), which binary on/off alpha; when alpha is 0 the color is always 0 or black). It means that the transparency bit is preserved. In our normal "BC1" encoder, we assume that the alpha value will not be read, so we are free to choose the encoding to maximize RGB quality.

The choice of "BC1_WithTransparency" vs "BC1" should not be made based on whether the source has alpha or not, it should be made based on whether your shader will *consume* alpha or not. So for example if you just have opaque RGB textures and you need to be able to pipe them to a shader that reads A to do alpha blending, and you are unable to manage the book-keeping to pipe constant 1.0 as a shader source for the A channel, then you must use "BC1_WithTransparency" to encoding opaque alpha in the texture.

When possible, we think it is best to use the "BC1" format which does not preserve alpha. Most people do not actually use binary transparency in BC1 (even for textures where the alpha is binary in the top mip, you typically need full alpha to make decent mips, so you should use BC7), they use BC1 for opaque textures. On opaque textures the "BC1" format that is free to change alpha can give much higher quality. You can then just map constant 1.0 as the A value source for the shader when binding a texture that is marked as opaque.

We understand that is not always practical in your pipeline, so we are trying to make "BC1_WithTransparency" work as well as possible.

Our RDO for "BC1_WithTransparency" will never change the binary alpha state of a pixel. Because of this the RMSE is actually only RGB, the A values will never differ from the original, assuming the original only had alpha values of 0 and 255 in U8.

An example of the quality of "BC1_WithTransparency" RDO on the "mysoup" image available in the Oodle Texture sample run :

otexdds bc1 mysoup1024.png r:\mysoup1024.dds --verbose
OodleTex_BC1 RMSE per texel: 7.0511

otexdds bc1a mysoup1024.png r:\mysoup1024.dds --verbose
OodleTex_BC1_WithTransparency RMSE per texel: 7.0510

otexdds bc1 mysoup1024.png r:\mysoup1024.dds --verbose --rdo
OodleTex_BC1 RMSE per texel: 7.5995

otexdds bc1a mysoup1024.png r:\mysoup1024.dds --verbose --rdo
OodleTex_BC1_WithTransparency RMSE per texel: 7.6006
On photographic images like this without a lot of opaque-black, the quality of "BC1_WithTransparency" is almost identical to "BC1".

On images that mix opaque black and other colors, the quality difference can be severe :

otexdds bc1 frymire.png r:\out.dds --verbose
OodleTex_BC1 RMSE per texel: 6.1506

otexdds bc1a frymire.png r:\out.dds --verbose
OodleTex_BC1_WithTransparency RMSE per texel: 12.1483
On "Frymire" from the Waterloo Bragzone set, the RMSE is nearly double with "BC1_WithTransparency".

We have also updated the Unreal integration for Oodle Texture to use "BC1_WithTransparency", as Unreal expects to be able to fetch opaque A from the texture on all BC1 encodings. Prior to 2.8.11 we were incorrectly using our "BC1" format in Unreal, which could change opaque black texels to transparent black.

Note that "BC1_WithTransparency" RDO's roughly the same as "BC1", so we expect compressed sizes to stay roughly the same.

7/27/2020

Performance of various compressors on Oodle Texture RDO data

Oodle Texture RDO can be used with any lossless back-end compressor. RDO does not itself make data smaller, it makes the data more compressible for the following lossless compressor, which you use for package compression. For example it works great with the hardware compressors in the PS5 and the Xbox Series X.

I thought I'd have a look at how various options for the back end lossless compressor do on BCN texture data after Oodle Texture RDO. (Oodle 2.8.9)

127,822,976 bytes of BC1-7 sample data from a game. BC1,3,4,5, and 7. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.

"baseline" is the non-RDO encoding to BCN by Oodle Texture. "rdo lambda 40" is a medium quality RDO run; at that level visual degradation is just starting to become easier to spot (lambda 30 and below is high quality).

baseline:

by ratio:
ooLeviathan8    :  1.79:1 ,    1.4 enc MB/s , 1069.7 dec MB/s
lzma_def9       :  1.79:1 ,    8.7 enc MB/s ,   34.4 dec MB/s
ooKraken8       :  1.76:1 ,    2.2 enc MB/s , 1743.5 dec MB/s
ooMermaid8      :  1.71:1 ,    4.9 enc MB/s , 3268.7 dec MB/s
zstd22          :  1.70:1 ,    4.5 enc MB/s ,  648.7 dec MB/s
zlib9           :  1.64:1 ,   15.1 enc MB/s ,  316.3 dec MB/s
lz4hc1          :  1.55:1 ,   72.9 enc MB/s , 4657.8 dec MB/s
ooSelkie8       :  1.53:1 ,    7.4 enc MB/s , 7028.2 dec MB/s

rdo lambda=40:

by ratio:
lzma_def9       :  3.19:1 ,    7.7 enc MB/s ,   60.7 dec MB/s
ooLeviathan8    :  3.18:1 ,    1.1 enc MB/s , 1139.3 dec MB/s
ooKraken8       :  3.13:1 ,    1.7 enc MB/s , 1902.9 dec MB/s
ooMermaid8      :  3.01:1 ,    4.2 enc MB/s , 3050.6 dec MB/s
zstd22          :  2.88:1 ,    3.3 enc MB/s ,  733.9 dec MB/s
zlib9           :  2.69:1 ,   16.5 enc MB/s ,  415.3 dec MB/s
ooSelkie8       :  2.41:1 ,    6.6 enc MB/s , 6010.1 dec MB/s
lz4hc1          :  2.41:1 ,  106.6 enc MB/s , 4244.5 dec MB/s

If you compare the log-log charts before & after RDO, it's easy to see that the relative position of all the compressors is basically unchanged, they just all get more compression.

The output size from baseline divided by the output size from post-RDO is the compression improvement factor. For each compressor it is :

ooLeviathan8    : 1.7765
ooKraken8       : 1.7784
ooMermaid8      : 1.7602
ooSelkie8       : 1.5548

lzma_def9       : 1.7821
zstd22          : 1.6941
zlib9           : 1.6402
lz4hc1          : 1.5751
Leviathan, Kraken, Mermaid and LZMA all improve around 1.77 X ; ZStd and Zlib a little bit less (1.65-1.70X), LZ4 and Selkie by less (1.55X - 1.57X). Basically the stronger compressors (on this type of data) get more help from RDO and their advantage grows. ZStd is stronger than Mermaid on many types of data, but Mermaid is particularly good on BCN.

* : Caveat on ZStd & LZ4 speed here : this is a run of all compressors built with MSVC 2017 on my AMD reference machine. ZStd & LZ4 have very poor speed in their MSVC build, they do much better in a clang build. Their clang build can be around 1.5X faster; ZStd-clang is usually slightly faster to decode than Leviathan, not slower. LZ4-clang is probably similar in decode speed to Selkie. The speed numbers fo ZStd & LZ4 here should not be taken literally.

It is common that the more powerful compressors speed up (decompression) slightly on RDO data because they speed up with higher compression ratios, while the weaker compressors (LZ4 and Selkie) slow down slightly on RDO data (because they are often in the incompressible path on baseline BCN, which is a fast path).

Looking at the log-log plots some things stand out to me as different than generic data :

Leviathan, Kraken & Mermaid have a smaller gap than usual. Their compression ratio on this data is quite similar, usually there's a bigger step, but here the line connecting them in log-log space is more horizontal. This makes Mermaid more attractive because you're not losing much compression ratio for the speed gains. (for example, Mermaid + BC7Prep is much better for space & speed than Kraken alone).

ZStd is relatively poor on this type of data. Usually it has more compression than Mermaid and is closer to Kraken, here it's lagging quite far behind, and Mermaid is significantly better.

Selkie is relatively poor on this type of data. Usually Selkie beats LZ4 for compression ratio (sometimes it even beats zlib), but here it's just slightly worse than LZ4. Part of that is the 256 KB chunking is not allowing Selkie to do long-distance matches, but that's not the main issue. Mermaid looks like a much better choice than Selkie here.


Another BCN data set :

358,883,720 of BCN data. Mostly BC7 with a bit of BC6. Mix of diffuse, normals, etc. The compressors here are run on the data cut into 256 KB chunks to simulate more typical game usage.

baseline :

by ratio:
ooLeviathan8    :  1.89:1 ,    1.1 enc MB/s ,  937.0 dec MB/s
lzma_def9       :  1.88:1 ,    7.6 enc MB/s ,   35.9 dec MB/s
ooKraken8       :  1.85:1 ,    1.7 enc MB/s , 1567.5 dec MB/s
ooMermaid8      :  1.77:1 ,    4.3 enc MB/s , 3295.8 dec MB/s
zstd22          :  1.76:1 ,    3.9 enc MB/s ,  645.6 dec MB/s
zlib9           :  1.69:1 ,   11.1 enc MB/s ,  312.2 dec MB/s
lz4hc1          :  1.60:1 ,   73.3 enc MB/s , 4659.9 dec MB/s
ooSelkie8       :  1.60:1 ,    7.0 enc MB/s , 8084.8 dec MB/s

rdo lambda=40 :

by ratio:
lzma_def9       :  4.06:1 ,    7.2 enc MB/s ,   75.2 dec MB/s
ooLeviathan8    :  4.05:1 ,    0.8 enc MB/s , 1167.3 dec MB/s
ooKraken8       :  3.99:1 ,    1.3 enc MB/s , 1919.3 dec MB/s
ooMermaid8      :  3.69:1 ,    3.9 enc MB/s , 2917.8 dec MB/s
zstd22          :  3.65:1 ,    2.9 enc MB/s ,  760.0 dec MB/s
zlib9           :  3.36:1 ,   19.1 enc MB/s ,  438.9 dec MB/s
ooSelkie8       :  2.93:1 ,    6.2 enc MB/s , 4987.6 dec MB/s
lz4hc1          :  2.80:1 ,  114.8 enc MB/s , 4529.0 dec MB/s

On this data set, Mermaid lags between the stronger compressors more, and it's almost equal to ZStd. On BCN data, the strong compressors (LZMA, Leviathan, & Kraken) have less difference in compression ratio than they do on some other types of data. On this data set, Selkie pulls ahead of LZ4 after RDO, as the increased compressibility of post-RDO data helps it find some gains. Zlib, LZ4, and Selkie are almost identical compression ratios on the baseline pre-RDO data but zlib pulls ahead post-RDO.

The improvement factors are :

ooLeviathan8   :    2.154
ooKraken8      :    2.157
ooMermaid8     :    2.085
ooSelkie8      :    1.831

lzma_def9      :    2.148
zstd22         :    2.074
zlib9          :    1.988
lz4hc1         :    1.750
Similar pattern, around 2.15X for the stronger compressors, around 2.08X for the medium ones, and under 2.0 for the weaker ones.


Conclusion:

Oodle Texture works great with all the lossless LZ coders tested here. We expect it to work well with all packaging systems.

The compression improvement factor from Oodle Texture is similar and good for all the compressors, but stronger compressors like Oodle Kraken are able to get even more benefit from the entropy reduction of Oodle Texture. Not only do they start out with more compression on baseline non-RDO data, they also improve by a larger multiplier on RDO data.

The Oodle Data lossless compressors are particularly good on BCN data, even relatively stronger than alternatives like zlib and ZStd than they are on some other data types. For example Oodle Mermaid is often slightly lower compression than ZStd on other data types, but is slightly higher compression than ZStd on BCN.

Mermaid has a substantial compression advantage over zlib on post-RDO BCN data, and decompresses 5-10X faster, making Mermaid a huge win over software zlib (zip/deflate/inflate).

7/26/2020

Oodle 2.8.9 with Oodle Texture speed fix and UE4 integration

Oodle 2.8.9 is now shipping, with the aforementioned speed fix for large textures.

Oodle Texture RDO is always going to be slower than non-RDO encoding, it simply has to do a lot more work. It has to search many possible encodings of the block to BCN, and then it has to evaluate those possible encodings for both R & D, and it has to use more sophisicated D functions, and it has to search for possible good encodings in a non-convex search space. It simply has to be something like 5X slower than non-RDO encoding. But previously we just had a perf bug where working set got larger than cache sized that caused a performance cliff, and that shouldn't happen. If you do find any performance anomalies, such as encoding on a specific texture or with specific options causes much slower performance, please contact RAD.

timerun 287 vs 289

hero_xxx_n.png ; 4096 x 4096
timerun textest bcn bc7 r:\hero_xxx_n.png r:\out.dds -r40 --w32
got opt: rdo_lagrange_parameter=40

Oodle 2.8.7 :

encode time: ~ 8.9 s
per-pixel rmse (bc7): 0.8238
---------------------------------------------
timerun: 10.881 seconds

Oodle 2.8.9 :

encode time: 4.948s
per-pixel rmse (bc7): 0.8229
---------------------------------------------
timerun: 6.818 seconds
the "timerun" time includes all loading and saving and startup, which appears to be about 1.9s ; the RDO encode time has gone from about 8.9s to 4.95 s

(Oodle 2.8.7 textest bcn didn't log encode time so that's estimated; the default number of worker threads has changed, so use --w32 to make it equal for both runs)

We are now shipping a UE4 integration for Oodle Texture!

The Oodle Texture integration is currently only for Oodle Texture RDO/non-RDO BCN encoders (not BC7Prep). It should be pretty simple, once you integrate it your Editor will just do Oodle Texture encodes. The texture previews in the Editor let you see how the encodings look, and that's what you pack in the game. It uses the Unreal Derived Data Cache to avoid regenerating the encodings.

We expose our "lambda" parameter via the "LossyCompressionAmount" field which is already in the Editor GUI per texture. Our engine patches further make it so that LossyCompressionAmount inherits from LODGroup, and if not set there, it inherits from a global default. So you can set lambda at :

per texture LossyCompressionAmount

if Default then look at :

LODGroup LossyCompressionAmount

if Default then look at :

global lambda
We believe that best practice is to avoid having artists tweaking lambda a lot per-texture. We recommend leaving that at "Default" (inherit) as much as possible. The tech leads should set up the global lambda to what's right for your game, and possibly set up the LODGroups to override that for specific texture classes. Only rarely should you need to override on specific textures.

LIMITATIONS :

Currently our Oodle Texture for UE4 integration only works for non-console builds. (eg. Windows,Linux,Mac, host PC builds). It cannot export content for PS4/5/Xbox/Switch console builds. We will hopefully be working with Epic to fix this ASAP.

If you are a console dev, you can still try Oodle Texture for UE4, and it will work in your Editor and if you package a build for Windows, but if you do "package for PS4" it won't be used.

Sample package sizes for "InfiltratorDemo" :

InfiltratorDemo-WindowsNoEditor.pak 

No compression :                            2,536,094,378

No Oodle Data (Zlib), no Oodle Texture :    1,175,375,893

Yes Oodle Data,  no Oodle Texture :           969,205,688

No Oodle Data (Zlib), yes Oodle Texture :     948,127,728

Oodle Data + Oodle Texture lambda=40 :        759,825,164

Oodle Texture provides great size benefit even with the default Zlib compression in Unreal, but it works even better when combined with Oodle Data.

7/15/2020

Two News Items

1. Mea Culpa.

We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.

The good news is the next version will be much faster on very large images, with no algorithmic changes (same results and quality). The bad news is we have lots of people testing it and seeing slower speeds than we expected.

2. Fastmail tua culpa.

Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!

Details for each :

1. Mea Culpa.

We shipped Oodle Texture with a silly performance bug that made it slower than it should have been.

The good news is the next version will be much faster on very large images, with no algorithmic changes (same results and quality). The bad news is we have lots of people testing it and seeing slower speeds than we expected.

It was sort of a story of being too "mature" again.

In our image analysis process, we do a lowpass filter with a Gaussian. In coding that up, I was experimenting with lots of different ideas, so I just did a first quick dumb implementation as a temp thing to get the results and see how it worked. I always intended to come back and rewrite it in the optimization phase if it worked out. (90% of the stuff I try in the experimentation phase just gets deleted, so I try to avoid spending too much time on early implementation until we work out what method is the one we want to ship).

So we tried various things and eventually settled on a process, and came back to optimize what we settled on. I immediately thought, oh well this Gaussian filter I did was a really dumb implementation and obviously we know there are various ways to do fast implementations of that, that's an obvious place to look at speed.

But rather than just dive in and optimize it, I decided to be "mature". The mature programmer doesn't just optimize code that is obviously a bad implementation. Instead they profile, and measure how much time it is actually taking. That way you can prioritize your efforts to spend your programming time where it has the biggest impact. Any programmer work is not zero-sum; if you spend time on X it takes away time from Y, so you can't just say yes of course we should do X, you have to say "X is more important than Y". If I'm optimizing the Gaussian I'm not doing something else important.

So I profiled it, and it was ~1% of total CPU Time. So I thought hrmm, well that's surprising, but I guess it's not important to total CPU time, so I won't optimize it.

I was wrong. The problem was I tested on an image that was too small. There's a huge cliff in performance that happens when the image doesn't fit in cache.

(for people who are aware of the performance issues in image filtering, this is obvious. The main issue for CPU image filtering is the cache usage pattern; there are various ways to fix that, tiles and strips and different local access patterns; that's well known)

Images up to 1024*1024 easily fit in cache (even in 4-float format at 16 bytes per pel, that's 16 MB). Up to 2k x 2k can almost fit in the big 64 MB L3 that is increasingly common.

At 8k x 8k , a 4-float image is 1 GB. (it's unintuitive how fast exponential growth goes up!). At that size you get a huge performance penalty from naive filtering implementations, which are constantly cache missing.

Foolishly, I did my one profile of this code section on a 1k x 1k image, so it looked totally fine.

The solution is simple and we'll have it out soon. (in typical Charles & Fabian obsessive perfectionism style, we can't just fix it "good enough", we have to fix it the best way possible, so we'll wind up with the super over-engineered world's best implemenation) I just feel a bit embarassed about it because doing good profiling and making smart implementation decisions is our specialty and I totally F'ed it.

I think it is an instructive case of some general principles :

1A. Profiling is hard, and a little bit of profiling is worse than none.

In this case there's a huge performance cliff when you go from working sets that fit in cache to ones that don't. That depends on cache size and machine; it can also depend on how much other CPU work is happening that's competing for cache. It depends on machine architexture, for example we've seen many compressors perform horribly on ARM big-little systems where latency to main memory can be much bigger than is typical on x86/64 desktops, because their architects did not profile on that type of machine.

Profiling is a special case of the more general "measurement fallacy". People have this very misplaced faith in a measured number. That can be extremely misleading, and in fact bad measurement can be worse than not doing at all. For example medical trials without valid controls or insufficiently large samples can lead to very harmful public policy decisions if their results are not ignored.

You can be making a completely garbage point, but if you start showing that it was 17.20 and here's a chart with some points, all of a sudden people start thinking "this is rigorous"; to trust any measurement you have to dig into how it was done, does it actually measure what you want to know? were noise and biasing factors controlled and measured? You have to pose the right question, measure the right thing in the right way, sample the right group, do statistical analysis of error and bias, etc. without that it's fucking pseudoscience garbage.

I see far too many people who know about this measurement problem, but then ignore it. For example pretty much everyone knows that GDP is a terrible measure of overall economic health of a country, and yet they still talk about GDP all the time. Maybe they'll toss in a little aside about ("GDP isn't really what we should talk about, but...") and then after the "but" they proceed to do a whole article looking at GDP growth. This is the trap! When you have a bad measurement, you MUST ignore it and not even think about it at all. (see also: graduation rates, diet, cost of social programs, etc. etc.)

You see this all the time with profiling where people measure some micro-benchmark of a hash table, or a mutex lock, and find the "fastest" implementation. These things are massively context dependent and measuring them accurately in a synthetic benchmark is nearly impossible (it would require very complex simulation of different input types, memory layouts and working set sizes, different numbers of threads in different thread usage patterns).

The problem with a bad measurement is it gives a number which then people can brandish as if it's unimpeachable (X was 4 cycles and Y was 5 cycles, therefore we must use X even though it's complicated and fragile and harder to use, and in fact after all the surrounding implementation it winds up being much worse). It far too often makes people believe that the result they saw in one measurement is universally true, when in fact all you observed is that *if* measured in that particular way in that particular situation, this is what you saw that one time. (reminds me of the old "black sheep" joke about the engineer, physicist and the mathematician).

There are lots of common mistakes in profiling that we see all the time, unfortunately, as people try Oodle and feel the need to measure performance for themselves. It's not that easy to just "measure performance". We try to be very careful about using data sets that are realistic samples of expected data, we remove fluctuations due to thermal throttling or single-core boosts, we run multiple times to check repeatability of results, etc. This is literally our job and we spend a lot of time thinking about it, and sometimes we still get it wrong, and yet every single day we get people going "oh I just cooked up this benchmark in two seconds and I'm getting weird results". See also : Tips for benchmarking a compressor and The Perils of Holistic Profiling .

In the modern world you have to consider profiling with N other threads running that you don't control, you can't assume that you get the whole machine to yourself. For example a very common huge mistake that I see is unnecessary thread switches; let's just hand off to this other thread very briefly then come back to our first thread to continue the work. That may be totally fine when you test it on a machine that is otherwise idle, but if you're competing for CPU time on a machine that has a ton of other threads running, that "little thread switch" to pop over to a different async task might take seconds. Over-threading tends to improve benchmarks when run on machines in isolation but hurt performance in the real world.

(See also *2 at end)

1B. Optimization is good for its own sake.

The whole idea that you should "avoid premature optimization" has many flaws and should be one of the learnings that you forget. Yes yes, of course don't go off and spend a month writing an assembly version of a loop without being sure it's an important thing to do, and also that you've got the right overall algorithmic structure and memory access pattern and so on. I'm not advocating just being dumb.

But also, don't use a really slow LogPrintf() implementation just because it doesn't show up in profiles.

When you have bad/slow code, it changes the way you use it. You wind up avoiding that function in high performance areas. It makes you code around the performance bug rather than just writing things the way you should.

I've worked at a number of companies where they disable asserts in debug builds because they've gotten too slow. I of course try turning on asserts, and a thousand of them fire because nobody else is testing with asserts on. The solution should have been to fix the speed of the debug build to something usable, not to avoid important programming tools.

Sometimes when you do a good implementation of something (even when it wasn't necessary for any particular profile of total app performance), it becomes a really useful component that you then wind up using all over. Like maybe you do a really cool memcpy that can do interleaves and shuffles, that winds up being a really useful tool that you can build things with, that you wouldn't have thought about until you did the good implementation of it.

It's also just fun and fun is good.

1C. Trust what is obviously true.

When the truth is staring you in the face, but some measurement, or some complex second thoughts contradict it, you need to stop and reevaluate. The obvious truth is probably right and your overthinking or being too "mature" with measuring things may be misleading you.

In this case the fact that a naive filter implementation was a temp place-holder and needed to be optimized was obviously true, and some over-thinking clouded that.

2. Fastmail tua culpa.

Some of my sent emails have not been reaching their destination. If you sent me a question and did not get a response, I may have responded and it just got lost. Please contact me again!

What was happening was fastmail (*1) was generating emails that failed SPF check. This would cause my sent emails to be just rejected by some receivers, with no "undelivered" response at all, so I didn't know it was happening.

The SPF record is supposed to verify that an email came from the sending mail host that it claims to (but not the sending address). Emails coming from the fastmail mail host mark themselves as being from fastmail, then the receiver can look up the SPF record at fastmail.com and see the IP's that it should have come from to verify it actually came from there. This prevents spammers from claiming to be sending mail from fastmail servers but actually using a different server. This makes it possible for receivers to have white & black lists for hosts. (SPF records do *not* verify the "from" field of the email)

I had my fastmail email set up to forward to an alias account (also inside fastmail). When I then replied to these (via SMTP through smtp.fastmail.com), it was going out identified as :

    helo=wforward1-smtp.messagingengine.com;
    client-ip=64.147.123.30
then receivers would check the SPF record for fastmail and get :
v=spf1 include:spf.messagingengine.com ?all

64.147.123.17
64.147.123.18
64.147.123.19
64.147.123.20
64.147.123.21
64.147.123.24
64.147.123.25
64.147.123.26
64.147.123.27
64.147.123.28
64.147.123.29 
which does not include the .30 IP , therefore my email was marked as an SPF failure.

Fastmail tech support was useless and unhelpful about figuring this out. It also sucks that I get no notification of the undelivered mail.

Some things that were useful :

NIST Email Authentication Tester
dmarcanalyzer SPF checker

*1: I switched to fastmail from dreamhost because dreamhost was failing to deliver my sent email. Deja vu. Why is it so fucking hard to deliver a god damn email !? (in dreamhost's case it's because they intentionally provide smtp service to lots of spammers, so the dreamhost smtp servers get into lots of blacklists)

*2: Another common problem with profiling and benchmarking I've been thinking about recently is the drawback of large tests, which you then average or sum.

People now often have access to large amounts of data to test on. That may or may not be great. It depends on whether that data is an unbiased random sampling of real world data that reflects what you care about the performance on in your final application.

The problem is that you often don't know exactly what data you will be used on, and the data you have is just "some stuff" that you don't really know if it reflects the distribution of data that will be observed later. (this is a bit like the machine learning issue of having a training set that is a good reflection of what will be seen in production use).

Again like the "measurement fallacy" the "big data" test can lead to a false sense of getting an accurate number. If you test on 4 TB of sample data that does not mean the numbers that come out are more useful than a test on 1 MB of sample data.

Large data averages and totals can swamp interesting cases with lots of other cases. There might be some extreme outliers in there where your performance is very bad, but they get washed away in the total. That would be fine if that was in fact a good representation of what you will see in real use, but if it's not you could be very bad.

The worst case is for a library provider like us, we don't know what data types are important to the client. That one weird case where we do badly might be 90% of the client's data.

Any time you're working with test sets where you take averages and totals you have to be aware of how you're pooling (weighted by size? (eg. adding times is weighted by size), or by file? or are categories equally weighted?). If you test set is 20% text and 40% executable that is assigning an effective importance weight to those data types.

In data compression we also have the issue of incompressible files, such as already compressed files, which are not something you should ever be running through your compressor. People running "lots of data" that just iterate every file on their personal disk and think they're getting a good measurement are actually biasing the inputs toward weird things that should not be used.

Because of these considerations and more, I have been increasingly using the method of "minimizing the maximum" of bad performance, or boosting the worst case.

Rather than using a big testset to take an average performance, I use a big test set to find the one file with the worse performance, and then do all I can to optimize that bad case. Measure again, find the new worst case, attack that one.

This has many advantages. It prevents clients from ever seeing a really bad case. That one worst case might actually be the type of data they really care about. It also tends to find interesting corner cases and reveals flaws you don't see on average cases (like oh this one weird file runs most of the loop iterations in the tail/safe loop), that lets you find and fix those cases. It's sort of a case of "you learn from your mistakes" by really digging into these examples of bad performance.

Another nice thing about the "fix the worst" method is that it's strictly additive for bigger test sets. You can just always toss more in your test set and you have more chances to find a worst case. You don't have to worry about how the data is distributed and if that reflects real world distributions. For example say someone gives you a terrabyte of images that are all grayscale. You don't have to worry that this is going to bias your image test set towards a weird over-weighting of grayscale.

This approach has been used on both Oodle Leviathan and Oodle Texture. It was one of the guiding principles of Leviathan that we not only be good on average, but we minimize the gap to the best compressor on every type of data. (we can't be the best possible compressor on every type of data, where specialized compressors can excel in some cases, but we wanted to minimize the worst difference). That led to lots of good discoveries in Leviathan that also helped the average case, and we used a similar principle in Oodle Texture. I think of it as a bit like the machine learning technique AdaBoost, where you take your worst cases and train on them more to get better at them, then keep repeating that and you wind up with a good classifier in general.

7/13/2020

Robust Win32 IO

I see far too much code in production that does not use Win32 IO robustly. Some of the issues are subtle and tricky, but many of them just come down to checking error codes and return values. You cannot assume :
ReadFile(size) either successfully reads all "size" bytes, or fails mysteriously and we should abort
What you actually need to be handling is :
ReadFile(size)
succeeded but got less than size
failed but failed due to being already at EOF
failed but failed due to a temporary system condition that we should retry
succeeded but is not asynchronous the way we expected
succeeded and was asynchronous but then GetOverlapped result does not wait as we expected
failed but failed due to IO size being too big and we should cut it into pieces

In a surely pointless attempt to improve matters, I've tried to make easy to use clean helpers that do all this for you, so you can just include this code and have robust IO :

robustwin32io.zip

Even if you are being careful and checking all the error codes, some issues you may not be handling :

  • Cut large IOs into pieces. You may have memory allocated for your large IO buffer, but when you create an IO request for that, the OS needs to mirror that into the disk cache, and into kernel memory space for the IO driver. If your buffer is too large, that can fail due to running out of resources.

    (this is now rarely an issue on 64-bit windows, but was common on 32-bit windows, and can still happen on the MS consoles)

  • Retry IOs in case of (some) failures. One of the causes of IO failure if too many requests in the queue, for example if you are spamming the IO system generating lots of small request. If you get these failures you should wait a bit for the queue to drain out then retry.

  • Always call GetOverLappedResult(FALSE) (no wait) before GetOverLappedResult(TRUE) (wait) to reset the event. If you don't do this, GetOverLappedResult(TRUE) can return without waiting for the IO to return, causing a race against the IO. This behavior was changed in Windows 7 so this might not be necessary any more, but there's some dangerous behavior with the manual-reset Event in the OVERLAPPED struct. When you start an async IO it is not necessarily reset to unsignaled. When you GetOverLappedResult(TRUE) it is supposed to be waiting on an event that is signalled when the IO completes, but if the event was already set to signalled before you called, it will just return immediately.

    NOTE this is not the issue with trying to do GetOverLappedResult on the same OVERLAPPED struct from multiple threads - that is just wrong; access to the OVERLAPPED struct should be mutex protected if you will query results from multiple threads, and you should also track your own "is io async" status to check before calling GetOverLappedResult.

  • Always call SetLastError(0) before calling any Windows API and then doing GetLastError. See previous blog on this topic : 10-03-13 - SetLastError(0). This particular bug was fixed in Windows Vista (so a while ago), but I'm paranoid about it and it's harmless to do, so I still do it. GetLastError/SetLastError is just a variable in your thread-info-block, so it's only a few instructions to access it. It's best practice to always SetLastError(0) at the start of a sequence of operations, that way you know you aren't getting errors that were left over from before.

For example, here's how to call GetOverlappedResult : (only call if st == win32_io_started_async)

BOOL win32_get_async_result(HANDLE handle,
                            OVERLAPPED * data,
                            DWORD * pSize)
{
    // only call this if you got "win32_io_started_async"
    // so you know IO is actually pending
            
    DWORD dwSize = 0;
    
    // first check result with no wait
    //  this also resets the event so that the next call to GOR works :
    
    if ( GetOverlappedResult(handle,data,&dwSize,FALSE) )
    {
        if ( dwSize > 0 )
        {
            *pSize = (DWORD) dwSize;
            return true;
        }
    }   
    
    // if you don't do the GOR(FALSE)
    //  then the GOR(TRUE) call here can return even though the IO is not actually done
    
    // call GOR with TRUE -> this yields our thread if the IO is still pending
    if ( ! GetOverlappedResult(handle,data,&dwSize,TRUE) )
    {
        DWORD err = GetLastError();
        
        if ( err == ERROR_HANDLE_EOF )
        {
            if ( dwSize > 0 )
            {
                *pSize = (DWORD) dwSize;
                return true;
            }
        }
        
        return false;       
    }
        
    *pSize = (DWORD) dwSize;
    
    return true;    
}           

Get the code :

robustwin32io.zip

Also note that I don't recommend trying to do unbuffered writes on Windows. It is possible but complex and requires privilege elevation which puts up a UAC prompt, so it's not very usable in practice. Just do buffered writes. See also : 01-30-09 - SetFileValidData and async writing and 03-12-09 - ERROR_NO_SYSTEM_RESOURCES

old rants