6/16/2020

Oodle Texture bc7prep data flow

We mostly talk about Oodle Texture encoding to BCN from source art, usually with RDO (rate-distortion optimization).

In the previous post about Oodle Texture we looked at the data flow of texture content in games, for the case of Oodle Texture RDO.

There is a different technology in Oodle Texture called "bc7prep" that can be used differently, or in addition to RDO.

BC7prep is a lossless transform specifically for BC7 (not BC1-6) that rearranges the bits to improve the compression ratio. Unlike Oodle Texture RDO, BC7Prep requires a reverse transform at runtime to unpack it back to BC7. This is a super fast operation, and can also be done with a GPU compute shader so the CPU doesn't have to touch the bits at all.

BC7prep can be used in combination with Oodle Texture RDO encoding, or on BC7 encodings made from any source. It typically improves the compression of BC7 by 5-15%

BC7 is particularly important because it makes up a lot of the size of modern games. It's a commonly used texture format, and without RDO or bc7prep it often doesn't compress much at all.

The similar data flow chart for BC7 textures that use bc7prep is :


RGB uncompressed source art content like BMP or PNG

|     <- this step is where Oodle Texture RDO goes
V

BC7 compressed texture  <- you can also start here for bc7prep

|     <- bc7prep transforms BC7 to "prepped" data
V

BC7Prep transformed texture

|
V

Kraken or zlib compression on the game package containing the BCN textures

|                                                                                   TOOLS ^
V
sent over network, stored on disk

|                                                                                   RUNTIME v
V

decompress Kraken or zlib to load (sometimes with hardware decompressor)

|
V

BC7Prep transformed texture

|     <- bc7prep unpacking, can be done on GPU
V

BC7 compressed texture in memory

|
V

rendered on GPU

Oodle Texture gives you several options for how you use it depending on what best fits your game. You can use only Oodle Texture RDO, in which case no runtime decoding is needed. You can use just bc7prep on existing BC7 encoded data, in which case you don't need to use Oodle Texture's BCN encoding from source art at all. Or you can use both together.

BC7Prep combined with Oodle Texture RDO at "near lossless" levels provides size savings for BC7 with almost no visual quality difference from a non-RDO BC7 encoding.

On varied BC7 textures :

total :
BC7 + Kraken                        : 1.387 to 1
BC7Prep + Kraken                    : 1.530 to 1
NearLossless RDO + BC7Prep + Kraken : 1.650 to 1
The size savings with BC7Prep are not huge and dramatic the way they are with RDO, because the BC7Prep transform is lossless, but they are very large compared to the normal differences between lossless compression options.

For example the compression ratio on that BC7 set with Oodle Leviathan is 1.409 to 1, not much better than Kraken, and a far smaller gain than BC7Prep gives. Oodle Leviathan is a very strong compressor that usually finds bigger gains over Kraken than that, but BC7 data is hard for compressors to parse. BC7Prep and Oodle Texture RDO put the data into a form that increases the benefit of strong compressors like Oodle over weaker ones like zlib. (on data that's incompressible, all compressors are the same).

The runtime BC7Prep untransform is extremely fast. If you're using Oodle Data compression in software, it's best to do the BC7Prep untransform in software right after decompression. If you're on a platform with hardware decompression, you may want to use BC7Prep untransform compute shaders so the texture data never has to be touched by the CPU.

Visit the RAD Game Tools website to read all about Oodle Texture.

Oodle Texture sample run

This is a sample run of Oodle Texture with files you can download to verify our results for yourself.

The example here is my photograph "mysoup" which I make CC0. While most game textures are not like photographs, this is typical of the results we see. To try Oodle Texture on your own images contact RAD Game Tools for an evaluation. You can also see more sample encodings on real game textures at the Oodle Texture web site.

I will be showing "mysoup" encoded to BC7 here, which is the format commonly used by games for high quality textures. It is 8 bits per texel (vs 24 for source RGB). The BC7 images I provide here are in DDS format; they are not viewable by normal image viewers, this is intended for game developers and technical artists to see real game data.

I have made these BC7 DDS with Oodle Texture in "BC7 RGBA" mode, which attempts to preserve the opaque alpha channel in the encoding. I would prefer to use our "BC7 RGB" which ignores alpha; this would get slightly higher RGB quality but can output undefined in alpha. Because many third party viewers don't handle this mode, I've not used it here.

I will also show the encoding of a few different maps from a physically based texture : coral_mud_01 from texturehaven.com (CC0), to show a sample of a full texture with diffuse, normal, and attribute maps.

I show RMSE here for reference, you should be able to reproduce the same numbers on the sample data. The RMSE I show is per texel (not per channel), and I compute only RGB RMSE (no A). Note that while I show RMSE, Oodle Texture has been run for Perceptual quality, and visual quality is what we hope to maximize (which sacrifices RMSE performance).

Oodle Texture is not a compressor itself, it works with the back end lossless compressor of your choice. Here I will show sizes with software Kraken at level 8 and zlib at level 9. You can download the data and try different back end compressors yourself. Oodle Texture works with lots of different lossless back end compressors to greatly increase their compression ratio.


The "mysoup" images are 1024x1024 but I'm showing them shrunk to 512x512 for the web site; you should always inspect images for visual quality without minification; click any image for the full size.

Download all the files for mysoup1024 here : (BC7 in DDS) :
mysoup1024_all.7z Download all the files for coral_mud_01 here : (BC7,BC4,BC5 in DDS) :
coral_mud_01.7z


mysoup1024.png :
original uncompressed RGB :

click image for full resolution.


baseline BC7 :
(no RDO, max quality BC7)
RMSE per texel: 2.7190

Kraken          :  1,048,724 ->   990,347 =  7.555 bpb =  1.059 to 1
Kraken+bc7prep  :  1,080,696 ->   861,173 =  6.375 bpb =  1.255 to 1
zlib9           :  1,048,724 -> 1,021,869 =  7.795 bpb =  1.026 to 1

BC7 data is difficult for traditional compressors to handle. We can see here that neither Kraken nor zlib can get much compression on the baseline BC7, sending the data near the uncompressed 8 bits per texel size of BC7.

Oodle Texture provides "bc7prep", which is a lossless transform that makes the BC7 data more compressible. "bc7prep" can be used with Kraken, zlib, or any other back end compressor. "bc7prep" does require a runtime pass to transform the data back to BC7 that the GPU can read, but this can be done with a GPU compute shader so no CPU involvement is needed. Here bc7prep helps quite a bit in changing the data into a form that can be compressed by Kraken.


rdo lambda=5 :
RDO in near lossless mode
RMSE per texel: 2.8473

Kraken          :  1,048,724 ->   849,991 =  6.484 bpb =  1.234 to 1
Kraken+bc7prep  :  1,080,468 ->   767,149 =  5.680 bpb =  1.408 to 1
zlib9           :  1,048,724 ->   895,421 =  6.831 bpb =  1.171 to 1

In near lossless mode, Oodle Texture can make encodings that are visually indistinguishable from baseline, but compress much better. At this level, Oodle Texture RDO is finding blocks that are lower rate (eg. compress better with subsequent Kraken compression), but are no worse than the baseline choice. It's simply a smarter encoding that considers rate as well as distortion when considering the many possible ways the block can be encoded in BCN.

Note that when we say "near lossless RDO" we mean nearly the same quality as the baseline encoding. The baseline encoding to BC7 forces some quality loss from the original, and this RDO setting does not increase that. The RMSE difference to baseline is very small, but the visual quality difference is even smaller.

We believe that legacy non-RDO encodings of most texture types should never be used. Oodle Texture RDO provides huge wins in size with no compromise; if you need max quality just run in near lossless mode. It simply makes a better encoding to BC7 which is much more compressible. Many common BC7 encoders produce worse quality than Oodle Texture does in near-lossless mode.


rdo lambda=40 :
RDO in medium quality mode
RMSE per texel: 4.2264

Kraken          :  1,048,724 ->   509,639 =  3.888 bpb =  2.058 to 1
Kraken+bc7prep  :  1,079,916 ->   455,747 =  3.376 bpb =  2.370 to 1
zlib9           :  1,048,724 ->   576,918 =  4.401 bpb =  1.818 to 1

At lambda=40 we are now trading off some quality for larger rate savings. At this level, visual differences from the original may start to appear, but are still very small, and usually acceptable. (for example the errors here are far far smaller than if you encoded to BC1, or even if you encoded with a poor BC7 encoder that reduces choices in a hacky/heuristic way).

At this level, Kraken is now able to compress the image nearly 2 to 1 , to 3.888 bits per texel, starting from baseline which got almost no compression at all. We've shrunk the Kraken compressed size nearly by half. This also means the content can load twice as fast, giving us an effective 2X multiplier on the disk speed. This is a HUGE real world impact on game content sizes with very little down side.

zlib has also benefitted from RDO, going from 1,021,869 to 576,918 bytes after compression. Kraken does a bit better because it's a bit stronger compressor than zlib. The difference is not so much because Oodle Texture is specifically tuned for Kraken (it's in fact quite generic), but because more compressible data will tend to show the difference between the back end compressors more. On the baseline BC7 data, it's nearly incompressible, so the difference between Kraken and zlib looks smaller there.


Download all the "mysoup" files here : (BC7 in DDS) :
mysoup1024_all.7z

Summary of all the compression results :

baseline BC7 :

Kraken          :  1,048,724 ->   990,347 =  7.555 bpb =  1.059 to 1
Kraken+bc7prep  :  1,080,696 ->   861,173 =  6.375 bpb =  1.255 to 1
zlib9           :  1,048,724 -> 1,021,869 =  7.795 bpb =  1.026 to 1

RDO lambda=5 :

Kraken          :  1,048,724 ->   849,991 =  6.484 bpb =  1.234 to 1
Kraken+bc7prep  :  1,080,468 ->   767,149 =  5.680 bpb =  1.408 to 1
zlib9           :  1,048,724 ->   895,421 =  6.831 bpb =  1.171 to 1

RDO lambda=40 :

Kraken          :  1,048,724 ->   509,639 =  3.888 bpb =  2.058 to 1
Kraken+bc7prep  :  1,079,916 ->   455,747 =  3.376 bpb =  2.370 to 1
zlib9           :  1,048,724 ->   576,918 =  4.401 bpb =  1.818 to 1


coral_mud_01 :

You can get the source art for coral_mud_01 at texturehaven.com. I used the 1k PNG option. On the web site here I am showing a 256x256 crop of the images so they can be seen without minification. Download the archive for the full res images.

coral_mud_01_diff_1k :
diffuse (albedo) color in BC7 (RGBA)

BC7 non-RDO BC7 lambda=30 BC7 lambda=50
RMSE 3.5537 RMSE 4.9021 RMSE 5.7194
7.954 bpb 5.339 bpb 4.683 bpb
BC7Prep could also be used for additional compression, not shown here.

coral_mud_01_Nor_1k.png:
normal XY in RG channels only in BC5
BC5 decodes to 16 bit
RMSE is RG only

BC5 non-RDO BC5 lambda=30 BC5 lambda=50
RMSE 3.4594 RMSE 5.8147 RMSE 7.3816
8.000 bpb 5.808 bpb 5.083 bpb

coral_mud_01_bump_1k.png:
bump map, single scalar channel
BC4 decodes to 16 bit

BC4 non-RDO BC4 lambda=30 BC4 lambda=50
RMSE 3.2536 RMSE 4.1185 RMSE 5.2258
7.839 bpb 6.871 bpb 6.181 bpb

Single scalar channels in BC4 is an unusual usage for games. Typically several scalar channels would be combined in a BC7 texture.

Compressed sizes are with software Kraken at level 8, no additional filters. "bpb" means "bits per byte", it's the compressed size in bits, per byte. The non-RDO textures in coral_mud all get almost no compression at all with Kraken. With RDO that improves to around 5 bpb, which is an 8:5 ratio or 1.6 to 1

With BC7 and BC5, the "bpb" size is also the number of bits per texel, because they start at 8 bits per texel when uncompressed. If RDO can improve BC7 to 4 bits per texel, that means it's now the same size on disk as uncompressed BC1, but at far higher visual quality. (2:1 on BC7 is a bit more than we typically expect; 8:5 or 8:6 is more common)

Download all the files for coral_mud_01 here : (BC7,BC4,BC5 in DDS) :
coral_mud_01.7z


Read more about Oodle Texture on the RAD Game Tools web site

6/09/2020

Followup tidbits on RGBE

As noted previously, RGBE 8888 is not a very good encoding for HDR in 32 bits. I haven't personally evaluated the other options, but from reading the 16-8-8 LogLUV looks okay. You want more bits of precision for luminance, and the only way to do that is to go into some kind of luma-chroma space.

In any case, we'll look at a couple RGBE followup topics because I think they may be educational. Do NOT use these. This is for our education only, don't copy paste these and put them in production! If you want an RGBE conversion you can use, see the previous post!

In the previous post I wrote that I generally prefer centered quantization that does bias on encode. This is different than what is standard for Radiance HDR RGBE files. (DO NOT USE THIS). But say you wanted to do that, what would it look like exactly?


// float RGB -> U8 RGBE quantization
void float_to_rgbe_centered(unsigned char * rgbe,const float * rgbf)
{
    // NOT HDR Radiance RGBE conversion! don't use me!
        
    float maxf = rgbf[0] > rgbf[1] ? rgbf[0] : rgbf[1];
    maxf = maxf > rgbf[2] ? maxf : rgbf[2];

    if ( maxf <= 1e-32f )
    {
        // Exponent byte = 0 is a special encoding that makes RGB output = 0
        rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
    }
    else
    {
        int exponent;
        frexpf(maxf, &exponent);
        float scale = ldexpf(1.f, -exponent + 8);
    
        // bias might push us up to 256
        // instead increase the exponent and send 128
        if ( maxf*scale >= 255.5f )
        {
            exponent++;
            scale *= 0.5f;
        }
    
        // NOT HDR Radiance RGBE conversion! don't use me!
        rgbe[0] = (unsigned char)( rgbf[0] * scale + 0.5f );
        rgbe[1] = (unsigned char)( rgbf[1] * scale + 0.5f );
        rgbe[2] = (unsigned char)( rgbf[2] * scale + 0.5f );
        rgbe[3] = (unsigned char)( exponent + 128 );
    }
}

// U8 RGBE -> float RGB dequantization
void rgbe_to_float_centered(float * rgbf,const unsigned char * rgbe)
{
    // NOT HDR Radiance RGBE conversion! don't use me!

    if ( rgbe[3] == 0 )
    {
        rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
    }
    else
    {
        // NOT HDR Radiance RGBE conversion! don't use me!

        float fexp = ldexpf(1.f, (int)rgbe[3] - (128 + 8));
        // centered restoration, no bias :
        rgbf[0] = rgbe[0] * fexp;
        rgbf[1] = rgbe[1] * fexp;
        rgbf[2] = rgbe[2] * fexp;
    }
}

what's the difference in practice ?

On random floats, there is no difference. This has the same 0.39% max round trip error as the reference implementation that does bias on decode.

The difference is that on integer colors, centered quantization restores them exactly. Specifically : for all the 24-bit LDR (low dynamic range) RGB colors, the "centered" version here has zero error, perfect restoration.

That sound pretty sweet but it's not actually helpful in practice, because the way we in games use HDR data typically has the LDR range scaled in [0,1.0] not ,[0,255]. The "centered" way does preserve 0 and 1 exactly.

The other thing I thought might be fun to look at is :

The Radiance RGBE conversion has 0.39% max round trip error. That's exactly the same as a flat quantizer from the unit interval to 7 bits. (the bad conversion that did floor-floor had max error of 0.78% - just the same as a flat quantizer to 6 bits).

But our RGBE all have 8 bits. We should be able to get 8 bits of precision. How would you do that?

Well one obvious issue is that we are sending the max component with the top bit on. It's in [128,255], we always have the top bit set and then only get 7 bits of precision. We could send that more like a real floating point encoding with an implicit top bit, and use all 8 bits.

If we do that, then the decoder needs to know which component was the max to put the implicit top bit back on. So we need to signal it. Well, fortunately we have 8 bits for the exponent which is way more dynamic range than we need for HDR imaging, so we can take 2 bits from there to send the max component index and leave 6 bits for exponent.

Then we also want to make sure we use the full 8 bits for the non-maximal components. To do that we can scale their fractional size relative to max up to 255.

Go through the work and we get what I call "rgbeplus" :


/**

! NOT Radiance HDR RGBE ! DONT USE ME !

"rgbeplus" packing

still doing 8888 RGBE, one field in each 8 bits, not the best possible general 32 bit packing

how to get a full 8 bits of precision for each component
(eg. maximum error 0.19% instead of 0.38% like RGBE)

for the max component, we store an 8-bit mantissa without the implicit top bit
  (like a real floating point encoding, unlike RGBE which stores the on bit)
  (normal RGBE has the max component in 128-255 so only 7 bits of precision)

because we aren't storing the top bit we need to know which component was the max
  so the decoder can find it

we put the max component index in the E field, so we only get 6 bits for exponent
  (6 is plenty of orders of magnitude for HDR images)
  
then for the non-max fields, we need to get a full 8 bits for them too  
  in normal RGBE they waste the bit space above max, because we know they are <= max
  eg. if max component was 150 , then the other components can only be in [0,150]
    and all the values above that are wasted precision
  therefore worst case in RGBE the off-max components also only have 7 bits of precision.
  To get a full 8, we convert them to fractions of max :
  frac = not_max / max
  which we know is in [0,1]
  and then scale that up by 255 so it uses all 8 bits

this all sounds a bit complicated but it's very simple to decode

I do centered quantization (bias on encode, not on decode)

**/

// float RGB -> U8 RGBE quantization
void float_to_rgbeplus(unsigned char * rgbe,const float * rgbf)
{
    // rgbf[] should all be >= 0 , RGBE does not support signed values
    
    // ! NOT Radiance HDR RGBE ! DONT USE ME !

    // find max component :
    int maxi = 0;
    if ( rgbf[1] > rgbf[0] ) maxi = 1;
    if ( rgbf[2] > rgbf[maxi] ) maxi = 2;
    float maxf = rgbf[maxi];

    // 0x1.p-32 ?
    if ( maxf <= 1e-10 ) // power of 10! that's around 2^-32
    {
        // Exponent byte = 0 is a special encoding that makes RGB output = 0
        rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
    }
    else
    {
        int exponent;
        frexpf(maxf, &exponent);
        float scale = ldexpf(1.f, -exponent + 9);
        // "scale" is just a power of 2 to put maxf in [256,512)
        
        // 6 bits of exponent :
        if ( exponent < -32 )
        {
            // Exponent byte = 0 is a special encoding that makes RGB output = 0
            rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
            return;
        }
        myassert( exponent < 32 );
        
        // bias quantizer in encoder (centered restoration quantization)
        int max_scaled = (int)( maxf * scale + 0.5f );
        if ( max_scaled == 512 )
        {
            // slipped up because of the round in the quantizer
            // instead do ++ on the exp
            scale *= 0.5f;
            exponent++;
            //max_scaled = (int)( maxf * scale + 0.5f );
            //myassert( max_scaled == 256 );
            max_scaled = 256;
        }
        myassert( max_scaled >= 256 && max_scaled < 512 );
        
        // grab the 8 bits below the top bit :
        rgbe[0] = (unsigned char) max_scaled;
        
        // to scale the other two components
        //  we need to use the maxf the *decoder* will see
        float maxf_dec = max_scaled / scale;
        myassert( fabsf(maxf - maxf_dec) <= (0.5/scale) );
        
        // scale lower components to use full 255 for their fractional magnitude :
        int i1 = (maxi+1)%3;
        int i2 = (maxi+2)%3;
        rgbe[1] = u8_check( rgbf[i1] * 255.f / maxf_dec + 0.4999f );
        rgbe[2] = u8_check( rgbf[i2] * 255.f / maxf_dec + 0.4999f );
        
        // rgbf[i1] <= maxf
        // so ( rgbf[i1] * 255.f / maxf ) <= 255
        // BUT
        // warning : maxf_dec can be lower than maxf
        // maxf_dec is lower by a maximum of (0.5/scale)
        // worst case is 
        // (rgbf[i1] * 255.f / maxf_dec ) <= 255.5
        // so you can't add + 0.5 or you will go to 256
        // therefore we use the fudged bias 0.4999f
        
        rgbe[3] = (unsigned char)( ( (exponent + 32) << 2 ) + maxi );
    }
}

// U8 RGBE -> float RGB dequantization
void rgbeplus_to_float(float * rgbf,const unsigned char * rgbe)
{
    // ! NOT Radiance HDR RGBE ! DONT USE ME !

    if ( rgbe[3] == 0 )
    {
        rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
    }
    else
    {
        int maxi = rgbe[3]&3;
        int exp = (rgbe[3]>>2) - 32;
        float fexp = ldexpf(1.f, exp - 9);
        float maxf = (rgbe[0] + 256) * fexp;
        float f1 = rgbe[1] * maxf / 255.f;
        float f2 = rgbe[2] * maxf / 255.f;
        int i1 = (maxi+1)%3;
        int i2 = (maxi+2)%3;
        rgbf[maxi] = maxf;
        rgbf[i1] = f1;
        rgbf[i2] = f2;
    }
}

and this in fact gets a full 8 bits of precision. The max round trip error is 0.196% , the same as a flat quantizer to 8 bits.

(max error is always measured as a percent of the max component, not of the component that has the error; any shared exponent format has 100% max error if you measure as a percentage of the component)

Again repeating myself : this is a maximum precision encoding assuming you need to stick to the "RGBE" style of using RGB color space and putting each component in its own byte. That is not the best possible way to send HDR images in 32 bits, and there's no particular reason to use that constraint.

So I don't recommend using this in practice. But I think it's educational because these kind of considerations should always be studied when designing a conversion. The errors from getting these trivial things wrong are very large compared to the errors that we spend years of research trying to save, so it's quite frustrating when they're done wrong.

6/07/2020

Widespread error in Radiance HDR RGBE conversions

It has come to my attention that broken RGBE image conversions used in the loading of Radiance HDR files are being spread widely. The broken routines are causing significant unnecessary errors in float reconstruction from RGBE 8888.

Note that RGBE 8888 is a very lossy encoding of floating point HDR images to begin with. It is really not appropriate as a working HDR image format if further processing will be applied that may magnify the errors. A half-float format is probably a better choice. If you do use RGBE HDR at least use the correct conversions which I will provide here. (See Greg Ward's article on HDR encodings and also nice table of dynamic range and accuracy of various HDR file formats).

The Radiance HDR RGBE encoding stores 3-channel floating point RGB using four 8-bit components. It takes the RGB floats and finds the exponent of the largest component, sends that exponent in E, then shifts the components to put the largest component's top bit at 128 and sends them as linear 8 bits.

In making the corrected version, I have been guided by 3 principles in order of priority : 1. ensure existing RGBE HDR images are correct encodings, 2. treat the original Radiance implementation as defining the file format, 3. minimize the error of round trip encoding.

With no further ado, let's have the correct RGBE conversion, then we'll talk about the details :

RGBE encoding puts the largest component in the range [128,255). It actually only has 7 bits of mantissa. The smaller components are put on a shared exponent with the largest component, which means if you ever care about tiny values in the non-maximal component this encoding is terrible (as are any shared-exponent encodings). There are much better packings of RGB floats to 32 bits that store more useful bits of precision.

Let's now look at the broken version that's being widely used. It uses the same encoder as above, but in the decoder the bad version does :

// RGBE 8888 -> float RGB
// BAD decode restoration, don't copy me

        rgbf[0] = rgbe[0] * fexp;
        rgbf[1] = rgbe[1] * fexp;
        rgbf[2] = rgbe[2] * fexp;
missing the biases by half.

Essentially this is just a quantization problem. We are taking floats and quantizing them down to 8 bits. Each 8 bit index refers to a "bucket" or range of float values. When you restore after quantizing, you should restore to the middle of the bucket. (with non-uniform error metrics or underlying probability distributions, the ideal restoration point might not be the middle; more generally restore to the value that minimizes error over the expected distribution of input values).

The broken version is essentially doing a "floor" on both the quantization and restoration.


floor quantization :

+---------+---------+---------+---------+---------+---------+
|0        |1        |2        |3        |4        |5        |
+---------+---------+---------+---------+---------+---------+

->
floor restoration :

*0        *1        *2        *3        *4        *5

->
bias 0.5 restoration :

     *0.5      *1.5      *2.5      *3.5      *4.5      *5.5

the rule of thumb is that you need an 0.5 bias on either the quantization side or the restoration side. If you do floor quantization, do +0.5 in restoration. If you do centered quantization (+0.5) you can do do integer restoration.

The broken version is just a bad quantizer that restores value ranges to one edge of the bucket. That's creating a net shift of values downward and just creates error that doesn't need to be there. On random RGB floats :

Maximum relative error :
Correct RGBE encoding : 0.3891%
Bad floor-floor RGBE encoding : 0.7812%

(percentage error relative to largest component).

Note this is a LOT of error. If you just took a real number in [0,1) and quantized it to 8 bits, the maximum error is 0.195% , so even the correct encoding at around 0.39% error is double that (reflecting that we only have 7 bits of precision in the RGBE encoding), and the bad encoding at around 0.78% is double that again (it's equal to the maximum error of uniform quantization if we only had 6 bits of precision).

Reference test code that will print these errors : test_rgbe_error.cpp

I have where possible tried to make this corrected RGBE quantizer match the original HDR file IO from Greg Ward's Radiance . I believe we should treat the Radiance version as canonical and make HDR files that match it. I have adopted a small difference which I note in Appendix 4. The change that I propose here actually makes the encoder match the description of its behavior better than it did before (see Appendix 4). the original Radiance code does floor quantization and midpoint restoration, like here. The broken version was introduced later.

The broken floor-floor code has been widely copied into many tools used in games (such as STBI and NVTT), hopefully those will be fixed soon. Fortunately, the encoder does not need to be changed, only the decode code is changed, so existing HDR images are okay. They can be loaded with the corrected restoration function and should see an improvement in quality for free.

That's it! We'll followup some details in appendices for those who are interested.


Appendix 1 :

Correct conversion in raw text if the pastebin isn't working :

// float RGB -> U8 RGBE quantization
void float_to_rgbe(unsigned char * rgbe,const float * rgbf)
{
    // rgbf[] should all be >= 0 , RGBE does not support signed values
        
    float maxf = rgbf[0] > rgbf[1] ? rgbf[0] : rgbf[1];
    maxf = maxf > rgbf[2] ? maxf : rgbf[2];

    if ( maxf <= 1e-32f )
    {
        // Exponent byte = 0 is a special encoding that makes RGB output = 0
        rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
    }
    else
    {
        int exponent;
        float scale;
        frexpf(maxf, &exponent);
        scale = ldexpf(1.f, -exponent + 8);
                
        rgbe[0] = (unsigned char)( rgbf[0] * scale );
        rgbe[1] = (unsigned char)( rgbf[1] * scale );
        rgbe[2] = (unsigned char)( rgbf[2] * scale );
        rgbe[3] = (unsigned char)( exponent + 128 );
    }
}

// U8 RGBE -> float RGB dequantization
void rgbe_to_float(float * rgbf,const unsigned char * rgbe)
{
    if ( rgbe[3] == 0 )
    {
        rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
    }
    else
    {
        // the extra 8 here does the /256
        float fexp = ldexpf(1.f, (int)rgbe[3] - (128 + 8));
        rgbf[0] = (rgbe[0] + 0.5f) * fexp;
        rgbf[1] = (rgbe[1] + 0.5f) * fexp;
        rgbf[2] = (rgbe[2] + 0.5f) * fexp;
    }
}


Appendix 2 :

Why I prefer centered quantization.

Radiance HDR does floor quantization & centered restoration. I think it should have been the other way around (centered quantization & int restoration), but I don't propose changing it here because we should stick to the reference Greg Ward implementation, and match how existing .hdr files have been encoded.

The reason I prefer centered quantization is that it exactly preserves integers and powers of two. (it has a small bonus of not needing a bias at decode time).

If your input real numbers are evenly distributed with no special values, then there's no reason to prefer either style, they're equal. But usually our inputs are not evenly distributed and random, we often deal with inputs where values like 0.0 and 1.0 or 256.0 are special and we'd like to preserve them exactly.

If you do centered quantization, these restore exactly. If you do floor quantization and then have an 0.5 bias on dequantization to do centered restoration, values like 0 shift to be in the center of their bucket.

In the correct RGBE encoding above, a float input of 1.0 is restored as 1.0039063 (= 1 + 1/256), because 1.0 corresponds to the bottom edge of a quantization bucket, and we restore to the middle of the bucket.

For example for non-HDR images the quantization I like is :

// U8 to float 
// map 1.0 to 255
// centered quantization

unsigned char unit_float_to_u8(float f)
{
    // f in [0,1.0]
    // scale up to [0,255] , then do centered quantization :
    //  eg. values after *255 in [0.5,1.5] -> 1
    // clamp before casting if you aren't sure your floats are bounded!
    return (unsigned char) (f * 255.f + 0.5f);
}

float u8_to_unit_float(unsigned char u8)
{
    // u8 in [0,255]
    // scale to [0,1.0]
    // do straight mapped dequantization :
    return u8 * (1.f/255);
}

It seems that perhaps the desire to preserve 1.0 exactly is what got us into this whole mess. A widely referenced extraction from Radiance was posted here : bjw rgbe, but with a crucial flaw. The reconstruction was changed to remove the bias :

/* standard conversion from rgbe to float pixels */
/* note: Ward uses ldexp(col+0.5,exp-(128+8)).  However we wanted pixels */
/*       in the range [0,1] to map back into the range [0,1].            */
Ward had a valid quantizer (floor encode, bias decode), but it disappeared in this version.

The correct way to get 1.0 preserved would have been to do a centered quantization (bias on the encode). As noted previously I don't recommend changing that now as it would mean existing hdr files were encoded wrong, and it deviates from Ward's original Radiance implementation, which should be taken as defining the format. We should consider the BJW implementation to simply have an incorrect decoder. (the BJW encoder is okay, but it does also suffer from the small flaw discussed in Appendix 4)


Appendix 3 :

RGBE encoder using IEEE floating point bit manipulations.

I don't post this as a suggested optimization, but rather because I think it illuminates what's actually going on in the RGBE encoding.

The IEEE floating points that we are encoding are sign-exponent-mantissa :

SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM
To send the largest component with RGBE, what we are doing is taking the top 7 bits of the mantissa, add the implicit 1 bit ahead of those, and put those in 8 bits.
000000001MMMMMMM0000000000000000
What we're doing with dropping the bits below the top 7 is the floor quantization, whatever those bits were, they're gone. The +0.5 bias on restoration is equivalent to saying we expect all of those dropped bits to be equally likely, hence their average value is 10000 (the highest drop bit is turned on, the rest are left off).

000000001MMMMMMM1000000000000000

The components that aren't the highest are forced onto the same exponent as the highest; this shifts in zeros from the left, their mantissa bits are shifted out the bottom of the word, and then we grab the top 8 bits of them.

The "ldexpf" we do in the correct implementation is just making a pure power of two, which is just an exponent in IEEE floats (all mantissa bits zero). When you multiply that on another float, you're just adding to the exponent part. Unfortunately there aren't standard C ways to do simple float operations like getting an exponent or adding to an exponent so we'll have to dig into the bits.

The full operation is :

union float_and_bits
{
    float f;
    unsigned int bits;  
};

// float RGB -> U8 RGBE quantization
void float_to_rgbe_bits(unsigned char * rgbe,const float * rgbf)
{
    float_and_bits fab[3];
    fab[0].f = rgbf[0];
    fab[1].f = rgbf[1];
    fab[2].f = rgbf[2];
    
    unsigned int max_bits = fab[0].bits > fab[1].bits ?  fab[0].bits : fab[1].bits;
    max_bits = max_bits > fab[2].bits ? max_bits : fab[2].bits;
    int max_exp = (max_bits >> 23) - 127;
    
    //max_bits == 0 is exact float zero
    //(int)max_bits < 0 means the sign bit is on (illegal negative input)
    //max_exp == 128 is NaN
    
    if ( (int)max_bits <= 0 || max_exp <= -100 || max_exp == 128 )
    {
        // Exponent byte = 0 is a special encoding that makes RGB output = 0
        rgbe[0] = rgbe[1] = rgbe[2] = rgbe[3] = 0;
    }
    else
    {
        // float exponent is a value in [1,2)*e^exp , we want [0.5,1) so do ++
        max_exp++;
        
        unsigned int mantissa = 1 << 23;
        int exp0 = (fab[0].bits>>23) - 127;
        int exp1 = (fab[1].bits>>23) - 127;
        int exp2 = (fab[2].bits>>23) - 127;
        int man0 = (fab[0].bits & (mantissa-1)) | mantissa;
        int man1 = (fab[1].bits & (mantissa-1)) | mantissa;
        int man2 = (fab[2].bits & (mantissa-1)) | mantissa;
        man0 >>= min(max_exp - exp0 - 8 + 23,31);
        man1 >>= min(max_exp - exp1 - 8 + 23,31);
        man2 >>= min(max_exp - exp2 - 8 + 23,31);
            
        rgbe[0] = (unsigned char)( man0 );
        rgbe[1] = (unsigned char)( man1 );
        rgbe[2] = (unsigned char)( man2 );
        rgbe[3] = (unsigned char)( max_exp + 128 );
    }
}

// U8 RGBE -> float RGB dequantization
void rgbe_to_float_bits(float * rgbf,const unsigned char * rgbe)
{
    if ( rgbe[3] == 0 )
    {
        rgbf[0] = rgbf[1] = rgbf[2] = 0.f;
    }
    else
    {
        float_and_bits fab;
        int exp = (int)rgbe[3] - 128 - 8;
        fab.bits = (exp+127)<<23;
        float fexp = fab.f;
        rgbf[0] = (rgbe[0] + 0.5f) * fexp;
        rgbf[1] = (rgbe[1] + 0.5f) * fexp;
        rgbf[2] = (rgbe[2] + 0.5f) * fexp;
    }
}
this version using bit manipulation produces exactly identical encodings to the correct version I have posted before.


Appendix 4 :

Another common oddity in the RGBE encoders.

I have tried to stick to the original Radiance version where reasonable, but I have changed one part that is widely done strangely.

The major bias error we have talked about before is in the decoder. This error is in the encode side, but unlike the bias this error is not a large one, it's a small correction and doesn't change the values intended to be stored in the format.

In my correct version, this part :

        int exponent;
        float scale;
        frexpf(maxf, &exponent);
        scale = ldexpf(1.f, -exponent + 8);
is just trying to get the exponent of the largest component, and then form a pure power of 2 float that multiplies by 256 / 2^exponent.

This has been widely done strangely in other implementations. The majority of other implementations do this :

divide by self version :

    scale = frexp(maxf,&exponent) * 256.0/maxf;
frexp returns the mantissa part of the float, scaled to [0.5,1), so when you divide that by maxf what you're doing is cancelling out the mantissa part and leaving just the exponent part of maxf (2^-exponent).

This is not only a bit strange, it is in fact worse. scale made this way does not come out as an exact power of 2, because floating point math is not exact (a reciprocal then multiply does not give back exact 1.0). This divide-by-yourself method does not produce exactly the same encoding as the bit extraction reference version above.

Aside from small drifts of 'scale' causing small error in reconstrction, if it ever got too big, you could have another problem. Our value is supposed to be in [128,256) (inclusive on the bottom of the range, exclusive on the top). That means you can just cast to unsigned char. But if scale could ever be slightly over 1.0, you could get up to 256 exactly, then the cast to unsigned char would turn into 0, which would be a huge error.

This appears to have motivated Greg Ward in the original Radiance code to add a safety margin :

original Radiance code version :
/ray/src/common/color.c line 272

    d = frexp(d, &e) * 255.9999 / d;
The 255.9999 fudge ensures that imprecise math can't make us incorrectly get to 256. It's an ugly way to handle it, and it does cause a small loss of quality, so despite my goal to stick to the original Radiance implementation I am not copying this.

(Radiance was started in 1985 when IEEE compliant floating point rounding was by no means standard, so we can certainly forgive it some fudgy bias factors)

The way I have shown creates a pure power of two scale factor, so it can't suffer from any drift of the scale due to floating point imprecision. This actually matches the described behavior of the RGBE encoding better than previous versions.

For example from Wikipedia :

Radiance calculates light values as floating point triplets, one each for red, green and blue. But storing a full double precision float for each channel (8 bytes × 3 = 24 bytes) is a burden even for modern systems. Two stages are used to compress the image data. The first scales the three floating point values to share a common 8-bit exponent, taken from the brightest of the three. Each value is then truncated to an 8-bit mantissa (fractional part). The result is four bytes, 32 bits, for each pixel. This results in a 6:1 compression, at the expense of reduced colour fidelity.

5/15/2020

Oodle and UE4 Loading Time

Jon Olick has done a study for us on what benefit Oodle Data compression has on Unreal load times :

Oodle and UE4 Loading Time

This is different than the benchmarks I've shown here, which are just the decompressors being run in isolation in a test bench. Jon measures times in Unreal. "Time to first frame" is the best measurement of load time, but it isn't total load time, as Unreal continues to do async loading after first frame.

Jon measured times on his PC, but with Unreal locked to one thread (as well as we could), and with various simulated disk speeds. This let us see the times more clearly, and lets us make some times that are more comparable to slower machines, or the last gen of consoles. Consoles like Switch will see the most load time benefit from Oodle, as they need help with both CPU time and IO time, so the double benefit of Oodle really helps there (Ooodle gives more compression than zlib and is also much faster to decode, so you save time on both IO and decompression).

Since Unreal 4.24 we're offer Oodle Data Compression and Network plugins that drop into Unreal without any changes to the Unreal Engine code. This uses the facilities that Epic added for us to make the compressor & network stack pluggable so that we wouldn't have to patch Engine source code to do our integration. (thanks Epic!). Now, you just drop in our plugins and Oodle automatically replaces the default compression in Unreal.

Some previous posts on Oodle in Unreal for more :

Oodle Data compression in Unreal

Oodle Network compression test (including Unreal)

5/10/2020

Oodle 2.8.6 Released

Oodle 2.8.6 is out now. See the full changelog at RAD.

Oodle 2.8.6 continues small fixes and tweaks in the 2.8.x major version. Leviathan decompression is now 5-10% faster on modern processors.

I have a new standard test machine, so I want to use this space to leave myself some checkpoint reference numbers on the new machine. My new standard machine is :

AMD Ryzen 9 3950X (CPU locked at 3393 MHz)
Zen2, 16 cores (32 hyper), TSC at 3490 MHz

I always down-clock my test machines and disable turbo (or "boost") for the core clocks. This gives me very reliable profiling (of single threaded work, anyway). If you don't do this, you will see not just variability of runs, but also the first thing you test will usually seem faster than later tests, so your testing may be order dependent. If you don't lock your cores at low clocks, then you should be repeating your tests many times, trying them in different orders, and seeing if your results are stable.

new machine :

ect seven Oodle 2.8.6
AMD Ryzen 9 3950X (CPU locked at 3393 MHz)

ooSelkie7       :  2.19:1 ,    7.7 enc MB/s , 4205.6 dec MB/s
ooMermaid7      :  2.85:1 ,    4.3 enc MB/s , 1985.1 dec MB/s
ooKraken7       :  3.08:1 ,    3.5 enc MB/s , 1258.7 dec MB/s
ooLeviathan7    :  3.22:1 ,    2.2 enc MB/s ,  778.4 dec MB/s

zlib9           :  2.33:1 ,    9.4 enc MB/s ,  328.1 dec MB/s
old machine :
ect seven Oodle 2.8.6
Core i7 3770 (locked at 3.4 GHz)

ooSelkie7       :  2.19:1 ,    7.5 enc MB/s , 3682.9 dec MB/s
ooMermaid7      :  2.85:1 ,    3.9 enc MB/s , 1722.3 dec MB/s
ooKraken7       :  3.08:1 ,    3.0 enc MB/s , 1024.9 dec MB/s
ooLeviathan7    :  3.22:1 ,    1.9 enc MB/s ,  679.4 dec MB/s

zlib9           :  2.33:1 ,    8.0 enc MB/s ,  310.2 dec MB/s
speeds are all single threaded, except the Oodle Optimal level encoders which use 2 threads for encoding (Jobify).

All reports on my blog before this post were on the Core i7 3770, where the run platform was not explicitly identified. All reports in the future will be on the Ryzen.

Here's an "example_lz_chart" run on the new machine :

AMD Ryzen 9 3950X (CPU locked at 3393 MHz)
Oodle 2.8.6 example_lz_chart <file>
lz_chart loading r:\testsets\lztestset\lzt99...
file size : 24700820
------------------------------------------------------------------------------
Selkie : super fast to encode & decode, least compression
Mermaid: fast decode with better-than-zlib compression
Kraken : good compression, fast decoding, great tradeoff!
Leviathan : very high compression, slowest decode
------------------------------------------------------------------------------
chart cell shows | raw/comp ratio : encode MB/s : decode MB/s |
All compressors run at various encoder effort levels (SuperFast - Optimal).
Many repetitions are run for accurate timing.
------------------------------------------------------------------------------
       |   HyperFast4|   HyperFast3|   HyperFast2|   HyperFast1|   SuperFast |
Selkie |1.41:834:4353|1.45:742:4355|1.53:557:4112|1.68:465:4257|1.70:412:4232|
Mermaid|1.54:702:3119|1.66:535:2591|1.79:434:2450|2.01:350:2429|2.04:324:2395|
Kraken |1.55:702:2247|1.71:532:1432|1.88:421:1367|2.10:364:1399|2.27:241:1272|
------------------------------------------------------------------------------
compression ratio (raw/comp):
       |   HyperFast4|   HyperFast3|   HyperFast2|   HyperFast1|   SuperFast |
Selkie |    1.412    |    1.447    |    1.526    |    1.678    |    1.698    |
Mermaid|    1.542    |    1.660    |    1.793    |    2.011    |    2.041    |
Kraken |    1.548    |    1.711    |    1.877    |    2.103    |    2.268    |
------------------------------------------------------------------------------
encode speed (MB/s):
       |   HyperFast4|   HyperFast3|   HyperFast2|   HyperFast1|   SuperFast |
Selkie |    834.386  |    742.003  |    557.065  |    465.025  |    412.442  |
Mermaid|    701.818  |    534.711  |    433.517  |    350.444  |    324.358  |
Kraken |    701.792  |    531.799  |    420.887  |    364.245  |    240.661  |
------------------------------------------------------------------------------
decode speed (MB/s):
       |   HyperFast4|   HyperFast3|   HyperFast2|   HyperFast1|   SuperFast |
Selkie |   4352.567  |   4355.253  |   4111.801  |   4256.927  |   4231.549  |
Mermaid|   3118.633  |   2590.950  |   2449.676  |   2429.461  |   2394.976  |
Kraken |   2247.102  |   1431.774  |   1366.672  |   1399.416  |   1272.313  |
------------------------------------------------------------------------------
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal3  |
Selkie |1.75:285:3847|1.83:127:4121|1.86: 55:4296|1.93: 10:4317|1.94:7.2:4297|
Mermaid|2.12:226:2307|2.19:115:2533|2.21: 52:2661|2.37:5.5:2320|2.44:4.2:2256|
Kraken |2.32:152:1387|2.39: 30:1483|2.44: 23:1469|2.55:9.8:1350|2.64:3.5:1292|
Leviath|2.48: 58: 899|2.56: 23: 937|2.62: 11: 968|2.71:3.9: 948|2.75:2.4: 932|
------------------------------------------------------------------------------
compression ratio (raw/comp):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal3  |
Selkie |    1.748    |    1.833    |    1.863    |    1.933    |    1.943    |
Mermaid|    2.118    |    2.194    |    2.207    |    2.370    |    2.439    |
Kraken |    2.320    |    2.390    |    2.435    |    2.553    |    2.640    |
Leviath|    2.479    |    2.557    |    2.616    |    2.708    |    2.749    |
------------------------------------------------------------------------------
encode speed (MB/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal3  |
Selkie |    284.979  |    127.375  |     55.468  |     10.398  |      7.168  |
Mermaid|    226.279  |    114.597  |     52.334  |      5.457  |      4.229  |
Kraken |    152.400  |     29.891  |     22.928  |      9.849  |      3.530  |
Leviath|     58.356  |     23.379  |     10.845  |      3.927  |      2.380  |
------------------------------------------------------------------------------
decode speed (MB/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal3  |
Selkie |   3846.881  |   4121.199  |   4296.318  |   4317.344  |   4297.364  |
Mermaid|   2307.345  |   2532.950  |   2660.551  |   2320.415  |   2255.556  |
Kraken |   1387.219  |   1483.488  |   1469.246  |   1350.332  |   1292.404  |
Leviath|    899.052  |    937.473  |    968.337  |    948.179  |    932.194  |
------------------------------------------------------------------------------

5/21/2019

On the entropy of distance coding

Let's look at a little math problem that I think is both amusing and illuminating, and is related to why LZ77 distance coding is fundamentally inefficient for sending a sequence of repeated tokens.

The toy problem we will study is this :

Given an array of random bytes, you for some silly reason decide to compress them by sending the distance to the most recent preceding occurance of that byte.

So to send 'A' you count backward from your current position P ; if there's an A at P-1, you send a 0, if it's at P-2 send a 1, etc.

What is the entropy of this distance encoding?

This "compressor" is silly, it expands (it will send more than 8 bits per byte). How much does it expand?

First of all, why are we looking at this. It's a boiled down version of LZ77 on a specific type of data. LZ77 sends repeated strings by sending the distance to the previous occurance of that string. Imagine your data consists of a finite set of tokens. Each token is several bytes and imagine the tokens are drawn from a random source, and there are 256 of them. If you had the indices of the tokens, you would just have a random byte sequence. But LZ77 does not have that and cannot convert the stream back to the token indexes. Instead the best it can do is to parse on token boundaries, and send the distance to the previous occurance of that token.

This does occur in the real world. For example consider something like a palettized image. The palette indices are bytes that refer to a dictionary of 256 colors. If you are given the image to compress after de-palettizing, you would see something like 32-bit RGBA tokens that repeat. Once you have seen the 256 unique colors, you no longer need to send anything but distances to previous occurances of a color. English text is also a bit like this, with the tokens equal to a word+punction string of variable length.

So back to our toy problem. To get the entropy of the distance coding, we need the probability of the distances.

To find that, I think about coding the distance with a sequence of binary coding events. Start at distance = 0. Either the current byte matches ours, or it doesn't. Chance of matching is (1/256) for random bytes. The chance of no-match is (255/256). So we multiply our current probability by (1/256) and the probability of all future distances by (255/256), then move to the next distance. (perhaps it's easier to imagine that the preceding bytes don't exist yet; rather you are drawing a random number as you step back in distance; any time you get your own value you stop)

This gives you the probability distribution :

P(0) = (1/256)
P(1) = (255/256) * (1/256)
P(2) = (255/256)^2 * (1/256)
...
P(n) = (255/256)^n * (1/256)
an alternative way to get the same distribution is to look at distance D. First multiply by the probability that it is not a lower distance (one minus the sum of all lower distance probabilities). Then the probability that it is here is (1/256). That's :
P(0) = (1/256)
P(1) = (1 - P(0)) * (1/256)
P(2) = (1 - P(0) - P(1)) * (1/256)
...
P(n) = (1 - P(0) - P(1) .. - P(n-1)) * (1/256)

which is equal to the first way.

Given this distribution we can compute the entropy :

H = - Sum_n P(n) * log2( P(n) )

starting at n = 0

let x = (255/256)
P(n) = (1-x) * x^n

log2( P(n) ) = log2( 1-x ) + n * log2( x )


H = - Sum_n (1-x) * x^n * ( log2( 1-x ) + n * log2( x ) )

terms that don't depend on n pull out of the sum :

H = - (1-x) * log2( 1-x ) * Sum_n x^n 
    - (1-x) * log2( x ) * Sum_n n * x^n

we need two sums :

G = Sum_n x^n 
S = Sum_n n * x^n

G is just the geometric series
G = 1 / (1 - x)

recall the trick to find G is to look at G - x*G
we can use the same trick to find S
S - x*S = G - 1
the other standard trick to find S is to take the d/dx of G
either way you get:

S = x / (1-x)^2

H = - (1-x) * log2( 1-x ) * G
    - (1-x) * log2( x ) * S

H = - (1-x) * log2( 1-x ) * ( 1 / (1-x) )
    - (1-x) * log2( x ) * ( x / (1-x)^2 )

H = - log2( 1-x ) - log2( x ) * ( x / (1-x) )

putting back in x = (255/256)
1-x = 1/256

the first term "- log2( 1-x )" is just 8 bits, send a byte

H = 8 + 255 * log2( 256 / 255 )

H = 8 + 1.43987 bits

about 9.44 bits

or if we look at general alphabet size N now instead of source bytes

H = log2(N) + (N-1) * log2( N / (N-1) )

recalling of course log2(N) is just the number of bits it would take to code a random symbol of N possible values
we'll look at the expansion due to distance coding, H - log2(N)

if we now take the limit of N large

H - log2(N) -> (N-1) * log2( N / (N-1) )
H - log2(N) -> (N-1) * log2( 1 + 1 / (N-1) )
H - log2(N) -> log2( ( 1 + 1 / (N-1) ) ^ (N-1) )
H - log2(N) -> log2( ( 1 + 1/N ) ^ N )

( 1 + 1/N ) ^ N -> e  !!

H - log2(N) -> log2( e ) = 1.44269 bits

for large alphabet, the excess bits sent due to coding distances instead of indices is log2(e) !!

I thought it was pretty entertaining for Euler to show up in this problem.

Why is distance coding fundamentally inefficient (vs just coding the indices of these repeated tokens) ? It's due to repetitions of values.

If our preceding bytes were "ABCB" and we now wish to code an "A" , it's at distance = 3 because we had to count the two B's. Our average distance is getting bumped up because symbols may occur multiple times before we find our match. If we did an LZ77 coder that made all substrings of the match length unique, we would not have this inefficiency. (for example LZFG which sends suffix trie node indices rather than distances does this)

We can see where this inefficiency appears in the probabilities:

if you are drawing a random number from 256 at each step
keep stepping until you get a match
each time you have to draw from all 256 possibilities (repeats allowed)

P(0) = (1/256)
P(1) = (255/256) * (1/256)
P(2) = (255/256)^2 * (1/256)
...
P(n) = (255/256)^n * (1/256)

(as above)

instead imagine drawing balls from a hat
once you draw a ball it is discarded so it cannot repeat
stop when you match your byte
first draw has the same 1/256 chance of match :

P(0) = (1/256)

as before multiply by 1-P to get the probability of continuing
but now we only have 255 balls left, so the chance of a match is (1/255)

P(1) = (255/256) * (1/255) = (1/256)

current P was (1/255) so multiply the next by (254/255)
now we have 254 , so we match (1/254) of the time :

P(2) = (255/256) * (254/255) * (1/254) = (1/256)
...
P(n) = (1/256)

it's just a flat probability.

decrementing the alphabet size by one each time makes a big difference.

This theoretical coding loss is also observed exactly in the real world.


experiment :

make 100,000 random bytes

make a palette of 256 32-bit random dwords

use the 100,000 random bytes to index the palette to output 400,000 bytes

what can LZ77 on these 400,000 bytes?

Our theoretical analysis says the best possible is 9.44 bits per palette index

plus we have to send 1024 bytes of the palette data as well

best = 100,000 * 9.44/8 + 1024 = 119,024

real world :

Leviathan Optimal5
random_bytes_depal.bin :    400,000 ->   119,034 =  2.381 bpb =  3.360 to 1

it turns out this theoretical limit is almost exactly achieved by Oodle Leviathan, only 10 bytes over due to headers and so on.

Leviathan is able to hit this limit (unlike simpler LZ77's) because it will entropy code all the match signals to nearly zero bits (all the matches will be "match len 4" commands, so they will have a probability near 1 and thus entropy code to zero bits). Also the offsets are in multiples of 4, but Leviathan will see the bottom bits are always zero and not send them. The result is that Leviathan sends the matches in this kind of data just as if it was sending a distance in tokens (as opposed to bytes) back to find the repeated token, which is exactly what our toy problem looked at.

We cannot do better than this with LZ77-style distance coding of matches. If we want to beat this we must induce the dictionary and send id references to whole words. Leviathan and similar LZ's cannot get as close to the optimum on text, because we don't handle tokens of varying length as well. In this kind of depalettized data, the match length is totally redundant and should be coded in zero bits. With text there is content-length correlation which we don't model at all.

Also note that we assume random tokens here. The loss due to sending distances instead of token id's gets even worse if the tokens are not equiprobable, as the distances are a poor way to capture the different token probabilities.

Summary :

The coding loss due to sending repeated tokens by distance rather than by index is at least log2(e) bits for large alphabet. This theoretical limit acts as a real world limit on the compression that LZ77 class algorithms can achieve on depalettized data.

4/09/2019

Oodle 2.8.0 Release

Oodle 2.8.0 is now out. This release continues to improve the Kraken, Mermaid, Selkie, Leviathan family of compressors. There are no new compressors or major changes in this release. We continue to move towards retiring the pre-Kraken codecs, but they are still available in Oodle 2.8

The major new features in Oodle 2.8 are :

  • "Job" threading of the encoders in Oodle Core. This allows you to get multi-threaded encoding of medium size buffers using Oodle Core on your own thread system (or optionally use the one provided in OodleX). See a whole section on this below.

  • Faster encoders, particularly the optimals, and particularly Optimal1. They're 1 - 2X faster even without Jobify threading.

  • Better limits on memory use, particularly for the encoders. You can now query the memory sizes needed and allocate all the memory yourself before calling Oodle, and Oodle will then do no allocations. see OodleLZ_GetCompressScratchMemBound, example_lz_noallocs, and the FAQ.

An example of the encoder speed improvement on the "seven" test set, measured with ect on a Core i7 3770, Kraken levels 5 (Optimal1) and 7 (Optimal3) :


Oodle 2.7.5 :
ooKraken5       :  3.02:1 ,    3.3 enc MB/s , 1089.2 dec MB/s
ooKraken7       :  3.09:1 ,    1.5 enc MB/s , 1038.1 dec MB/s

Oodle 2.8.0 : (without Jobify)
ooKraken5       :  3.01:1 ,    4.6 enc MB/s , 1093.6 dec MB/s
ooKraken7       :  3.08:1 ,    2.3 enc MB/s , 1027.6 dec MB/s

Oodle 2.8.0 : (with Jobify)
ooKraken5       :  3.01:1 ,    7.2 enc MB/s , 1088.3 dec MB/s
ooKraken7       :  3.08:1 ,    2.9 enc MB/s , 1024.6 dec MB/s

See the full change log for more.


About the new Jobify threading of Oodle Core :

Oodle Core is a pure code lib (as much as possible) that just does memory to memory compression and decompression. It does not have IO, threading, or other system dependencies. (that's provided by Oodle Ext). The system functions that Oodle Core needs are accessed through function pointers that the user can provide, such as for allocations and logging. We have extended this so you can now plug in a Job threading system which Oodle Core can optionally use to multi-thread operations.

Currently the only thing we will multi-thread is OodleLZ_Compress encoding of the new LZ compressors (Kraken, Mermaid, Selkie, Leviathan) at the Optimal levels, on buffers larger than one BLOCK (256 KB). In the future we may multi-thread other things.

Previously if you wanted multi-threaded encoding you had to split your buffers into chunks and multi-thread at the chunk level (with or without overlap), or by encoding multiple files simultaneously. You still can and should do that. Oodle Ext for example provides functions to multi-thread at this granularity. Oodle Core does not do this for you. I refer to this as "macro" parallelism.

The Oodle Core provides more "micro" parallelism that uses multiple cores even on individual buffers. It parallelizes at the BLOCK level, hence it will not get any parallelism on buffers <= one BLOCK (256 KB).

Threading of OodleLZ_Compress is controlled by the OodleLZ_CompressOptions:jobify setting. If you don't touch it, the default value (Jobify_Default) is to use threads if a thread system is plugged in to Oodle Core, and to not use threads if no thread system is plugged in. You may change that option to explicitly control which calls try to use threads and which don't.

OodleX_Init plugs the Oodle Ext thread system in to Oodle Core. So if you use OodleX and don't touch anything, you will now have Jobify threading of OodleLZ_Compress automatically enabled. You can specify Threads_No in OodleX_Init if you don't want the OodleX thread system. If you use OodleX you should NOT plug in your own thread system or allocators into Oodle Core - you must let OodleX provide all the plugins. The Oodle Core plugins allow people who are not using OodleX to provide the systems from their own engine.

WHO WILL SEE AN ENCODE PERF WIN :

If you are encoding buffers larger than 1 BLOCK (256 KB).

If you are encoding at level Optimal1 (5) or higher.

If you use the new LZ codecs (Kraken, Mermaid, Selkie, Leviathan)

If you plug in a job system, either with OodleX or your own.

CAVEAT :

If you are already heavily macro-threading, eg. encoding lots of files multi-threaded, using all your cores, then Jobify probably won't help much. It also won't hurt, and might help ensure full utilization of all your cores. YMMV.

If you are encoding small chunks (say 64 KB or 256 KB), then you should be macro-threading, encoding those chunks simultaneously on many threads and Jobify does not apply to you. Note when encoding lots of small chunks you should be passing pre-allocated memory to Oodle and reusing that memory for all your compress calls (but not sharing it across threads - one scratch memory buffer per thread!). Allocation time overhead can be very significant on small chunks.

If you are encoding huge files, you should be macro-threading at the chunk level, possibly with dictionary backup for overlap. Contact RAD support for the "oozi" example that demonstrates multi-threaded encoding of huge files with async IO.

NOTE : All the perf numbers we post about and shows graphs for are for single threaded speed. I will try to continue to stick to that.


A few APIs have changed & the CompressOptions struct has changed.

This is why the middle version number (8) was incremented. When the middle ("major") version of Oodle is the same, the Oodle lib is binary link compatible. That means you can just drop in a new DLL without recompiling. When the major version changes you must recompile.

A few APIs have small signature changes :

 OodleLZ_GetDecodeBufferSize, OodleLZ_GetCompressedBufferSizeNeeded and OodleLZ_GetInPlaceDecodeBufferSize :
    take compressor argument to return smaller padding for the new codecs.
 OodleLZ_GetChunkCompressor API : take compressed size argument to ensure it doesn't read past end
these should give compile errors and be easy to fix.

The CompressOptions struct has new fields. Those fields may be zero initialized to get default values. So if you were initializing the struct thusly :

struct OodleLZ_CompressOptions my_options = { 1, 2, 3 };
the new fields on the end will be implicitly zeroed by C, and that is fine.

NOTE : I do NOT recommend that style of initializing CompressOptions. The recommended pattern is to GetDefaults and then modify the fields you want to change :

struct OodleLZ_CompressOptions my_options = OodleLZ_CompressOptions_GetDefault();
my_options.seekChunkReset = true;
my_options.dictionarySize = 256*1024;
then after you set up options you should Validate :
OodleLZ_CompressOptions_Validate(&my_options);
Validate will clamp values into valid ranges and make sure that any constraints are met. Note that if Validate changes your options you should really look at why, you shouldn't be shipping code where you rely on Validate to clamp your options.


WARNINGS :

example_lz before 2.8.0 had a bug that caused it to stomp the user-provided input file, if one was provided.

YES IT WOULD STOMP YOUR FILE!

That bug is not in the Oodle library, it's in the example code, so we did not issue a critical bug fix for it, but please beware running the old example_lz with a file argument. If you update to the 280 SDK please make sure you update the *whole* SDK including the examples, not just the lib!

On Windows it is very important to not link both Oodle Core and Ext. The Oodle Ext DLL includes a whole copy of Oodle Core - if you use OodleX you should ONLY link to the Ext DLL, *not* both.

Unfortunately because of the C linkage model, if you link to both Core and Ext, the Oodle Core symbols will be multiply defined and just silently link without a warning or anything. That is not benign. (It's almost never benign and C needs to get its act together and fix linkage in general). It's specifically not benign here, because Oodle Ext will be calling its own copy of Core, but you might be calling to the other copy of Core, so the static variables will not be shared.

Benchmarking Oodle with ozip -b

The ozip utility is designed to act mostly like gzip. A compiled executable of ozip is provided with Oodle for easy testing, or you may download ozip source on github.

ozip now has a benchmarking option (ozip -b) which is an easy way to test Oodle.

ozip -b runs encode & decode many times to provide accurate timing. It does not include IO. It was designed to be similar to zstd -b so that they are directly comparable.

ozip -b can take a file or a dir (plus wildcard), in which case it will test all the files in the dir. You can set up the specific compressor and options you want to test to see how they affect performance and compression ratio.

So for example you can test the effect of spaceSpeedTradeoffBytes on Kraken level Optimal1 :

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os512
K 5 mozilla          :  51220480 ->  14288373 (3.585),   3.5 MB/s, 1080.4 MB/s

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla
K 5 mozilla          :  51220480 ->  14216948 (3.603),   3.5 MB/s, 1048.6 MB/s

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os128
K 5 mozilla          :  51220480 ->  14164777 (3.616),   3.5 MB/s, 1004.6 MB/s
Or to test Kraken HyperFast3 on all the files in Silesia :
r:\>ozip -b -c8 -ol-3 r:\testsets\silesia\*
K-3 12 files         : 211938580 ->  81913142 (2.587), 339.0 MB/s, 1087.6 MB/s


Another option for easy testing with Oodle is example_lz_chart, which is also provided as a pre-compiled exe and also as source code.

example_lz_chart runs on a single file you provide and prints a report of the compression ratio and speed of various Oodle compressors and encode levels.

This gives you an overview of the different performance points you can hit with Oodle.


WARNING :

Do not try to profile Oodle by process timing ozip.

The normal ozip (not -b mode) uses stdio and is not designed to be as efficient as possible. It's designed for simplicity and to replicated gzip behavior when used for streaming pipes on UNIX.

In general it is not recommended to benchmark by timing with things like IO included because it's very difficult to do that right and can give misleading results.

See also :

Tips for benchmarking a compressor
The Perils of Holistic Profiling
Tips for using Oodle as Efficiently as Possible


NOTE :

ozip does not have any threading. ozip -b is benchmarking single threaded performance.

This is true even for the new Jobify threading because ozip initializes OodleX without threads :

    OodleX_Init_Default(OODLE_HEADER_VERSION,OodleX_Init_GetDefaults_DebugSystems_No,OodleX_Init_GetDefaults_Threads_No);

I believe that zstd -b is also single threaded so they are apples to apples. However some compressors uses threads by default (LZMA, LZHAM, etc.) so if they are being compared they should be set to not use threads OR you should use Oodle with threads. Measuring multi-threaded performance is context dependent (for example are you encoding many small chunks simultaneously?) and I generally don't recommend it, it's much easier to compare fairly with single threaded performance.

For high performance on large files, ask for the "oozi" example.

1/20/2019

The Oodle LZNIB Algorithm

Oodle LZNIB is an LZ77-family compressor which has very good decode speed (about 5X faster than Zip/deflate, about 1/2 the speed of LZ4) while getting better compression than Zip/deflate. This was quite unique at the time it came out. It is now made obsolete by Oodle Mermaid/Selkie.

I thought it might be interesting to write up LZNIB and look at some of the things we learned from it.

LZNIB is a non-entropy coded LZ (*) which writes values in nibbles and bytes. Literals are written uncompressed (8 bits).

(* = actually it's more like "semi entropy coded"; it doesn't use bit io, and doesn't use standard entropy coders like Huffman, but the variable length integer encoding was adjusted to minimize entropy, and can vary to fit the file's statistics; more details on this later)

LZNIB can send three actions : literal run ("lrl") , match with offset ("normal match"), match with repeat offset ("rep match").

One of the key realizations for LZNIB is that even though there are three actions, only two are possible at any time.


After Literals :

LRL should not occur (would have just been a longer LRL)
match & rep match are possible

After Match or Rep Match :

rep match should not occur (would have just been a longer match)
normal match & lrl are possible

because there are only two actions possible at each point, we can send this using a nibble with a decision threshold value :

Threshold T

values in [0 , T-1] are action type 1
values in [T , 15] are action type 2

So the core decode step of LZNIB is to grab the next nibble, do one branch, and then process that action. The values within your threshold group are the first part of your LRL or match len. There are at least two different thresholds, one for {match/rep-match} in the after-literals state, and one for {match/lrl} in the after-match state. In LZNIB we hard-coded the threshold for match/rep-match to 5 as this was found to be good on most data. The optimal {match/lrl} threshold is more dependent on the data.

Approximately, (T/16) should be the probability of action type 1. This is in fact exactly what entropy coding this binary action choice would do. What entropy coding does is take your range of possible encodable values and splits them by the probability of the binary choice. The remaining values in the word can then be split to send further choices. Here we just do the first choice using a semi-entropy-coding, then use the remainder of the word to send as much of our next value ("len") as we can. (the fact that we do not entropy code "len" and that len has probability peaks is why the probability based choice of T might not be actually optimal)

In LZNIB the threshold T could be transmitted per chunk so that it could be optimized to fit the data. In practice that was rarely used, and mostly the default value of T=8 was used. Part of why it was rarely used is due to the difficulty of parse-statistics feedback. The parse must make decisions based on some assumed T, because it affects coding cost. Then after you have your parse choices you can make a choice for optimal T for that data, but if that T is different you might want to re-parse with the new T. This is a mess and the simplest way to address it here is just to parse for all values of T. You can reduce the set of likely useful T values to around 8 or 10 and just do the parse 8-10X times to make a T choice. This in fact works great and helps compression a lot on some data, but is obviously slow.

In contrast to something like LZ4, LZNIB has a flexible action sequence. That is, LZ4 is always doing {LRL,match,LRL,match,...}. For example to send two matches in a row you must send an LRL of length 0 between them. LZNIB has a flexible action sequence, therefore requires a branch, it could send {LRL,match,match,LRL,rep match,match,...}

LZNIB uses unbounded offsets for matches. They are sent using a variable length integer scheme. Initially 12 bits are sent, then a byte is added as necessary. The scheme used is "encodemod", which treats each value sent as being divided in two ranges. One range is for values that can terminate in the current word, the other range is for values that don't fit and includes a portion of the remaining value. See the encodemod post series for details.

The encodemod scheme is very flexible and can be tuned to fit the entropy characteristics of the data (unlike traditional bit-packing variable length integer schemes, which have a less flexible knob). To do this we gathered LZ parses for many files and found the best encodemod thresholds. This was done for the offsets, for LRL, match len (after match), match len (after literal), and rep match len.

All the lens (LRL, match len, etc.) are sent first in the control nibble. If they don't fit, the maximum is sent in the control nibble, then the remainder is sent using encodemod. The encodemod used for lens sends first another nibble, then any further values are sent with bytes.

The full decoder for LZNIB in pseudocode is :


Do a literal run to start.

After-literal state :

Get control nibble
if nib < 5 
{
  rep match
  if nib == 4 get further ml using encodemod, first nibble then bytes

  copy rep match
}
else
{
  normal match
  if nib == 15 get further ml using encodemod, first nibble then bytes

  get offset using encodemod, first 12 bits then bytes

  copy match
}

After-match state :

Get control nibble
if nib < T 
{
  literal run
  if nib == T-1 get further lrl using encodemod, first nibble then bytes

  copy literal run

  goto After-literal
}
else
{
  normal match
  if nib == 15 get further ml using encodemod, first nibble then bytes

  get offset using encodemod, first 12 bits then bytes

  copy match  

  goto After-match
}

That's it.


LZNIB is simple to decode but it turned out to be quite hard to parse well. Making good choices in the encoder can have an absolutely huge affect on the decode speed, even at roughly equal compressed size. Some of those issues are non-local. LZNIB was the first time we encountered an LZ like this, and it turned out to be important again for Kraken & Mermaid/Selkie.

One of the issues with LZNIB is there are a lot of exact ties in compressed size, since it steps in units of nibbles. Those ties need to be broken in favor of faster decode speed.

To good approximation, decode speed is just about minimizing the number of control nibbles. You want as few transitions between actions as possible, you prefer to stay in long literal runs and long matches. You definitely don't want to be doing more mode switches if there's an encoding of the same compressed size with fewer mode switches.

Let's look at some simple examples to get a feel for these issues.

Consider a rep match of len 1. A rep match control takes 1 nibble, while sending a literal instead takes 2 nibbles, so sending a rep instead would save you 1 nibble. But, if the rep is followed by another literal run, you would have to send a new lrl control nibble there, while staying in an lrl might save you that.


L = literal , costs 2 nibbles + 1 at start of lrl
R = rep match, costs 1 control nibble
M = normal match, costs 1 control nibble + 3 or more nibbles for offset

LLR = 1+2*2+1 = 6 nibbles
LLL = 1+3*2 = 7 nibbles

so LLR is best right?  take the rep!

LLRL = 1+2*2+1 + 1+2 = 9 nibbles
LLLL = 1+4*2 = 9 nibbles

No!  Nibble count is the same but fewer controls = prefer the literal.
It depends what follows the rep.

LLRM
LLLM

in this case the rep is cheaper.

So if a literal follows the rep, don't take it, right?

LLLLRLLLL = 1+2*4 + 1 + 1+2*4 = 19 nibbles
LLLLLLLLL = 1+2*9 + 1 = 20 nibbles

No! In the second case the LRL of 9 can't fit in the control nibble, so an
extra lrl nibble must be sent.  So prefer the rep here.

So the simple choice of "do I take a rep of len 1 or stay in LRL" is not easy and can only be made non-locally.

A similar thing happens with normal matches. A normal match of len 3 with an offset that fits in 12 bits takes 4 nibbles, which saves you 2 vs sending 3 literals. But if you have to resume an LRL after the match, that costs you 1, so your savings is down to 1. There may be cheaper ways (in terms of decode speed) to get that 1 nibble savings, such as a rep of len 1 for example. Or if you can trade two len 3 matches for a single len 4 match :


MMMLMMML = 4 + 3 + 4 + 3 = 14
LLMMMMLL = 5 + 4 + 5     = 14

same nibble count, but fewer mode switches = prefer the single len 4 match over two len 3's

A len 3 match that doesn't fit in the 12 bit offset (but does fit in the next threshold, at 20 bits) takes 6 nibbles to send, which is a tie with 3 literals. But again it could actually save you a nibble if it causes the LRL to fit in control.

You might be tempted to just punt this, eg. make rep matches have a minimum length of 2, and normal matches have a minimum length of 4. However that is leaving a lot of compression on the table. The shorter matches only save a nibble here and there, but on some files there are many of those possible. For the optimal parsers we wanted to be able to get those wins when they are possible.


The challenge of optimal parsing LZNIB.

The standard approach for optimal parsing a format with rep matches is the "forward arrivals" style parse (as in LZMA). (this is in contrast to the classical backwards LZSS optimal parse which can only be used in formats that do not have adaptive state which is parse-path dependent).

See some posts on forward-arrivals parse : here and here

The standard forward arrivals parse does not work well with LZNIB. In the forward-arrivals parse, you take the cheapest way to arrive at pos P, you cost all ways to go forward from P (with a literal, or various match options), and fill out the costs at P+1, P+len, etc.

The problem is that it doesn't account for the fact that the best way to arrive at pos P depends on what comes next (in particular, do literals or matches come next, and if literals, how many?). It treats each action as being independent.

We'll look at some ways to improve the forward arrivals parse but first a detour. Fortunately in this case it's possible to solve this systematically.

Whenever I face a difficult heuristic like this where we know we're approximating in some way and don't know quite the best way, the first thing I always look for is - can we solve it exactly? (or with some bounded error). The more exact solution might be too slow and we won't actually be able to ship it, but it gives us a target, it lets us know how close our approximation is, and may give us guidance on how to get there.

In this case what we can do is a full dynamic programming parse with the LRL as part of the state matrix.

Optimal parsing is always a form of dynamic programming. We often approximate it by ignoring the internal state of the parse and making an array of only [pos]. What I mean by "full dynamic programming" is to make the state explicit and use a 2d array of [pos][state]. Then on each parse choice, you look at how that steps position (eg. by match len) and also how that changes the internal state, and you move to the next array slot. In this case the important state variable is LRL.

(we're treating a single literal as a coding action, which it is not, and correcting that by considering LRL as a state variable. The result is that we get correct code costs for all LRL steps from each pos.)

(note that the offset which is used for rep matches is also an important internal state variable, but we are continuing to ignore that as is usual; we do store a different rep arrival in each [pos][lrl] entry, but we don't differentiate between arrivals that only differ by rep offset)

We consider LRL's in [0,21]. This lets us capture the transition of LRL not fitting in the control nibble (at 7, with the default threshold of 8), and then again when it overflows the first nibble of encodemod (at 7+15=21). LRL value of 21 is a state for all higher LRL's, so we don't account for when later bytes of encodemod overflow.

We make a 2d matrix that's (file len) * 22 entries.

At each pos P, you can advance all the entries by adding one literal. This does :


for all LRL

costs[P+1][LRL+1] = 2 + costs[P][LRL] + LRL_Delta_Cost(LRL)

(2 nibbles is the cost of a literal byte)

where LRL_Delta_Cost = 

1 at LRL = 0
1 at LRL = 7
1 at LRL = 21
otherwise zero

Or you can advance with a match. To advance with a match, start from the cheapest arrival with any LRL and step by the match len and fill costs[P+len][0]. You can also advance with a rep match, which is similar except that it cannot advance from the LRL=0 state. Each arrival stores where it came from (both pos and LRL).

When you reach the end of the file, you take the cheapest of all the LRL arrivals and trace it backwards to the root to find the parse.

This kind of full matrix dynamic programming parse completely captures the non-local effects caused by things like LRL running into thresholds that change the cost. Unfortunately it's too slow for production (and uses a lot of memory), but it is very useful as a reference point. It told us that a much better optimal parse was possible.

An important note : The dynamic programming parse needs to be making space-speed decisions. As noted previously in particular there are a lot of ties and those need to be broken in favor of decode speed. The primary driver for decode speed is the number of control tokens. What we did is just add a cost penalty to each control token. The number we used is (1/4) of a nibble penalty for each control token. That is, we will give up 1 nibble of compression if it saves 4 mode switches. If we can send {LRL} instead of {LRL,match,LRL,rep,match} we will do it if the penalty is only one nibble.

(modern Oodle now uses a rate-disortion-like Lagrange multiplier to optimize for the desired tradeoff of space & speed, which the user can control. This work in LZNIB predates that system and just used a hard-coded tradeoff that we found to greatly improve decode speed without much penalty to compressed size.)

So, armed with the dynamic programming solution, we could generate stats to look at how it was parsing files, and compare that to how forward-arrivals was parsing. What we saw was :


dynamic programming :
--------------------------------------------------------- 
7.6 bytes per token ; 30.2% literals, 55.6% matches, 14.1% LO 
AM: 50.4% match / 49.6% lrl ; 8.9 average AM ML , 7.0 lrl 
AM: offlow 40.7% ml3 24.0% ml4 21.8%
AL: 49.2% match / 50.8% LO ; 7.6 average AL ML , 6.4 LO len 
AL: offlow 53.5% ml3 24.6% ml4 19.9% ; LO len1 11.4% len2 24.9%
---------------------------------------------------------

forward arrivals :
--------------------------------------------------------- 
5.9 bytes per token ; 21.0% literals, 61.1% matches, 17.9% LO
AM: 46.4% match / 53.6% lrl ; 8.4 average AM ML , 3.5 lrl
AM: offlow 43.1% ml3 37.5% ml4 19.9%
AL: 39.5% match / 60.5% LO ; 7.5 average AL ML , 5.0 LO len
AL: offlow 36.7% ml3 43.1% ml4 15.0% ; LO len1 30.1% len2 23.4%
---------------------------------------------------------

key :
--------------------------------------------------------- 
decompressed bytes per token ; % of control tokens of each type
AM = in after-match state
AM: % of tokens ; average len of that token
AM: offlow = offset in 12 bits , ml3 = % of matches that have len 3
AL = in after-literal state
LO means "last offset" which a synonym for "rep match"
---------------------------------------------------------

In this case the compressed size was very similar, but the dynamic programming parse was much faster to decode (about 20% faster).

We can easily see what's going on :


DP parse has more bytes per token, eg. fewer tokens for the whole file.  This is why it is faster.  This is more of the end result
of the problem rather than the source of the problem.

DP parse has way more bytes in literals (30.2% vs 21%)

DP parse has way longer average LRL (7.0 vs 3.5)

forward-arrivals parse sends way more len-3 matches (37.5% vs 25.0% and 43.1% vs 24.6%)

forward-arrivals parse sends way more rep len 1 matches (30.1% vs 11.4%)

I found it quite interesting that two ways to parse the file could produce nearly the same compressed size, but get there in very different ways.

Clearly the forward arrivals parse needs a way to find the longer literal runs when they are best in a non-local way.

When you are walking along in a forward-arrivals parse, you just see the single cheapest arrival to your pos; consider something like this :


1234
LLLR
   
At pos 4, the cheapest arrival is via rep-len1. The standard forward arrivals parse fills that spot with the rep-len1 arrival, and then continues forward from there. There's no way to go back in time. You no longer see the arrival at pos 3.

A key thing we should observe is that when a non-local cost effect makes us change a decision (away from just using cheapest arrival), it is always in favor of LRL. The standard arrivals parse is taking too many matches & reps and we want to take literal runs instead in some of those places.

The solution is pretty simple :

Store only match (and rep-match) arrivals at each pos (not literal arrivals). When considering going forward, consider starting at current pos P from the match arrival there, OR go from [P-1] with 1 LRL, or [P-2] with 2 LRL, etc.


considering an example from before :

12345678
MMMLMMML
LLMMMMLL

at pos 8 I look at the ways to go forward (find matches & reps)
say I find a match

how should I arrive at pos 8 ?

I can arrive via

arrival[7] (MMMLMMM) + LRL 1

or

arrival[6] (LLMMMM) + LRL 2

You should choose the latter.

Even though arrival[7] was the cheaper way to get to pos 7, arrival[6] is the better way to get to pos 8.

You want to limit the LRL lookback to something reasonable. Perhaps 8 (or 22 as we did in the full matrix parse, but that isn't necessary). If you find no match arrivals in the 8-step lookback, you need a way to go back to the most recent preceding match arrival.

Instead of just looking at a forward step of {match} we consider {step back 1 + L + M} , {back2 + LLM }, etc. In LZNIB, costing the LRL lookback steps is simple because literals are just always 1 byte.

Essentially what we're doing here is treating the LZNIB parse as if it used an {LRL,match} combined packet. Even though the actual format has separate literal run and match actions, they act like they are combined.

In fact there's a different way to look at LZNIB. After an LRL nibble action token, there must always be a match or rep-match action. So we can think of those as instead being a byte action token for {LRL+match}. In that case rather than the after-literal / after-match states, there is only one state :


LZNIB always after-match :

nibble action : match
(no rep match is possible)
byte action : LRL then match
byte action : LRL then rep-match

If you parse using these actions, then the non-local effects go away.

In the end we found huge improvements to the LZNIB parse that gave us ~20% faster decompression just through making better decisions in the encoder. The way that we investigated this and developed the parse was later fruitful in the development of Kraken & Mermaid/Selkie, which have similar issues but are even far more complicated to parse well.


Some old references links related to LZNIB, and some perf reports showing LZNIB in the pre-oceanic-cryptozoology days :

cbloom rants 10-22-12 - LZ-Bytewise conclusions
cbloom rants 03-02-15 - LZ Rep-Match after Match Strategies
cbloom rants 05-13-15 - Skewed Pareto Chart
cbloom rants Oodle Results Update

old rants