cbloom rants

4/28/2021

Requantization to different UNORM bit depths

I wrote before in Topics in Quantization for Games some general basics on quantization. We're going to continue from there, but focus just on the specific case : GPU convention quantization of n-bit unsigned values (UNORM).

As noted before, there are two viable conventions for uniform linear quantizers; either "floor" quantization (with bias on reconstruct) or "round" quantization (bias on quantizer) with direct reconstruction (end points restored directly). The latter is what GPUs use, therefore games should always use this convention for all quantizers to avoid mistakes.

(for example RGBE quantization in the HDR file format uses the opposite convention; floor quantize and bias on dequantize; make sure to use the right convention to match the file format)

The case we will talk about here is :


n-bit value
N = 2^n
quantized values in [0, N-1]
source values in [0.0, 1.0] (inclusive)

quantize(f) = (int)( f * (N-1) + 0.5f )

dequantize(i) = (float) i / (N-1);

quantized 0 -> 0.0
dequantized N-1 -> 1.0

Now we want to talk about "requantization". Requantization is when you have some n-bit value and need to change the representation into an m-bit value; we will assume always that both values are in UNORM GPU convention (bias on quantize).

A correct requantizer is one that corresponds to dequantization and quantization to the new bit depth :


x is quantized in n bits

dequantize :

f = (float) x / (N-1);

quantize to m bits :

y = (int)( f * (M-1) + 0.5f);

y = (int)( x * (float)(M-1)/(N-1) + 0.5f);

y is the requantization of x from n bits to m bits

(N = 2^n , M = 2^m)

Okay so back in Topics in Quantization for Games we showed that "bit replication" is the correct requantizer for changing n to 2n.

In requantization what we focus on is the denominator of the quantizer, so changing n bits to m bits is changing denominator from (N-1) to (M-1). The denominator is also the quantized representation of "One" (1.0 in dequantized range), and we need that to be preserved. The fact that bit replication is the correct requantizer is because bit replication is (N+1) (because N+1 = (1<<n) + 1) and (N+1)*(N-1) (that's (N+1)*One) = (N^2 - 1) which is the new One.

To continue that, we can show that you can repeat bit replication to double bits again :


Start at n bits

Bit replicate once : *= (N+1)

now at 2n bits
2n bits range is N^2

Bit replicate again : *= (N^2+1)

What happens to One in n bits (N-1) :

(N-1)*(N+1)*(N^2+1)
(N^2-1)*(N^2+1)
(N^4-1)

becomes One in 4n bits

Alternatively just write out the requantizer :

y = (int)( (x/(N-1)) * (N^4-1) + 0.5f );

then using the factorization above this is just :

y = (int)( ((x/(N-1)) *  (N^2-1) * (N^2+1) + 0.5f );
y = (int)( (x * (N+1) * (N^2+1) + 0.5f );

y = x * (N+1) * (N^2+1);

exact in ints , which is bit replicate twice

So for example to go from 4 to 16 bits, you bit-replicate 0xA to 0xAAAA

Note that while bit doubling is the correct requantizer for bit doubling you can *not* just truncate bits to bit half. eg. to requantize from 16 bits to 8 bits, you can not just grab the top 8 bits. While grabbing the top 8 bits would be a round-trip from 8 bit values that had been bit-doubled, it would not put the bucket boundaries in the right place for values that are in between. eg. 16-bit values in [129,255] should requantize to "1" in 8-bit, not just the 0 you would get if you truncated to the top 8 bits.

Okay, so let's get into some requantizers. I want to note that just doing the simple form :

y = (int)( x * (float)(M-1)/(N-1) + 0.5f);
is totally fine and you can just do that and not worry about this. But that said, our goal here is to work out requantizers that are exactly equal to that result, but that stay in integer.

There are a couple general techniques that help with this, so I'll do examples of both.

1. Taylor series in the denominator method.

For example, let's find the requantizer from 16 bit to 8 bit. The simple form is :

y = (int)( x * 255/65535.f + 0.5f);
To find the int form, we can try changing 65535 to 65536. One way to do that is approximate :

1/65535 = 1/(65536-1) = 1/65536 * 1/(1 - 1/65536)

We can use the Taylor expansion of 1/(1 - x) for x small :

1/(1 - x) = 1 + x + O(x^2)

using x = 1/65536 , it's reasonable to drop terms in x^2

So :

1/65535 ~= 1/65536 * ( 1 + 1/65536 )

y = (int)( x * 255/65535.f + 0.5f);
y = (int)( (x * 255 + 32767.5.f)/65535.f );
y = (int)( (x * 255 + 32767.5.f) * (1/65536) * ( 1 + 1/65536 ) );

y = (int)( (x * 255 + 32767.5.f + x*255/65536 + 32767.5.f/65536) * (1/65536) );

65536/255 = 257.003922.. ; try 257
32767.5.f + 32767.5.f/65536 = 32768 - 0.5f/65536 ; try 32768

y ~= (int)( (x * 255 + (x/257.f) + 32768) * (1/65536) );
y ~= (int)( (x * 255 + (x/257.f) + 32768) ) >> 16;

y = (x * 255 + (x/257) + 32768) >> 16;

and in fact that is the correct 16 to 8 requantizer in integers (in the sense that it matches the simple float formula exactly).

There is however an even simpler 16 to 8 requantizer which also matches :

int requantize_16_to_8( int x )
{
    ASSERT( x >= 0 && x <= 65535 );

    return ( x * 255 + 32768 + 127 )>>16;
}
which can be made by replacing the (x/257) term with its average.

2. Bit replicating up to make an infinite repeating fraction

Another technique that sometimes works is based on a different conceptual picture.

We have (for example) an 8 bit value; 0x00 is "zero" , and 0xFF is "one" (that's what they dequantize to). To get 0xFF to mean "one" we think of it as a fixed point fraction, and bit replicate so for 0xFF think :


0xFF = 0. FF FF FF ... forever

then there is no number between that and 1.0 , therefore it is equal to 1.0

so to find requantizers or interpolators, we can bit replicate several times and approximate the infinite repetition with a finite number , but then we'll be a little bit too low so we'll have to bias up a bit.

To be clear, this repeated fraction is not just hand having and not just right at the end points; if you could repeat forever it would be exact. We can see it by using our Taylor expansion again and doing all the terms :


1/(1-x) = 1 + x + x^2 + x^3 + x^4 + ...

using x = 1/256 , this is :

256/255 = 1 + 2^-8 + 2^-16 + 2^-24 + 2^-32 + ...

or

1/255 = shift down 8 + shift down 16 + shift down 24 + ...

or

0x34/255 = 0 . 34 34 34 34 ...

Any denominator that's a power of two minus one is a repeated binary fraction.

We want to see if we can use just a few repeats and then bias up and get correct requantizer results.

As an example, let's work out the requantizer from 10 bits to 8 bits. As usual the simple way is :

f = x/1023.f
y = (int)( f * 255.f + 0.5f );
but we'll try to find an exact match to that formula that stays in int. We guess that we want to start by making a repeated fraction :
thinking of f as a fraction with decimal 20 places to the left

f = (x << 10) | x;

y = (int)( f * 255.f + 0.5f );

y = ( ((x << 10) | x) * 255 + (1<<19) + bias )>>20;

where we guess we need "bias" because of the fact that we are using a truncated 20 place fraction rather than forever
that is, we need to push up 0.FFFFF slightly to get to 1.0

It turns out it works with bias = (1<<9) (among others)

y = ( x * 1025 * 255 + (1<<19) + (1<<9) )>>20;

( using ( (x << 10) | x ) = x * 1025 )

This is in fact a correct requantizer from 10 to 8 bits :

int requantize_10_to_8( int x )
{
    ASSERT( x >= 0 && x <= 1023 );

    return ( x * 0x03FCFF + 0x080200 )>>20;
}
Okay, so we can see it's frequently possible to use an approximation of the infinite repeated fraction to find correct requantizers.

(you could also get this from our one term Taylor expansion as well; essentially we just did 1/1023 ~= 1025/(1024^2) )

This requantize_10_to_8 requires 28-bit intermediates, so it fits in 32-bit registers. For scalar that's fine, but for SIMD we'd like to be able to do these requantizers in as few intermediate bits as possible (ideally in 16 to get wider SIMD).

If we go back to the verbose form we can see how to get these in fewer bits :


y = ( ((x << 10) | x) * 255 + (1<<19) + (1<<9) )>>20;

y = ( ((x*255) << 10) + ( x * 255 ) + (1<<19) + (1<<9) )>>20;

t = (x*255) + (1<<9)

y = ( (t<<10) + t )>>20;

y = ( t + (t>>10) )>>10;

int requantize_10_to_8_alt( int x )
{
    ASSERT( x >= 0 && x <= 1023 );

    int t = (x*255) + (1<<9);

    return ( t + (t>>10) )>>10;
}

This gives the same result again; it now uses only 18 bit intermediates.

This form is familiar from the classic "mul8bit" from Jim Blinn's "Three Wrongs Make a Right" :

int mul8bit( int a, int b )
{
    int t = a*b + 128;
    return (t + (t>>8) )>>8;
}
mul8bit takes a [0,255] value and makes it act as an interpolator from 0% to 100% , eg. for alpha blending. It is equivalent to :
    (int)( ((a / 255.f) * b) + 0.5f );
which is dequantizing a to UNORM, using it to scale b, then requantizing.

We can now see that the "mul8bit" interpolator is a relative of our requantizers, and we can derive mul8bit by using the repeated fraction or Taylor series.

The end!

3/24/2021

Oodle 2.9.0 : old compressors removed

Oodle 2.9.0 is now out. The full changelog is :
## Release 2.9.0 - March 23, 2021

$* *fix* : OodleLZ_Compress had an encoder crash bug in the Optimal level encodes on data in sizes just slightly over a 256KB chunk (eg. 262145) with a repeated substring at the very end

$* *change* : Mac libs and dylibs are now fat binaries with x64 and ARM64
$* *change* : Tex : Oodle Texture no longer checks for license file
$* *change* : defining OODLE_IMPORT_LIB / OODLE_IMPORT_DLL is no longer needed; you can link with either type of lib without setting a define
$* *change* : Oodle public headers no longer define types like U8, SINTa, they are instead OO_U8, OO_SINTa, etc.
$* *change* : Oodle public headers now require stdint.h which on Windows/MSVC means VC2010 or higher

$* *change* : Net : OODLE_PLATFORM_HAS_SELECTDICTIONARYANDTRAIN define removed.  Call OodleNetwork1_SelectDictionarySupported instead.

$* *removed* : Core : support for the deprecated LZ compressors is now removed (LZH,LZA,etc.).  Only newlz (Kraken,Mermaid,Selkie,Leviathan,Hydra) and LZB16 are supported.
$* *removed* : Core : OodleLZ_CompressContext_* functions removed; streaming encoders no longer supported
$* *removed* : Ext : OODLEX_PATH_* public defines removed.
$* *removed* : Ext : OODLEX_WCHAR_SIZE public define removed.
$* *removed* : Tex : OodleTex_CheckLicense func removed ; Oodle Texture no longer checks for license files

$* *deprecation* : OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size no longer used; newlz compressors no longer ever drop down to LZB16 (default behavior unchanged)
The biggest change is the removal of all deprecated pre-Kraken compressors (LZH,LZHLW,LZNA,LZNIB,BitKnit,etc.) except for LZB16 which stays for the moment (but is also deprecated). Data compressed with the old codecs cannot be decoded by Oodle 2.9.0 ; up to Oodle 2.8.14 data made by any earlier version can always be decoded by later versions.

Note that old versions of the Kraken-family of compressors still be decoded (Oodle 2.9.0 will read data written by Oodle 2.3 Kraken), and that will always be the case, we never break reading old data (made by the supported codecs). So if you are using the Kraken-family of compressors, you can always update to newer versions and your old data will be readable.

If you were using one of the old codecs, we recommend changing to one of the Oceanic Cryptozoology codecs (Kraken, Mermaid, Selkie, Leviathan). They are better in every way. To do this, you need to decode that data with the old version of Oodle you have (or any version up through 2.8.14) because 2.9.0 cannot decode the old codecs. Assuming your old version is post-Kraken (2.1.5), you can re-encode to Kraken with your old sdk and the current version (2.9.0) will be able to load that data.

If you can't switch codecs for some reason, you can always keep using the old version of Oodle you are currently on.

We have also designated the last version before 2.9.0 (which was 2.8.14) as a "long term support" release. We will continue to push critical bug fixes to 2.8.14.lts for people who need to stay on old codecs.

(2.8.14.lts has already received a bug fix since the 2.8.14 first release; clients on 2.8.14 should update to the latest version)

If at all possible we recommend everyone to get on the Oceanic Cryptozoology codecs and update to 2.9.0 ; if you need the old codecs we recommend 2.8.14 for long term support.

LZB16 is still supported in 2.9.0 because some data may need it, but we strongly discourage its use.

3/10/2021

Faster Inverse BWT

The BWT (Burrows Wheeler Transform) has long fascinated people for its ability to capture complex correlations with a very simple inverse transform. Unfortunately despite that inverse transform being very simple, it is also slow. I will briefly review the inverse BWT (i-BWT) and then look at ways to speed it up.

Jump to the end for the punch line : speed results and source code.


Let's briefly review the basic BWT to establish the baseline algorithm and notation.

Given an input string like "inputstring" add an EOF symbol so S = "inputstring|" then form all rotations :

inputstring|
nputstring|i
putstring|in
utstring|inp
tstring|inpu
string|input
tring|inputs
ring|inputst
ing|inputstr
ng|inputstri
g|inputstrin
|inputstring
Then sort all the rows, this gives the matrix M :
g|inputstrin
ing|inputstr
inputstring|
ng|inputstri
nputstring|i
putstring|in
ring|inputst
string|input
tring|inputs
tstring|inpu
utstring|inp
|inputstring
^          ^
F          L
this is all the suffixes of the file, and we've sorted them all. We've labeled the first column F and the last column L.

The last column L is the BWT, which is what we transmit. (typically we don't transmit the EOF character, instead we transmit its location and leave it out of the BWT string that is sent).

In the matrix M each column is a permutation of the original string S (remember the rows are cyclic rotations). The first column F is just the alphabetic sort of S. F can obviously be reconstructed from a histogram of S, which can be counted from L. The column L is the original characters of S in order of their following suffixes.

The inverse transform is simple conceptually. We transmit L, the BWT string (the last column), we need to regenerate S the top row.

Because the rows are all just rotations of S, the character at F[i] on each row is the character that follows L[i]. This gives you an answer to "what's the next character after this one?" which is all you need :

Output decoded string D
Start with D=| the EOF character.

What's the next character?

L = nr|iinttsupg
F = giinnprsttu|

Find | in L , it's L[2]
so the next char is F[2] = i

D=|i

What's the next character?

Find i in L , it's L[4]  (NOT L[3])
so the next char is F[4] = n
D=|in

Then find n at L[5] (not L[0])
the next char is p, so D=|inp

after p is u, etc.
So you can just read out the characters one by one like that.

When a character occured multiple times in L we had to find the right one (which corresponds to the same location of that character in the original string S). eg. when finding 'i' we needed L[4] not L[3] because L[4] is the i that follows | in "input", while L[3] is the i in "string".

The way to do that is to find the same occurance number of that character. If it's the 2nd "i" in F, we want that to map to the 2nd "i" in L.

The reason is because L has the characters sorted by their suffixes, while F is sorted by the whole strings. So when they are the same symbol in L, then the order they occur in L is the same as the order the occur in F.

For example the n's in F :

ng|inputstri
nputstring|i
are in this order because "ng" is less than "np"; in the last column L, the n's occur as :
g|inputstrin
putstring|in
they are sorted by their suffixes "g|" and "pu", so they are in the same order.

So the nth occurance of each character in L/F corresponds to the nth occurance of each character in L/F. The main thing we need for inversion is this mapping between L and F for each character.

It turns out to be more convenient to go backwards and make the mapping from L to F (rather than F to L). This is because the F array can just be implicit, we never need to make it in memory. F is just the "cum prob" table. That is :

Each symbol has a histogram count[s]
start[s] = prefix sum of counts < s
cumulative probability range[s] = [ start[s] , start[s] + count[s] )
(closed interval on low end, open interval on high end)
F[] = fill s in range[s]

To find the nth occurance of s in F[] is just i = start[s] + n
This is the same array you would use in arithmetic coding or RANS.

We'll calling the mapping from L to F , LF[] and it is just constructed thusly :

scan L[]
s = L[i]
count C[s] as you go (start at zero)
find the C[s]th occurance of s in F[]
LF[i] = start[s] + C[s]
then C[s] ++
Now LF[i] takes us from the symbol at L[i] to the same symbol in F[] which will be at F[ LF[i] ], then from there you can step to the prior symbol, which is at L[ LF[i] ].

Note that this is backwards from what we were doing before, so we are now reading at the stream in reverse. To read out the i-BWT (backwards) is now : :


start at i = location of EOF in L[]

for N times :
{
find symbol L[i] in F[] ; this is i = LF[i]
then step to the prior symbol F -> L
output L[i]
}
that's it! A full BWT inverse transform is then :
L[] = the BWT string, we get this from the decoder

count[i] = histogram of L[]
start[i] = prefix sum of count[]

make LF :

C[] = 0
for i to N
{
    s = L[i];
    LF[i] = start[s] + C[s];
    C[s]++;
}

read out :
i = index of EOF (transmitted)
// L[i] is EOF sym which we don't need to output
for N times
{
    i = LF[i];
    *ptr++ = L[i];
}
This produces the string backwards, which we usually handle by reversing in the encoder before doing the BWT so that the decoder can just output forwards.

That's it!

I'm not going to talk about how the compressor transmits the BWT string L[] compactly. See the original BW94 paper or the bzip code for the classic MTF + RLE + switching Huffmans encoding. See Fenwick for analysis of BWT as a symbol ranker. In the modern world most people do BWT encoding without MTF; see Maniscalco M99, context induction, BCM and BBB.

I'm also not talking about the forward transform at all. The forward transform also has interesting efficiency issues, mainly also relating to cache misses for large buffers or to GPU implementations, and there has been lots of good research on that; see the papers. In this blog I will assume you don't care much about forward transform speed, so I'll push all the slow operations to the encoder side to make the decoder as fast as possible.

I will also just briefly mention that BWT of long string recurrances can be sped up with an LZP pre-pass. We will assume this has been done where appropriate so that there are no very long identical strings in our source input (eg. perhaps suffixes are always unique after 16 symbols).

End of review and we'll dive into speeding up the i-BWT.


So the transform is delightfully simple, but it's slow. Why?

(and by "slow" I mean *really* slow; specifically 200-400 cycles per iteration on my Intel machine "beasty")

The problem is that each instruction of the i-BWT intrinsically depends on the previous instruction (instrinsically meaning can't be removed with code reorganization), so no instruction-level-parallelism (ILP) can be used. On modern CPUs that might be able to do 4 instructions per clock this is bad. What's worse is that there is an intrinsic cache miss in the lookup step of the i-BWT.

The problem is the index of the next symbol in the sequence is semi-random and goes all over the BWT buffer. It cache misses often (almost every single iteration). Now you can obviously fix this by making the BWT chunk smaller, which is what the old bzip does. If the chunk is small enough to fit in cache, no problem. But that also kills compression and sort of ruins the whole point of the BWT (if small chunk BWT compression is good enough for you, then you should just use an LZ like Kraken which better fits that space-speed niche; high speed small cache is outside the operating range of where BWT makes sense vs alternatives). In-cache BWT is not a Pareto optimal way to do a fast compressor; we are only interested in BWT in the high-compression domain. So we want to use BWT chunk sizes that are bigger than the fast cache (16M chunks perhaps, which often fits in L3) and speed things up.

The core of the i-BWT is just this :

for N times
{
    i = LF[i];
    *ptr++ = L[i];
}
The problem is the LF[i] are a near-random walk around the buffer. They are in order of the sorted suffixes, which tend to be completely randomly distributed in N, assuming the input array is not in some degenerate pre-sorted order. You load at LF[i] and it cache misses, you can't do any more work, the loop completely stalls. As soon as the cache miss returns, you immediately do i = LF[i] and look up at LF[i] again and stall again.

Now one obvious thing we can do is to first combine the L[] and LF[] arrays into a single merged array, so that we get exactly 1 cache missing load per iteration instead of 2. But we still need fundamental improvements.

There are two big deficiencies we need to address :

1. Instruction sequence with sequential dependencies
2. Cache miss on looking up the next symbol

The most obvious approach to deal with these deficiencies is to use a technique we are now familiar with, which is to do multiple streams at the same time.

Independent multiple streams allow you to keep more execution units busy, especially in a case like this where there is a fundamental cache miss stall that you want to fill with useful CPU work during those cycles.

Say you have a single execution stream where every instruction is dependent on the last (c follows b follows a), only 1 can execute per cycle, and if one cache misses you get unused cycles with no work to do :


clock tick on the left
"1" is the independent stream number
a,b,c indicate a sequence of dependent instructions

0: 1a
1: 1b
2: 1c (cache miss)
3: .
4: .
5: .
6: .
7: 1d

If you run 4 streams of independent work, each stream dependent on the last instruction within its stream but no cross-stream dependencies, now the CPU can execute 2 instructions per cycle (say), and fill in during the cache miss :

0: 1a + 2a
1: 1b + 2b
2: 1c + 2c (both cache miss)
3: 3a + 4a
4: 3b + 4b
5: 3c + 4c (both cache miss)
6: .
7: 1d + 2d

It's a bit like what hyper-threading does for you, but we get it just by running more independent execution sequences on our single thread to give the out-of-order execution a bigger selection of work to find instructions to fill the gaps.

In our case with heavy cache missing dependent work, in the 1-stream case you are waiting on the latency of each cache miss with the processor doing nothing, so you get a full stall all that time. Going to more streams at least lets us get multiple memory operations pending at once, rather than stall, wake up, stall again on one load at a time.


In the case of i-BWT we can do this by taking the input data (before BWT) and think of it as T segments. We don't cut the string into pieces (remember chunking hurts compression and we don't want to do that), we still do the forward BWT on the whole string.

Like :

"inputstring|"

in 2 segments :
input+string|

do the BWT on the whole string "inputstring|"

then find the locations of the EOF symbol | in L[]
and the + segment boundary, by finding where "string|" starts


g|inputstrin
ing|inputstr
inputstring| <- EOF key
ng|inputstri
nputstring|i
putstring|in
ring|inputst
string|input <- segment key
tring|inputs
tstring|inpu
utstring|inp
|inputstring
^          ^
F          L

EOF key = 2
segment key = 7

Then you would transmit the BWT string L[] and the EOF key (2) as usual, and also transmit the segment key (7). The segment key lets you start the i-BWT core loop at that location and read out characters from there, in addition to the stream from the EOF key.

In the i-BWT you can start output cursors for all T segments and you can read them out of the i-BWT simultaneously. The i-BWT core loop just does "from this symbol, what's the next symbol?" so you can do T of those independently.


1->..+2->...|

cursor 1 starts at i=2 in L[]
cursor 2 starts at i=7 in L[]

decodes:
i....+s.....|
in...+st....|
etc..
input+string|

The N-stream decode is very simple, literally just doing the 1-stream process to N output pointers :

T segments
key[T] are the start indexes
i1,i2,etc. = key[]
ptr1 = start of output buffer
ptr2 = ptr1 + (N/T) , start of next segment, etc.

for N/T times
{
    i1 = LF[i1];
    *ptr1++ = L[i1];

    i2 = LF[i2];
    *ptr2++ = L[i2];
}

it just does the basic i-BWT loop on T segments in each iteration. Crucially these are independent, so when one of them stalls on a cache miss, the others can still do work. If all of them were stalling on cache misses and the memory system could service T cache line fetches simultaneously, we would be able to stall with all T requests in flight which would give us factor of T speedup. In practice it appears to be somewhat less than that.

You can also cut into segments like this for parallel decoding on multiple threads for very large blocks (without doing any chunking, which sacrifices compression). eg. you might find 16 segments to decode and give 4 segments each to 4 threads. The threads have no cache contention because they are writing linearly to different output segments, and the shared memory of the LF[] array is read-only in this loop.

Going to 4 streams provides around a 2.7X speedup, 8 streams gets us about 4X :

enwik7 :
ibwt_1x1 : seconds:0.7500 | cycles per: 299.330 | MB/s : 13.33
ibwt_1x4 : seconds:0.2801 | cycles per: 111.769 | MB/s : 35.71
ibwt_1x8 : seconds:0.1835 | cycles per: 73.218 | MB/s : 54.51
(see more speeds at end)


The next thing we can do is an idea from the paper "Cache Friendly Burrows-Wheeler Inversion". We can make our basic i-BWT step output words or dwords instead of bytes.

This is a simple idea, but I think it's quite interesting *why* and *when* it works.

What we want to do is just do our core i-BWT to output words at a time instead of bytes, by outputting a precomputed pair of one byte steps :

core i-BWT loop for words :

i = key (EOF location in L[])
for N/2 times
{
    *word_ptr++ = LL[i];
    i = LF2[i];
}
where LL = a word, two steps of L[] , and LF2[] = two steps of LF[], that is LF2[i] = LF[LF[i]].

(reminder that in practice we put the LL[] and LF2[] into a single struct so that we get one cache miss array lookup rather than two).

So to use that we just need to build the LL and LF2 arrays first. To do that we just need to precompute each pair of steps in the byte-at-a-time BWT :

prep for word i-BWT :

first build LF[] as usual

for i = 0 to N
{
    int i2 = LF[i];
    LL[i] = L[i] | (L[i2]<<8);
    LF2[i] = LF[i2];
}
At every index in L[] we precompute a two-character step by following LF[] once.

(side note : you could precompute a two-character step forward more simply; from L[i] the next character is just F[i], which you could get from the cumulative probability table at i, but that would give you a step forward and what we want is a step backward; L[i2] here is the character that precedes L[i]; it wouldn't help because we need a lookup at i2 to make LF2[] anyway)

Now at first glance this might not seem helpful, what we have done is transform :


byte at-a-time BWT :

core i-BWT loop :

N times :
    i <- LF[i]
    output byte from i


word at-a-time BWT :

prep :

N times : 
    LF2[i] = LF[LF[i]]
    word = L[i] + L[LF[i]]

N/2 times :
    i <- LF2[i]
    output word from i

We are actually doing more work; we now do 3/2 N iterations instead of N iterations. But it is in fact much faster.

Why?

The reason is that the N-step "prep" loop to build LF2 does not have the two big problems of the core loop. Remember the two big issues :

1. Instruction sequence with sequential dependencies
2. Cache miss on looking up the next symbol
these are what make the "core" i-BWT loop so slow.

First issue #1 : the "prep" loop naively allows for execution ahead, unlike the core loop. The reason is that it is just doing i++ to iterate the loop, rather than chasing i = LF[i] around a data-dependent serpentine walk through the arrays. This means that the processor can effectively unroll the loop to execute ahead :

prep loop :

iter 1 :
    LF2[i] = LF[LF[i]]
    word = L[i] + L[LF[i]]
    i++;

iter 2 :
    LF2[i] = LF[LF[i]]
    word = L[i] + L[LF[i]]
    i++;
    
iter 1 and iter 2 can run at the same time because i++ can be precomputed.
If LF[LF[i]] stalls on a cache miss in iter 1, iter 2 can still proceed.

core loop :

iter 1 :
    i = LF[i]
    output byte/word from i

iter 2 :
    i = LF[i]
    output byte/word from i

iter 2 can't start until iter 1 is done because we don't know "i" until the
results of the lookup at LF[i] are done.
A modern processor will be able to execute ahead in the "prep" loop to fully saturate execution, and we don't need to manually do the N-streams to get ILP because it just naturally exists.

issue #2: cache misses. The "prep" loop cache misses far less than the "core" loop.

Again this is not naively obvious. The byte "core" loop does N lookups at LF[i]. The "prep" loop also does N lookups at LF[i]. So they should cache miss the same, right?

Nope. The difference is because the "core" loop is doing i = LF[i] which takes you to a semi-random place each time, whereas the "prep" loop is doing i++ for subsequent lookups, and that means the lookups are often to nearby locations that stay in cache.

core loop does :

    starting from i
    i1 = LF[i]
    lookup at i1
    i2 = LF[i1]
    lookup at i2
    i3 = LF[i2]
    lookup at i3

prep loop does :

    starting from i
    i1 = LF[i]
    lookup at i1
    i2 = LF[i+1]
    lookup at i2
    i3 = LF[i+2]
    lookup at i3
The indexes i1,i2,i3 in the core loop are semi-random, in the prep loop they are not.

To understand, let's remember what LF[] is. It takes us from the symbol at L[i] to the location of the preceding symbol in the BWT (we're outputting backwards).

So say you were currently at "ing|" (in "inputstring|"), it takes you to the location of "ring|" then "tring|" ; those are i1,i2,i3 in the core loop.

In the "prep" loop you are taking the starting strings at i, i+1, i+2. These are subsequent in sorted order. These might be something like :

"hella ..."
"hellish a.."
"hellish b.."
"hello a..."
"hello b..."
So the steps LF[] gives us to i1,i2,i3 will be to tack on the preceding character and look that up. They might all be preceded by random characters, but in practice because these strings are similar their preceding characters tend to be similar. That might be :
" hella ..."
" hellish a.."
"shellish b.."
" hello a..."
"shello b..."
then you look up those locations in the suffix sort. As long as you tacked on the same preceding character, they will stay adjacent :
" hella ..." -> i =4000
" hellish a.." -> i = 4001
"shellish b.." -> i = 6078
" hello a..." -> i = 4002
"shello b..." -> i = 6079
The preceding character for the sorted suffixes is of course just the BWT string L[] itself. The portion here would be " s s".

So as long as the BWT string has repeated characters, the indexes of the "prep" loop lookup will be adjacent. If they are adjacent indexes, they will not cache miss because we already loaded their neighbor which brought in that cache line (load at i=4001 won't cache miss because we just did a load at i=4000).

Well, getting repeated characters in L[] is exactly what the BWT gives us! If we think of it in terms of post-MTF BWT strings, a 0 will be a load from the most likely location, 1 will be the second most likely, etc. so there will tend to be a few cache lines that we already have that provide most of our loads for the "prep" loop. Only rarely do you get unexpected characters that cause cache misses.

Note that this only works on *compressible* data where the BWT is giving the coherence we want. On random data that doesn't compress this breaks down. (though even in that case, there are still at most 256 hot cache lines = 16 KB that we need to read from, so it's still better than the "core" loop, and we still have property #1 that each iteration is independent).

Actual load locations so you can see how this works in practice :

LF[i] = 8853
LF[i] = 8854
LF[i] = 8855
LF[i] = 8856
LF[i] = 10553
LF[i] = 8857
LF[i] = 10554
LF[i] = 48
LF[i] = 49
LF[i] = 50
LF[i] = 10555
LF[i] = 10556
LF[i] = 51
LF[i] = 52
LF[i] = 5070
LF[i] = 10557
LF[i] = 2477
LF[i] = 53
LF[i] = 54
Where you have 10553,10554,10555 that means the same character was preceding those suffixes, so they lookup in adjacent places.

You can of course take this idea for word output and repeat it again to get dword (u32) output per iteration. In practice what I see is that dword output is only faster for compressible data where the cache coherence property above works well. On random input dword output gets a lot slower than word-at-a-time. Because of this I think word output is best if you only choose one i-BWT, but you can do even better by adaptively choosing the output algorithm by the compression ratio.

The dword (4x) variant does an N-loop to build LF2 which is quite coherent, the next N-loop to build LF4 is slightly less coherent unless the data is compressible (we're relying on two-symbols-away correlation now instead of neighbor correlation, which is less strong), then it does N/4 of the "core" loop which is still quite cache missy. So the net is :

!
1x byte loop : N steps of "core" , cache mising
2x word loop : N steps of LF2 + (N/2) steps of core
4x word loop : N steps of LF2 + N steps of LF4 + (N/4) steps of core

The authors of "Cache Friendly Burrows-Wheeler Inversion" have another method that accelerates long repeated substrings called "precopy". I think that a pre-pass with an LZP transform is probably a faster to way to accomplish the same thing on data where that is helpful. "precopy" also makes N-stream interleaving much more complex. I have not tested it.


Speeds measured on my Intel Core i7 3770 (locked at 3403 MHz) (Ivy Bridge)

ibwt 1x1 = classic byte at a time inverse BWT
ibwt 1x4 = byte x 4 streams
ibwt 2x4 = word x 4 streams
ibwt 4x4 = dword x 4 streams
ibwt 1x8 = byte x 8 streams
ibwt 2x8 = word x 8 streams
ibwt 4x8 = dword x 8 streams

cycles per byte for the ibwt :

name 1x1 1x4 2x4 4x4
dickens 350.6 123.7 70.0 47.9
enwik7 301.2 116.0 66.8 47.0
webster 376.9 139.0 78.5 51.2
lzt99 211.3 114.6 65.1 61.4
rand_16M 401.6 130.6 80.4 297.3

name 1x1 1x8 2x8 4x8
dickens 337.0 79.8 46.8 35.8
enwik7 299.1 73.2 46.0 36.1
webster 376.5 98.0 57.5 40.3
lzt99 208.6 77.9 48.1 53.7
rand_16M 401.3 84.1 55.0 273.4

We've improved the i-BWT speed from around 300 cycles/byte to 50 cycles/byte. That's still extremely slow. It's around the speed of Oodle LZNA (30 cycles/bytes) or 7z LZMA (70 cycles/byte). It's an order of magnitude slower than Oodle Leviathan (5 cycles/byte). But in some cases, such as text, BWT on large blocks gets a lot more compression than those, so it could be interesting on the right data types if you care about the very high compression domain. All speeds single threaded.

The i-BWT on a small buffer that stays in cache is around 10 cycles/byte (and could probably be faster; we haven't looked at all at micro-optimizing the actual instructions here, since if we're stalling on cache misses they don't matter), so at 50 cycles/byte we're still spending 40 cycles/byte stalled on the memory subsystem in our best algorithm, which is not great.

ADD : with MEM_LARGE_PAGES :

name 1x1 1x8 2x8 4x8
dickens 301.3 52.4 33.9 27.7
enwik7 265.8 50.5 33.1 28.2
webster 319.6 54.5 34.8 29.0
lzt99 177.5 50.6 32.5 35.6
rand_16M 336.6 53.5 41.4 154.1

Without large pages, the bottleneck appears to be in TLB/page mapping. With large pages it seems to be limited by the rate that cache missing reads can be retired (assuming the latency of that is fixed, this is the number of different cache lines that can have in-flight reads simultaneously). eg. if latency is 300 cycles, with 10 simultaneous reads we could get to 30 cycles/byte.


Downloads :

fast_ibwt_test.cpp

fast_ibwt_test exe (Win x64)

The source code is public domain. The ibwt routines are standard C. The test driver uses cblib

2/23/2021

Rate allocation in Oodle Texture

Oodle Texture does rate-distortion optimization (RDO) BCN encoding, optimizing the BCN encoding for an R-D two axis score such that the size of the BCN texture after a following lossless compression is reduced while the distortion (difference from original) is not increased too much.

One way to think about RDO conceptually is as rate allocation. Rate allocation is when an encoder intentionally puts more or fewer bits in parts of the data to control what the decoder will receive, so that it can use more bits where it helps more, and the result is a better quality for a given output size. I want to talk a bit about that viewpoint and how it works in Oodle Texture.

Traditional lossy encoders like JPEG didn't have active rate allocation; they used a fixed scheme (DCT and quantize) that did a fixed rate allocation (give more bits to the lower frequencies) and didn't try to optimize that based on image contents. Modern codecs like H264 use rate allocating encoders and furthermore have explicit tools in the format to make rate allocation more flexible, such as variable quantizers. Formats without explicit rate allocation controls can still be rate adjusted, which is what we do in Oodle Texture, but it's not as easy to dial.

In that context, we think of a traditional non-RDO BCN encoder as allocating rate evenly to all blocks; it doesn't care where the rate goes and just tries to minimize distortion. The BCN encoder itself produces fixed size blocks (8 bytes per block for BC1 or 16 bytes per block for BC7), but the "rate" we care about is the size of the block after subsequent LZ compression (eg. with Kraken or Zip/deflate).

In Oodle Texture, we use the Lagrange parameter method for RDO. Specifically we construct a Lagrange cost :

J = D + lambda * R
and then we can just minimize a single number, J, rather than a two-axis score of {R,D}. Obviously when lambda=0 this reduces to J=D , minimize D, which is a non-RDO encoding that only cares about quality and not rate. As lambda increases your score is more and more a balance of minimizing both distortion and rate.

We sometimes think of D(R) , distortion just being a function of R, but that's not how it works in an implementation, you don't actually know the functions for how R and D relate.

In practice there is some encoding, which I will index by "i" , and you have two separate functions :

R(i) and D(i)
That is, you just choose various encodings that look promising, and measure R and D of each. This will give you a scattering of points on an R-D plot. Some points are off the Pareto Frontier - strictly worse in terms of R & D and we just reject those points. See Fabian's blog on RDO for some pictures of this.

Henceforth I will assume that points off the Pareto Frontier have been rejected and we now only are looking at encodings on the Pareto R-D curve. Also I will sort the index of encodings "i" by rate.

The R(i) and D(i) curves both tend to be hyperbolic-shaped, with minima at opposite ends of the "i" spectrum. If you minimized either one on their own, they don't have any local minima so you would just go to the edge of what the encoding format is capable of doing. By adding them together with lambda, it makes a trough somewhere in the middle :

High D, low R on the left (very lossy), low D high R (non-RDO) on the right.

At low lambda, the minimum of J is in the high-R, low-D domain. As lambda increases it shifts the minimum of J to lower R. In the example chart, the red low-lambda curve has minimum at i=14, the teal medium-lambda curve has minimum at i=12, the gold high-lambda curve has minimum at i=10. In this way, lambda can be used to dial rate, but indirectly.

Concretely, if we reparameterize to imagine we have a function D(R) (Distortion as a function of R; there is only one encoding for each R because we discard non-Pareto encodings), we have :


J(R) = D(R) + lambda * R

minimum is at d/dR J = 0

d/dR J = 0 = d D /dR + lambda

lambda = - dD / dR

lambda is the slope on the D(R) curve.

Because the D(R) curve is hyperbolic-shaped, the slope acts like a parameter of D itself. That is, where D is higher, the curve is also steeper, so by dialing the slope we have control over the principle value D as well.

Aside for experts : we are assuming that D(R) is continuous and monotonic AND dD/dR is continuous and monotonic; that is, the slope steadily increases from low to high. The whole idea of lambda as a slope of D(R) only really works if these curves are smooth and hyperbolic shaped as expected. We also need J to only have one local minimum. If these assumptions fail (which in practice they do!) you can wind up at a minimum that's not where you want, and the lambda-dials-D relationship can break down. There are some tricks to avoid these pitfalls. Try to avoid large steps of R/D in your encoding choices; provide the encoder with a fine grained series of steps; don't use the true R which can have strange non-monotonic wiggles, instead use an approximation of R which tends to smooth things out; you generally want to pre-classify the encodings that you think are low R vs high R and force them to be monotonic rather than measuring true R.

Now, "lambda is the slope on the D(R) curve" sounds a bit unintuitive and abstract, but in fact it is a concrete simple thing.

Lambda is a price. It is the gain in D that you must get for a gain in R to be worth it.

By using lambda as your control parameter instead of D, you are dialing quality via the *cost* of quality in terms of rate. You are saying "this much gain in quality is worth this much rate to me". We typically measure R in bits, so let's say lambda is quality per bit. In order to make a file bigger by 1 bit, it must provide at least "lambda" quality gain. Smaller quality gain, it's not worth it, don't do it.

Having our control paramter be price instead of directly controlling R or D turns out to be very beneficial, both for us internally and for you externally.

First why it's right externally : lambda control (slope control) lets you set a single parameter that works for all kinds of data or images. It automatically gives more rate to images that are harder to compress ("harder" here very concretely means a steeper slope - they give up more distortion than average for each step or rate). So you can have a mix of very vastly different data, lightmaps and normal maps and cartoon drawings, and they automatically rate adjust to send bits where they help the most. Many novices prefer to think of "rate reduction" or a "constant quality" kind of setting, like "I want to RDO all my images to 30% rate reduction", but that's really wrong. On some images, getting to 30% rate reduction would do lots of quality damage, while on others you could easily get more rate reduction (say 50%) without much quality loss at all. Lambda does that for you automatically.

Internally it's important because it lets us make each coding decision in isolation, just by looking at the J cost of that one decision in isolation. It makes the problem separable (which is also great for parallelism), but still achieve a global minimum. Lagrange optimization automatically does rate allocation within the image to different blocks and decisions.

I think it helps to compare to an alternative manual rate allocation method to see how advantageous it is :


Algorithm Manual Rate Allocation :

you have an image of N BCN 4x4 blocks
for each block, find the lowest D (non-RDO) encoding
measure R of that encoding - this is the max rate for each block

now incrementally reduce total rate
you want to take rate from the best block to take rate from of the N

for all N blocks
find the next smaller encoding 
measure the D increase for that step down of R
slope(n) = delta(D) / delta(R)

take the one block change n with the lowest slope(n)

repeat until all slopes(n) are > max_desired_slope

vs.

Algorithm Lagrange Rate Allocation :

for all N blocks in parallel :

try various encodings of the block
compute J = D + lambda * R for each encoding
take the encoding on that block with the best J

In "Manual Rate Allocation" we have this palette of various bins (the blocks) to choose to take rate from, and you take it from the one that does the least damage to D, if you keep repeating that the slopes will converge until they are all roughly equal.

The powerful thing is that these two algorithms converge to exactly the same solution (caveat: assuming monotonic and smooth R/D curves, which in fact they are not). The Lagrange method is actually doing rate allocation between the blocks, even though it's not explicit, and each block can be considered in isolation. In an RDO encoder, rate is like water, it flows away from high ground (dD/dR is high) to low ground, until level is reached. The Lagrange method is able to do this separably because the total amount of water (rate) is not conserved.

We can control exactly what the rate allocation does through the choice of the D function. The R function for rate is relatively inflexible - we're counting bits of output size (though in practice we may want to use a smoothed rate rather than true rate), we don't have much choice in R - in contrast, the D function is not absolutely specified and there are lots of options there. The choice of D changes where rate flows.

For example consider the common choices of D = L1 norm (SAD) vs L2 norm (SSD). These metric score different errors differently, which causes rate to flow. In particular L2 (squaring) puts a very big weight on single large errors vs. smaller multiple errors.


Source data   = { 10, 10 }
Approximation = {  8, 12 }

SAD error = 2 + 2 = 4
SSD error = 2^2 + 2^2 = 8

Source data   = { 10, 10 }
Approximation = { 10, 14 }

SAD error = 0 + 4 = 4
SSD error = 0^2 + 4^2 = 16

Choosing D = SSD would cause rate to flow from the {8,12} approximation to the {10,14} approximation because the error is much bigger there (relative to where it is with SAD). All reasonable D functions will be zero for an exact match, and increase as the approximation gets further from the source data, but they can increase at different rates for different types of distortions. In a non-RDO encoding, these different D functions might find very similar encodings, but with RDO by choosing different D you get rate to flow between blocks to different characters of error.


With no further ado, we can get to the fun part : pictures of how rate allocation plays out in practice in Oodle Texture.

These images show the original image, followed by gray scale images of the rate allocation. These are for BC7 encodings with Oodle Texture. The non-RDO encoding is Oodle Texture at lambda=0 ; the RDO encodings are at lambda=40 (typical medium quality setting).

In the rate allocation images, each pixel represents one 4x4 block. White is full rate (16 bytes per block) and black is zero bits. These encodings are not normalized to the same total rate (in particular the non-RDO is of course much larger). The images are gamma corrected so that linear light intensity corresponds to bits per block. The original images are halved in size for display here. So for example "mysoup" was 1024x512, the original shown here is 512x256, and rate map is one pixel per block, so 256x128.

The "rmse" and "perceptual" images use the exact same coding procedure at the same lambda, they differ only in the D function used to measure distortion. "rmse" tries to minimize simple L2 norm, while "perceptual" has some estimation of visual quality (trying to avoid blocking artifacts, for example).

The R measured for these maps is the size of each block after subsequent compression by Oodle's LZA compressor.

non-RDORDO rmseRDO perceptual

The non-RDO "mysoup" has very high entropy, the BC7 blocks are nearly incompressible by the LZ (7.759 bpb). The RDO with D=rmse keeps very high bit rate in the blocks of the bowl rim, but allocates lots of bits away from the smooth regions in the egg and pumpkin. In the "perceptual" allocation, the bits in the rim are reduced, allowing more error there, and the bit rate of the smooth regions come up to avoid blocking artifacts.

non-RDORDO rmseRDO perceptual

In the house image "test6" the bit rate of the "rmse" allocation goes nearly to zero in the smooth dark areas of the beams under the eaves. That's a mistake perceptually and causes bad blocking artifacts, so the "perceptual" allocator shifts rate back to those blocks.

non-RDORDO rmseRDO perceptual

We can see that Oodle Texture is changing near-even rate allocation to very variable rate per block. Even though the BC7 blocks are always 16 bytes, we have made some more compressible than others, shifting rate to where it is needed. On many blocks, 16 bytes is just too much, that much rate is not needed to get a good encoding, and the difference in D to a reduced-size encoding is small; these are low slope blocks and will lose rate in RDO. After rate allocation, the blocks all have the same dD/dR slope; they reach an equilibrium where you can't take bits from one block and move it to another block and improve total quality.

Something you may notice in all the rate allocation maps is there are horizontal lines of higher rate going through the image. This is where we slice the images into chunks for parallelism of some operations that have to work per-chunk instead of per-block. The stripes of higher rate show that we could find some improvement there.


Links :

Oodle Texture at radgametools.com
Lagrange space-speed optimization - cbloom blog
Leviathan is a market trader - cbloom blog
Rate-Distortion Optimisation in Dirac
Rate-distortion optimization The ryg blog

2/15/2021

Oodle 2.8.14 with Mac ARM64

Oodle 2.8.14 is out. The full changelog is at RAD. The highlights are :
## Release 2.8.14 - February 15, 2021

$* *enhancement* : BC7 encoding is faster ; slightly different encodings at higher speed with similar quality

$* *new* : Mac ARM64 build now provided ; Mac example exes are fat x64+arm64
$* *new* : Apple tvOS build now provided

$* *deprecation* : Mac 32 bit x86 build no longer provided
We're also now shipping plugin integrations for Unreal 4.26

Kraken decompression is wicked fast on the M1 :

Kraken, Win-x64 msvc-1916, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 =  3.243 bpb =  2.467 to 1
decode           : 16.444 millis, 2.26 c/B, rate= 1502.09 MB/s

Kraken, Mac-x64 xcode-12.0.0, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 =  3.243 bpb =  2.467 to 1
decode           : 16.183 millis, 2.10 c/B, rate= 1526.36 MB/s
(emulated!)

Kraken, Mac-ARM64 xcode-12.0.0, lzc99.kraken.zl6
lzt99 : 24,700,820 ->10,013,788 =  3.243 bpb =  2.467 to 1
decode           : 11.967 millis, 1.55 c/B, rate= 2064.13 MB/s

Win64 run is :
AMD Ryzen 9 3950X (CPU locked at 3393 MHz, no turbo)
Zen2, 16 cores (32 hyper), TSC at 3490 MHz 

Mac runs on Macbook Mini M1 at 3205 MHz

Mac x64 is emulated on the M1
c/B = cycles per byte should be taken with some salt as we have trouble finding real clock rates, but it's clear the M1 has superior IPC (instructions per clock) to the Zen2. It seems to be about the same speed as the Zen2 in emulated x64!

It will be interesting to see what the M1 high performance variants can do.


Some more speeds cuz I like big numbers :

Macbook Mini M1 ARM64 :

Mermaid, Normal, lzt99 :
24,700,820 ->11,189,930 =  3.624 bpb =  2.207 to 1
encode (x1)      : 299.154 millis, 37.21 c/B, rate= 82.57 MB/s
decode (x30)     : 6.438 millis, 0.80 c/B, rate= 3836.60 MB/s

Mermaid, Optimal2, lzt99 :
24,700,820 ->10,381,175 =  3.362 bpb =  2.379 to 1
encode (x1)      : 3.292 seconds, 409.43 c/B, rate= 7.50 MB/s
decode (x30)     : 7.134 millis, 0.89 c/B, rate= 3462.57 MB/s

Selkie, Normal, lzt99 :
24,700,820 ->13,258,742 =  4.294 bpb =  1.863 to 1
encode (x1)      : 213.197 millis, 26.51 c/B, rate= 115.86 MB/s
decode (x30)     : 3.126 millis, 0.39 c/B, rate= 7901.52 MB/s

Selkie, Optimal2, lzt99 :
24,700,820 ->12,712,659 =  4.117 bpb =  1.943 to 1
encode (x1)      : 1.861 seconds, 231.49 c/B, rate= 13.27 MB/s
decode (x30)     : 3.102 millis, 0.39 c/B, rate= 7962.55 MB/s

1/11/2021

AVIF Test

AVIF is an image format derived from I-frames of AV1 video (similar to HEIC/HEIF from H265/HEVC). See also my 2014 Test of BPG , which is an H265 I-frame image format.

Here are some links I've found on AVIF :

AVIF image format supported by Cloudflare Image Resizing
GitHub - AOMediaCodeclibavif libavif - Library for encoding and decoding .avif files
GitHub - googlebrunsli Practical JPEG Repacker
Releases · kornelskicavif-rs · GitHub
GitHub - link-ucavif avif encoder, using libaom directly.
GitHub - xiphrav1e The fastest and safest AV1 encoder.
AVIF for Next-Generation Image Coding by Netflix Technology Blog Netflix TechBlog
Submissions from xiph.org Hacker News
Image formats for the web HEIC and AVIF – The Publishing Project
Squoosh
Comparing AVIF vs WebP file sizes at the same DSSIM
AV1 Codec
AVIF images color losschange AV1

Unfortunately I have not found a standard encoder with recommended settings. There appears to be zero guidance on settings anywhere. Because of that I am using the simplest encoder I could find, Kornelski's "cavif" which has a simple --quality parameter. I run thusly :

cavif --quality=X -o out.avif in.png
avifdef out.avif dec.png
imdiff in.png dec.png 
CALL FOR HELP : if you know better programs/settings for encoding AVIF, please let me know ; in avifenc , what is the min max Q parameter? adaptive quantization appears to be off by default, don't we want that on?

I measure results using my imdiff

I will compare AVIF to what I call "JPEG++" which is JPEG with the packjpg/Lepton back end, and a deblocking decoder (my "jpegdec"). This a rough stand-in for what I think a real JPEG++ should be (it should really have an R-D front-end and chroma-from-luma as well; that's all very easy and unequivocably good).

With no further ado, some results :

(imdiff "fit" is a quality in 0-10 , higher is better)


porsche640.bmp :


PDI_1200 :


Results are a bit disappointing. AVIF is much beter on RMSE but slightly worse on my other two scores. Overall that means it's most likely better overall, but it's not a huge margin.

(I'm sure AVIF is a big win on graphic/text images where JPEG does poorly)

AVIF results here look worse than what I saw from BPG (HEIC). Perhaps better encoders/settings will fix that.

Looking at the results visually, AVIF preserves sharp lines much better, but is completely throwing away detail in some places. There are some places where I see AVIF actually *change* the image, whereas JPEG is always just making the same image but slightly worse.

7z of encoded images to compare (2 MB)

NOTE : the fact that AVIF wins strongly in RGB RMSE but not in my other perceptual metrics indicates that is is not optimizing for those metrics. Perhaps in other perceptual metrics it would show a strong win. The metrics I use here from imdiff were chosen because I found them to be the best fit to human quality scores. Lots of the standard scores that people use (like naive SSIM) I have found to be utter junk, with no better correlation to human quality than RMSE. MS-SSIM-IW is the best variant of SSIM I know, but I haven't tested some of the newer metrics that have come out in the last few years.

1/06/2021

Some JPEG Tools

A couple of tips and tools for working with JPEG files. I use :

jhead
jpegcrop

Of course you can use exiftool and jpegtran but these are rather simpler.

1. Strip all personal data before uploading images to the web.

JPEG EXIF headers contain things like date shot and location. If you don't want Google to scrape that and use it to track your movements and send drones to follow you carrying advertisements, you might want to strip all that private data before you upload images to the net. I use :

jhead -purejpg %*
jhead -mkexif %*
jhead -dsft %*
with the very handy jhead tool. This removes all non-image headers, then makes a blank exif, then sets the exif time from the mod time. The last two steps are optional, if you want to preserve the shot time (assuming the mod time of the file was equal to the exif time). Note if the times are not equal you can use "jhead -ft" to set modtime from exif time before this.

Also note that if you use tools to modify images (resizing, changing exposure, what have you), they may or may not carry through the old exif data; whether that's good or bad is up to you, but it's very inconsistent. They also will probably set the modtime to now, whereas I prefer to keep the modtime = shot time so that I can find files by date more easily.

2. Remove orientation tags

iPhones are the only device I know of that consistently make use of the JPEG orientation tags rather than actually rotate the pixels. Unfortunately lots of loaders don't support these tags right, so if you load the image in a non-compliant loader it will appear sideways or upside down.

(this is a classic problem with data formats that have too many features; inevitabily many implementers only support a subset of the features which they deem to be the necessary ones; then some bone-head says "oh look the format has this weird feature, let's use it", and they output data which most loaders won't load right (then they blame the loaders). This is a mistake of the people defining the format; don't include unnecessary features, and make the practical subset a well-defined subset, not something that forms ad-hoc from use.)

To fix this, you want to read the orientation tag, rotate the pixels, then blank out the orientation tag (you have to do the last step or compliant loaders will doubly rotate). I used to have scripts to do all that with jpegtran, but it's super easy to do with jhead, you just :

jhead -autorot %*

3. Lossless crop

Avoid loading a JPEG, modifying it, and saving it out again. It destroys the image by re-applying the lossy quantization. I think by far the most common modification people do is cropping and there's no damn reason to re-encode for a crop. You can crop at 8x8 granularity losslessly, and you should. (all the web image uploaders that give you a crop tool need to do this, please, stop sucking so bad).

jpegcrop is a GUI tool that provides a decent lossless crop.

FreeVimager is okay too. Lossless crop is hidden in "JPEG Advanced".

We've got JPEG-XL coming soon, which looks great, but all the tech in the world won't help if people like Instagram & Youtube keep re-encoding uploads, and re-encoding every time you modify.

11/18/2020

Oodle 2.8.13 Release

Oodle 2.8.13 fixes an issue in Oodle Texture with consistency of encodings across machine architectures.

We try to ensure that Oodle Texture creates the same encodings regardless of the machine you run on. So for example if you run on machines that have AVX2 or not, our optional AVX2 routines won't change the results, so you get binary identical encodings.

We had a mistake that was causing some BC1 RDO encodings to be different on AMD and Intel chips. This is now fixed.

The fixed encoding (for BC1 RDO) made by 2.8.13 can be different than either of the previous (AMD or Intel) encodings. I want to use this announcement as an opportunity to repeat I point I am trying to push :

Do not use the binary output of encodings to decide what content needs to be patched!

This is widespread practice in games and it is really a bad idea. The problem is that any changes to the encoders you use (either Oodle Texture, your compressor, or other), can cause you to patch all your content unnecessarily. We do NOT gaurantee that different versions of Oodle Texture produce the same output, and in fact we can specifically promise they won't (always produce the same encodings) because we will continue to improve the encoder and find better encodings over time. Aside from version changes causing binary diffs, you might want to change settings, quality or speed levels, etc. and that shouldn't force a full patch down to customers.

The alternative to using binary output to make patches is to check if the pre-encoding content has changed, and don't patch if that is the same. I know there are difficult issues with that, often for textures you do something like load a bitmap, apply various transforms, then do the BCN encoding, and there's not a snapshot taken of the fully processed texture right before the BCN encoding.

If you are shipping a game with a short lifespan, your intention is to only ship it and only patch for bugs, then patching based on binary compiled content diffs is probably fine. But if you are shipping a game that you intend to have a long lifetime, with various generations of DLC, or a long term online player base, then you seriously need to consider patching based on pre-compression content diffs. It is very likely you will at some point in the life time of the game have to face OS version changes, compiler changes, or perhaps bugs in the encoder which force you to get on a newer version of the compression tools. You don't want to be in a situation where that's impossible because it would generate big patches.


One elegant way to solve this, and also speed up your content cooking, is to implement a cooked content cache.

Take the source bitmap, and all the texture cooking & encoding options, and use those as a key to look up pre-cooked content from a network share or similar. If found, don't re-encode.

Every time you export a level, you don't want to have to re-encode all the textures with Oodle Texture. With huge art teams, when someone edits a texture, everyone else on the team can just fetch from the cook cache rather than encode it locally.

The same kind of system can be used to avoid unnecessary patches. If you then populate your cook-cache with the content from the last shipped version, you won't make new encodings unless the source art or options change, even if the encoder algorithm changes.


For the lossless package compressor, ideally the patch generator would be able to decode the compression and wouldn't generate patches if only the compressed version changed.

To be clear, I'm assuming here that you are compressing assets individually, or even in smaller sub-asset chunks, or some kind of paging unit. I'm assuming you are NOT compressing whole packages as single compression units; if you did that any a single byte changing in any asset could change the entire compressed unit, that's a bad idea for patching.

(note that encryption also has the same property of taking a single byte change and spreading it around the chunk, so encryption should usually be done on the same or smaller chunk than the compression)

With most existing patchers that are not compression-aware, if you change the compressor (for example by changing the encode level option, or updating to a newer version), the compressed bytes will change and generate large patches. What they should ideally do is see that while the compressed bytes changed, the decompressed bytes are the same, so no patch is needed, and the old version of the compressed bytes can be retained. This would allow you to deploy new compressors and have them used for all new content and gradually roll out, without generating unnecessary large patches.

9/24/2020

How Oodle Kraken and Oodle Texture supercharge the IO system of the Sony PS5

The Sony PS5 will have the fastest data loading ever available in a mass market consumer device, and we think it may be even better than you have previously heard. What makes that possible is a fast SSD, an excellent IO stack that is fully independent of the CPU, and the Kraken hardware decoder. Kraken compression acts as a multiplier for the IO speed and disk capacity, storing more games and loading faster in proportion to the compression ratio.

Sony has previously published that the SSD is capable of 5.5 GB/s and expected decompressed bandwidth around 8-9 GB/s, based on measurements of average compression ratios of games around 1.5 to 1. While Kraken is an excellent generic compressor, it struggled to find usable patterns on a crucial type of content : GPU textures, which make up a large fraction of game content. Since then we've made huge progress on improving the compression ratio of GPU textures, with Oodle Texture which encodes them such that subsequent Kraken compression can find patterns it can exploit. The result is that we expect the average compression ratio of games to be much better in the future, closer to 2 to 1.

Oodle Kraken is the lossless data compression we invented at RAD Game Tools, which gets very high compression ratios and is also very fast to decode. Kraken is uniquely well suited to compress game content and keep up with the speed requirements of the fast SSD without ever being the bottleneck. We originally developed Oodle Kraken as software for modern CPUs. In Kraken our goal was to reformulate traditional dictionary compression to maximize instruction level parallelism in the CPU with lots independent work running at all times, and a minimum of serial dependencies and branches. Adapting it for hardware was a new challenge, but it turned out that the design decisions we had made to make Kraken great on modern CPUs were also exactly what was needed to be good in hardware.

The Kraken decoder acts as an effective speed multiplier for data loading. Data is stored compressed on the SSD and decoded transparently at load time on PS5. What the game sees is the rate that it receives decompressed data, which is equal to the SSD speed multiplied by the compression ratio.

Good data compression also improves game download times, and lets you store more games on disk. Again the compression ratio acts as an effective multiplier for download speed and disk capacity. A game might use 80 GB uncompressed, but with 2 to 1 compression it only take 40 GB on disk, letting you store twice as many games. A smaller disk with better compression can hold more games than a larger disk with worse compression.

When a game needs data on PS5, it makes a request to the IO system, which loads compressed data from the SSD; that is then handed to the hardware Kraken decoder, which outputs the decompressed data the game wanted to the RAM. As far the game is concerned, they just get their decompressed data, but with higher throughput. On other platforms, Kraken can be run in software, getting the same compression gains but using CPU time to decode. When using software Kraken, you would first load the compressed data, then when that IO completes perform decompression on the CPU.

If the compression ratio was exactly 1.5 to 1, the 5.5 GB/s peak bandwidth of the SSD would decompress to 8.25 GB/s uncompressed bytes output from the Kraken decoder. Sony has estimated an average compression ratio of between 1.45 to 1 and 1.64 to 1 for games without Oodle Texture, resulting in expected decompressed bandwidth of 8-9 GB/s.

Since then, Sony has licensed our new technology Oodle Texture for all games on the PS4 and PS5. Oodle Texture lets games encode their textures so that they are drastically more compressible by Kraken, but with high visual quality . Textures often make up the majority of content of large games and prior to Oodle Texture were difficult to compress for general purpose compressors like Kraken.

The combination of Oodle Texture and Kraken can give very large gains in compression ratio. For example on a texture set from a recent game :

Zip 1.64 to 1
Kraken 1.82 to 1
Zip + Oodle Texture 2.69 to 1
Kraken + Oodle Texture 3.16 to 1

Kraken plus Oodle Texture gets nearly double the compression of Zip alone on this texture set.

Oodle Texture is a software library that game developers use at content creation time to compile their source art into GPU-ready BC1-7 formats. All games use GPU texture encoders, but previous encoders did not optimize the compiled textures for compression like Oodle Texture does. Not all games at launch of PS5 will be using Oodle Texture as it's a very new technology, but we expect it to be in the majority of PS5 games in the future. Because of this we expect the average compression ratio and therefore the effective IO speed to be even better than previously estimated.

How does Kraken do it?

The most common alternative to Kraken would be the well known Zip compressor (aka "zlib" or "deflate"). Zip hardware decoders are readily available, but Kraken has special advantages over Zip for this application. Kraken gets more compression than Zip because it's able to model patterns and redundancy in the data that Zip can't. Kraken is also inherently faster to decode than Zip, which in hardware translates to more bytes processed per cycle.

Kraken is a reinvention of dictionary compression for the modern world. Traditional compressors like Zip were built around the requirement of streaming with low delay. In the past it was important for compressors to be able to process a few bytes of input and immediately output a few bytes, so that encoding and decoding could be done incrementally. This was needed due to very small RAM budgets and very slow communication channels, and typical data sizes were far smaller than they are now. When loading from HDD or SSD, we always load data in chunks, so decompressing in smaller increments is not needed. Kraken is fundamentally built around decoding whole chunks, and by changing that requirement Kraken is able to work in different ways that are much more efficient for hardware.

All dictionary compressors send commands to the decoder to reproduce the uncompressed bytes. These are either a "match" to a previous substring of a specified length at an "offset" from the current output pointer in the uncompressed stream, or a "literal" for a raw byte that was not matched.

Old fashioned compressors like Zip parsed the compressed bit stream serially, acting on each bit in different ways, which requires lots of branches in the decoder - does this bit tell you it's a match or a literal, how many bits of offset should I fetch, etc. This is also creates an inherent data dependency, where decoding each token depends on the last, because you have to know where the previous token ends to find the next one. This means the CPU has to wait for each step of the decoder before it begins the next step. Kraken can pre-decode all the tokens it needs to form the output, then fetch them all at once and do one branchless select to form output bytes.

Kraken creates optimized streams for the decoder

One of the special things about Kraken is that the encoded bit stream format is modular. Different features of the encoder can be turned on and off, such as entropy coding modes for the different components, data transforms, and string match modes. Crucially the Kraken encoder can choose these modes without re-encoding the entire stream, so it can optimize the way the encoder works for each chunk of data it sees. Orthogonality of bit stream options is a game changer; it means we can try N boolean options in only O(N) time by measuring the benefit of each option independently. If you had to re-encode for each set of options (as in traditional monolithic compressors), it would take O(2^N) time to find the best settings.

The various bit stream options do well on different types of data, and they have different performance trade offs in terms of decoder speed vs compression ratio. On the Sony PS5 we use this to make encoded bit streams that can be consumed at the peak SSD bandwidth so that the Kraken decoder is never the bottleneck. As long as the Kraken decoder is running faster than 5.5 GB/s input, we can turn on slower modes that get more compression. This lets us tune the stream to make maximum use of the time budget, to maximize the compression ratio under the constraint of always reading compressed bits from the SSD at full speed. Without this ability to tune the stream you would have very variable decode speed, so you would have to way over-provision the decoder to ensure it was never the bottleneck, and it would often be wasting computational capacity.

There are a huge number of possible compressed streams that will all decode to the same uncompressed bytes. We think of the Kraken decoder as a virtual machine that executes instructions to make output bytes, and the compressed streams are programs for that virtual machine. The Kraken encoder is then like an optimizing compiler that tries to find the best possible program to run on that virtual machine (the decoder). Previous compressors only tried to minimize the size of the compressed stream without considering how choices affect decode time. When we're encoding for a software decoder, the Kraken encoder targets a blend of decode time and size. When encoding for the PS5 hardware decoder, we look for the smallest stream that meets the speed requirement.

We designed Kraken to inherently have less variable performance than traditional dictionary compressors like Zip. All dictionary compressors work by copying matches to frequently occurring substrings; therefore they have a fast mode of decompression when they are getting lots of long string matches, they can output many bytes per step of the decoder. Prior compressors like Zip fall into a much slower mode on hard to compress data with few matches, where only one byte at a time is being output per step, and another slow mode when they have to switch back and forth between literals and short matches. In Kraken we rearrange the decoder so that more work needs to be done to output long matches, since that's already a super fast path, and we make sure the worst case is faster. Data with short matches or no matches or frequent switches between the two can still be decoded in one step to output at least three bytes per step. This ensures that our performance is much more stable, which means the clock rate of the hardware Kraken decoder doesn't have to be as high to meet the minimum speed required.

Kraken plus Oodle Texture can double previous compression ratios

Kraken is a powerful generic compressor that can find good compression on data with repeated patterns or structure. Some types of data are scrambled in such a way that the compressability is hard for Kraken to find unless that data is prepared in the right way to put it in a usable form. An important case of this for games is in GPU textures.

Oodle Kraken offers even bigger advantages for games when combined with Oodle Texture. Often the majority of game content is in BC1-BC7 textures. BC1-7 textures are a lossy format for GPU that encodes 4x4 blocks of pixels into 8 or 16 byte blocks. Oodle Kraken is designed to model patterns in this kind of granularity, but with previous BC1-BC7 texture encoders, there simply wasn't any pattern there to find, they were nearly incompressible with both Zip and Kraken. Oodle Texture creates BC1-7 textures in a way that has patterns in the data that Kraken can find to improve compression, but that are not visible to the human eye. Kraken can see that certain structures in the data repeat, the lengths of matches and offsets and space between matches, and code them in fewer bits. This is done without expensive operations like context coding or arithmetic coding.

It's been a real pleasure working with Sony on the hardware implementation of Kraken for PS5. It has long been our mission at RAD to develop the best possible compression for games, so we're happy to see publishers and platforms taking data loading and sizes seriously.

9/11/2020

Topics in Quantization for Games

I want to address some topics in quantization, with some specifics for games.

We do "quantization" any time we take a high precision value (a floating point, or higher-bit integer) and store it in a smaller value. The quantized value has less precision. Dequantization takes you back to the space of the input and should be done to minimize the desired error function.

I want to encourage you to think of quantization like this :

quantization takes some interval or "bucket" and assigns it to a label

dequantization restores a given label to a certain restoration point

"quantization" does not necessarily take you to a linear numeric space with fewer bits

The total expected error might be what we want to minimize :

Total_Error = Sum_x P(x) * Error( x,  dequantization( quantization(x) ) )
Note that in general the input values x do not have uniform probability, and the Error is not just linear L1 or L2 error, you might care about some other type of error. (you might also care more about minimizing the maximum rather than the average error).

I like to think of the quantized space as "labels" because it may not be just a linear numerical space where you can do distance metrics - you always dequantize back to your original value space before you do math on the quantization labels.

I started thinking about this because of my recent posts on Widespread error in RGBE and Alternative quantizers for RGBE, and I've been looking in various game-related code bases and found lots of mistakes in quantization code. These are really quite big errors compared to what we work very hard to reduce. I've found this kind of thing before outside of games too. For example it's very common for the YUV conversions in video and image codecs to be quite crap, giving up lots of error for no good reason. Common errors I have seem in the YUV conversions are : using the terribad 16-235 range, using the rec601/bt709 matrix so that you encode with one and decode with the other, using terribad down and/or up filters for the chroma downsample). It's frustrating when the actual H264 layer works very hard to minimize error, but then the YUV-RGB layer outside it adds some that could be easily avoided.

We do quantization all the time. A common case is for 8-bit RGB colors to float colors, and vice versa. We do it over and over when we do rendering passes; every time you write values out to a render target and read them back, you are quantizing and dequantizing. It is important to take care to make sure that those quantization errors are not magnified by later passes. For example when writing something like normals or lighting information, a quantization error of 1/256 can become much larger in the next stage of rendering.

(a common example of that is dot products or cosines; if you have two vectors and store something that acts like a dot product between them (or a cosine of an angle), the quantization bucket around 1.0 for the two vectors being parallel corresponds to a huge amount of angular variation, and this often right where you care most about having good precision, it's much better to store something that's like the acos of the dot product)

If you aren't going to do the analysis about how quantization errors propagate through your pipeline, then the easiest thing to do is to only quantize once, at the very end, and keep as much precision through the stages as possible. If you do something like a video codec, or an image processing pipeline, and try to work in limited precision (even 16 bit), it is important to recognize that each stage is an implicit quantization and to look at how those errors propagate through the stages.

(aside: I will mention just briefly that we commonly talk about a "float" as being the "unquantized" result of dequantization; of course that's not quite right. A "float" is a quantized representation of a real number, it just has variable size quantization bins, smaller bins for smaller numbers, but it's still quantized with steps of 1 ulp (unit in last place). More correctly, going to float is not dequantization, but rather requantization to a higher precision quantizer. The analysis of propagating through quantization error to work in 8 bits or whatever is the same you should do for how float error propagates through a series of operations. That said I will henceforth be sloppy and mostly talk about floats as "dequantized" and assume that 1 ulp is much smaller than precision that we care about.)

So lets go back and start at the beginning :

Linear uniform scalar quantization

If our input values x are all equally probable ( P(x) is a constant ), and the error metric we care about is linear L1 or L2 norm, then the optimal quantizer is just equal size buckets with restoration to center of bucket.

(for L1 norm the total error is actually the same for any restoration point in the bucket; for L2 norm total error is minimized at center of bucket; for L1 norm the maximum error is minimized at center of bucket)

We'll now specifically look at the case of an input value in [0,1) and quantizing to N buckets. The primary options are :


int quantize_floor( float x , int N )
{
    return (int)( x * N );
    // or floor( x * N );
    // output is in [0, N-1] , input x in [0,1) not including 1.0
}

float dequantize_floor( int q, int N )
{
    return (q + 0.5f ) * (1.f / N);
}

int quantize_centered( float x, int N )
{
    return (int)( x * (N-1) + 0.5f );
    // or round( x * (N-1) )
    // output is in [0, N-1] , input x in [0,1] , including 1.0 is okay
}

float dequantize_centered( int q, int N )
{
    return q * (1.f / (N-1));
}

The rule of thumb for these quantizers is you either bias by 0.5 in the quantizer, or in the dequantizer. You must bias on one side or the other, not both and not neither! The "floor" quantizer is "bias on dequant", while the "centered" quantizer is "bias on quant".

Visually they look like this, for the case of N = 4 :

(the top is "floor" quantization, the bottom is "centered")

(the top is "floor" quantization, the bottom is "centered")

In both cases we have 4 buckets and 4 restoration points. In the "floor" case the terminal bucket boundaries correspond to the boundaries of the [0,1) input interval. In the "centered" case, the terminal buckets are centered on the [0,1) endpoint, which means the bucket boundaries actually go past the end, but they restore exactly to the endpoints.

If your input values are actually all equally likely and the error metric that you care about is just L2 norm, then "floor" quantization is strictly better. You can see that the bucket size for "floor" quantization is 1/4 vs. 1/3 for "centered", which means the maximum error after dequantization is 1/8 vs. 1/6.

In practice we often care more about the endpoints or the integers, not just average or maximum error; we suspect the probability P(x) for x = 0 and 1 is higher, and the error metric Error( dequantization( quantization(x) ) - x ) may also be non-linear, giving higher weight to the error when x = 0 and 1.

"centered" quantization also has the property of preserving integers. For example say your input range was [0,255) in floats. If you quantize to N=256 buckets with "centered" quantization, it will restore exactly to the integers.

Games should only be using centered quantization!

While in theory there are cases where you might want to use either type of quantization, if you are in games don't do that!

The reason is that the GPU standard for UNORM colors has chosen "centered" quantization, so you should do that too. Certainly you need to do that for anything that interacts with the GPU and textures, but I encourage you to just do it for all your quantization, because it leads to confusion and bugs if you have multiple different conventions of quantizer in your code base.

The GPU UNORM convention is :

float dequantize_U8_UNORM( unsigned char u8 )
{
  return u8 * (1.f/255);
}
which implies centered quantization, so please use centered quantization everywhere in games. That means : bias 0.5 on quantize, no bias on dequantize.

While on the topic of UNORM, let's look at conversion between quantized spaces with different precision. Let's do U8 UNORM to U16 UNORM for example.

The way to get that right is to think about it as dequantization followed by quantization. We dequantize the U8 UNORM back to real numbers, then quantize real numbers back to U16 :


dequant = u8 * (1.f/255);

u16 = round( dequant * 65535 );

u16 = round( u8 * (1.f/255) * 65535 );

u16 = round( u8 * 257 );

u16 = u8 * 257;

u16 = u8 * (256 + 1);

u16 = (u8<<8) + u8;

So U8 to U16 re-quantization for UNORM is : take the U8 value, and replicate it shifted up by 8.
requantize U8 UNORM to U16 UNORM :

0xAB -> 0xABAB

This obviously has the necessary property that 00 stays zero, and 0xFF becomes 0xFFFF, so 1.0 is preserved.

This is something we call "bit replication". Let's take a moment to see why it works exactly in some cases and only approximately in others.

Bit Replication for re-quantization to higher bit counts

Bit replication is often used in games to change the bit count of a quantized value (to "requantize" it).

For example it's used to take 5-bit colors in BC1 to 8-bit :


The top 3 bits of the 5-bit value are replicated to the bottom :

abcde -> abcde|abc

giving an 8 bit value

Bit replication clearly gets the boundary cases right : all 0 bits to all 0's (dequantizes to 0.0), and all 1 bits to all 1 bits (dequantizes to 1.0); in between bit replication linearly increases the low bits between those endpoints, so it's obviously sort of what you want. In some cases bit replication corresponds exactly to requantization, but not in others.

With a B-bit UNORM value, it has N = 2^B values. The important thing for quantization is the denominator (N-1). For example with a 5-bit value, (N-1) = 31 is the denominator. It becomes clear if we think about requantization as changing the *denominator* of a fraction.


Requantization from 5 bits to 10 bits is changing the denominator from 31 to 1023 :

dequant( 5b ) = 5b / 31.0;
requant_10( x ) = round( x * 1023.0 );

requant_5_to_10 = round( x * 1023 / 31 );

1023/31 = 33 exactly, so :

requant_5_to_10 = x * 33

in integers.  And 33 = (32 + 1) = shift up 5 and replicate

requantization from 5 to 10 bits is just duplicating the bits shifted up
abcde -> abcde|abcde

What that means is bit replication from B to 2B is exactly equal to what you would get if you dequantized that number to UNORM and requantized it again.

This is of course general for any B :


denominator for B is (N-1)
denominator for 2B is (N^2 - 1)

requantiztion is *= (N^2 - 1) / (N-1)

(N^2 - 1) = (N-1) * (N+1)

so 

requantization is *= (N+1)

which is bit replication

Now more generally for bit replication to some number of bits that's not just double (but <= double, eg. between B and 2B) :

b between B and 2B
n = 2^b

requant_B_to_b(x) = round( x * (n-1) / (N-1) )

requant_B_to_b(x) = round( x * (N+1) * (n-1) / (N^2-1) )

requant_B_to_b(x) = round( (x bit replicated to 2B) * ( scale down ) )

bit replication from B to b is :

bitrep(x) = (x bit replicated to 2B) >> (2B - b)

that is, just replicate to 2B and then truncate low bits to get to b

when b = 2B , these are exactly equal as we showed above

obviously also at b = B (NOP)
and also at b = B+1 (adding one bit)

in the range b = [B+2, 2B-1] they are not quite exactly equal, but close

Let's look at an example, 5 bits -> 8 bits :

bitdouble( 5b ) = (5b * 33)

requant_5_to_8(5b) = round( (5b * 33) * ( 255.0 / 1023.0 ) )

bitrep_5_to_8(5b) = (5b * 33) >> 2

we can see where the small difference comes from :

bit replication just truncates off the 2 bottom bits

requantization does * (255/1023) , which is almost a /4 (like >>2) but not quite
and the requantization also rounds instead of truncating

so we should see how bit replication is similar to centered UNORM requantization, but not quite the same.

Now, bit replication is used in BC7, ASTC, etc. Is it a source of error? No, not if you do your encoder right. What it does mean is that you can't just find the 5-bit color value by doing a centered quantizer to 5 bits. Instead you have to ask what does the 5-bit value bit-replicate to, and find the closest value to your input.

Quantizing infinite signed values and the deadzone quantizer

So far we've talked about quantizing finite ranges, specifically [0,1) but you can map any other finite range to that interval. Let's have a brief look at quantizing infinite ranges.

If you just quantize a signed number to a signed quantized number, then you can use the above _floor or _centered quantizers without thinking any more about it. You will have uniform buckets across the whole number line. But what we often want to do is take a signed input number and quantize it to *unsigned* and separate out the sign bit, to create a sign+magnitude representation. (this makes the most sense with values whose probability P(x) is symmetric about zero and whose mean is at zero; eg. after a transform that subtracts off the mean)

One reason we might want to do that is because most of our schemes for sending unbounded (variable length) numbers work on unsigned numbers. For example : Encode Mod and Exp Golomb .

Now one option would be to quantize to signed ints and then Fold up Negatives to make an unsigned number to feed to your variable length scheme.

There are reasons we don't like that in data compression. Folded up negatives have a number line like : {0, -1, 1, -2, 2, -3 ... }

The annoying thing about that for data compression is that if you have a probability model like a Laplacian that decreases with absolutely value of x, the probabilities have these steps where values are repeated : { P(0), P(1), P(1), P(2), P(2), ... } and coding them with something like exp-golomb is no longer quite correct as they don't progressively fall off. Some codecs in the past have used tricks to reduce this (eg. JPEG-LS and CALIC) by doing things like being able to flip the sign so that you get either {0, -1, 1, -2, ... } or {0, 1, -1, 2, ... } depending on whether positive or negative is more probable.

Rather than do all that, let's assume you want to extract the sign bit and send it separately. So you are sending only the magnitude.

So we have taken the sign out and now only have a one sided interval [0, inf) to quantize. You can take that one-sided interval and just apply floor or centered quantization to it :


unsigned half_line_quantize( float x )
{
    ASSERT( x >= 0.f );
    //return floor( x ); // floor quantizer
    //return round( x ); // centered quantizer
    float bias = 0.f for floor and 0.5 for centered;
    return (unsigned) ( x + bias );
}

but something a bit funny has happened.

Floor and centered quantization now just act to shift where the boundary of the 0 bin is. But the 0 bin now occurs on both sides of the half interval, so to make the 0 bin the same size as the other bins, it should have a boundary at 0.5 (half the size of the other bins on the half interval). (I'm assuming here that your quantization bucket size is 1.0 ; for general sized quantization buckets just scale x before it gets here).

It's clear that the zero bin is a bit special, so we usually just go ahead and special case it :


pseduocode signed_line_quantizer( float x )
{
    // x signed

    float ax = fabsf(x);

    if ( ax < deadzone )
    {
        // special bucket for zero :
        // don't send sign bit
        return 0;
    }
    else
    {
        // do send sign bit of x
        // do floor quantizer above the zero bucket :
        return floor(ax - deadzone);
    }
}

Now if you want the zero bucket to have the same size as all others, you would set deadzone = 0.5 (it's half the zero bucket size on the full line). If you want to use a uniform floor quantizer on the half line, that would correspond to deadzone = 1.0 (making the zero bucket actually twice the size of others after mirroring to the negative half of the line).

What's been found in data compression is that a "deadzone" larger than equal size buckets (larger than 0.5) is beneficial. There are two primary reasons :

We use codecs where coding zeros is especially cheap, so sending more zeros is very desirable. So larger deadzone in the quantizer will give you more zeros, hence cheaper coding, and this is a greater benefit than the loss in quality. This is sort of a hacky way of doing some rate-distortion optimization, like trellis quantization but without any work.

The other reason is perceptual modeling; many human perception systems (eyes and ears) are less sensitive to the initial onset of a signal than they are to variations once the signal is present. Signals near zero are not detected by humans at all until they reach some threshold, and then once they pass the threshold there's a finer discrimination of level. For example the human ear might not detect a harmonic until it is 10 dB, but then distinguish volume levels at 1 dB changes after that.

Essentially your quantizer has two parameters, the bucket size for zero, and then the bucket size for values above zero. This is a very simple form of a more general variable quantizer.

In theory you would like to have variable size bins, such that each bin corresponds to an equal amount of perceptual importance (eg. larger bins where the values are less important). For the most part we now do that by applying a nonlinear transformation to the value before it reaches the uniform quantizer, rather than trying to do variable size bins. For example you might take log(x) before quantizing if you think precision of high values is less important. Another common example is the "gamma corrected" color space (or sRGB) for images; that's a non-linear transform applied to the signal (roughly pow 2.2) to map it to a space that's more perceptually uniform so that the quantization buckets give more precision where it's needed.

Something to watch out for is that a lot of code uses a deadzone quantizer without being clear about it. If you see something like :

!
int half_line_quantizer_thats_actually_a_deadzone( float x )
{
  ASSERT( x >= 0.f );
  return (int) x;
}
That's actually a deadzone quantizer with a 2x sized bin zero, if it's being used after sign removal.

In the olden days, variable-size quantization buckets were used as a kind of entropy coder. They would have smaller buckets in higher probability regions and larger buckets in lower probability regions, so that the quantized output value had equal probability for all bins. Then you could send the quantized value with no entropy coding. This is now almost never done, it's better to use quantization purely for error metric optimization and use a separate entropy coder on the output.

Topics in dequantization

Just briefly some topics in dequantization.

For values that are all equally likely, under an L2 (SSD/RMSE) error norm, dequantization to the center of the bucket is optimal. More generally the restoration point for each bucket should minimize the error metric weighted by the probability of that input value.

An easy case is with an L2 error metric but a non-uniform probability. Then the error in a given bucket for a restoration point is :

L2 error of restoring to r in this bucket :

E = Sum_x P(x) * ( r - x )^2

( Sum_x for x's in this bucket )

find r that minimizes E by taking d/dr and setting to zero :

d/dr E = 0

d/dr E = Sum_x P(x) * ( r - x ) * 2

Sum_x P(x) * ( r - x ) = 0

Sum_x P(x) * r = Sum_x P(x) * x

r = ( Sum_x P(x) * x ) / ( Sum_x P(x) )

that's just the expectation value of x in the bucket

we should restore to the average expected 'x' value in the bucket.

A common case of that is for a skewed probability distribution - something like Laplacian or Poisson with a falloff of probabilities away from the peak - we should restore each bucket to a value that's skewed slightly towards the peak, rather than restoring the center.

Now if you have a mathematical model of P(x) then you could compute where these centers should be, and perhaps store them in a table.

What's often better in practice is just to measure them experimentally. Do trial runs and record all the values that fall into each quantization bucket and take their mean - that's your restoration point.

Then you could store those measured restoration points in constants in your code, OR you could measure them and store them per-data item. (for example an image compressor could transmit them per image - maybe not all but a few of the most important ones).

Another thing you can do in dequantization is to not always restore to the same point. I noted briefly previously that if what you care about is L1 norm, then any restoration point in the bucket has the same error. Rather than just pick one, you could restore to any random point in the bucket and that would give the same expected L1 norm.

L2 norm strongly prefers the mean (minimizing L2 is blurring or smoothing, while L1 allows lots of noise), but perceptually it may be better to add some randomness. You could restore to mean in the bucket plus a small amplitude of noise around there. Again this noise could be global constant, or could be sent per-image, or per-band; it could also be predicted from local context so you could have more or less noisy areas.

Note that adding noise in dequantization is not the same as just adding noise arbitrarily after the fact. The values are still within the quantization bucket, so they could have been the true source values. That is, we can reframe dequantization as trying to guess the source given the quantized version :


Encoder had original image I

made Q = quant( I )

Q was transmitted

rather than just run I' = dequant( Q )

we instead pose it as :

we want to find I'
such that
Q = quant( I' )
and I' has the maximum probability of being the original I
or I' has the most perceptual similarity to our guess of I

The key thing here is that noise within the quantization bucket keeps the constraint Q = quant(I') satisfied.

As an example I'll mention something I've done in the past for wavelet bit-plane truncation.

Wavelet coding converts an image into activity residuals at various frequency subbands. These are initially quantized with a uniform+deadzone quantizer (if a floating point wavelet transform was used). Then in many codecs they are sent progressively in bit planes, so the highest bits are sent first, then lower bits, so that you get the most important bits first. You can then truncate the stream, cutting off transmission of lower bits in the higher subbands, effectively increasing the quantizer there. This is done in JPEG2000 with the EBCOT scheme for example.

So a given wavelet residual might be sent like :


value 45

= 101101

only top 2 bits sent :

10xxxx

the others are cut off.

In the decoder you know which bits you got and which are missing, which is equivalent to a larger quantization bucket.

The classic option (eg. SPIHT) was just to fill the lost xx bits with zeros :

10xxxx -> 100000

This makes values that are too low and is generally very smoothing (high frequency detail just goes away)

You might think, it's a quantization bucket, we should restore to the middle, which is 0.5 which is the
next bit on :

10xxxx -> 101000 or 100111

That is much too high, it's larger than the expectation and actually looks like a sharpen filter.
The reason is that wavelet amplitudes have P(x) strongly skewed towards zero, so the mean value is
way below the middle of the bucket.

Restoring to 0.25 is a bit better :

10xxxx -> 100100

but even better is to just measure what is the mean in the image for each missing bit count; that
mean depends on how large our value was (the part that's not truncated).

Finally in addition to restoring the missing bits to mean, you could add randomness in the dequantization, either within the quantization bucket (below the bottom bit), or in the low part of the missing bits (eg. if 4 bits are missing the bottom 2 might get some randomness). You can compute the amount of randomness desired such that the decompressed image matches the high frequency energy of the original image.

And that's enough on quantization for now!

old rants