cbloom rants: 04/2016

4/30/2016

Oodle Kraken Pareto Frontier

The whole idea of Kraken is to excel in the size-decodespeed tradeoff. From the beginning of its design, every decision was considered in terms of the size vs decodespeed.

So how does it do?

To visualize the size-decodespeed Pareto frontier, I like to use an imaginary loading problem. You want to load compressed data and then decompress it. One option is the null compressor (or "memcpy") that just loads uncompressed data and then memcpys it. Or you can use compressors that give smaller file sizes, thus quicker loads, but take more time to decompress. By dialing the virtual disk speed, you can see which compressor is best at different tradeoffs of space vs. speed.

Of course you may not actually be just trying to optimize load time, but you can still use this imaginary loading problem as a way to study the space-speed tradeoff. If you care more about size, you look at lower imaginary disk speeds, if you core more about CPU use, you look at higher imaginary disk speeds.

I like to show "speedup" which is the factor of increase in speed using a certain compressor to do load+decomp vs. the baseline (memcpy). So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff. (By "show" I mean "is linearly proportional to, so you can actually look at the ratio of Y-intercepts in the Pareto curve and it tells you compression ratio on the left and decompression speed on the right).

03-02-15 - Oodle LZ Pareto Frontier and 05-13-15 - Skewed Pareto Chart

The chart showing "millis" shows time, so obviously lower is better. I show a few ways to combine load & decomp. Sum is the time to load then decomp sequentially. Max is the larger of the two, which is the time if they were perfectly overlapped (not usually possible). Personally I think "l2 sum" is most useful. This is the sqrt of sum of squares, which is a kind of sum that biases towards the larger; it's kind of like a mix of "sum" and "max", it means you want the sum to be small, but you also want them to similar times.

Kraken tops the Pareto curve for decodespeed vs size for a huge range of virtual disk speed; from around 2 mb/s up to around 300 mb/s.

Of course the Pareto curve doesn't tell the whole picture. For one thing I don't have encode speed in the equation here at all (and sometimes you need to look at the memory use tradeoff too, so it's really a size-decodespeed-encodespeed-memoryuse surface). For another you sometimes have strict requirements, like I must hit a certain file size, and then you just pick the fastest decoder that meets that requirement. One thing the curves do clearly tell you is when a curve just completely covers another (that is, is greater in Y for all values of X), then that compressor is just never needed (in space & decodespeed).

Silesia :

Game Test Set :

Just for comparison to earlier posts on my Blog about the other Oodle compressors, here's a run of lzt99 with the full Oodle compressor set.

I've included ZStd for reference here just to show that Kraken jumping way up is not because the previous Oodle compressors were bad - far from it, they were already Pareto optimal. (I like to compare against ZStd not because it's bad, but because it's excellent; it's the only other compressor in the world I know of that's even close to Oodle in terms of good compression ratio and high speed (ZStd also has nice super-fast encode speeds, and it's targetted slightly differently, it's better on text and Oodle is better on binary, it's not a direct apples comparison)).

You can see that LZHLW is just completely covered by Kraken, and so is now deprecated in Oodle. Even LZNIB and BitKnit are barely peeking out of the curve, so the range where they are the right answer is greatly reduced to more specialized needs. (for example BitKnit still romps on strongly structured data, and is useful if you just need a size smaller than Kraken)

4/28/2016

Performance of Oodle Kraken

A continuing series reporting on the performance of recently released Oodle Kraken

First some general notes about Oodle before the big dump of numbers. (skip to the bottom for charts)

Oodle is not intended to solve every problem in data compression. Oodle Data Compression is mainly designed for compress-once load-many usage patterns, where the compressed size and decode speed is important, but encode speed is not as important. Things like distribution, packaging, when you bake content into a compressed archive and then serve it to many people. I do consider it a requirement to keep the encoders faster than 1 MB/s on my ancient laptop, because less than that is just too slow even for content baking.

Like most data compressors, Oodle has various compression level options that trade off encoder speed for compressed size. So there are faster and slower encoders; some of the fast modes are good tradeoffs for space-speed. Kraken turns out to be pretty good at getting high compression even in the fast modes. I'll do a post that goes into this in the future.

Oodle is mainly intended for packing binary data. Because Oodle is built for games, we focus on the types of data that games ship, such as textures, models, animations, levels, and other binary structured data that often contains various types of packed numeric data. Oodle is good on text but is not really built for text. In general if you are focusing on a specific type of data (eg. text/html/xml) you will get the best results with domain-specific preprocessors, such as xwrt or wbpe. Oodle+wbpe is an excellent text compressor.

Oodle Kraken is intended for high compression use cases, where you care about size. There's a whole family of compressors now that live in the "just below lzma (7zip)" compression ratio domain, such LZHAM, Brotli, Oodle BitKnit, and ZStd. In general the goal of these is to get as close to lzma compression ratio as possible while providing much better speed. This is where Kraken achieves far more than anyone has before, far more than we thought possible.

Today I will be focusing on high compression modes, looking at decode speed vs. compression ratio. All compressors will generally be run in their max compression mode, without going into the "ridiculously slow" below 1 MB/s range. (for Oodle this means running -zl6 Optimal2 but not -zl7 or higher).

Okay, on to lots of numbers.

Rather than post an average on a test set, which can be misleading due to the selection of the test set or the way the results are averaged (total compressed size or average ratio?), I'll post results on a selection of individual files :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps

by encode speed:
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps

by decode speed:
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps

by encode speed:
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps

by decode speed:
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps

by encode speed:
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps

by decode speed:
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps

by encode speed:
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps

by decode speed:
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps

by encode speed:
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps

by decode speed:
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps

by encode speed:
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps

by decode speed:
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
-------------------------------------------------------

In chart form :

(lz4 decode speed is off the top of the chart)

I'm not including the other Oodle compressors here just to keep things as simple as possible. If you do want more compression than Kraken, and care about decode speed, then Oodle LZNA or BitKnit are much faster to decode than lzma (7zip) at comparable or better compression ratios.

Visit radgametools.com to learn more about Kraken and Oodle.

4/27/2016

Release the Kraken!

Announcing Oodle Kraken! - a new compression algorithm in the Oodle Data Compression library.

Oodle Kraken offers high compression with incredible decode speed, the likes of which has never been seen before.

Oodle Kraken decodes 3X faster than ZLib, and gets way more compression. There's really nothing even close to Kraken, its performance is revolutionary - it's never been possible to get such high compression and blazing fast decode speed before.

On the "Silesia" test set, the performance is :

Kraken :  4.05 to 1 compression ratio,   919.8 decode mb/s
zlib9  :  2.74 to 1 compression ratio,   306.9 decode mb/s
lzma   :  4.37 to 1 compression ratio,    78.8 decode mb/s

(speed measured on a Core i7-3770 3.4 Ghz , x64, single threaded)

Oodle Kraken's high compression makes it great for downloadables and distribution. Kraken's fast decode speed makes it amazing for in-game loading, paging and streaming, with less CPU load taking time away from your game. There's no need to decompress Kraken data when you install - just load compressed data directly. Kraken is so fast that loading compressed data and then decompressing is faster than just loading uncompressed data off disk!

Visit radgametools.com to learn more about Kraken and Oodle.

I'll be writing more about Oodle Kraken in the coming weeks, such as performance analysis on other data sets and vs. other compressors.

The overall performance on Silesia :

And a chart of the performance on each file in Silesia. This is a loglog scatter plot, with log2 of decode speed on the x axis and log2 of compression ratio on the y axis. (so for example from LZMA to Kraken on "sao" is 4.68 steps on the x axis, that means 25.76X faster)

Oodle Kraken is superb on the AMD Jaguar chip that's in the PS4 and XBox One. Most decompressors really struggle to run fast on those platforms, but not Kraken.

It's hard to write about Oodle Kraken without sounding ridiculous. It's really a huge step in data compression, where huge steps almost never happen.

4/25/2016

Huffman Correction

A reader pointed out an error in my blog post - 08-12-10 - The Lost Huffman Paper

I had :


if ( bits >= huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

it should have been :


if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

it's corrected now.

In my convention, branchCodeLeftAligned is the left-aligned bitbuff value that means you must go to a higher codelen.

I thought for clarity I'd go ahead and post the example I did with him :


You have this alphabet :

symbol_id, codeLen, code:
0 ; 2 ; 00
1 ; 3 ; 010
2 ; 3 ; 011
3 ; 3 ; 100
4 ; 4 ; 1010
5 ; 4 ; 1011
6 ; 4 ; 1100
7 ; 4 ; 1101
8 ; 4 ; 1110
9 ; 5 ; 11110
10; 5 ; 11111


baseCode[n] = first code of len n - # of codes of lower len


baseCode,
    [2]     0
    [3]     1       = 010 - 1
    [4]     6       = 1010 - 4
    [5]     21      = 11110 - 9

huff_branchCodeLeftAligned
    [2]   0x4000000000000000      010000...
    [3]   0xa000000000000000      101000...
    [4]   0xf000000000000000      111100...
    [5]   0xffffffffffffffff      111111...


My decode loop is :

for(int codeLen=1;;codeLen++) // actually unrolled, not a loop
{

if ( bitbuff < huff_branchCodeLeftAligned[codeLen] ) return symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];

}

or

int codeLen = minCodeLen;
while ( bitbuff >= huff_branchCodeLeftAligned[codeLen] ) codeLen++;
sym = symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];

so if bitbuff is

11010000...

codeLen starts at 2
we check

if ( 11010000.. < 0x4000... ) - false
if ( 11010000.. < 0xa000... ) - false
if ( 11010000.. < 0xf000... ) - true
  return ( 1101 - baseCode[4] ); = 13 - 6 = 7

And a full table-accelerated decoder for this code might be :


// 3-bit acceleration table :
#define TABLE_N_BITS 3
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

if ( bitbuff < huff_branchCodeLeftAligned[4] ) return symbolUnsort[ getbits(4) - baseCode[4] ];

// 5 is max codelen
// this compare is not always true (because of the bitbuff=~0 problem), but we take it anyway
//if ( bitbuff < huff_branchCodeLeftAligned[5] )

return symbolUnsort[ getbits(5) - baseCode[5] ];

And there you go. MSB-first Huffman that supports long code lengths that exceed the acceleration table size.

Data Compression History : Finnish

Finnish was perhaps the fastest (non-nop) compressor in the world around 1995 (? not sure on the exact year. Definitely before P-Pro and CMOV and branch penalties and such; this is a pre-Pentium-era optimized compressor; it definitely existed before LZP1 (1996) since it was one of the things I benchmarked against). (heck it might be a 286-era compressor, seeing as it's all 16-bit!)

Finnish was by some guys that I assume were from Finland. If anybody knows the correct attribution please let me know.

I was thinking about it the other day because we talked about the old segment register trick that we used to do, and I always thought this was such a neat little bit of code. It also uses the byte-regs as part of word-reg tricks.

Finnish :


; es = CharTable
; bx = hash index
; dl = control bits
; ds[si] = input
; ds[di] = output
; ax/al/ah = input char
; bp = control ptr

ProcessByte     macro SourceReg,BitVal
                        cmp     SourceReg, es:[bx]
                        je      ProcessByte_done
                        or      dl, BitVal
                        mov     es:[bx], SourceReg
                        mov     ds:[di], SourceReg
                        inc     di
ProcessByte_done:       mov     bh, bl
                        mov     bl, SourceReg
                        endm


ProcessBlockLoop:
                        mov bp, di           ; ControlPtr = CompPtr++;
                        inc di
                        xor dl, dl           ; Control = 0;

                        lodsw                ; AX = ds[si] , si += 2
                        ProcessByte al, 80h
                        ProcessByte ah, 40h
                        lodsw
                        ProcessByte al, 20h
                        ProcessByte ah, 10h
                        lodsw
                        ProcessByte al, 08h
                        ProcessByte ah, 04h
                        lodsw
                        ProcessByte al, 02h
                        ProcessByte ah, 01h

                        mov     ds:[bp], dl  ; *ControlPtr = Control

cbloom rants