9/24/2020

How Oodle Kraken and Oodle Texture supercharge the IO system of the Sony PS5

The Sony PS5 will have the fastest data loading ever available in a mass market consumer device, and we think it may be even better than you have previously heard. What makes that possible is a fast SSD, an excellent IO stack that is fully independent of the CPU, and the Kraken hardware decoder. Kraken compression acts as a multiplier for the IO speed and disk capacity, storing more games and loading faster in proportion to the compression ratio.

Sony has previously published that the SSD is capable of 5.5 GB/s and expected decompressed bandwidth around 8-9 GB/s, based on measurements of average compression ratios of games around 1.5 to 1. While Kraken is an excellent generic compressor, it struggled to find usable patterns on a crucial type of content : GPU textures, which make up a large fraction of game content. Since then we've made huge progress on improving the compression ratio of GPU textures, with Oodle Texture which encodes them such that subsequent Kraken compression can find patterns it can exploit. The result is that we expect the average compression ratio of games to be much better in the future, closer to 2 to 1.

Oodle Kraken is the lossless data compression we invented at RAD Game Tools, which gets very high compression ratios and is also very fast to decode. Kraken is uniquely well suited to compress game content and keep up with the speed requirements of the fast SSD without ever being the bottleneck. We originally developed Oodle Kraken as software for modern CPUs. In Kraken our goal was to reformulate traditional dictionary compression to maximize instruction level parallelism in the CPU with lots independent work running at all times, and a minimum of serial dependencies and branches. Adapting it for hardware was a new challenge, but it turned out that the design decisions we had made to make Kraken great on modern CPUs were also exactly what was needed to be good in hardware.

The Kraken decoder acts as an effective speed multiplier for data loading. Data is stored compressed on the SSD and decoded transparently at load time on PS5. What the game sees is the rate that it receives decompressed data, which is equal to the SSD speed multiplied by the compression ratio.

Good data compression also improves game download times, and lets you store more games on disk. Again the compression ratio acts as an effective multiplier for download speed and disk capacity. A game might use 80 GB uncompressed, but with 2 to 1 compression it only take 40 GB on disk, letting you store twice as many games. A smaller disk with better compression can hold more games than a larger disk with worse compression.

When a game needs data on PS5, it makes a request to the IO system, which loads compressed data from the SSD; that is then handed to the hardware Kraken decoder, which outputs the decompressed data the game wanted to the RAM. As far the game is concerned, they just get their decompressed data, but with higher throughput. On other platforms, Kraken can be run in software, getting the same compression gains but using CPU time to decode. When using software Kraken, you would first load the compressed data, then when that IO completes perform decompression on the CPU.

If the compression ratio was exactly 1.5 to 1, the 5.5 GB/s peak bandwidth of the SSD would decompress to 8.25 GB/s uncompressed bytes output from the Kraken decoder. Sony has estimated an average compression ratio of between 1.45 to 1 and 1.64 to 1 for games without Oodle Texture, resulting in expected decompressed bandwidth of 8-9 GB/s.

Since then, Sony has licensed our new technology Oodle Texture for all games on the PS4 and PS5. Oodle Texture lets games encode their textures so that they are drastically more compressible by Kraken, but with high visual quality . Textures often make up the majority of content of large games and prior to Oodle Texture were difficult to compress for general purpose compressors like Kraken.

The combination of Oodle Texture and Kraken can give very large gains in compression ratio. For example on a texture set from a recent game :

Zip 1.64 to 1
Kraken 1.82 to 1
Zip + Oodle Texture 2.69 to 1
Kraken + Oodle Texture 3.16 to 1

Kraken plus Oodle Texture gets nearly double the compression of Zip alone on this texture set.

Oodle Texture is a software library that game developers use at content creation time to compile their source art into GPU-ready BC1-7 formats. All games use GPU texture encoders, but previous encoders did not optimize the compiled textures for compression like Oodle Texture does. Not all games at launch of PS5 will be using Oodle Texture as it's a very new technology, but we expect it to be in the majority of PS5 games in the future. Because of this we expect the average compression ratio and therefore the effective IO speed to be even better than previously estimated.

How does Kraken do it?

The most common alternative to Kraken would be the well known Zip compressor (aka "zlib" or "deflate"). Zip hardware decoders are readily available, but Kraken has special advantages over Zip for this application. Kraken gets more compression than Zip because it's able to model patterns and redundancy in the data that Zip can't. Kraken is also inherently faster to decode than Zip, which in hardware translates to more bytes processed per cycle.

Kraken is a reinvention of dictionary compression for the modern world. Traditional compressors like Zip were built around the requirement of streaming with low delay. In the past it was important for compressors to be able to process a few bytes of input and immediately output a few bytes, so that encoding and decoding could be done incrementally. This was needed due to very small RAM budgets and very slow communication channels, and typical data sizes were far smaller than they are now. When loading from HDD or SSD, we always load data in chunks, so decompressing in smaller increments is not needed. Kraken is fundamentally built around decoding whole chunks, and by changing that requirement Kraken is able to work in different ways that are much more efficient for hardware.

All dictionary compressors send commands to the decoder to reproduce the uncompressed bytes. These are either a "match" to a previous substring of a specified length at an "offset" from the current output pointer in the uncompressed stream, or a "literal" for a raw byte that was not matched.

Old fashioned compressors like Zip parsed the compressed bit stream serially, acting on each bit in different ways, which requires lots of branches in the decoder - does this bit tell you it's a match or a literal, how many bits of offset should I fetch, etc. This is also creates an inherent data dependency, where decoding each token depends on the last, because you have to know where the previous token ends to find the next one. This means the CPU has to wait for each step of the decoder before it begins the next step. Kraken can pre-decode all the tokens it needs to form the output, then fetch them all at once and do one branchless select to form output bytes.

Kraken creates optimized streams for the decoder

One of the special things about Kraken is that the encoded bit stream format is modular. Different features of the encoder can be turned on and off, such as entropy coding modes for the different components, data transforms, and string match modes. Crucially the Kraken encoder can choose these modes without re-encoding the entire stream, so it can optimize the way the encoder works for each chunk of data it sees. Orthogonality of bit stream options is a game changer; it means we can try N boolean options in only O(N) time by measuring the benefit of each option independently. If you had to re-encode for each set of options (as in traditional monolithic compressors), it would take O(2^N) time to find the best settings.

The various bit stream options do well on different types of data, and they have different performance trade offs in terms of decoder speed vs compression ratio. On the Sony PS5 we use this to make encoded bit streams that can be consumed at the peak SSD bandwidth so that the Kraken decoder is never the bottleneck. As long as the Kraken decoder is running faster than 5.5 GB/s input, we can turn on slower modes that get more compression. This lets us tune the stream to make maximum use of the time budget, to maximize the compression ratio under the constraint of always reading compressed bits from the SSD at full speed. Without this ability to tune the stream you would have very variable decode speed, so you would have to way over-provision the decoder to ensure it was never the bottleneck, and it would often be wasting computational capacity.

There are a huge number of possible compressed streams that will all decode to the same uncompressed bytes. We think of the Kraken decoder as a virtual machine that executes instructions to make output bytes, and the compressed streams are programs for that virtual machine. The Kraken encoder is then like an optimizing compiler that tries to find the best possible program to run on that virtual machine (the decoder). Previous compressors only tried to minimize the size of the compressed stream without considering how choices affect decode time. When we're encoding for a software decoder, the Kraken encoder targets a blend of decode time and size. When encoding for the PS5 hardware decoder, we look for the smallest stream that meets the speed requirement.

We designed Kraken to inherently have less variable performance than traditional dictionary compressors like Zip. All dictionary compressors work by copying matches to frequently occurring substrings; therefore they have a fast mode of decompression when they are getting lots of long string matches, they can output many bytes per step of the decoder. Prior compressors like Zip fall into a much slower mode on hard to compress data with few matches, where only one byte at a time is being output per step, and another slow mode when they have to switch back and forth between literals and short matches. In Kraken we rearrange the decoder so that more work needs to be done to output long matches, since that's already a super fast path, and we make sure the worst case is faster. Data with short matches or no matches or frequent switches between the two can still be decoded in one step to output at least three bytes per step. This ensures that our performance is much more stable, which means the clock rate of the hardware Kraken decoder doesn't have to be as high to meet the minimum speed required.

Kraken plus Oodle Texture can double previous compression ratios

Kraken is a powerful generic compressor that can find good compression on data with repeated patterns or structure. Some types of data are scrambled in such a way that the compressability is hard for Kraken to find unless that data is prepared in the right way to put it in a usable form. An important case of this for games is in GPU textures.

Oodle Kraken offers even bigger advantages for games when combined with Oodle Texture. Often the majority of game content is in BC1-BC7 textures. BC1-7 textures are a lossy format for GPU that encodes 4x4 blocks of pixels into 8 or 16 byte blocks. Oodle Kraken is designed to model patterns in this kind of granularity, but with previous BC1-BC7 texture encoders, there simply wasn't any pattern there to find, they were nearly incompressible with both Zip and Kraken. Oodle Texture creates BC1-7 textures in a way that has patterns in the data that Kraken can find to improve compression, but that are not visible to the human eye. Kraken can see that certain structures in the data repeat, the lengths of matches and offsets and space between matches, and code them in fewer bits. This is done without expensive operations like context coding or arithmetic coding.

It's been a real pleasure working with Sony on the hardware implementation of Kraken for PS5. It has long been our mission at RAD to develop the best possible compression for games, so we're happy to see publishers and platforms taking data loading and sizes seriously.

20 comments:

  1. Thank you for this informative rant.

    ReplyDelete
  2. Is there any comparison between Kraken + Oodle Texture and BCPack that you’re aware of? For example, is BCPack CPU dependent or is it fundamentally the same as Kraken? etc...

    ReplyDelete
    Replies
    1. I don't think its on the same level as kraken is built into the ps5 remember when cerny stated that its power was equivalent to 9 zen 2 cores and has dual processors to control the i/o throughput, my guess they did this to free compression being done by the CPU/GPU it basically does it itself, some people say its better than BC pack can allow the ps5 17.38GB bandwidth which possibly is 3x faster than the xbox series x.
      Moores law is dead recently uploaded a video to talk about oodle kraken around the 1:34:00 mark if you want to have a listen but it does sound like this tech is efficient.

      Delete
  3. I don't think there's any public comparison of Kraken and BCPack. If there was one an unofficial one around, beware it might not include the affects of Oodle Texture. Oodle Texture dramatically changes the way textures compress; we believe it should always be used with textures in games.

    We're big fans of the Xbox Series X. Their approach is slightly different, but we're glad to see they are taking compression seriously.

    Oodle Texture works great for the Xbox as well, and we are working with a lot of game companies that are using it on Xbox, so consumers should see lots of games with those huge size and speed savings on Xbox as well. It's up to the individual game developers, as it's not been licensed platform-wide at this time.

    Game developers are also using Oodle Texture and Oodle Kraken for PC games; most multi-platform devs will be using the same Oodle Texture encoding of their textures for all platforms, it's not platform-specific. On the PC you don't have hardware Kraken, so software Kraken is used on the CPU. To keep up with the fastest SSD speeds this requires several cores; luckily high end PC's also have lots of CPU cores!

    At RAD we've always just tried to make the best compression possible for games. We plan to continue to work with all platforms in the future.

    ReplyDelete
  4. (I also work at RAD on Oodle.)

    The Kraken decoders are not "equivalent to 9 Zen 2 cores", that's quoting wildly out of context; by the same rationale a Deflate decoder that hits 5-6 GB/s output would be "equivalent to 12 Zen 2 cores" which is just as misleading. That ratio is just meaningless. They're dedicated fixed-function hardware that does one specific task (that happens to be suitable for HW implementation), certainly not equivalent substitutes for a general-purpose CPU core. If they were, that'd be missing the point.

    Both PS5 and Xbox Series X decided to go for HW decompression because they noticed that with a SSD, decompression goes from a side task for one CPU core to a full-time job for several, at which point it makes sense to design dedicated hardware. Once you decide to go there, you have considerable freedom in how you design decompression units, how they are clocked, how many there are, etc., and you configure all that to meet your targets.

    In the PS5 case, the goal was for the decompressors to never be the bottleneck in real workloads, so they're dialed in to be fast enough to keep up with the SSD at all times, with a decent safety margin. That's all there is to it.

    Along the same lines, 2 helper processors in an IO block that has both a full Flash controller and the decompression/memory mapping/etc. units is not by itself remarkable. Every SSD controller has one. That's what processes the SATA/NVMe commands, does the wear leveling, bad block remapping and so forth. The special part is not that these processors exist, but rather that they run custom firmware that implements a protocol and feature set quite different from what you would get in an off-the-shelf SSD.

    ReplyDelete
  5. And what's the maximum window size supported by the hardware decoder? Is there a dedicated SRAM array for the data in the window?

    ReplyDelete
  6. Is ps5 the only platform that has Hardware Decompression for kraken + oodle? Or does pc & xbox series x/s have it as well? If so can they reach the ps5 theoretical I/O throughput speeds?(I’ve seen multiple reports claiming it’s peak is 22GBs, some say even more than that), also, the only way to attain such I/O speeds, does that require having the same ssd speeds as ps5? Sorry if my question doesn’t make any sense, I am not a tech expert, just an enthusiast trying to learn as much as I can.

    ReplyDelete
  7. PS5 is the only system with a hardware Kraken decoder, and the only platform with platform-wide license to Oodle Texture so that every game can use it. In theory PC SSD's will keep getting faster, but you would need several CPU cores running software Kraken to match the decompressed bandwidth of the PS5 hardware Kraken. Even then, a typical game on the PC won't be able to achieve that IO speed because of other bottlenecks; once you're going that fast lots of other things in the system software can become problems, you have to address it all through the software.

    ReplyDelete
    Replies
    1. I see. Thanks for answering my question & thank you as well Fabian for answering my question. So you mentioned ps5 being the only platform with licensing to be used on every game, so the only way to use Oodle texture would be through licensing?

      Delete
  8. svpv: We can't answer questions about the details of the HW implementation because it's Sony's, not ours.

    Micky: the design was focused on maximizing minimum decode speed; that is, making sure that even for data that is pathologically slow to decode, the decoder keeps up with (or ideally outpaces) the peak SSD read speeds.

    I'm happy the peak decompression speed came out the way it did but we always knew it was going to be high and didn't worry about it much.

    ReplyDelete
  9. This blog is nerd-elicious! Looking forward to more entry posts regarding PS5 and the development of its capabilities. Thank you guys so much for sharing all this!

    ReplyDelete
  10. When will the first games that use all of this new technology be released?

    ReplyDelete
  11. I can't comment on specific games, but the whole combination of Oodle Texture + Kraken + a well optimized loading pipeline probably won't be in a shipping game on PS5 for a while yet.

    There are games rolling out Oodle Texture now on PC, such as Warframe :

    https://forums.warframe.com/topic/1223735-the-great-ensmallening/


    I've seen some press trying to compare load times of cross-platform games on the various systems. While that is tempting to think it's a scientific way to compare something equal across systems, it doesn't give a very accurate picture of what's actually happening. Cross platform games rarely have the time to carefully design their IO system to be optimal on all platforms, especially early launch titles where the pressure to get anything shipped is difficult enough. Many games load stack is CPU bound, which means you aren't really seeing the IO subsystem performance at all, and if you drive the IO system in a generic cross-platform way it probably isn't at peak performance. It's like taking a bunch of cars to a test track but giving them all bald tires in the rain, the higher performance cars won't be able to do a better lap time so you won't really see the difference in what they're capable of.

    ReplyDelete
  12. Thanks for your responses. I really appreciate having dialogue with someone who actually knows what they're talking about and doesn't mind taking the time to teach those of us that are still learning.
    That's really interesting and makes a lot of sense. Did Sony anticipate this technology coming which is why they included a 22 GB/s decompressor despite announcing the 8-9 GB/s number? It seemed like overkill at the time to me but this makes me think they knew exactly what they were doing. Also wouldn't this mean that once it is implemented loading times will rarely be more than half a second on well-optimized games since the PS5 has 13 GBs of usable RAM to fill max?

    ReplyDelete
  13. The decompressor acts as a speed multiplier depending on the compression ratio. The input speed to the decompressor is always the same, determined by the disk rate, but the output speed varies. Game content doesn't just have a single uniform compression ratio, it will be a mix of content, some of which compresses better than others, so some decompresses near the min speed, and some much faster. When we talk about the overall game speed or compression ratio, that's an average. Also the average is done on time, not speed (which is the inverse of time), so for example the average of 5 GB/s and 20 GB/s is 8.

    As for the load times - basically yes, load times will be ridiculously fast in games that are designed for it, in fact we should see games in the future that have zero visible load time at all, you just jump right in to huge levels and never experience any load time.

    ReplyDelete
  14. Were you guys consulted by Sony regarding the design of the decompression hardware or was just internal decision-making? Do you consider the jump in hardware of the console and particularly the 16GB of RAM enough to bring photorealistic graphics a step closer considering games now have to render in dynamic 4K? I assume RAM usage will be now much more efficient and dedicated to what you're currently seeing on screen and not idle but I wonder if most of the new processing power won't go to the need to render in dynamic 4K.

    And if I may, could you comment on how does the PS5 I/O unit compare with the Series X?

    Thank you guys for the answers. It's really interesting.

    ReplyDelete
  15. Im so glad I found this blog! This is a great discussion! This comment was interesting:

    "cbloom said...
    PS5 is the only system with a hardware Kraken decoder, and the only platform with platform-wide license to Oodle Texture so that every game can use it. In theory PC SSD's will keep getting faster, but you would need several CPU cores running software Kraken to match the decompressed bandwidth of the PS5 hardware Kraken. Even then, a typical game on the PC won't be able to achieve that IO speed because of other bottlenecks; once you're going that fast lots of other things in the system software can become problems, you have to address it all through the software."


    While true, the "lots of other things in the system software" amounts to the work Microsoft is doing in DirectStorage on XBSX, and in porting to PC DX12, and the work they've done on BCPack. RAD is awesome, Oodle is super impressive, and the approach Sony has taken here from both a systems architecture, and developer experience, standpoint is really elegant. But all that said, I really don't think we're going to see gigantic real world differences in experience between the PS5 and XBSX/PC (which I think is what many of the comments are fishing for)

    Even today, just optimizing old game engines to be aware of, and utilize, NVMe already makes a *noticeable* difference on PC, and that's before *any* updates to DX12 or any leveraging of more advanced decompression pipelines by "yet to be built" game engines.

    ReplyDelete
  16. The HW was designed mainly by Sony and AMD. We did get looped in once the architecture was settled to assist with validation (making sure the design works right), tooling, and add some functionality in our SW for use by the PS5 SDK.

    We emphatically did not consider the rest of the console HW in any way because we didn't need to know and weren't briefed on it until last year, long after the design was final (the pipeline for mass-market HW is quite long). Nor did we know anything about there being a SSD, SSD throughputs or clock rates. All we knew was that the design had ambitious targets for the number of bytes decoded per cycle.

    We can't comment on HW specifics or make comparisons, that's all covered by NDAs.

    ReplyDelete
  17. Hi Charles, congratulations! I'm very impressed with what you've accomplished as an engineer/programmer/applied mathematician. It's great to see people like you succeeding and being rewarded for your skills.

    Regarding the Kraken decomp chip, I thought Kraken was quickly surpassed and superseded by your team's subsequent codecs. (Selkie?) Am I remembering this wrong? My impression was that Kraken was a sort of old, first stab at your end goals, and that you blew it out of the water with follow-on development.

    Relatedly, I'm stumped by the results in your table, with Kraken + Oodle Texture at 3.16, vs 2.69 for Kraken + ZIP. I would've expected a bigger difference, even if Oodle Texture wasn't specifically optimized or tuned for Kraken. Or is it not optimized for Kraken specifically? I'd expect something like Zstd or brotli to beat ZIP by that margin, and Kraken to be somewhat better than them. Is Kraken more about decomp speed than compression ratio?

    These new consoles are a revelation in terms of I/O architecture and throughput. They seem to be much more powerful than PCs in ways I didn't expect. Do you think hardware decompression could be just as easily applied to PCs and smartphones for the same wins? I'm fascinated by hardware implementations like that, in part because I know so little about them – how exactly code is instantiated in hardware at a low level and so forth. Have you looked at a combined decompression and decryption pipeline? I guess decryption would have to happen first, followed by decompression, unless I'm missing something. I wonder if there are possible approaches for combining encryption and compression into a unified codec, or at least very complementary or synergistic codecs. Something like BitLocker or Opal combined with an Oodle codec would be super.

    ReplyDelete
  18. "Regarding the Kraken decomp chip, I thought Kraken was quickly surpassed and superseded by your team's subsequent codecs. (Selkie?)"

    No, Kraken is still very much state-of-the-art. All the new CryptoOceanicZoology codecs offer different trade offs of speed vs compression vs complexity. Kraken is also very tuneable for different applications.

    "Or is it not optimized for Kraken specifically?"

    Oodle Texture is not heavily Kraken-specific. It works well with any back-end compressor.

    "Is Kraken more about decomp speed than compression ratio?"

    Kraken is very tuneable. It is about space+speed not one or the other. It usually beats ZStd at both.

    "Do you think hardware decompression could be just as easily applied to PCs and smartphones for the same wins?"

    That's probably coming in the future. I think in general in computing we'll see more custom chips for various tasks because they provide superior performance per watt vs generalized computing.

    Most apps are not written to take advantage of fast IO, so there's a lot of work to do on the software levels above the hardware to see the benefit (from device drivers to memory mapping, to OS file buffers, to the way apps request and read data (more async and bigger chunks please!)).

    "Have you looked at a combined decompression and decryption pipeline?"

    They're separate tasks, but it does make sense to put them on the same chiplet so that they can both act on the same local cached buffer rather than pulling memory through two different units. Maybe we'll see both decryption and decompression in a unified IO controller?

    ReplyDelete