11/18/2008

11-18-08 - DXTC

(for links to all posts in this series see DXTC Post Summary )

I've been doing a little work on DXT1 compression. The main reason I'm looking at it is because I want to do something *better* than DXT1, and before I do that, I want to make sure I'm doing the DXTC right.

For now let's gather prior art :

The only real decision the coder gets to make is what the two 565 endpoints are. The other details you just have to get right - reconstruct the palette interpolants right, and assign the 16 indices to the closest palette entries right. For the moment I don't care about speed, I just want to make sure the indices are actually the best choices.

All the papers by Ignacio and J.M.P. van Waveren are modern classics. One thing of note : the "simplified" index finder that JMP has in his original paper is wrong. On page 12 he says "Evaluating the above expression reveals that the sub expression ( !b3 & b4 ) can be omitted because it does not significantly contribute to the final result" - apparently that is not true, because when I use the "rewritten" index finder it produces bad indexes. His original version of the index finder bit twiddle is fine.

One thing I've found with the index finding is that the integer and nonlinear effects are pretty strong and any ideas you have about planar geometry fail if you assume that your 4 points are colinear (they are not actually). For example, an obvious idea is to do a dot product test to pick between the 4 points. First dot product with the middle separating plane, then the two planes between the next pairs of points. This doesn't work. Part of the problem is because of the bad rounding in the generation of the 1/3 point. The JMP method is actually comparing vs the true distances, so that part is good.


In more detail, an obvious idea is to do something like this :

int dot = (color - c0) * (c1 - c0);

int bmid = (dot > mid01);
int b02 = (dot > mid02);
int b31 = (dot > mid31);

// using the DXT1 palette order label 0,2,3,1

int threebit = (bmid<<2) + (b02<<1) + b31;

int index = remap[threebit];

// remap is an 8-entry table that remaps from 3 bits to 2 bits to give an index

That "works" just as much as the "ryg" code works or the "Extreme DXT" method of finding indexes work - which is to say that it doesn't work.

FastDXT appears to be an implementation of the id "Real Time DXT" paper.

Ericsson Texture Compression (ETC2) is similar to DXTC but different; this is a pretty good paper, there are some interesting ideas here like the T and H blocks. It gets slightly better quality in some cases. It's obvious that you can beat DXTC by having a base color and a log-scaled delta color, rather than two end points. The two 565 end points is wasteful; you could for example do a 777 base color and a 444 log scaled delta.

Tight Frame Normal Map Compression is similar to DXT5 (really to 3Dc) with some small improvement on normal maps.

Extreme DXT Compression by Peter Uliciansky is indeed super fast. However, it has some errors. The division method that he uses to assign the indeces is not correct. You can improve it by offsetting the end points correctly to put the division quantization boundaries in the right place, but even then it's still not exactly correct unless the diagonal of the color bbox is along {1,1,1}. The error from this is pretty small and he's going for max speed, so it's semi reasonable what he does. Note that using the full range of the bbox for the division but insetting it to make the dxt1 colors (as he does) improves the indeces slightly, but it's not actually the correct range scale.

In more detail :

To find indices for the DXTC colors c0 and c1

You should use

base = c0 + (c0 - c1)/6

range = (c1 - c0) * 4/3

index = (color - base) * 4 / range;

or

index = (color - base) * 3 / (c1 - c0);

not :

index = (color - c0) * 4 / (c1 - c0);

But even if you do that it's still wrong because the planes through color space do not actually separate the colors from their closest palette entry.

MSDN compressed format documentation now has the right rounding for the 1/3 points, but the old DirectX8 docs have the wrong rounding. The old docs said


color_2 = (2*color_0 + 1*color_1 + 1)/3
color_3 = (1*color_0 + 2*color_1 + 1)/3

Which is actually better, but is not what the standard or the hardware does.

Simon Brown's Squish is very good. His Cluster Fit idea is elegant and clever and produces quite good quality.

There's a thread on MollyRocket about DXTC with some convenient code from "ryg". It's nice simple code, but it has some errors. The way he finds the indices is the dot product method which is wrong. Also the way he computes the 1/3 and 2/3 colors for the palette using LerpRGB is wrong, it rounds differently that the hardware really does.

One thing that some simple code gets wrong is the way to quantize from 24 bit to 565. You can't just take the top bits, because the 565 values are not reconstructed to the middle of their bins. You need to correct for the offsetting. You can either use the true range, which is something like -4 to 259, or you can skew the bits in the quantizer. The "ryg" code has a perfect quantizer that's very neat in the "As16Bit" function. An example is that the value 250 should map to 30 in 5 bits, not 31.

BTW the YCoCg-DXT5 method is pretty good on current hardware, but I'm not really looking at it. It's obviously inherently wasteful, it's just a way of making use of the current hardware, but it's not really what you'd want to be doing. It's also very large, at around 2.6 bits per byte (8 bits per pixel).

More advanced topics to come.

11/11/2008

11-11-08 - REYES

Gritz' course on Renderman : To Infinity and Beyond is the best reference I've found on REYES. I'm still curious about the details of how to efficiently do the part from micropolygons to pixel colors.

The brute force way takes a micropolygon and puts a bound on it that cover its entire temporal sweep across the current frame. That is then binned into all the pixels it touches. In each pixel, 16 (or so) supersampling locations are checked with precomputed jittered sample locations and times. At each spot where a sample is found, the color & Z are added to a list for later blending. Note with fast moving objects and depth of field, it can get really inefficient, because micropolygons have to get tested in many pixels.

The Pixar Library is a nice collection of papers. There's a new paper I hadn't seen on point-based GI that's based on the Bunell disk method for realtime GI but enhanced for better quality. It seems extremely simple and the results look good. It's not super accurate in terms of solving the rendering equation with minimal error, but that's not really what we want anyway. Though actually now that I think about it it's just a variant on the ancient "render the world from each surfel's point of view" method for GI and doesn't really add much.

Anyway, back to REYES. Ignoring temporal sampling, the spatial sampling stuff can all be done with a normal rasterizer I'm pretty sure. The micropolygon subdivision and binning is just your fragment rasterizer (a rasterizer is just a device that creates micropolygons that are at most pixel size and makes a grid that's aligned with the screen grid). When you decide a fragment goes in a pel, you don't just stuff the color in the frame buffer, instead you stick it in the list to sample against the stochastic supersampled grid test just like REYES.

But it occurs to me that if you're building up a list of fragments in each pel like this, you may as well do analytic coverage than stochastic sampling. When you rasterize your triangles, compute the area of each pel that's covered. This can be done efficienctly incrementally just like a Wu antialiased line drawer for the case of only one triangle edge interesting a pel, if 2 or 3 edges intersect one pel it's more work.

Add the fragment to the pel list, consisting of {color, alpha, Z, area}. Once you have all these fragments sort them back-to-front and accumulate into the framebuffer. Fragments that have the same Z are first gathered together before being alpha-blended onto the current value in the frame buffer. That means that two triangles that share an edge act like a continuous surface and will correctly give you 100% area covers and will correctly not alpha blend with each other. This gives you exact anti-aliasing with alpha blending.

11-11-08 - Mapped Drives

Is there a way to make "Disconnected Network Drives" try to connect from DOS or from code? I'm talking about when you "net use" to make a network path to a drive letter, but you reboot or something so it's disconnected. It does a sort of lazy connect, where if you go click on it in Explorer it will hook right up, but if you go try to dir it from a CLI before you explore to it, you can't get to it.

I'd like to make my CLI file copy utils do something like "if target is a disconnect network drive, wake it up first".


The answer seems to rely somewhere in the functions RealDriveType, IsNetDrive, and WNetRestoreConnection.

These are functions that the DOJ forced MS to expose and the documentation is total balls. This page has better info than MSDN :

http://vbnet.mvps.org/index.html?code/network/isnetdrive.htm

I'm seeing some weird things with them though. For one thing, IsNetDrive is not a light query. It actually stalls and tries to wake up the drive right there, so most of the time all you have to do is call IsNetDrive and that actually will restore the connection. IsNetDrive can stall out for a very long time.

IsNetDrive also lies, as that page I linked details.

I haven't gotten WNetRestoreConnection to do anything useful. It always tells me "The local device name is already in use".

WNetRestoreConnection also pops up an interactive dialog box when it has errors which is fucking evil.


So ... in every case I've tried so far IsNetDrive actually refreshes the connection. I haven't yet found a need to actually call WNetRestoreConnection, which is good because it doesn't seem to work.

11/03/2008

11-03-08 - Curves

I finally read Ignacio's talk on displaced subdivision with the NV geometry shader pipeline. You can find the whole talk with audio and such online but I like just the PDF . There's a lot of good stuff in there. A lot of the issues are general issues for low-high poly work, it's just that most people so far have been just ignoring the problems and creating meshes with bad discontinuities.

Ignacio's stuff is based on the Charles Loop paper "Approximating Catmull-Clark Subdivision Surfaces with Bicubic Patches". It's funny that it's really just a fancy version of the PN Triangles hack; the actual geometry surface made is not continuous, but there's a separate tangent field over it which makes the shading look smooth.

While there I noticed a bunch of Loop papers I hadn't read. The Loop-Blinn thing from 05 was particularly interesting because I've talked to a few people in the last N years about rendering fonts better. Thatcher worked on it a bit at Oddworld, and recently Sean's been doing it at RAD for our new GUI product. The Loop Blinn Paper described how to use a pixel shader to shade quadratic or cubic curves. The cubic stuff seems like rather a pain in the butt, but piecewise quadratic is certainly good enough for glyphs. Also they sort of gloss over the details of anti-aliasing, which is of course crucial; there's not much point in being this careful with curves unless you're going to get the antialiasing right.

The MDK Curvy Blues blog covers the same material and claims there are some mistakes in the Loop-Blinn paper. There's also a nice GLSL conics paper and demo app of the Loop-Blinn technique.

Zheng Qin, Michael McCool and Craig Kaplan seem to be doing a lot of work on this stuff at Waterloo. I sort of assume you're familiar with the simple Valve Alpha-test Distance Field thing ; the Qin work is the basis of the Valve technique and is more robust and general, but much more complicated. The Valve technique is probably okay in a game development scenarios where artists are monitoring how their work looks in game and are used to tweaking with weird algorithms, but it definitely has parameters that you have to tweak, and the use of single (or even two) distances can cause bad degeneracies if you're not careful, whenever you have details that are close to the size of a pixel in the distance map.

10/10/2008

10-10-08 : On the Art of Good Arithmetic Coder Use

On the Art of Good Arithmetic Coder Use

I think it's somewhat misleading to have these arithmetic coder libraries. People think they can take them and just "do arithmetic coding". The reality is that the coder is only a small part of the arithmetic coding process, and it's the easiest part. The other parts are subtle and somewhat of an art.

The other crucial parts are how you model your symbol, how you present symbols to the coder, how you break up the alphabet, and different type of adaptation.

If you have a large alphabet, like 256 symbols or more, you don't want to just code with that alphabet. It has few problems. For one, decoding is slow, because you need to do two divides, and then you have to binary search to find your symbol from a cumulative probability. Aside from the speed issue, you're not compressing super well. The problem is you have to assign probabilities to a whole mess of symbols. To get good information on all those symbols, you need to see a lot of characters to code. It's sort of like the DSP sampling problem - to get information on a big range of frequencies in audio you need a very large sampling window, which makes your temporal coherence really poor. In compression, many statistics change very rapidly, so you want quick adaptation, but if you try to do that on a large alphabet you won't be gathering good information on all the symbols. Peter Fenwick has some good old papers on multi-speed adaptation by decomposition for blocksorted values.

In compression most of our alphabets are highly skewed, or at least you can decompose them into a highly skewed portion and a flat portion. Highly skewed alphabets can be coded very fast using the right approaches. First of all you generally want to sort them so that the most probable symbol is 0, the next most probable is 1, etc. (you might not actually run a sort, you may just do this conceptually so that you can address them in order from MPS to LPS). For example, blocksort output or wavelet coefficients already are sorted in this order. Now you can do cumulative probability searches and updates much faster by always starting at 0 at summing towards 0, so that you rarely walk very far into the tree.

You can also decompose your alphabet into symbols that more accurately represent its "nature" which will give you better adaptation. The natural alphabet is the one that mimics the actual source that generated your code stream. Of course that's impossible to know, but you can make good guesses. Consider an example :


The source generates symbols thusly :

It chooses Source1 or Source2 with probability P
Source1 codes {a,b,c,d} with fixed probabilities P1(a),P1(b),...
Source2 codes {e,f,g,h} with fixed probabilities P2(e),P2(f),...

The probability P of sources 1 and 2 changes over time with a semi-random walk
After each coding event, it either does P += 0.1 * (1 - P) or P -= 0.1 * P


If we tried to just code this with an alphabet {a,b,c,d,e,f,g,h} and track the adaptation
we would have a hell of a time and not do very well.

Instead if we decompose the alphabet and code a binary symbol {abcd,efgh} and then code
each half, we can easily do very well.  The coding of the sub-alphabets {a,b,c,d} and
{e,f,g,h} can adapt very slowly and gather a lot of statistics to learn the probabilities
P1 and P2 very well.  The coding of the binary symbol can adapt quickly and learn the
current state of the random decision probability P.

This may seem rather contrived, but in fact lots of real world sources look exactly like this. For example if you're trying to compress HTML, it switches between basically looking like English text and looking like markup code. Each of those is a seperate set of symbols and probabilities. The probabilites of the characters within each set are roughly constantly (not really, but they're relatively constant compared to the abruptness of the switch), but where a switch is made is random and hard to predict so the probability of being in one section or another needs to be learned very quickly and adapt very quickly.

We can see how different rates of adaptation can greatly improve compression.

Good decomposition also improves coding speed. The main way we get this is by judicious use of binary coding. Binary arithmetic coders are much faster - especially in the decoder. A binary arithmetic decoder can do the decode, modeling, and symbol find all in about 30 clocks and without any division. Compare that to a multisymbol decoder which is around 70 clocks just for the decode (two divides), and that doesn't even include the modeling and symbol finding, which is like 200+ clocks.

Now, on general alphabets you could decompose your multisymbol alphabet into a series of binary arithmetic codes. The best possible way to do this is with a Huffman tree! The Huffman tree tries to make each binary decision as close to 50/50 as possible. It gives you the minimum total code length in binary symbols if you just wrote the Huffman codes, which means it gives you the minimum number of coding operations if you use it to do binary arithmetic coding. That is, you're making a binary tree of coding choices for your alphabet but you're skewing your tree so that you get to the more probable symbols with fewer branches down the tree.

(BTW using the Huffman tree like this is good for other applications. Say for example you're trying to make the best possible binary search tree. Many people just use balanced trees, but that's only optimal if all the entries have equal probability, which is rarely the case. With non-equal probabilities, the best possible binary search tree is the Huffman tree! Many people also use self-balancing binary trees with some sort of cheesy heuristic like moving recent nodes near the head. In fact the best way to do self-balancing binary trees with non equal probabilities is just an adaptive huffman tree, which has logN updates just like all the balanced trees and has the added bonus of actually being the right thing to do; BTW to really get that right you need some information about the statistics of queries; eg. are they from a constant-probability source, or is it a local source with very fast adaptation?).

Anyhoo, in practice you don't really ever want to do this Huffman thing. You sorted your alphabet and you usually know a lot about it so you can choose a good way to code it. You're trying to decompose your alphabet into roughly equal probability coding operations, not because of compression, but because that gives you the minimum number of coding operations.

A very common case is a log2 alphabet. You have symbols from 0 to N. 0 is most probable. The probabilities are close to geometric like {P,P^2,P^3,...} A good way to code this is to write the log2 and then the remainder. The log2 symbol goes from 0 to log2(N) and contains most of the good information about your symbol. The nice thing is the log2 is a very small alphabet, like if N is 256 the log2 only goes up to 9. That means coding it is fast and you can adapt very quickly. The remainder for small log2's is also small and tracks quickly. The remainder at the end is a big alphabet, but that's super rare so we don't care about it.

Most people now code LZ77 offsets and lengths using some kind of semi-log2 code. It's also a decent way to code wavelet or FFT amplitudes. As an example, for LZ77 match lengths you might do a semi-log2 code with a binary arithmetic coder. The {3,4,5,6} is super important and has most of the probability. So first code a binary symbol that's {3456} vs. {7+}. Now if it's 3456 send two more binary codes. If it's {7+} do a log2 code.

Another common case is the case that the 0 and 1 are super super probable and everything else is sort of irrelevant. This is common for example in wavelets or DCT images at high compression levels where 90% of the values have been quantized down to 0 or 1. You can do custom things like code run lengths of 0's, or code binary decisions first for {01},{2+} , but actually a decent way to generally handle any highly skewed alphabet is a unary code. A unary code is the huffman code for a geometric distribution in the case of P = 50% , that is {1/2,1/4,1/8,1/16,...} ; we code our symbols with a series of binary arithmetic codings of the unary representation. Note that this does not imply that we are assuming anything about the actual probability distribution matching the unary distribution - the arithmetic coder will adapt and match whatever distribution - it's just that we are optimal in terms of the minimum number of coding operations only when the probability distribution is equal to the unary distribution.

In practice I use four arithmetic coder models :

1. A binary model, I usually use my rung/ladder but you can use the fixed-at-pow2 fractional modeller too.

2. A small alphabet model for 0-20 symbols with skewed distribution. This sorts symbols from MPS to LPS and does linear searches and probability accumulates. It's good for order-N adaptive context modeling, N > 0

3. A Fenwick Tree for large alphabets with adaptive statistics. The Fenwick Tree is a binary tree for cumulative probabilities with logN updates. This is what I use for adaptive order-0 modeling, but really I try to avoid it as much as possible, because as I've said here, large alphabet adaptive modeling just sucks.

4. A Deferred Summation semi-adaptive order-0 model. This is good for the semi-static parts of a decomposed alphabet, such as the remainder portion of a log2 decomposition.

Something I haven't mentioned that's also very valuable is direct modeling of probability distributions. eg. if you know your probabilites are Laplacian, you should just model the laplacian distribution directly, don't try to model each symbol's probability directly. The easiest way to do this usually is to track the average value, and then use a formula to turn the average into probabilities. In some cases this can also make for very good decoding, because you can make a formula to go from a decoded cumulative probabilty directly to a symbol.

ADDENDUM BTW : Ascii characters actually decompose really nicely. The top 3 bits is a "selector" and the bottom 5 bits is a "selection". The probabilities of the bottom 5 bits need a lot of accuracy and change slowly, the probabilities of the top 3 bits change very quickly based on what part of the file you're in. You can beat 8-bit order0 by doing a separated 3-bit then 5-bit. Of course this is how ascii was intentionally designed :


    0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F  0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
0  NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US
1   SP  !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /  0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?
2   @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O  P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _
3   `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o  p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~ DEL

10-10-08 : On LZ Optimal Parsing


10-10-08

On LZ Optimal Parsing

So the LZ-Huffman I've done for RAD/Oodle does some new stuff with optimal parsing. I want to write up what I'm doing so I remember it, and I'm also going to try to write about the semi-optimal parsing that a lot of people are doing now.

First let's review a bit. In LZ coding you have a lot of choices about how you code things. Greedily choosing the longest match at each point dOpoes not produce the best code stream. There is a very good ancient paper by Storer and Szymanski that introducing optimal parsing, and their method is both fast and 100% optimal, but it is only "optimal" in the sense of coding the fewest number of literals possible. See my old notes on Storer-Szymaski optimal parsing . Now, for the old LZSS coder that used fixed bit coding (8 bit literals, 24 bit matches, or whatever), that optimality is also the same as code length optimality. But in reality what we care about is code length optimality - we want to produce the shortest possible file - which is not necessarilly the same as coding the fewest possible literals.

There are sort of two interesting domains of modern LZ coders, and optimal parsing matters a whole lot to both of them. One domain is the super fast, super simple LZ coder like my LZH. Optimal parsing is important to LZH because the matches are so important to compression, you really want to get the best matches you can. The other domain is something like LZMA or BALZ (an ROLZ) that does a lot of context coding of literals. In those cases, the context coder for literals is so good that you really want to only code good matches, and often the literals will be so cheap you want to send them as literals; also the cost of literals and matches is highly variable for those coders.

Optimal parsing is a path search in a graph. For game developers you can think about A* as I talk about this. You start yourself at the head of the file, and your goal is to walk to the end of the file in the shortest possible path. From each node in the graph, your exit paths are the possible coding operations at that point in the file, eg. write a literal, write match X of length Y. Each coding operation has a certain cost which is equal to the number of bits output if you do that coding operation, and then you step along that link to the next point in the file as specified by the match length.

Note that with the original LZSS we only needed to consider either matching or not matching at each position in the file, since any time you wrote a match the best thing to do was to write the maximum match length. With true code-optimal parsing, you have to consider all possible match offsets and all possible match lengths at each location (!!). This is a huge number of coding choices, but we can rule out almost all of them. (more on this later).

So, the brute force straightforward optimal parser should be obvious at this point. Start with an empty code stream and push position 0 on the stack. Repeat forever : pop the stack. Try all coding operations at the position specified in the stack. Push all of those coding operations on the stack with the code stream they made, and the position of the end of the match. This works fine and produces a 100% optimal parse, but is insanely slow. If you have N coding choices at each position it makes N^L operations for a length L file.

With adaptive coding (eg. you're using an adaptive entropy coder for literals matches and offsets), you cannot get a 100% optimal parse unless you do this full search, because it does not have the optimal subgraph property. That is if you have the optimal path from A to C, and that path goes through B, that optimal path is not the same as taking he optimal path from A to B and the optimal path from B to C. To speed things up we are going to start making approximations, and we are going to generally assume a looser version of the optimal subgraph property. What this means in terms of LZ coding is that if you are at a certain position in the file, the best way to code the whole file does not necessarily include the best way to code from {begin} to the current position.

Let's revisit the original LZSS parse now. LZSS does no entropy coding, so it doesn't matter how you get to a given byte of the stream, you will code the remainder in the same way. This is obviously a "Dynamic Programming" problem. You have shared sub-portions of the problem that can be collapsed by storing their result in a table. The optimal path from point P to the end is the same regardless of how you get to P. You could now proceed by doing the brute force approach and search depth-first, and each time you find a path, you store all the sub-paths that you visited, and then you reuse them instead of recomputing them. Or you can do it a more elegant way which is identical, which is to walk backwards through the file and at each position P store the optimal path from the P to the end. When you find the optimal path at P obviously all the future subgraphs that you might need are already done because you're going backwards. So, we now think of the LZSS optimal parse as a way of solving the graph search using dynamic programming.

The case of a static entropy coder is simpler so let me focus on that first. Offsets and lengths and literals are written in some # of bits that is known in advance and we wish to make the shortest possible code stream. We can do a sort of LZSS backwards parse to do this quickly. It's not as simple as LZSS because we cannot assume that the best match choice at each step is the longest one.

Optimal Parser for Static Statistics :

We're going to walk backwards like LZSS. At each position already considered (future positions in the file), we have already found the optimal choice at that spot. We have various coding choices at the current position, either literal or various {offset,length} possibilities. The total cost of each of those coices is : {Cost to code the current choice} + {Table cost at current pos + match length} , that is the already computed cost to finish the walk after you step match length.

It should be obvious that we can solve this just by considering every coding operation possible at each position and walk backwards. That gives us a total cost of O(N*L) , which is a lot better than O(N^L) , but it's still slow because N is very large. So let's start making approximations.

What are all these coding choices that we have? We found various offsets that we matched at least min match length (usually 3). This gives us a bunch of possible offsets and the maximum match length at each offset : {Off1,Len1} , {Off2,Len2}. In general we also have to consider all lengths >= 3 at each offset.

The first approximation is the assumption that for any given match length, the lowest code cost comes from the smallest possible offset that matches with that length. That is, we assume that if Off1 < Off2 , then cost(Off1) <= cost(Off2). This is certainly true for all non-pathological files in the real world. This kind of assumption is also self-satisfying, that is by making this assumption, the optimal parser will prefer shorter offsets, which makes them more likely, which makes their code length shorter. It's a positive feedback loop; I'll talk more about feedback later.

With this approximation, we no longer need to consider all offsets and all match lengths. Now we only need to consider all match lengths, and the offset for any length is chosen for us - it's the lowest offset that codes that length. Our code choices now look like :


you find a match of length 7 at Off1, length 11 at Off2, length 22 at Off3
with Off1 < Off2 < Off3

choose from :

Length 3-7 at Off1
Length 8-11 at Off2
Lebgth 12-22 at Off3

Okay, so we could do this, but it's still too slow. We need to reduce our choices further.

One option is to only consider the longest possible length at each offset. That is, assume that if you code a length at a given offset, the best choice is the longest length at that offset. So in our example you would only choose from {7 at Off1, 11 at Off2, 22 at Off3}. That is sort of okay, but it actually gives up a lot of compression. It misses too many good choices. Apparently there are some good coding paths that involve taking a shorter than maximal match.

To get to the really cool optimization, we need to stop and think about this intuitively. Why would you not write a maximal match? It should almost always give us a shorter total code len, because writing a longer match is very cheap and lets us not code symbols after the match. The cost of a given match is


Cost(offset) + Cost(match length) + CostTable( at pos + match length )

Let's assume for now that the cost of the offset and the match length are seperable (they aren't used to predict each other). The cost of each match length is an entropy code that generally codes shorter lengths in fewer bits. It's sub-linear in match length, but it is generally increasing. However, it increases less than the average cost of future symbols. Note that for long match lengths, the cost of len and (len+1) may be exactly equal depending on how you code your lengths. For example of you use a semi-log2 code, then the cost of many large match lengths is identical. That means coding longer matches is free in terms of the current code cost.

The CostTable generally increases backwards linearly proportional to the average entropy. That is :


CostTable[ {end} ] = 0
CostTable[ end - 1 ] = cost to code form (end-1) to end
CostTable[ end - N ] = N * H

on average ; where H is the average entropy of one symbol

However, even though CostTable is linearly increasing backwards on average, it is not always. Consider for example what CostTable looks like in the neighborhood of a very good very long match. Say at some position P this long match is available with length L. At position (P+1) the best coding choice will be that same match with length (L-1), same at (P+2) etc. The cost table in this area will look like :

CostTable[ P-1 ] = Cost to code literal at (P-1) + Cost(L) + CostTable[ P + L ];
CostTable[ P   ] = Cost(L  ) + CostTable[ P + L ];
CostTable[ P+1 ] = Cost(L-1) + CostTable[ P + L ];
CostTable[ P+2 ] = Cost(L-2) + CostTable[ P + L ];
CostTable[ P+3 ] = Cost(L-3) + CostTable[ P + L ];

If L is large, like say 100 or more, the Cost(L) and the Cost(L-1) will be very very similar. In this region, CostTable[] is increasing as you step backwards, but it's only increasing very slowly, far more slowly than the average slope of H. We see that CostTable goes backwards like N*H in a line, but it deviates up and down across the line. Some areas you code worse than average entropy, some areas you code much better than average entropy. The areas where you code better than average entropy are candidates for choosing a shorter than maximal match length!

What are we really doing as we step back through the file and choose the best match length? We have this constant table Cost(L) that tells us the cost of coding each match length. This table is increasing, it's smallest at Cost(3) and largest at Cost(infinity). We find the maximum match length at a given offset and the minimum at a given offset (the minimum is the smallest length that can't be coded from a lower offset). We take the Cost(L) table in that range and add it onto CostTable() in that range, then pick the lowest value after the sum. Generally CostTable increases backwards faster than Cost(L) increases forwards, but not always! As we saw in the example above with a long match, CostTable can increase backwards slower than Cost(L) increases forwards.

So, how do we find this spots quickly? Well, when we are looking at coding at spot P, we have already visited all the CostTable spots > P. What we'd like to know is, Is CostTable ahead of us increasing sublinearly? And if so, where is it MOST sublinear? That is, Where is CostTable[P] as far below (end - P)*H ? That will tell us the most likely spot to code a less than maximal match.

We could build a binary search tree for these spots as we go backwards, which would be logN and all, but I used a simpler solution. At each spot P, I store the best sublinear spot preceding that spot (preceding in file order, actually occuring later in the backwards walk). That is :


I chose a coding at pos P and updated CostTable[P]

BestSpotPreceding[P] = P;
int step = 1;
while( CostTable[P] + step * TinyCost < CostTable[P + step] )
{
 BestSpotPreceding[P + step] = P;
 step++;
}

Normally when CostTable is increasing a lot as you step back, this does not get triggered at all. That is, CostTable[P+1] is usually a H smaller than CostTable[P]. But in weird cases, the cost of coding this symbol was almost free, so BestSpotPreceding points back to us. BestSpotPreceding gets updated in waves that terminate at cheap code spots. Normally you keep stepping back, then you hit a really cheap code spot and a wave percolates forward, filling out BestSpotPreceding, then you step some more normally, then you hit a cheap code spot and it waves through again. Sort of carries in an arithmetic coder. Anyway, this is technically still O(N*L), but it's a very cheap operation and it's way way less than N*L in practice.

So, how do we use this? Instead of considering only the maximum length at each offset ( {7 at Off1, 11 at Off2, 22 at Off3} in our previous example ), we also consider matches at BestSpotPreceding[] in that offset. That is :

Look in 3-7 at Off1
Consider a 7 at Off1
Also consider BestSpotPreceding[Pos + 7] at Off1 (if it's >= 3 )

Look in 8-11 at Off2
Consider an 11 at Off2
Also consider BestSpotPreceding[Pos + 11] at Off2 (if it's >= 8)

Look in 12-22 at Off3
Consider a 22 at Off3
Also consider BestSpotPreceding[Pos + 22] at Off3 (if it's >= 12)

And those are all the matches we have to consider. I did this and found it hurt less than 0.001 bpb compared to the full optimal parse that considers all possible coding operations at each position.

ADDENDUM #1 : I should mention there's another cool way to get this optimization if you don't want to do the BestSpotPreceding thing. We again want to ramp up our intuition for the problem. The cases where we need to make a non-trivial decision about match lengths are cases when we are writing two matches in a row, both greater than minimum length, and they overlap. That means there's a big range of ways we can code the same block. For example :

 

At pos P we find a match of length 7
At pos P+4 we find a match of length 10

The area from pos P to [P+14] can be coded with two matches, the ways are :

{4,10}
{5,9}
{6,8}
{7,7}

The area of overlap of the matches all codes to the same offset, it's the same offset that was found at P+4, but if you match past that at P, then you just match less of it. The amount of overlap is the number of different coding choices we have.

We're coding the same two offsets and skipping the same number of symbols, so the only difference in code cost is the cost to code the match lengths. The code costs of the choices are :


Cost(len 4) + Cost(len 10)
Cost(len 5) + Cost(len 9)
Cost(len 6) + Cost(len 8)
Cost(len 7) + Cost(len 7)

But we don't even need to compute these. The cost of coding a match len is almost always increasing in len, and increasing more slowly for larger lens, that is :

(len1 > len2) implies codecost( len1 ) >= codecost( len2 )

(len1 > len2) implies ( codecost( len1 +1 ) - codecost(len1) ) <= ( codecost( len2 + 1 ) - codecost( len2 ) )

That is, code costs go up for larger lens, but they go up more slowly. It's like a sqrt(x) function - sharp at the beginning and then levels out. That means the lowest total code cost is the one that's most skewed. You want to extend the longer match as much as possible because that's very cheap, and you want to shorten the shorter match as much as possible because that saves you a lot. Intuitively you can think the difference between coding a len of 3 and 4 is very big, but the difference between 511 and 512 is tiny tiny.

In our example that means the best choice is :

Cost(len 4) + Cost(len 10)

The simple heuristic to do this in general is :


Consider coding option at pos P
Find max match len possible at each offset
Consider coding with max match len

len = max match len
while ( matchoffset[ P + len - 1] == matchoffset[ P + len ] )
 len--;
if ( len != max match len )
 consider coding at len

Note the similarity with the BestSpotPreceding heuristic. We try coding with our longest match len. Then we also try coding with the *shortest* match len that still reaches the same match after we jump forward. We're trying the two most skew possibilities, our match as long as possible and our match as short as possible, constrained to be getting the same match following. (matchoffset is already computed because we're going backwards).

ADDENDUM #2 : More early outs. LowestCostPreceding. @@ todo : write this.

There are some more minor issues we need to consider.

The first one is that if we are doing a Huffman (or static arithmetic) coder to entropy code the length, offset & literal, then we can't actually know the code length in advance. And the way we choose to do the coding will affect the statistics which will affect the code lengths, which affects how we choose to do the coding, ad infinitum. It's a feedback loop.

As usual in these kind of optimization searches, we make the semi-static assumption. We make a good guess for the code lengths, find the optimal parse for those code lengths, use that parse to guess new code lengths, optimal parse again, etc. Hopefully this smoothly converges. Well, it doesn't. The problem is that the coding alphabet is actually pretty sparse. On medium size files there are not that many literals and there are plenty of offsets and match lengths you might not code at all in one particular encoding choice. For example, in one pass you might not code any matches of length {31} at all. Now when you do the next pass you need something reasonable in there. If you use the actual code lengths, you can strangely skew how you do the coding, and it can actually ping pong a lot in the search and not settle down.

To improve this, I suggest smoothing the counts in the early passes. That is, the chance of a length 31 match should be somewhere between the chance of a length 30 and length 32 match, regardless of the coincidence that we happened not to code with it in the last pass. To do this, I fit a laplacian probability model to lengths and offsets, and in the early passes I blend that model in to smooth out the regularize the distribution. As I do more passes I decrease the weight of that blended simple model and use the raw counts more.

In practice, you can avoid doing multiple passes of the actual optimal parser without hurting compression too much. You can do a first pass with just a normal non optimal parser and build statistics from that. Smooth those statistics with a regularizing simple model. Then do the optimal parse with those statistics, and you're done. Doing more passes will improve the coding decisions, though.

The other issue is that it's still a bit slow. Most of the slowness comes from the areas where very long matches are possible. If there's a long match of length L at pos P, we also consider (L-1) at (P+1), and (L-2) at (P+2), etc., and for each of those we generally find many closer offsets to also consider, roughly O(L) of them, which makes us O(L^2) over that region. On super degerate files like "pic" in the Calgary Corpus, this makes us very slow. However, it's easy to fix. In this area with the long match, it really doesn't matter much what we do, because all of them are good choices. This is a property that in areas of super high compression, the importance to the whole file is much less. You want to maximize your wins on the hard to compress spots, because in terms of output bytes they are more important. So, in these big long match areas, we just cut out the whole area inside the long match, and we always code the long match at the beginning. (actually I allow coding at P and P+1 up to P+K for a small K ; K is like 4 and L is like 100).

Optimal Parser for Adaptive Coders :

Now, with real modern coders you aren't doing static statistics, they're adapting as you go. You could just make a guess at the overall statistics and use the semi-static optimal parser that I just described, and then code the stream with your adaptive coders. That's better than not optimal parsing at all, but it's giving up a lot. Adaptive coders get a big win by changing their statistics as they scan through the file, and an optimal parser can make much better decisions if it uses the actual local statistics. For example, in one part of the file, literals might code very well, and the optimal parser should favor literals more, in another part of the file there might be lots of very short match lengths, and the optimal parser should code for that, etc.

One way to solve this is to use a different kind of semi-static approximation and optimal parse forward in blocks. For each block, you pretend your coder are static and use some fixed code lengths. Now, run the semi-static optimal parser that I described above to find the optimal coding for the next 1000 bytes or so. Now step up through that coding and output codes with your adaptive entropy coder. Only output 900 bytes or so, don't go all the way to the end of the block so that you avoid edge effects. Now hold use the current statistics to find the optimal parse for the next 1000 bytes. This in fact works very well, and I believe was first used by RKive many years ago. He called it "dynamic programming" which is sort of right in the sense that all backward optimal parses are a type of dynamic programming.

There's another form of adaptive optimal parse that a lot of people do now using forward walking. It's called "flexible parsing", and I believe a form of this is used in LZMA (add: no, that's not LZMA).

The simplest form of "flexible parse" is just the old "lazy evaluation" of Zip. Lazy matching walks through the file and considers either writing {match} or {literal + match at next pos}. It's a slightly non-greedy parser. Flexible parsing is basically the same thing, but we consider a few different choices.

To do this, we run through the whole file first and find the best match at each position. (you don't actually have to do this in practice, you could just look ahead some amount N from the current position to avoid multiple passes). In any cases, the match search at each pos is only done once, and we have them ready far enough in the future to do the flexible parse.

So, we consider a few coding choices at the current location. We choose the one that gives the lowest code cost to the end of the file. To make good decisions, it needs to also consider the coding after you take the current operation. That is, it's like a recursive function :


CodeCostRecurive( P )
{
 consider n in N choices
  cost of choice n = Cost(code n) + CodeCostRecurive( P + len(n) )
 return best of N
}

now of course if we actually did this full recursion we would be doing the whole O(N^L) tree like we didn't want to do. But we don't need to do that because we're only using this to make the decision at the current byte. That is, we can use the "short horizon" assumption - coding choices far in the future don't matter that much to our current coding choice. It's like only considering a few moves ahead in a Chess program (which is a bad thing to do there, but a good thing here). In fact, you only need to look 2 moves ahead here to make good coding choices.

Also, you don't need to consider very many N choices. N = 4 or 8 is plenty. Certainly consider coding a literal and coding the longest possible match, and then heuristically pick some other likely good choices in between (such as the longest matches available with some shorter offsets, or the longest matches possible from previous-match-references if you have those).

How do we terminate the recursion? After 2 steps (considering one op and then one more op after that) we just need an estimate of the cost of coding the rest of the file. That's easy, it's just :


CodeCostEstimate( P ) = ( {end} - P ) * H

That is, just assume the remainder is coded at the entropy. The entropy you use here should be a conservative overestimate of the entropy; for example just use the entropy of what you've done so far and add a fudge factor. This ensures that you actually prefer writing matches and don't do something strange if you start the file in a very low entropy region.

This flexible parse sounds kind of expensive, but in fact with N = 4, it's only 16 table lookups to compute all the code lengths, then we pick the best one. This of course is not a true optimal parse, but it's pretty damn good, and the fact that it's using the true current adaptive statistics makes up for the fact that it's not doing the best possible parse.

ADDENDUM : I should note that this kind of forward parse is also better at making good decisions with PMR's (Previous Match References). PMR's are cheap to code offsets, so they should always be considered in the optimal parse when they are possible matches. Going backwards you can't make good decisions about trying to use PMR's because you haven't done the earlier parts of the file yet.

10/08/2008

10-08-08 : Arithmetic coders throw away accuracy in lots of little places.


10-08-08

Arithmetic coders throw away accuracy in lots of little places. A full precision range update would be :


low += (symlow * range) / symtot;
range = (symrange * range) / symtot;

Intead we do :

U32 r = range / symtot;
low += symlow * r;
range = symrange * r;

This has changed the parentheses of the multiplication, which lets us do math in 32 bits and also saves a divide. It's inaccuracte, we would waste some range, so we check for being at the end of the range and put the wasted space on there :

if ( (symlow + symrange) == symtot )
{
 U32 t = symlow * (range / symtot);
 low += t;
 range -= t;
}
else
{
 U32 r = range / symtot;
 low += symlow * r;
 range = symrange * r;
}

We can see the waste by comparing the range update for the top symbol in the two cases :

case1 = range - symlow * (range / symtot);
case2 = symrange * (range / symtot);

waste = case1 - case2;
waste = range - symlow * (range / symtot) - symrange * (range / symtot);
waste = range - (symtot - symrange) * (range / symtot) - symrange * (range / symtot);
waste = range - symtot * (range / symtot);
waste = range % symtot;

Putting the waste on the last symbol is not great. To make the most of it, you should really reorganize your alphabet so that the most probable symbol is the last one. Symtot is usually around (1<<14) which means the waste is on average (1<<13). Range is between (1<<24) and (1<<32) so the fractional waste is between 2^-11 and 2^-19. This appears to come out to about a waste of 1 byte every 4096 on text with the alphabet rearranged to put the space (0x20) at the end.

I should note that many many people have arithcoders that do this check to put the wasted range on the last symbol, but they *don't* do the work to make sure the last symbol is the MPS (the most probable symbol). If you don't do that, it's rather pointless. With english text for example you might be putting the extra range on the symbol "255" which basically never happens, so you are doing work for nothing.

Sean asked a good question that had me a little confused yesterday. Why do some of the renormalization updates shift in 1's to "high" and some don't? You might see :


while ( range < MinRange )
{
 *outptr++ = low >> 24;
 low <<= 8;
 range <<= 8;
}


or :


while ( range < MinRange )
{
 *outptr++ = low >> 24;
 low <<= 8;
 range <<= 8;
 range += 255;
}

I thought maybe there was some subtle math reason why you would choose not to put in the 255. Nope. It's just another approximation + optimization. Shifting in the 1 bits to range is the more accurate correct way to do it, but at that point range is already in 2^24 to 2^32, so adding on 255 doesn't amount to much. In practice on my real file tests it doesn't even save 1 byte. I guess it should save something like 1 every 2^20. So, most people just leave it out to save the 1 instruction.

Here are some numbers :


r:\bigtest.bin : inLen : 25651384

True Entropy : 18090226.9 : 5.642

// these are all using cum prob max = 1<<14 and doing a divide for range

radArith cycles: 64.9
radArith : encLen : 18090853 = 5.642 bpb

// radArith is a Michael Schindler 31 bit range encoder with check to put waste range on top sym

Arithmetic_Codec cycles: 58.2
Arithmetic_Codec : encLen : 18091236 = 5.642 bpb

// Arithmetic_Codec is FastAC as modified by me with no check for waste on top sym

rrArithCoder cycles: 57.3
rrArithCoder : encLen : 18091236 = 5.642 bpb

// rrArithCoder is my reimplemented 32 bit range coder similar to FastAC ; yay I beat them by 1 clock !

rrArithBinaryModel cycles: 35.1
rrArithBinaryModel : encLen : 18844782 = 5.877 bpb

// rrArithBinaryModel is a bit-by-bit adaptive model + coder ; that's cycles *per bit* , it's 280 cycles/byte

rrArithCoderRawBits cycles: 24.1
rrArithCoderRawBits : encLen : 25651385 = 8.000 bpb

// rrArithCoderRawBits just puts the bytes out through the arithmetic coder
// this basically just measures the speed of the renorm loop

rrHuffman cycles: 14.5
rrHuffman : encLen : 18179876 = 5.670 bpb

// Huffman encoding is still 4X faster and doesn't cost you a whole lot in compression efficiency

10/07/2008

10-07-08 : A little more on arithmetic coding ...


10-07-08

A little more on arithmetic coding ...

Frédéric Kayser sent me the link to this FastAC page with a bunch of papers and source code by Amir Said. It's pretty good stuff, I thought I'd give a synposis of what's in there :

The coder is almost a vanilla Michael Schindler style "range coder". There are a few little tweaks in there that differ from the standard implementation.

1. Instead of a "virtual queue" for the bytes that might carry, he goes ahead and just writes them out, and then if a carry is detected he goes back through the output bytes and increments them while they're 0xFF. This violates one of the rules of standard arithmetic coder design, which is that you never touch output bytes after you stream them out, so that they can be transmitted or written to disk or whatever. Now of course if you're just outputting to a big buffer in memory his way is totally fine, and it appears to be faster than bothering with all the variables to account for the virtual queue.

2. He actually uses the full 32 bits for range instead of 31 bits like in the standard Schindler. We use 31 bits so that we can see the carry in the top bit. Said has a neat little trick for detecting the carry :


U32 old_low = range_low;

range = range_high - range_low;
range_high = range_low + symhigh * (range / symtot);
range_low += symlow * (range / symtot);

if ( range_low < old_low ) // carry

(This maybe could be even faster if there were an elegant way to directly get the carry flag in C). Using the full 32 bits is nice because it makes you byte aligned and eliminates some little ugly bit junk you have to do with shifting bytes 7 and 1 to get that last bit right. This lets you avoid a LOT of shift and mask work which is a big win.

Now, as for the models, his "Adapative_Data_Model" is 100% exactly my Deferred Summation model from DCC96. You can read about the original DefSum in my papers section or get the ancient code in the statistical coders section. From what I can tell there's nothing new or interesting in the modeling, it's 100% just deferred summation to get you power of two CumProbMax so you can turn your division into a shift, he uses my fast decode table, etc.

Anyway, the implementation is pretty good and the source code is pretty clean, so it's a decent place to start. However, I don't recommend actually using it as is. The models are terrible and the API interface is very bad.

The reason why "Deferred Summation" never amounted to much is that arithmetic coding a big hunk of symbols with order-0 modeling is kind of pointless. You just never actually want to do that. If you're doing arithmetic coding, you're also doing higher order modeling or LZ coding, or something, so that you don't just have static statistics. Providing API's that are locked to these silly models is lame. The correct API interfaces for an arithmetic coder are the ones I have in "arithc" , just

Encode( symlow, symfreq, symtot );
EncodeTotIsPow2( symlow, symfreq, symtotshift );

The binary models are also silly. He does deferred summation for the bit model as well, which is not necessary. You can either do the implicitly rescaling fixed-tot bit model that I just wrote about recently, or you can use a rung-ladder model like I have in crblib which uses a table to find power of 2 probabilities from a state.

Said's codec is in fact very fast. It's almost 10% faster that any arithmetic coder I've ever seen. That's pretty damn good. I'm going to modify it a bit to provide what I think are reasonable API's and post it.

Okay here it is : cb_Arithmetic_Codec.zip ; a tiny .h with a modified version of FastAC. This is pretty rough, I tried to modify the original FastAC code as little as possible.

10-07-08 : Random file stuff I've learned

Random file stuff I've learned :

Good article on Windows file caching . The Windows file cache architecture is pretty sweet really. Files are cached in 256k blocks. The caching is also file "buffering". That is, file IO in the application doesn't get any buffer just for that app - you get access to the cache pages, that is your buffer. It means multiple apps on the same file share buffers. This is all pretty sweet. The caching is just a file mapping of pages in the system address space. When you create a memory mapped file, you are just getting direct access to those pages, which you can then lock to get pointers in your address space. You can lock & unlock the pages just like any pages using VirtualAlloc and VirtualFree. When you write to pages, you can control them getting flushed out the write cache by unlocking them.

Windows file caching is *extremely* greedy. It will basically throw away anything it has cached in order to cache the latest request. This makes it look very good in benchmarks, but can be pretty bad in practice. For example, I was testing by doing linear streaming IO across a 2 GB file. Windows caches every single byte of the 2 GB file so that future touches to it are instant; in order to do that, it will evict the baby and the bathwater from the cache (including stuff like the cache of the NTFS MFT table).

One thing I haven't been able to figure out is a way to manually control the prefetching that windows does. When you do a bunch of ReadFile calls in a row, the cache will asynchronously prefetch ahead the next 256k chunk. This means that even if you just call the synchronous ReadFile (or fread) to get bits of the file, the return is almost immediate. What I would like, however, is a way to manually say "prefetch at this byte". The only way I can do that at the moment is to issue a read at that byte on a thread, but that's not really ideal, it would be better to be able to signal the underlying cache directly.

Also it appears that if you use memory mapped files, windows stops doing the smart async prefetch ahead. Say you memory map a huge file, then you run through streaming touching pages in order. Windows uses the virtual page fault from the memory manager to interrupt and pull in pages from disk. It seems to not prefetch because you get a lot of faults and your app just stalls when you get to pages that aren't in memory yet. Memory mapped files are pretty awesome, it's cool that you just get access to the cache, but without the ability to request prefetches they're pretty worthless since the plain old ReadFile beats them so badly. (if you are in fact doing totally random file accesses on a huge file, then they are just about perfect).

Stdio in MSVC has a 4k buffer and then calls straight to ReadFile to fill it. The 4k buffer isn't really providing you any "buffering" it just hides function call overhead and makes the reads page aligned; the real buffering is inside ReadFile.

If you're linking with the multithreaded libs, stdio becomes outrageously slow. Personally I like my threading model to work such that individual objects, such as a "FILE" are not protected from concurrent use by the library. That is, once the API hands out an atomic object to the app, it is up to the app if it wants to protect that object from multiple threads accessing it or not. The library should only do protection from access to shared/global data structures. (I know they can't do this with stdio because they want to make sure it's safe, and they want to protect stdout/stdin/stdierr). Anyway, the result is that "getc" goes from being around 10 clocks to being around 170 clocks. Fortunately, you can fix that pretty easily by just bringing back the macro yourself. Just define this :

#define macrogetc(_stream)     (--(_stream)->_cnt >= 0 ? ((U8)*(_stream)->_ptr++) : _filbuf(_stream))

And use it to do your IO when you know the FILE is only being used by one thread.

Something I just discovered today : while Windows buffering is in fact very good for reads, it sucks for writes. If I write out a stream of data through a buffered file, I only get around 25 MB/sec. If I write out chunks with async non-buffered calls, I get 100 MB/sec. (conversely, buffered streaming input is actually faster than doing a bunch of async non-buffered reads).

10/06/2008

10-06-08 : Followup on the "Russian Range Coder"


10-06-08

Followup on the "Russian Range Coder". So, first of all let me stress once again that this thing is pointless and you should not use it. It's an "optimization" that gives up some compression and doesn't actually make you faster at all. You should never do approximations or optimizations unless they actually help, even if they look like they should be faster. Best to stick with full quality everywhere lest it come back to bite you.

But I was thinking about it and I realized you can improve the renorm loop. The original loop in the case of underflow just truncates the range down to the bottom part to make the top bits always equal to the top bits of "low". But that can be really bad in some cases, since you might be straddling a boundary, but most of your range could be above the straddle, not below. It's easy enough to just check whether you have more range above or below the straddle :


    while(rc->range < (1<<16))
    {
  
  int above = (rc->low + rc->range) & 0xFFFF;
  int below = (~rc->low) & 0xFFFF;
  //int above = rc->range - 1 - below;
  
  /*
  U32 hi = rc->low + rc->range;
  U32 hi_below = rc->low + below;
  U32 lo_above = hi & 0xFFFF0000;
  */
  
  if ( above > below )
  {
   rc->low += rc->range - above;
   rc->range = above; 
  }  
  else
  {
   rc->range = below;
  }
  
  *rc->ptr++ = (U8) (rc->low >> (RANGE_SIZE - 8));
  rc->range <<= 8;
  rc->low <<= 8;
    }

That is, we either push the top bits of low up to the top bits of hi, or push the top bits of hi down to the top bits of low, depending on which gives us more code space. Again, this doesn't hurt speed at all because it's rare. In fact, that's why you should just use a proper virtual-queue range coder that handles underflow right.

Just to make it super clear what's happening, here is an actual example :


    rc->low + rc->range            0x4d0043e4
    rc->low                        0x4cffa580
    above                          0x000043e4
    below                          0x00005a7f
    rc->low + below                0x4cffffff
    rc->low + rc->range - above    0x4d000000

Some numbers :


m:\ccc\book1 : inLen : 768,771

True Entropy : 435,042.6 : 4.527 bpb

radArith : encLen : 435,116 = 4.528 bpb
 bytes exceeding entropy : 73

original RRC : encLen : 435,284 = 4.530 bpb
 bytes exceeding entropy : 241

cbloom RRC : encLen : 435,216 = 4.529 bpb
 bytes exceeding entropy : 173

radArith cycles: 69.6
original rc cycles: 74.2
cbloom rc cycles: 70.3

Here "radArith" is a normal Michael Schindler range coder with virtual queue. My modified "russian range coder" has about half the waste of the original (compared to the canonical range coder, not compared to entropy). Note that the arithmetic coders here are all using a max cumulative frequency of 16384, while the "true entropy" uses the exact character counts in the file.

BTW these are the links to the authors of the "Russian Range Coder" - they seem to all be dead. If anyone has live links to these people let me know and I'll hook them up.

 

/*
  Arithmetic coder based on "carryless rangecoder" by D. Subbotin.
  * http://www.softcomplete.com/algo/pack/rus-range.txt
    (Original C++ code by Subbotin)
  * http://hem.spray.se/mikael.lundqvist/
    (C impelementation by M. Lundqvist)
  * H. Okumura: C magazine, Vol.14, No.7, pp.13-35, Jul. 2002
    (Tutorial of data compression with sample codes, in japanese).
*/

old rants