12/03/2014

12-03-14 - PVQ Inspired Rambling - Smooth Blocks

So there's a question of what exactly is a "smooth" block.

I'm not entirely sure. On the one hand you could just say that you only have "detail" blocks, and a detail block where all the AC energy is zero is a "smooth" block.

But there are some things about "smooth" that I think deserve a more global consideration in the coder as a special mode. I will try to ramble about them a bit :

1. Smooth blocks need more DC resolution. With no detail in the block the eye is much more sensitive to the color being wrong. In fact DC quantizer should go up with AC energy in general. In typical coders the DC's are all sent separately from the AC, before them, so that order should be changed. Either signal block type before sending the DC's, or send some information about the total AC energy.

1a. They particularly need more DC resolution when they have a similar DC to their neighbors. This can be done by sending the delta-DC with a non-uniform quantizer, which is a hack I wrote about previously.

2. Even with lapped transforms or standard deblocking filters, a zero-AC block is lumpy. They use filter shapes that make the block flat in the middle and sloped at the edges, which gives them a C1-smooth stair-step look.

3. Connected regions of "smooth" should be handled specially. They should have restored-with-quantization range maximum-likelihood gradient restoration.

Say you have a DC quantizer of 8, and you have a row of smooth blocks with DC's of [ 0 0 0 8 8 8 ] . Current methods will give you a run of all 0 pixels, then a smooth ramp up to 8, and then a run of all 8's. Obviously what should really be there is a long smooth gradient.

4. A "smooth" next to an "edge" block can imply that the smooth gradient region goes up to the edge found within that edge block and no further.

5. If you get to such low bit rates that you take a detail block and send zero AC's just because you can't afford them, you don't want that block to be considered "smooth" and put into the big-smooth-gradient pathway. Though really this needs to be something your RDO fights hard against (don't let detail blocks go to zero AC's even if it seems like the right RD decision because it actually isn't under a better D metric)

12-03-14 - PVQ Inspired Rambling - RD Ideals

Lossy coders with RD optimization issues.

You make some lossy coder which has various modes and decisions and quantization options. These affect both R&D on a micro scale. You want to run an automated RDO that can in theory try everything.

In practice this creates problems. The RDO winds up doing funny things. It is technically finding the best D for the desired R, but that winds up looking bad.

One issue is that to make RDO efficient we use a simplified D which is not the whole perceptual error.

Often we're measuring D locally (eg. on one block at a time) but there are non-local effects. Either far apart in the frame, or over time.

For example, RDO will often take away too many bits from part of the frame to put them on another, which creates an ugly variation of quality. RDO can create alternating smooth and detailed blocks when they should be uniform. In video, temporal variation can be even worse, with blocks flickering as RDO makes big changes. Even though it was doing the right thing in terms of D on each individual block, it winds up looking terrible.

Some principals that prevent this from happening :

1. There should be a lot of fine steps in R (and D). There should be coding opportunities where you can add a bit or half a bit and get some more quality. Big steps are bad, they cause blocks (either with spatial or temporal shifts) to jump a lot in R & D.

A common problem case is in video residuals, the jumps from "zero movec, zero residual" to "some movec, zero residual" to "some movec, some residual" can easily be very discontinuous if you're not careful (big steps in R) which leads to nasty chunkiness under RDO.

2. Big steps are also bad for local optimization. Assuming you aren't doing a full-search RDO, you want to make a nicely searchable coding space that has smooth and small steps as you vary coding choices.

3. The R(D) (or D(R)) curve should be as smooth as possible, and of course monotonic. Globally, the RDO that you do should result in that. But it should also be true locally (eg. per block) as much as possible.

4. Similar D's should have similar R's. (and vice versa, but it's harder to intuit the other way around). If there are two quite different ways of making a similar distortion number (but different looking) - those should have a similar R. If not, then your coder is visually biased.

eg. if horizontal noise is cheaper to code than vertical noise (at the same D), the RD optimization will kill one but not the other, and the image will visually change. It will appear to smooth in one direction.

Of course this has to be balanced against entropy - if the types of distortion are not equally probably, they must have different bit rates. But it should not be any more than necessary, which is a common flaw. Often rare distortion gets a codeword that is too long, but people don't care much because it's rare, the result being that it just gets killed by RDO.

Part of the problem here is that most coders (such as DCT and z-scan) are based on an implicit model of the image which is built on a global average ideal image. By the time you are coding a given block, you often have enough information to know that this particular image does not match the ideal, but we don't compensate correctly.


You can think about it this way - you start at 0 bits. You gradually increase the rate and give more bits out. Under RDO, you give each bit out to the place where it gives you most increase in D. At each step as you do this, there should lots of places where you can get that good D. It should be available all over the frame with just slightly different D's. If there are big steps in your coding scheme, it will only be in a few places.

12-03-14 - Lossy AC Location Quantization

Okay, I want to ramble a bit about the idea of lossy AC location quantization.

This is the idea that as you are quantizing the AC's, throwing out some information, one of the things you should be throwing out is the *location* of AC signals, not just their energy levels (as you do with a scalar quantizer).

Now, Daala's PVQ scheme (P = Predictive or Pyramid depending on context, confusingly) does this a bit. They send the L2 norm ("gain") of each subband (their "subbands" being chunks of similar AC coefficients), and they send a Pyramid VQ unit vector from a codebook describing the distribution of that gain within the subband. When they do the VQ index selection, that can wind up decreasing some AC values and increasing others, because you rotated a bit in the N-dimensional space. Unlike normal AC scalar quantization, this is energy preserving because it's a rotation - when one value goes down, others go up.

But Daala is not really super lossy on the location part of the AC's in the way my imaginary coder is. In particular the step from sending no AC's to sending one (at K=1) is a very large bit rate step, and there's no way to send one AC in a somewhat-specified location. The step from K=1 to K=2 is also large.

To be clear, this idea is all about low bit rates. At high bit rates, K is high, this all doesn't matter very much.

When I was doing video, I was looking at low bit rates. Well, typical internet video bit rates. It's around 4k bytes per frame. For 720p that's around 2.4 bits per 8x8 block. For 480p it's 7.3 bits per 8x8 block. The bit rates for video are low!

At those low bit rates, when you get to send any AC's at all, it's usually just AC10 and/or AC01. If you have blocks with energy out in the higher AC's, it usually just costs too much to send them at all in an R/D coder. They cost too many bits, because you have to send a long Z-scan run-length, and they don't help D enough. (of course I'm talking about mocomp residuals here)

But perceptually this is bad. As noted previously, it makes it too big of a step in R to get to the next D option. There needs to be something in between.

That is, you start by sending just { AC10 , end-of-block } ; then your next step is { AC10 , AC45 , eob } - that's like a 10 bit step or something. You need something in between that's a few bits and provides some D benefit, like { AC10 , AC-somewhere }. That is, the location of that AC needs to be lossy.

Now, none of the old coders do this because under RMSE it doesn't make any sense. Taking some AC signal energy and putting on a different basis function is just a big WTF in terms of classic transform theory.

But perceptually it is right.

Rather than pulling something out of my ass, this time I'll post the actual DCT categories that I found to be optimal in my perceptual metric studies -


const int c_dctCategory[64] = 
{
   0, 1, 1, 3, 3, 3, 3, 6,
   2, 5, 5, 5, 6, 6, 6, 6,
   2, 5, 5, 6, 6, 6, 6, 6,
   4, 5, 7, 5, 6, 6, 6, 6,
   4, 7, 7, 7, 8, 8, 8, 8,
   4, 7, 7, 7, 8, 8, 8, 8,
   4, 7, 7, 7, 8, 8, 8, 8,
   7, 7, 7, 7, 8, 8, 8, 8
};

(there's also inter-band masking; categories 1+2 mask 3-5, and then categories 1-5 mask 6-8).

If you can't get the AC's all right, the next best thing is to get the sums within the categories right.

The categories sort of grossly tell you what kind of detail is in the block perceptually; is it high frequency, is it vertical detail or horizontal detail. It's what the eye picks up most of all in low bit rate video. Hey there's big chunky edge here where it should have been smooth, or hey this textured water is smooth in one block. The most important thing is for there to be *some* texture in the water, not for it to be right texture.

Point being, there is some benefit to filling in the gaps in the R-D space. Between sending no AC's at all, and sending "AC64 is 1" you can send "something in group 8 is 1". (in Pyramid VQ speak, there are big gaps to fill between K=0 and K=1 and K=2)

(note that under a normal D metric, this won't be good, so RDO won't work. You either need a D metric that counts this right, or fudge it by interpolating)

(also note that I'm hand-waving in a nasty way here; the perceptual property that "subband sums make sense" is true when applied to the original block - not the residual after subtracting out mocomp, while the property that "almost all AC's are zero" is a property of the residual! This is a difficulty that leads to the Daala PVQ (P = Predictive) Householder business. We'll ignore it for now.)

So, how might you actually send AC's with lossy positions?

There are a few ways. I'll sketch one or two.

First of all, let's work one subband at a time. We'll assume for now that we want the "lossy positioning" to only move values around within a subband, where they look rather similar. (perhaps at very low bit rates being even more lossy would be good). So, we want to send the sum of energy on each subband, and then send the way that energy is distributed within the subband.

It's an open question to me what the best "energy" on the subband is to preserve; L2, L1 or something else? Some power of those norms? In any case, it's scalar quantized. Then we have to distribute the energy, knowing the sum.

At higher bit rates, you quantize the distribution somehow and send it. One method is the Pyramid VQ (PVQ) that Daala uses. Perhaps I'll describe this more later, or you can follow the links and read old reference material. For small subbands and low quantized sums, you can also just store a codebook.

When the quantized sum is 1 ("K" in PVQ speak), then you just need to encode the position of where that 1 pulse is; eg. in a subband of N slots you need to code the spot that gets the energy, which takes log2(N) bits for a flat code. (of course a flat code is wrong because the slots aren't equally likely; the lower frequencies are slightly more likely, even within a subband that has been selected to be somewhat uniform).

So, what about sending less than log2(N) bits?

Well, at K=1 you can think of the bits as being binary splitting planes, cutting the subband in half with each bit you send. If you send all log2(N) bits then you specify a value precisely.


eg.

subband 3 is { a b c d }

K = 0 ; no bits sent , position = { abcd }
1 bit : { ab | cd }
K = 1 : 2 bits : { a | b | c | d }

We now have a continuous way of sending more bits and getting more precision.

On larger subbands, the splitting planes could be not in the middle (binary encoding) but rather at the point where they divide the probability 50/50, so that the index is variable length and doesn't require entropy coding.

When you've specified a lossy range of possible energy locations, eg. at 1 bit we specified "either a or b" you have a lot of options of how to restore that in the decoder. It could be just to the most probably spot (a), or a pseudo-random permutation of either. If the encoder and decoder use the same pseudo-random permutation, that allows the encoder to make an RD choice based on whether the restoration will match reality or not.

There's a tricky issue above K=1 when you want to add lossy detail for the next value distribution. It needs to be done such that D actually goes down, which is not trivial. You want D to be as continuous as possible as you add more lossy detail. The "embedded" style of truncating away bits of location is also not so obvious to me when you have two values to distribute.

More on this whole topic later.

12/01/2014

12-01-14 - AC Quantization

I just read the Xiph PVQ demo page ; pretty interesting, go have a read. (BTW it kills me that Daala is using lapped transforms, oy!) (also - thank god some people still write real blogs and RSS them so I can find them!)

It reminded me of some things I've been talking about for a long time, but I couldn't find a link to myself on my blog. Anyway, internally at RAD I've been saying that we should be sending AC coefficients with more precision given to the *sum* of the values, and less given to the *distribution* of the values.

ADD : some searching finds these which are relevant -

cbloom rants 08-25-09 - Oodle Image Compression Looking Back
cbloom rants 08-27-09 - Oodle Image Compression Looking Back Pictures
cbloom rants 03-10-10 - Distortion Measure
cbloom rants 10-30-10 - Detail Preservation in Images

I've never actually pursued this, so I don't have the answers. But it is something that's been nibbling at my brain for a long time, and I still think it's a good idea, so that tells me there's something there.

I'll scribble a few notes -

1. I've long believed that blocks should be categorized into "smooth", "detail" and "edge". For most of this discussion we're going to ignore smooth and edge and just talk about detail. That is, blocks that are not entirely smooth, and don't have a dominant edge through them (perhaps because that edge was predicted).

2. The most important thing in detail blocks is preserving the amount of energy in the various frequency subbands. This is something that I've talked about before in terms of perceptual metrics. In a standard DCT you do make categories something like :

01447777
23669999
566BBC99
56BBBCCC
8ABBBCCC
8ADDDEEE
8AADDEEE
8AADDEEE
and what's most important is preserving the sum in each category. (that chart was pulled out of my ass but you get the idea).

The sums should be preserved in a kind of wavelet subband quadtree type of way. Like preserve the sum of each of the 4x4 blocks; then go only to the upper-left 4x4 and divide it into 2x2's and preserve those sums, and then go to only the upper-left 2x2, etc.

3. You can take a standard type of codec and optimize the encoding towards this type of perceptual metric, and that helps a bit, but it's the wrong way to go. Because you're still spending bits to exactly specify the noise in the high frequency area. (doing the RD optimization just let you choose the cheapest way to specify that noise).

What you really want is a perceptual quantizer that fundamentally gives up information in the right way as you reduce bitrate. At low bits you just want to say "hey there's some noise" and not spend bits specifying the details of it.

The normal scalar quantizers that we use are just not right. As you remove bits, they kill energy, which looks bad. It looks better for that energy to be there, but wrong.

3 1/2. The normal zig-zag coding schemes we use are really bad.

In order to specify any energy way out in the highest-frequency region (E in the chart above) you have to send a ton of zeros to get there. This makes it prohibitively costly in bits.

One of the first things that you notice when implementing an R/D optimized coder with TQ is that it starts killing all your high frequency detail. This is because under any kind of normal D norm, with zig-zag-like schemes, the R to send those values is just not worth it.

But perceptually that's all wrong. It makes you over-smooth images.

3 3/4. Imagine that the guy allocating bits is standing on the shore. The shore is the DC coefficient at 00. He's looking out to see. Way out at the horizon is the super-high-frequency coefficients in the bottom right of the DCT. In the foreground (that's AC10 and AC01) he can see individual waves and where rocks are. Way out at sea he shouldn't be spending bandwidth trying to describe individual waves, but he should still be saying things like "there are a lot of big waves out there" or "there's a big swell but no breakers". Yeah.

4. What you really want is a joint quantizer of summed energy and the distribution of that energy. At max bit rate you send all the coefficients exactly. As you reduce bitrate, the sum is preserved pretty well, but the distribution of the lower-right (highest frequency) coefficients becomes lossy. As you reduce bit rate more, the total sum is still pretty good and the overall distribution of energy is mostly right, but you get more loss in where the energy is going in the lower frequency subbands, and you also get more scalar quantization of the lower frequency subbands, etc.

Like by the time that AC11 has a scalar quantizer of 8, the highest frequency lower-right area has its total energy sent with a quantizer of 32, and zero bits are sent for the location of that energy.

5. When the energy is unspecified, you'd like to restore in some nice way. That is, don't just restore to the same quantization vector every time ("center" of the quantization bucket), since that could create patterns. I dunno. Maybe restore with some randomness; restore based on prediction from the neighborhood; restore to maximum likelihood? (ML based on neighborhood/prediction/image not just a global ML)

6. An idea I've tossed around for a while is a quadtree/wavelet-like coding scheme. Take the 8x8 block of coefficients (and as always exclude DC in some way). Send the sum of the whole thing. Divide into four children. So now you have to send a (lossy) distribution of that sum onto the 4 child slots. Go to the upper left (LL band) and do it again, etc.

7. The more energy you have, the less important its exact distribution, due to masking. As you have more energy to distribute, the number of vectors you need goes up a lot, but the loss you can tolerate also goes up. In terms of the bits to send a block, it should still increase as a function of the energy level of that block, but it should increase less quickly than naive log2(distributions) would indicate.

8. Not all AC's are equally likely or equally perceptually important. Specifically the vector codebook should contain more entries that preserve values in the upper-left (low frequency) area.

9. The interaction with prediction is ugly. (eg. I don't know how to do it right). The nature of AC values after mocomp or intra-prediction is not the same as the nature of AC's after just transform (as in JPEG). Specifically, ideas like variance masking and energy preservation apply to the transformed AC values, *not* to the deltas that you typically see in video coding.

10. You want to send the information about the AC in a useful order. That is, the things you send first should be very strong classifiers of the entropy of that block for coding purposes, and of the masking properties for quantization purposes.

For example, packJPG first send the (quantizated) location of the last-non-zero coefficient in Z-scan order. This turns out to be the best classifier of blocks for normal JPEG, so that is used as the primary bit of context for coding further information about the block.

You don't want sending the "category" or "masking" information to be separate side-band data. It should just be the first part of sending the coefficients. So your category is maybe something like the bit-vector of which coefficient groups have any non-zero coefficients. Something like that which is not redundant with sending them, it's just the first gross bit of information about their distribution.


I found this in email; I think there are some more somewhere ...

(feel free to ignore this) Currently every video coder in the world does something like mocomp then DCT (or a similar transform (eg. one that is orthonormal, complete, and ) on the residuals. The thing is, the DCT really doesn't make much sense on the residuals. Heck even just removing the DC doesn't usually do much on the residuals, because the mocomp will generally match the DC, so the DC of the residuals is near zero. But the DCT still has value. The big advantage it has is the ability to send just a few coefficients and get a rough approximation of simple shapes like flat vs. gradient vs. noisy. For example if the residual is sort of speckly all over, to send that by coding values one by one you would have to send lots of values. With the DCT you can just send some signal in AC11 or AC22 and you get some rough approximation of splotches all over. The DCT has another advantage; with quantization and R/D optimization, you want to be able to identify which coefficients are okay to crush. You want to be able to put the basis functions in order of perceptual importance so that you can send them in a sequence where truncating the tail is an okay way to send less data. The subtle issue in all cases is that the coder choice implies a general type of error that you will create, because the coder creates an ordering, and under R/D optimization, you will prefer to send the parts of the signal that are cheaper. eg. in the normal DCT + zigzag type of coder, sending high frequency signals is much more expensive than low frequency. The result is that R/D optimization on DCT+zigzag coders tends to make things smoother. That is both good and bad. It means you can't really let the R/D optimization go nuts or you lose all your high frequency energy and your video appears to have no "detail". It would be nice to have coders that create different types of error under heavy R/D optimization. But maybe there's something better than the DCT? 1. Basis transform other than DCT. This is just a small change, not a huge shift in method, but some subsets are special : 1.a. Multiple bases. This makes the coder over-complete, which is bad in theory but can help in practice. For example you could have 3 different bases possible - DCT (cosine), Haar (steps), and Legendre (polynomials), and the coder could first send a selection of bases then the residuals. 1.b. KLT - optimal bases per frame. The encoder looks at the actual residuals and finds the optimum (in the sense of "most energy compacting") basis. You must then transmit the basis at the start of the frame. I believe this is not possible for 8x8 transforms, because sending the basis costs too much; you would have to send 64*64 coefficients, which is large compared to typical video frame sizes. It is possible, however, for 4x4 bases, and that could be cascaded with a 2x2 Haar to make an 8x8 transform. There are some problems with this, though - because the KLT bases are computed, the perceptual character of them is unknown; eg. how to order them by energy is known, but that order might not be their perceptual order. Also there is no "fast" transform because the bases are unknown; you must use straight basis addition to do the reverse transform. (there are weird cases where KLT basis is clearly good; for example scenes that don't match the "natural/smooth" assumption of DCT, such as movie titles or credits, scenes full of the same pattern (eg. videos of grass); but just like with color transforms, the disadvantage of KLT is that you no longer know exactly what your bases are, and you can't pre-decide their perceptual importance (eg. the JPEG quantizer matrix)) 1.c. Overcomplete basis. Rather than multiple bases or per-frame bases, use a static pre-chosen overcomplete set. Maybe some cosines, some linear ramps, some exact hard edges at various locations and angles. 2. Quadtree with "don't care" distribution. I haven't quite figure this out, but the thinking goes like this : Perhaps the most important (perceptual) single piece of information about the residuals is the total amount of AC energy. This is just the L-something sum of the residuals. Less important is exactly how that energy is distribution. So maybe we could come up with a way of encoding the residuals such that we first send the sum of energy, and then we send information about the distribution, and when we increase the "quantizer" or we do R/D truncation, what we lose first is information about the distribution, not the total. That is, loss causes the location of energy to become imprecise, not the total. I don't have a great idea for how to do this, but maybe something like this : First send the AC total for the whole block. Then consider blocks in a quadtree hierarchy. For each block, you already know the total of energy, so transmit how that energy is split among the 4. Then for any child whose total is not zero, recurse onto that child and repeat. At this point it's lossless, so it's not very interesting; the question is how to introduce some loss. Two ideas : 2.a. Add some amount of "don't care" energy to each quad chunk. That is, you know the total for each quadtree block as you descend. Rather than sending the 4 portions which must add up to the total, send 4 portions which are LE the total; the remainder (missing energy) is "don't care" energy and is distributed in some semi-random way. 2.b. Quantize the distribution. Send the total first. Then the distribution is sent with some larger quantizer, and the missing parts are fixed. eg. you send a total of 13. The distribution gets an extra quantizer of 3. So the first block you would send as a distribtuon of 13/3 = 4. So you send how the 4 is spread. That only accounts for 4*3 = 12, so 1 was not sent, you put it somewhere random. If the distribution quantizer is 1, it's lossless; if the distribution quantizer is reasonably large, then you send no info about the distribution, just the total energy. This method is pretty poor at sending linear ramps or edges, though. 3. S/D/E separate coders ; I've been talking about this idea for a while without much substance. The idea is that the perceptually important thing really changes dramatically based on the local character. For smooth areas, you really must keep it smooth, but the exact level or type of ramp is not that important. For edges, you want to keep a sharp edge, and not introduce ringing or blocking, but any other detail near the edge gets severely visually masked, so you can fuck that up. For detail area, the dominating thing is the amount of energy and the rough frequency of it, the exact speckles don't matter. Anyway, it occurred to me that rather than trying to do an SDE coder from the ground up, what you can do is just provide various coders, like maybe 1c and 2b and also normal DCT coding. You don't specifically ever decide "this block is smooth" and use the smooth coder, instead you just try all the coders and take the one that comes out best. One of the difficulties with this is you don't know the perceptual impact of your coding decisions. With a normal DCT zigzag type coder, you can just use the magnitude of the values to be a rough estimate of the perceptual importance, and add some fudge factors for the DC being more importan than AC and so on. With weird bases and weird coders you don't have a good way to make R/D decisions other than by actually running a perceptual D calculation for each R choice, which is several orders of magnitude slower.

(there were some good responses to that mail as well...)

(If someone wants to give me a professorship and a few grad students so that I can solve these problems, just let me know anytime...)

11/12/2014

11-12-14 - Intel TSX notes

Had a little look at Intel TSX. Some sloppy notes.

TSX are new instructions for (limited) hardware transactions.

I'm gonna talk mainly about the newer RTM (XBEGIN/XEND) because it's conceptually clearer than XACQUIRE/XRELEASE.

In general RTM allows a chunk of speculative code to be run, which can be fully rolled back, and which is not visible to any other core until it is committed. This allows you to do a transaction involving several memory ops, and only commit if there was no contention, so the entire transaction appears atomically to other cores.

The TSX documentation talks about it a lot in terms of "hardware lock elision" (HLE) but it's more general than that. (and HLE the algorithmic process is not to be confused with the annoyingly named "HLE instructions" (XACQUIRE/XRELEASE))

TSX does not gaurantee forward progress, so there must always be a fallback non-TSX pathway. (complex transactions might always abort even without any contention because they overflow the speculation buffer. Even transactions that could run in theory might livelock forever if you don't have the right pauses to allow forward progress, so the fallback path is needed then too).

TSX works by keeping a speculative set of registers and processor state. It tracks all reads done in the speculation block, and enqueues all writes to be delayed until the transaction ends. The memory tracking of the transaction is currently done using the L1 cache and the standard cache line protocols. This means contention is only detected at cache line granularity, so you have the standard "false sharing" issue.

If your transaction reads a cache line, then any write to that cache line by another core causes the transaction to abort. (reads by other cores do not cause an abort).

If your transaction writes a cache line, then any read or write by another core causes the transaction to abort.

If your transaction aborts, then any cache lines written are evicted from L1. If any of the cache lines involved in the transaction are evicted during the transaction (eg. if you touch too much memory, or another core locks that line), the transaction is aborted.

TSX seems to allow quite a large working set (up to size of L1 ?). Obviously the more memory you touch the more likely to abort due to contention.

Obviously you will get aborts from anything "funny" that's not just plain code and memory access. Context switches, IO, kernel calls, etc. will abort transactions.

At the moment, TSX is quite slow, even if there's no contention and you don't do anything in the block. There's a lot of overhead. Using TSX naively may slow down even threaded code. Getting significant performance gains from it is non-trivial.

RTM Memory Ordering :

A successful RTM commit causes all memory operations in the RTM region to appear to execute atomically. A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction. The XBEGIN instruction does not have fencing semantics. However, if an RTM execution aborts, all memory updates from within the RTM region are discarded and never made visible to any other logical processor.

One of the best resources is the new Intel Optimization Manual, which has a whole chapter on TSX.

RTM is very nice as a replacement for traditional lockfree algorithms based on atomics when those algorithms are very complex. Something simple like just an atomic increment, you obviously shouldn't use RTM, just do the atomic. Even something like a lockfree LIFO Stack will be better with traditional atomics. But something complex like a lockfree MPMC FIFO Queue will be appropriate for RTM. (an example MPMC FIFO requires two CAS ops, and an atomic load & store, even without contention; so you can replace all those atomic ops with one RTM section which either commits or aborts)

RTM handles nesting in the simplest way - nested transactions are absorbed into their parent. That is, no transaction commits until the topmost parent commits. Aborts in nested transactions will abort the parent.

BEWARE : the transactional code might not need the old fashioned lock-free atomics, but you do still have to be careful about what the optimizer does. Use volatiles, or perhaps relaxed-order atomic stores to make sure that the variable reads/writes you think are happening actually happen where you expect!!


I think an interesting point is that elided locks don't act like normal locks.

Consider a simple object protected by a lock :


struct Thingy
{
    int m_lock; // 0 if unlocked
    int m_a;
    int m_b;
};

The lock provides mutual exclusion for work done on m_a and m_b.

Let's pretend we protected it using a simple spin lock, so a function that used it would be like :


void DoStuff( Thingy * t )
{

while( AtomicExchange(t->m_lock,1) != 0 )
{
  // someone else has the lock; spin :
  pause;
}

// I own the lock
// do stuff to m_a and m_b in here :
DoStuff_Locked(t);

AtomicStore(t->m_lock,0); // release the lock

}

Now we replace it with the RTM transactional version :

void DoStuff( Thingy * t )
{

if ( _xbegin() )
{
  // transactional path

  // add "m_lock" to the transaction read-set :
  if ( t->m_lock != 0 )
  {
    // lock is held, abort transaction
    _xabort();
  }

    // do stuff to m_a and m_b in here :
    DoStuff_Locked(t);

  _xend();
}
else
{
  // transaction aborts go here
  // normal non-transactional path :

    while( AtomicExchange(t->m_lock,1) != 0 )
    {
      // someone else has the lock; spin :
      pause;
    }

    // I own the lock
    // do stuff to m_a and m_b in here :
    DoStuff_Locked(t);

    AtomicStore(t->m_lock,0); // release the lock
}

}

So, how does this work. Let's have a look :

In the transactional path, m_lock is not written to. The lock is not actually held. We do make sure to *read* m_lock so that if another thread takes the lock, it aborts our transaction. Our transaction will only complete if no other thread writes the memory we access.

In fact, the transactional path does not provide mutual exclusion. Multiple threads can read from the same object without conflicting. As lock as the "DoStuff_Locked" only reads, then the transactions will all proceed. In this sense, RTM has converted a normal lock into a read-writer lock!

The transactional path is fine grained! The way we wrote the code is coarse-grained. That is, m_lock protects the entire object, not the individual fields. So if thread 1 tries to modify m_a , and thread 2 modifies m_b, they must wait for each other, even though they are not actually racing. The transactional path will let them both run at the same time, provided there's cache line padding to prevent false sharing.

To be clear :


Transactional :

T1 :
read m_lock
write m_a

T2 :
read m_lock
write m_b

can run simultaneously with no conflict

traditional/fallback lock :

T1 :
take m_lock
write m_a

T2 :
take m_lock
write m_b

must synchronize against each other.

NOTE : as written the transactional path will actually fail all the time, because all the variables are on the same cache line. They need cache-line-size padding between the fields. Perhaps most importantly, the mutex/lock variable that's checked should be on its own cache line that is not written to on the transactional path.

Obviously this doesn't seem too interesting on this little toy object, but on large objects with many fields, it means that operations working on different parts of the object don't synchronize. You could have many threads reading from part of an object while another write writes a different part of the object with no conflict.

It's pretty interesting to me that the elided lock behaves very differently than the original lock. It changes to an RW lock and becomes fine-grained.


Overall I think TSX is pretty cool and I hope it becomes widespread. On the other hand, there is not much real world benefit to most code at the moment.

Some links :

Scaling Existing Lock-based Applications with Lock Elision - ACM Queue
Lock elision in glibc 01.org
TSX anti patterns in lock elision code Intel� Developer Zone
Exploring Intel� Transactional Synchronization Extensions with Intel� Software Development Emulator Intel� Developer Zone
Web Resources about Intel� Transactional Synchronization Extensions Intel� Developer Zone
Fun with Intel� Transactional Synchronization Extensions Intel� Developer Zone

11/11/2014

11-11-14 - x64 movdqa atomic test

How to do an atomic load/store of a 128 bit value on x64 is an ugly question.

The only guaranteed way is via cmpxchg16b. But that's an awfully slow way to do a load or store.

movdqa appears to be an atomic way to move 128 bits - on most chips. Not all. And Intel/AMD don't want to clearly identify the cases where it is atomic or not. (they specifically don't guarantee it)

At the moment, shipping code needs to use cmpx16 to be safe. (my tests indicate that the x64 chips in the modern game consoles *do* have atomic movdqa, so it seems safe to use there)

My main curiousity is whether there exist any modern ("Core 2" or newer) x64 chips that *do not* provide an atomic movdqa.

Anyhoo, here's a test to check if movdqa is atomic on your machine. If you like, run it and send me the results : (Windows, x64 only)

cb_x64_atomics_test.7z

The atomic test will just run for a while. If it has a failure it will break.

(you will get some errors about not having a v: or r: drive; you can just ignore them.)

Copy the output, or should be able to get the log file here :

"c:\oodlelogs\oodle_cb_test_x64_atomics.log"
"c:\oodlelogs\oodle_cb_test_x64_atomics.prev"

email me at cb at my domain.


For the record, an atomic 128 bit load/store for x64 Windows using cmpx16 :


#include <intrin.h>

void AtomicLoad128(__int64 * out_Result, __int64 volatile * in_LoadFrom)
{
    // do a swap of the value in out_Result to itself
    //  if it matches, it stays the same
    //  it it doesn't match, we get a load
    _InterlockedCompareExchange128(in_LoadFrom,out_Result[1],out_Result[0],out_Result);
}

void AtomicStore128(__int64 volatile * out_StoreTo,const __int64 * in_Value)
{
    // do an initial non-atomic load of StoreTo :
    __int64 check_StoreTo[2];
    check_StoreTo[0] = out_StoreTo[0];
    check_StoreTo[1] = out_StoreTo[1];
    // store with cmpx16 :
    while( ! _InterlockedCompareExchange128(out_StoreTo,in_Value[1],in_Value[0],check_StoreTo) )
    {
        // check_StoreTo was reloaded with the value in out_StoreTo
        _mm_pause();
    }
}

9/10/2014

09-10-14 - Suffix Trie EOF handling

I realized something.

In a Suffix Trie (or suffix sort, or anything similar), handling the end-of-buffer is a mess.

The typical way it's described in the literature is to treat the end-of-buffer (henceforth EOB) as if it is a special character with value out of range (such as -1 or 256). That way you can just compare strings that go up to EOB with other strings, and the EOB will always mismatch a normal character and cause those strings to sort in a predictable way.

eg. on "banana" when you insert the final "na-EOB" you compare against "nana" and wind up comparing EOB vs. 'n' to find the sort order for that suffix.

The problem is that this is just a mess in the code. All over my suffix trie code, everywhere that I do a string lookup, I had to do extra special case checking for "is it EOB" and handle that case.

In addition, when I find a mismatch of "na-EOB" and "nan", the normal path of the code would be to change the prefix "na" into a branch and add children for the two different paths - EOB and "n". But I can't actually do that in my code because the child is selected by an unsigned char (uint8), and EOB is out of bounds for that variable type. So in all the node branch construction paths I have to special case "is it a mismatch just because of EOB, then don't create a branch". Blah blah.

Anyway, I realized that can all go away.

The key point is this :

Once you add a suffix that hits EOB (eg. the first mismatch against the existing suffixes is at EOB), then all future suffixes will also hit EOB, and that will be the deepest match for all of them.

Furthermore, all future suffix nodes can be found immediately using "follows"

eg. in the "banana" case, construction of the trie goes like :


"ban" is in the suffix trie
  (eg. "ban..","an..","n.." are all in the trie)

add "ana" :

we find the existing "an.."

the strings we compare are "anana-EOB" vs "ana-EOB"

so the mismatch hits EOB

That means all future suffixes will also hit EOB, and their placement in the tree can be found
just by using "follows" from the current string.

"ana-EOB" inserts at "anana"
"na-EOB" inserts at "nana"
"a-EOB" inserts at "ana"

That is, at the end of every trie construction when you start hitting EOB you just jump out to this special case of very simple follows addition.

So all the special EOB handling can be pulled out of the normal Trie code and set off to the side, which is lovely for the code clarity.

In practice it's quite common for files to end with a run of "000000000" which now gets swallowed up neatly by this special case.


ADD : if you don't care about adding every occurance of each suffix, then it gets even simpler -

If you hit EOB when adding a string - just don't add it. A full match of that suffix already exists, and your suffix that goes to EOB can't match any lookup better than what's in there.

(note that when I say "hit EOB" I mean you are at a branch node, and your current character doesn't match any of the branches because it is EOB. You will still add leaves that go to EOB, but you will never actually walk down those leaves to reach EOB.)

8/31/2014

08-31-14 - DLI Image Compression

I got pinged about DLI so I had a look.

DLI is a closed source image compressor. There's no public information about it. It may be the best lossy image compressor in the world at the moment.

(ASIDE : H265 looks promising but has not yet been developed as a still image compressor; it will need a lot of perceptual work; also as in my previous test of x264 you have to be careful to avoid the terrible YUV and subsamplers that are in the common tools)

I have no idea what the algorithms are in DLI. Based on looking at some of the decompressed images, I can see block transform artifacts, so it has something like an 8x8 DCT in it. I also see certain clues that make me think it uses something like intra prediction. (those clues are good detail and edge preservation, and a tendency to preserve detail even if it's the wrong detail; the same thing that you see in H264 stills)

Anyway, I thought I'd run my own comparo on DLI to see if it really is as good as the author claims.

I tested against JPEG + packJPEG + my JPEG decoder. I'm using an unfinished version of my JPEG decoder which uses the "Nosratinia modified" reconstruction method. It could be a lot better. Note that this is still a super braindead simple JPEG. No per-block quantizer. No intra prediction. Only 8x8 transforms. Standard 420 YCbCr. No trellis quantization or rate-distortion. Just a modern entropy coding back end and a modern deblocking decoder.

I test with my perceptual image tester imdiff . The best metric is Combo which is a linear combo of SCIELAB_MyDelta + MS_SSIM_IW_Y + MyDctDelta_YUV.

You can see some previous tests on mysoup or moses or PDI

NOTE : "dlir.exe" is the super-slow optimizing variant. "dli.exe" is reasonably fast. I tested both. I ran dlir with -ov (optimize for visual quality) since my tests are mostly perceptual. I don't notice a huge difference between them.

My impressions :

DLI and jpeg+packjpg+jpegdec are both very good. Both are miles ahead of what is commonly used these days (old JPEG for example).

DLI preserves detail and contrast much better. JPEG tends to smooth and blur things at lower bit rates. Part of this may be something like a SATD heuristic metric + better bit allocation.

DLI does "mangle" the image. That is, it gets the detail *wrong* sometimes, which is something that JPEG really never does. The primary shapes are preserved by jpeg+packjpg+jpegdec, they just lose detail. With DLI, you sometimes get weird lumps appearing that weren't there before. If you just look at the decompressed image it can be hard to spot, because it looks like there's good detail there, but if you A-B test the uncompressed to the original, you'll see that DLI is actually changing the detail. I saw this before when analyzing x264.

DLI is similar looking to x264-still but better.

DLI seems to have a special mode for gradients. It preserves smooth gradients very well. JPEG-unblock creates a stepped look because it's a series of ramps that are flat in the middle.

DLI seems to make edges a bit chunky. Smooth curves get steppy. jpeg+packjpg+jpegdec is very good at preserving a smooth curved edge.

DLI is the only image coder I've seen that I would say is definitely slightly better than jpeg+packjpg+jpegdec. Though it is worse in some ways, I think the overall impression of the decoded image is definitely better. Much better contrast preservation, much better detail energy level preservation.

Despite jpeg often scoring better than DLI on the visual quality metrics I have, DLI usually looks much better to my eyes. This is a failure of the visual quality metrics.


Okay. Time for some charts.

In all cases I will show the "TID Fit" score. This is a 0-10 quality rating with higher better. This removes the issue of SSIM, RMSE, etc. all being on different scales.

NOTE : I am showing RMSE just for information. It tells you something about how the coders are working and why they look different, where the error is coming from. In both cases (DLI and JPEG) the runs are optimized for *visual* quality, not for RMSE, so this is not a comparison of how well they can do on an RMSE contest. (dlir should be run with -or and jpeg should be run with flat quantization matrices at least).

(see previous tests on mysoup or moses or PDI )

mysoup :

moses :

porsche640 :

pdi1200 :


Qualitative Comparison :

I looked at JPEG and DLI encodings at the same bit rate for each image. Generally I try to look around 1 bpp (that's logbpp of 0) which is the "sweet spot" for lossy image compression comparison.

Here are the original, a JPEG, and a DLI of Porsche640.
Download : RAR of Porsche640 comparison images (1 MB)

What I see :

DLI has very obvious DCT ringing artifacts. Look at the lower-right edge of the hood, for example. The sharp line of the hood has ringing ghosts in 8x8 chunks.

DLI preserves contrast overall much better. The most obvious places are in the background - the leaves, the pebbles. JPEG just blurs those and drops a lot of high frequency detail, DLI keeps it much better. DLI preserves a lot more high frequency data.

DLI adds a lot of noise. JPEG basically never adds noise. For example compare the centers of the wheels. The JPEG just looks like a slightly smoothed version of the original. The DLI has got lots of chunkiness and extra variation that isn't in the original.

In a few places DLI really mangles the image. One is the A-pillar of the car, another is the shadow on the hood, also the rear wheel.

Both DLI and JPEG do the same awful thing to the chroma. All the orange in the gravel is completely lost. The entire color of the laurel bush in the background is changed. Both just produce a desaturated image.

Based on the scores and what I see perceptually, my guess is this : DLI uses an 8x8 DCT. It uses a quantization matrix that is much flatter than JPEG's.

8/27/2014

08-27-14 - LZ Match Length Redundancy

A quick note on something that does not work.

I've written before about the redundancy in LZ77 codes. ( for example ). In particular the issue I had a look at was :

Any time you code a match, you know that it must be longer than any possible match at lower offsets.

eg. you won't sent a match of length of 3 to offset 30514 if you could have sent offset 1073 instead. You always choose the lowest possible offset that gives you a given match length.

The easy way to exploit this is to send match lengths as the delta from the next longest match length at lower offset. You only need to send the excess, and you know the excess is greater than zero. So if you have an ML of 3 at offset 1073, and you find a match of length 4 at offset 30514, then you send {30514,+1}

To implement this in the encoder is straightforward. If you walk your matches in order from lowest offset to highest offset, then you know the current best match length as you go. You only consider a match if it exceeds the previous best, and you record the delta in lengths that you will send.

The same principle applies to the "last offsets" ; you don't send LO2 if you could sent LO0 at the same length, so the higher index LO matches must be of greater length. And the same thing applies to ROLZ.

I tried this in all 3 cases (normal LZ matches, LO matches, ROLZ). No win. Not even tiny, but close to zero.

Part of the problem is that match lengths are just not where the bits are; they're small already. But I assume that part of what's happening is that match lengths have patterns that the delta-ing ruins. For example binary files will have patterns of 4 or 8 long matches, or in an LZMA-like you'll have certain patterns show up like at certain pos&3 intervals after a literal you get a 3-long match, etc.

I tried some obvious ideas like using the next-lowest-length as part of the context for coding the delta-length. In theory you could be able to recapture something like a next-lowest of 3 predicts a delta of 1 in places where an ML of 4 is likely. But I couldn't find a win there.

I believe this is a dead end. Even if you could find a small win, it's too slow in the decoder to be worth it.

7/14/2014

07-14-14 - Suffix-Trie Coded LZ

Idea : Suffix-Trie Coded LZ :

You are doing LZ77-style coding (eg. matches in the prior stream or literals), but send the matches in a different way.

You have a Suffix Trie built on the prior stream. To find the longest match for a normal LZ77 you would take the current string to code and look it up by walking it down the Trie. When you reach the point of deepest match, you see what string in the prior stream made that node in the Trie, and send the position of that string as an offset.

Essentially what the offset does is encode a position in the tree.

But there are many redundancies in the normal LZ77 scheme. For example if you only encode a match of length 3, then the offsets that point to "abcd.." and "abce.." are equivalent, and shouldn't be distinguished by the encoding. The fact that they both take up space in the numerical offset is a waste of bits. You only want to distinguish offsets that actually point at something different for the current match length.

The idea in a nutshell is that instead of sending an offset, you send the descent into the trie to find that string.

At each node, first send a single bit for does the next byte in the string match any of the children. (This is equivalent to a PPM escape). If not, then you're done matching. If you like, this is like sending the match length with unary : 1 bits as long as you're in a node that has a matching child, then a 0 bit when you run out of matches. (alternatively you could send the entire match length up front with a different scheme).

When one of the children matches, you must encode which one. This is just an encoding of the next character, selected from the previously seen characters in this context. If all offsets are equally likely (they aren't) then the correct thing is just Probability(child) = Trie_Leaf_Count(child) , because the number of leaves under a node is the number of times we've seen this substring in the past.

(More generally the probability of offsets is not uniform, so you should scale the probability of each child using some modeling of the offsets. Accumulate P(child) += P(offset) for each offset under a child. Ugly. This is unfortunately very important on binary data where the 4-8-struct offset patterns are very strong.)

Ignoring that aside - the big coding gain is that we are no longer uselessly distinguishing offsets that only differ at higher match length, AND instead of just wasting those bits, we instead use them to make those offsets code smaller.

For example : say we've matched "ab" so far. The previous stream contains "abcd","abce","abcf", and "abq". Pretend that somehow those are the only strings. Normal LZ77 needs 2 bits to select from them - but if our match len is only 3 that's a big waste. This way we would say the next char in the match can either be "c" or "q" and the probabilities are 3/4 and 1/4 respectively. So if the length-3 match is a "c" we send that selection in only log2(4/3) bits = 0.415

And the astute reader will already be thinking - this is just PPM! In fact it is exactly a kind of PPM, in which you start out at low order (min match length, typically 3 or so) and your order gets deeper as you match. When you escape you junk back to order 3 coding, and if that escapes it jumps back to order 0 (literal).

There are several major problems :

1. Decoding is slow because you have to maintain the Suffix Trie for both encode and decode. You lose the simple LZ77 decoder.

2. Modern LZ's benefit a lot from modeling the numerical value of the offset in binary files. That's ugly & hard to do in this framework. This method is a lot simpler on text-like data that doesn't have numerical offset patterns.

3. It's not Pareto. If you're doing all this work you may as well just do PPM.

In any case it's theoretically interesting as an ideal of how you would encode LZ offsets if you could.

(and yes I know there have been many similar ideas in the past; LZFG of course, and Langdon's great LZ-PPM equivalence proof)

7/03/2014

07-03-14 - Oodle 1.41 Comparison Charts

I did some work for Oodle 1.41 on speeding up compressors. Mainly the Fast/VeryFast encoders got faster. I also took a pass at trying to make sure the various options were "Pareto", that is the best possible space/speed tradeoff. I had some options that were off the curve, like much slower than they needed to be, or just worse with no benefit, so it was just a mistake to use them (LZNib Normal was particularly bad).

Oodle 1.40 got the new LZA compressor. LZA is a very high compression arithmetic-coded LZ. The goal of LZA is as much compression as possible while retaining somewhat reasonable (or at least tolerable) decode speeds. My belief is that LZA should be used for internet distribution, but not for runtime loading.

The charts :

compression ratio : (raw/comp ratio; higher is better)

compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 2.362 2.508 2.541 2.645 2.698
LZHLW 2.161 2.299 2.33 2.352 2.432
LZH 1.901 1.979 2.039 2.121 2.134
LZNIB 1.727 1.884 1.853 2.079 2.079
LZBLW 1.636 1.761 1.833 1.873 1.873
LZB16 1.481 1.571 1.654 1.674 1.674
lzmamax  : 2.665 to 1
lzmafast : 2.314 to 1
zlib9 : 1.883 to 1 
zlib5 : 1.871 to 1
lz4hc : 1.667 to 1
lz4fast : 1.464 to 1

encode speed : (mb/s)

compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 23.05 12.7 6.27 1.54 1.07
LZHLW 59.67 19.16 7.21 4.67 1.96
LZH 76.08 17.08 11.7 0.83 0.46
LZNIB 182.14 43.87 10.76 0.51 0.51
LZBLW 246.83 49.67 1.62 1.61 1.61
LZB16 511.36 107.11 36.98 4.02 4.02
lzmamax  : 5.55
lzmafast : 11.08
zlib9 : 4.86
zlib5 : 25.23
lz4hc : 32.32
lz4fast : 606.37

decode speed : (mb/s)

compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 34.93 37.15 37.76 37.48 37.81
LZHLW 363.94 385.85 384.83 391.28 388.4
LZH 357.62 392.35 397.72 387.28 383.38
LZNIB 923.66 987.11 903.21 1195.66 1194.75
LZBLW 2545.35 2495.37 2465.99 2514.48 2515.25
LZB16 2752.65 2598.69 2687.85 2768.34 2765.92
lzmamax  : 42.17
lzmafast : 40.22
zlib9 : 308.93
zlib5 : 302.53
lz4hc : 2363.75
lz4fast : 2288.58

While working on LZA I found some encoder speed wins that I ported back to LZHLW (mainly in Fast and VeryFast). A big one is to early out for last offsets; when I get a last offset match > N long, I just take it and don't even look for non-last-offset matches. This is done in the non-Optimal modes, and surprisingly hurts compression almost not all while helping speed a lot.

Four of the compressors are now in pretty good shape (LZA,LZHLW,LZNIB, and LZB16). There are a few minor issues to fix someday (someday = never unless the need arises) :

LZA decoder should be a little faster (currently lags LZMA a tiny bit). LZA Optimal1 would be better with a semi-greedy match finder like MMC (LZMA is much faster to encode than me at the same compression level; perhaps a different optimal parse scheme is needed too). LZA Optimal2 should seed with multi-parse. LZHLW Optimal could be faster. LZNIB Normal needs much better match selection heuristics, the ones I have are really just not right. LZNIB Optimal should be faster; needs a better way to do threshold-match-finding. LZB16 Optimal should be faster; needs a better 64k-sliding-window match finder.

The LZH and LZBLW compressors are a bit neglected and you can see they still have some of the anomalies in the space/speed tradeoff curve, like the Normal encode speed for LZBLW is so bad that you may as well just use Optimal. Put aside until there's a reason to fix them.


If another game developer tells me that "zlib is a great compromise and you probably can't beat it by much" I'm going to murder them. For the record :

zlib -9 :
4.86 MB/sec to encode
308.93 MB/sec to decode
1.883 to 1 compression

LZHLW Optimal1 :
4.67 MB/sec to encode
391.28 MB/sec to decode
2.352 to 1 compression
come on! The encoder is slow, the decoder is slow, and it compresses poorly.

LZMA in very high compression settings is a good tradeoff. In its low compression fast modes, it's very poor. zlib has the same flaw - they just don't have good encoders for fast compression modes.

LZ4 I have no issues with; in its designed zone it offers excellent tradeoffs.


In most cases the encoder implementations are :


VeryFast =
cache table match finder
single hash
greedy parse

Fast = 
cache table match finder
hash with ways
second hash
lazy parse
very simple heuristic decisions

Normal =
varies a lot for the different compressors
generally something like a hash-link match finder
or a cache table with more ways
more lazy eval
more careful "is match better" heuristics

Optimal =
exact match finder (SuffixTrie or similar)
cost-based match decision, not heuristic
backward exact parse of LZB16
all others have "last offset" so require an approximate forward parse

I'm mostly ripping out my Hash->Link match finders and replacing them with N-way cache tables. While the cache table is slightly worse for compression, it's a big speed win, which makes it better on the space-speed tradeoff spectrum.

I don't have a good solution for windowed optimal parse match finding (such as LZB16-Optimal). I'm currently using overlapped suffix arrays, but that's not awesome. Sliding window SuffixTrie is an engineering nightmare but would probably be good for that. MMC is a pretty good compromise in practice, though it's not exact and does have degenerate case breakdowns.


LZB16's encode speed is very sensitive to the hash table size.


-h12
24,700,820 ->16,944,823 =  5.488 bpb =  1.458 to 1
encode           : 0.045 seconds, 161.75 b/kc, rate= 550.51 mb/s
decode           : 0.009 seconds, 849.04 b/kc, rate= 2889.66 mb/s

-h13
24,700,820 ->16,682,108 =  5.403 bpb =  1.481 to 1
encode           : 0.049 seconds, 148.08 b/kc, rate= 503.97 mb/s
decode           : 0.009 seconds, 827.85 b/kc, rate= 2817.56 mb/s

-h14
24,700,820 ->16,491,675 =  5.341 bpb =  1.498 to 1
encode           : 0.055 seconds, 133.07 b/kc, rate= 452.89 mb/s
decode           : 0.009 seconds, 812.73 b/kc, rate= 2766.10 mb/s

-h15
24,700,820 ->16,409,957 =  5.315 bpb =  1.505 to 1
encode           : 0.064 seconds, 113.23 b/kc, rate= 385.37 mb/s
decode           : 0.009 seconds, 802.46 b/kc, rate= 2731.13 mb/s

If you accidentally set it too big you get a huge drop-off in speed. (The charts above show -h13 ; -h12 is more comparable to lz4fast (which was built with HASH_LOG=12)).

I stole an idea from LZ4 that helped the encoder speed a lot. (lz4fast is very good!) Instead of doing the basic loop like :


while(!eof)
{
  if ( match )
    output match
  else
    output literal
}

instead do :

while(!eof)
{
  while( ! match )
  {
    output literal
  }

  output match
}

This lets you make a tight loop just for outputing literals. It makes it clearer to you as a programmer what's happening in that loop and you can save work and simplify things. It winds up being a lot faster. (I've been doing the same thing in my decoders forever but hadn't done in the encoder).

My LZB16 is very slightly more complex to encode than LZ4, because I do some things that let me have a faster decoder. For example my normal matches are all no-overlap, and I hide the overlap matches in the excess-match-length branch.

6/21/2014

06-21-14 - Suffix Trie Note

A small note on Suffix Tries for LZ compression.

See previously :

Sketch of Suffix Trie for Last Occurance

So. Reminder to myself : Suffix Tries for optimal parsing is clean and awesome. But *only* for finding the length of the longest match. *not* for finding the lowest offset of that match. And *not* for finding the longest match length and the lowest offset of any other (shorter) matches.

I wrote before about the heuristic I currently use in Oodle to solve this. I find the longest match in the trie, then I walk up to parent nodes and see if they provide lower offset / short length matches, because those may be also interesting to consider in the optimal parser.

(eg. for clarity, the situation you need to consider is something like a match of length 8 at offset 482313 vs. a match of length 6 at offset 4 ; it's important to find that lower-length lower-offset match so that you can consider the cost of it, since it might be much cheaper)

Now, I tested the heuristic of just doing parent-gathers and limitted updates, and it performed well *in my LZH coder*. It does *not* necessarily perform well with other coders.

It can miss out on some very low offset short matches. You may need to supplement the Suffix Trie with an additional short range matcher, like even just a 1024 entry hash-chain matcher. Or maybe a [256*256*256] array of the last occurance location of a trigram. Even just checking at offset=1 for the RLE match is helpful. Whether or not they are important or not depends on the back end coder, so you just have to try it.

For LZA I ran into another problem :

The Suffix Trie exactly finds the length of the longest match in O(N). That's fine. The problem is when you go up to the parent nodes - the node depth is *not* the longest match length with the pointer there. It's just the *minimum* match length. The true match length might be anywhere up to *and including* the longest match length.

In LZH I was considering those matches with the node depth as the match length. And actually I re-tested it with the correct match length, and it makes very little difference.

Because LZA does LAM exclusion, it's crucial that you actually find what the longest ML is for that offset.

(note that the original LZMA exclude coder is actually just a *statistical* exclude coder; it is still capable of coding the excluded character, it just has very low probability. My modified version that only codes 7 bits instead of 8 is not capable of coding the excluded character, so you must not allow this.)

One bit of ugliness is that extending the match to find its true length is not part of the neat O(N) time query.

In any case, I think is all a bit of a dead-end for me. I'd rather move my LZA parser to be forward-only and get away from the "find a match at every position" requirement. That allows you to take big steps when you find long matches and makes the whole thing faster.

6/18/2014

06-18-14 - Oodle Network Test Results

Well, it's been several months now that we've been testing out the beta of Oodle Network Compression on data from real game developers.

Most of the tests have been UDP, with a few TCP. We've done a few tests on data from the major engines (UE3/4, Source) that do delta property bit-packing. Some of the MMO type games with large packets were using zlib on packets.

This is a summary of all the major tests that I've run. This is not excluding any tests where we did badly. So far we have done very well on every single packet capture we've seen from game developers.


MMO game :
427.0 -> 182.7 average = 2.337:1 = 57.21% reduction
compare to zlib -5 : 427.0 -> 271.9 average


MMRTS game :
122.0 -> 75.6 average = 1.615:1 = 38.08% reduction


Unreal game :
220.9 -> 143.3 average = 1.542:1 = 35.15% reduction


Tiny packet game :
21.9 -> 15.6 average = 1.403:1 = 28.72% reduction


Large packet Source engine game :
798.2 -> 519.6 average = 1.536:1 = 34.90% reduction

Some of the tests surprised even me, particularly the tiny packet one. When I saw the average size was only 22 bytes I didn't think we had much room to work with, but we did!

Some notes :

  • Oodle Network Compression provides > 28% reduction in all cases seen so far.

  • Oodle works even on tiny packets!

  • Oodle greatly out-compresses zlib. Some developers think that zlib is a great compromise for space and speed. Oodle can do much better! Oodle is also generally faster to encode than zlib and uses less memory per channel.

  • Oodle works even on games that are already doing bit-packing and delta updates (like most of the major traditional engines do).

  • Oodle easily pays for itself; the savings in bandwidth is much greater than the cost of Oodle. In addition to the cost savings, Oodle means you can run more players

6/16/2014

06-16-14 - Rep0 Exclusion in LZMA-like coders

For reference on this topic : see the last post .

I believe there's a mistake in LZMA. I could be totally wrong about that because reading the 7z code is very hard; in any case I'm going to talk about Rep0 exclusion. I believe that LZMA does not do this the way it should, and perhaps this change should be integrated into a future version. In general I have found LZMA to be very good and most of its design decisions have been excellent. My intention is not to nitpick it, but to give back to a very nice algorithm which has been generously made public by its author, Igor Pavlov.

LZMA does "literal-after-match" exclusion. I talked a bit about this last time. Basically, after a match you know that the next literal cannot be the one that would have continued the match. If it was you would have just written a longer match. This relies on always writing the maximum length for matches.

To model "LAM" exclusion, LZMA uses a funny way of doing the binary arithmetic model for those symbols. I wrote a bit about that last time, and the LZMA way to do that is good.

LZMA uses LAM exclusion on the first literal after a match, and then does normal 8-bit literal coding if there are more literals.

That all seems fine, and I worked on the Oodle LZ-Arith variant for about month with a similar scheme, thinking it was right.

But there's a wrinkle. LZMA also does "rep0len1" coding.

For background, LZMA, like LZX before it, does "repeat match" coding. A rep match means using one of the last N offsets (usually N = 3) and you flag that and send it in very few bits. I've talked about the surprising importance of repeat matches before (also here and other places ).

LZMA, like LZX, codes rep matches with MML of 2.

But LZMA also has "rep0len1". rep0len1 codes a single symbol at the 0th repeat match offset. That's the last offset matched from. That's the same offset that provides the LAM exclusion. In fact you can state the LAM exclusion as "rep0len1 cannot occur on the symbol after a match". (and in fact LZMA gets that right and doesn't try to code the rep0len1 bit after a match).

rep0len1 is not a win on text, but it's a decent little win on binary structured data (see example at bottom of this post ). It lets you get things like the top byte of a series of floats (off=4 len1 match, then 3 literals).

The thing is, if you're doing the rep0len1 coding, then normal literals also cannot be the rep0len1 symbol. If they were, then you would just code them with the rep0len1 flag.

So *every* literal should be coded with rep0 exclusion. Not just the literal after a match. And in fact the normal 8-bit literal coding path without exclusion is never used.

To be concrete, coding a literal in LZMA should look like this :


cur_lit = buffer[ pos ];

rep0_lit = buffer[ pos - rep0_offset ];

if ( after match )
{
    // LAM exclusion means cur_lit should never be = rep0_lit
    ASSERT( cur_lit != rep0_lit );
}
else
{
    if ( cur_lit == rep0_lit )
    {
        // lit is rep0, send rep0len1 :
        ... encode rep0len1 flag ...

        // do not drop through
        return;
    }
}

// either way, we now have exclusion :
ASSERT( cur_lit != rep0_lit );

encode_with_exclude( cur_lit, rep0_lit );

and that provides a pretty solid win. Of all the things I did to try to beat LZMA, this was the only clear winner.


ADDENDUM : some notes on this.

Note that the LZMA binary-exclude coder is *not* just doing exclusion. It's also using the exclude symbol as modelling context. Pure exclusion would just take the probability of the excluded symbol and distribute it to the other symbols, in proportion to their probability.

It turns out that the rep0 literal is an excellent predictor, even beyond just exclusion.

That is, say you're doing normal 8-bit literal coding with no exclusion. You are allowed to use an 8-bit literal as context. You can either use the order-1 literal (that's buffer[pos-1]) or the rep0 literal (that's buffer[pos-rep0_offset]).

It's better to use the rep0 literal!

Of course the rep0 literal becomes a weaker predictor as you get away from the end of the match. It's very good on the literal right after a match (lam exclusion), and still very good on the next literal, and then steadily weaker.

It turns out the transition point is 4-6 literals away from the end of the match; that's the point at which the o1 symbol becomes more correlated to the current symbol than the rep0 lit.

One of the ideas that I had for Oodle LZA was to remove the rep0len1 flag completely and instead get the same effect from context modeling. You can instead take the rep0 lit and use it as an 8-bit context for literal coding, and should get the same benefit. (the coding of the match flag is implicit in the probability model).

I believe the reason I couldn't find a win there is because it turns out that LZ literal coding needs to adapt very fast. You want very few context bits, you want super fast adaptation of the top bits. Part of the issue is that you don't see LZ literals very often; there are big gaps where you had matches, so you aren't getting as much data to adapt to the file. But I can't entirely explain it.

You can intuitively understand why the rep0 literal is such a strong predictor even when it doesn't match. You've taken a string from earlier in the file, and blacked out one symbol. You're told what the symbol was before, and you're told that in the new string it is not that symbol. It's something like :


"Ther" matched before
'r' is to be substituted
"The?"
What is ? , given that it was 'r' but isn't 'r' here.

Given only the o1 symbol ('e') and the substituted symbol ('r'), you can make a very good guess of what should be there ('n' probably, maybe 'm', or 's', ...). Obviously more context would help, but with limited information, the substituted symbol (rep0 lit) sort of gives you a hint about the whole area.

An ever simpler case is given just the fact that rep0lit is upper or lower case - you're likely to substitute it with a character of the same case. Similarly if it's a vowel or consonant, you're likely to substitute with one of the same. etc. and of course I'm just using English text because it's intuitive, it works just as well on binary structured data.


There's another very small flaw in LZMA's exclude coder, which is more of a minor detail, but I'll include it here.

The LZMA binary exclude coder is equivalent to this clearer version that I posted last time :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 256 );
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
This codes up to 8 bits while the bits of "val" match the bits of "exclude" , and up to 8 bits while the bits of "val" don't match.

Now, obviously the very first bit can never be coded in the unmatched phase. So we could eliminate that from the unmatched bins. But that only saves us one slot.

(and actually I'm already wasting a slot intentionally; the "ctx" with place value bit like this is always >= 1 , so you should be indexing at "ctx-1" if you want a tight packed array. I intentionally don't do that, so that I have 256 bins instead of 255, because it makes the addressing work with "<<8" instead of "*255")

More importantly, in the "matched" phase, you don't actually need to code all 8 bits. If you code 7 bits, then you know that "val" and "exclude" match in all top 7 bits, so it must be that val == exclude^1. That is, it's just one bit flip away; the decoder will also know that so you can just not send it.

The fixed encoder is :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 128 );
        m_bins[256 + ctx + (exclude_bit<<7)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( bit != exclude_bit )
            break;
        if ( ctx >= 128 )
        {
            // I've coded 7 bits
            // and they all match
            // no need to code the last one
            return;
        }
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
Note that now ctx in the matched phase can only reach 128. That means this coder actually only needs 2*256 bins, not 3*256 bins as stated last time.

This is a little speed savings (tiny because we only get one less arith coding event on a rare path), a little compression savings (tiny because that bottom bit models very well), and a little memory use savings.

6/12/2014

06-12-14 - Some LZMA Notes

I've been working on an LZ-Arith for Oodle, and of course the benchmark to beat is LZMA, so I've had a look at a few things.

Some previous posts related to things I'll discuss today :

cbloom rants 09-27-08 - On LZ and ACB
cbloom rants 10-01-08 - First Look at LZMA
cbloom rants 10-05-08 - 5 Rant on New Arithmetic Coders
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions

Some non-trivial things I have noticed :

1. The standard way of coding literals with a binary arithmetic coder has a subtle quirk to it.

LZMA uses the now standard fractional update method for binary probability modeling. That's p0 -= (p0 >> updshift) and so on. See for example : 10-05-08 - 5 : Rant on New Arithmetic Coders .

The fractional update method is an approximation of a standard {num0,num1} binary model in which you are kept right at the renormalization threshold. That is, a counting model does :

P0 = num0 / (num0+num1);
... do coding ...
if ( bit ) num1++;
else num0++;
if ( (num0+num1) > renorm_threshold )
{
  // scale down somehow; traditionally num0 >>= 1;
}
The fractional shift method is equivalent to :

num0 = P0;
num1 = (1<<frac_tot) - P0;
if ( bit ) num1++;
else num0++;

// num0+num1 is now ((1<<frac_tot)+1); rescale :
P0 = num0 * (1<<frac_tot)/((1<<frac_tot)+1);

That is, it assumes you're right at renormalization threshold and keeps you there.

The important thing about this is adaptation speed.

A traditional {num0,num1} model adapts very quickly at first. Each observed bit causes a big change to P0 because total is small. As total gets larger, it becomes more stable, it has more inertia and adapts more slowly. The renorm_threshold sets a minimum adaptation speed; that is, it prevents the model from becoming too full of old data and too slow to respond to new data.

Okay, that's all background. Now let's look at coding literals.

The standard way to code an N bit literal using a binary arithmetic coder is to code each bit one by one, either top down or bottom up, and use the previous coded bits as context, so that each subtree of the binary tree gets its own probability models. Something like :


ctx = 1;
while( ctx < 256 ) // 8 codings
{
    int bit = (val >> 7)&1; // get top bit
    val <<= 1; // slide val for next coding
    BinaryCode( bit, p0[ctx-1] );
    // put bit in ctx for next event
    ctx = (ctx<<1) | bit;
}

Okay.

Now first of all there is a common misconception that binary coding is somehow different than N-ary arithmetic coding, or that it will work better on "binary data" that is somehow organized "bitwise" vs text-like data. That is not strictly true.

If we use a pure counting model for our N-ary code and our binary code, and we have not reached the renormalization threshold, then they are in fact *identical*. Exactly identical.

For example, say we're coding two-bit literals :


The initial counts are :

0: 3
1: 1
2: 5
3: 4
total = 13

we code a 2 with probability 5/13 in log2(13/5) bits = 1.37851
and its count becomes 6

With binary modeling the counts are :

no ctx
0: 4
1: 9

ctx=0
0: 3
1: 1

ctx=1
0: 5
1: 4

to code a "2"
we first code a 1 bit with no context
with probability 9/13 in log2(13/9) bits = 0.53051
and the counts become {4,10}

then we code a 0 bit with a 1 context
with probability 5/9 in log2(9/5) bits = 0.84800
and the counts become {6,4}

And of course 1.37851 = 0.53051 + 0.84800

The coding is exactly the same. (and furthermore, binary coding top down or bottom up is also exactly the same).

However, there is a difference, and this is where the quirk of LZMA comes in. Once you start hitting the renormalization threshold, so that the adaptation speed is clamped, they do behave differently.

In a binary model, you will see many more events at the top bit. The exact number depends on how spread your statistics are. If all 256 symbols are equally likely, then the top bit is coded 128X more often than the bottom bits (and each of the next bits is coded 64X, etc.). If only one symbol actually occurs then all the bit levels will be coded the same number of times. In practice it's somewhere in between.

If you were trying to match the normal N-ary counting model, then the binary model should have much *slower* adaptation for the top bit than it does for the bottom bit. With a "fractional shift" binary arithmetic coder that would mean using a different "update shift".

But LZMA, like most code I've seen that implements this kind of binary coding of literals, does not use different adaptation rates for each bit level. Instead they just blindly use the same binary coder for each bit level.

This is wrong, but it turns out to be right. I tested a bunch of variations and found that the LZMA way is best on my test set. It seems that having much faster adaptation of the top bits is a good thing.

Note that this is a consequence of using unequal contexts for the different bit levels. The top bit has 0 bits of context, while the bottom bit has 7 bits of context, which means its statistics are diluted 128X (or less). If you do an order-1 literal coder this way, the top bit has 8 bits of context while the bottom bit gets 15 bits.

2. The LZMA literal-after-match coding is just an exclude

I wrote before (here : cbloom rants 08-20-10 - Deobfuscating LZMA ) about the "funny xor thing" in the literal-after-match coder. Turns out I was wrong, it's not really funny at all.

In LZ coding, there's a very powerful exclusion that can be applied. If you always output matches of the maximum length (more on this later), then you know that the next symbol cannot be the one that followed in the match. Eg. if you just copied a match from "what" but only took 3 symbols, then you know the next symbol cannot be "t", since you would have just done a length-4 match in that case.

This is a particularly good exclusion because the symbol that followed in the match is what you would predict to be the most probable symbol at that spot!

That is, say you need to predict the MPS (most probable symbol) at any spot in the file. Well, what you do is look at the preceding context of symbols and find the longest previous match of the context, and take the symbol that follows that context. This is "PPM*" essentially.

So when you code a literal after a match in LZ, you really want to do exclusion of the last-match predicted symbol. In a normal N-ary arithmetic coder, you would simply set the count of that symbol to 0. But it's not so simple with the binary arithmetic coder.

With a binary arithmetic coder, let's say you have the same top 7 bits as the exclude symbol. Well then, you know exactly what your bottom bit must be without doing any coding at all - it must be the bit that doesn't match the exclude symbol. At the next bit level above that, you can't strictly exclude, but you can probabilistically, exclude. That is :


Working backwards from the bottom :

At bit level 0 :

if symbol top 7 bits == exclude top 7 bits
then full exclude

that is, probability of current bit == exclude bit is zero

At bit level 1 :

if symbol top 6 bits == exclude top 6 bits
then

if symbol current bit matches exclude current bit, I will get full exclusion in the next level
so chance of that path is reduced but not zero

the other binary path is unaffected

that is, we're currently coding to decide between 4 symbols.  Something like :

0 : {A,B}
1 : {C,D}

we should have P0 = (PA+PB)/(PA+PB+PC+PD)

but we exclude one; let's say B, so instead we want to code with P0 = PA/(PA+PC+PD)

etc..

That is, the exclude is strongest at the bottom bit level, and becomes less strong as you go back up to higher bit levels, because there are more symbols on each branch than just the exclude symbol.

The LZMA implementation of this is :


  static void LitEnc_EncodeMatched(CRangeEnc *p, CLzmaProb *probs, UInt32 symbol, UInt32 matchByte)
  {
    UInt32 offs = 0x100;
    symbol |= 0x100;
    do
    {
      matchByte <<= 1;
      RangeEnc_EncodeBit(p, probs + (offs + (matchByte & offs) + (symbol >> 8)), (symbol >> 7) & 1);
      symbol <<= 1;
      offs &= ~(matchByte ^ symbol);
    }
    while (symbol < 0x10000);
  }

I rewrote it to understand it; maybe this is clearer :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    // same thing but maybe clearer :
    bool matched = true;        
    val |= 0x100; // place holder top bit

    for(int i=0;i<8;i++) // 8 bit literal
    {
        int exclude_bit = (exclude >> (7-i)) & 1;
        int bit = (val >> (7-i)) & 1;

        int context = val >> (8-i);
        if ( matched )
            context += exclude_bit?512:256;

        m_probs[context].encode(arith,bit);

        if ( bit != exclude_bit )
            matched = false;
    }
}

We're tracking a running flag ("matched" or "offs") which tells us if we are on the same path of the binary tree as the exclude symbol. That is, do all prior bits match. If so, that steps us into another group of contexts, and we add the current bit from the exclude symbol to our context.

Now of course "matched" always starts true, and only turns to false once, and then stays false. So we can instead implement this as two loops with a break :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}

It's actually not weird at all, it's just the way to do symbol exclusion with a binary coder.

ADDENDUM : maybe I'm going to0 far saying it's not weird. It is a bit weird, sort of like point 1, it's actually not right, but in a good way.

The thing that's weird is that when coding the top bits, it's only using the bits seen so far of the exclude symbol. If you wanted to do a correct probability exclusion, you need *all* the bits of the exclude symbol, so that you can see exactly what symbol it is, how much probability it contributes to that side of the binary tree.

The LZMA way appears to work significantly better than doing the full exclude.

That is, it's discarding some bits of the exclude as context, and that seems to help due to some issue with sparsity and adaptation rates. The LZMA uses 3*256 binary probabilities, while full exclusion uses 9*256. (though in both cases, not all probs are actually used; eg. the first bit is always coded from the "matched" probs, not the "un-matched").

ADDENDUM2 : Let me say it again perhaps clearer.

The way to code a full exclude using binary modeling is :


coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use 8 bits of exclude as context
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

The LZMA way is :

coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use N+1 bits of exclude as context   // <- only difference is here
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

I also tried intermediate schemes like using N+2 bits of exclude (past bits+current bit+one lower bit) which should help a little to identify the exclusion probability without diluting statistics too much - they all hurt.

3. Optimal parsing and exclusions are either/or and equal

There are two major options for coding LZ-arith :

I. Do literal-after-match exclusion and always output the longest match. Use a very simplified optimal parser that only considers literal vs match (and a few other things). Essentially just a fancier lazy parse (sometimes called a "flexible parse").

II. Do not do literal-after-match exclusion, and consider many match lengths in an optimal parser.

It turns out that these give almost identical compression.

Case II has the simpler code stream because it doesn't require the literal-after-match special coder, but it's much much slower to encode at high compression because the optimal parser has to work much harder.

I've seen this same principle many times and it always sort of intrigues me. Either you can make a code format that explicitly avoids redundancy, or you can exploit that redundancy by writing an encoder that aggressively searches the coding space.

In this case the coding of exclude-after-match is quite simple, so it's definitely preferable to do that and not have to do the expensive optimal parse.

4. LZMA is very Pareto

I can't really find any modification to it that's a clear win. Obviously you can replace the statistical coders with either something faster (ANS) or something that gives more compression (CM) and you can move the space/speed tradeoff, but not in a clearly beneficial way.

That is, on the compression_ratio / speed / memory_use three-way tradeoff, if you hold any two of those constant, there's no improvement to be had in the other.

.. except for one flaw, which we'll talk about in the next post.

old rants