cbloom rants: 2008

12/27/2008

12-26-08 - In Defense of Less Typing

In the C++ rant we got a bit into the discussion of "that just reduces typing". It's become a common wisdom these days that "anything that just reduces typing is not of significant importance in programming". This is sort of a reaction to the bad old days where developers put too much emphasis on reducing the time it took to make a first draft of the code. Now people like to repeat the plathitudes that "first draft is only 10% of dev time, debuggability and readability and maintainability are what really matter".

Yes, yes, that is all true, but I think people miss the forest for the trees here in some cases.

Every character you type is a chance for a bug or a mistake. For one thing there are typos, but there are also just brain farts. The less you type when writing a piece of code, the more likely it is to be correct.

(I should emphasize the fact that reducing code duplication is very good for many reasons that I don't go into detail much in this rant, and those are mainly the main reason to merge duplicate code; I'm talking about cases where the code is not exactly duplicated, but is similar, or where you have a choice between making a very simple API which requires a lot of typing by the client, or a more complex API which has very simple usage for the client).

A lot of good programmers now are adopting the idea of exposing simple minimal C-style APIs that leave usage up to the client. There are a lot of things to like about that (for example, Sean's stb.h style thing for simple functionality is in fact wonderfully easy to integrate and steal snippets from), but there are also bad things about it. I think good programmers overestimate their ability to write simple usage code without mistakes. For example, you might think that you don't need a class to encapsulate a 32-bit Color, you can easily just use a 4-byte array or pass around a dword and do the shifts by hand - but I have seen bug after bug from small mistakes in that kind of code, because if you write the same thing over and over, or copy-paste and try to change a bunch of code by hand, there is some small chance of mistake each time you do it.

It's funny to me that good programmers in game dev are going in two directions at the same time. One direction is to make code very easy to follow by simple visual inspection. Many major people in game dev are pretty high on this these days. The basic idea is to use C-style imperative programming, make the action of each line obvious to simple visual inspection, reduce segmentation of code into function calls and prefer simple long linear code blocks (rather than factored out functions, conditionals, objects, etc). There are certainly merits to that. The other direction people are going is custom metaprogramming and local language redefinition. Many of these people want coding languages where you can locally redefine the rules of the language and do custom markup for things like network mirroring, thread fiber spawning, local-global state memory variable differentiation, etc. This kind of stuff would make code completely impossible to understand by simple visual inspection without intimate undestanding of how all the extra meta-language rules work. These ideas also have a lot of merit, because writing micro-threaded massively parallel code in plain C-style C++ is really difficult and ugly, and having custom features would make it much better - but these two directions are totally conflicting.

While I'm ranting about opposite directions, let me also rant about the idea that something is a good idea for "library design" but not for within your app (or vice versa). IMO Coding = Library Design. Most of the complex code that I write is in "libraries" for myself. Libraries are just chunks of functionality that you want to expose in a compact interface. Well, that's what you should be doing all the time. Coding is just writing a bunch of libraries, then the "app" is just tying together the "libraries".

So, for example, Casey's excellent talk about good library design (things like exposing multiple levels of interface from very simple to nitty gritty, and not forcing a usage pattern on the client) are just good ways to write code *always*.

I don't trust the me of one year ago, nor do I trust the me of one year in the future. I need to write API's for myself that make me write the right code. Part of that is all the things I've often written about before (self-validating API's, API's that are impossible to use wrong), but part of it is just plain less typing. If the API makes me (the client) write a whole bunch of code to do the simple things I often want to do - that makes it far more likely I will do it wrong.

Also I believe the illusion of choice is a horrible thing. If there's really only one or two reasonable ways to use a system, then just expose that to me. Don't give me what looks like a bunch of flexible components, but they only really work right if you do one specific thing.

Addendum : okay I'm bored of this topic and I'm sure you are too, but I feel like I started it so I should wrap it up a bit more.

Paul Graham has this thing "Succinctness is Power" that's sort of similar to this rant. As usual he writes it well, but I think he's a little bit wrong. The issue that I believe is important, which is what I'm trying to talk about here is :

Reducing the number of choices that the programmer has to make in order to write correct code.

Part of that is reducing typing - but the crucial thing is reducing typing when there is actual choice in the alternatives. That is, if it's something you *have* to type, that's not bad. For example a very verbose language like COBOL is not inherently bad due to its verbosity (cobol is horrible for other reasons).

Making code that works correctly with minimal typing (and makes compile errors if you use it wrong) is the goal. So part of what I'm getting at here is using short common words that it's easy for the programmer to get right, using highly constrained classes instead of very general ones, etc.

Part of the credo goes like this :


remove the option to do things that you never want to do

make it hard to do things that you rarely want to do

make it easy to do the right thing

As an example - iterators are cool even when they save almost no work. Say for example you have something like a doubly linked list class. Many of the simple C guys would say "hey you can just write code to traverse the linked list", and you write client code like :

for(Node * n = list.head; n != list.head; n = n->next)
{
    ...
}

That's easy right, you're a good programmer, you can just write that loop. No. You can't, that's what I'm trying to say with this rant. I mean, yes, you can, but you had a 1% chance of introducing a bug because you had to write code.

Writing code is bad.

Instead you could make your Node look like an iterator. You could give it standard names like begin() and end(). Now instead of writing code you can just use for_each , or even just copy-paste or spit out a loop that you've done a million times :

for(Node::iterator it = list.begin(); it != list.end(); ++it)
{
    ...
}

Is safer because it's standard. On a similar note, using a constrained object like an iterator is safer than using an int, because every time you use an int people get tempted to do evil things. How many bugs have I seen because people try to get clever with their loop iteration? Maybe they count backwards for efficiency and use and unsigned type by mistake. Or they pull the ++i out of the for() and then forget to do it due to a continue. Or they use the "i" outside of the for loop and bugs get introduced.

Lots of people are anti-OOP these days. I love OOP ; no, not deep inheritance trees and lots of data inheritance, and whatnot, but the basic idea of coding in terms of objects that impose constraints and conditions on their use. The best way for me to program is to build components and helpers which make expressing the program easy.

12/18/2008

12-18-08 - Open Source Just Doesn't Work

There's a Netflix Plugin for Media Portal . I dare you to even try to figure out how to install it and get it to run. And then there are some big problems with control and integration once you get it running. I'm trying to get the developer to fix it; fingers crossed, it would be nice to have that. I keep hearing the Xbox Netflix integration is really awesome. I might have to get an Xbox just for that. Yay.

12/16/2008

12-16-08 - Libraries and Cinit

I need a kind of mini class-factory for Oodle. This is for when I load a paging bundle that's full of various resources, I need to make each one. (BTW there will be a "low level" usage pattern for Oodle where you just load a paging bundle and the game gets a handle to the bundle and does whatever it wants with it. This is for the "high level" layer that automates a lot of things for you but is optional).

I guess what I'm going to have to do is require the game to give me a creator function pointer that knows how to make all the objects. eg. it would give me something like


void * (*fp_factory) (int objectType);

and fp_factory would point to something like

void * GameObjectFactory(int type)
{
    switch(type)
    {
    case OBJECT_PLAYER : return (void *) new Player;
    case OBJECT_ENEMY: ...
    }
}

Or something. As a game developer I hate that kind of global registry, because when you add a new object type you have to remember to go update this global registry file, which becomes a frequent merge disaster for source control, etc. I really like having everything you need to make a game object in a single CPP file. That means objects should be self-registering.

The way I usually do self-registering objects is with a little class that runs at cinit. Something like :


#define IMPLEMENT_FACTORY_PRODUCT(classname)    namespace { ClassFactory::Registrar classname##registrar( classname::Factory , typeid(classname) ); }

then in Player.cpp you have :

IMPLEMENT_FACTORY_PRODUCT(Player);

That's all fine and dandy, but it's not so cool for me as a library maker.

For one thing, doing work during CINIT is kind of naughty as a library. I've been trying to make it a rule for myself that I don't use object constructors to do cinit work the way I sometimes like to do. It's a perfectly normal thing to do in C++, and if you are writing C++ code you should make all your systems instantiate-on-use like proper singletons so that CINIT objects work - BUT as a library maker I can't require the clients to have done all that right. In particular if I do something like allocations during CINIT, they might run before the client does its startup code to install custom allocators or whatever.

For another thing, there are problems with the linker and cinit in libraries. The linker can drop objects even though they are doing cinit calls that register them in global factory databases. There are various tricks to prevent this, but they are platform dependent and it is a little evil bit of spaghetti to get the client involved in.

I guess I probably also shouldn't rely on "typeid" or "dynamic_cast" or any other runtime type information existing either since people like to turn that off in the compiler options for no good reason (it has basically zero cost if you don't use it). So without that stuff I pretty much just have to rely on the game to give me type info manually anyway.

Bleh, I'm just rambling now...

12/15/2008

12-15-08 - Denoising

I've been playing around with denoising images a tiny bit. There's a ton of papers on this and I've barely only scratched the surface, but it strikes me that the majority of the researches seem to be doing silly things that are ill-conceived.

Almost all of them work in the same basic way. They create a prediction of what the current pixel should be with a local smooth predictor, let's call that 'S'. Then they take the difference from the actual pixel value 'A'. If the difference is greater than a certain threshold, |S - A| > T , they replace the pixel value with 'S'.

That's just very wrong. It assumes that images can be modeled with a single-lobe Gaussian probability distribution. In fact, images are better modeled as a blend of several lobes with different means. That is, there is not one single smooth generator, but many, which are switched or blended between based on some state.

Any single-lobe predictor will incorrectly identify some valid image detail as noise.

I like to make it clear that the problem has two parts : deciding if a pixel is noise or not noise, and then filling in a replacement value if you decide that the pixel is noise.

My feeling is that the second part is actually not the important or difficult part. Something like a median filter or a bilateral filter is probably an okay way to fill in a value once you decide that a pixel is noise. But the first part is tricky and as I've said any simple weighted linear predictor is no good.

Now, ideally we would have a rich model for the statistical generation of images. But I wrote about that before when talking about Image Doubling (aka Super-Resolution), and we're still very far from that.

In the mean time, the best thing we have at the moment, I believe, is the heuristic modes of something like CALIC, or the Augural Image Zooming paper, or Pyramid Workshop or TMW. Basically these methods have 6 - 100 simple models of local image neighborhoods. For example the basic CALIC models are : {locally smooth}, {vertical edge}, {horizontal edge}, {45 degree edge}, {135 degree edge}, {local digital}, {pattern/texture}. The neighborhood is first classified to one of the heuristic models, and then a prediction is made using that model.

We can thus propose a simple heuristic noise detection algorithm :

Bayesian Noise Detection :

N = current local neighborhood
A = actual pixel value

P(M|N) = probability of model M given neighborhood N
P(A|M) = probability of pixel A was generated by model M

let

P(A|N) = argmax{M} P(A|M) * P(M|N)

then classify A as noise if

P(A|N) < T

for some threshold T

(I don't specify how the P's are normalized because it just changes the scaling of T,
but they should be normalized in the same way for the whole image)

Note that a very crucial thing is that we are using the argmax on models, NOT the average on models. What we're saying is that if *any* of our heuristic local models had a high likelihood of generating the value A, then we do not consider it noise. The only values that are noise are ones which are unlikely under *all* models.

In a totally hand-wavey heuristic way, this is just saying that if a pixel is within threshold of being a locally smooth value, or an edge value, or a texture, etc. then it's not noise. If it fits none of those models within threshold, it's noise.

I started by looking at the Median Filter and the Bilateral Filter. There have been some cool recent papers on fast Median Filtering :
Constant Time Median Filter
Weiss Log(r) Median Filter
Fast Bilateral Filter ; Sylvain Paris and Fr�do Durand + good references
Siggraph Course on Bilateral Filtering

Those are all very worth reading even though I don't think it's actually super useful. The fast median filter approaches use cool ways of turning an operation over a sliding window into incremental operations that only process values getting added in and removed as you step the window. Median filter is a weird thing that works surprisingly well actually, but it does create a weird sort of Nagel-drawing type of look, with nothing but smooth gradients and hard edges. In fact it's a pretty good toon-shading process.

BTW the fast median filters seem useless for denoising, since they really only matter for large r (radius of the filter), and for denoising you really only want something like a 5x5 window, at which size a plain brute force median is faster.

Bilateral filter actually sort of magically does some of the heuristic cases that I've talked about it. Basically it makes a prediction using a filter weighted by distance and also value difference. So similar values contribute and disparate values don't. That actually does a sort of "feature selection". That is, if your pixel value is close to other pixel values in a vertical edge, then the bilateral filter will weight strongly on those other similar pixel values and ignore the other off-edge pixels. That's pretty good, and the results are in fact decent, but if you think about it more you see the bilateral filter is just a ghetto approximation of what you really want. Weighting based on pixel value difference is not the right way to blend models, it makes no distinction about the context of that value difference - eg. it doesn't care if the value difference comes from a smooth gradient or a sudden step. As others have noted, the Bilateral Filter makes the image converge towards piecewise-constant, which is obviously wrong. Getting towards piecewise linear would be better, piecewise bicubic would be better still - but even that is just the very easiest beginning of what the heuristic estimator can do.

NL Means is a denoising algorithm which is a bit closer to the right idea; he's definitely aware of the issues. However, the actual NL means method is poor. It relies on closely matching neighborhoods to form good predictions, which anyone who's worked in image compression or super-resolution knows is not a good approach. The problem is there are simply too many possible values in reasonably sized neighborhoods. That is, even for a moderately sized neighborhood like 5x5, you have 2^8^25 possible values = 2^200. No matter how much you train, the space is too sparse. It may seem from the NL Means formulation that you're weighting in various neighbors, but in reality in practice you only find a few neighbors that are reasonably close and they get almost all of the weight, and they might not even be very close. It's like doing K-means with 2^200 possible values - not good.

There's a lot of work on Wavelet Denoising which I haven't really read. There are some obvious appealing aspects of that. With wavelets you can almost turn an image into a sum of {smooth}+{edges}+{texture}+{noise} and then just kill the noise. But I don't really like the idea of working in wavelet domain, because you wind up affecting a lot of pixels. Most of the noise I care about comes from cameras, which means the noise is in fact isolated to 1 or 2 adjacent pixels. I also don't like all the research papers that want to use 9x9 or larger local windows. Real images are very local, their statistics change very fast, and pulling in shit from a 9x9 window is just going to mess them up. IMO a 5x5 window is the reasonable size for typical image resolutions of today.

BTW one thing I've noticed with my camera noise images is that the fucking camera JPEG makes the noise so much harder to remove. The noise looks like it's originally just true single pixel noise, but when it goes through the camera JPEG, that single-pixel peak is really unfriendly to the DCT, so it gets smeared out, and you wind up having noise lumps that look like little DCT shapes. To specifically denoise photos that have gone through JPEG you probably have to work on 8x8 blocks and work directly on the DCT coefficients. (also the Bayer pattern demosaic obviously spreads noise as well; ideally you'd get to work on the raw taps before jpeg, before the camera denoise, and before the Bayer demosaic).

ADDENDUM : a lot of the denoise people seem to be trying to perfect the Playboy Centerfold algorithm, that makes photos look extremely smoothed and airbrushed. Often if you're not sure a pixel is noise it's best to leave it alone. Also, all the methods that use a pixel-magnitude threshold value for noise are wrong. The threshold for noise needs to be context sensitive. That is, in smooth parts of the image, you might be able to say that a pixel is probably noise when it's off from expectation by only 1 or 2 pixel values. In chaotic textures parts of the image, a pixel might be off by 20 values or more and you still might not be able to say it's noise. The correct parameter to expose to the user is a *confidence*. That is, I want to do something like replace all pixels which the algorithm is >= 90% confident it can improve.

Another problem I've seen with the majority of the denoisers is that they create structures from noise. If you run them on just staticy junk, they will form local flat junks, or linear bits or weird feathery patterns. This is because even in random noise there will be little bits that have similar values so they become seeds to create structures. This is very bad, the weird structures that are formed by this "denoising" are much worse than the original static, which is pretty inoffensive to the eye.

Marc sent me the link to GREYCstoration a free denoiser based on the image manifold PDE research. I don't like that this technique is becoming known as "PDE" - PDE just means partial differential equation; in this case it's a diffusion equation, in particular a variant of anisotropic diffusion with different diffusion rates based on the local curvature - eg. it diffuses across smooth areas and not across edges. (that's actually an old basic technique, the new thing he does is the diffusion follows contour lines (but is actually 2d, just weighted) and works on all components). It looks pretty decent. It's actually more exciting to me for super-resolution, it looks like it does a pretty good job of image super-resolution.

12/08/2008

12-08-08 - File Hashes

So I'm sticking a file hash in Oodle and thought I'd test some of the stuff out there. Test candidates :

SHA1 (Sean's stb.h implementation)

MD5 (OpenSSL implementation)

BurtleBurtle Lookup3 (hashlittle2)

Cessu's SSE2 hash (+ extra tail code I added)

CRC32

CRC32+32

In all cases I create a 64 bit hash. Hey, it's plenty of bits, it's easier to pass around cuz I have a 64 bit type, and it makes it a fair competition. SHA1 makes 160 bits (= 5 dwords), MD5 makes 128 bits (= 4 dwords), so I use Bob's Mix method to get that down to 64 bits.

A lot of people think SHA1 or MD5 or something is the "right thing" to do for file hashes. That's not really true. Those hashes were designed for cryptography which is not the same purpose. In particular, they are slow *on purpose* because they want to be hard to invert. They also make tons of bits, not because you need tons of bits to tell files apart, but again to make them hard to invert by brute force attack. I don't care about my file hashes being vulnerable to attack, I just want the chance of accidental collisions to be microscopic.

CRC32+32 means doing CRC32 on alternating bytes and jamming them together to make 64 bits. This is not a true "CRC64" but I might refer to it as CRC64 sometimes. (suggestion by "Joe Smith" ; Joe Smith? Is that a pseudonym?)

Just for background, if the 64 bit hashes are "perfect" - that is the 64 bit words coming out of them are random in every bit, even for input that is very non-random - then the chance of collision is indeed microscopic. (in fact you only need maybe 40 bits). The number of items you can hash in B bits is around 2^(B/2) , so B = 32 is not enough bits since 2^16 = 64k and you may in fact run on 64k files. But even at B = 40, 2^20 = 1 Million is a lot, and certainly B = 64, means 2^32 = 4 Billion items before you expect a collision. So, anyway, the point is to test whether these hashes are actually close enough to perfect on real data that they don't generate an unexpected high number of collisions.

I ran these hashes on every file on my hard drive. I threw out files that were exactly equal to some other file so there would be no false collisions due to the files actually being identical. I have 24963 files. I made 2 variants of every file, one by taking a random byte and changing it to a random value, and another variant by flipping 4 random bits. So in total 74853 arrays were hashed.

First the speed numbers :

sha1                 : 48.218018
md5                  : 19.837351
burtle               : 7.122040
Cessu                : 6.370848
crc32+32             : 15.055287
crc32                : 21.550138

These are in clocks per byte. The CRC numbers are a little funny because the CRC32+32 loop is a bit unrolled, but the CRC32 loop goes byte by byte. In any case, even though CRC is very simple, it is not fast, because even unrolled it still works byte by byte and there's a hard data dependency - you have to completely process each byte before you can work on the next byte.

Cessu's hash is only barely faster than Bob's lookup3 even though it uses SSE2 and works on 16 bytes at a time. Bob's hash is really damn good. When I tested it on strings it did not perform well for me because I'm so close to the bone on strings that the rollup & rolldown overhead killed me, but on larger arrays or even long strings, lookup3 kicks butt. ( Bob's page )

So... how many total collisions in the hashes do you think I had? (only testing collisions versus hashes of the same type of course). Remember I tested on 74853 different arrays, made from 24963 files and 24963+24963 more tiny modifications.

One.

One collision. Of course it was in CRC32. None of the 64 bit hashes had any collisions.

I then tried making 8 variants of each file by 8 different random byte jams, so I was running 199704 arrays. Again zero collisions for any 64 bit hash.

So, in an attempt to actually get a collision, I made 10,000,000 test arrays by sprintf'ing the digits from 1 to 10,000,000 , and then tacked on 2 random bytes. (note : you should not test hashes by making random arrays, because any decent hash will return random output bits from random input bits; the test I am interested in is how close the hash output is to random on highly *nonrandom* input). I ran the hashes on all those strings and got :

collisions on 10,000,000 tests :

sha1                 : 0
md5                  : 0
burtle               : 0
Cessu                : 0
crc64                : 0
rand32               : 11,530
crc32                : 11,576

Again none of the 64 bit hashes has any collisions. CRC32 had quite a few of course - but only as many as a 32 bit random number generator !! That means the CRC is in fact performing as a perfect hash. It is perfectly randomizing the nonrandom input.

So, I have no idea which of the 64 bit hashes is "best" in terms of randomizing bits and detecting errors. Obviously if they are actually perfectly making 64 bits, the chance of me ever seeing a collision is tiny. I thought maybe the "crc32+32" might not have 64 real bits of quality and might fail sooner, but it's not bad enough to fail in any kind of real world scenario apparently.

So, anyway, I'm gonna use "lookup3" because it's both super fast, plenty good, and it has the Bob Jenkins seal of approval which means it should actually be "robust".

HOWEVER : SSE4.2 has a CRC32 instruction. If you were in some application where you could rely on having that, then that would definitely be the fastest way to go, and a CRC32+32 appears to be plenty high quality for file identification.

BTW I keep hearing that CRC32 has degeneracy failures on real world data, but I have yet to see it.

12-08-08 - DXTC Summary

I thought I should fix some things that were wrong or badly said in my original DXTC posts :

DXTC Part 1
DXTC Part 2
DXTC Part 3
DXTC Part 4

Added later :

DXTC Addendum
DXTC More Followup
DXTC is not enough
DXTC is not enough part 2

On the "ryg" coder : there was a bug/typo in the implementation I was using which gave bogus results, so you should just ignore the numbers in those tables. See for correction : Molly Rocket Forum on Ryg DXTC coder . Also I should note he does have some neat code in there. The color finder is indeed very fast; it is an approximation (not 100% right) but the quality is very close to perfect. Also his "RefineBlock" which does the Simon Brown style optimize-end-from-indices is a very clever fast implementation that collapses a lot of work. I like the way he uses the portions of one 32 bit word to accumulate three counters at a time.

I also mentioned in those posts that the optimized version of the Id "real time DXTC" bit math was "wrong". Well, yes, it is wrong in the sense that it doesn't give you exactly the same indeces, but apparently that was an intentional approximation by JMP, and in fact it's a very good one. While it does pick different indeces than the exact method, it only does so in cases where the error is zero or close to zero. On most real images the actual measured error of this approximation is exactly zero, and it is faster.

So, here are some numbers on a hard test set for different index finders :


    exact : err:  31.466375 , clocks: 1422.256522

    id    : err:  31.466377 , clocks: 1290.232239
            diff:  0.000002

    ryg   : err:  31.466939 , clocks:  723.051241
            diff:  0.000564

    ryg-t : err:  31.466939 , clocks:  645.445860
            diff:  0.000564

You can see the errors for all of them are very small. "ryg-t" is a new one I made which uses a table to turn the dot product checks into indexes, so that I can eliminate the branches. Start with the "ryg" dot product code and change the inner loop to :


    const int remap[8] = { 0 << 30,2 << 30,0 << 30,2 << 30,3 << 30,3 << 30,1 << 30,1 << 30 };

    for(int i=0;i < 16;i++)
    {
        int dot = block.colors[i].r*dirr + block.colors[i].g*dirg + block.colors[i].b*dirb;

        int bits =( (dot < halfPoint) ? 4 : 0 )
                | ( (dot < c0Point) ? 2 : 0 )
                | ( (dot < c3Point) ? 1 : 0 );

        mask >>= 2;
        mask |= remap[bits];
    }

I should note that these speed numbers are for pure C obvious implementations and if you really cared about speed you should use SSE and who knows what would win there.

Now this last bit is a little half baked but I thought I'd toss it up. It's a little bit magical to me that Ryg's Mul8Bit (which is actually Jim Blinn's) also happens to produce perfect quantization to 565 *including the MSB shifting into LSB reproduction*.

I mentioned before that the MSB shifting into LSB thing is actually "wrong" in that it would hurt RMSE on purely random data, because it is making poor use of the quantization bins. That is, for random data, to quantize [0,255] -> 32 values (5 bits) you should have quantization bins that each hold 8 values, and whose reconstruction is at the middle of each bin. That is, you should reconstruct to {4,12,20,...} Instead we reconstruct to {0,8,...247,255} - the two buckets at the edges only get 4 values, and there are some other ones that get 9 values. Now in practice this is a good thing because your original data is *not* random - it's far more likely to have exact 0 and 255 values in the input, so you want to reproduce those exactly. So anyway, it's not a uniform quantizer on the range [0,255]. In fact, it's closer to a uniform quantizer on the range [-4,259].


I think it might actually just be a numerical coincidence in the range [0,255].

The correct straight-forward quantizer for the DXTC style colors is

    return (32*(val+4))/(256+8);

for R.  Each quantization bin gets 8 values except the top and bottom which only get 4.  That's equivalent to quantizing the range [-4,256+4) with a uniform 8-bin quantizer.

Now

1/(256 + 8) = 1/256 * 1/(1 + 8/256)

We can do the Taylor series expansion of 1/(1+x) for small x on the second term and we get ( 1 - 8/256 + 64/256/256) up to powers of x^2

So we have

    ( (32*val+128) * ( 1 - 8/256 + 64/256/256) ) >> 8

And if we do a bunch of reduction we get 

    return ( 31*val+124 + ((8*val+32)>>8) ) >> 8

If we compare this to Mul8bit :

        return ( 31*val+128 + ((val*31 + 128)>>8)) >> 8;

it's not exactly the same math, but they are the same for all val in [0,255]

But I dunno. BTW another way to think about all this is to imagine your pixels are an 8.inf fixed point rep of an original floating point pixel, and you should replicate the 8 bit value continuously. So 0 = 0, 255 = FF.FFFFFF.... = 1.0 exactly , 127 = 7F.7F7F7F..

BTW this reminds me : When I do Bmp -> FloatImage conversions I used to do (int + 0.5)/256 , that is 0 -> 0.5/256 , 255 -> 255.5/256 . I no longer do that, I do 0->0, and 255 -> 1.0 , with a 1/255 quantizer.

12/05/2008

12-05-08 - lrotl

Well I found one x86 ASM widget. I've always known you could do nice fast barrel rotations on x86 but thought they were inaccessible from C. Huzzah! Stdlib has a function "_lrotl()" which is exactly what you want, and happily it is one of the magic functions the MSVC recognizes in their compiler and turns into assembly with all goodness. (They also have custom handling for strcmp, memcpy, etc.)

Oh, I noticed lrotl in OpenSSL which seems to have a ton of good code for different hashes/checksums/digests/whatever-the-fuck-you-call-them's.

As a test I tried it on Sean's hash, which is quite good and fast for C strings :


RADINLINE U32 stb_hash(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = (hash << 7) + (hash >> 25) + *str++;
   }
   return hash;
}

RADINLINE U32 stb_hash_rot(const char *str)
{
   U32 hash = 0;
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash : 6.43 clocks per char
stb_hash_rot : 3.24 clocks per char

Woot! Then I also remembered something neat I saw today at Paul Hsieh's Assembly Lab . A quick check for whether a 32 bit word has any null byte in it :


#define has_nullbyte(x) (((x) - 0x01010101) & ( ~(x) & 0x80808080 ) )

Which can of course be used to make an unrolled stb_hash :


RADINLINE U32 stb_hash_rot_dw(const char *str)
{
   U32 hash = 0;
   
   while ( ! has_nullbyte( *((U32 *)str) ) )
   {
      hash = _lrotl(hash,7) + str[0];
      hash = _lrotl(hash,7) + str[1];
      hash = _lrotl(hash,7) + str[2];
      hash = _lrotl(hash,7) + str[3];
      str += 4;
   }
   while (*str)
   {
      hash = _lrotl(hash,7) + *str++;
   }
   return hash;
}

stb_hash_rot_dw : 2.50 clocks

So anyway, I'm getting distracted by pointless nonsense, but it's nice to know lrotl works. (and yes, yes, you could be faster still by switching the hash algorithm to something that works directly on dwords)

12-05-08 - 64 Bit Multiply

Something that I've found myself wanting to do a lot recently is multiply two 32 bit numbers, and then take the top 32 bit dword from the 64 bit result. In C this looks like :


U32 mul32by32returnHi(U32 a,U32 b)
{
    U64 product = (U64) a * b;
    U32 ret = (U32)(product >> 32);
    return ret;
}

That works fine and all, but the C compiler doesn't understand that you're doing something very simple. It generates absolutely disgusting code. In particular, it actually promotes a & b to 64 bit, and then calls a function called "_allmul" in the Windows NT RTL. This allmul function does a bunch of carries and such to make 64-by-64 bit multiply work via the 32+32->64 multiply instruction in x86. You wind up with a function that takes 60 clocks when it could take 6 clocks :


U32 mul32by32returnHi(U32 a,U32 b)
{
    __asm
    {
        mov eax, a
        mul b
        mov eax,edx
    }
}

Now, that's nice and all, the problem is that tiny inline asm functions just don't work in MSVC these days. Calling a big chunk of ASM is fine, but using a tiny bit of ASM causes the optimizer to disable all its really fancy optimizations, and you wind up with a function that's slower overall.

What I'd like is a way to get at some of the x86 capabilities that you can't easily get from C. The other thing I often find myself wanting is a way to get at the carry flag, or something like "sbb" or "adc". The main uses of those are Terje's cool tricks to get the sign of an int or add with saturation

It would be awesome if you could define your own little mini assembly functions that acted just like the built-in "intrinsics". It would be totally fine if you had to add a bunch of extra data to tell the compiler about dependencies or side effects or whatever, if you could just make little assembly widgets that worked as neatly as their own intrinsics it would be totally awesome for little ops.

ADDENDUM : urg ; so I tested Won's suggestion of Int32x32To64 and found it was doing the right thing (actually UInt32x32To64 - and BTW that's just a macro that's identical to my first code snippet - it doesn't seem to be a magic intrinsic, though it does have platform switches so you should probably use that macro). That confused me and didn't match what I was seeing, so I did some more tests...

It turns out the first code snippet of mul32by32returnHi actually *will* compile to the right thing - but only if you call it from simple functions. If you call it from a complex function the compiler gets confused, but if you call it from a little tight test loop function it does the right thing. URG.

Here's what it does if I try to use it in my StringHash interpolation search test code :


            int start = mul32by32returnHi( query.hash, indexrange );
004087A5  xor         eax,eax 
004087A7  push        eax  
004087A8  mov         eax,dword ptr [esp+38h] 
004087AC  push        eax  
004087AD  push        0    
004087AF  push        ebx  
004087B0  mov         dword ptr [esi],ebx 
004087B2  call        _allmul (40EC00h) 
004087B7  mov         ecx,dword ptr [esp+14h]

You can see it's extending the dwords to 64 bits, pushing them all, calling the function, then grabbing just one dword from the results. Ugh.

And here's what it does in a tight little test loop :


      U32 dw = *dwp++;
      hash = mul32by32returnHi(hash,dw) ^ dw;
00401120  mov         eax,ecx 
00401122  mul         eax,edx 
00401124  add         esi,4 
00401127  xor         edx,ecx 
00401129  mov         ecx,dword ptr [esi]

Which not only is correct use of 'mul' but also has got crazy reordering of the loop which makes it a lot faster than my inline ASM version.

12/02/2008

12-02-08 - Oodle Happy

This is from a few weeks ago, but I just pulled it up again and realized it pleases me :

This is an image of the little thread profile viewer that I made which shows timelines of multiple threads. This particular portion is showing an LZ decompressor thread in the top bar and the IO thread in the bottom bar. The LZ decompressor is double buffered, you can see it ping-ponging between working on the two buffers. As it finished each buffer it fires an IO to get more compressed data. You can see those IO's get fired and run in the bottom. This portion is CPU bound, you can see the IO thread is stalling out a lot (that's partly because there's no seeking and the packed data is coming through the windows file cache).

Anyway, the thing that pleases me is that it's working perfectly. The CPU is 100% busy with decompression work and never wastes any time waiting for IO. The IO thread wakes and sleeps to fire the IOs. Yum. In the future complication and mess I will never again see such nice perfect threading behavior.

12-02-08 - H264

I hate H264. It's very good. If you were making a video codec from scratch right now you would be hard pressed to beat H264. And that's part of why I hate it. Because you have to compete with it, and it's a ridiculously over-complex bucket of bolts. There are so many unnecessary modes and different kinds of blocks, different entropy coders, different kinds of motion compensation, even making a fully compliant *decoder* is a huge pain in the ass.

And the encoder is where the real pain lies. H264 like many of the standards that I hate, is not a one-to-one transform from decompressed streams to code streams. There is no explicit algorithm to find the optimal stream for a given bit rate. With all the different choices that the encoder has of different block types, different bit allocation, different motion vectors, etc. there's a massive amount of search space, and getting good compression quality hinges entirely on having a good encoder that searches that space well.

All of this stifles innovation, and also means that there are very few decent implementations available because it's so damn hard to make a good implementation. It's such a big arcane standard that's tweaked to the Nth degree, there are literally thousands of papers about it (and the Chinese seem to have really latched on to working on H264 improvements which mean there are thousands of papers written by non-English speakers, yay).

I really don't like overcomplex standards, especially this style that specifies the decoder but not the encoder. Hey, it's a nice idea in theory, it sounds good - you specify the decoder, and then over the years people can innovate and come up with better encoders that are still compatible with the same decoder. Sounds nice, but it doesn't work. What happens in the real world is that a shitty encoder gains acceptance in the mass market and that's what everyone uses. Or NO encoder ever takes hold, such as with the so-called "MPEG 4" layered audio spec, for which there exists zero mainstream encoders because it's just too damn complex.

Even aside from all that annoyance, it also just bugs me because it's not optimal. There are lots of ways to encode the exact same decoded video, and that means you're wasting code space. Any time the encoder has choices that let it produce the same output with different code streams, it means you're wasting code space. I talked about this a bit in the past in the LZ optimal parser article, but it should be intuitively obvious - you could take some of those redundant code streams and make them decode to something different, which would give you more output possibilities and thus reduce error at the same bit rate. Obviously H264 still performs well so it's not a very significant waste of code space, but you could make the coder simpler and more efficient by eliminating those choices.

Furthermore, while the motion compensation and all that is very fancy, it's still "ghetto". It's still a gross approximation of what we really want to do, which is *predict* the new pixel from its neighbors and from the past sequence of frames. That is, don't just create motion vectors and subtract the value and encode the difference - doing subtraction is a very primitive form of prediction.

Making a single predicted value and subtracting is okay *if* the predicted probability spectrum is unimodal laplacian, and you also use what information you can to predict the width of the laplacian. But it often isn't. Often there are pixels that you can make a good prediction that this pixel is very likely either A or B, and each is laplacian, but making a single prediction you'd have to guess (A+B)/2 which is no good. (an example of this is along the edges of moving objects, where you can very strongly predict any given pixel to be either from the still background or from the edge of the moving object - a bimodal distribution).

12-01-08 - VM with Virtual Disk

I had this idea for a VM to completely sandbox individual programs. It goes like this :

Do a fresh install of Windows or whatever OS. Take a full snapshot of the disk and store it in a big file. This is now *const* and will be shared by all sandboxes.

Every program that you want to run in isolation gets its own sandbox. Initially a sandbox just points at the const OS snapshot which is shared. File reads fall through to that. When you run the installer on the sandbox, it will do a bunch of file writes - those go in a journal which is unique to this sandbox that stores all the file renames, writes, deletes, etc. That can be saved or simply thrown away after the program is done.

You can optionally browse to sandbox journals. They look just like a regular disk with files. What you're seeing is the const OS snapshot with the changes that the individual program made on top of it. You can then copy files in and out of the sandbox drive to get them to your real disk.

So, for example, when you download some program from the internet that you don't trust, you can just pop up a new sandbox and run it there. This is *instant* and the program is 100% isolated from being able to do file IO to your real system. But if it makes some files you want, you can easily grab them out.

You could also mount "portals" across the sandboxes if you want to. For example, say you don't trust shitty iTunes and you want to run it in a sandbox so it can't mess with your registry or anything. But you want your music files to be on your main drive and have those be accessible to iTunes. You can mount a portal like a net drive for the sandbox to be able to "see out" to just that directory. That way you don't have to like duplicate your music files into the iTunes sandbox or whatever.

Aside from isolating rogue programs, this fixes a lot of problems with Windows. It lets you do 100% clean uninstalls - boom you just delete the whole sandbox and the program has no fingers left over. Every program gets its own registry and set of DLLs and such so there can never be conflicts. You don't have that damn problem of Windows always mysteriously going to shit after 5 years.

If you put your OS on c: and all your data on d:, you could easily just let all the sandboxes of trusted programs have portals to d: so that you can just run Photoshop in a sandbox and browse to d: and work on images, and it feels like just run normal programs on a normal computer.

11/28/2008

11-28-08 - Optimization

Yes yes, I'm generally in the "measure carefully and only optimize where needed" camp which the pretentious and condescending love. Weblog pedants love to point out pointless optimizations and say "did you really measure it?". Fie.

There's a great advantage to just blindly optimizing (as long as it doesn't severely cost you in terms of clarity or maintainability or simplicity). It lets you stop worrying. It frees up head space. Say you do some really inefficient string class and then you use it all over your code. Every time you use you have to wonder "is this okay? am I getting killed by this inefficient class?" and then you measure it, and no, it's fine. Then a month later you worry again. Then a month later you worry again. It gives you great peace of mind to know that all your base components are great. It lets you worry about the real issue, which is the higher level algorithm. (note that micro-optimizing the higher level algorithm prematurely is just always bad).

11-28-08 - Chance of CRC being bad

Say I run the CRC32 on data chunks. I've got them identified by name so I only care about false equality (collisions) on objects of the same name. I'm going to be conservative and assume I only get 31 good bits out of the CRC so there are 2^31 values, not a full 2^32. I think around 20 versions of the data per name max. So the chance of collision is a birthday "paradox" thing (it's not a freaking paradox, it's just probability!) :


P = "1 - e^( - 20*19/(2*(2^31)) )" = 8.847564e-008

Now if you have N names, what's the chance that *any* of them has a collision?


C = 1 - (1 - P)^N

So, for what N of objects is C = 50% ?


log(0.5) = N * log(1 - P)

N = log(0.5)/(log(1-P)

N = 7,834,327

You can have 7 Million files before you get a 50% chance of any one of them having a collision.

For a more reasonable number like N = 100k , what's the chance of collision?


C = "1 - (1 - 8.847564*(10^-8))^100000" = 0.00881 = 0.881 %

These probabilities are very low, and I have been pessimistic so they're even lower, but perhaps they are too high. On any given project, an 0.881% chance of a problem is probably okay, but for me I have to care about what's the chance that *any customer* has a problem, which puts me closer to the 7 M number of files and means it is likely I would see one problem. Of course a collision is not a disaster. You just tell Oodle to refresh everything manually, and the chance is that only happens once to anybody ever in the next few years.

BTW we used CRC32 to hash all the strings in Oddworld : Stranger's Wrath. We had around 100k unique strings which is pretty close to the breaking point for CRC32 (unlike the above case, they all had to be unique against each other). Everything was fine until near the end of the project when we got a few name collisions. Oh noes! Disaster! Not really. I just changed one of the names from "blue_rock.tga" to "blue_rock_1.tga" and it no longer collided. Yep, that was a crazy tough problem. (of course I can't use solutions like that as a library designer).

11/26/2008

11-26-08 - Oodle

OMG I finally got hot loading sort of working with bundles and override bundles. So you can build a package of a bunch of files. The game runs and grabs the package that pages in. You touch some file, the game sees the change, that causes it to make a single-file-bundle for the changed file, it loads in the new bundle, and then patches the resource table so that the new version is used instead of the old one in the package. Then the Game Dictionary is updated so that future runs will also load the new version.

Phew. It works but it's fragile and I'm not happy with how careful you have to be. There are lots of ways it can fail. For example, the packages all load one by one asynchronously. If you fire the big package load first, it may come in and get hooked up before the override package. Then if you hook up your game resources, you have hooked up to the old (wrong) data. You can prevent this by doing a call on each resource that you get to say "is this the right version of this resource" before hooking up to it. But currently there's just a ton of stuff like that you have to be aware of and make sure to do the right call, etc. etc.

I keep coming back to the problem that I need to know whether I have the right version of the file in a package. There are three options I can see :

1. Fuck it. Do nothing. This requires the game to load the right versions all the time. Clients could, for example, always just work unpackaged while editing, then when they make paging packages they would simply not be editable or incremental-linkable in a nice way. This is not a happy solution but it "works" in the sense that it is *reliable* about not working.

2. Use mod times. This works fine and is easy and fast *as long as you only touch and make files on your local machine*. Which would fail for major game devs since they tend to compile files on some kind of build server, or check in and share compiled files. But if you think about the way programmers work - we don't ever check in our .OBJ files, we just check in sources and everybody has to rebuild on their local machine. If you make your artists do the same thing - only share source files and always local rebuild - OR sync to a full compiled set from the server and only run compiled, but never try to mix local changes with server compiles - then it works fine.

3. Use CRC's. This is the most robust and reliable. The packages store the CRC of the files they were made from. If the source file on disk has a different CRC, we assume that the source file is better. (we don't try to use mod times to tell who's newer because of the server-client mod time problem). If the CRC is different we repackage using the local source file. Then when we load we always prefer packages that have content that matches the local CRC. This works and is stable and good and all. The only problem is all the time you spend doing CRC's on files, which may or may not be a big deal. Obviously running over your whole tree and doing CRC's all the time would be ridiculous, but you can use mod time to tell when to redo the CRC, so you only incrementally redo it. Even getting all the mod times is too slow, so you can use a persistent Watcher app to cache the drive dir listing and file infos.

The "CRC" method is not necessarilly "right" in the sense that it doesn't load the newest version of the content, but it is reliable, it always does the same thing - it loads content that corresponds to the source version on the user's disk.

BTW this last thing is something the OS should really do but totally fails on. With the amount of memory we have now, the OS should just keep a listing of every dir you've ever visited in memory at all times. Even if you visit every dir on the disk it would only be 100 Megs. You could cap it at 64 MB or something and LRU, and obviously having 64 MB of dir listings is going to make your dir access instant all the time. I might just write an app to do this myself.

I'm kind of tempted to offer #1 and #3 to clients and not offer #2. I don't want to offer any features that sort of work or will have weird behavior that looks like "bugs".

Also BTW yes of course there will also just be a mode to load what's in the packages and not even look at what's newest or anything. This is what the real shipping game will use of course, once you have your final packages set up it just loads them and assumes you made them right.

11/25/2008

11-24-08 - Lua Coroutines

.. are cool. Coroutines is the awesome way to do game logic that takes time. The script looks like :

PutRightFootIn();
PutRightFootOut();
if ( GetState(Friends) == dancing )
{
    DoHokeyPokey();
    TurnAround();
}
else
{
    LookAt( Friends );
    speech = StartSpeech("Screw you guys");
    FlipOff();
    WaitFor(speech);
    Go( Gome );
}

These things take various amounts of time and actually yield back execution to the game while that happens. Each NPC or game object has its own coroutine which the game basically just "pings". The Lua code really only ever runs a few lines, all it does is pick a new task and make the object do that task, and then it yields again. (obviously if it was just a static list like the example above you could do that easily other ways, the good thing about script is it can do conditionals and make decisions as it goes).

Coco lets you do a real Lua yield & resume from C calls, which lets you build your "threading" into your game's helper functions rather than relying on your designers to write scripts that yield correctly. LuaJIT (at the Coco link) is cool too.

I don't really like the Lua syntax, and I also personally hate untyped languages. (I think the right thing for scripting style languages is "weakly typed", that is all numeric types convert silenty and thing do their ToString and such, but I like to just get a compile error when I try to pass an alligator to the factorial function. Also having explicit types of "location" and "NPC" and such would be nice.) But having an off the shelf scripting language that does the coroutine thing is a pretty big win.

I generally think of simple game AI as being in 2 (or more) parts. The 2 main parts are the "brain" (or reactions) and the "job" (or task). The job/task is what the guy is currently doing and involves a plan over time and a sequence of actions. This is best done with a coroutine of some kind. The brain/reaction part is just a tick every frame that looks around and maybe changes the current task. eg. it's stuff like "saw the player, go into attack job". That can be well implemented as a regular Lua function call that the game just calls every frame, it should just do stuff that doesn't take any time and return immediately.

11/24/2008

11-23-08 - Hashing & Cache Line Size

A while ago I wrote about hash tables and reprobing and cache sizes. I mentioned Cliff Click's nonblocking hash table which is sort of interesting to me in theory but also scary and not something I really need at the moment. Anyhoo I was looking at : Some dude's C version of the lock free hash table and I noticed a little thing that's relevant to my old discussion.

He explicitly uses the cache line size and does +1 linear stepping along the cache line, then does a complex reprobe by rehashing, then does +1 again while in the cache line, then rehashes, etc. That seems like a pretty reasonable thing to do. I guess I considered that at the time, but didn't want to do it because it requires you to explicitly know the cache line size of the machine you're on. It literally requires a hard coded :

NumLinearSteps = CACHE_LINE_SIZE / sizeof(table_entry);

Some other : blogs about the lock free hash have extensive link lists.

11/22/2008

11-22-08 - Rasterization

I just found this pretty sweet introduction article on barycentric rasterization from 2004. It's not super advanced, but it starts at the beginning and works through things and is very readable. There are some dumb things in the block checking, so if you care go to the last page and see the posts by "rarefluid".

BTW the "edge equations" are really 2d plane equations (edge cross products). Checking just the edge planes against 8x8 blocks is only a rough quick reject. You can have blocks that are right outside of one vertex at an acute corner, and those blocks are "inside" all the planes but outside of the triangle. The code they have posted also checks against the bounding box of the whole triangle which largely fixes this case. At most they will consider one extra 8x8 block which doesn't actually contain any pixels.

(it's also really not yet a full barycentric rasterizer, he's just doing the edge tests that way; from his other posts I figure he's doing interpolation using the normal homogenous way, but if you're doing the edge-tests like that then you should just go ahead and do your interpolation barycentric too).

This kind of block-based barycentric rasterizer is very similar to what hardware does. One of the nice things about it is the blocks can easily be dispatched to microthreads to rasterize in parallel, and the blocks are natural quanta to check against a coarse Z buffer.

The old Olano-Greer paper about homogenous coordinate rasterization is now online in HTML. Some other random junk I found that I think is just junk : Reducing divisions for polygon mappers & Triangle Setup .

This blog about Software occlusion culling is literally a blast from the past. As I've often suggested, if you care about that stuff, the SurRender Umbra technical manual is still the godly bible on all things occlusion. (or you can read my ancient article on it ). But I also still think that object-based occlusion like that is just a bad idea.

Coarse Z that's updated by the rasterizer is however a good thing. Doing your own on top of what the card does is pretty lame though. This is yet another awesome win from Larrabee. If we/they do a coarse Z buffer, it can get used by the CPUs to do whole-object rejections, or whole-triangle rejections, or macroblock rejections.

Apparently the guy who wrote that top article is Nicolas Capens ; he wrote "swShader" which was an open source DX9 software rasterizer, which got taken down and is now a commercial product (which was a silly thing to do, of course any customer would rather buy Pixomatic !!). I learned this from a random flame war he got in. Don't you love the internet?

11/21/2008

11-21-08 - DXTC Part 4

So I finally implemented the end point lsqr fit from indeces thing that Simon does for myself. One thing immediately fell out - doing just 4 means and then end point fit pretty much equals the best mode of Squish. That's very cool because it's quite fast. Note that this is not doing all the searches of all possible clusterings - I just pick one clustering from 4 means and then optimize the end points for those indeces. (when I do 4 means I actually try a few different ways as I mentioned previously, I try all the permutations of putting the 4 means on the 4 palette entries, which is 4!/2 ways = 12 ways, but then I only optimize the best one of those, so it's still very fast).

One thing I noticed is that the lsqr fit really doesn't do much other than shift an end point by one. That is, the end points are in 565 already, you can do this nice optimization in floats, but when you quantize back to 565 you pretty much always hit the point you started with or at most a step of 1.

So the new "CB" modes are :

CB 1 = just 4 means then lsqr fit , faster than Squish, a bit slower than ryg. Quality is roughly competitive with Squish, but they have different strengths and weakness, so taking the best of the two might be reasonable. Squish never beats "CB 1" by a lot, but "CB 1" kills it on the weird "frymire" image.

CB 2 = CB 1 followed by simulated annealing.

CB 3 = Very slow heavy optimization. This is an attempt to see what "optimal" would get you. It tries "CB 2" and then it also tries using all 16 colors in the block as end points, so 16*16 trials, does the lsqr optimization on each trial, and then anneals the best of those. There are still various other things to try that might find a better result, but this is already pretty crazy slow. This is too slow to use in real production, the idea is simply to get an idea of how close "CB 2" is to optimal. Of course "CB 3" could still be far off optimal, I'm only conjecturing that it's close.

One of the interesting things to look at is the curve of diminishing returns from CB1 to 2 to 3. In most cases there's only a small improvement from 2 to 3, but there are exceptions, mainly in the weird degenerate images. kodim02 is one case (this is a photo but it's almost all red), and frymire of course. That meets expectations. On noisy natural images, the cluster of colors is pretty close to Gaussian noise, which works well with the PCA single line fit and the least-squares contiuous distribution approximation. On weird images with degenerate cases there can be stranger optimal solutions (for example : putting one of the color end points outside of the volume of original colors, so that one of the interpolated 1/3 colors can hit a certain value more precisely).

ADDENDUM : you might validly object and ask why the annealing is not getting closer to the global optimum. There are some approximations in the annealing that are hurting. For one thing I only try wiggling the ends by 1 step in 565. Then I don't really run it for very long, so it doesn't have a chance to make big walks and get to really different solutions. All it can really do is local optimizations with small steps to tunnel past local barriers to find better minima - it's not trying huge steps to other parts of the solution space. Theoretically if I ran a much longer annealing schedule with more time spent at higher temperatures it would do a better job of finding the global minimum. But I'm happy with this approach - the annealing is just an improved local optimization that steps bast small barriers, and to find drastically different global solutions I have to seed the trial differently.

The new numbers : (RMSE per pixel)

file	CB 1	CB 2	CB 3	Squish opt	Squish	ryg	D3DX8	FastDXT
kodim01.bmp	8.447	8.3145	8.2678	8.2829	8.3553	8.9185	9.8466	9.9565
kodim02.bmp	5.6492	5.4759	5.2864	6.1079	6.2876	6.8011	7.4308	8.456
kodim03.bmp	4.7533	4.6776	4.6591	4.7869	4.9181	5.398	6.094	6.4839
kodim04.bmp	5.5234	5.4286	5.3967	5.6978	5.8116	6.3424	7.1032	7.3189
kodim05.bmp	9.7619	9.6171	9.5654	9.6493	9.7223	10.2522	11.273	12.0156
kodim06.bmp	7.2524	7.1395	7.1086	7.15	7.2171	7.6423	8.5195	8.6202
kodim07.bmp	5.7557	5.6602	5.634	5.784	5.8834	6.3181	7.2182	7.372
kodim08.bmp	10.3879	10.2587	10.2056	10.2401	10.3212	10.8534	11.8703	12.2668
kodim09.bmp	5.3242	5.2477	5.2247	5.2935	5.3659	5.7315	6.5332	6.6716
kodim10.bmp	5.2564	5.1818	5.1657	5.2478	5.3366	5.7089	6.4601	6.4592
kodim11.bmp	6.7614	6.6503	6.6139	6.731	6.8206	7.3099	8.1056	8.2492
kodim12.bmp	4.8159	4.747	4.7308	4.7968	4.8718	5.342	6.005	6.0748
kodim13.bmp	11.0183	10.8489	10.7894	10.8684	10.9428	11.6049	12.7139	12.9978
kodim14.bmp	8.4325	8.3105	8.2723	8.3062	8.3883	8.8656	9.896	10.8481
kodim15.bmp	5.6871	5.5891	5.548	5.8304	5.9525	6.3297	7.3085	7.4932
kodim16.bmp	5.1351	5.0439	5.0274	5.065	5.1629	5.5526	6.3361	6.1592
kodim17.bmp	5.5999	5.5146	5.4976	5.509	5.6127	6.0357	6.7395	6.8989
kodim18.bmp	8.1345	8.0103	7.9791	7.9924	8.0897	8.6925	9.5357	9.7857
kodim19.bmp	6.6903	6.5979	6.5645	6.5762	6.652	7.2684	7.9229	8.0096
kodim20.bmp	5.4532	5.3825	5.3582	5.4568	5.5303	5.9087	6.4878	6.8629
kodim21.bmp	7.2207	7.1046	7.0718	7.1351	7.2045	7.6764	8.4703	8.6508
kodim22.bmp	6.52	6.4191	6.3933	6.4348	6.5127	7.0705	8.0046	7.9488
kodim23.bmp	4.9599	4.8899	4.8722	4.9063	5.0098	5.3789	6.3057	6.888
kodim24.bmp	8.5761	8.4633	8.4226	8.4299	8.5274	8.9206	9.9389	10.5156
clegg.bmp	14.8934	14.8017	14.6102	14.9736	15.2566	15.7163	21.471	32.7192
FRYMIRE.bmp	7.4898	7.3461	6.0851	10.7105	12.541	12.681	16.7308	28.9283
LENA.bmp	7.1286	7.0286	6.9928	7.1432	7.2346	7.6053	8.742	9.5143
MONARCH.bmp	6.6617	6.5616	6.5281	6.5567	6.6292	7.0313	8.1053	8.6993
PEPPERS.bmp	6.0418	5.9523	5.8026	6.4036	6.5208	6.9006	8.1855	8.8893
SAIL.bmp	8.4339	8.3077	8.2665	8.3254	8.3903	8.9823	9.7838	10.5673
SERRANO.bmp	6.7347	6.2445	5.9454	6.3524	6.757	7.0722	9.0549	18.3631
TULIPS.bmp	7.6491	7.5406	7.5065	7.5805	7.656	8.0101	9.3817	10.5873
lena512ggg.bmp	4.8961	4.8395	4.8241	4.8426	4.915	5.1986	6.0059	5.5247
lena512pink.bmp	4.6736	4.5922	4.5767	4.5878	4.6726	5.0987	5.8064	5.838
lena512pink0g.bmp	3.7992	3.7523	3.7455	3.7572	3.8058	4.2756	5.0732	4.8933
linear_ramp1.BMP	1.5607	1.348	1.3513	1.4035	2.1243	2.0939	2.6317	3.981
linear_ramp2.BMP	1.5097	1.2772	1.2772	1.3427	2.3049	1.9306	2.5396	4.0756
orange_purple.BMP	2.9842	2.9048	2.9074	2.9125	3.0685	3.2684	4.4123	7.937
pink_green.BMP	3.2837	3.2041	3.2031	3.2121	3.3679	3.7949	4.5127	7.3481
sum :	250.8565	246.2747	243.2776	252.3828	259.7416	275.5826	318.5562	370.8691

11-21-08 - More Texture Compression Nonsense

I guess the DX11 Technical Preview was just publicly released a few weeks ago. Unfortunately it's semi-crippled and still doesn't have information about BC7. From what I gather though BC7 does seem to be pretty high quality.

There're multiple different issues here. There's providing data to the card in a way that's good for texture cache usage (DXT1 is a big winner here, especially on the RSX apparently). Another is keeping data small in memory so you can hold more. Obviously that's a bigger issue on the consoles than the PC, but you always want things smaller in memory if you can. Another issue is paging off disk quickly. Smaller data loads faster, though that's not a huge issue off hard disk, and even off DVD if you're doing something like SVT you are so dominated by seek times that file sizes may not be that big. Another issue is the size of your content for transmission, either on DVD or over the net; it's nice to make your game downloadable and reasonably small.

I guess that the Xenon guys sort of encourage you to use PCT to pack your textures tiny on disk or for transmission. PCT seems to be a variant of HD-photo. MCT is I guess a DXT-recompressor, but I can't find details on it. I'm not spilling any beans that you can't find at MS Research or that end users aren't figuring out or PS3 developers are hinting at.

The "Id way" I guess is storing data on disk in JPEG, paging that, decompressing, then recompressing to DXT5-YCoCg. That has the advantage of being reasonably small on disk, which is important if you have a huge unique textured world so you have N gigs of textures. But I wonder what kind of quality they really get from that. They're using two different lossy schemes, and when you compress through two different lossy schemes the errors usually add up. I would guess that the total error from running through both compressors puts them in the ballpark of DXT1. They're using 8 bits per pixel in memory, and presumably something like 1-2 bits per pixel on disk.

Instead you could just use DXT1 , at 4 bits per pixel in memory, and do a DXT1-recompressor, which I would guess could get around 2 bits per pixel on disk. DXT1-recompressed is lower quality than JPEG, but I wonder how it compares to JPEG-then-DXT5 ?

If I ignore the specifics of the Id method or the XBox360 for the moment, the general options are :

1. Compress textures to the desired hardware-ready format (DXT1 or DXT5 or whatever) with a high quality offline compressor. Store them in memory in this format. Recompress with a shuffler/delta algorithm to make them smaller for transmisssion on disk, but don't take them out of hardware-ready format. One disadvantage of this is that if you have to support multiple hardware-ready texture formats on the PC you have to transmit them all or convert.

2. Compress texture on disk with a lossy scheme that's very tight and has good RMSE/size performance. Decompress on load and then recompress to hardware-ready format (or not, if you have enough video memory). One advantage is you can look at the user's machine and decide what hardware-ready format to use. A disadvantange is the realtime DXT compressors are much lower quality and even though they are very fast the decompress-recompress is still a pretty big CPU load for paging.

3. Hybrid of the two : use some very small format for internet transmision or DVD storage, but when you install to HD do the decompress and recompress to hardware-ready format then. Still has the problem that you're running two different lossy compressors which is evil, but reduces CPU load during game runs.

I don't think that having smaller data on disk is much of a win for paging performance. If you're paging any kind of reasonable fine-grained unit you're just so dominated by seek time, and hard disk throughput is really very fast (and it's all async anyway so the time doesn't hurt). For example a 256 x 256 texture at 4 bits per pixel is 32k bytes which loads in 0.32 millis at 100 MB/sec. (BTW that's a handy rule of thumb - 100k takes 1 milli). So the real interesting goals are : A) be small in memory so that you can have a lot of goodies on screen, and B) be small for transmission to reduce install time, download time, number of DVDs, etc.

One idea I've been tossing around is the idea of using lossy compressed textures in memory as a texture cache to avoid hitting disk. For example, a 256 X 256 texture at 1 bit per pixel is only 8k bytes. If you wanted to page textures in that unit size you would be ridiculously dominated by seek time. Instead you could just hold a bunch of them in memory and decompress to "page" them into usable formats. The disk throughput is comparable to the decompresser throughput, but not having seek times means you can decompress them on demand. I'm not convinced that this is actually a useful thing though.

BTW I also found this Old Epic page about the NVidia DXT1 problem which happily is fixed but I guess there are still loads of those old chips around; this might still make using DXT1 exclusively on the PC impossible. The page also has some good sample textures to kill your compressor with.

11/20/2008

11-20-08 - DXTC Part 3

So we know DXT1 is bad compared to a variable bitrate coder, but it is a small block fixed rate coder, which is pretty hard to deal with.

Small block inherently gives up a lot of coding capability, because you aren't allowed to use any cross-block information. H264 and HDPhoto are both small block (4x4) but both make very heavy use of cross-block information for either context coding, DC prediction, or lapping. Even JPEG is not a true small block coder because it has side information in the Huffman table that captures whole-image statistics.

Fixed bitrate blocks inherently gives up even more. It kills your ability to do any rate-distortion type of optimization. You can't allocate bits where they're needed. You might have images with big flat sections where you are actually wasting bits (you don't need all 64 bits for a 4x4 block), and then you have other areas that desperately need a few more bits, but you can't gived them to them.

So, what if we keep ourselves constrained to the idea of a fixed size block and try to use a better coder? What is the limit on how well you can do with those constraints? I thought I'd see if I could answer that reasonably quickly.

What I made is an 8x8 pixel fixed rate coder. It has zero side information, eg. no per-image tables. (it does have about 16 constants that are used for all images). Each block is coded to a fixed bit rate. Here I'm coding to 4 bits per pixel (the same as DXT1) so that I can compare RMSE directly, which is a 32 byte block for 8x8 pixels. It also works pretty well at 24 byte blocks (which is 1 bit per byte), or 64 for high quality, etc.

This 8x8 coder does a lossless YCoCg transform and a lossy DCT. Unlike JPEG, there is no quantization, no subsampling of chroma, no huffman table, etc. Coding is via an embedded bitplane coder with zerotree-style context prediction. I haven't spent much time on this, so the coding schemes are very rough. CodeTree and CodeLinear are two different coding techniques, and neither one is ideal.

Obviously going to 8x8 instead of 4x4 is a big advantage, but it seems like a more reasonable size for future hardware. To really improve the coding significantly on 4x4 blocks you would have to start using something like VQ with a codebook which hardware people don't like.

In the table below you'll see that CodeTree and CodeLinear generally provide a nice improvement on the natural images, about 20%. In general they're pretty close to half way between DXTC and the full image coder "cbwave". They have a different kind of perceptual artifact when they have errors - unlike DXTC which just make things really blocky, these get the halo ringing artifacts like JPEG (it's inherent in truncating DCT's).

The new coders do really badly on the weird synthetic images from bragzone, like clegg, frymire and serrano. I'd have to fix that if I really cared about these things.

One thing that is encouraging is that this coder does *very* well on the simple synthetic images, like the "linear_ramp" and the "pink_green" and "orange_purple". I think these synthetic images are a lot like what game lightmaps are like, and the new schemes are near lossless on them.

BTW image compression for paging is sort of a whole other issue. For one thing, a per-image table is perfectly reasonable to have, and you could work on something like 32x32 blocks. But more important is that in the short term you still need to provide the image in DXT1 to the graphics hardware. So you either just have to page the data in DXT1 already, or you have to recompress it, and as we've seen here the "real-time" DXT1 recompressors are not high enough quality for ubiquitous use.

ADDENDUM I forgot but you may have noticed the "ryg" in this table is also not the same as previous "ryg" - I fixed a few of the little bugs and you can see the improvement here. It's still not competitive, I think there may be more errors in the best fit optimization portion of the code, but he's got that code so optimized and obfuscated I can't see what's going on.

Oh, BTW the "CB" in this table is different than the previous table; the one here uses 4-means instead of 2-means, seeded from the pca direction, and then I try using each of the 4 means as endpoints. It's still not quite as good as Squish, but it's closer. It does beat Squish on some of the more degenerate images at the bottom, such as "linear_ramp". It also beats Squish on artificial tests like images that contain only 2 colors. For example on linear_ramp2 without optimization, 4-means gets 1.5617 while Squish gets 2.3049 ; most of that difference goes away after annealing though.

RMSE per pixel :

file	Squish opt	Squish	CB opt	CB	ryg	D3DX8	FastDXT	cbwave KLTY	CodeTree	CodeLinear
kodim01.bmp	8.2808	8.3553	8.3035	8.516	8.9185	9.8466	9.9565	2.6068	5.757835	5.659023
kodim02.bmp	6.1086	6.2876	6.1159	6.25	6.8011	7.4308	8.456	1.6973	4.131007	4.144241
kodim03.bmp	4.7804	4.9181	4.7953	4.9309	5.398	6.094	6.4839	1.3405	3.369018	3.50115
kodim04.bmp	5.6913	5.8116	5.7201	5.8837	6.3424	7.1032	7.3189	1.8076	4.254454	4.174228
kodim05.bmp	9.6472	9.7223	9.6766	9.947	10.2522	11.273	12.0156	2.9739	6.556041	6.637885
kodim06.bmp	7.1472	7.2171	7.1596	7.3224	7.6423	8.5195	8.6202	2.0132	5.013081	4.858232
kodim07.bmp	5.7804	5.8834	5.7925	5.9546	6.3181	7.2182	7.372	1.4645	3.76087	3.79437
kodim08.bmp	10.2391	10.3212	10.2865	10.5499	10.8534	11.8703	12.2668	3.2936	6.861067	6.927792
kodim09.bmp	5.2871	5.3659	5.3026	5.4236	5.7315	6.5332	6.6716	1.6269	3.473094	3.479715
kodim10.bmp	5.2415	5.3366	5.2538	5.3737	5.7089	6.4601	6.4592	1.7459	3.545115	3.593297
kodim11.bmp	6.7261	6.8206	6.7409	6.9128	7.3099	8.1056	8.2492	1.8411	4.906141	4.744971
kodim12.bmp	4.7911	4.8718	4.799	4.9013	5.342	6.005	6.0748	1.5161	3.210518	3.231271
kodim13.bmp	10.8676	10.9428	10.9023	11.2169	11.6049	12.7139	12.9978	4.1355	9.044009	8.513297
kodim14.bmp	8.3034	8.3883	8.3199	8.5754	8.8656	9.896	10.8481	2.4191	6.212482	6.222196
kodim15.bmp	5.8233	5.9525	5.8432	6.0189	6.3297	7.3085	7.4932	1.6236	4.3074	4.441998
kodim16.bmp	5.0593	5.1629	5.0595	5.1637	5.5526	6.3361	6.1592	1.546	3.476671	3.333637
kodim17.bmp	5.5019	5.6127	5.51	5.6362	6.0357	6.7395	6.8989	1.7166	4.125859	4.007367
kodim18.bmp	7.9879	8.0897	8.0034	8.225	8.6925	9.5357	9.7857	2.9802	6.743892	6.376692
kodim19.bmp	6.5715	6.652	6.5961	6.7445	7.2684	7.9229	8.0096	2.0518	4.45822	4.353687
kodim20.bmp	5.4533	5.5303	5.47	5.5998	5.9087	6.4878	6.8629	1.5359	4.190565	4.154571
kodim21.bmp	7.1318	7.2045	7.1493	7.3203	7.6764	8.4703	8.6508	2.0659	5.269787	5.05321
kodim22.bmp	6.43	6.5127	6.4444	6.6185	7.0705	8.0046	7.9488	2.2574	5.217884	5.142252
kodim23.bmp	4.8995	5.0098	4.906	5.0156	5.3789	6.3057	6.888	1.3954	3.20464	3.378545
kodim24.bmp	8.4271	8.5274	8.442	8.7224	8.9206	9.9389	10.5156	2.4977	7.618436	7.389021
clegg.bmp	14.9733	15.2566	15.1516	16.0477	15.7163	21.471	32.7192	10.5426	21.797655	25.199576
FRYMIRE.bmp	10.7184	12.541	11.9631	12.9719	12.681	16.7308	28.9283	6.2394	21.543401	24.225852
LENA.bmp	7.138	7.2346	7.1691	7.3897	7.6053	8.742	9.5143	4.288	7.936599	8.465576
MONARCH.bmp	6.5526	6.6292	6.5809	6.7556	7.0313	8.1053	8.6993	1.6911	5.880189	5.915117
PEPPERS.bmp	6.3966	6.5208	6.436	6.6482	6.9006	8.1855	8.8893	2.3022	6.15367	6.228315
SAIL.bmp	8.3233	8.3903	8.3417	8.5561	8.9823	9.7838	10.5673	2.9003	6.642762	6.564393
SERRANO.bmp	6.3508	6.757	6.5572	6.991	7.0722	9.0549	18.3631	4.6489	13.516339	16.036401
TULIPS.bmp	7.5768	7.656	7.5959	7.8172	8.0101	9.3817	10.5873	2.2228	5.963537	6.384049
lena512ggg.bmp	4.8352	4.915	4.8261	4.877	5.1986	6.0059	5.5247		2.054319	2.276361
lena512pink.bmp	4.5786	4.6726	4.581	4.6863	5.0987	5.8064	5.838		3.653436	3.815336
lena512pink0g.bmp	3.7476	3.8058	3.7489	3.8034	4.2756	5.0732	4.8933		4.091045	5.587278
linear_ramp1.BMP	1.4045	2.1243	1.3741	1.6169	2.0939	2.6317	3.981		0.985808	0.984156
linear_ramp2.BMP	1.3377	2.3049	1.3021	1.5617	1.9306	2.5396	4.0756		0.628664	0.629358
orange_purple.BMP	2.9032	3.0685	2.9026	2.9653	3.2684	4.4123	7.937		1.471407	2.585087
pink_green.BMP	3.2058	3.3679	3.2	3.2569	3.7949	4.5127	7.3481		1.247967	1.726312

11/18/2008

11-18-08 - DXTC Part 2

First of all, let's dispell the illusion that some people have that DXT1 is "pretty good". DXT1 is fucking awful. It compresses at 1.33333 bits per byte (4 bits per pixel). That's very large as far as image compressors are concerned. For typical images, around 4.0 bpb is true lossless, around 1.5 is perceptually lossless, and around 0.5 is "very good". In fact wavelet compressors can get as low as 0.1 bpb and acheive about the same quality as DXT1. Despite this I've heard smart people saying that "DXT1 is pretty good". Yes, it is a convenient fixed size block, and yes the decoder is extremely fast and simple, but as far as quality is concerned it is not even close. At 1.33333 it should be near lossless.

Here are some numbers on various DXT1 compressors. The numbers in the table are the RMSE (sqrt of the L2 error). The far right column is a wavelet compressor for comparison; it's not the best wavelet compressor in the world by a long shot, it's "cbwave" which is very old and which I designed for speed, not maximum quality. In any case it gives you an idea how far off DXT1 is. (BTW I always try to show results in RMSE because it is linear in pixel magnitude, unlike MSE or PSNR which is a very weird nonlinear scale). More discussion after the table...

(ADD: the "ryg" results shown here have a bug and should be ignored; see DXTC Post Summary )

file	Squish opt	Squish	CB opt	CB	ryg	D3DX8	FastDXT	cbwave
kodim01.bmp	8.2808	8.3553	8.352	8.6924	9.374	9.8466	9.9565	2.6068
kodim02.bmp	6.1086	6.2876	6.1287	6.3025	7.52	7.4308	8.456	1.6973
kodim03.bmp	4.7804	4.9181	4.8312	5.0225	5.855	6.094	6.4839	1.3405
kodim04.bmp	5.6913	5.8116	5.7285	5.9394	6.9408	7.1032	7.3189	1.8076
kodim05.bmp	9.6472	9.7223	9.707	10.112	10.8934	11.273	12.0156	2.9739
kodim06.bmp	7.1472	7.2171	7.1777	7.44	8.1005	8.5195	8.6202	2.0132
kodim07.bmp	5.7804	5.8834	5.8379	6.0583	6.8153	7.2182	7.372	1.4645
kodim08.bmp	10.2391	10.3212	10.346	10.747	11.3992	11.8703	12.2668	3.2936
kodim09.bmp	5.2871	5.3659	5.3306	5.5234	5.9884	6.5332	6.6716	1.6269
kodim10.bmp	5.2415	5.3366	5.2777	5.4633	5.9377	6.4601	6.4592	1.7459
kodim11.bmp	6.7261	6.8206	6.7643	7.0216	7.8221	8.1056	8.2492	1.8411
kodim12.bmp	4.7911	4.8718	4.8204	4.9863	5.6651	6.005	6.0748	1.5161
kodim13.bmp	10.8676	10.9428	10.925	11.4237	12.402	12.7139	12.9978	4.1355
kodim14.bmp	8.3034	8.3883	8.3398	8.6722	9.4258	9.896	10.8481	2.4191
kodim15.bmp	5.8233	5.9525	5.8568	6.0862	6.6749	7.3085	7.4932	1.6236
kodim16.bmp	5.0593	5.1629	5.0863	5.2851	5.8093	6.3361	6.1592	1.546
kodim17.bmp	5.5019	5.6127	5.5313	5.7358	6.4975	6.7395	6.8989	1.7166
kodim18.bmp	7.9879	8.0897	8.0192	8.3716	9.7744	9.5357	9.7857	2.9802
kodim19.bmp	6.5715	6.652	6.6692	6.91	8.0128	7.9229	8.0096	2.0518
kodim20.bmp	5.4533	5.5303	5.4895	5.6864	6.3457	6.4878	6.8629	1.5359
kodim21.bmp	7.1318	7.2045	7.1724	7.4582	8.1637	8.4703	8.6508	2.0659
kodim22.bmp	6.43	6.5127	6.4644	6.7137	7.8264	8.0046	7.9488	2.2574
kodim23.bmp	4.8995	5.0098	4.9244	5.0906	5.6989	6.3057	6.888	1.3954
kodim24.bmp	8.4271	8.5274	8.4699	8.8564	9.3906	9.9389	10.5156	2.4977
clegg.bmp	14.9733	15.2566	15.1755	15.7168	16.3563	21.471	32.7192	10.5426
FRYMIRE.bmp	10.7184	12.541	12.132	12.8278	12.989	16.7308	28.9283	6.2394
LENA.bmp	7.138	7.2346	7.1763	7.4264	8.1203	8.742	9.5143	4.288
MONARCH.bmp	6.5526	6.6292	6.5949	6.846	7.5162	8.1053	8.6993	1.6911
PEPPERS.bmp	6.3966	6.5208	6.4557	6.677	7.3618	8.1855	8.8893	2.3022
SAIL.bmp	8.3233	8.3903	8.3598	8.6627	9.8685	9.7838	10.5673	2.9003
SERRANO.bmp	6.3508	6.757	6.8385	7.9064	7.5303	9.0549	18.3631	4.6489
TULIPS.bmp	7.5768	7.656	7.6146	7.8786	8.4084	9.3817	10.5873	2.2228

Back to comparing the DXT1 encoders. BTW the test set here is the Kodak image set plus the Waterloo Bragzone image set. The Kodak set is all photographs that are pretty noisy, and there's not a huge difference in the coders. The Bragzone image set has some synthetic images with things like gradients which are harder to compress well, and there you can really dramatically see the bad encoders fall apart. In particular if you look at the results on "clegg" and "frymire" and "serrano" you can see how bad the "FastDXT" coder is.

The "Squish" in the table is the iterative cluster fit with uniform weighting. All coders work on linear RGB error; and the MSE is mean per pixel not per component.

The "CB" encoder is a simple 2-means fit. I seed the means by doing the PCA to find the best fit line. I put a plane perpendicular to that line through the average and take all the points on each half to be the two clusters, average the cluster to seed the means, and then iteratively refine the means by reclustering. Once I have the 2-means I do a simple search to find the best 565 DXT end points to find the two means. There are 3 cases to try :

1. put c0 and c1 = the two means

2. put c2 and c3 = the two means (so c0 and c1 are pushed out)

3. make (c0,c2) straddle mean 1 and (c3,c1) straddle mean 2 - this is best for gaussian clusters around the mean

A 2-means like this is slightly better than doing a pca "range fit" like Simon's fast method or the "ryg" method. If the data was all Gaussian noise, they would be equivalent, but of course it's not. You often get blocks that have a bunch of pixels at the low end which are all exactly the same color ( for example, all perfect black), and then a spread of a bunch of junk at the high end (like some orange, some yellow, etc.). You want to put one end point exactly on perfectly black and put the other endpoint at the center of the cluster on the high end.

"CB opt" and "Squish opt" take the results of the CB and Squish compressors and then improve them by iterative search. Simon Brown on his page mentions something about doing greedy endpoint refinement but claims it "doesn't converge because the indeces change". That's silly, of course it converges.

To do a greedy search : try wiggling one or both end points in 565 space. Find new best index fit for the new end points. Measure new error. If the error is lower, take the step.

Of course that works and it does converge and it's pretty simple and improves the encoding after Squish.

In fact you can do even better by doing simulated annealing instead of a plain greedy search. We should know by now that any time we have a greedy search like this where we can measure the error and it can get stuck in local minima, we can improve it with something like simulated annealing. I use 256 steps of annealing with a sinusoidal decreasing temperature pattern. I randomly pick a way to wiggle the two end points (you need to consider wiggling both, not just single wiggles). I try the wiggle and measure the error delta. Negative errors (improvements) are always taken, positive errors are taken probabilitistically based on the temperature. This is what was done in the "opt" coders above.

Most of the time the annealing search just makes a small improvement, but once in a while (such as on "frymire" and "serrano") it really finds something remarkable and makes a big difference.

Simon's cluster fit alg is very sweet. I didn't really understand it at first from the description of the code, so I'm going to redescribe it for myself here.

The basic idea is you find an assignment of indices, and from that solve a best fit to give you the end points for those indices. If you think about all the colors in the block living in 3x3 color space, assigning indices is like giving labels to each point. All the points with the same label are a "cluster". Then each cluster is fit with either an end point or one of the interpolated points.

Once you know the clusters, putting the best two end points is a simple linear optimization. You can solve the least squares analytically, it's not like a matrix iteration least squares problem or anything.

So the trick is how to decide the indices. If you tried them all, it would be 4^16 ways, which is way too many. So what Simon does is create a linear ordering of the points using the PCA axis, then try all clusterings that can be created by planes perpendicular to that linear axis. That's equivalent to just trying all groupings of the indeces sorted by their dot along the PCA axis. That is, the clusters are

[0,i) , [i,j) [j,k) [k,16)

for all i,j,k

which is a manageable amount to search, and gives you most of the interesting clusterings that you care about. Something that might improve Squish is tring a few different axes and picking the best.

BTW this end point optimization is very approximate. One issue is that it assumes the best index for each point doesn't change. It also of course just uses real number arithmetic to make the 1/3 points, not the actual integer rounding that's done. Those factors are actually pretty big.