1/13/2010

01-13-10 - Lagrange Rate Control Part 3

Lambda seeking. How do we choose lambda to match a desired rate? I have a few ideas.

One idea is to train up a statistical fit to make good guesses. For any given video (or other data to compress), you can create a relationship between lambda and R that's a pretty simple functional fit. There are a few ways to do this. One way would be to just go ahead and compress the video 16 times with different lambda values, observe R each time, that gives you a bunch of data points, now use those to fit a function to lambda(R) , now you can dial a lambda for any rate. If you're going to be working on the same video a lot, this might not be so bad; in particular in the games or DVD publishing business this might be tolerable.

A slight tweak on that idea is instead of just making a fit per video, instead fit to some characteristics of that video. Sample some kind of moments from the video, like do one pass with simple mocomp and measure the L2 norm of the movec and the L2 norm of the residuals. Those are just two moments. Use those to build up a fit lambda(R,m1,m2). To make this fit, we could run on a variety of sample videos to build a database of some good seeds. Furthermore, whenever a user does a run, it would add a seed to their local database. That way if they run on the same video multiple times, those moments would be matched exactly and the fit would get better. If we don't find a close enough sample point in the database to make a good fit, then we just force a test compression of the video at a given lambda and measure the R. So the end result is that rather than doing a bunch of test compressions to build a model, we just do a quick scan to sample some moments, then take the R parameter and look up the fit and get a guess for lambda. Note that rather than doing this fit per video, you could do it *per frame* which would give you a lot more dense sample data very quickly.

Another way to get a decent guess for lambda is with interpolation search. The procedure would be something like this :


pick good brackets for lambda
    high and low lambdas that you're pretty sure the desired lambda is between, but are as close together as possible
    (requires some kind of moment sampling or heuristic training)

evalue R(lambda) at the two end points and half way between them
    that's 3 compression runs

from 3 points, fit a quadratic R(lambda)

use quadratic to choose 4th point lambda
    (bias it so its not too close to the 3 we already did)

run compression at 4th point
    if that R(lambda) is close enough to target -> done

make line between 4th point and neighbor that straddles target R
use lerp to choose 5th lambda

4 or 5 evaluations of R(lambda)

In many cases with 4 runs your R may be close enough to desired rate and you are just done. The quality of interpolation search depends on how simple/smooth the R(lambda) function is. In general it is a good fit to simple curves, but you can get unlucky in your sample point choices.

Finally, if you want to hit a rate more exactly, you can do so more efficiently by relaxing the global uniform lambda requirement. Say you have some lambda which is pretty close to giving you the right rate and you want to encode to hit that rate more exactly. You let lambda drift a little bit as you code to try to adjust towards hitting the exact bit rate.

I'm not sure exactly what the ideal algorithm is for this, so if you have better ideas let me know. It's a little bit of a nasty black box feedback problem, because I'm trying to hit R by dialing L, and I have this unknown function R(L). One assumption that we will use, which most video coders use in one way or another, is that the video is locally self similar, that is, frame N is usually similar to frame N+1 , so if I observe something about the function R(L) on frame N, it is a good guess that it behaves the same way on frame N+1 ; obviously this is grossly untrue for major cuts, but those are "rare" and we just accept the innaccuracy there (you could also have a panic mode when you see you got it really wrong).

So the idea is that rather than search L around on a given frame, we use previous frames to just make a guess for L and only encode the frame once. If we get it wrong by a little bit, no big deal, we'll make up for it on the next frame. This only works when we are only trying to make small corrections.

Specifically : you did a previous pass at lambda L1 and that gave you total size R1. It also gave you a size for each frame : F1(i). You now want to hit size R2 (which is very close to R1).

At frame i you have already written W2 bits, so you have (R2 - W2) remaining. In pass one you had (R1 - W1) remaining at the same spot. Compute the desired size for the current frame as :

F2 = F1 * (R2 - W2) / (R1 - W1)

To hit frame size F2 , you know if F2 is close to F1 then you should use L = L1. If F2 is a little bigger or smaller either way you should adjust L slightly. To do that, we track a running estimate of dL/dF , call it M for the slope. So we use :

L = L1 + (F2 - F1) * M

We then compress with this L which gives us a frame size F(L). We then compute the actual observed slope :

S = (L - L1) / ( F(L) - F1 )

and then update M using S via IIR or FIR.

There are a lot of kludgy things you'd need to do here, like seed M with a decent guess, don't blend in S updates if you get a weird result ( like F(L) = F1 ), forbid L from making big steps away from L1 even if the estimate really wants to, etc.

Also I'm not sure if dF/dL is really the right variable to estimate ; maybe it would be better to do d(F/F1)/ d(L/L1) , or something. That is, for frames of different characters, what is the most consistent response of frame size to L variation?

1/12/2010

01-12-10 - Lagrange Rate Control Part 2

Okay, so we've talked a bit about lagrange coding decisions. What I haven't mentioned yet is that we're implicitly talking about rate control for video coders. "Rate control" just means deciding how many bits to put in each frame. That's just a coding decision. If the frames were independent (eg. as in Motion JPEG - no mocomp) then we know that lagrange multiplier R/D decisions would in fact be the optimal way to allocate bits to frames. Thus, if we ignore the frame-dependence issue, and if we pretend that all distortions are equal - lagrange rate control should be optimal.

What does lagrange rate control mean for video? It means you pick a single global lambda for the whole video. This lambda tells you how much a bit of rate should be worth in terms of distortion gain. Any bit which doesn't give you at least lambda worth of distortion gain will not be coded. (we assume bits are of monotonically decreasing value - the first bit is the most important, the more bits you send the less value they have). On each frame of video you want to code the frame to maximize J. The biggest single control parameter to do this is the quantizer. The quantizer will largely control the size of the frame. So you dial Q to maximize J on that frame, this sets a frame size. (maximizing J will also affect movec choices, macroblock mode choices, etc).

Frames of different content will wind up getting different Q's and very different sizes. So this is very far from constant bit rate or constant quantizer. What it is is "constant bit value". That is, all frames are of the size where adding one more bit does not help by lambda or more. Harder to code (noisy, fast motion) frames will thus be coded with more error, because it takes more bits to get the same amount of gain in that type of frame. Easy to code (smooth, flat) frames will be coded with much less error. Whether or not this is good perceptually is unclear, but it's how you get optimal D or a given R assuming our D choice is what we want.

Ideally you want to use a single global lambda for your whole video. In practice that might not be possible. Usually the user wants to specificy a certain bit rate, either because they actually need to meet a bit rate maximum (eg. for DVD streaming) , or because they want to specify a maximum total size (eg. for fitting on your game's ship DVD), or simply because that's an intuitive way of specifying "quality" that people are familiar with from MP3 audio and such. So your goal is to hit a rate. To do that with a single global lambda, you would have to try various lambdas, search them up and down, re-encode the whole video each time. You could use binary search (or maybe interpolation search), but this is still a lot of re-encodings of the whole video to try to hit rate. (* more on this later)

Aside : specifying lambda is really how people should encode videos for distribution as downloads via torrents or whatever. When I'm storing a bunch of videos on my hard disk, the limitting factor is my total disk size and the total download time - I don't need to limit how big each individual movie is. What I want is for the bits to go where they help me most. That's exactly what lambda does for you. It makes no sense that I have some 700 MB half hour show that would look just fine in 400 MB , while I have some other 700 MB show that looks like shit and could really use some more bits. Lambda is the right way to allocate hard drive bytes for maximum total viewing quality.

Okay. The funny thing is that I can't find anyone else on the web or in papers talking about lagrange video rate control. It's such an obvious thing that I expected it to be the standard way, but it's not.

What do other people do? The de-facto standard seems to be what x264 and FFMPEG do, which I'll try to roughly outline (though I can't say I get all the details since the only documentation is the very messy source code). Their good mode is two pass, so I'll only talk about that.

The primary thing they do is pick a size for each frame somehow, and then try to hit that size. To hit that frame size, they search QP ( the quantization parameter) a bit. The specifically only search QP in the local neighborhood of the previous QP because they want to limit QP variation between frames (the range of search is a command line parameter - in fact almost everything in this is a command line parameter so I'll stop saying that). When they choose a QP, there's a heuristic formula for H264 which specifies a lambda for lagrange decisions that corresponds to that QP. Note that this lambda is only used for inside-the-frame coding decisions, not for choosing QP or global rate allocation. Also note that the lambda-QP relationship is not necessarily optimal; it's a formula (there are a bunch of H264 papers about making good lambda-QP functional fits and searches). They also do additional funny things like run a blurring pass on QP to smooth out variation; presumably this is a heuristic for perceptual purposes.

So the main issue is how do they pick this frame size to target? So far as I can tell it's a mess of heuristics. For each frame they have a "complexity" measure C. On the first pass C is computed from entropy of the delta or some such thing, raised to the 0.8 power (purely heuristic I believe). The C's are then munged by some fudge factors (that can be set on the command line) - I frame sizes are multiplied by a factor > 1 that makes them bigger, B frame sizes are multipled by a factor < 1 that makes them smaller. Once all the "complexities" are chosen, they are all scaled by (target total size) / (total complexity) to make them add up to the total target size. This sets the desired size for each frame.

Note that this heuristic method has many of the same qualitative properties as full lagrangian allocation - that is, more complex frames will get more bytes than simpler frames, but not *enough* more bytes to give them the same error, so more complex frames will have larger errors than simpler frames. However, quantitatively there's no gaurantee that it's doing the same thing.

So far as I can tell the lagrange method is just better (I'm a little concerned about this because it just seems to vastly obviously better that it disturbs me that not everyone is doing it). Ignoring real world issues we've glossed over, the next big problem is the fact that we have to do this search for lambda, so we'll talk about that next time.

ADDENDUM : x264/ffmpeg rate control references :

ratecontrol.txt - Loren is the main dev but this is very cursory
FFmpeg RateControlContext Struct Reference
FFmpeg libavcodecratecontrol.c Source File
[Ffmpeg-user] changing bitrate on the fly - detailed rate control document, but not written by one of the main devs so beware
x264 Macroblock Tree Ratecontrol testing (committed) - Doom9's Forum - this is about the dependency issue that we haven't discussed

01-12-10 - Lagrange Rate Control Part 1

I'm going to talk about video coding and code stream design. It's important to remember the end goal - we want to make a coded bit stream which is <= N bytes and plays back with the highest quality possible. In all cases I assume the decoder is a fixed known quantity.

First of all it's useful to consider what you would do if you had infinite CPU time. The answer is you should try all coded bit streams. That is, to make the optimal N byte stream, you should make 256 ^ N streams, run each through the decoder, measure the error somehow, and choose the best. That sounds ridiculous, but it is our goal; everything else we talk about will be approximations that are trying to get as close as possible to that.

Let's try to get into the realm of reality. You have some encoder that you think produces good code streams for your given decoder. In various places in your encoder you have decision points - what quantizer to use, what macroblock mode to use, whether to send this value as its true self or as any other value, etc. If you had lots of CPU power you should do a brute force search on all decisions - try every decision all possible ways; if there are M decisions and each has K choices this is K^M. Reject code streams produced that are bigger than N. Measure the error of the results and choose the best.

Okay, that's still ridiculously impractical. But let's consider a simpler case. Say you only have two coding decisions, and most importantly - they are independent, that is one decision does not affect the other. For example say you are coding two separate images, and you are only allowed to choose the quantizer on each image, and your goal is to minimize the total error. If there are K quantizer choices, a full search would mean trying K^2 ways. But in that's not necessary because of independence. All you have to do code the first image K ways & remember the rate (R) and distortion (D) of each way, then the second image K ways and remember R/D, so you have 2K codings. Now to find the optimal combination you could do K^2 table lookups, but even here you can usually do much less, because the R/D tables are monotonic (or very nearly monotonic) which means rather than doing K^2 you only need to pick an entry on each side and then slide each one up and down to search for improved D given the R constraint, which is O(K). This obviously extends to N independent coding choices ; rather than K^N we can do K*N codings.

The simplest way to do this in practice is with a lagrange multiplier framework. Rather than look for optimal D given R, we look for an optimal lagrange cost : J = D + lambda * R.

If J is maximized by our coding choices, it means dJ/dCode = 0. That means 0 = dD/dC + lambda * dR/dC , or dD/dR = - lambda

That is, lambda has selected a slope of the D(R) curve. If D(R) and dD/dR are both monotonic, then this has uniquely selected a rate & encoding. We shall henceforth assume that this monotonicity requirement is true. In practice it is not quite true but it is "mostly" true which we can deal with in hacky ways (* perhaps more on this niggle some day). With this assumption, if your goal is to hit a certain rate, you can monotonically search lambda to find an encoding that maximizes J at that rate.

Now the key point is that maximizing J for a given rate gives you the optimal encoding for independent coding choices. Consider our example of the two images above. Instead I make a quantizer choice from the K choices on each image independently. On each one I measure J and choose the one that maximizes J. (note that since R and D are monotonic I can actually just do a binary search here that's O( log(K) ) not O(K) ). Because they are independent, I have found the maximum J for the full stream. This is the optimal encoding. Why? Because it has made the slope dD/dR equal at each of the coding choices. That is, I can't move bits from one image to the other and get a better distortion at the same rate. If the slopes were not equal, then I would get a win from moving the bits away from where they affect D less to where they affect D more.

This is the same as before, but the win in terms of code flow and structure should be obvious. Before I had to do a bunch of encodings and then go back and consider different bit allocations to find the best. Now I can do one linear pass through my codings and make the optimal J decision at each step, and then after a decision is made I can forget about it and move on to the next and never revisit it. The disadvantage is now that my encoding is parameterized by the unintuitive lambda parameter rather than just R, the rate; if I want to hit a specific rate I have to search lambda in an outer loop (* perhaps more on this later).

Now we're going to make a big leap of faith and see what happens if we try to use this sort of one-pass lagrange optimization on more complicated real world scenarios. What goes wrong? Two big issues. One is that D might not be independent, the other is that coding is not independent.

To be clear, our proposal is to do one-pass lagrange coding decision making. Encode the data in streaming one pass scan. At each decision point, you try all possible ways for the current subset of the data only. You measure J on the current subset and take the decision that maximizes J on that subset. You then do the encoding using that decision and output the encoded bits, and move on to the next subset. R (the rate) can of course be measured on each subset independently (if arithmetic coding, you can count the fractional bits left in the state as well). D (the distortion) must be some measure that can be done locally.

What about D (the distortion measure) ? Well, if D is SSD (sum of square differences) (aka L2 error or MSE), then it is in fact independent from pixel to pixel or block to block, which means it is fully correct to measure D only locally on each decision. But you might object that this is not a good way to measure distortion. I'm going to ignore this for now and just posit that we choose D to be SSD and we're going to optimize for that. (* more on this later).

The big issue is that in the real world our coding decisions are not independent. A huge affect with video coding is of course motion compensation - blocks can be used as reference for later blocks. That means a coding decision now can have affects far in the future. Even if we ignore that and just talk about image coding, blocks affect the coding of other blocks through statistical coding - be it huffman, arithmetic, or context coding. Typically this has an affect of the form : if I chose mode A for the current block, that has the side effect of making mode A cheaper for future blocks, and all other modes slightly more expensive. I contend that the statistical affects on the future are not a huge issue. (you may recall when we talked about LZ77 optimal parsing, I made the same hand-wavey argument to dismiss this issue). There is one big issue about the statistical effects - early on in coding when statistics are sparse, decision 1 can have a huge affect on the statistics which heavily biases decision 2 and leads you far away from the true optimal choice. To stop this, the statistics for decision making purposes should be preconditioned with some "normal" data to reduce the affect of strange early decisions. Note that the statistics used for actual coding need not be the same as the statistics for decision making.

Sometimes a purely greedy coding decision can be very bad, though. Consider the case of something like trellis quantization of DCT coefficients. The true output of the DCT is X,Y,Z. But you need not transmit those values, you can transmit anything and it will just have some rate and distortion. For example, say the output is 0,2,3,0,X,0,0,0,0 . How should you transmit X ? If you're using something like run-length coding or end-of-block signalling, there is a big advantage to sending X as 0. You can't know that unless you could see the future after X. If you only saw 0,2,3,0,X, and didn't know what followed you wouldn't see that. This is a well known issue with quantization of trailing DCT coefficients, but it's also an issue with things like macroblock mode decisions in video coding. The reason is that mode equal to previous mode is coded so much smaller than any other mode, when you make a decision you very strongly affect not only your current rate, but also the following block. One remedy to this is "semi-non-greedy lookahead" type coding as is well known from LZ77 coders; that is, instead of just doing one coding decision, you try all ways for the next 2 or 3 or 4, choose the one that optimizes all of them together, and then discard all but the current one and step ahead one step. Okay I'm going to sort of ignore this issue for now, but we should keep it in the back of our mind that purely greedy forward-scan decision making can be pretty far from optimal.

The other big issue for video is motion compensation. Yeah, that's a big issue, but it's not specific to this type of code stream optimization, so I'm not going to discuss it now. It's an issue that you must address in any video coder that makes decisions. The way to address it in this framework is to choose some scaling factor for "D" on each block. You compute J = c * D + lambda * R , where normally c would be 1, instead you bias it up or down depending on whether you decide the current block is more or less important than average. (blocks which will be sources for future motion compensation should be considered more important in the forward greedy pass decision making). (* perhaps more on this later).

To be continued ...

12/18/2009

12-18-09 - The ACM

There's a bunch of hubbub lately in the game/graphics community about the anti-scientific actions of the ACM.

I find it all much ado about nothing. Of *course* the ACM is a bunch of cocks that are against scientific progress and dissemination of information. Ooo the group of evil bastards who claim all the rights to the paper that you yourself wrote are being dicks !? Well big fucking surprise. Woop-tee-do.

Everyone should boycott the ACM and the IEEE. Duh. The paper publications have absolutely zero purpose now. The conferences have almost no purpose, it's just a bunch of hob-nobbing and back-slapping. The internet is how real information is conveyed. I'm not quite sure why people are so enamored of conferences. I can get way more information in 10 minutes reading papers on the internet than you can get from a 4 day conference, because I am pulling the information I want at the pace that I want, not having it pushed to me.

Now, peer review and collection of papers does in fact provide a service. You don't just want researchers putting up papers on their own web sites with no organization and no peer review. (for example, Arxiv is great and all, but the lack of organization and peer review makes it weak as a primary publishing location).

If you actually cared about getting away from the evil iron grip of the ACM and IEEE bastards, you should work harder to organize and promote the free online journals, of which there are many, but they haven't caught on because the majority of major researchers are smitten by the prestige of the ACM/IEEE names.

Go to DOAJ , pick a journal, support that. Or you know, start a new one focused on higher performance / realtime computer graphics.

11/18/2009

11-18-09 - Raw Conversion

CR2 format is a big mess of complication. My god I hate TIFF. The main thing is the sensor data. It appears to be stored as "lossless JPEG" which is a new format that uses the JPEG-LS predictor but then just codes the residual with normal JPEG Huffman coding. The sensor data is RGGB which they either store as a 4-channel per pixel [RGGB per pixel] or as 2-channel [GR or GB]. Either way is clearly not optimal. One interesting thing I could do if I cracked the CR2 format is store all these raws smaller with a better compressor. The RAWs from the S90 are around 11M on average, it uses the 2-channel mode; the RAWs are 1872x2784 = 3744x2784 samples and 12 bits per sample. That means the JPEG is getting to 8.85 bits per sample. Not very good.

Of course I probably have to use dcraw to read it for me, but dcraw is just about the worst piece of code I've ever seen in my life. It's a miracle to me that people are able to write functioning software from code like that.

Paul Lee has a modified dcraw and some nice sample pictures of how demosaicing can go wrong (click the Moire or Aliasing links).

My idea for high quality RAW processing :


First of all, abandon your idea of an "image" as a matrix (grid) of colors (aka a bitmap).

The S90 sensor has barrel distortion that's corrected in software.

It also samples colors in an RGGB Bayer mosaic pattern (like most cameras).

The two of those things combined mean that you really just have a collection of independent R's, G's, and B's at
irregular positions (not on a grid due to barrel distortion).

Now, you should also know that you need to do things like denoising on these original samples, NOT on
the grid of colors after conversion to a bitmap.

So I want to denoise directly on the source data of irregular color samples.
Denoising R & B should make use of the higher quality G data.

Denoising should of course use edge detection and other models of the image prior to make a Bayesian
maximum likelihood estimate of the sample without noise.

To output a bitmap you need to sample from this irregular lattice of samples (mosaic'ed and distorted).

Resampling creates aliasing and loss of information, so you only want to do it once ever on an image.

There's absolutely no a-priori reason why we should be resampling to the same resolution as the sensor
here.  You should resample at this point directly to the final resolution that you want your image.

For example with the S90 rather than outputting the stupid resolution 3648x2736, I would just output 3200x2400
which would let me view images at 1600x1200 on monitors with a box down-filter which will make them appear
much higher quality in practice (vs 3648x2736 viewed at 1600x1200 which involves a nasty blurring down-filter).

The output from this should be a floating point bitmap so that we don't throw away any color resolution
information.

Exposure correction can then be done on the floating point bitmap without worrying about the irregular
lattice or any further resampling issues.

11/06/2009

11-06-09 - IsSameFile

I found myself wanting to know if two file names were the same file on disk. It's hard to check that just by looking at the name. Obviously you have issues like one might be absolute, one might be relative. Even if you fix that, they could be different A-code-pageizations of unicode names. And something I hit often is one of them might be on a "subst" or even a hard link. I want to know if they are actually the same file.

This appears to work :


bool IsSameFile(char * Name1,char * Name2)
{
    HANDLE f1,f2;
    BY_HANDLE_FILE_INFORMATION info1;
    BY_HANDLE_FILE_INFORMATION info2;
    
    f1 = CreateFile(Name1,GENERIC_READ,FILE_SHARE_READ,0,OPEN_EXISTING,0,0);
    if ( f1 == INVALID_HANDLE_VALUE )
        return false;
    
    f2 = CreateFile(Name2,GENERIC_READ,FILE_SHARE_READ,0,OPEN_EXISTING,0,0);
    if ( f2 == INVALID_HANDLE_VALUE )
    {
        CloseHandle(f1);
        return false;
    }   
    
    GetFileInformationByHandle(f1,&info1);
    GetFileInformationByHandle(f2,&info2);
    
    CloseHandle(f1);
    CloseHandle(f2);
    
    // BY_HANDLE_FILE_INFORMATION has a unique file ID and Volume Serial Number in it
    //  check those are the same
    //  heh fuck it just check they are all the same
    
    // confirmed : this does work across substs
    
    return memcmp(&info1,&info2,sizeof(info1)) == 0;
}

11/04/2009

11-04-09 - Video is not for Windows

Holy crap video is a flaming mess.

First, if you don't know, there are two main things : packages & streams. The packages are AVI, MKV, MP4, MOV ; they put together image data and audio data in some way (the layer that unpacks these is often called a "splitter" or "demuxer"). The packages send their streams to codecs which convert them to some format for display. On Windows this is all supposed to go through DirectShow which is supposed to use the 4CC codes and some priority information to automatically find the right handler for the various streams and packages. In theory.

The first problem you hit on Windows is that AVI packages are handled pretty well, but AVI can't hold H264 video because AVI can't handle the flexible B-frame ordering that H264 can generate. (limitted profiles of H264 can be put in AVI, and there are hacks around this problem, but you're getting into a world of hurt). So you need MKV or MP4 boxes, and those are handled poorly; some apps handle them okay, some don't. Some apps "cheat" and don't trust DirectShow like they're supposed to (the cheating apps often work better).

Things I've installed lately :


MP4Box : MP4 stream boxer/unboxer ; pretty decent app, recommended, but help is poor

YAMB : GUI for MP4Box.  Useful to help figure out command lines for MP4Box because the help is bad.
    YAMB has bad bugs though and will fail to launch MP4Box, so you have to copy out the command line
    and run it yourself

MKVVerify : MKV stream checker.  Useful because MKV support is so fucking borked.

MediaInfo : Media info reporter.  Questionable usefulness because I don't trust it and don't know where
    it's getting it's info for.

Graphedit : DirectShow graph visualizer and tester from MS

GSpot : AVI info tool.  Useless.

MSU VMT : Moscow State University Video Quality Measurement Tool.  This is pretty neat when it works,
    but far too often it fails to get the frames correctly, so you get totally bogus results.

MSU LS Codec : Moscow State University Lossless Codec.  Best compressing lossless codec, seems nice
    but crashes some tools when you try to use videos compressed with this.  Thus useless.

Lagarith Codec : This appears to be the one good working lossless codec.  Recommended.

HuffYUV Codec : Videos made with this crash me on read.  Jeff says it works great for him.  Avoid.

MeGUI : GUI for "mencoder" which can driver AviSynth and x264 ; like all of these big GUIs that try to
    run a bunch of other products, this mysteriously fails for me.  It seems to set everything up right
    and then it launches ten other programs and they fail to hook up in the way MeGUI expected them to.
    Garbage.

Handbrake : see MeGUI

FFDShow : hooks up the Linux video decoders (ffmpeg , libavc, etc.) to DirectShow.  This thing is
    pretty evil and fails to report frame rate and media info sometimes, but is also the only real
    choice. 

Haali Media Splitter : MKV unboxer, works with FFDShow.  Difficult to install correct manually.
    Even when installed correctly, does some weird shit with framerate; doesn't seem to report it
    correctly through DirectShow.  Probably best to get a codec pack like :

K-Lite Codec Pack : works for me but generally is considered malware
Matroska Codec Pack : didn't work for me
CCCP Codec Pack : not tried

MPlayer : Linux media player, now ported to Windows ; very flexible command line control of everything,
    alternate audio/video in/out.  Highly recommended.

MEncoder : video encode/decode partner to MPlayer.  I've had more success running mplayer and x264 manually
    than using this.  Still I can't complain about MEncoder from the command line.

MPUI : GUI for MPlayer.  This is horrific malware.  When you install it, it takes over your system without
    asking.  They do provide some tools for you to change this after the fact, but still should be avoided.
    Use Media Player Classic or VLC.

AviSynth : script thing to pipe video to other programs that read AVS scripts.  Dear lord.

Basically I've found that all the GUI's are broken, and all the video containers (AVI,MP4,MKV) are broken. The thing I've finally discovered that actually works is using MPlayer and X264 from the command line, and only working with split frames. Trying to work with video containers caused me all kinds of hurt because so many of these apps fail to unbox the containers right and screw up the frame rate or drop frames or other mistakes. Instead now if I want to work on a video I use MPlayer to convert it to raw frames.


mplayer -benchmark -ao null -vo png:z=5 video.avi

to dump frames to PNG

mplayer -benchmark -ao null -vo yuv4mpeg:file=test.y4m video.avi

to dump the video to YUV4MPEG format in "test.y4m" for input to x264

x264.exe --bitrate 10000 --output "out.mp4" test.y4m

x264 compress to "out.mp4"

Then use mp4box to put the audio back if wanted.

The cool thing about mplayer is that its audio/video decoders are the same ones used to view the video. So you can watch it, and if it plays right in the viewer, then it will extract correctly. I've found lots of videos that I can watch in MPC or VLC, but then fail to load the same way in whatever encoder/decoder when I try to process something.

The sucky thing about this method is you make ginormous temp files on your disk, which also slows things down a lot. But avoiding the fuckups of going through the borked DShow codecs and splitters is worth it.

Most of these tools now are originally Linux tools that are getting moved back to Windows. One very promising development is that many of them have the option to directly load libs for the codecs Linux-style (eg. just load libavc to play video) and avoid DirectShow completely. I haven't really tried that yet but it seems like it's almost possible to work with video just by manually picking a few of these libs and then you avoid the whole Windows borked media layer.

ADDENDUM : one of the difficulties I've seen in a lot of tools is reading the frame rate wrong. This is presumably due to the demuxers not reporting things back totally right. But there are also two related fundamental problems that make everything harder :

1. Most of these formats don't have real/useful headers. (if they do have header info, it's just added as "comment" information). This was done originally because theoretically if your AVI is being broadcast on TV and you change the channel into it, you will just start getting bytes in the middle and never see the header, thus they didn't put good headers on at all.

2. It's almost impossible to really reliably just get the frames out of video. DirectShow doesn't have a reliable call that's just "give me the next frame". Instead you have to ask for "when is the next frame" and then "give me an image at this time". The problem is that the "when" can get fucked up in various ways, and then when you say "give me an image at this time" you can either skip frames or get duplicate frames. (this is what fucks up the MSU VMT tool for me so badly, they are getting the time sampling all wrong quite often).

Even if it's not way off, this still causes subtle bugs because people don't agree exactly on how to represent the frame rate. Some people treat broadcast as exactly = 30000/1001 fps and use rational arithmetic for all timing. Some people use floats for frame rate and use 29.97003 and then wind up with floating point precision problems at high frame numbers. Many of the containers store the frame rate as a number of microseconds between frames, eg. 33367 ; so if they store "33367" in the header, should I use that as my frame time increment exactly, or should I use 33366.666666 ?

I'm guessing that tons of people get duplicate and/or dropped frames because of this and just don't notice it.

10/19/2009

10-19-09 - Against asynchronous GUIs

Most new Microsoft products have super asynchronous GUIs. This means they do GUI refreshes and window popups and threads and so on. This sounds like a nice thing, it removes big stalls and freezes in the GUIs and makes them "respond" more quickly, in the sense that when you request some new view it pops the window up immediately and then gradually fills it in, whereas before it would just freeze and wait for the new view to pop up.

In practice for power users this just sucks absolute balls. Power users use the keyboard. We use our peripherial vision. We don't sit around and wait for windows to pop up, we hit key combos that we know and we memorize sequences. So for example you might memorize that hitting Alt-F opens a find dialog and puts the focus in the entry window, so you can highlight the text you want to find, Ctrl-C, Alt-F, Ctrl-V, Enter, all in a row boom you're searching for that text.

BUT NO! No you aren't, instead you just pasted that text into your document because the fucking finder window was being built on a thread and wasn't opened in time to receive your ctrl-v so your ctrl-v went to the old window.

There are few obvious GUI design fails here. One is that asynchronicity and focus stealing should never go together. If a window opening is async then it should not transfer focus to that window. Also, if a key pops up a new window and transfers focus, do that immediately and queue up the key presses and don't process them until the action associated with each key press is complete. eg. if you get input like A,B,C, if you process A and it fires something async, then don't respond to B until the action from A is done if it's possible that they affect each other.

More generally, this programming pattern of finding clever complicated ways to hide the fact that your systems are overly bloated and slow is just not the win. You will only make the failure cases less common but more ugly. For example in this particular case I'd rather have a slow modal popup than an unpredictable async popup.

Now, there certainly is a place for asynchronicity in GUIs, namely in tasks spawned from GUIs that run in a pane. When you start some long process, that should be async from the GUI and run on its own thread, but I would argue that the entire GUI should run on one single thread and GUI drawing updates of things like icons and buttons should not be async (not any aspect of it that affects user input - purely visual updates could be async). So like for example, when you open a new folder, the population of the list for that folder should NOT be async because the user may be typing characters to navigate to names within the folder, and if you populate async you will screw up the response to their typing (it could be okay to be async here *if* you make it respond in the same way that it would if it was synchronous!). Ideally GUI operations should just be fast and not need to be async.


How do we not have a fucking universal laptop docking multi-port thing !? WTF.

The delay of Windows/PC's recognizing USB devices is really tilting. Every time I move my laptop or plug in the keyboard, it's wait .. wait .. wait .. okay now I can type. Urg.

I'm trying this new thing where I put my computer on automatic standby after 2 hours. I've tried lots of reminder apps before to make myself get up and stretch, but I always just ignore them. Standby is a hard enough obstacle that I think it might make me actually get up - if nothing else I'll get up in anger because the standby will fuck up what I'm doing. But when I come out of standby I have to deal with the fucking USB.


Computer-laid-out maps are abominations. Hand drawn maps are beautiful and convey so much more information. (what follows are just links related to maps, not examples of great hand-drawn maps)

Metskers Maps in Seattle is pretty great.

FeetFirst is a non-profit that supports walking in Seattle. Their maps are only okay.

See also : City of Seattle Map of public art and Historic Places in Seattle .

9/29/2009

09-29-09 - Aliasing is Pretty

I made some images of aliasing/moiray patterns : (BTW inherent to these images is the fact that they don't look at all right except at full zoom since the image structure is created by the pixel grid sampling aliasing, so click through). You may be surprised to know what these are an image of.

Photobucket

Photobucket

Photobucket

Photobucket

Photobucket

Photobucket

Photobucket

....

Answer : these are pictures of sin( x^2 + y^2 ) , that is, sin( r^2 ) a very simple radial trig function. You don't see the function at all, the images are 100% aliasing as the radial sin is scaled to very high frequency.

9/26/2009

09-26-09 - Habits - Code Inlining

There's been an off and on discussion with some peers in email about the issue of putting small code functionality directly inline in a bigger function vs. splitting it out. I don't think it's actually an interesting topic so I don't want to repeat it here, but it made me think of something I want to write about.

Basically the contention is that even if you have a logically separate piece of work, sometimes it's better to go ahead and write it inline in the larger function, because it makes the code flow more linear and imperative and thus easier to read and debug. There are also obvious disadvantages to writing code inline - less strong separation of functional bits, temptation to copy-paste code, less ability to debug and test functions independently, etc. etc. All the pros and cons are obvious and thus not interesting.

The key point to me is the issue that while code inline may be a win sometimes - it's a big loss at other times, and it requires you to make a careful smart logical decision about which way to go.

That's very bad. I think most smart people as they get older come to the realization that you shouldn't put yourself in positions where you have to make careful logical decisions over and over. Instead you should craft habits and rules for yourself that help you do the right thing without having to make careful decisions.

One obvious case that most smart people figure out is diet and exercise, or booze or other indulgences. Yes, it's perfectly fine to have a some dessert once in a while, but if you open that door for yourself you're putting yourself in a situation where you are consciously making a decision "is it okay for me to have cake today?" and you will inevitably get lazy and make the wrong choice sometimes.

As usual the best analogy is poker, and it's how this point was really made real for me. Smart people often start out playing poker trying to logically reason out every single action and think they don't need to be constrained by simple rules or habits. That's great if you really are correctly thinking through every situation, but you inevitably get tired or lazy or make mistakes, and if you're not constraining yourself with rules, you can make huge mistakes.

For example, there might be cases where the best play is to limp aces up front, or to not reraise with aces, but correctly making that decision requires a lot of careful thought, and the upside to making that decision is pretty small, while the downside to doing it in the wrong situation is very big. It's best to just make a rule for yourself that you always raise or reraise. It simplifies things, it removes decision points and lets you focus on more important issues. It might be +EV to call raises sometimes with hands like 68o, but it's best to just give yourself a rule that you never do that.

To be clear - these rules are specifically non-optimal. By making rules for yourself you are intentionally choosing to not try to be 100% optimal, so of course someone can say "you could have done something better there". That's not the point. The point is that if you try to make perfect decisions all the time you will occasionally fail very badly.

Winning poker play (or good coding, or good life living) are largely about making it easy for yourself to do the right thing.

9/25/2009

09-25-09 - Motion Search

I just read the paper on Patch Match and it makes me angry so I figure I'll write about the motion search method I'm developing for possible future use in the new RAD video stuff. PatchMatch is just so incredibly trivial and obvious, it's one of those things that never should have been a paper and never should have been accepted in a journal. It's a great thing for someone to write on their blog because you can describe it in about one sentence, and most experts in the field already know the idea and are probably doing it already. (I will say the good thing about the paper is they do a good job of gathering references to other papers that are related, such as stuff in texture synth and hole filling and so on which I find interesting).

Here's the one sentence version of PatchMatch : Seed your match field with some random guess or shitty initial matches; improve by incrementally propagating match offsets to neighbors and trying small random deltas to find improvements. (it's an absolute classic spin network magnetic moment relaxation kind of problem).

Here's what I've been doing : start with a match field set to all nulls (no match found yet). Then incrementally fill it in with matches and propagate them to neighbors. It proceeds in a few steps like this :

Step 1. Use computer vision methods to find feature points in a frame. Match feature points to the previous frame. This bit is a bit tricky and tweaky, you only want to make matches that you're pretty confident in. Note that this matching is done based on a "characteristic" of the feature point which has no distance limit, and is also somewhat immune to rotation and scaling and such. Sometimes this step finds some very good correspondences between the frames, but it's sparse - it only has high confidence at a few places in the frame, so you can't use it to find all the block matches (and you wouldn't want to even if you could). Generally this finds around 100 vectors.

Step 2. Find "distinctive" spots in the frame. The goal is to find some spots that are not degenerate - eg. not flat patches, not straight edges. The idea is that these are places where we can likely find a good motion vector with high confidence, unlike degenerate areas where there are lots of equally good match vectors. I use two mechanisms to find distinctive spots : one is the computer vision feature points that were not already used in the first matching step. The second is to take the "cornerness" map of the image using a Harris or Hessian operator on the derivative of gaussians (this is a lot like an edge map, but it kills straight edges). Find the top 5% highest cornerness values that are local maxima, and use those as distinctive spots. All of the distinctive spots do a long distance brute force block match (something like radius = 16 or 32) to try to find a good motion vector for them.

Step 3 : Flood fill to fill in the gaps. We now have presumably good motion vectors at a few key points in the frame. Go to their neighbors and search for match vectors that are close to the neighboring one that we already found. Put that in the blank and push its neighbors to the queue to continue the flood fill.

Step 4 : Relaxation pass. (this is not critical). We now have a motion vector everywhere in the frame. For each match vector in the frame, look at its 4 neighbors. Examine match vectors that are near my 4 neighboring vectors. If one is better, replace self. Continue to next. Theoretically you should do this pass a few times, but I find 1 or 2 is very close to infinite.

The key thing is that motion is usually semi-coherent (but not fully coherent, because we are not really trying to find true motion here, but rather just the best matching block, which is a lot more random than true motion is). By finding very good motion vectors in seed spots where we have high confidence, we can propagate that good information out to places where we don't have as much confidence. This lets us avoid doing large brute-force searches.

BTW I really do not understand the point of all the "diamond search" type shit in the video compression literature. It seems to just find really shitty motion vectors and is not making good use of the possibilities in the bit stream. Especially with GPU video encoding in this modern age, doing plain old big chunks of brute force motion search is preferrable. (yes, I know it's for speed, but it's a poor way to optimize, and the high quality encoders are still non-realtime anyway, so if you're not realtime you may as well take some more time and do better; plus the vast majority of use of non-realtime video encoders is in an encode-once decode-many type of scenario which means you should spend a lot of cpu and encode as well as possible).

With this method I find motion vectors using local searches of radius 8-16 that are the same quality as brute force searches of radius 50-100, which makes it about two orders of magnitude faster (and higher quality, since nobody does brute force searches that far).


ADDENDUM : To give this post a bit more weight, here are some numbers on quality from my video coder vs. brute force search radius :


 -s16  : rmse : 9.3725 , psnr : 28.7277
 -s26  : rmse : 9.2404 , psnr : 28.8510
 -s48  : rmse : 9.0279 , psnr : 29.0531
 -s64  : rmse : 8.9171 , psnr : 29.1603
 -s100 : rmse : 8.7842 , psnr : 29.2907
 -s9999: rmse : 8.5294 , psnr : 29.5465

(-s16 means it's searching a 33x33 grid for motion vectors) (-s9999 means it searches full frame).

The above described iterative feature point propagation method gets


 -sfast: rmse : 8.8154 , psnr : 29.2600

BTW for doing full-frame brute force search you obviously should use a block-space acceleration structure for high dimensional nearest neighbor search, like a kd-tree, a bd-tree (box decomposition) or vp-tree (vantage point). High dimensional spaces are nasty though; the typical idea of "find a cell then walk to its immediate neighbors" is not fast in high D because you have O(D) neighbors.

9/08/2009

09-08-09 - DXTC Addendum

Ryg pointed out that there are a few very important little details that I took for granted and didn't mention in my original DXTC postings , or was just not clear about :

One is that when I try all ways of hitting two given endpoints, I try both 4-color and 3-color versions. That is, given two endpoint colors C0 and C1, I quantize them to 16 bits, then try the DXT1 palette that you get from {C0,C1} and also the one from {C1,C0} (order of DXT1 determines whether it is 3-color or 4-color).

The second and related crucial thing, is that in 3-color mode, the extra color is transparent black. If the texture has no alpha at all, I assume the user will not be using it as an alpha source, so I treat the transparent black as just black. That is, I do color palette selection with alpha just ignored.

Apparently this is pretty important. I suspect this especially helps with the "4 means" method; if a bunch of the colors are near black, you want them to be classed together and then just ignored for the endpoint selection, so that they will go to the hard-coded black in 3-color mode and your interp end points will be chosen from the remaining colors.

8/27/2009

08-27-09 - Oodle Image Compression Looking Back Pictures

I thought for the record I should put up some pictures about what I talked about last time.

First of all the R/D trellis quantization issue. Very roughly what we're doing here is coding to a certain bit rate. The "RDO" lets us use a smaller quantization bucket size, which initially lowers distrortion and increases our rate, but then we hammer on some of the values - mainly we just force them to zero, which causes some distortion and decreases rate; we choose to hammer the values that save us the most rate per distortion. (99% of the time all you're doing is turning 1's into 0's, so it's a matter of picking the 1 to squash to 0 which saves you most the rate).

Here are the results on "Moses" at 0.5 bits per pixel :

No R/D : RMSE = 9.9381 :

Photobucket

Unconstrained R/D : RMSE = 9.7843 :

Photobucket

You should be able to see in the R/D image that some of the image looks better, but other parts look much worse. The RDO has stolen rate from places where it was expensive in terms of rate to encode a certain distortion, and moved those bits to parts of the image where you can get more distortion win at a cheaper rate. This is awesome if your goal is to minimize RMSE, but it's unclear to me whether this is *ever* good perceptually.

In this particular case, the RDO Moses image actually has a worse SSIM than the No-RD image; this type of mistake is actually something that SSIM is okay at detecting.

In practice I use some hacks to limit how much the RDO can do to any one block. With those hacks I almost always get an SSIM improvement from RDO, but it's still unclear to me whether or not it's actually a perceptual improvement on many images (in some cases it's a very clear win; images like kodim09 or kodim20 where you have big flat patches in some spots and then a lot of edge detail in other spots, the RDO does a good job of stealing from the flats to give to the edges, which the eye likes, because we don't mind it if an almost perfectly smooth area becomes perfectly smooth).

Now for the hacky perceptual smooth DC issue.

This is "kodim04" at 0.25 bpp ; no RDO ; no unblock , no perceptual DC quantization ; basically a naive DCT coder :

Photobucket

Now we turn on the hacky perceptual quantization that gives more precision to smooth DC's : (unblock still off) :

Photobucket

Note that the perceptual quant of DC means that we are using more of our bitrate for the DC band, so we give less bits to AC, which means using a larger quantizer for AC to match the bit rate constraint.

Now with unblocking , no perceptual DC quant : (RMSE = 12.8565 , SSIM = 58.62%)

Photobucket

With unblocking and perceptual DC quant : (RMSE = 12.9666, SSIM = 57.88%)

Photobucket

I think the improvement is clearest on the unblocked images - the perceptual DC quant one actually looks okay, the parts that are supposed to be smooth still look smooth. The one with uniform DC quant looks disgustingly bumpy. Note that the SSIM of the better image is actually quite a bit worse. Of course RMSE gets worse any time you do a perceptual improvement. You should also be able to see that the detail in the hat thatching is better in the nonperceptual version, but that doesn't bother the eye nearly as much as breaking smoothness.

ADDENDUM : some close up pictures of Moses' waddle area showing the R/D artifacts better. You should zoom these to full screen with a box filter and toggle between them to see most clearly. You should see the RDO killing blocks in the collar area very clearly. All you really need to do is look at the last picture of these four and you should be able to see what I'm talking about with the RDO :

Portion of Moses at 0.75 bpp : No lagrange optimization :

Photobucket

With Lagrange RDO :

Photobucket

Crop of No-L :

Photobucket

Crop of RDO :

Photobucket

8/25/2009

08-25-09 - Oodle Image Compression Looking Back

I did a little image compressor for RAD/Oodle. The goal was to make something with quality comparable to a good modern wavelet coder, but using a block-based scheme so that it's more compact and simple in memory use so that it will be easy to stream through the SPU and SIMD and all that good stuff, we also wanted an internal floating point core algorithm so that it extends to HDR and arbitrary bit depths. I wrote about it before, see : here or here . That's been done for a while but there were some interesting bits I never wrote about so I thought I'd note them quickly :

1. I did lagrange R-D optimization to do "trellis quantization" (see previous ). There are some nasty things about this though, and it's actually turned off by default. It usually gives you a pretty nice win in terms of RMSE (because it's measuring "D" (distortion) in terms of MSE, so by design it optimizes that for a given rate), but I find in practice that it actually hurts perceptual quality pretty often. By "perceptual" here I just mean my own eyeballing (because as I'll mention later, I found SSIM to be pretty worthless). The problem is that the R-D trellis optimization is basically taking coefficients and slamming them to zero where the distortion cost of doing that is not worth the rate it would take to have that coefficient. In practice what this does is take individual blocks and makes them very smooth. In some cases that's great, because it lets you put more bits where they're more important (for example on images of just human faces it works great because it takes bits away from the interior patches of skin and gives those bits to the edges and eyes and such).

One of the test images I use is the super high res PNG "moses.png" that I found here . Moses is wearing a herring bone jacket. At low bit rates with R-D Trellis enabled, what happens is the coder just starts tossing out entire blocks in the jacket because they are so expensive in terms of rate. The problem with that is it's not uniform. Perceptually the block that gets killed stands out very strongly and looks awful.

Obviously this could be fixed by using a better measure of "D" in the R-D optimization. This is almost a mantra for me : when you design a very aggressive optimizer and just let it run, you better be damn sure you are rating the target criterion correctly, or it will search off into unexpected spaces and give you bad results (even though they optimize exactly the rating that you told it to optimize).

2. It seems DCT-based coders are better than wavelets on very noisy images (eg. images with film grain, or just images with lots of energy in high frequency, such as images of grasses, etc). This might not be true with fancy shape-adaptive wavelets and such, but with normal wavelets the "prior" model is that the image has most of its energy in the smooth bands, and has important high frequency detail only in isolated areas like edges. When you run a wavelet coder at low bit rate, the result is a very smoothed looking version of the image. That's good in most cases, but on the "noisy" class of images, a good modern wavelet coder will actually look worse than JPEG. The reason (I'm guessing) is that DCT coders have those high frequency pattern basis functions. It might get the detail wrong, but at least there's still detail.

In some cases it makes a big difference to specifically inject noise in the decoder. One way to do this is to do a noisey restore of the quantization buckets. That is, coefficient J with quantizer Q would normally restore to Q*J. Instead we restore to something random in the range [ Q*(J-0.5) , Q*(J+0.5) ]. This ensures that the noisey output would re-encode to the same bit stream the decoder saw. I wound up not using this method for various reasons, instead I optionally inject noise directly in image space, using a simplified model of film grain noise. The noise magnitude can be manually specified by the user, or you can have the encoder measure how noisey the original is and compare to the baseline decoder output and see how much energy we lost, and have the noise injector restore that noise level.

To really do this in a rigorous and sophisticated way you should really have location-variable noise levels, or even context-adaptive noise levels. For example, an image of a smooth sphere on a background of static should detect the local neighborhood and only add noise on the staticy background. Exploring this kind of development is very difficult because any noise injection hurts RMSE a lot, and developing new algorithms without any metric to rate them is a fool's errand. I find that in some cases reintroducing noise clearly looks better to my eye, but there's no good metric that captures that.

3. As I mentioned in the earlier posts, lapping just seems to not be the win. A good post process unblocking filter gives you all the win of lapping without the penalties. Another thing I noticed for the first time is that the JPEG perceptual quantization matrix actually has a built-in bias against blocking artifacts. The key thing is that the AC10 and AC01 (the simplest horizontal and vertical ramps) are quantized *less* than the DC. That guarantees that if you have two adjacent blocks in a smooth gradient area, if the DC's quantize to being one step apart, then you will have at least one step of AC10 linear ramp to bridge between them.

If you don't use the funny JPEG perceptual quantization matrix (which I don't think you should) then a good unblocking filter is crucial at low bit rate. The unblocking filter was probably the single biggest perceptual improvement in the entire codec.

4. I also somewhat randomly found a tiny trick that's a huge improvement. We've long noticed that at high quantization you get this really nasty chroma drift problem. The problem occurs when you have adjacent blocks with very similar colors, but not quite the same, and they sit on different sides of quantization boundary, so one block shifts down and the neighbor shifts up. For example with Quantizer = 100 you might have two neighbors with values {49, 51} and they quantize to {0,1} which restores to {0,100} and becomes a huge step. This is just what quantization does, but when you apply quantization separately to the channels of a color (RGB or YUV or whatever), when one of the channels shifts like that, it causes a hue rotation. So rather than seeing a stair step, what you see is that a neighboring block has become a different color.

Now there are a lot of ideas you might have about how to address this. To really attack it thoroughly, you would need a stronger perceptual error metric, in particular one which can measure non-local patterns, which is something we don't have. The ideal perceptual error metric needs to be able to pick up on things like "this is a smooth gradient patch in the source, and the destination has a block that stands out from the others".

Instead we came up with just a simple hack that works amazingly well. Basically what we do is adaptively resize the quantization of the DC component, so that when you are in a smooth region ("smooth" meaning neighboring block DC's are similar to each other), then we use finer quantization bucket sizes. This lets you more accurately represent smooth gradients and avoid the chroma shift. Obviously this hurts RMSE so it's hard to measure the behavior analytically, but it looks amazingly much better to our eyes.

Of course while this is an exciting discovery it's also terrifying. It reminded me how bad our image quality metrics are, and the fact that we're optimizing for these broken metrics means we are making broken algorithms. There's a whole plethora of possible things you could do along this vein - various types of adaptive quantizer sizes, maybe log quantizers? maybe more coarse quantizers in noisy parts of the image? it's impossible to explore those ideas because we have no way to rate them.

As I mentioned previously, this experiment also convinced me that SSIM is just worthless. I know in the SSIM papers they show examples where it is slightly better than RMSE at telling which image is higher quality, but in practice within the context of a DCT-based image coder I find it almost never differs from RMSE; that is, if you do something like R-D optimized quantization of DCT coefficients with Distortion measured by RMSE, you will produce an image that has almost exactly the same SSIM as if you did R-D with D measured by SSIM. If RMSE and SSIM were significantly different, that would not be the case. I say this within the context of DCT-based image coding because obviously RMSE and SSIM can disagree a lot, but that axis of freedom is not explored by DCT image coders. The main thing is that SSIM is really not measuring anything important visual at all. A real visual metric needs to use global/neighborhood information, and knowledge of shapes and what is important about the image. For example, changing a pixel that's part of a perfect edge is way more important than changing an image that's in some noise. Changing a block from grey to pink is way worse than changing a block from green to blue-green, even if it's a smaller value change. etc. etc.

It seems to me there could very easily be massive improvements possible in perceptual quality without any complexity increase that we just can't get because we can't measure it.

8/05/2009

08-05-09 - Relacy License Notes

The lock-free code I posted with Relacy has a clarification to the license agreement added. If you have downloaded this please read and make sure you are in compliance. I've copied the added text here :

ADDENDUM ON RELACY LICENSE : (revised 9-14-09)

Relacy is now released under the BSD license :


    1 /*  Relacy Race Detector
    2  *  Copyright (c) 2009, Dmitry S. Vyukov
    3  *  All rights reserved.
    4  *  Redistribution and use in source and binary forms, with or without modification,
    5  *  are permitted provided that the following conditions are met:
    6  *    - Redistributions of source code must retain the above copyright notice,
    7  *      this list of conditions and the following disclaimer.
    8  *    - Redistributions in binary form must reproduce the above copyright notice, this list of conditions
    9  *      and the following disclaimer in the documentation and/or other materials provided with the distribution.
   10  *    - The name of the owner may not be used to endorse or promote products derived from this software
   11  *      without specific prior written permission.
   12  *  THIS SOFTWARE IS PROVIDED BY THE OWNER "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
   13  *  THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
   14  *  IN NO EVENT SHALL THE OWNER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
   15  *  OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   16  *  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
   17  *  STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
   18  *  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   19  */

My work with Relacy is 100% free for any use. However, the original Relacy license still applies to all work product made with Relacy, such as my code above.

The version of Relacy that I built my code with was released under a previous less clear and restrictive license. Dmitry says the new BSD license applies.

8/04/2009

08-04-09 - CINIT

Two questions I can't find answers to :

1. Is there a way to tell from a piece of code that you are being called from cinit ? eg. in C++ when a constructor causes some code to run, and that calls some function, and then I get called, is there anything I can check to see that I'm currently in cinit, not main?

(obviously a very evil thing I could do is run a stack trace and see what's at the top of the stack). I can't find anything in C that I can check, because the C stdlib is initialized before me, so to my cinit code it looks just like I'm in the app run.

The reason I want this is mainly for asserting & validation - I want to make sure that my own cinit code isn't calling certain things (such as memory allocation) so I want to put in checks like ASSERT( ! in_cinit() );

2. Is there a way to disallow cinit code in certain modules? For example, having cinit stuff in any library is very unsafe because you have to be wary that your OBJ could get dropped from the link and the cinit stuff will not be run, plus you have order of run issues, it's something I can handle fine in my own projects, but not something I want to force on clients. So I want to make sure that my actual deliverable libraries have no cinit stuff - but I do want cinit stuff in my test apps, so I don't want to just break it entirely. I'd really like a compiler error so I know I did a booboo right when I write it.

7/07/2009

07-07-09 - Small Image Compression Notes , Part 2

Deblocking survey :

There are a few different ways to come at the problem theoretically.

One is to work on post-decode data in spatial domain. These approaches basically work by explicitly trying to detect block edges and just filter them. This is the approach, for example, of the H264 "in loop deblocking filter" which there is a lot of literature on. See for example "Adaptive Deblocking Filter" by List, Joch, et.al. For an example of the filter-based approach on the 8x8 DCT case see "DCT-Based Image Compression using Wavelet-Based Algorithm with Efficient Deblocking Filter" by Yan and Chen. (BTW the JPEG standard contains a "block smoother" which basically predicts AC1 as a linear function from neighboring block coefficients. This is okay for the specific case of smooth images and very high quantization, but is generally not awesome and is an ancient technique. Ignore.)

A more hardcore version of the filtering approach is "Combined Frequency and Spatial Domain Algorithm for the Removal of Blocking Artifacts" which does adaptively-offset and adaptively-directed gaussian filters ; this is sort of like the image denoising stuff that creates pixel gradient flow vectors - the filters are local gradient adaptive so they don't go across real edges. This appears to perform quite well but is very expensive.

The other general approach is a more abstract maximum-likelihood idea. You received a lossy compressed image I. You know the original image was one of the many which when compressed produces I. You want to output the one that was most likely the true original image. This is a maximum likelihood problem, and requires some a-priori model of what you think "natural" images look like. In particular, for the case of quantized DCT coefficients, you have a quantized DCT coefficient C ; instead of just reproducing Q*C you can reproduce anything in the range { Q*C - Q/2 , Q*C + Q/2 } , and you should choose the thing in that range that makes the "best" image.

"Optimal JPEG Decoding" (1998) by Jung, Antonini, Barlaud takes this approach directly. Their results are not awesome though; presumably because their prior is not good. A more modern version of the same idea is "Block Artifact Reduction Using a Transform-Domain Markov Random Field Model" by Li and Delp which uses a better model for image likelihood, but is in the same vein of doing a brute force search in the allowed coefficient space to find the maximum-likelihood reproduction.

A related method that was popular for a while is "Projection onto Convex Sets". This is basically just a method of satisfying simple convex constraints in an optimization. Here our constraint is that the quantized coefficient stay the same, that is, repro in { Q*C - Q/2 , Q*C + Q/2 } . You then apply some target function, such as you want smoothness or something, and take iterative steps towards that goal and project onto the constraints one by one. There are a lot more details to this, I haven't paid too much attention to it because these are all crazy expensive and I want something realtime.

"Blocking Artifact Detection and Reduction in Compressed Data" by Triantafyllidis etal (2002) is in the same vein but simpler and more analytical. It again worse directly in DCT space on coefficients within their quantization range, but it directly solves for the ideal reconstruction value as a function of neighbors based on minimization of specific simple deblocking metric. You wind up with just some equations for how to modify each coefficient in terms of neighbor coefficients. While the paper is good, I think one of their base assumptions - that the frequencies can be dealt with independently - is not sound, and most other people do not make that assumption.

"Derivation of Prediction Equations for Blocking Effect Reduction" by Gopal Lakhani and Norman Zhong (1999) is an older, simpler still version of the Triantafyllidis paper. They only correct the first few coefficients and solve for optimal reconstruction to minimize MSDS (mean squared difference of slopes). You can actually look at the equations here and they're very intuitively obviously right. For example, the first AC coefficient should be corrected using the difference of the neighboring DC coefficients. In case you don't see that that's obviously right, if you have DC's like [8],[16],[24] after dequantization at Q=8, and your AC's all got quantized to zero, obviously the original image most likely had a smooth slope, so the first AC in the middle block should be predicted to be the linear interpolation.

An interesting one I found that's related to the stuff I tried with smooth reconstruction of the DC band is : "Improvement of DCT-based Compression Algorithms Using Poisson�s Equation" by Yamatani and Saito (2006) .

BTW a related issue that often comes up is the incorrectness of center dequantization of AC coefficients. I've written about this before and lots of these papers mention it; the best full note on it is : "Biased Reconstruction for JPEG Decoding" by Price.

The very modern stuff has gotten quite arcane. People now are doing things like directional overcomplete wavelets on the reproduced image; with this they can detect both block artifacts and also ringing and other quantized transform artifacts. They then use maximum-likelihood markov models to guess what the source image was that produced this output. This stuff is extremely complex and I haven't really followed it because it's nowhere near realtime, but probably the best solution for offline very high quality JPEG decoders.

An interesting outlier is John Costella's Unblock . It's based on a clever simple idea that I've never seen anywhere else. Unblock is based on the assumption that pixels near the block boundaries come from the same model as pixels in the centers of blocks. That sounds obvious but it's quite profound. It means that pixels near the edges of blocks should have the same statistics as pixels in the centers (in the maximum likelihood lingo, this is a prior we can use to choose an optimal output). In particular, it's useful because in the DCT the interior pixels are much more accurate than the edge pixels. What Unblock does is looks at the statistics of the decompressed interior pixels and assumes those are our goal, and then it forces the pixels near the edge to match the statistics of the interior. The corrections are applied as wide smooth filters.

7/06/2009

07-06-09 - Small Image Compression Notes

Lapping appears to be a complete red herring. I've wasted a lot of time on it and I'm very angry. I've been trying to work up a lapped block DCT image coder. The idea is that block-DCT-based is good for speed and parallelization for micro-core architectures, good for memory bandwidth, etc. and the lapping theoretically lets you avoid some of the nasty block artifacts by effectively extending your basis functions.

In practice it just doesn't work. I've tried lots of different lapping methods, and in all of them if I make a parameterized lap amount based on a kaiser-bessel-derived window and then tweak the lap amount to maximize SSIM, it tunes to no lapping at all. Basically what's happening is that the extra bit rate cost caused by the forward lap scrambling things up is too great for the win of smoother basis functions on decompress to make up. Obviously in a few contrived cases it does help, such as on very smooth images at very high compression. (of course the large lap basis functions are a form of modeling - they will help any time the image is smooth over the larger area, and hurt when it is not).

The really silly thing about this is that areas where the image is very smooth over a large area are the cases we already handle very well!! Yeah sure naive JPEG looks awful, but even a deblocking filter after decompress can fix that case very easily. In areas that aren't smooth, lapping actually makes artifacts like ringing worse.

The other issue is I'm having a little trouble with lagrange bitstream optimization. Basically my DCT block coder does a form of "trellis quantization" (which I wrote about before) where it can selectively zero coefficients if it decides it gets an R/D win by doing so. Obviously this gives you a nice RMSE win at a given rate (by design it does so - any time it finds a coefficient to zero, it steps up the R/D slope). But what does this actually do?

Think about trying to make the best bit stream for a given rate. Say two bits per pixel. If we don't do any lagrange optimization at all, we might pick some quantizer, say Q = 16. Now we turn on lagrange optimization, it finds some coefficients to zero, that reduces the bit rate, so to get back to the target bit rate, we can use a lower quantizer. It searches for the right lagrange lambda by iterating a few times and we wind up with something like Q = 12 , and some values zeroed, and a better RMSE. What's happened is we got to use a lower quantizer, so we made more, larger, nonzero coefficients, and then we selectively zeroed a few that took the most R/D.

But what does this actually do to the image qualitatively? What it does is increase the quality everywhere (Q =16 goes to Q=12) , but then it stomps on the quality in a few isolated spots (trellis quantization zeros some coefficients). If you compare the two images, the lagrange optimized one looks better everywhere, but then is very smooth and blurred out in a few spots. Normally this is not a big deal and it's just a win, but sometimes I've found it actually looks really awful.

Even if you optimize for some perceptual metric like SSIM it doesn't detect how bad this is, because SSIM is still a local measurement and this is a nonlocal artifact. Your eyes very quickly pick out that part of the image has been blurred way more than the rest of it. (in other cases it does the same thing, but it's actually good; it sort of acts like a bilateral filter actually, it will give bits to the high contrast edges and kill coefficients in the texture part, so for like images of skin it does a nice job of keeping the edges sharp and just smoothing out the interior, as opposed to non-lagrange-optimized JPEG which allocates bits equally and will preserve the skin pore detail and make the edges all ringy and chopped up).

I guess the fix to this is some hacky/heuristic way to just force the lagrange optimization not to be too aggressive.

I guess this is also an example of a computer problem that I've observed many times in various forms : when you let a very aggressive optimizer run wild seeking some path to maximize some metric, it will do so, and if your metric does not perfectly measure exactly the thing that you actually want to optimize, you can get some very strange/bad results.

6/22/2009

06-22-09 - Redraw Dilemma

This apartment searching is really annoying me. I can't handle having "many balls in the air" ; when I put something on my todo list, I like to work at it until it's gone. God I fucking hate shit on my todo list (the fucking health care keeps reinserting itself on my todo list and it's pissing me off; they got me again today with some billing fuckup, but I digress...).

Anyway, it's reminding me of a concept I often think about. I'll call it "the redrawer's dilemma" but there must be a better/standard name for this.

The hypothetical game goes something like this :

You are given a bag with 100 numbers in it. You know the numbers are in [0,1000] but don't know how many of each number there are in the bag. You start by drawing a random number from the bag.

At each turn of play, you can either keep your current number (in which case that is your final score), or you can put your current number back in the bag and draw again, but drawing again costs you -1 that will be subtracted from your final score.

How do you play this game optimally?

There are two things that are interesting to me about this game in real life. One is that humans almost always play it incredibly badly, and the second is that when you finally decide to stop redrawing you're almost always unhappy about it (unless you got super lucky and draw a 900+ number).

The two classic human player errors in this game are the "I just started drawing, I shouldn't stand yet" and the "I can't stop now, I already passed on something better than this". The "I just started drawing, I shouldn't stand yet" guy draws something like an 800 on one of his early draws. He thinks dang that's really good, but maybe this bag just has lots of high numbers in it, I just started drawing, I should put some time into it. Now of course that reasoning is based in correct logic - if you have reason to believe that your chance of drawing higher is good enough to merit the cost of continued looking, then yes, do so, but just drawing more because "it's early" makes no sense - the game is totally non temporal, the cost of continuing drawing doesn't go up over time. This often leads into the "I can't stop now, I already passed on something better than this" guy, who's mainly motivated by pride and shame - he doesn't want to admit to himself that he made a big mistake passing early when he got a high number, so he has to keep drawing until he gets something better. He might draw an 800, then a whole mess of single digit numbers and he's thinking "oh fuck I blew it" and then he draws a 400. At that point he should stand and quit redrawing, but he can't, so he draws again.

The thing is, even if he played correctly and just took the 400 after passing on the 800, he would be really unhappy about. And if the early termination guy played correctly and just got an early 800 and didn't draw, he would be unhappy too, because he'd always be wondering if he could've done better.

The other game theory / logical fallacy that plagues me in these kind of things is "I'm already spending X I may as well spend X". First I was looking for places around $1500, then I bumped it to $1700, then $1900. Now I'm looking at places for $2500 cuz fuck it they're nicer and I was looking at places for $2000 so it's only $500 more.

In other news, hotpads is actually a pretty cool apartment search site. It seems they are just scraping craigslist and maybe some other classifieds sites, so it's not like they have anything new, but the map interface and search features and such are solid. One thing is really annoying me about it though - the wheel zooming in the map is totally broken, I keep trying to wheel zoom and it sends the map off the never never land. Urg!

In more random news, I've really enjoyed the "Wallander" series on PBS ; the crime stories are pretty dumb/ridiculous, but I like the muddled contemplative pace of it, and the washed out monochrome color palette.

6/21/2009

06-21-09 - Fast Exp & Log

So in an earlier post I wrote about approximation of log2 and Ryg commented with links to Robin Green's great GDC 2003 talk : part1 (pdf) and part2 (pdf) ( main page here ).

It's mostly solid, but in part 2 around page 40 he talks about "fastexp" and "bitlog" and my spidey senses got tingling. Either I don't understand, or he was just smoking crack through that section.

Let's look at "bitlog" first. Robin writes it very strangely. He writes :


A Mathematical Oddity: Bitlog
  A real mathematical oddity
  The integer log2 of a 16 bit integer
  Given an N-bit value, locate the leftmost nonzero bit.
  b = the bitwise position of this bit, where 0 = LSB.
  n = the NEXT three bits (ignoring the highest 1)

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

Bitlog Example
 For example take the number 88
88 = 1011000
b = 6th bit
n = 011 = 3
bitlog(88) = 8*(6-1)+3
= 43
  (43/8)+1 = 6.375
  Log2(88) = 6.4594
  This relationship holds down to bitlog(8)

Okay, I just don't follow. He says it's "exact" but then shows an example where it's not exact. He also subtracts off 1 and then just adds it back on again. Why would you do this :

    bitlog(x) = 8x(b-1) + n

  Bitlog is exactly 8 times larger than log2(x)-1

When you could just say :

    bitlog(x) = 8xb + n

  Bitlog is exactly 8 times larger than log2(x)

??? Weird.

Furthermore this seems neither "exact" nor an "oddity". Obviously the position of the MSB is the integer part of the log2 of a number. As for the fractional part of the log2, this is not a particular good way to get it. Basically what's happening here is he takes the next 3 bits and uses them for linear interpolation to the next integer.

Written out verbosely :


x = int to get log2 of
b = the bitwise position of top bit, where 0 = LSB.

x >= (1 << b) && x < (2 << b)

fractional part :
f = (x - (1 << b)) / (1 << b)

f >= 0 && f < 1

x = 2^b * (1 + f)

correct log2(x) = b + log2(1+f)

approximate with b + f

note that "f" and "log2(1+f)" both go from 0 to 1, so it's exact at the endpoints
but wrong in the middle

So far as I can tell, Robin's method is actually like this :

uint32 bitlog_x8(uint32 val)
{
    if ( val <= 8 )
    {
        static const uint32 c_table[9] = { (uint32)-1 , 0, 8, 13, 16, 19, 21, 22, 24 };
        return c_table[val];
    }
    else
    {
        unsigned long index;
        
        _BitScanReverse(&index,(unsigned long)val);
    
        ASSERT( index >= 3 );
    
        uint32 bottom = (val >> (index - 3)) & 0x7;
        uint32 blog = (index << 3) | bottom;

        return blog;
    }
}

where I've removed the weird offsets of 1 and this just returns log2 times 8. You need the check for val <= 8 because shifting by negative amounts is fucked.

But you might well ask - why only use 3 bits ? And in fact you're right, I see no reason to use only 3 bits. In fact we can do a fixed point up to 27 bits : (we need to save 5 bits at the top to store the max possible integer part of the log2)


float bitlogf(uint32 val)
{
    unsigned long index;
    
    _BitScanReverse(&index,(unsigned long)val);

    uint32 vv = (val << (27 - index)) + ((index-1) << 27);

    return vv * (1.f/134217728); // 134217728 = 2^27
}

what we've done here is find the pos of the MSB, shift val up so the MSB is at bit 27, then we add the index of the MSB (we subtract one because the MSB it self starts the counting at one in the 27th bit pos). This makes a fixed point value with 27 bits of fractional part, the bits below the MSB act as the fractional bits. We scale to return a float, but you could of course do this with any # of fixed point bits and return a fixed point int.

But of course this is exactly the same kind of thing done in an int-to-float so we could use that too :


float bitlogf2(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127 << 23));
    
    return vv * (1.f/8388608); // 8388608 = 2^23
}

which is a lot like what I wrote about before. The int-to-float does the exact same thing we did manually above, finding the MSB and making the log2 and fractional part.

One note - all of these versions are exact for the true powers of 2, and they err consistently low for all other values. If you want to minimize the maximum error, you should bias them.

The maximum error of ( log2( 1 + f) - f ) occurs at f = ( 1/ln(2) - 1 ) = 0.442695 ; that error is 0.08607132 , so the correct bias is half that error : 0.04303566

Backing up in Robin's talk we can now talk about "fastexp". "fastexp" is doing "e^x" by using the floating point format again, basically he's just sticking x into the exponent part to get the int-to-float to do the 2^x. To make it e^x instead of 2^x you just scale x by 1/ln(2) , and again we use the same trick as with bitlog : we can do exact integer powers of two, to get the values in between we use the fractional bits for linear interpolation. Robin's method seems sound, it is :


float fastexp(float x)
{
    int i = ftoi( x * 8.f );
        
    FloatAnd32 f;
    f.i = i * 1512775 + (127 << 23) - 524288;
    
    // 1512775 = (2^20)/ln(2)
    // 524288 = 0.5*(2^20)

    return f.f;
}

for 3 bits of fractional precision. (note that Robin says to bias with 0.7*(2^20) ; I don't know where he got that; I get minimum relative error with 0.5)).

Anyway, that's all fine, but once again we can ask - why just 3 bits? Why not use all the bits of x as fractional bits? And if we put the multiply by 1/ln(2) in the float math before we convert to ints, it would be more accurate.

What we get is :


float fastexp2(float x)
{
    // 12102203.16156f = (2^23)/ln(2)
    int i = ftoi( x * 12102203.16156f );
    
    FloatAnd32 f;
    f.i = i + (127 << 23) - 361007;
    
    // 361007 = (0.08607133/2)*(2^23)

    return f.f;
}

and indeed this is much much more accurate. (max_rel_err = 0.030280 instead of 0.153897 - about 5X better).

I guess Robin's fastexp is preferrable if you already have your "x" in a fixed point format with very few fractional bits (3 bits in that particular case, but it's good for <= 8 bits). The new method is preferred if you have "x" in floating point or if "x" is in fixed point with a lot of fractional bits (>= 16).

ADDENDUM :

I found the Google Book where bitlog apparently comes from; it's Math toolkit for real-time programming By Jack W. Crenshaw ; so far as I can tell this book is absolute garbage and that section is full of nonsense and crack smoking.

ADDENDUM 2 :

it's obvious that log2 is something like :


x = 2^I * (1+f)

(I is an int, f is the mantissa)

log2(x) = I + log2(1+f)

log2(1+f) = f + f * (1-f) * C

We've been using log2(1+f) ~= f , but we know that's exact at the ends and wrong in the middle
so obvious we should add a term that humps in the middle.

If we solve for C we get :

C = ( log2(1+x) - x ) / x*(1-x)

Integrating on [0,1] gives C = 0.346573583

hence we can obviously do a better bitlog something like :

float bitlogf3(float fval)
{
    FloatAnd32 fi;
    
    fi.f = fval;
    
    float vv = (float) (fi.i - (127<<23));
    
    vv *= (1.f/8388608);
    
    //float frac = vv - ftoi(vv);
    
    fi.i = (fi.i & 0x7FFFFF) | (127<<23);
    
    float frac = fi.f - 1.f;
    
    const float C = 0.346573583f;
        
    return vv + C * frac * (1.f - frac);
}

old rants