6/21/2010

06-21-10 - RRZ PNG-alike notes

Okay I've futzed around with heuristics today and it's a bit of a dead end. There are just a few cases that you cannot ever catch with a heuristic.

There is one good heuristic that works 90% of the time :


do normal filter
try compress with LZ fast and MinMatchLen = 8
try compress with LZ normal and MinMatchLen = 3

if the {fast,8} wins then the image is almost certainly a "natural" image. Natural images do very well with long min match len, smooth filters, and simple LZ matchers.

If not, then it's some kind of weird synethetic thing. At that point, pretty much all bets are off. Synthetic images have the damnable property that certain patterns repeat, so they are very sensitive to whether the LZ can find those patterns after filtering and such. But, a good start is to try the no-filter case with normal LZ, and perhaps try the Adaptive, and you can use Loco or Non-Loco depending on whether the normal filter chose loco post-filter or not.

But there are some bitches. kodak_12 is a natural image, and we detect that right. The problem is the best mode {N,N+L,A,A+L} changes when you optimize the LZ parameters, and it changes by quite a lot actually. Many modes will show N+L or A+L willing by quite a lot, but the right mode is N and it wins by 10k.

ryg_t.train.03.bmp is the worst. It is a solid 10% better with the "Normal" mode, but this only shows up when you do the LZ optimal parse; at any other setting of LZ all the modes are very close, but for some reason there are some magic patterns that only occur in Normal mode which are very good for the optimal parse - all the other modes stay about the same size when you turn LZ up to optimal, but Normal filter gets way smaller.

Okay, some actually useful notes :

There are some notes on the PNG web pages that say the best way to choose the filter per row is with sum of abs. Oh yeah? I can beat it. First of all, doing sum of abs but adding a small penalty for non-zero helps a tiny bit. But the best thing is to do entropy per row, and add a penalty for non-zero. You're welcome.

The filters N and (N+W)/2 are almost never best as whole-image filters, but are actually helpful in the adaptive filter loop.

I reduced my filter set down to only 5 and it hurt very little. Having the extra filters is basically free in terms of the format, but is a pain in the ass to maintain if you need to write optimized SIMD decoders for every filter on every platform. So for my own sanity, a minimum set of filters is preferrable.

BTW I should note that the fact that you have to tune minMatchLen and lzLevel is an indicator of the limitations of the optimal parse. If the optimal parse really found the best LZ stream, you should just run Optimal and let it pick what matches it wants. This is an example of it finding a local minimum which is not the global minimum. The problem is that the Huffman codes are severely different if you run with MML = 3 or 5 for example. Maybe there's a way around this problem; it requires more thought.

6/20/2010

06-20-10 - Windows 7 Niggles

1. When you edit a file name in explorer it by default doesn't include the extension. I constantly do "F2 , ctrl-C" to grab a file name, and then find later that I have the fucking file name without extension. .URL files are the worst, they for some reason just absolutely refuse to show me their extension. God fucking dammit, LEAVE SHIT ALONE.

2. Fucking folders can't all be set to Details as far as I can figure out. Yes yes yes I know you can do "set all folders to look like this one". But that only applies to folders that you have ever visited at least once. When you go to some random folder which you have never visited before - BOOM you're looking at fucking icons again. I despise icons. There must be a way to set the default options for folders, but I can't find it on web searches.

3. Fucking backspace not going up dir in Explorer really chaps my hide. I need to fix that. I guess I'll do that after I write my own AllSnap replacement (Grrr).

06-20-10 - Struct annoyance

It's a very annoying fact of coding life that references through a struct are much slower than loading the variable out to temps. So for example this is slow :

void rrMTF_Init(rrMTFData * pData)
{
    for (int i = 0; i < 256;  i++)
    {
        pData->m_MTFimage[i] = pData->m_MTFmap[i] = (U8) i;
    }
}

and you have to do this by hand :

void rrMTF_Init(rrMTFData * pData)
{
    U8 * pMTFimage = pData->m_MTFimage;
    U8 * pMTFMap = pData->m_MTFmap;
    for (int i = 0; i < 256;  i++)
    {
        pMTFimage[i] = pMTFmap[i] = (U8) i;
    }
}

Unfortunately there are plenty of cases where this is actually a significant big deal.

06-20-10 - Some notes on PNG

Before we go further, lets have a look at PNG for reference.

Base PNG by default does :


Filter 5 = "adaptive" - choose from [0..4] per row using minimum sum of abs

Zlib strategy "FILTER" , which just means minMatchLength = 4

So the pngcrush guys did some clever things. Basically the idea is to try all possible ways to write a PNG and see which is smallest. The things you can play with are :


Filter 0..5

Zlib Strategy ; they have a few hard-coded but really it comes down to min match length

Zlib compression level (oddly highest level is not always best)

PNG scan options (progressive, interleaves, bottom up vs top down, etc.)

It's well known that on weird synthetic images "no filter" beats any filter. The only way you can detect that is actually by trying an LZ compress, you cannot tell from statistics.

The clever thing in pngcrush is that they don't search that whole space, but still usually find the optimal (or close to optimal) settings. The way they do it is with a heuristic guided search; they identify things that they have to always test (the 5 default strategies they try) with LZ, then depending on which of those is best they try a few others, and then maybe a few more, then you're done. It's like based on which branch of the search space you walk off initially they know from testing where the optimum likely is.

"loco" here is pngcrush with the LOCO color space conversion (RGB -> (R-G),G,(B-G) ) from JPEG-LS. This is the only lossless color conversion you can do that is not range expanding (eg. stays in bytes) (* correction : not quite true, see comments, of course any lifting style transform can be non-range-expanding using modulo arithmetic; it does appear to be the only *useful* byte-to-byte color conversion though). (BTW LOCO is not allowed in compliant PNG, but it's such a big win that it's unfair to them not to pretend that PNG can do LOCO for purposes of this comparison).

name png pngcrush loco advpng crush+adv best
ryg_t.yello.01.bmp 421321 412303 386437 373573 373573 373573
ryg_t.train.03.bmp 47438 37900 36003 34260 34260 34260
ryg_t.sewers.01.bmp 452540 451880 429031 452540 451880 429031
ryg_t.font.01.bmp 44955 37857 29001 22514 22514 22514
ryg_t.envi.colored03.bmp 113368 97203 102343 113368 97203 97203
ryg_t.envi.colored02.bmp 63241 55036 65334 63241 55036 55036
ryg_t.concrete.cracked.01.bmp 378383 377831 309126 378383 377831 309126
ryg_t.bricks.05.bmp 506528 486679 375964 478709 478709 375964
ryg_t.bricks.02.bmp 554511 553719 465099 554511 553719 465099
ryg_t.aircondition.01.bmp 29960 29960 23398 20320 20320 20320
ryg_t.2d.pn02.bmp 29443 26025 27156 24750 24750 24750
kodak_24.bmp 705730 704710 572591 705730 704710 572591
kodak_23.bmp 557596 556804 483865 557596 556804 483865
kodak_22.bmp 701584 700576 580566 701584 700576 580566
kodak_21.bmp 680262 650956 547829 646806 646806 547829
kodak_20.bmp 505528 504796 439993 500885 500885 439993
kodak_19.bmp 671356 670396 545636 671356 670396 545636
kodak_18.bmp 780454 779326 631000 780454 779326 631000
kodak_17.bmp 624331 623431 510131 615723 615723 510131
kodak_16.bmp 573671 541748 481190 517978 517978 481190
kodak_15.bmp 612134 611258 516741 612134 611258 516741
kodak_14.bmp 739487 703036 590108 739487 703036 590108
kodak_13.bmp 890577 859429 688072 866745 859429 688072
kodak_12.bmp 569219 533864 477151 535591 533864 477151
kodak_11.bmp 635794 634882 519918 635794 634882 519918
kodak_10.bmp 593590 592738 500082 593590 592738 500082
kodak_09.bmp 583329 582489 493958 558418 558418 493958
kodak_08.bmp 787619 786491 611451 787619 786491 611451
kodak_07.bmp 566085 565281 486421 566085 565281 486421
kodak_06.bmp 667888 631478 540442 644928 631478 540442
kodak_05.bmp 807702 806538 638875 807702 806538 638875
kodak_04.bmp 637768 636856 532209 637768 636856 532209
kodak_03.bmp 540788 506321 464434 514336 506321 464434
kodak_02.bmp 617879 616991 508297 596342 596342 508297
kodak_01.bmp 779475 760251 588034 706982 706982 588034
bragzone_TULIPS.bmp 680124 679152 591881 680124 679152 591881
bragzone_SERRANO.bmp 153129 107167 107759 96932 96932 96932
bragzone_SAIL.bmp 807933 806769 623437 807933 806769 623437
bragzone_PEPPERS.bmp 426419 424748 376799 426419 424748 376799
bragzone_MONARCH.bmp 614974 614098 526754 614974 614098 526754
bragzone_LENA.bmp 474968 474251 476524 474968 474251 474251
bragzone_FRYMIRE.bmp 378228 252423 253967 230055 230055 230055
bragzone_clegg.bmp 483752 483056 495956 483752 483056 483056

There's no adv+loco because advpng and advmng both refuse to work on the "loco" bastardized semi-PNG.

BTW I should note that I should eat my hat a little bit over my own "PNG sucks" post. The thing is, yes basic PNG is easy to beat and it has a lot of mistakes in the design, but the basic idea is fine, they did a good job on the standard pretty quickly, but the thing that really seals the deal is that once you make a flexible open standard, people will step in and find ways to play with it, and while base PNG is pretty bag, PNG after optimization is not bad at all.

06-20-10 - Searching my PNG-alike

Okay so stealing some ideas from pngcrush let's check out the search space. I decide I'm going to examine various filters, Loco or no loco, LZ min match len, and LZ compress "level" (level = how hard it looks for matches).

Here are the results for my PNG-alike with the modes :

0 = no filter
Loco = no filter (in loco space)
Normal = select a single DPCM filter for the whole image
N+L = Normal in LOCO space
Adaptive = per row best DPCM filter
A+L = you get it by now

The left six columns are these modes with default LZ parameters (Fast match, min match len = 4). The right six columns are the same modes with LZ parameters optimized for each mode.

name 0 Loco Normal N+L Adaptive A+L optimized LZs 0 Loco Normal N+L Adaptive A+L
ryg_t.yello.01.bmp 458255 435875 423327 427031 438946 431678 372607 359799 392963 395327 415370 401618
ryg_t.train.03.bmp 56455 55483 69635 76211 68022 67678 36399 35195 31803 40155 37582 36638
ryg_t.sewers.01.bmp 599803 610463 452287 452583 466154 452834 593935 593759 421779 420091 443786 421166
ryg_t.font.01.bmp 42239 32207 38855 38855 53070 36746 33119 26911 35383 35383 40998 33798
ryg_t.envi.colored03.bmp 297631 309347 150183 165803 142658 157046 265755 278103 102487 114923 95394 109022
ryg_t.envi.colored02.bmp 109179 112623 100687 113867 89178 98374 90039 93535 62139 68027 54662 57514
ryg_t.concrete.cracked.01.bmp 481115 407759 355575 356911 408054 361602 384795 353235 299963 301907 342810 310342
ryg_t.bricks.05.bmp 551475 485271 418907 417655 492622 418406 469063 448279 372195 370459 429066 373310
ryg_t.bricks.02.bmp 665315 632623 482367 481347 538670 483106 590319 577699 455431 455203 522158 455426
ryg_t.aircondition.01.bmp 41635 29759 26011 26011 32866 25738 29023 25103 20547 20547 23946 20522
ryg_t.2d.pn02.bmp 25075 26539 28303 29259 28046 28974 22147 22723 25915 26179 26194 25790
kodak_24.bmp 826797 771829 640137 634141 726892 633308 723681 712285 567693 558085 684060 559564
kodak_23.bmp 835449 783941 576481 569981 608476 565172 724641 712001 481577 478041 551292 479240
kodak_22.bmp 898101 879949 655461 651213 722096 651000 803429 804073 577433 571301 689664 574252
kodak_21.bmp 781077 708861 617565 629401 705608 618424 647881 633069 549865 549025 665724 545584
kodak_20.bmp 609705 561957 495509 500161 537692 494484 501293 490849 434865 431745 506592 429556
kodak_19.bmp 822045 733053 630793 624897 697020 619064 673953 658345 550669 541845 660444 541424
kodak_18.bmp 941565 912081 691161 692425 789736 693804 850705 848353 618961 619949 764004 622628
kodak_17.bmp 758089 678233 597225 592941 660016 590292 617097 606045 507169 504961 610092 508672
kodak_16.bmp 650557 587537 543001 543001 607916 545244 522829 511833 466277 466277 536280 468136
kodak_15.bmp 759109 697257 595353 590321 648656 586324 643385 628481 511193 504213 593304 506728
kodak_14.bmp 891629 793553 661505 657357 745816 649928 749085 731569 584645 580301 707596 581520
kodak_13.bmp 997161 891637 729425 730901 878212 729580 853557 829057 677041 678613 802224 680196
kodak_12.bmp 672361 602825 545921 562749 606004 549292 539305 526793 465297 472693 539532 467088
kodak_11.bmp 758197 691697 604869 597125 665364 587264 639537 625145 523649 512685 608388 510200
kodak_10.bmp 747121 681213 592637 589561 635972 578888 625961 611553 504573 499753 576284 497400
kodak_09.bmp 688365 629429 587233 583245 627916 571652 565449 562329 501377 495637 567772 491896
kodak_08.bmp 1001257 882153 686961 684177 792916 684056 860269 825757 615657 610505 766620 610524
kodak_07.bmp 709829 673917 568649 563561 616820 561636 605177 600157 480705 473233 551136 473500
kodak_06.bmp 779709 694229 600145 600145 687824 601564 642525 626449 534037 534037 637180 534804
kodak_05.bmp 962357 873793 700581 695257 810688 694828 845905 823025 633581 623341 783400 624368
kodak_04.bmp 813869 729865 613241 606849 672176 607280 677345 660057 531533 522061 622904 528064
kodak_03.bmp 639809 586873 522049 542681 581880 527856 519549 510309 437765 452965 511900 443948
kodak_02.bmp 729069 649709 603853 591781 654408 585916 598913 584941 515437 502941 602364 500964
kodak_01.bmp 872669 747333 661481 655597 772808 653388 699945 682005 593001 582389 689436 586328
bragzone_TULIPS.bmp 1032589 1021309 646905 652213 701508 652128 966881 969913 565997 571377 662504 570796
bragzone_SERRANO.bmp 150142 151706 169982 169302 173229 175217 103462 103566 139306 139074 142609 143457
bragzone_SAIL.bmp 983473 941993 686013 686117 795152 686204 892301 887097 613845 609953 762420 610008
bragzone_PEPPERS.bmp 694247 679795 424603 423679 451006 424262 655987 650771 369291 366611 416426 368106
bragzone_MONARCH.bmp 887917 868533 600373 598985 654864 598976 810325 805725 507937 508085 598348 508096
bragzone_LENA.bmp 737803 733251 493703 498299 498274 502150 710215 704763 467103 475179 471586 477638
bragzone_FRYMIRE.bmp 342667 344807 419859 420811 420506 419026 241899 242355 335063 336859 335990 335894
bragzone_clegg.bmp 760493 799717 525077 541329 523580 536244 557529 571413 445897 468265 444736 465376

One important note :

The "Normal" filters include the option to do a post-filter Loco conversion. This is different than the "loco space" option in the modes above. Let me elaborate. "Loco" in the modes listed above means transform the image into Loco colorspace, and then proceed with filtering and LZ. Loco built into a filter mode means, on each pixel do the filter delta, then do loco conversion. This can be integrated directly into the DPCM pixel delta code, so it's just considered a filter type. In particular, in "loco space", then the N,W,NW neighbors are already in loco colorspace. When loco is part of the filter, the neighbors are in RGB space and the delta is converted after the fact. If everything was linear, these would be equivalent.

Okay, what do we see?

It's very surprising to me how much LZ optimization helps. In particular it surprises me that making the LZ search *worse* (by turning down compression "level") helps a lot; as does increasing match len; on natural images a min match len around 8 or 10 is usually best. (or even more, I forbid a min match len > 10 because it starts to hurt decode speed).

Well we were hoping that we could pick the mode based on the default LZ parameters, and then just optimize the LZ parameters for that one mode. It is often the case the the best mode after optimization is the same as the best mode before optimization, but not always. When it is not the case, it is usually a small mistake. However, there is one case where it's a very bad mistake - on ryg_t.yello.01.bmp you would make output of 393k instead of 360k.

Natural images are the easiest; for them you can pretty much just pick A+L (or N+L) and you're very close to best if you didn't get the best. Synthetic images are harder, they are very sensitive to the exact mode.

We can also say that no filter + loco is almost always wrong, except for that same annoying one case. Unfortunately I don't see any heuristic that can detect when 0+loco needs to be checked. Similarly for adaptive + noloco.

Obviously there's a fundamental problem when the initial sizes are very close together, you can't really differentiate between the modes. When the sizes are very far apart then it is a reliable guess.

Let me briefly note things I could be searching that I'm not :

Rearranging pixels in various ways, eg. RGBRGB -> RRRGGGBB , or to whole planes; interleaving lines, different scan orders, etc. etc.

LSQR fits for predictors. This doesn't hurt decode speed a ton so it would fit in my design spec, I'm just getting sick of wasting my time on this so I'm not bothering with it.

Predictors on different regions instead of per row. eg. a predictor type per 16x16 tile or something.

Anything that hurts decode speed, eg. bigger predictors, adaptive predictors, non-LZ coder, etc.

Oh I'm also not looking at any pixel format conversions; I assume the client has put it in the pixel format they want and won't change it. Obviously some of the PNG optimizers can win by palettizing when not many colors are actually used, and of course there are lots of other pixel formats that might help, blah blah.

Oh while I'm at it, I should also note that my LZ is actually kind of crippled for this comparison. I divide the data stream into 256k chunks and compress them completely independently (no LZ window across the chunk boundary). This lets me seek on compressed data and decompress portions of it independently, but it is quite a bad penalty.

06-20-10 - PNG Comparo

Okay this is kind of bogus and I thought about not even posting it, but come on, you need the closure right? Everyone likes a comparo. First of all, why this is bogus : 1. PNG just cannot compete without better LOCO support. Here I am allowing the LOCO files, but they were not advpng/pngout optimized , and of course they're not really PNGs. 2. I have that crippling 256k chunking on my format. I guess if I wanted to do a fair comparo I should make a version of my shit which doesn't have LOCO and also remove my 256k chunking and compare that vs. no-loco PNG. God dammit now I have to do that.

Whatever, here you go :

RRZ heuristic = guided search to try to find the best set of options
RRZ best = actual best options for my thingy
png best = best of advpng/crush/loco

RRZ heuristic RRZ best png best
ryg_t.yello.01.bmp 392963 359799 373573
ryg_t.train.03.bmp 35195 31803 34260
ryg_t.sewers.01.bmp 421779 420091 429031
ryg_t.font.01.bmp 26911 22514
ryg_t.envi.colored03.bmp 95394 97203
ryg_t.envi.colored02.bmp 54662 55036
ryg_t.concrete.cracked.01.bmp 299963 309126
ryg_t.bricks.05.bmp 370459 375964
ryg_t.bricks.02.bmp 455203 465099
ryg_t.aircondition.01.bmp 20522 20320
ryg_t.2d.pn02.bmp 22147 24750
kodak_24.bmp 559564 558085 572591
kodak_23.bmp 479240 478041 483865
kodak_22.bmp 574252 571301 580566
kodak_21.bmp 549865 545584 547829
kodak_20.bmp 429556 439993
kodak_19.bmp 541424 545636
kodak_18.bmp 618961 631000
kodak_17.bmp 508672 504961 510131
kodak_16.bmp 466277 481190
kodak_15.bmp 506728 504213 516741
kodak_14.bmp 581520 580301 590108
kodak_13.bmp 677041 688072
kodak_12.bmp 465297 477151
kodak_11.bmp 510200 519918
kodak_10.bmp 497400 500082
kodak_09.bmp 491896 493958
kodak_08.bmp 610524 610505 611451
kodak_07.bmp 473500 473233 486421
kodak_06.bmp 534037 540442
kodak_05.bmp 624368 623341 638875
kodak_04.bmp 522061 532209
kodak_03.bmp 437765 464434
kodak_02.bmp 500964 508297
kodak_01.bmp 586328 582389 588034
bragzone_TULIPS.bmp 565997 591881
bragzone_SERRANO.bmp 103462 96932
bragzone_SAIL.bmp 613845 609953 623437
bragzone_PEPPERS.bmp 366611 376799
bragzone_MONARCH.bmp 508096 507937 526754
bragzone_LENA.bmp 467103 474251
bragzone_FRYMIRE.bmp 241899 230055
bragzone_clegg.bmp 444736 483056

PNG wins by a little bit on FRYMIRE , SERRANO , ryg_t.aircondition.01.bmp , ryg_t.font.01.bmp . I'm going to pretend that I don't know that because that's what sent me down this god damn pointless rabbit hole in the first place, I discovered that PNG beat me on a few files so I had to find out why and fix myself.

Anyway, something that would be more productive would be to write a fast PNG decoder. All the PNG decoders out there in the world are woefully slow. Let me tell you all how to write a fast PNG decoder :

1. First make sure your Zip decoder is fast. The standard ones are okay, but they do too much checking for end of buffer and do you have enough bits blah blah. The correct way to do that is to allocate your decompression buffers 64k aligned, and put some NO_ACCESS pages on each end. Then just let your Zip decoder run. Make sure it will never crash on bad input - it will just make bad output (this is relatively easy to do and doesn't require explicit checks, just careful coding to make sure all compressed bit streams decode to something).

2. The un-filtering for PNG needs to be unrolled for the exact data type and filter. You can do this in C very neatly using template loop inversion which I wrote about previously. For maximum speed however you really should do the un-filter with SIMD. It's a very nice easy case for SIMD, except for the god fucking awful pointless abortion that is the Paeth filter.

3. Un-filtering and LZ decompress should be interleaved for cache hotness. You decode a bit, un-filter a bit, then stream out data in the final pixel format into client-ready packed plane. The zip window is only 32k and you only need one previous row to filter, so your whole set of read-write data should be less than 64k, and the data you stream out should be written to a separate buffer with NTQ write-combining style writes. Ideally your stream out supports enough pixel formats that it can write directly to whatever the client needs (X8R8G8B8 for D3D or whatever) so that memory doesn't have to be touched again. Because the output buffer is only written write combined you can decode directly into locked textures.

My guess is that this should be in the neighborhood of 80 MB/sec.

06-20-10 - KZip

While futzing around on PNG I discovered that there is this whole community of people who are hard-core optimizing PNG's for distribution sizes. They have these crazy tools like pngcrush/optipng/advancecomp/pngout.

AdvPNG is just the Zip encoder from 7zip (Igor Pavlov). It's a semi-optimal parser using a forward-pass multi-lookahead heuristic like Quantum. Some day I need to try to read the source code to figure out what he's doing exactly, but god damnit the code is ugly and it's not documented. Dammit Igor.

PNGOUT is a lossless "deflate" (zip) optimizer. The engine is the same as KZip by Ken Silverman. Unfortunately Ken has not documented his algorithm at all anywhere. Come on Ken! Nobody is gonna pay you for KZip! Just write about it!

Anyway, my guess from reading his brief description and looking at what it does is : KZip has some type of optimal parser. Who knows what kind of optimal parser; my guess is that knowing a bit about how Ken codes it is probably not an actual Storer-Szymanski optimal parser, but rather some kind of heuristic, perhaps a like 7zip/LZMA. KZip also clearly has some kind of Huffman split point optimizer (similar to what I just did ). Again just guessing from the command line options it looks like his Huffman splitter is single pass and is just based on a heuristic that detects changes in the statistics and decides to put a split there. Hmmm, I wish I'd found this months ago.

KZip appears to be the best zip optimizer in existence. Despite claims of being crazy slow I actually think it's quite fast by my standards. No surprise KZip is a lot smaller and faster than my optimal parser, but I do make smaller files. Ken ! Set your algorithm free! (ADDENDUM : whoops, that was only on PNG data; for some reason it's pretty fast on image data, but it's slow as balls in some other cases, not sure what's going on there; 7zip is a lot faster than kzip and the file sizes are usually very close (once in a while kzip does significiantly better)).

For a while I've been curious to try my RAD LZ optimizer on a Zip token stream. It would be a nice sanity check to test it against KZip, but I'm not motivated enough to take the pain of figuring out how to write Zip tokens.

06-20-10 - Filters for PNG-alike

The problem :

Find a DPCM pixel prediction filter which uses only N,W,NW and does not range-expand (eg. ubytes stay in ubytes). (eg. like PNG).

We certainly could use a larger neighborhood, we could use adaptive predictors that evaluate the neighborhood for edges/etc., we would wind up with GAP from CALIC or something newer. We want to keep it simple so we can have a very fast decoder.

These filters :


    case 0: // 0 predictor is the same as NONE
        return 0;
    
    case 1: // N
        return N;
        
    case 2: // W
        return W;
        
    case 3: // gradient // this tends to win on synthetic images
        {
        int pred = N + W - NW;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 4: // average
        return (N+W)>>1;
        
    case 5: // grad skewed towards average // this tends to win on natural images - before type 12 took over anyway
        {
        int pred = ( 3*N + 3*W - 2*NW + 1) >>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
    
    case 6: // grad skewed even more toward average
        {
        int pred = ( 5*N + 5*W - 2*NW + 3) >>3;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 7: // grad skewed N
        {
        int pred = (2*N + W - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 8: // grad skewed W
        {
        int pred = (2*W + N - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
    
    case 9: // new
        {
        int pred = (3*N + 2*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 10:    // new
        {
        int pred = (2*N + 3*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 11: // new
        return (N+W + 2*NW + 1)>>2;
        
    case 12: // ClampedGradPredictor
    {
        int grad = N + W - NW;
        int lo = RR_MIN3(N,W,NW);
        int hi = RR_MAX3(N,W,NW);
        return rr::clamp(grad,lo,hi);
    }   
    
    case 13: // median
        return Median3(N,W,NW);
    
    case 14:
    {
        // semi-Paeth
        // but only pick N or W 
        int grad = N + W - NW;
        // pick closer of N or W to grad
        if ( RR_ABS(grad - N) < RR_ABS(grad - W) )
            return N;
        else
            return W;
    }

perform like this :

name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ryg_t.train.03.bmp 56347 76447 70047 80619 76187 79811 77519 79771 76979 77399 76279 80383 75739 75875 74635
ryg_t.sewers.01.bmp 604843 475991 496067 468727 466363 455359 458111 458727 469203 456399 462727 496691 452663 495931 464915
ryg_t.font.01.bmp 42195 54171 53599 56135 73355 66955 70163 60199 63815 69499 70099 79983 59931 61219 56735
ryg_t.envi.colored03.bmp 316007 147315 136255 158063 139231 154551 148419 155379 149667 148683 143415 145871 148099 144851 153311
ryg_t.envi.colored02.bmp 111763 98835 87711 111367 93123 108867 100003 106275 95927 96391 97383 96907 99275 94371 100707
ryg_t.concrete.cracked.01.bmp 493143 416307 449899 426755 403779 402635 400923 408019 420291 398215 408299 437371 409055 440371 424087
ryg_t.bricks.05.bmp 568755 534267 524243 514563 509375 493923 497639 505267 499007 501079 497419 551687 499855 547935 511595
ryg_t.bricks.02.bmp 684515 590955 577207 557155 560391 537987 545019 551251 544555 549543 544635 602115 545919 600139 557659
ryg_t.aircondition.01.bmp 41595 33215 34279 33535 33199 32695 32683 32619 33207 32503 32987 35795 31507 35171 32243
ryg_t.2d.pn02.bmp 25419 27815 28827 29499 32203 32319 32443 29679 32531 32315 32575 34183 28307 31247 27919
ryg_gemein.bmp 797 801 173 153 565 585 553 521 205 633 529 813 153 825 141
kodak_24.bmp 843281 735561 760793 756509 744021 736037 735797 735989 753709 733105 741801 785077 719185 777905 727221
kodak_23.bmp 865985 593977 616885 617105 591501 592561 588837 595949 609437 586121 595793 620657 585453 617825 601713
kodak_22.bmp 919613 732613 756249 741441 719973 712757 711245 721365 732337 709621 719157 760065 706517 763481 724053
kodak_21.bmp 797741 753057 691653 746033 713469 716065 710321 737017 713845 720525 704129 744785 701733 744633 709445
kodak_20.bmp 624901 563261 538069 571633 540981 548329 541773 559437 548549 546565 540337 559545 539289 557105 545901
kodak_19.bmp 847781 724541 728337 732717 718309 712989 712217 716593 729505 712893 715053 752173 689953 756849 698881
kodak_18.bmp 958657 808577 820253 820429 783373 783089 777377 794965 799449 779117 782541 821717 783005 821025 802569
kodak_17.bmp 782061 664829 655845 685273 651025 657225 650477 666625 666645 653117 652293 680813 644849 675045 656597
kodak_16.bmp 664401 696493 595101 673961 646981 647441 643353 675157 640833 658721 631509 684517 628837 682217 629293
kodak_15.bmp 779289 639009 660137 677649 643985 651729 644809 650525 666229 641669 650377 673545 635441 662389 645817
kodak_14.bmp 912149 800681 742901 787273 754233 754325 749641 778777 758045 760761 744001 795029 743909 791341 755425
kodak_13.bmp 1015697 932969 898929 941109 890989 900533 890221 918073 905705 897865 888113 925757 894461 922701 907029
kodak_12.bmp 690537 641517 583717 641973 613461 617881 612461 635253 617913 620833 607177 644837 598393 638889 602213
kodak_11.bmp 773937 725081 674657 714169 697533 691425 690389 708197 695109 698369 685965 738609 671365 739385 676597
kodak_10.bmp 766701 650637 648349 665309 648997 641961 641481 651741 651785 643369 642133 685733 623437 684949 629193
kodak_09.bmp 703689 640025 628161 659637 635201 637657 633881 644193 647817 636281 634105 662125 615393 662801 621873
kodak_08.bmp 1026037 862013 888589 835329 884329 841405 859717 839193 861917 856097 865405 942073 792833 950005 792201
kodak_07.bmp 725525 654925 605753 629757 633945 622693 627053 643069 621433 634965 620645 673361 597821 666629 604337
kodak_06.bmp 796077 786789 680841 750253 731117 723657 722981 753989 714849 739405 709825 772685 703417 773613 705529
kodak_05.bmp 981245 850313 836165 845001 815801 814429 810709 832337 825073 815897 811729 852385 808993 853785 828145
kodak_04.bmp 845721 672369 679245 692469 659981 663625 657621 670629 678165 658845 661957 691945 653629 688253 670325
kodak_03.bmp 656841 605525 559081 619713 581713 596313 587633 607841 600725 592761 584449 601901 579221 592745 587409
kodak_02.bmp 761633 666601 663837 677385 648117 649785 644985 661253 662077 647553 647545 682089 640501 687305 654309
kodak_01.bmp 896361 850353 811873 828397 822169 807197 810217 827621 815245 818249 807665 869829 781173 875265 788725
bragzone_TULIPS.bmp 1052229 729141 757129 688121 703317 671417 683069 683653 696193 682109 691257 756241 675637 761477 704293
bragzone_SERRANO.bmp 150898 172286 160718 193602 255462 274994 278094 212478 254902 274478 277418 285990 171918 213214 167058
bragzone_SAIL.bmp 1004581 865725 826729 834385 811121 798037 798301 819669 807221 805785 796761 862209 795581 868857 813521
bragzone_PEPPERS.bmp 712087 469243 456511 438211 442695 428627 433227 437215 436595 436059 434359 471587 426191 474595 439575
bragzone_MONARCH.bmp 907737 670825 671745 644513 638201 623613 627513 640213 640485 630941 630941 676401 624925 679553 652849
bragzone_LENA.bmp 745299 484027 513747 506631 478875 481911 477131 481819 494431 474343 483007 498519 478559 498799 489203
bragzone_FRYMIRE.bmp 342567 432867 399963 481259 608567 664895 670755 544963 612075 666279 668187 693467 433755 456899 421263
bragzone_clegg.bmp 806117 691593 511625 518489 1208409 1105433 1167561 733217 951773 1158429 1165953 1289969 502845 1286365 505093

commentary :

The big surprise is that ClampedGradPredictor (#12) is the fucking bomb. In fact it's so good that it hides the behavior of other predictors. For example plain old Grad is never picked. Also predictor #5 (grad skewed towards average) was actually by far the best until #12 came along.

The other minor surprise is that W is actually best sometimes, and N is never best, and generally N is much worse than W. Now, it is no surprise that W is better than N - it is a well known fact that typical images have much stronger horizontal correlation than vertical, but I am surprised just how *big* the difference is.

More in the next post.

6/19/2010

06-19-10 - Verio spam filters outgoing mail

Verio, the host for cbloom.com , filters your outgoing emails through some spam filter and rejects them. You, the paying customer have no control over this, you cannot disable it or add whitelists or anything. In its wisdom it decides that emails like this are spam :

Brownies have been placed in the kitchen.  Make them vanish, please.

After much emailing with Verio tech support I get these copy-paste responses :

----------------------------------------------------------------------
Outbound spam filters 
----------------------------------------------------------------------

We apologize for any inconvenience you might have experienced with this issue.

In response to your concern. The outgoing mails might have some contents that is not allowed by your filters.

and when I ask if there's any way to disable or get around it :

************************************************************************
Outbound spam filters 
************************************************************************

I am sorry but no. There is no way to whitelist outgoing messages, we have no support for getting around our internal, outgoing spam filters

This is obviously retarded. Either you're a spammer or you're not. Spammers should have their accounts banned and the rest of us should be allowed to send fucking email. If you are considering getting an account with Verio, don't. (their prices are also terrible by modern standards).

06-19-10 - Ranting Wussies

The disappointing thing about most ranters is that when you actually get them in person with the subject of their mockery, they turn into these polite normal boring reasonable conciliatory people who see the validity of the other side's position. You see Jon Stewart do this a lot, of course BSNYC and his ilk do it, etc. Stick to your fucking guns you woosies.

06-19-10 - NiftyP4 and Timeout

I've fiddled with the NiftyP4 code and have it almost working perfectly for my needs (thanks Jim!). One small niggle remains :

When I lose my net connection from home to work, VC will still hang longer than I'd like. This appears to be comming from the P4 command. The problem is that P4 hangs, and that Nifty waits on P4. Those issues in more detail :

P4 stalls way too long when it can't connect to the server. So far as I can tell there is no way to set this timeout variable in P4 (??). (net searching is a bit hard because Perforce does have a timeout variable, but that is for controlling how long client login sessions on the server last before they are reset). I'd like to set this timeout to like 1 second, currently it's 10 seconds or something. I basically either have a fast connection or not, the long timeout is never right.

Nifty stalls on P4. This is so that when you do something like "save all", it gets notication of the VC command and can check out any necessary files before VC tries to save them. So it runs the P4 command and waits before returning to let VC do its save.

So my hack solution for the moment is to make Nifty only wait for 500 millis for the P4 command and just return if it isn't done by then. This will then give you a "file is not writeable" popup error kind of thing and saves you from the horrible stalled out DevStudio.

BTW some notes for working on addins :

The easiest way to find the names of VC commands seems to be from Keyboard mapping. It appears that they are basically not documented at all. If you can't get it too hookup from the Keyboard mapping command name, your next best option is to trap *every* command and log its name, then go into VC and do the things you want to find the name for and see what comes out of your logs. (see links below)

Disabling an addin from the Tools->Addins menu does not actually unload it (sadly). You have to close VC and restart it to make it unload so that you can update the DLL.

The easiest way to debug an addin is just to make a .AddIn file in "C:\Users\you\Documents\Visual Studio 2005\Addins" and point it at the DLL that you build from your AddIn project. Then you can set up F5 in your AddIn project to run devenv. That way you debug the devenv session that loads your AddIn and you can set breakpoints etc.

See also :

Using EnableVSIPLogging to identify menus and commands with VS 2005 + SP1 - Dr. eX's Blog - Site Home - MSDN Blogs
Source Checkout - niftyplugins - Project Hosting on Google Code
Resources about Visual Studio .NET extensibility
Many Visual studio events missing from Events.SolutionEvents
HOWTO Getting Project and ProjectItem events from a Visual Studio .NET add-in.
HOWTO Capturing commands events from Visual Studio .NET add-ins
Binding to Visual Studio internal Commands from add-in

06-19-10 - Gmail doesn't let you send mails to yourself

I always CC myself when I write an email that I think is interesting (especially with my multiple email personalities, I often CC from one to another). I've been confounded for a while that when I use gmail, it seems to not deliver the emails to myself. Not only that, but it gives no error message or delivery failure or anything, it just silently doesn't send them.

See here for more people complaining.

06-19-10 - Amazon censors product reviews

It's hard for me to get my hackles too up about this, but it's a topic that's worth noting over and over, so I'm making myself write about it.

A while ago I decided since I'm buying all this stuff from Amazon, I should write some reviews so that other people can see what products are good and which aren't. I mainly wrote reviews for products that had zero reviews, and mainly in cases where the product was not clearly described or listed.

I thought nothing of it, until I went back and was reviewing some of my purchases and noticed that there was no review on some of the products I was pretty sure I had reviewed.

Well, it turns out that Amazon censored my review. Not only did they not publish it, but they *silently* don't publish it, they don't give you any notification or reason that it was declined, it just doesn't show up.

It took me a while emailing back and forth with support to get some answers, but it turns out most of the declines come down to a specific policy :

Amazon does not allow reviews to address errors in the listing.

Some of the reviews I wrote were about incorrect pictures or list prices. For example a picture might show a set of several brass wool scrubbers, but the shipped product is only one of them. I write a review to make it clear what you're getting. BOOM! Review does not show up.

Here are the current guidelines . In particular, the troublesome ones are :


# Details about availability, price, or alternate ordering/shipping
#
# Comments on other reviews visible on the page (because page visibility is subject to change without notice)

Off-topic information: 

* Feedback on the seller, your shipment experience or the packaging 

* Feedback about typos or inaccuracies in our catalog or product description

Granted, some amount of censorship is necessary or you would be flooded by spam and libel, but this goes rather too far. Not being allowed to write a review when what you get is not what they said you would get is a pretty big problem. I can also understand them not wanting reviews that say "this is available cheaper at Walmart" but all my reviews that mentioned price were things like "listing says retail price is $100, in fact retail price is around $40". It's also simply not true that they objectively enforce those standards; you will see plenty of positive reviews around Amazon that mention price in a positive way, things like "what a great product for only $10" - that review is allowed through, but if it's a negative review like "product is not worth $10" it won't be allowed through; all my reviews that mentioned price and were censored were negative reviews. For example one of my reviews that was blocked was about a listing that showed a picture of a box full of cleaning pads and showed a list price of $40; actual shipped product was one cleaning pad (not a box full) - review deleted.

Of course I should note that this is not unusual. Yelp of course censors reviews in roughly the same way ; both Amazon and Yelp really want you to write "stories" that attract "community" and generally make people hang out and buy product. They don't want factual information that helps customers. I had several reviews deleted from Yelp because they failed to be "personal" enough (they were short things like "this place sucks"). Yelp is also semi-corrupt, despite claims otherwise they are in fact in the pocket of sponsors, and will delete reviews or ban people who write too many negative reviews.

CNET and Chow are actually much worse. They will bald-facedly delete reviews and discussion threads that are critical of sponsors. Most of the advertising sponsored web forums, such as the 2+2 poker forums or car forums like 6speedonline are the same way as well. They will lock or delete threads that are critical of sponsors or the site administration.

In conclusion : the internet is fucked. It is now owned by corporations who censor and edit the content to create the message they want. You have to be very careful about what you read on these sites, because valuable negative information might have been deleted, and if you value your own content, you should not contribute to any one of these sites.

6/17/2010

06-17-10 - Suffix Array Neighboring Pair Match Lens

I suffix sort strings for my LZ coder. To find match lengths, I first construct the array of neighboring pair match lengths. You can then find the match length between any two indexes (i,j) as the MIN of all pair match lengths between them. I wrote about this before when I wrote about the LZ Optimal parse strategy, but in order to find all matches against any given suffix, you find the position of that suffix in the sort order, then walk to neighbors and keep doing MIN() with the pair match lengths.

Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0). I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to figuring it out. Anyway, I stumbled upon this :


from :

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2

Algorithm GetHeight
input: A text A and its suffix array Pos
1 for i:=1 to n do
2 Rank[Pos[i]] := i
3od
4 h:=0
5 for i:=1 to n do
6 if Rank[i] > 1 then
7 k := Pos[Rank[i]-1]
8 while A[i+h] = A[j+h] do
9 h := h+1
10 od
11 Height[Rank[i]] := h
12 if h>0 then h := h-1 fi
13 fi
14 od   

The idea is obvious once you see it. You walk in order of the original character array index, (not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor. At index [i+1] then, you must be able to match to a length of at least the same length-1 by matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix match. For example :


On "mississippi" :

At index 1 :

ississippi

matches :

issippi

with length 4.

Now step forward one.

At index 2 :

ssissippi

I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one)
so I can start my string compare at 3

And here's the C code :


static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2]
        const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen)
{
    int n = sortWindowLen;
    const U8 * A = sortWindowBase;

    int h = 0;
    for(int i=0; i< n ;i ++)
    {
        int sort_i = sortIndexInverse[i];
        if ( sort_i > 0 )
        {
            int j = sortIndex[sort_i-1];
            int h_max = sortWindowLen - RR_MAX(i,j);
            while ( h < h_max && A[i+h] == A[j+h] )
            {
                h++;
            }
            sortSameLen[sort_i-1] = h;

            if ( h > 0 ) h--;
        }
    }
}

06-17-10 - Investment

I bought a Hattori knife a little while ago. It requires a lot of babying - it has some carbon steel so you have to keep it dry, the steel is quite soft so you have to make sure to only let it touch other soft things, you have to sharpen it with a whetstone every so often. It's an amazing knife, but you really have to invest a lot into it. The funny thing is that the fact that it requires so much work actually makes you more fond of it. It's like the knife needs you, it's great and that greatness comes from the work you put into it.

The Porsche is sort of the same way. I don't know if they actually do this intentionally, maybe they do if they're very clever, but the car is a bit temperamental, you have to baby it a bit, check the oil all the time, let it get up to temperature before thrashing it, etc. Of course that "quirkiness" is really "shittiness", a modern car should just run and not need to be babied, but that investment that you put in actually makes you feel closer to it, it makes you really fond of it.

It's well known that our attachment to children is based on the same principle. You put so much work into making a child that you then feel very committed to it.

Of course this is really all just a classic example of one of the major flaws in human intuitive reasoning - the fallacy of sunk cost. Just because you have previously put in lots of work to your child doesn't mean that you should continue to do so. You should ignore sunk cost and just evaluate future EV based on the current situation. If you're a rational human you should wake up each day and say to yourself "should I keep my kids today?".

BTW this is another rant but I do find it somewhat amusing that the most celebrated features of humanity are basically irrationality. There's only good decision making and bad decision making. Irrational decision making is bad. Anyone who is "emotional" or "sticks to their guns" or is "brave" or "puts their family before everything else" or whatever is a bad decision maker. That's not to say that a rational decision maker will never do anything that might be described as "brave" or wouldn't "put their family first", but they will do it based on considering the consequences of each choice and selecting one; anyone who does it based only on "feeling" is dangerous and not to be trusted.

06-17-10 - Friday Code Review

Today I bring you a piece of my own code that had a nasty bug. This is from my buffered file reader, which I did some fiddling on a while ago and didn't test hard enough (there are now so many ways to use my code that it's really hard to make sure I test all the paths, which is a rant for another day).

Here's the whole routine :


   S64  OodleFile_ReadLarge(OodleFile * oof,void * into,S64 count)
   {
    // use up the buffer I have :
    U8 * into_ptr = (U8 *)into;
    S64 avail = (S64)( oof->ptrend - oof->ptr );
    S64 useBuf = RR_MIN(avail,count);
        
    memcpy(into_ptr,oof->ptr,(size_t)useBuf);
    oof->ptr += useBuf;
    into_ptr += useBuf;
    count -= useBuf;
    
    rrassert( oof->ptr <= oof->ptrend );
        
    if ( count > 0 )
    {
        //OodleFile_Flush(oof);
        rrassert( oof->ptr == oof->ptrend );
        
        S64 totRead = 0;
        
        while( count > 0 )
        {
            S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
            
            S32 read = oof->vfile->Read(into_ptr,curRead);
        
            if ( read == 0 )
                break; // fail !
        
            count    -= read;
            into_ptr += read;
        }
        
        return useBuf + totRead;
    }
    else
    {
        return useBuf;
    }
   }

But the problem is here :


        S64 totRead = 0;
        
        while( count > 0 )
        {
            S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
            
            S32 read = oof->vfile->Read(into_ptr,curRead);
                
            count    -= read;
            into_ptr += read;
            // totRead += read;  // doh !!
        }
        
        return useBuf + totRead;

That's benign enough, but I think this is actually a nice simple example of a habit I am trying to get into :

never use another variable when one of the ones you have will do

In particular this loop could be written without the redundant counter :


        while( count > 0 )
        {
            S32 curRead = (S32) RR_MIN(count,OODFILE_MAX_IO_SIZE);
            
            S32 read = oof->vfile->Read(into_ptr,curRead);
                
            count    -= read;
            into_ptr += read;
        }
        
        return useBuf + (S64)(into_ptr - (U8 *)into);

Don't step the pointer and also count how far it steps, do one or the other, and compute them from each other. I keep finding this over and over and its one of the few ways that I still consistently make bugs : if you have multiple variables that need to be in sync, it's inevitable that that will get screwed up over time and you will have bugs.

(BTW I just can't help going off topic of my own posts; oh well, here we go : I hate the fact that I have to have C-style casts in my code all over now, largely because of fucking 64 bit. Yeah yeah I could static_cast or whatever fucking thing but that is no better and much uglier to read. On the line with the _MIN , I know that a _MIN of two types can fit in the size of the smaller type, so it's safe, but it's disgusting that I'm using my own semantic knowledge to validate the cast, that's very dangerous. I could use my check_value_cast here but it is tedious sticking that everywhere you do any 32-64 bit casting or pointer-int casting. On the output I could use ptrdiff_t or whatever you portability tards want, but that only pushes the cast to somewhere else (if I Read return a type ptrdiff_t then the outside caller has to turn that into S64 or whatever). I decided it was better if my file IO routines always just work with 64 bit file positions and sizes, even on 32 bit builds. Obviously running on 32 bits you can't actually do a larger than 4B byte read into a single buffer, so I could use size_t or something for the size of the read here, but I hate types that change based on the build, I much prefer to just say "file pos & size is 64 bit always".)

6/15/2010

06-15-10 - Top down - Bottom up

"Top down bottom up" sounds like a ZZ-top song, but is actually different ways of attacking tree-like problems. Today I am thinking about how to decide where to split a chunk of data for Huffman compression. A quick overview of the problem :

You have an array of N characters. You want to Huffman compress them, which means you send the code lengths and then the codes. The idea is that you may do better if you split this array into several. For each array you send new code lengths. Obviously this lets you adapt to the local statistics better, but costs the space to send those code lengths. (BTW one way to do this in practice is to code using an N+1 symbol alphabet, where the extra symbol means "send new codes"). You want to pick the split points that give you lowest total length (and maybe also a penalty for more splits to account for the speed of decode).

The issue is how do you pick the splits? For N characters the possible number of splits is 2^(N-1) (eg. if N = 3, the ways are [012],[0][12],[01][2],[0][1][2]) so brute force is impossible.

The standard ways to solve this problem is either top down or bottom up. I always think of these in terms of Shannon-Fano tree building (top down) or Huffman tree building (bottom up).

The top down way is : take the N elements and pick the single best split point. Then repeat on each half. This is recursive top-down tree building. One disadvantage of this approach is that it can fail to make good splits when the good splits are "hidden". eg. data like {xxxxxoooooxxxxxoooooxxxxxoooooxxxxx} might not pick any splits, because the gain from one split is not significant enough to offset the cost of sending new codes, but if you actually went all the way down the tree you would find good splits by completely segregating each unique bit. I also sometimes call this the "greedy" approach.

So the next solution is bottom up. Start with every character as a group (merge up runs of identical characters first). Then find the two groups which give you the biggest gain to merge and merge those groups. Repeat until no merge is beneficial.

In general it is often the case that bottom up solutions to these problems are better than top-down, but I don't there is any strong reason why that should be true.

There are a whole class of problems like this, such as BSP/kd tree building and category tree building in the machine learning literature (actually this is identical to category tree building in machine learning, since entropy is pretty good way to rate the quality of category trees, and an "overfit" penalty for making too many splits is equivalent to the code cost transmission penalty here). One of the good tricks if using the top-down approach is do an extra "non-greedy" step. When you hit the point where no more splits are beneficial, go ahead and do another split anyway speculatively and then split its children. Sometimes by going ahead one step you get past the local minimum, eg. the classic case is trying to build a kd-tree for this kind of data :


  X  O

  O  X

Where do you put a split? You currently have 50/50 entropies, no split is beneficial. If you do one split and then force another, you find a perfect tree.

Anyway, because this class of problems is unsolvable, most of the techniques are just heuristic and ad-hoc. For example in many cases it can be helpful just to always do a few mid-point splits for N levels to get things started. For the specific case of kd-trees there have been a lot of great papers in the last 5 years out of the interactive ray tracing community.

06-15-10 - The end of ad-hoc computer programming

I think we are in an interesting period where the age of the "guerilla" "seat of the pants" , eg. ad-hoc, programmer is ending. For the last 30 years, you could be a great programmer without much CS or math background if you had good algorithmic intuition. You could figure out things like heuristics for top-down or bottom-up tree building. You could dive into problems without doing research on the prior art and often make up something competitive. Game programmers mythologize these guys as "ninjas" or "hackers" or various other terms.

That's just about over. The state of the art in most areas of coding is getting so advanced that ad-hoc approaches simply don't cut the mustard. The competition is using an approach that is provably optimal, or provably within X% of optimal, or provably optimal in a time-quality tradeoff sense.

And there are so many strong solutions to problems out there now that cutting your own is more and more just foolish and hopeless.

I wouldn't say that the end of the ad-hoc programmer has come just yet, but it is coming. And that's depressing.

old rants