6/21/2010

06-21-10 - RRZ PNG-alike notes

Okay I've futzed around with heuristics today and it's a bit of a dead end. There are just a few cases that you cannot ever catch with a heuristic.

There is one good heuristic that works 90% of the time :


do normal filter
try compress with LZ fast and MinMatchLen = 8
try compress with LZ normal and MinMatchLen = 3

if the {fast,8} wins then the image is almost certainly a "natural" image. Natural images do very well with long min match len, smooth filters, and simple LZ matchers.

If not, then it's some kind of weird synethetic thing. At that point, pretty much all bets are off. Synthetic images have the damnable property that certain patterns repeat, so they are very sensitive to whether the LZ can find those patterns after filtering and such. But, a good start is to try the no-filter case with normal LZ, and perhaps try the Adaptive, and you can use Loco or Non-Loco depending on whether the normal filter chose loco post-filter or not.

But there are some bitches. kodak_12 is a natural image, and we detect that right. The problem is the best mode {N,N+L,A,A+L} changes when you optimize the LZ parameters, and it changes by quite a lot actually. Many modes will show N+L or A+L willing by quite a lot, but the right mode is N and it wins by 10k.

ryg_t.train.03.bmp is the worst. It is a solid 10% better with the "Normal" mode, but this only shows up when you do the LZ optimal parse; at any other setting of LZ all the modes are very close, but for some reason there are some magic patterns that only occur in Normal mode which are very good for the optimal parse - all the other modes stay about the same size when you turn LZ up to optimal, but Normal filter gets way smaller.

Okay, some actually useful notes :

There are some notes on the PNG web pages that say the best way to choose the filter per row is with sum of abs. Oh yeah? I can beat it. First of all, doing sum of abs but adding a small penalty for non-zero helps a tiny bit. But the best thing is to do entropy per row, and add a penalty for non-zero. You're welcome.

The filters N and (N+W)/2 are almost never best as whole-image filters, but are actually helpful in the adaptive filter loop.

I reduced my filter set down to only 5 and it hurt very little. Having the extra filters is basically free in terms of the format, but is a pain in the ass to maintain if you need to write optimized SIMD decoders for every filter on every platform. So for my own sanity, a minimum set of filters is preferrable.

BTW I should note that the fact that you have to tune minMatchLen and lzLevel is an indicator of the limitations of the optimal parse. If the optimal parse really found the best LZ stream, you should just run Optimal and let it pick what matches it wants. This is an example of it finding a local minimum which is not the global minimum. The problem is that the Huffman codes are severely different if you run with MML = 3 or 5 for example. Maybe there's a way around this problem; it requires more thought.

6/20/2010

06-20-10 - Struct annoyance

It's a very annoying fact of coding life that references through a struct are much slower than loading the variable out to temps. So for example this is slow :

void rrMTF_Init(rrMTFData * pData)
{
    for (int i = 0; i < 256;  i++)
    {
        pData->m_MTFimage[i] = pData->m_MTFmap[i] = (U8) i;
    }
}

and you have to do this by hand :

void rrMTF_Init(rrMTFData * pData)
{
    U8 * pMTFimage = pData->m_MTFimage;
    U8 * pMTFMap = pData->m_MTFmap;
    for (int i = 0; i < 256;  i++)
    {
        pMTFimage[i] = pMTFmap[i] = (U8) i;
    }
}

Unfortunately there are plenty of cases where this is actually a significant big deal.

06-20-10 - Some notes on PNG

Before we go further, lets have a look at PNG for reference.

Base PNG by default does :


Filter 5 = "adaptive" - choose from [0..4] per row using minimum sum of abs

Zlib strategy "FILTER" , which just means minMatchLength = 4

So the pngcrush guys did some clever things. Basically the idea is to try all possible ways to write a PNG and see which is smallest. The things you can play with are :


Filter 0..5

Zlib Strategy ; they have a few hard-coded but really it comes down to min match length

Zlib compression level (oddly highest level is not always best)

PNG scan options (progressive, interleaves, bottom up vs top down, etc.)

It's well known that on weird synthetic images "no filter" beats any filter. The only way you can detect that is actually by trying an LZ compress, you cannot tell from statistics.

The clever thing in pngcrush is that they don't search that whole space, but still usually find the optimal (or close to optimal) settings. The way they do it is with a heuristic guided search; they identify things that they have to always test (the 5 default strategies they try) with LZ, then depending on which of those is best they try a few others, and then maybe a few more, then you're done. It's like based on which branch of the search space you walk off initially they know from testing where the optimum likely is.

"loco" here is pngcrush with the LOCO color space conversion (RGB -> (R-G),G,(B-G) ) from JPEG-LS. This is the only lossless color conversion you can do that is not range expanding (eg. stays in bytes) (* correction : not quite true, see comments, of course any lifting style transform can be non-range-expanding using modulo arithmetic; it does appear to be the only *useful* byte-to-byte color conversion though). (BTW LOCO is not allowed in compliant PNG, but it's such a big win that it's unfair to them not to pretend that PNG can do LOCO for purposes of this comparison).

name png pngcrush loco advpng crush+adv best
ryg_t.yello.01.bmp 421321 412303 386437 373573 373573 373573
ryg_t.train.03.bmp 47438 37900 36003 34260 34260 34260
ryg_t.sewers.01.bmp 452540 451880 429031 452540 451880 429031
ryg_t.font.01.bmp 44955 37857 29001 22514 22514 22514
ryg_t.envi.colored03.bmp 113368 97203 102343 113368 97203 97203
ryg_t.envi.colored02.bmp 63241 55036 65334 63241 55036 55036
ryg_t.concrete.cracked.01.bmp 378383 377831 309126 378383 377831 309126
ryg_t.bricks.05.bmp 506528 486679 375964 478709 478709 375964
ryg_t.bricks.02.bmp 554511 553719 465099 554511 553719 465099
ryg_t.aircondition.01.bmp 29960 29960 23398 20320 20320 20320
ryg_t.2d.pn02.bmp 29443 26025 27156 24750 24750 24750
kodak_24.bmp 705730 704710 572591 705730 704710 572591
kodak_23.bmp 557596 556804 483865 557596 556804 483865
kodak_22.bmp 701584 700576 580566 701584 700576 580566
kodak_21.bmp 680262 650956 547829 646806 646806 547829
kodak_20.bmp 505528 504796 439993 500885 500885 439993
kodak_19.bmp 671356 670396 545636 671356 670396 545636
kodak_18.bmp 780454 779326 631000 780454 779326 631000
kodak_17.bmp 624331 623431 510131 615723 615723 510131
kodak_16.bmp 573671 541748 481190 517978 517978 481190
kodak_15.bmp 612134 611258 516741 612134 611258 516741
kodak_14.bmp 739487 703036 590108 739487 703036 590108
kodak_13.bmp 890577 859429 688072 866745 859429 688072
kodak_12.bmp 569219 533864 477151 535591 533864 477151
kodak_11.bmp 635794 634882 519918 635794 634882 519918
kodak_10.bmp 593590 592738 500082 593590 592738 500082
kodak_09.bmp 583329 582489 493958 558418 558418 493958
kodak_08.bmp 787619 786491 611451 787619 786491 611451
kodak_07.bmp 566085 565281 486421 566085 565281 486421
kodak_06.bmp 667888 631478 540442 644928 631478 540442
kodak_05.bmp 807702 806538 638875 807702 806538 638875
kodak_04.bmp 637768 636856 532209 637768 636856 532209
kodak_03.bmp 540788 506321 464434 514336 506321 464434
kodak_02.bmp 617879 616991 508297 596342 596342 508297
kodak_01.bmp 779475 760251 588034 706982 706982 588034
bragzone_TULIPS.bmp 680124 679152 591881 680124 679152 591881
bragzone_SERRANO.bmp 153129 107167 107759 96932 96932 96932
bragzone_SAIL.bmp 807933 806769 623437 807933 806769 623437
bragzone_PEPPERS.bmp 426419 424748 376799 426419 424748 376799
bragzone_MONARCH.bmp 614974 614098 526754 614974 614098 526754
bragzone_LENA.bmp 474968 474251 476524 474968 474251 474251
bragzone_FRYMIRE.bmp 378228 252423 253967 230055 230055 230055
bragzone_clegg.bmp 483752 483056 495956 483752 483056 483056

There's no adv+loco because advpng and advmng both refuse to work on the "loco" bastardized semi-PNG.

BTW I should note that I should eat my hat a little bit over my own "PNG sucks" post. The thing is, yes basic PNG is easy to beat and it has a lot of mistakes in the design, but the basic idea is fine, they did a good job on the standard pretty quickly, but the thing that really seals the deal is that once you make a flexible open standard, people will step in and find ways to play with it, and while base PNG is pretty bag, PNG after optimization is not bad at all.

06-20-10 - Searching my PNG-alike

Okay so stealing some ideas from pngcrush let's check out the search space. I decide I'm going to examine various filters, Loco or no loco, LZ min match len, and LZ compress "level" (level = how hard it looks for matches).

Here are the results for my PNG-alike with the modes :

0 = no filter
Loco = no filter (in loco space)
Normal = select a single DPCM filter for the whole image
N+L = Normal in LOCO space
Adaptive = per row best DPCM filter
A+L = you get it by now

The left six columns are these modes with default LZ parameters (Fast match, min match len = 4). The right six columns are the same modes with LZ parameters optimized for each mode.

name 0 Loco Normal N+L Adaptive A+L optimized LZs 0 Loco Normal N+L Adaptive A+L
ryg_t.yello.01.bmp 458255 435875 423327 427031 438946 431678 372607 359799 392963 395327 415370 401618
ryg_t.train.03.bmp 56455 55483 69635 76211 68022 67678 36399 35195 31803 40155 37582 36638
ryg_t.sewers.01.bmp 599803 610463 452287 452583 466154 452834 593935 593759 421779 420091 443786 421166
ryg_t.font.01.bmp 42239 32207 38855 38855 53070 36746 33119 26911 35383 35383 40998 33798
ryg_t.envi.colored03.bmp 297631 309347 150183 165803 142658 157046 265755 278103 102487 114923 95394 109022
ryg_t.envi.colored02.bmp 109179 112623 100687 113867 89178 98374 90039 93535 62139 68027 54662 57514
ryg_t.concrete.cracked.01.bmp 481115 407759 355575 356911 408054 361602 384795 353235 299963 301907 342810 310342
ryg_t.bricks.05.bmp 551475 485271 418907 417655 492622 418406 469063 448279 372195 370459 429066 373310
ryg_t.bricks.02.bmp 665315 632623 482367 481347 538670 483106 590319 577699 455431 455203 522158 455426
ryg_t.aircondition.01.bmp 41635 29759 26011 26011 32866 25738 29023 25103 20547 20547 23946 20522
ryg_t.2d.pn02.bmp 25075 26539 28303 29259 28046 28974 22147 22723 25915 26179 26194 25790
kodak_24.bmp 826797 771829 640137 634141 726892 633308 723681 712285 567693 558085 684060 559564
kodak_23.bmp 835449 783941 576481 569981 608476 565172 724641 712001 481577 478041 551292 479240
kodak_22.bmp 898101 879949 655461 651213 722096 651000 803429 804073 577433 571301 689664 574252
kodak_21.bmp 781077 708861 617565 629401 705608 618424 647881 633069 549865 549025 665724 545584
kodak_20.bmp 609705 561957 495509 500161 537692 494484 501293 490849 434865 431745 506592 429556
kodak_19.bmp 822045 733053 630793 624897 697020 619064 673953 658345 550669 541845 660444 541424
kodak_18.bmp 941565 912081 691161 692425 789736 693804 850705 848353 618961 619949 764004 622628
kodak_17.bmp 758089 678233 597225 592941 660016 590292 617097 606045 507169 504961 610092 508672
kodak_16.bmp 650557 587537 543001 543001 607916 545244 522829 511833 466277 466277 536280 468136
kodak_15.bmp 759109 697257 595353 590321 648656 586324 643385 628481 511193 504213 593304 506728
kodak_14.bmp 891629 793553 661505 657357 745816 649928 749085 731569 584645 580301 707596 581520
kodak_13.bmp 997161 891637 729425 730901 878212 729580 853557 829057 677041 678613 802224 680196
kodak_12.bmp 672361 602825 545921 562749 606004 549292 539305 526793 465297 472693 539532 467088
kodak_11.bmp 758197 691697 604869 597125 665364 587264 639537 625145 523649 512685 608388 510200
kodak_10.bmp 747121 681213 592637 589561 635972 578888 625961 611553 504573 499753 576284 497400
kodak_09.bmp 688365 629429 587233 583245 627916 571652 565449 562329 501377 495637 567772 491896
kodak_08.bmp 1001257 882153 686961 684177 792916 684056 860269 825757 615657 610505 766620 610524
kodak_07.bmp 709829 673917 568649 563561 616820 561636 605177 600157 480705 473233 551136 473500
kodak_06.bmp 779709 694229 600145 600145 687824 601564 642525 626449 534037 534037 637180 534804
kodak_05.bmp 962357 873793 700581 695257 810688 694828 845905 823025 633581 623341 783400 624368
kodak_04.bmp 813869 729865 613241 606849 672176 607280 677345 660057 531533 522061 622904 528064
kodak_03.bmp 639809 586873 522049 542681 581880 527856 519549 510309 437765 452965 511900 443948
kodak_02.bmp 729069 649709 603853 591781 654408 585916 598913 584941 515437 502941 602364 500964
kodak_01.bmp 872669 747333 661481 655597 772808 653388 699945 682005 593001 582389 689436 586328
bragzone_TULIPS.bmp 1032589 1021309 646905 652213 701508 652128 966881 969913 565997 571377 662504 570796
bragzone_SERRANO.bmp 150142 151706 169982 169302 173229 175217 103462 103566 139306 139074 142609 143457
bragzone_SAIL.bmp 983473 941993 686013 686117 795152 686204 892301 887097 613845 609953 762420 610008
bragzone_PEPPERS.bmp 694247 679795 424603 423679 451006 424262 655987 650771 369291 366611 416426 368106
bragzone_MONARCH.bmp 887917 868533 600373 598985 654864 598976 810325 805725 507937 508085 598348 508096
bragzone_LENA.bmp 737803 733251 493703 498299 498274 502150 710215 704763 467103 475179 471586 477638
bragzone_FRYMIRE.bmp 342667 344807 419859 420811 420506 419026 241899 242355 335063 336859 335990 335894
bragzone_clegg.bmp 760493 799717 525077 541329 523580 536244 557529 571413 445897 468265 444736 465376

One important note :

The "Normal" filters include the option to do a post-filter Loco conversion. This is different than the "loco space" option in the modes above. Let me elaborate. "Loco" in the modes listed above means transform the image into Loco colorspace, and then proceed with filtering and LZ. Loco built into a filter mode means, on each pixel do the filter delta, then do loco conversion. This can be integrated directly into the DPCM pixel delta code, so it's just considered a filter type. In particular, in "loco space", then the N,W,NW neighbors are already in loco colorspace. When loco is part of the filter, the neighbors are in RGB space and the delta is converted after the fact. If everything was linear, these would be equivalent.

Okay, what do we see?

It's very surprising to me how much LZ optimization helps. In particular it surprises me that making the LZ search *worse* (by turning down compression "level") helps a lot; as does increasing match len; on natural images a min match len around 8 or 10 is usually best. (or even more, I forbid a min match len > 10 because it starts to hurt decode speed).

Well we were hoping that we could pick the mode based on the default LZ parameters, and then just optimize the LZ parameters for that one mode. It is often the case the the best mode after optimization is the same as the best mode before optimization, but not always. When it is not the case, it is usually a small mistake. However, there is one case where it's a very bad mistake - on ryg_t.yello.01.bmp you would make output of 393k instead of 360k.

Natural images are the easiest; for them you can pretty much just pick A+L (or N+L) and you're very close to best if you didn't get the best. Synthetic images are harder, they are very sensitive to the exact mode.

We can also say that no filter + loco is almost always wrong, except for that same annoying one case. Unfortunately I don't see any heuristic that can detect when 0+loco needs to be checked. Similarly for adaptive + noloco.

Obviously there's a fundamental problem when the initial sizes are very close together, you can't really differentiate between the modes. When the sizes are very far apart then it is a reliable guess.

Let me briefly note things I could be searching that I'm not :

Rearranging pixels in various ways, eg. RGBRGB -> RRRGGGBB , or to whole planes; interleaving lines, different scan orders, etc. etc.

LSQR fits for predictors. This doesn't hurt decode speed a ton so it would fit in my design spec, I'm just getting sick of wasting my time on this so I'm not bothering with it.

Predictors on different regions instead of per row. eg. a predictor type per 16x16 tile or something.

Anything that hurts decode speed, eg. bigger predictors, adaptive predictors, non-LZ coder, etc.

Oh I'm also not looking at any pixel format conversions; I assume the client has put it in the pixel format they want and won't change it. Obviously some of the PNG optimizers can win by palettizing when not many colors are actually used, and of course there are lots of other pixel formats that might help, blah blah.

Oh while I'm at it, I should also note that my LZ is actually kind of crippled for this comparison. I divide the data stream into 256k chunks and compress them completely independently (no LZ window across the chunk boundary). This lets me seek on compressed data and decompress portions of it independently, but it is quite a bad penalty.

06-20-10 - PNG Comparo

Okay this is kind of bogus and I thought about not even posting it, but come on, you need the closure right? Everyone likes a comparo. First of all, why this is bogus : 1. PNG just cannot compete without better LOCO support. Here I am allowing the LOCO files, but they were not advpng/pngout optimized , and of course they're not really PNGs. 2. I have that crippling 256k chunking on my format. I guess if I wanted to do a fair comparo I should make a version of my shit which doesn't have LOCO and also remove my 256k chunking and compare that vs. no-loco PNG. God dammit now I have to do that.

Whatever, here you go :

RRZ heuristic = guided search to try to find the best set of options
RRZ best = actual best options for my thingy
png best = best of advpng/crush/loco

RRZ heuristic RRZ best png best
ryg_t.yello.01.bmp 392963 359799 373573
ryg_t.train.03.bmp 35195 31803 34260
ryg_t.sewers.01.bmp 421779 420091 429031
ryg_t.font.01.bmp 26911 22514
ryg_t.envi.colored03.bmp 95394 97203
ryg_t.envi.colored02.bmp 54662 55036
ryg_t.concrete.cracked.01.bmp 299963 309126
ryg_t.bricks.05.bmp 370459 375964
ryg_t.bricks.02.bmp 455203 465099
ryg_t.aircondition.01.bmp 20522 20320
ryg_t.2d.pn02.bmp 22147 24750
kodak_24.bmp 559564 558085 572591
kodak_23.bmp 479240 478041 483865
kodak_22.bmp 574252 571301 580566
kodak_21.bmp 549865 545584 547829
kodak_20.bmp 429556 439993
kodak_19.bmp 541424 545636
kodak_18.bmp 618961 631000
kodak_17.bmp 508672 504961 510131
kodak_16.bmp 466277 481190
kodak_15.bmp 506728 504213 516741
kodak_14.bmp 581520 580301 590108
kodak_13.bmp 677041 688072
kodak_12.bmp 465297 477151
kodak_11.bmp 510200 519918
kodak_10.bmp 497400 500082
kodak_09.bmp 491896 493958
kodak_08.bmp 610524 610505 611451
kodak_07.bmp 473500 473233 486421
kodak_06.bmp 534037 540442
kodak_05.bmp 624368 623341 638875
kodak_04.bmp 522061 532209
kodak_03.bmp 437765 464434
kodak_02.bmp 500964 508297
kodak_01.bmp 586328 582389 588034
bragzone_TULIPS.bmp 565997 591881
bragzone_SERRANO.bmp 103462 96932
bragzone_SAIL.bmp 613845 609953 623437
bragzone_PEPPERS.bmp 366611 376799
bragzone_MONARCH.bmp 508096 507937 526754
bragzone_LENA.bmp 467103 474251
bragzone_FRYMIRE.bmp 241899 230055
bragzone_clegg.bmp 444736 483056

PNG wins by a little bit on FRYMIRE , SERRANO , ryg_t.aircondition.01.bmp , ryg_t.font.01.bmp . I'm going to pretend that I don't know that because that's what sent me down this god damn pointless rabbit hole in the first place, I discovered that PNG beat me on a few files so I had to find out why and fix myself.

Anyway, something that would be more productive would be to write a fast PNG decoder. All the PNG decoders out there in the world are woefully slow. Let me tell you all how to write a fast PNG decoder :

1. First make sure your Zip decoder is fast. The standard ones are okay, but they do too much checking for end of buffer and do you have enough bits blah blah. The correct way to do that is to allocate your decompression buffers 64k aligned, and put some NO_ACCESS pages on each end. Then just let your Zip decoder run. Make sure it will never crash on bad input - it will just make bad output (this is relatively easy to do and doesn't require explicit checks, just careful coding to make sure all compressed bit streams decode to something).

2. The un-filtering for PNG needs to be unrolled for the exact data type and filter. You can do this in C very neatly using template loop inversion which I wrote about previously. For maximum speed however you really should do the un-filter with SIMD. It's a very nice easy case for SIMD, except for the god fucking awful pointless abortion that is the Paeth filter.

3. Un-filtering and LZ decompress should be interleaved for cache hotness. You decode a bit, un-filter a bit, then stream out data in the final pixel format into client-ready packed plane. The zip window is only 32k and you only need one previous row to filter, so your whole set of read-write data should be less than 64k, and the data you stream out should be written to a separate buffer with NTQ write-combining style writes. Ideally your stream out supports enough pixel formats that it can write directly to whatever the client needs (X8R8G8B8 for D3D or whatever) so that memory doesn't have to be touched again. Because the output buffer is only written write combined you can decode directly into locked textures.

My guess is that this should be in the neighborhood of 80 MB/sec.

06-20-10 - KZip

While futzing around on PNG I discovered that there is this whole community of people who are hard-core optimizing PNG's for distribution sizes. They have these crazy tools like pngcrush/optipng/advancecomp/pngout.

AdvPNG is just the Zip encoder from 7zip (Igor Pavlov). It's a semi-optimal parser using a forward-pass multi-lookahead heuristic like Quantum. Some day I need to try to read the source code to figure out what he's doing exactly, but god damnit the code is ugly and it's not documented. Dammit Igor.

PNGOUT is a lossless "deflate" (zip) optimizer. The engine is the same as KZip by Ken Silverman. Unfortunately Ken has not documented his algorithm at all anywhere. Come on Ken! Nobody is gonna pay you for KZip! Just write about it!

Anyway, my guess from reading his brief description and looking at what it does is : KZip has some type of optimal parser. Who knows what kind of optimal parser; my guess is that knowing a bit about how Ken codes it is probably not an actual Storer-Szymanski optimal parser, but rather some kind of heuristic, perhaps a like 7zip/LZMA. KZip also clearly has some kind of Huffman split point optimizer (similar to what I just did ). Again just guessing from the command line options it looks like his Huffman splitter is single pass and is just based on a heuristic that detects changes in the statistics and decides to put a split there. Hmmm, I wish I'd found this months ago.

KZip appears to be the best zip optimizer in existence. Despite claims of being crazy slow I actually think it's quite fast by my standards. No surprise KZip is a lot smaller and faster than my optimal parser, but I do make smaller files. Ken ! Set your algorithm free! (ADDENDUM : whoops, that was only on PNG data; for some reason it's pretty fast on image data, but it's slow as balls in some other cases, not sure what's going on there; 7zip is a lot faster than kzip and the file sizes are usually very close (once in a while kzip does significiantly better)).

For a while I've been curious to try my RAD LZ optimizer on a Zip token stream. It would be a nice sanity check to test it against KZip, but I'm not motivated enough to take the pain of figuring out how to write Zip tokens.

06-20-10 - Filters for PNG-alike

The problem :

Find a DPCM pixel prediction filter which uses only N,W,NW and does not range-expand (eg. ubytes stay in ubytes). (eg. like PNG).

We certainly could use a larger neighborhood, we could use adaptive predictors that evaluate the neighborhood for edges/etc., we would wind up with GAP from CALIC or something newer. We want to keep it simple so we can have a very fast decoder.

These filters :


    case 0: // 0 predictor is the same as NONE
        return 0;
    
    case 1: // N
        return N;
        
    case 2: // W
        return W;
        
    case 3: // gradient // this tends to win on synthetic images
        {
        int pred = N + W - NW;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 4: // average
        return (N+W)>>1;
        
    case 5: // grad skewed towards average // this tends to win on natural images - before type 12 took over anyway
        {
        int pred = ( 3*N + 3*W - 2*NW + 1) >>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
    
    case 6: // grad skewed even more toward average
        {
        int pred = ( 5*N + 5*W - 2*NW + 3) >>3;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 7: // grad skewed N
        {
        int pred = (2*N + W - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 8: // grad skewed W
        {
        int pred = (2*W + N - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
    
    case 9: // new
        {
        int pred = (3*N + 2*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 10:    // new
        {
        int pred = (2*N + 3*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 11: // new
        return (N+W + 2*NW + 1)>>2;
        
    case 12: // ClampedGradPredictor
    {
        int grad = N + W - NW;
        int lo = RR_MIN3(N,W,NW);
        int hi = RR_MAX3(N,W,NW);
        return rr::clamp(grad,lo,hi);
    }   
    
    case 13: // median
        return Median3(N,W,NW);
    
    case 14:
    {
        // semi-Paeth
        // but only pick N or W 
        int grad = N + W - NW;
        // pick closer of N or W to grad
        if ( RR_ABS(grad - N) < RR_ABS(grad - W) )
            return N;
        else
            return W;
    }

perform like this :

name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ryg_t.train.03.bmp 56347 76447 70047 80619 76187 79811 77519 79771 76979 77399 76279 80383 75739 75875 74635
ryg_t.sewers.01.bmp 604843 475991 496067 468727 466363 455359 458111 458727 469203 456399 462727 496691 452663 495931 464915
ryg_t.font.01.bmp 42195 54171 53599 56135 73355 66955 70163 60199 63815 69499 70099 79983 59931 61219 56735
ryg_t.envi.colored03.bmp 316007 147315 136255 158063 139231 154551 148419 155379 149667 148683 143415 145871 148099 144851 153311
ryg_t.envi.colored02.bmp 111763 98835 87711 111367 93123 108867 100003 106275 95927 96391 97383 96907 99275 94371 100707
ryg_t.concrete.cracked.01.bmp 493143 416307 449899 426755 403779 402635 400923 408019 420291 398215 408299 437371 409055 440371 424087
ryg_t.bricks.05.bmp 568755 534267 524243 514563 509375 493923 497639 505267 499007 501079 497419 551687 499855 547935 511595
ryg_t.bricks.02.bmp 684515 590955 577207 557155 560391 537987 545019 551251 544555 549543 544635 602115 545919 600139 557659
ryg_t.aircondition.01.bmp 41595 33215 34279 33535 33199 32695 32683 32619 33207 32503 32987 35795 31507 35171 32243
ryg_t.2d.pn02.bmp 25419 27815 28827 29499 32203 32319 32443 29679 32531 32315 32575 34183 28307 31247 27919
ryg_gemein.bmp 797 801 173 153 565 585 553 521 205 633 529 813 153 825 141
kodak_24.bmp 843281 735561 760793 756509 744021 736037 735797 735989 753709 733105 741801 785077 719185 777905 727221
kodak_23.bmp 865985 593977 616885 617105 591501 592561 588837 595949 609437 586121 595793 620657 585453 617825 601713
kodak_22.bmp 919613 732613 756249 741441 719973 712757 711245 721365 732337 709621 719157 760065 706517 763481 724053
kodak_21.bmp 797741 753057 691653 746033 713469 716065 710321 737017 713845 720525 704129 744785 701733 744633 709445
kodak_20.bmp 624901 563261 538069 571633 540981 548329 541773 559437 548549 546565 540337 559545 539289 557105 545901
kodak_19.bmp 847781 724541 728337 732717 718309 712989 712217 716593 729505 712893 715053 752173 689953 756849 698881
kodak_18.bmp 958657 808577 820253 820429 783373 783089 777377 794965 799449 779117 782541 821717 783005 821025 802569
kodak_17.bmp 782061 664829 655845 685273 651025 657225 650477 666625 666645 653117 652293 680813 644849 675045 656597
kodak_16.bmp 664401 696493 595101 673961 646981 647441 643353 675157 640833 658721 631509 684517 628837 682217 629293
kodak_15.bmp 779289 639009 660137 677649 643985 651729 644809 650525 666229 641669 650377 673545 635441 662389 645817
kodak_14.bmp 912149 800681 742901 787273 754233 754325 749641 778777 758045 760761 744001 795029 743909 791341 755425
kodak_13.bmp 1015697 932969 898929 941109 890989 900533 890221 918073 905705 897865 888113 925757 894461 922701 907029
kodak_12.bmp 690537 641517 583717 641973 613461 617881 612461 635253 617913 620833 607177 644837 598393 638889 602213
kodak_11.bmp 773937 725081 674657 714169 697533 691425 690389 708197 695109 698369 685965 738609 671365 739385 676597
kodak_10.bmp 766701 650637 648349 665309 648997 641961 641481 651741 651785 643369 642133 685733 623437 684949 629193
kodak_09.bmp 703689 640025 628161 659637 635201 637657 633881 644193 647817 636281 634105 662125 615393 662801 621873
kodak_08.bmp 1026037 862013 888589 835329 884329 841405 859717 839193 861917 856097 865405 942073 792833 950005 792201
kodak_07.bmp 725525 654925 605753 629757 633945 622693 627053 643069 621433 634965 620645 673361 597821 666629 604337
kodak_06.bmp 796077 786789 680841 750253 731117 723657 722981 753989 714849 739405 709825 772685 703417 773613 705529
kodak_05.bmp 981245 850313 836165 845001 815801 814429 810709 832337 825073 815897 811729 852385 808993 853785 828145
kodak_04.bmp 845721 672369 679245 692469 659981 663625 657621 670629 678165 658845 661957 691945 653629 688253 670325
kodak_03.bmp 656841 605525 559081 619713 581713 596313 587633 607841 600725 592761 584449 601901 579221 592745 587409
kodak_02.bmp 761633 666601 663837 677385 648117 649785 644985 661253 662077 647553 647545 682089 640501 687305 654309
kodak_01.bmp 896361 850353 811873 828397 822169 807197 810217 827621 815245 818249 807665 869829 781173 875265 788725
bragzone_TULIPS.bmp 1052229 729141 757129 688121 703317 671417 683069 683653 696193 682109 691257 756241 675637 761477 704293
bragzone_SERRANO.bmp 150898 172286 160718 193602 255462 274994 278094 212478 254902 274478 277418 285990 171918 213214 167058
bragzone_SAIL.bmp 1004581 865725 826729 834385 811121 798037 798301 819669 807221 805785 796761 862209 795581 868857 813521
bragzone_PEPPERS.bmp 712087 469243 456511 438211 442695 428627 433227 437215 436595 436059 434359 471587 426191 474595 439575
bragzone_MONARCH.bmp 907737 670825 671745 644513 638201 623613 627513 640213 640485 630941 630941 676401 624925 679553 652849
bragzone_LENA.bmp 745299 484027 513747 506631 478875 481911 477131 481819 494431 474343 483007 498519 478559 498799 489203
bragzone_FRYMIRE.bmp 342567 432867 399963 481259 608567 664895 670755 544963 612075 666279 668187 693467 433755 456899 421263
bragzone_clegg.bmp 806117 691593 511625 518489 1208409 1105433 1167561 733217 951773 1158429 1165953 1289969 502845 1286365 505093

commentary :

The big surprise is that ClampedGradPredictor (#12) is the fucking bomb. In fact it's so good that it hides the behavior of other predictors. For example plain old Grad is never picked. Also predictor #5 (grad skewed towards average) was actually by far the best until #12 came along.

The other minor surprise is that W is actually best sometimes, and N is never best, and generally N is much worse than W. Now, it is no surprise that W is better than N - it is a well known fact that typical images have much stronger horizontal correlation than vertical, but I am surprised just how *big* the difference is.

More in the next post.

6/17/2010

06-17-10 - Suffix Array Neighboring Pair Match Lens

I suffix sort strings for my LZ coder. To find match lengths, I first construct the array of neighboring pair match lengths. You can then find the match length between any two indexes (i,j) as the MIN of all pair match lengths between them. I wrote about this before when I wrote about the LZ Optimal parse strategy, but in order to find all matches against any given suffix, you find the position of that suffix in the sort order, then walk to neighbors and keep doing MIN() with the pair match lengths.

Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0). I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to figuring it out. Anyway, I stumbled upon this :


from :

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2

Algorithm GetHeight
input: A text A and its suffix array Pos
1 for i:=1 to n do
2 Rank[Pos[i]] := i
3od
4 h:=0
5 for i:=1 to n do
6 if Rank[i] > 1 then
7 k := Pos[Rank[i]-1]
8 while A[i+h] = A[j+h] do
9 h := h+1
10 od
11 Height[Rank[i]] := h
12 if h>0 then h := h-1 fi
13 fi
14 od   

The idea is obvious once you see it. You walk in order of the original character array index, (not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor. At index [i+1] then, you must be able to match to a length of at least the same length-1 by matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix match. For example :


On "mississippi" :

At index 1 :

ississippi

matches :

issippi

with length 4.

Now step forward one.

At index 2 :

ssissippi

I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one)
so I can start my string compare at 3

And here's the C code :


static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2]
        const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen)
{
    int n = sortWindowLen;
    const U8 * A = sortWindowBase;

    int h = 0;
    for(int i=0; i< n ;i ++)
    {
        int sort_i = sortIndexInverse[i];
        if ( sort_i > 0 )
        {
            int j = sortIndex[sort_i-1];
            int h_max = sortWindowLen - RR_MAX(i,j);
            while ( h < h_max && A[i+h] == A[j+h] )
            {
                h++;
            }
            sortSameLen[sort_i-1] = h;

            if ( h > 0 ) h--;
        }
    }
}

6/15/2010

06-15-10 - Top down - Bottom up

"Top down bottom up" sounds like a ZZ-top song, but is actually different ways of attacking tree-like problems. Today I am thinking about how to decide where to split a chunk of data for Huffman compression. A quick overview of the problem :

You have an array of N characters. You want to Huffman compress them, which means you send the code lengths and then the codes. The idea is that you may do better if you split this array into several. For each array you send new code lengths. Obviously this lets you adapt to the local statistics better, but costs the space to send those code lengths. (BTW one way to do this in practice is to code using an N+1 symbol alphabet, where the extra symbol means "send new codes"). You want to pick the split points that give you lowest total length (and maybe also a penalty for more splits to account for the speed of decode).

The issue is how do you pick the splits? For N characters the possible number of splits is 2^(N-1) (eg. if N = 3, the ways are [012],[0][12],[01][2],[0][1][2]) so brute force is impossible.

The standard ways to solve this problem is either top down or bottom up. I always think of these in terms of Shannon-Fano tree building (top down) or Huffman tree building (bottom up).

The top down way is : take the N elements and pick the single best split point. Then repeat on each half. This is recursive top-down tree building. One disadvantage of this approach is that it can fail to make good splits when the good splits are "hidden". eg. data like {xxxxxoooooxxxxxoooooxxxxxoooooxxxxx} might not pick any splits, because the gain from one split is not significant enough to offset the cost of sending new codes, but if you actually went all the way down the tree you would find good splits by completely segregating each unique bit. I also sometimes call this the "greedy" approach.

So the next solution is bottom up. Start with every character as a group (merge up runs of identical characters first). Then find the two groups which give you the biggest gain to merge and merge those groups. Repeat until no merge is beneficial.

In general it is often the case that bottom up solutions to these problems are better than top-down, but I don't there is any strong reason why that should be true.

There are a whole class of problems like this, such as BSP/kd tree building and category tree building in the machine learning literature (actually this is identical to category tree building in machine learning, since entropy is pretty good way to rate the quality of category trees, and an "overfit" penalty for making too many splits is equivalent to the code cost transmission penalty here). One of the good tricks if using the top-down approach is do an extra "non-greedy" step. When you hit the point where no more splits are beneficial, go ahead and do another split anyway speculatively and then split its children. Sometimes by going ahead one step you get past the local minimum, eg. the classic case is trying to build a kd-tree for this kind of data :


  X  O

  O  X

Where do you put a split? You currently have 50/50 entropies, no split is beneficial. If you do one split and then force another, you find a perfect tree.

Anyway, because this class of problems is unsolvable, most of the techniques are just heuristic and ad-hoc. For example in many cases it can be helpful just to always do a few mid-point splits for N levels to get things started. For the specific case of kd-trees there have been a lot of great papers in the last 5 years out of the interactive ray tracing community.

06-15-10 - The end of ad-hoc computer programming

I think we are in an interesting period where the age of the "guerilla" "seat of the pants" , eg. ad-hoc, programmer is ending. For the last 30 years, you could be a great programmer without much CS or math background if you had good algorithmic intuition. You could figure out things like heuristics for top-down or bottom-up tree building. You could dive into problems without doing research on the prior art and often make up something competitive. Game programmers mythologize these guys as "ninjas" or "hackers" or various other terms.

That's just about over. The state of the art in most areas of coding is getting so advanced that ad-hoc approaches simply don't cut the mustard. The competition is using an approach that is provably optimal, or provably within X% of optimal, or provably optimal in a time-quality tradeoff sense.

And there are so many strong solutions to problems out there now that cutting your own is more and more just foolish and hopeless.

I wouldn't say that the end of the ad-hoc programmer has come just yet, but it is coming. And that's depressing.

6/07/2010

06-07-10 - Unicode CMD Code Page Checkup

I wrote a while ago about the horrible fuckedness of Unicode support on Windows :

cbloom rants 06-21-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-14-08 - 3

Part of the nastiness was that in Win32 , command line apps get args in OEM code page, but most Windows APIs expect files in ANSI code page. (see my pages above - simply doing OEM to ANSI conversions is not the correct way to fix that) (also SetFileAPIsToOEM is a very bad idea, don't do that).

Here is what I have figured out on Win64 so far :

1. CMD is still using 8 bit characters. Technically they will tell you that CMD is a "true unicode console". eh, sort of. It uses whatever the current code page is set to. Many of those code pages are destructive - they do not preserve the full unicode name. This is what causes the problem I have talked about before of the possibility of having many unicode names which show up as the same file in CMD.

2. "chcp 65001" changes you code page to UTF-8 which is a non-destructive code page. This only works with some fonts that support it (I believe Lucida Concole works if you like that). The raster fonts do not support it.

3. printf() with unicode in the MSVC clib appears to just be written wrong; it does not do the correct code page translation. Even wprintf() passed in correct unicode file names does not do the code page translation correctly. It appears to me they are always converting to ANSI code page. On the other hand, WriteConsoleW does appear to be doing the code page translation correctly. (as usual the pedantic rule lawyers will spout a bunch of bullshit about the fact that printf is just fine the way it is and it doesn't do translations and just passes through binary; not true! if I give it 16 bit chars and it outputs 8 bit chars, clearly it is doing translation and it should let me control how!)


expected : printf with wide strings (unicode) would do translation to the console code page 
(as selected with chcp) so that  characters show up right.
(this should probably be done in the stdout -> console pipe)

observed : does not translate to CMD code page, appears to always use A code page even with
the console is set to another code page

4. CMD has a /U switch to enable unicode. This is not what you think, all it does is make the output of built-in commands unicode. Note that your command line apps might be piped to a unicode text file. To handle this correctly in your own console app, you need to detect that you are being piped to unicode and do unicode output instead of converting to console CP. Ugly ugly.

5. CMD display is still OEM code page by default. In the old days that was almost never changed, but nowadays more people are in fact changing it. To be polite, your app should use GetConsoleOutputCP() , you should *NOT* use SetConsoleOutputCP from a normal command line app because the user's font choice might not support the code page you want.

6. CMD argc/argv argument encoding is still in the console code page (not unicode). That is, if you run a command line app from CMD and auto-complete to select a file with unicode name, you are passed the code page encoding of that unicode name. (eg. it will be bytewise identical to if you did FindFirstW and then did UnicodeToOem). This means GetCommandLineW() is still useless for command line apps - you cannot get back to the original unicode version of the command line string. It is possible for you to get started with unicode args (eg. if somebody many you from CreateProcessW), in which case GetCommandLineW actually is useful, but that is so rare it's not really worth worrying about.


expected : GetCommandLineW (or some other method) would give you the original full unicode arguments
(in all cases)

observed : arguments are only available in CMD code page

7. If I then call system() from my app with the CMD code page name, it fails. If I find the Unicode original and convert it to Ansi, it is found. It appears that system() uses the ANSI code page (like other 8-bit file apis). ( system() just winds up being CreateProcess ). This means that if you just take your own command line that called you and do the same thing again with system() - it might fail. There appears to be no way to take a command line that works in CMD and run it from your app. _wsystem() seems to behave well, so that might be the cleanest way to proceed (presuming you are already doing the work of promoting your CMD code page arguments to proper unicode).


repro : write an app that takes your own full argc/argv array, and spawn a process with those same args 
(use an exe name that involved troublesome characters)

expected : same app runs again

observed : if A and Console code pages differ, your own app may not be found

8. Copy-pasting from CMD consoles seems to be hopeless. You cannot copy a chunk of unicode text from Word or something and paste it into a console and have it work (you would expect it to translate the unicode into the console's code page, but no). eg. you can't copy a unicode file name in explorer and paste it into a console. My oh my.


repro : copy-paste a file name from (unicode) Explorer into CMD

expected : unicode is translated into current CMD code page and thus usable for command line arguments

observed : does not work

9. "dir" seems to cheat. It displays chars that aren't in the OEM code page; I think they must be changing the code page to something else to list the files then changing it back (their output seems to be the same as mine in UTF8 code page). This is sort of okay, but also sort of fucked when you consider problem #8 : because of this dir can show file names which will then not be found if you copy-paste them to your command line!


repro : dir a file with strange characters in it.  copy-paste the text output from dir and type 
"dir <paste>" on the command line

expected : file is found by dir of copy-paste

observed : depending on code page, the file is not be found

So far as I can tell there is no API tell you the code page that your argc/argv was in. That's a pretty insane ommission. (hmm, it might be GetConsoleCP , though I'm not sure about that). (I'm a little unclear about when exactly GetConsoleCP and GetConsoleOutputCP can not be the same; I think the answer is they are only different if output is piped to a file).

I haven't tracked down all the quirks yet, but at the moment my recommendation for best practices for command line apps goes like this :

1. Use GetConsoleCP() to find the input CP. Take your argc/argv and match any file arguments using FindFirstW to get the unicode original names. (I strongly advising using cblib/FileUtil for this as it's the only code I've ever seen that's even close to being correct). For arguments that aren't files, convert from the console code page to wchar.

2. Work internally with wchar. Use the W versions of the Win32 File APIs (not A versions). Use the _w versions of clib FILE APIs.

3. To printf, either just write your own printf that uses WriteConsoleW internally, or convert wide char strings to GetConsoleOutputCP() before calling into printf.

For more information :

Console unicode output - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
windows cmd pipe not unicode even with U switch - Stack Overflow
Unicode output on Windows command line - Stack Overflow
INFO SetConsoleOutputCP Only Effective with Unicode Fonts
GetConsoleOutputCP Function (Windows)
BdP Win32ConsoleANSI

Addendum : I've updated cblib and chsh with new versions of everything that should now do all this at least semi-correctly.


BTW a semi-related rant :

WTF are you people who define these APIs not actually programmers? Why the fuck is it called "wcslen" and not "wstrlen" ? And how about just fucking calling it strlen and using the C++ overload capability? Here are some sane ways to do things :


typedef wchar_t wchar; // it's not fucking char_t

// yes int, not size_t mutha fucka
int wstrlen(const wchar * str) { return wcslen(str); }
int strlen(const wchar * str) { return wstrlen(str); }

// wsizeof replaces sizeof for wchar arrays
#define wsizeof(str)    (sizeof(str)/sizeof(wchar))
// best to just always use countof for strings instead of sizeof
#define countof(str)	(sizeof(str)/sizeof(str[0]))

fucking be reasonable to your users. You ask too much and make me do stupid busy work with the pointless difference in names between the non-w and the w string function names.

Also the fact that wprintf exists and yet is horribly fucked is pretty awful. It's one of those cases where I'm tempted to do a


#define wprintf wprintf_does_not_do_what_you_want

kind of thing.

06-07-10 - Exceptions

Small note about exceptions for those who are not aware :

x64 totally changes the way exceptions on the PC are implemented. There are two major paradigms for exceptions :

1. Push/pop tracking and unwinding. Each time you enter a function some unwind info is pushed, and when you leave it is popped. When an exception fires it walks the current unwind stack, calling destructors and looking for handlers. This makes executables a lot larger because it adds lots of push/pop instructions, and it also adds a speed hit everywhere even when exceptions are never thrown becuase it still has to push/pop all the unwind info. MSVC/Win32 has done it this way. ("code driven")

2. Table lookup unwinding. A table is made at compile time of how to unwind each function. When an exception is thrown, the instruction pointer is used to find the unwind info in the table. This is obviously much slower at throw time, but much faster when not throwing, and also doesn't bloat the code size (if you don't count the table, which doesn't normally get loaded at all). I know Gcc has done this for a while (at least the SN Systems version that I used a while ago). ("data driven")

All x64 code uses method #2 now. This is facilitated by the new calling convention. Part of that is that the "frame pointer omission" optimization is forbidden - you must always push the original stack and return address, because that is what is walked to unwind the call sequence when a throw occurs. If you write your own ASM you have to provide prolog/epilog unwind info, which goes into the unwind table to tell it how to unwind your stack.

This has been rough and perhaps slightly wrong. Read here for more :

X64 Unwind Information - FreiK's WebLog - Site Home - MSDN Blogs
Uninformed - vol 4 article 1
Introduction to x64 debugging, part 3 � Nynaeve
Introduction to x64 debugging, part 2 � Nynaeve
Exception Handling (x64)

BTW many people (Pierre, Jeff, etc.) think the overhead of Win32's "code driven" style of exception handling is excessive. I agree to some extent, however, I don't think it's terribly inefficient. That is, it's close to the necessary amount of code and time that you have to commit if you actually want to catch all errors and be able to handle them. The way we write more efficient code is simply by not catching all errors (or not being able to continue or shut down cleanly from them). That is we decide that certain types of errors will just crash the app. What I'm saying is if you actually wrote error checkers and recovery for every place you could have a failure, it would be comparable to the overhead of exception handling. (ADDENDUM : nah this is just nonsense)

6/02/2010

06-02-10 - Some random Win64 junk

To build x64 on the command line, use

vcvarsall.bat amd64

Oddly the x64 compiler is still in "c:\program files (x86)" , not "c:\program files" where it's supposed to be. Also there's this "x86_amd64" thing, I was like "WTF is that?" , it's the cross-compiler to build x64 apps from x86 machines.

amd64 is a synonym for x64. IA64 is Itanium.

The actual "CL" command line to build x64 vs x86 can stay unchanged I think (not sure about this yet). For example you still link shell32.lib , and you still define "WIN32".

Another nasty case where 64 bit bytes your ass is in the old printf non-typed varargs problem. If you use cblib/safeprintf this shouldn't be too bad. Fucking printf formatting for 64 bit ints is not standardized. One nice thing on MSVC is you can use "%Id" for size_t sized things, so it switches between 32 and 64. For real 64 bit use "%I64d" (on MSVC anyway).

size_t really fucking annoys me because it's unsigned. When I do a strlen() I almost always want that to go into a signed int because I'm moving counters around that might go negative. So I either have to cast or I get tons of warnings. Really fucking annoying. So I use my own strlen32 and anyone who thinks I will call strlen on a string greater than 2 GB is smoking some crack.

Here are the MSC vers :


MSVC++ 9.0  _MSC_VER = 1500
MSVC++ 8.0  _MSC_VER = 1400
MSVC++ 7.1  _MSC_VER = 1310
MSVC++ 7.0  _MSC_VER = 1300
MSVC++ 6.0  _MSC_VER = 1200
MSVC++ 5.0  _MSC_VER = 1100

Some important revs : "volatile" changes meaning in ver 1400 ; most of the intrinsics show up in 1300 but more show up in 1400. Template standard compliance changes dramatically in 1400.

06-02-10 - Directives in the right place

I'm annoyed with "volatile" and "restrict" and all that shit because I believe they put the directive in the wrong place - on the variable declaration - when really what you are trying to do is modify certain *actions* , so the qualifier should be on the *actions* not the variables.

eg. if I write some matrix multiply routine and I know I have made it memory-alias safe, I want to say "restrict" on the *code* chunk, not on certain variables.

Same thing with "volatile" in the Java/C++0x/MSVC>=2005 sense, it's really a command for the load or the store, eg. "this store must actually be written to memory and must be ordered against other stores" , not a descriptor of a variable.

06-02-10 - 64 bit transition

The broke ass 64/32 transition is pretty annoying. Some things that MS could/should have done to make it easier :

Don't sell Win7-32 for desktops. If you go to most PC makers (Dell, Lenovo, etc.) and spec a machine, the default OS will be Win7 - 32 bit. I'm sure lots of people don't think about that and just hit okay. That fucking sucks, they should be encouraging everyone to standardize to 64 bit.

Give me a 32+64 combined exe. This is trivial and obvious, I don't get why I don't have it. For most apps there's no need for a whole separate Program Files (x86) / Program Files install dir nonsense. Just let me make a .EXE that has the x86 and the x64 exes in it, and execute the right one. That way I can just distribute my app as a single EXE and clients get the right thing.

Let me call 32 bit DLL's from 64 bit code. As usual people write a bunch of nonsense about why you can't do it that just isn't true. I can't see any real good reason why I can't call 32 bit DLL's from 64 bit code. There are a variety of solutions to the memory addressing range problem. The key thing is that the memory addresses we are passing around are *virtual* addresses, not physical, and each app can have different virtual address page tables. You could easily make the 32-bit to 64-bit thunk act like a process switch, which swaps the page tables. It's perfectly possible for an app with 32 bit of address space to access 64 bits of physical memory (of course this is exactly what 32 bit apps running on 64 bit windows do). Okay, that would be pretty ugly, but there's a very simple solution - reserve the lower 4G of virtual address space in 64 bit apps for communication with 32 bit apps. That is, make all the default calls to HeapAlloc go to > 4G , but give me a flag to alloc in the lower 4 G. Then if I want to call to a 32 bit DLL I have to pass it memory from the lower 4G space. Yes obviously you can't just build your old app and attach to a 32 bit DLL and have it just work, but it would still give me a way to access it when I need it (and there are plenty of other reasons why you wouldn't be able to just link in the 32 bit dll without thinking, eg. the sizes of many of your types have changed).

Right now if I have some old 32 bit DLL that I can't recompile for whatever reason and I need to get to it from my 64 bit app, the recommended solution is to make a little 32 bit server app that calls the DLL for me and use interprocess communication with named memory maps to give its data to the 64 bit app. That's okay and all but they easily could have packaged that into a much easier thing to do.

This code project article is actually one of the best I've seen on the transition. It reminds me of one of the weird broke-ass things I've seen : some of my favorite command line apps are still old 32 bit builds. They still work ... mostly. The problem is that they go through the fucking WOW64 directory remapping, so if you try to do anything with them in the Windows system directories you can get *seriously* fucked up. I really don't like the WOW64 directory remap thing. It only exists to fix broke-ass apps that manually look for stuff in "c:\program files" or whatever. I feel it could have been off by default and then turned on only for exe's that need it. I understand their way makes most 32 bit apps work out of the box which is what most people want, so that all makes sense and is fine, but it is nasty that my perfectly good 32 bit apps don't actually get to see the true state of the machine for no good reason.

To give an example of the most insane case : I use my own "dir" app. Before I rebuilt it, I was using the old 32 bit exe version. It literally lies to me about what files are in a given dir because it gets the WOW64 remap. That's whack. The 32 bit exes work just fine on Win64 and the only thing forcing me to rebuild a lot of these things is just to make them avoid the WOW64.