cbloom rants: 06/2010

6/21/2010

06-21-10 - RRZ PNG-alike notes

Okay I've futzed around with heuristics today and it's a bit of a dead end. There are just a few cases that you cannot ever catch with a heuristic.

There is one good heuristic that works 90% of the time :


do normal filter
try compress with LZ fast and MinMatchLen = 8
try compress with LZ normal and MinMatchLen = 3

if the {fast,8} wins then the image is almost certainly a "natural" image. Natural images do very well with long min match len, smooth filters, and simple LZ matchers.

If not, then it's some kind of weird synethetic thing. At that point, pretty much all bets are off. Synthetic images have the damnable property that certain patterns repeat, so they are very sensitive to whether the LZ can find those patterns after filtering and such. But, a good start is to try the no-filter case with normal LZ, and perhaps try the Adaptive, and you can use Loco or Non-Loco depending on whether the normal filter chose loco post-filter or not.

But there are some bitches. kodak_12 is a natural image, and we detect that right. The problem is the best mode {N,N+L,A,A+L} changes when you optimize the LZ parameters, and it changes by quite a lot actually. Many modes will show N+L or A+L willing by quite a lot, but the right mode is N and it wins by 10k.

ryg_t.train.03.bmp is the worst. It is a solid 10% better with the "Normal" mode, but this only shows up when you do the LZ optimal parse; at any other setting of LZ all the modes are very close, but for some reason there are some magic patterns that only occur in Normal mode which are very good for the optimal parse - all the other modes stay about the same size when you turn LZ up to optimal, but Normal filter gets way smaller.

Okay, some actually useful notes :

There are some notes on the PNG web pages that say the best way to choose the filter per row is with sum of abs. Oh yeah? I can beat it. First of all, doing sum of abs but adding a small penalty for non-zero helps a tiny bit. But the best thing is to do entropy per row, and add a penalty for non-zero. You're welcome.

The filters N and (N+W)/2 are almost never best as whole-image filters, but are actually helpful in the adaptive filter loop.

I reduced my filter set down to only 5 and it hurt very little. Having the extra filters is basically free in terms of the format, but is a pain in the ass to maintain if you need to write optimized SIMD decoders for every filter on every platform. So for my own sanity, a minimum set of filters is preferrable.

BTW I should note that the fact that you have to tune minMatchLen and lzLevel is an indicator of the limitations of the optimal parse. If the optimal parse really found the best LZ stream, you should just run Optimal and let it pick what matches it wants. This is an example of it finding a local minimum which is not the global minimum. The problem is that the Huffman codes are severely different if you run with MML = 3 or 5 for example. Maybe there's a way around this problem; it requires more thought.

6/20/2010

06-20-10 - Struct annoyance

It's a very annoying fact of coding life that references through a struct are much slower than loading the variable out to temps. So for example this is slow :


void rrMTF_Init(rrMTFData * pData)
{
    for (int i = 0; i < 256;  i++)
    {
        pData->m_MTFimage[i] = pData->m_MTFmap[i] = (U8) i;
    }
}

and you have to do this by hand :


void rrMTF_Init(rrMTFData * pData)
{
    U8 * pMTFimage = pData->m_MTFimage;
    U8 * pMTFMap = pData->m_MTFmap;
    for (int i = 0; i < 256;  i++)
    {
        pMTFimage[i] = pMTFmap[i] = (U8) i;
    }
}

Unfortunately there are plenty of cases where this is actually a significant big deal.

06-20-10 - Some notes on PNG

Before we go further, lets have a look at PNG for reference.

Base PNG by default does :


Filter 5 = "adaptive" - choose from [0..4] per row using minimum sum of abs

Zlib strategy "FILTER" , which just means minMatchLength = 4

So the pngcrush guys did some clever things. Basically the idea is to try all possible ways to write a PNG and see which is smallest. The things you can play with are :


Filter 0..5

Zlib Strategy ; they have a few hard-coded but really it comes down to min match length

Zlib compression level (oddly highest level is not always best)

PNG scan options (progressive, interleaves, bottom up vs top down, etc.)

It's well known that on weird synthetic images "no filter" beats any filter. The only way you can detect that is actually by trying an LZ compress, you cannot tell from statistics.

The clever thing in pngcrush is that they don't search that whole space, but still usually find the optimal (or close to optimal) settings. The way they do it is with a heuristic guided search; they identify things that they have to always test (the 5 default strategies they try) with LZ, then depending on which of those is best they try a few others, and then maybe a few more, then you're done. It's like based on which branch of the search space you walk off initially they know from testing where the optimum likely is.

"loco" here is pngcrush with the LOCO color space conversion (RGB -> (R-G),G,(B-G) ) from JPEG-LS. This is the only lossless color conversion you can do that is not range expanding (eg. stays in bytes) (* correction : not quite true, see comments, of course any lifting style transform can be non-range-expanding using modulo arithmetic; it does appear to be the only *useful* byte-to-byte color conversion though). (BTW LOCO is not allowed in compliant PNG, but it's such a big win that it's unfair to them not to pretend that PNG can do LOCO for purposes of this comparison).

name	png	pngcrush	loco	advpng	crush+adv	best
ryg_t.yello.01.bmp	421321	412303	386437	373573	373573	373573
ryg_t.train.03.bmp	47438	37900	36003	34260	34260	34260
ryg_t.sewers.01.bmp	452540	451880	429031	452540	451880	429031
ryg_t.font.01.bmp	44955	37857	29001	22514	22514	22514
ryg_t.envi.colored03.bmp	113368	97203	102343	113368	97203	97203
ryg_t.envi.colored02.bmp	63241	55036	65334	63241	55036	55036
ryg_t.concrete.cracked.01.bmp	378383	377831	309126	378383	377831	309126
ryg_t.bricks.05.bmp	506528	486679	375964	478709	478709	375964
ryg_t.bricks.02.bmp	554511	553719	465099	554511	553719	465099
ryg_t.aircondition.01.bmp	29960	29960	23398	20320	20320	20320
ryg_t.2d.pn02.bmp	29443	26025	27156	24750	24750	24750
kodak_24.bmp	705730	704710	572591	705730	704710	572591
kodak_23.bmp	557596	556804	483865	557596	556804	483865
kodak_22.bmp	701584	700576	580566	701584	700576	580566
kodak_21.bmp	680262	650956	547829	646806	646806	547829
kodak_20.bmp	505528	504796	439993	500885	500885	439993
kodak_19.bmp	671356	670396	545636	671356	670396	545636
kodak_18.bmp	780454	779326	631000	780454	779326	631000
kodak_17.bmp	624331	623431	510131	615723	615723	510131
kodak_16.bmp	573671	541748	481190	517978	517978	481190
kodak_15.bmp	612134	611258	516741	612134	611258	516741
kodak_14.bmp	739487	703036	590108	739487	703036	590108
kodak_13.bmp	890577	859429	688072	866745	859429	688072
kodak_12.bmp	569219	533864	477151	535591	533864	477151
kodak_11.bmp	635794	634882	519918	635794	634882	519918
kodak_10.bmp	593590	592738	500082	593590	592738	500082
kodak_09.bmp	583329	582489	493958	558418	558418	493958
kodak_08.bmp	787619	786491	611451	787619	786491	611451
kodak_07.bmp	566085	565281	486421	566085	565281	486421
kodak_06.bmp	667888	631478	540442	644928	631478	540442
kodak_05.bmp	807702	806538	638875	807702	806538	638875
kodak_04.bmp	637768	636856	532209	637768	636856	532209
kodak_03.bmp	540788	506321	464434	514336	506321	464434
kodak_02.bmp	617879	616991	508297	596342	596342	508297
kodak_01.bmp	779475	760251	588034	706982	706982	588034
bragzone_TULIPS.bmp	680124	679152	591881	680124	679152	591881
bragzone_SERRANO.bmp	153129	107167	107759	96932	96932	96932
bragzone_SAIL.bmp	807933	806769	623437	807933	806769	623437
bragzone_PEPPERS.bmp	426419	424748	376799	426419	424748	376799
bragzone_MONARCH.bmp	614974	614098	526754	614974	614098	526754
bragzone_LENA.bmp	474968	474251	476524	474968	474251	474251
bragzone_FRYMIRE.bmp	378228	252423	253967	230055	230055	230055
bragzone_clegg.bmp	483752	483056	495956	483752	483056	483056

There's no adv+loco because advpng and advmng both refuse to work on the "loco" bastardized semi-PNG.

BTW I should note that I should eat my hat a little bit over my own "PNG sucks" post. The thing is, yes basic PNG is easy to beat and it has a lot of mistakes in the design, but the basic idea is fine, they did a good job on the standard pretty quickly, but the thing that really seals the deal is that once you make a flexible open standard, people will step in and find ways to play with it, and while base PNG is pretty bag, PNG after optimization is not bad at all.

06-20-10 - Searching my PNG-alike

Okay so stealing some ideas from pngcrush let's check out the search space. I decide I'm going to examine various filters, Loco or no loco, LZ min match len, and LZ compress "level" (level = how hard it looks for matches).

Here are the results for my PNG-alike with the modes :

0 = no filter
Loco = no filter (in loco space)
Normal = select a single DPCM filter for the whole image
N+L = Normal in LOCO space
Adaptive = per row best DPCM filter
A+L = you get it by now

The left six columns are these modes with default LZ parameters (Fast match, min match len = 4). The right six columns are the same modes with LZ parameters optimized for each mode.

name	0	Loco	Normal	N+L	Adaptive	A+L	optimized LZs	0	Loco	Normal	N+L	Adaptive	A+L
ryg_t.yello.01.bmp	458255	435875	423327	427031	438946	431678		372607	359799	392963	395327	415370	401618
ryg_t.train.03.bmp	56455	55483	69635	76211	68022	67678		36399	35195	31803	40155	37582	36638
ryg_t.sewers.01.bmp	599803	610463	452287	452583	466154	452834		593935	593759	421779	420091	443786	421166
ryg_t.font.01.bmp	42239	32207	38855	38855	53070	36746		33119	26911	35383	35383	40998	33798
ryg_t.envi.colored03.bmp	297631	309347	150183	165803	142658	157046		265755	278103	102487	114923	95394	109022
ryg_t.envi.colored02.bmp	109179	112623	100687	113867	89178	98374		90039	93535	62139	68027	54662	57514
ryg_t.concrete.cracked.01.bmp	481115	407759	355575	356911	408054	361602		384795	353235	299963	301907	342810	310342
ryg_t.bricks.05.bmp	551475	485271	418907	417655	492622	418406		469063	448279	372195	370459	429066	373310
ryg_t.bricks.02.bmp	665315	632623	482367	481347	538670	483106		590319	577699	455431	455203	522158	455426
ryg_t.aircondition.01.bmp	41635	29759	26011	26011	32866	25738		29023	25103	20547	20547	23946	20522
ryg_t.2d.pn02.bmp	25075	26539	28303	29259	28046	28974		22147	22723	25915	26179	26194	25790
kodak_24.bmp	826797	771829	640137	634141	726892	633308		723681	712285	567693	558085	684060	559564
kodak_23.bmp	835449	783941	576481	569981	608476	565172		724641	712001	481577	478041	551292	479240
kodak_22.bmp	898101	879949	655461	651213	722096	651000		803429	804073	577433	571301	689664	574252
kodak_21.bmp	781077	708861	617565	629401	705608	618424		647881	633069	549865	549025	665724	545584
kodak_20.bmp	609705	561957	495509	500161	537692	494484		501293	490849	434865	431745	506592	429556
kodak_19.bmp	822045	733053	630793	624897	697020	619064		673953	658345	550669	541845	660444	541424
kodak_18.bmp	941565	912081	691161	692425	789736	693804		850705	848353	618961	619949	764004	622628
kodak_17.bmp	758089	678233	597225	592941	660016	590292		617097	606045	507169	504961	610092	508672
kodak_16.bmp	650557	587537	543001	543001	607916	545244		522829	511833	466277	466277	536280	468136
kodak_15.bmp	759109	697257	595353	590321	648656	586324		643385	628481	511193	504213	593304	506728
kodak_14.bmp	891629	793553	661505	657357	745816	649928		749085	731569	584645	580301	707596	581520
kodak_13.bmp	997161	891637	729425	730901	878212	729580		853557	829057	677041	678613	802224	680196
kodak_12.bmp	672361	602825	545921	562749	606004	549292		539305	526793	465297	472693	539532	467088
kodak_11.bmp	758197	691697	604869	597125	665364	587264		639537	625145	523649	512685	608388	510200
kodak_10.bmp	747121	681213	592637	589561	635972	578888		625961	611553	504573	499753	576284	497400
kodak_09.bmp	688365	629429	587233	583245	627916	571652		565449	562329	501377	495637	567772	491896
kodak_08.bmp	1001257	882153	686961	684177	792916	684056		860269	825757	615657	610505	766620	610524
kodak_07.bmp	709829	673917	568649	563561	616820	561636		605177	600157	480705	473233	551136	473500
kodak_06.bmp	779709	694229	600145	600145	687824	601564		642525	626449	534037	534037	637180	534804
kodak_05.bmp	962357	873793	700581	695257	810688	694828		845905	823025	633581	623341	783400	624368
kodak_04.bmp	813869	729865	613241	606849	672176	607280		677345	660057	531533	522061	622904	528064
kodak_03.bmp	639809	586873	522049	542681	581880	527856		519549	510309	437765	452965	511900	443948
kodak_02.bmp	729069	649709	603853	591781	654408	585916		598913	584941	515437	502941	602364	500964
kodak_01.bmp	872669	747333	661481	655597	772808	653388		699945	682005	593001	582389	689436	586328
bragzone_TULIPS.bmp	1032589	1021309	646905	652213	701508	652128		966881	969913	565997	571377	662504	570796
bragzone_SERRANO.bmp	150142	151706	169982	169302	173229	175217		103462	103566	139306	139074	142609	143457
bragzone_SAIL.bmp	983473	941993	686013	686117	795152	686204		892301	887097	613845	609953	762420	610008
bragzone_PEPPERS.bmp	694247	679795	424603	423679	451006	424262		655987	650771	369291	366611	416426	368106
bragzone_MONARCH.bmp	887917	868533	600373	598985	654864	598976		810325	805725	507937	508085	598348	508096
bragzone_LENA.bmp	737803	733251	493703	498299	498274	502150		710215	704763	467103	475179	471586	477638
bragzone_FRYMIRE.bmp	342667	344807	419859	420811	420506	419026		241899	242355	335063	336859	335990	335894
bragzone_clegg.bmp	760493	799717	525077	541329	523580	536244		557529	571413	445897	468265	444736	465376

One important note :

The "Normal" filters include the option to do a post-filter Loco conversion. This is different than the "loco space" option in the modes above. Let me elaborate. "Loco" in the modes listed above means transform the image into Loco colorspace, and then proceed with filtering and LZ. Loco built into a filter mode means, on each pixel do the filter delta, then do loco conversion. This can be integrated directly into the DPCM pixel delta code, so it's just considered a filter type. In particular, in "loco space", then the N,W,NW neighbors are already in loco colorspace. When loco is part of the filter, the neighbors are in RGB space and the delta is converted after the fact. If everything was linear, these would be equivalent.

Okay, what do we see?

It's very surprising to me how much LZ optimization helps. In particular it surprises me that making the LZ search *worse* (by turning down compression "level") helps a lot; as does increasing match len; on natural images a min match len around 8 or 10 is usually best. (or even more, I forbid a min match len > 10 because it starts to hurt decode speed).

Well we were hoping that we could pick the mode based on the default LZ parameters, and then just optimize the LZ parameters for that one mode. It is often the case the the best mode after optimization is the same as the best mode before optimization, but not always. When it is not the case, it is usually a small mistake. However, there is one case where it's a very bad mistake - on ryg_t.yello.01.bmp you would make output of 393k instead of 360k.

Natural images are the easiest; for them you can pretty much just pick A+L (or N+L) and you're very close to best if you didn't get the best. Synthetic images are harder, they are very sensitive to the exact mode.

We can also say that no filter + loco is almost always wrong, except for that same annoying one case. Unfortunately I don't see any heuristic that can detect when 0+loco needs to be checked. Similarly for adaptive + noloco.

Obviously there's a fundamental problem when the initial sizes are very close together, you can't really differentiate between the modes. When the sizes are very far apart then it is a reliable guess.

Let me briefly note things I could be searching that I'm not :

Rearranging pixels in various ways, eg. RGBRGB -> RRRGGGBB , or to whole planes; interleaving lines, different scan orders, etc. etc.

LSQR fits for predictors. This doesn't hurt decode speed a ton so it would fit in my design spec, I'm just getting sick of wasting my time on this so I'm not bothering with it.

Predictors on different regions instead of per row. eg. a predictor type per 16x16 tile or something.

Anything that hurts decode speed, eg. bigger predictors, adaptive predictors, non-LZ coder, etc.

Oh I'm also not looking at any pixel format conversions; I assume the client has put it in the pixel format they want and won't change it. Obviously some of the PNG optimizers can win by palettizing when not many colors are actually used, and of course there are lots of other pixel formats that might help, blah blah.

Oh while I'm at it, I should also note that my LZ is actually kind of crippled for this comparison. I divide the data stream into 256k chunks and compress them completely independently (no LZ window across the chunk boundary). This lets me seek on compressed data and decompress portions of it independently, but it is quite a bad penalty.

06-20-10 - PNG Comparo

Okay this is kind of bogus and I thought about not even posting it, but come on, you need the closure right? Everyone likes a comparo. First of all, why this is bogus : 1. PNG just cannot compete without better LOCO support. Here I am allowing the LOCO files, but they were not advpng/pngout optimized , and of course they're not really PNGs. 2. I have that crippling 256k chunking on my format. I guess if I wanted to do a fair comparo I should make a version of my shit which doesn't have LOCO and also remove my 256k chunking and compare that vs. no-loco PNG. God dammit now I have to do that.

Whatever, here you go :

RRZ heuristic = guided search to try to find the best set of options
RRZ best = actual best options for my thingy
png best = best of advpng/crush/loco

	RRZ heuristic	RRZ best	png best
ryg_t.yello.01.bmp	392963	359799	373573
ryg_t.train.03.bmp	35195	31803	34260
ryg_t.sewers.01.bmp	421779	420091	429031
ryg_t.font.01.bmp	26911		22514
ryg_t.envi.colored03.bmp	95394		97203
ryg_t.envi.colored02.bmp	54662		55036
ryg_t.concrete.cracked.01.bmp	299963		309126
ryg_t.bricks.05.bmp	370459		375964
ryg_t.bricks.02.bmp	455203		465099
ryg_t.aircondition.01.bmp	20522		20320
ryg_t.2d.pn02.bmp	22147		24750
kodak_24.bmp	559564	558085	572591
kodak_23.bmp	479240	478041	483865
kodak_22.bmp	574252	571301	580566
kodak_21.bmp	549865	545584	547829
kodak_20.bmp	429556		439993
kodak_19.bmp	541424		545636
kodak_18.bmp	618961		631000
kodak_17.bmp	508672	504961	510131
kodak_16.bmp	466277		481190
kodak_15.bmp	506728	504213	516741
kodak_14.bmp	581520	580301	590108
kodak_13.bmp	677041		688072
kodak_12.bmp	465297		477151
kodak_11.bmp	510200		519918
kodak_10.bmp	497400		500082
kodak_09.bmp	491896		493958
kodak_08.bmp	610524	610505	611451
kodak_07.bmp	473500	473233	486421
kodak_06.bmp	534037		540442
kodak_05.bmp	624368	623341	638875
kodak_04.bmp	522061		532209
kodak_03.bmp	437765		464434
kodak_02.bmp	500964		508297
kodak_01.bmp	586328	582389	588034
bragzone_TULIPS.bmp	565997		591881
bragzone_SERRANO.bmp	103462		96932
bragzone_SAIL.bmp	613845	609953	623437
bragzone_PEPPERS.bmp	366611		376799
bragzone_MONARCH.bmp	508096	507937	526754
bragzone_LENA.bmp	467103		474251
bragzone_FRYMIRE.bmp	241899		230055
bragzone_clegg.bmp	444736		483056

PNG wins by a little bit on FRYMIRE , SERRANO , ryg_t.aircondition.01.bmp , ryg_t.font.01.bmp . I'm going to pretend that I don't know that because that's what sent me down this god damn pointless rabbit hole in the first place, I discovered that PNG beat me on a few files so I had to find out why and fix myself.

Anyway, something that would be more productive would be to write a fast PNG decoder. All the PNG decoders out there in the world are woefully slow. Let me tell you all how to write a fast PNG decoder :

1. First make sure your Zip decoder is fast. The standard ones are okay, but they do too much checking for end of buffer and do you have enough bits blah blah. The correct way to do that is to allocate your decompression buffers 64k aligned, and put some NO_ACCESS pages on each end. Then just let your Zip decoder run. Make sure it will never crash on bad input - it will just make bad output (this is relatively easy to do and doesn't require explicit checks, just careful coding to make sure all compressed bit streams decode to something).

2. The un-filtering for PNG needs to be unrolled for the exact data type and filter. You can do this in C very neatly using template loop inversion which I wrote about previously. For maximum speed however you really should do the un-filter with SIMD. It's a very nice easy case for SIMD, except for the god fucking awful pointless abortion that is the Paeth filter.

3. Un-filtering and LZ decompress should be interleaved for cache hotness. You decode a bit, un-filter a bit, then stream out data in the final pixel format into client-ready packed plane. The zip window is only 32k and you only need one previous row to filter, so your whole set of read-write data should be less than 64k, and the data you stream out should be written to a separate buffer with NTQ write-combining style writes. Ideally your stream out supports enough pixel formats that it can write directly to whatever the client needs (X8R8G8B8 for D3D or whatever) so that memory doesn't have to be touched again. Because the output buffer is only written write combined you can decode directly into locked textures.

My guess is that this should be in the neighborhood of 80 MB/sec.

06-20-10 - KZip

While futzing around on PNG I discovered that there is this whole community of people who are hard-core optimizing PNG's for distribution sizes. They have these crazy tools like pngcrush/optipng/advancecomp/pngout.

AdvPNG is just the Zip encoder from 7zip (Igor Pavlov). It's a semi-optimal parser using a forward-pass multi-lookahead heuristic like Quantum. Some day I need to try to read the source code to figure out what he's doing exactly, but god damnit the code is ugly and it's not documented. Dammit Igor.

PNGOUT is a lossless "deflate" (zip) optimizer. The engine is the same as KZip by Ken Silverman. Unfortunately Ken has not documented his algorithm at all anywhere. Come on Ken! Nobody is gonna pay you for KZip! Just write about it!

Anyway, my guess from reading his brief description and looking at what it does is : KZip has some type of optimal parser. Who knows what kind of optimal parser; my guess is that knowing a bit about how Ken codes it is probably not an actual Storer-Szymanski optimal parser, but rather some kind of heuristic, perhaps a like 7zip/LZMA. KZip also clearly has some kind of Huffman split point optimizer (similar to what I just did ). Again just guessing from the command line options it looks like his Huffman splitter is single pass and is just based on a heuristic that detects changes in the statistics and decides to put a split there. Hmmm, I wish I'd found this months ago.

KZip appears to be the best zip optimizer in existence. Despite claims of being crazy slow I actually think it's quite fast by my standards. No surprise KZip is a lot smaller and faster than my optimal parser, but I do make smaller files. Ken ! Set your algorithm free! (ADDENDUM : whoops, that was only on PNG data; for some reason it's pretty fast on image data, but it's slow as balls in some other cases, not sure what's going on there; 7zip is a lot faster than kzip and the file sizes are usually very close (once in a while kzip does significiantly better)).

For a while I've been curious to try my RAD LZ optimizer on a Zip token stream. It would be a nice sanity check to test it against KZip, but I'm not motivated enough to take the pain of figuring out how to write Zip tokens.

06-20-10 - Filters for PNG-alike

The problem :

Find a DPCM pixel prediction filter which uses only N,W,NW and does not range-expand (eg. ubytes stay in ubytes). (eg. like PNG).

We certainly could use a larger neighborhood, we could use adaptive predictors that evaluate the neighborhood for edges/etc., we would wind up with GAP from CALIC or something newer. We want to keep it simple so we can have a very fast decoder.

These filters :


    case 0: // 0 predictor is the same as NONE
        return 0;
    
    case 1: // N
        return N;
        
    case 2: // W
        return W;
        
    case 3: // gradient // this tends to win on synthetic images
        {
        int pred = N + W - NW;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 4: // average
        return (N+W)>>1;
        
    case 5: // grad skewed towards average // this tends to win on natural images - before type 12 took over anyway
        {
        int pred = ( 3*N + 3*W - 2*NW + 1) >>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
    
    case 6: // grad skewed even more toward average
        {
        int pred = ( 5*N + 5*W - 2*NW + 3) >>3;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 7: // grad skewed N
        {
        int pred = (2*N + W - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 8: // grad skewed W
        {
        int pred = (2*W + N - NW + 1)>>1;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
    
    case 9: // new
        {
        int pred = (3*N + 2*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 10:    // new
        {
        int pred = (2*N + 3*W - NW + 1)>>2;
        pred = RR_CLAMP_255(pred);
        return pred;
        }
        
    case 11: // new
        return (N+W + 2*NW + 1)>>2;
        
    case 12: // ClampedGradPredictor
    {
        int grad = N + W - NW;
        int lo = RR_MIN3(N,W,NW);
        int hi = RR_MAX3(N,W,NW);
        return rr::clamp(grad,lo,hi);
    }   
    
    case 13: // median
        return Median3(N,W,NW);
    
    case 14:
    {
        // semi-Paeth
        // but only pick N or W 
        int grad = N + W - NW;
        // pick closer of N or W to grad
        if ( RR_ABS(grad - N) < RR_ABS(grad - W) )
            return N;
        else
            return W;
    }

perform like this :

name	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
ryg_t.train.03.bmp	56347	76447	70047	80619	76187	79811	77519	79771	76979	77399	76279	80383	75739	75875	74635
ryg_t.sewers.01.bmp	604843	475991	496067	468727	466363	455359	458111	458727	469203	456399	462727	496691	452663	495931	464915
ryg_t.font.01.bmp	42195	54171	53599	56135	73355	66955	70163	60199	63815	69499	70099	79983	59931	61219	56735
ryg_t.envi.colored03.bmp	316007	147315	136255	158063	139231	154551	148419	155379	149667	148683	143415	145871	148099	144851	153311
ryg_t.envi.colored02.bmp	111763	98835	87711	111367	93123	108867	100003	106275	95927	96391	97383	96907	99275	94371	100707
ryg_t.concrete.cracked.01.bmp	493143	416307	449899	426755	403779	402635	400923	408019	420291	398215	408299	437371	409055	440371	424087
ryg_t.bricks.05.bmp	568755	534267	524243	514563	509375	493923	497639	505267	499007	501079	497419	551687	499855	547935	511595
ryg_t.bricks.02.bmp	684515	590955	577207	557155	560391	537987	545019	551251	544555	549543	544635	602115	545919	600139	557659
ryg_t.aircondition.01.bmp	41595	33215	34279	33535	33199	32695	32683	32619	33207	32503	32987	35795	31507	35171	32243
ryg_t.2d.pn02.bmp	25419	27815	28827	29499	32203	32319	32443	29679	32531	32315	32575	34183	28307	31247	27919
ryg_gemein.bmp	797	801	173	153	565	585	553	521	205	633	529	813	153	825	141
kodak_24.bmp	843281	735561	760793	756509	744021	736037	735797	735989	753709	733105	741801	785077	719185	777905	727221
kodak_23.bmp	865985	593977	616885	617105	591501	592561	588837	595949	609437	586121	595793	620657	585453	617825	601713
kodak_22.bmp	919613	732613	756249	741441	719973	712757	711245	721365	732337	709621	719157	760065	706517	763481	724053
kodak_21.bmp	797741	753057	691653	746033	713469	716065	710321	737017	713845	720525	704129	744785	701733	744633	709445
kodak_20.bmp	624901	563261	538069	571633	540981	548329	541773	559437	548549	546565	540337	559545	539289	557105	545901
kodak_19.bmp	847781	724541	728337	732717	718309	712989	712217	716593	729505	712893	715053	752173	689953	756849	698881
kodak_18.bmp	958657	808577	820253	820429	783373	783089	777377	794965	799449	779117	782541	821717	783005	821025	802569
kodak_17.bmp	782061	664829	655845	685273	651025	657225	650477	666625	666645	653117	652293	680813	644849	675045	656597
kodak_16.bmp	664401	696493	595101	673961	646981	647441	643353	675157	640833	658721	631509	684517	628837	682217	629293
kodak_15.bmp	779289	639009	660137	677649	643985	651729	644809	650525	666229	641669	650377	673545	635441	662389	645817
kodak_14.bmp	912149	800681	742901	787273	754233	754325	749641	778777	758045	760761	744001	795029	743909	791341	755425
kodak_13.bmp	1015697	932969	898929	941109	890989	900533	890221	918073	905705	897865	888113	925757	894461	922701	907029
kodak_12.bmp	690537	641517	583717	641973	613461	617881	612461	635253	617913	620833	607177	644837	598393	638889	602213
kodak_11.bmp	773937	725081	674657	714169	697533	691425	690389	708197	695109	698369	685965	738609	671365	739385	676597
kodak_10.bmp	766701	650637	648349	665309	648997	641961	641481	651741	651785	643369	642133	685733	623437	684949	629193
kodak_09.bmp	703689	640025	628161	659637	635201	637657	633881	644193	647817	636281	634105	662125	615393	662801	621873
kodak_08.bmp	1026037	862013	888589	835329	884329	841405	859717	839193	861917	856097	865405	942073	792833	950005	792201
kodak_07.bmp	725525	654925	605753	629757	633945	622693	627053	643069	621433	634965	620645	673361	597821	666629	604337
kodak_06.bmp	796077	786789	680841	750253	731117	723657	722981	753989	714849	739405	709825	772685	703417	773613	705529
kodak_05.bmp	981245	850313	836165	845001	815801	814429	810709	832337	825073	815897	811729	852385	808993	853785	828145
kodak_04.bmp	845721	672369	679245	692469	659981	663625	657621	670629	678165	658845	661957	691945	653629	688253	670325
kodak_03.bmp	656841	605525	559081	619713	581713	596313	587633	607841	600725	592761	584449	601901	579221	592745	587409
kodak_02.bmp	761633	666601	663837	677385	648117	649785	644985	661253	662077	647553	647545	682089	640501	687305	654309
kodak_01.bmp	896361	850353	811873	828397	822169	807197	810217	827621	815245	818249	807665	869829	781173	875265	788725
bragzone_TULIPS.bmp	1052229	729141	757129	688121	703317	671417	683069	683653	696193	682109	691257	756241	675637	761477	704293
bragzone_SERRANO.bmp	150898	172286	160718	193602	255462	274994	278094	212478	254902	274478	277418	285990	171918	213214	167058
bragzone_SAIL.bmp	1004581	865725	826729	834385	811121	798037	798301	819669	807221	805785	796761	862209	795581	868857	813521
bragzone_PEPPERS.bmp	712087	469243	456511	438211	442695	428627	433227	437215	436595	436059	434359	471587	426191	474595	439575
bragzone_MONARCH.bmp	907737	670825	671745	644513	638201	623613	627513	640213	640485	630941	630941	676401	624925	679553	652849
bragzone_LENA.bmp	745299	484027	513747	506631	478875	481911	477131	481819	494431	474343	483007	498519	478559	498799	489203
bragzone_FRYMIRE.bmp	342567	432867	399963	481259	608567	664895	670755	544963	612075	666279	668187	693467	433755	456899	421263
bragzone_clegg.bmp	806117	691593	511625	518489	1208409	1105433	1167561	733217	951773	1158429	1165953	1289969	502845	1286365	505093

commentary :

The big surprise is that ClampedGradPredictor (#12) is the fucking bomb. In fact it's so good that it hides the behavior of other predictors. For example plain old Grad is never picked. Also predictor #5 (grad skewed towards average) was actually by far the best until #12 came along.

The other minor surprise is that W is actually best sometimes, and N is never best, and generally N is much worse than W. Now, it is no surprise that W is better than N - it is a well known fact that typical images have much stronger horizontal correlation than vertical, but I am surprised just how *big* the difference is.

6/17/2010

06-17-10 - Suffix Array Neighboring Pair Match Lens

I suffix sort strings for my LZ coder. To find match lengths, I first construct the array of neighboring pair match lengths. You can then find the match length between any two indexes (i,j) as the MIN of all pair match lengths between them. I wrote about this before when I wrote about the LZ Optimal parse strategy, but in order to find all matches against any given suffix, you find the position of that suffix in the sort order, then walk to neighbors and keep doing MIN() with the pair match lengths.

Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0). I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to figuring it out. Anyway, I stumbled upon this :


from :

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2

Algorithm GetHeight
input: A text A and its suffix array Pos
1 for i:=1 to n do
2 Rank[Pos[i]] := i
3od
4 h:=0
5 for i:=1 to n do
6 if Rank[i] > 1 then
7 k := Pos[Rank[i]-1]
8 while A[i+h] = A[j+h] do
9 h := h+1
10 od
11 Height[Rank[i]] := h
12 if h>0 then h := h-1 fi
13 fi
14 od

The idea is obvious once you see it. You walk in order of the original character array index, (not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor. At index [i+1] then, you must be able to match to a length of at least the same length-1 by matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix match. For example :


On "mississippi" :

At index 1 :

ississippi

matches :

issippi

with length 4.

Now step forward one.

At index 2 :

ssissippi

I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one)
so I can start my string compare at 3

And here's the C code :


static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2]
        const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen)
{
    int n = sortWindowLen;
    const U8 * A = sortWindowBase;

    int h = 0;
    for(int i=0; i< n ;i ++)
    {
        int sort_i = sortIndexInverse[i];
        if ( sort_i > 0 )
        {
            int j = sortIndex[sort_i-1];
            int h_max = sortWindowLen - RR_MAX(i,j);
            while ( h < h_max && A[i+h] == A[j+h] )
            {
                h++;
            }
            sortSameLen[sort_i-1] = h;

            if ( h > 0 ) h--;
        }
    }
}

6/15/2010

06-15-10 - Top down - Bottom up

"Top down bottom up" sounds like a ZZ-top song, but is actually different ways of attacking tree-like problems. Today I am thinking about how to decide where to split a chunk of data for Huffman compression. A quick overview of the problem :

You have an array of N characters. You want to Huffman compress them, which means you send the code lengths and then the codes. The idea is that you may do better if you split this array into several. For each array you send new code lengths. Obviously this lets you adapt to the local statistics better, but costs the space to send those code lengths. (BTW one way to do this in practice is to code using an N+1 symbol alphabet, where the extra symbol means "send new codes"). You want to pick the split points that give you lowest total length (and maybe also a penalty for more splits to account for the speed of decode).

The issue is how do you pick the splits? For N characters the possible number of splits is 2^(N-1) (eg. if N = 3, the ways are [012],[0][12],[01][2],[0][1][2]) so brute force is impossible.

The standard ways to solve this problem is either top down or bottom up. I always think of these in terms of Shannon-Fano tree building (top down) or Huffman tree building (bottom up).

The top down way is : take the N elements and pick the single best split point. Then repeat on each half. This is recursive top-down tree building. One disadvantage of this approach is that it can fail to make good splits when the good splits are "hidden". eg. data like {xxxxxoooooxxxxxoooooxxxxxoooooxxxxx} might not pick any splits, because the gain from one split is not significant enough to offset the cost of sending new codes, but if you actually went all the way down the tree you would find good splits by completely segregating each unique bit. I also sometimes call this the "greedy" approach.

So the next solution is bottom up. Start with every character as a group (merge up runs of identical characters first). Then find the two groups which give you the biggest gain to merge and merge those groups. Repeat until no merge is beneficial.

In general it is often the case that bottom up solutions to these problems are better than top-down, but I don't there is any strong reason why that should be true.

There are a whole class of problems like this, such as BSP/kd tree building and category tree building in the machine learning literature (actually this is identical to category tree building in machine learning, since entropy is pretty good way to rate the quality of category trees, and an "overfit" penalty for making too many splits is equivalent to the code cost transmission penalty here). One of the good tricks if using the top-down approach is do an extra "non-greedy" step. When you hit the point where no more splits are beneficial, go ahead and do another split anyway speculatively and then split its children. Sometimes by going ahead one step you get past the local minimum, eg. the classic case is trying to build a kd-tree for this kind of data :


  X  O

  O  X

Where do you put a split? You currently have 50/50 entropies, no split is beneficial. If you do one split and then force another, you find a perfect tree.

Anyway, because this class of problems is unsolvable, most of the techniques are just heuristic and ad-hoc. For example in many cases it can be helpful just to always do a few mid-point splits for N levels to get things started. For the specific case of kd-trees there have been a lot of great papers in the last 5 years out of the interactive ray tracing community.

06-15-10 - The end of ad-hoc computer programming

I think we are in an interesting period where the age of the "guerilla" "seat of the pants" , eg. ad-hoc, programmer is ending. For the last 30 years, you could be a great programmer without much CS or math background if you had good algorithmic intuition. You could figure out things like heuristics for top-down or bottom-up tree building. You could dive into problems without doing research on the prior art and often make up something competitive. Game programmers mythologize these guys as "ninjas" or "hackers" or various other terms.

That's just about over. The state of the art in most areas of coding is getting so advanced that ad-hoc approaches simply don't cut the mustard. The competition is using an approach that is provably optimal, or provably within X% of optimal, or provably optimal in a time-quality tradeoff sense.

And there are so many strong solutions to problems out there now that cutting your own is more and more just foolish and hopeless.

I wouldn't say that the end of the ad-hoc programmer has come just yet, but it is coming. And that's depressing.

6/07/2010

06-07-10 - Unicode CMD Code Page Checkup

I wrote a while ago about the horrible fuckedness of Unicode support on Windows :

cbloom rants 06-21-08 - 3
cbloom rants 06-15-08 - 2
cbloom rants 06-14-08 - 3

Part of the nastiness was that in Win32 , command line apps get args in OEM code page, but most Windows APIs expect files in ANSI code page. (see my pages above - simply doing OEM to ANSI conversions is not the correct way to fix that) (also SetFileAPIsToOEM is a very bad idea, don't do that).

Here is what I have figured out on Win64 so far :

1. CMD is still using 8 bit characters. Technically they will tell you that CMD is a "true unicode console". eh, sort of. It uses whatever the current code page is set to. Many of those code pages are destructive - they do not preserve the full unicode name. This is what causes the problem I have talked about before of the possibility of having many unicode names which show up as the same file in CMD.

2. "chcp 65001" changes you code page to UTF-8 which is a non-destructive code page. This only works with some fonts that support it (I believe Lucida Concole works if you like that). The raster fonts do not support it.

3. printf() with unicode in the MSVC clib appears to just be written wrong; it does not do the correct code page translation. Even wprintf() passed in correct unicode file names does not do the code page translation correctly. It appears to me they are always converting to ANSI code page. On the other hand, WriteConsoleW does appear to be doing the code page translation correctly. (as usual the pedantic rule lawyers will spout a bunch of bullshit about the fact that printf is just fine the way it is and it doesn't do translations and just passes through binary; not true! if I give it 16 bit chars and it outputs 8 bit chars, clearly it is doing translation and it should let me control how!)


expected : printf with wide strings (unicode) would do translation to the console code page 
(as selected with chcp) so that  characters show up right.
(this should probably be done in the stdout -> console pipe)

observed : does not translate to CMD code page, appears to always use A code page even with
the console is set to another code page

4. CMD has a /U switch to enable unicode. This is not what you think, all it does is make the output of built-in commands unicode. Note that your command line apps might be piped to a unicode text file. To handle this correctly in your own console app, you need to detect that you are being piped to unicode and do unicode output instead of converting to console CP. Ugly ugly.

5. CMD display is still OEM code page by default. In the old days that was almost never changed, but nowadays more people are in fact changing it. To be polite, your app should use GetConsoleOutputCP() , you should *NOT* use SetConsoleOutputCP from a normal command line app because the user's font choice might not support the code page you want.

6. CMD argc/argv argument encoding is still in the console code page (not unicode). That is, if you run a command line app from CMD and auto-complete to select a file with unicode name, you are passed the code page encoding of that unicode name. (eg. it will be bytewise identical to if you did FindFirstW and then did UnicodeToOem). This means GetCommandLineW() is still useless for command line apps - you cannot get back to the original unicode version of the command line string. It is possible for you to get started with unicode args (eg. if somebody many you from CreateProcessW), in which case GetCommandLineW actually is useful, but that is so rare it's not really worth worrying about.


expected : GetCommandLineW (or some other method) would give you the original full unicode arguments
(in all cases)

observed : arguments are only available in CMD code page

7. If I then call system() from my app with the CMD code page name, it fails. If I find the Unicode original and convert it to Ansi, it is found. It appears that system() uses the ANSI code page (like other 8-bit file apis). ( system() just winds up being CreateProcess ). This means that if you just take your own command line that called you and do the same thing again with system() - it might fail. There appears to be no way to take a command line that works in CMD and run it from your app. _wsystem() seems to behave well, so that might be the cleanest way to proceed (presuming you are already doing the work of promoting your CMD code page arguments to proper unicode).


repro : write an app that takes your own full argc/argv array, and spawn a process with those same args 
(use an exe name that involved troublesome characters)

expected : same app runs again

observed : if A and Console code pages differ, your own app may not be found

8. Copy-pasting from CMD consoles seems to be hopeless. You cannot copy a chunk of unicode text from Word or something and paste it into a console and have it work (you would expect it to translate the unicode into the console's code page, but no). eg. you can't copy a unicode file name in explorer and paste it into a console. My oh my.


repro : copy-paste a file name from (unicode) Explorer into CMD

expected : unicode is translated into current CMD code page and thus usable for command line arguments

observed : does not work

9. "dir" seems to cheat. It displays chars that aren't in the OEM code page; I think they must be changing the code page to something else to list the files then changing it back (their output seems to be the same as mine in UTF8 code page). This is sort of okay, but also sort of fucked when you consider problem #8 : because of this dir can show file names which will then not be found if you copy-paste them to your command line!


repro : dir a file with strange characters in it.  copy-paste the text output from dir and type 
"dir <paste>" on the command line

expected : file is found by dir of copy-paste

observed : depending on code page, the file is not be found

So far as I can tell there is no API tell you the code page that your argc/argv was in. That's a pretty insane ommission. (hmm, it might be GetConsoleCP , though I'm not sure about that). (I'm a little unclear about when exactly GetConsoleCP and GetConsoleOutputCP can not be the same; I think the answer is they are only different if output is piped to a file).

I haven't tracked down all the quirks yet, but at the moment my recommendation for best practices for command line apps goes like this :

1. Use GetConsoleCP() to find the input CP. Take your argc/argv and match any file arguments using FindFirstW to get the unicode original names. (I strongly advising using cblib/FileUtil for this as it's the only code I've ever seen that's even close to being correct). For arguments that aren't files, convert from the console code page to wchar.

2. Work internally with wchar. Use the W versions of the Win32 File APIs (not A versions). Use the _w versions of clib FILE APIs.

3. To printf, either just write your own printf that uses WriteConsoleW internally, or convert wide char strings to GetConsoleOutputCP() before calling into printf.

For more information :

Console unicode output - Junfeng Zhang's Windows Programming Notes - Site Home - MSDN Blogs
windows cmd pipe not unicode even with U switch - Stack Overflow
Unicode output on Windows command line - Stack Overflow
INFO SetConsoleOutputCP Only Effective with Unicode Fonts
GetConsoleOutputCP Function (Windows)
BdP Win32ConsoleANSI

Addendum : I've updated cblib and chsh with new versions of everything that should now do all this at least semi-correctly.

BTW a semi-related rant :

WTF are you people who define these APIs not actually programmers? Why the fuck is it called "wcslen" and not "wstrlen" ? And how about just fucking calling it strlen and using the C++ overload capability? Here are some sane ways to do things :


typedef wchar_t wchar; // it's not fucking char_t

// yes int, not size_t mutha fucka
int wstrlen(const wchar * str) { return wcslen(str); }
int strlen(const wchar * str) { return wstrlen(str); }

// wsizeof replaces sizeof for wchar arrays
#define wsizeof(str)    (sizeof(str)/sizeof(wchar))
// best to just always use countof for strings instead of sizeof
#define countof(str)	(sizeof(str)/sizeof(str[0]))

fucking be reasonable to your users. You ask too much and make me do stupid busy work with the pointless difference in names between the non-w and the w string function names.

Also the fact that wprintf exists and yet is horribly fucked is pretty awful. It's one of those cases where I'm tempted to do a


#define wprintf wprintf_does_not_do_what_you_want

kind of thing.

06-07-10 - Exceptions

Small note about exceptions for those who are not aware :

x64 totally changes the way exceptions on the PC are implemented. There are two major paradigms for exceptions :

1. Push/pop tracking and unwinding. Each time you enter a function some unwind info is pushed, and when you leave it is popped. When an exception fires it walks the current unwind stack, calling destructors and looking for handlers. This makes executables a lot larger because it adds lots of push/pop instructions, and it also adds a speed hit everywhere even when exceptions are never thrown becuase it still has to push/pop all the unwind info. MSVC/Win32 has done it this way. ("code driven")

2. Table lookup unwinding. A table is made at compile time of how to unwind each function. When an exception is thrown, the instruction pointer is used to find the unwind info in the table. This is obviously much slower at throw time, but much faster when not throwing, and also doesn't bloat the code size (if you don't count the table, which doesn't normally get loaded at all). I know Gcc has done this for a while (at least the SN Systems version that I used a while ago). ("data driven")

All x64 code uses method #2 now. This is facilitated by the new calling convention. Part of that is that the "frame pointer omission" optimization is forbidden - you must always push the original stack and return address, because that is what is walked to unwind the call sequence when a throw occurs. If you write your own ASM you have to provide prolog/epilog unwind info, which goes into the unwind table to tell it how to unwind your stack.

This has been rough and perhaps slightly wrong. Read here for more :

X64 Unwind Information - FreiK's WebLog - Site Home - MSDN Blogs
Uninformed - vol 4 article 1
Introduction to x64 debugging, part 3 � Nynaeve
Introduction to x64 debugging, part 2 � Nynaeve
Exception Handling (x64)

BTW many people (Pierre, Jeff, etc.) think the overhead of Win32's "code driven" style of exception handling is excessive. I agree to some extent, however, I don't think it's terribly inefficient. That is, it's close to the necessary amount of code and time that you have to commit if you actually want to catch all errors and be able to handle them. The way we write more efficient code is simply by not catching all errors (or not being able to continue or shut down cleanly from them). That is we decide that certain types of errors will just crash the app. What I'm saying is if you actually wrote error checkers and recovery for every place you could have a failure, it would be comparable to the overhead of exception handling. (ADDENDUM : nah this is just nonsense)

6/02/2010

06-02-10 - Some random Win64 junk

To build x64 on the command line, use


vcvarsall.bat amd64

Oddly the x64 compiler is still in "c:\program files (x86)" , not "c:\program files" where it's supposed to be. Also there's this "x86_amd64" thing, I was like "WTF is that?" , it's the cross-compiler to build x64 apps from x86 machines.

amd64 is a synonym for x64. IA64 is Itanium.

The actual "CL" command line to build x64 vs x86 can stay unchanged I think (not sure about this yet). For example you still link shell32.lib , and you still define "WIN32".

Another nasty case where 64 bit bytes your ass is in the old printf non-typed varargs problem. If you use cblib/safeprintf this shouldn't be too bad. Fucking printf formatting for 64 bit ints is not standardized. One nice thing on MSVC is you can use "%Id" for size_t sized things, so it switches between 32 and 64. For real 64 bit use "%I64d" (on MSVC anyway).

size_t really fucking annoys me because it's unsigned. When I do a strlen() I almost always want that to go into a signed int because I'm moving counters around that might go negative. So I either have to cast or I get tons of warnings. Really fucking annoying. So I use my own strlen32 and anyone who thinks I will call strlen on a string greater than 2 GB is smoking some crack.

Here are the MSC vers :


MSVC++ 9.0  _MSC_VER = 1500
MSVC++ 8.0  _MSC_VER = 1400
MSVC++ 7.1  _MSC_VER = 1310
MSVC++ 7.0  _MSC_VER = 1300
MSVC++ 6.0  _MSC_VER = 1200
MSVC++ 5.0  _MSC_VER = 1100

Some important revs : "volatile" changes meaning in ver 1400 ; most of the intrinsics show up in 1300 but more show up in 1400. Template standard compliance changes dramatically in 1400.

06-02-10 - Directives in the right place

I'm annoyed with "volatile" and "restrict" and all that shit because I believe they put the directive in the wrong place - on the variable declaration - when really what you are trying to do is modify certain *actions* , so the qualifier should be on the *actions* not the variables.

eg. if I write some matrix multiply routine and I know I have made it memory-alias safe, I want to say "restrict" on the *code* chunk, not on certain variables.

Same thing with "volatile" in the Java/C++0x/MSVC>=2005 sense, it's really a command for the load or the store, eg. "this store must actually be written to memory and must be ordered against other stores" , not a descriptor of a variable.

06-02-10 - 64 bit transition

The broke ass 64/32 transition is pretty annoying. Some things that MS could/should have done to make it easier :

Don't sell Win7-32 for desktops. If you go to most PC makers (Dell, Lenovo, etc.) and spec a machine, the default OS will be Win7 - 32 bit. I'm sure lots of people don't think about that and just hit okay. That fucking sucks, they should be encouraging everyone to standardize to 64 bit.

Give me a 32+64 combined exe. This is trivial and obvious, I don't get why I don't have it. For most apps there's no need for a whole separate Program Files (x86) / Program Files install dir nonsense. Just let me make a .EXE that has the x86 and the x64 exes in it, and execute the right one. That way I can just distribute my app as a single EXE and clients get the right thing.

Let me call 32 bit DLL's from 64 bit code. As usual people write a bunch of nonsense about why you can't do it that just isn't true. I can't see any real good reason why I can't call 32 bit DLL's from 64 bit code. There are a variety of solutions to the memory addressing range problem. The key thing is that the memory addresses we are passing around are *virtual* addresses, not physical, and each app can have different virtual address page tables. You could easily make the 32-bit to 64-bit thunk act like a process switch, which swaps the page tables. It's perfectly possible for an app with 32 bit of address space to access 64 bits of physical memory (of course this is exactly what 32 bit apps running on 64 bit windows do). Okay, that would be pretty ugly, but there's a very simple solution - reserve the lower 4G of virtual address space in 64 bit apps for communication with 32 bit apps. That is, make all the default calls to HeapAlloc go to > 4G , but give me a flag to alloc in the lower 4 G. Then if I want to call to a 32 bit DLL I have to pass it memory from the lower 4G space. Yes obviously you can't just build your old app and attach to a 32 bit DLL and have it just work, but it would still give me a way to access it when I need it (and there are plenty of other reasons why you wouldn't be able to just link in the 32 bit dll without thinking, eg. the sizes of many of your types have changed).

Right now if I have some old 32 bit DLL that I can't recompile for whatever reason and I need to get to it from my 64 bit app, the recommended solution is to make a little 32 bit server app that calls the DLL for me and use interprocess communication with named memory maps to give its data to the 64 bit app. That's okay and all but they easily could have packaged that into a much easier thing to do.

This code project article is actually one of the best I've seen on the transition. It reminds me of one of the weird broke-ass things I've seen : some of my favorite command line apps are still old 32 bit builds. They still work ... mostly. The problem is that they go through the fucking WOW64 directory remapping, so if you try to do anything with them in the Windows system directories you can get *seriously* fucked up. I really don't like the WOW64 directory remap thing. It only exists to fix broke-ass apps that manually look for stuff in "c:\program files" or whatever. I feel it could have been off by default and then turned on only for exe's that need it. I understand their way makes most 32 bit apps work out of the box which is what most people want, so that all makes sense and is fine, but it is nasty that my perfectly good 32 bit apps don't actually get to see the true state of the machine for no good reason.

To give an example of the most insane case : I use my own "dir" app. Before I rebuilt it, I was using the old 32 bit exe version. It literally lies to me about what files are in a given dir because it gets the WOW64 remap. That's whack. The 32 bit exes work just fine on Win64 and the only thing forcing me to rebuild a lot of these things is just to make them avoid the WOW64.