tag:blogger.com,1999:blog-5246987755651065286.post7523637097394460660..comments2021-07-18T12:41:01.986-07:00Comments on cbloom rants: 05-18-09 - Lagrange Space-Speed Optimizationcbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-5246987755651065286.post-29657919709896489542009-05-18T19:35:00.000-07:002009-05-18T19:35:00.000-07:00"You really want to keep the number of symbol..."You really want to keep the number of symbols you need to read to decode your data low."<br /><br />Yeah but you also don't want too many symbols. The real "art" of compression design is to factor out the information that you can model well from the information that is really just random. In the very common cases of highly predictable symbols you want to do super efficient things like RLE or other multi-symbol coding methods, and in the cases of random bits you want to just send them as raw bits and not model them at all. It takes specific understanding of the problem to factor symbols into the information-containing part and the random part.<br /><br />The "HuffA" code that I wrote a while ago (I think it's in crblib or something) takes the N most probable symbols and creates 2 or 3 symbol decode characters for Huffman.<br /><br />The LZH that's in Oodle combines the Huffman trees for Length & Offset so that most of the time it only does one single Huff decode and it gets all the information it needs to decode a match.<br /><br />LZX has some extra clever stuff that I might copy some day. He ensures that before each decode operation he has enough bits available, so that he can then quickly do the full decode without updating the bit-io structure.<br /><br />I never found a good way to combine Huffman and Arithmetic coding because they read bits so differently. It would be nice if you could use one or the other or both.<br /><br />BTW I just read the "Group Testing of Wavelets" paper and I think it's very interesting. It's a nice example of how you can do "entropy coding" without any huffman or arithmetic, by just creating a code stream such that your output bits are signaling selections between categories that are equally likely.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-72309614977455830652009-05-18T18:32:00.000-07:002009-05-18T18:32:00.000-07:00For Huffman codes, another important efficiency co...For Huffman codes, another important efficiency consideration is simply the design of your alphabet. You really want to keep the number of symbols you need to read to decode your data low. The obvious gain is in runtime efficiency (every symbol decoded=one pass through the decoding loop, and the less the better), but it's also about the compression ratio: Huffman codes effectively round your symbol probabilities to (negative) powers of two with a maximum probability of 1/2 (=a one-bit code), so if there's one symbol that's a lot more likely than the others, you're potentially wasting a lot of space. That's why JPEG/MPEG etc. use run-length coding, but only for zero runs: zeros are very frequent, so there'd be a lot of waste in the bitstream if you'd send them all individually (and as an added bonus, decoding gets faster as well since there's less symbols sent and more useful work done per decoded symbol). This bit waste is a lot less if you're using an arithmetic coder as backend (there's still some due to truncation/renormalization errors, but that's around 0.01%-0.1% overhead depending on the algorithm, nowhere near the almost 50% you can get with Huffman and one symbol with probability very close to 1). That's why CABAC (the arith coder in H264) not using any RLE, just a simple bitmap for "is this coeff zero". You still have the larger amount of coded symbols (and hence more work for the decoder) though.<br /><br />One other thing that can be worthwhile is trying to decode two (syntatically distinct at the bitstream level) symbols at once. In the general case, this adds a lot of overhead, but there's some common special cases where it's really useful: a common pattern is a huffman code specifying the length of some raw bitstring that immediately succeeds it. You find this for match offsets in ZIP/LZX/LZMA, the run/level codes in JPEG/MPEG, and so on. Normally, you first decode the huffman-coded symbol and then read the extra bits (as specified by that symbol), but there's no deep reason why that has to be separate steps; you can just index your table with [huffman code]+[extra bits], and store the resulting decoded value in there. Not something worth doing for all codes (the longer codes with more extra bits result in tons of table entries that provide no or nearly no information), but a nice thing to do in your first-level table lookup, to accelerate decoding of your most frequent symbols.ryghttps://www.blogger.com/profile/03031635656201499907noreply@blogger.com