03-01-13 - Zopfli

zopfli seems to make small zips. There's no description of the algorithm so I can't comment on it. But hey, if you want small zips it seems to be the current leader.

(update : I've had a little look, and it seems to be pretty straightforward, it's an optimal parser + huff reset searcher. There are various other prior tools to do this (kzip,deflopt,defluff,etc). It's missing some of the things that I've written about before here, such as methods of dealing with the huff-parse feedback; the code looks pretty clean, so if you want a good zip-encoder code it looks like a good place to start.)

I've written these things before, but I will summarize here how to make small zips :

1. Use an exact (windowed) string matcher.

cbloom rants 09-24-12 - LZ String Matcher Decision Tree

2. Optimal parser. Optimal parsing zip is super easy because it has no "repeat match", so you can use plain old backwards scan. You do have the huffman code costs, so you have to consider at least one match candidate for each codeword length.

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 09-04-12 - LZ4 Optimal Parse

3. Deep huffman reset search. You can do this pretty easily by using some heuristics to set candidate points and then building a bottom-up tree. Zopfli seems to use a top-down greedy search. More huffman resets makes decode slower, so a good encoder should expose some kind space-speed tradeoff parameter (and/or a maximum number of resets).

cbloom rants 06-15-10 - Top down - Bottom up
cbloom rants 10-02-12 - Small note on Adaptive vs Static Modeling

4. Multi-parse. The optimal parser needs to be seeded in some way, with either initial code costs or some kind of heuristic parse. There may be multiple local minima, so the right way to do it is to run 4 seeds (or so) simultaneously with different strategies.

cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies

5. The only unsolved bit : huffman - parse feedback. The only solution I know to this is iteration. You should use some tricks like smoothing and better handling of the zero-frequency symbols, but it's just heuristics and iteration.

One cool thing to have would be a cheap way to compute incremental huffman cost.

That is, say you have some array of symbols. The symbols have a corresponding histogram and huffman code. The full huffman cost is :

fullcost(symbol set) = cost{ transmit code lengths } + sum[n] { codelen[n] * count[n] }
that is, the cost to send the code lengths + the cost of sending all the symbols with those code lengths.

You'd like to be able to do an incremental update of fullcost. That is, if you add one more symbol to the set, what is the delta of fullcost ?

*if* the huffman code lengths don't change, then the delta is just +codelen[symbol].

But, the addition of the symbol might change the code lengths, which causes fullcost to change in several ways.

I'm not sure if there's some clever fast way to do incremental updates; like when adding the symbol pushes you over the threshold to change the huffman tree, it often only changes some small local part of the tree, so you don't have to re-sum your whole histogram, just the changed part. Then you could slide your partition point across an array and find the optimal point quite quickly.

Now some ranting.

How sad is it that we're still using zip?

I've been thinking about writing my own super-zipper for many years, but I always stop myself because WTF is the point? I don't mean for the world, I guess I see that it is useful for some people, but it does nothing for *me*. Hey I could write some thing and probably no one would use it and I wouldn't get any reward from it and it would just be another depressing waste of some great effort like so many other things in my life.

It's weird to me that the best code in the world tends to be the type of code that's given away for free. The little nuggets of pure genius, the code that really has something special in it - that tends to be the free code. I'm thinking of compressors, hashers, data structures, the real gems. Now, I'm all for free code and sharing knowledge and so on, but it's not equal. We (the producers of those gems) are getting fucked on the deal. Apple and the financial service industry are gouging me in every possible immoral way, and I'm giving away the best work of my life for nothing. It's a sucker move, but it's too late. The only sensible play in a realpolitik sense of your own life optimization is to not work in algorithms.

Obviously anyone who claims that patents provide money to inventors is either a liar (Myhrvold etc.) or just has no familiarity with actual engineering. I often think about LZ77 as a case in point. The people who made money off LZ77 patents were PK and Stac, both of whom contributed *absolutely nothing*. Their variants were completely trivial obvious extensions of the original idea. Of course the real inventors (L&Z, and the modern variant is really due to S&S) didn't patent and got nothing. Same thing with GIF and LZW, etc. etc. perhaps v42 goes in there somewhere too; not a single one of the compression-patent money makers was an innovator. (and this is even igoring the larger anti-patent argument, which is that things like LZ77 would have been obvious to any number of researchers in the field at the time; it's almost always impossible to attribute scientific invention/discovery to an individual)


L_(//:) said...

Out of curiosity - what would You name greatest current failure of patent system?

(relevant: http://filmicgames.com/archives/778 says it's: arithmetic encoding in JPEG images)

Sergey Schetinin said...

In addition to the programming gems being free there's a lot of areas of improvement in our industry that are not being pursued because they don't bring any reward. For example programming languages are in a very sad state yet trying to create a new programming as a business is doomed to failure.

cbloom said...

@Sergey - good point. There's tons of work that should be done on making software easier to share and interoperate. We should have a standardized scripting language that every app can talk to. There should be standard plugin systems and the ability to plug your own code into 3rd party apps and libraries for customizations. etc. etc. If it doesn't sell more widgets to people on facebook it's not gonna get worked on.

cbloom said...

@L - I don't think jpeg-arith is even in the top 10 of terrible patents. There are total face-palming patents on things like text messaging, spread sheets, autocomplete, cloud computing, etc.

jpeg-arith is particularly retarded because it's one of those cases where the patent just killed the standard. (J2K was perhaps a bigger tragedy).

Neal Burns said...

I think academia and places like Microsoft Research (at one time, Xerox PARC and Bell Labs) are where you're supposed to get rewarded for work that benefits mankind. Otherwise it's just for the immortality and adoration.

I appreciate the stuff that you've posted. Especially this: http://cbloomrants.blogspot.com/2011/09/09-28-11-algorithm-next-index-with.html and this, which I saw today: http://cbloomrants.blogspot.com/2011/09/09-28-11-string-matching-with-suffix.html

I'm wondering if you noticed that the next-with-lower-value array (which I think is this), after being permuted to the original (not sorted) order*, looks to be a reversible transform. I.e., bijective for any permutation. It's closely related to the inversion vector (if you don't recognize this as such, it's the same concept as the "Inversion Frequencies" BWT algorithm), but with the benefit of being optimal in space and time. The best known inversion vector algorithm is superlinear. Obviously it's useful for LZ, but I think it's interesting in its own right.

* so that for any file position, it holds the position of the suffix that sorts next, for the initial part of the file only

old rants