(update : I've had a little look, and it seems to be pretty straightforward, it's an optimal parser + huff reset searcher. There are various other prior tools to do this (kzip,deflopt,defluff,etc). It's missing some of the things that I've written about before here, such as methods of dealing with the huff-parse feedback; the code looks pretty clean, so if you want a good zip-encoder code it looks like a good place to start.)
I've written these things before, but I will summarize here how to make small zips :
1. Use an exact (windowed) string matcher.
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
2. Optimal parser. Optimal parsing zip is super easy because it has no "repeat match", so you can use plain old backwards scan. You do have the huffman code costs, so you have to consider at least one match candidate for each codeword length.
cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 09-04-12 - LZ4 Optimal Parse
3. Deep huffman reset search. You can do this pretty easily by using some heuristics to set candidate points and then building a bottom-up tree. Zopfli seems to use a top-down greedy search. More huffman resets makes decode slower, so a good encoder should expose some kind space-speed tradeoff parameter (and/or a maximum number of resets).
cbloom rants 06-15-10 - Top down - Bottom up
cbloom rants 10-02-12 - Small note on Adaptive vs Static Modeling
4. Multi-parse. The optimal parser needs to be seeded in some way, with either initial code costs or some kind of heuristic parse. There may be multiple local minima, so the right way to do it is to run 4 seeds (or so) simultaneously with different strategies.
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
5. The only unsolved bit : huffman - parse feedback. The only solution I know to this is iteration. You should use some tricks like smoothing and better handling of the zero-frequency symbols, but it's just heuristics and iteration.
One cool thing to have would be a cheap way to compute incremental huffman cost.
That is, say you have some array of symbols. The symbols have a corresponding histogram and huffman code. The full huffman cost is :
fullcost(symbol set) = cost{ transmit code lengths } + sum[n] { codelen[n] * count[n] }
that is, the cost to send the code lengths + the cost of sending all the symbols with those code lengths.
You'd like to be able to do an incremental update of fullcost. That is, if you add one more symbol to the set, what is the delta of fullcost ?
*if* the huffman code lengths don't change, then the delta is just +codelen[symbol].
But, the addition of the symbol might change the code lengths, which causes fullcost to change in several ways.
I'm not sure if there's some clever fast way to do incremental updates; like when adding the symbol pushes you over the threshold to change the huffman tree, it often only changes some small local part of the tree, so you don't have to re-sum your whole histogram, just the changed part. Then you could slide your partition point across an array and find the optimal point quite quickly.
Now some ranting.
How sad is it that we're still using zip?
I've been thinking about writing my own super-zipper for many years, but I always stop myself because WTF is the point? I don't mean for the world, I guess I see that it is useful for some people, but it does nothing for *me*. Hey I could write some thing and probably no one would use it and I wouldn't get any reward from it and it would just be another depressing waste of some great effort like so many other things in my life.
It's weird to me that the best code in the world tends to be the type of code that's given away for free. The little nuggets of pure genius, the code that really has something special in it - that tends to be the free code. I'm thinking of compressors, hashers, data structures, the real gems. Now, I'm all for free code and sharing knowledge and so on, but it's not equal. We (the producers of those gems) are getting fucked on the deal. Apple and the financial service industry are gouging me in every possible immoral way, and I'm giving away the best work of my life for nothing. It's a sucker move, but it's too late. The only sensible play in a realpolitik sense of your own life optimization is to not work in algorithms.
Obviously anyone who claims that patents provide money to inventors is either a liar (Myhrvold etc.) or just has no familiarity with actual engineering. I often think about LZ77 as a case in point. The people who made money off LZ77 patents were PK and Stac, both of whom contributed *absolutely nothing*. Their variants were completely trivial obvious extensions of the original idea. Of course the real inventors (L&Z, and the modern variant is really due to S&S) didn't patent and got nothing. Same thing with GIF and LZW, etc. etc. perhaps v42 goes in there somewhere too; not a single one of the compression-patent money makers was an innovator. (and this is even igoring the larger anti-patent argument, which is that things like LZ77 would have been obvious to any number of researchers in the field at the time; it's almost always impossible to attribute scientific invention/discovery to an individual)