9/09/2010

09-09-10 - The sorry state of compression on Wikipedia

It disturbs me a little bit that PAQ has a huge Wikipedia page but PPM , which is a far more significant and fundamental basic algorithm, has almost nothing. Presumably because nobody is actively working on PPM at the moment, and the people working on PAQ are pimping it.

Similarly there is no page at all for ROLZ or LZP (though there are some decent pages on German Wikipedia (ROLZ) and Polish Wikipedia (LZP) ).

A lot of Wikipedia is really well sorted these days (basic math and computer science is best) but there are gaps where people haven't gotten motivated. Compression is one of those gaps, somebody cool needs to go write a bunch of good articles.

Actually it's so shameful maybe I will write it, but I don't really want to get sucked into that rat hole and waste my time on that.

What a proper PPM entry should cover :

1. The basic operation of the PPM algorithm.

2. Theoretical description of finite context modeling, theory of asymptotic optimality for finite context markov sources. Probability structure of PPM as a cascade of models.

3. Exclusions (Update Exclusion, Lazy Exclusion) (Partial Update Exclusion (PPMii)

4. Zero frequency problem & Escape estimation. PPMA,B,C,D variants. PPMP and PPMX. Clearest in Korodi paper.

(Data Compression Using Adaptive Coding and Partial String Matching '84)
(A note on the PPM data compression algorithm '88)
(Probability estimation for PPM '95)
(Experiments on the Zero Frequency Problem '95)

5. PPM* (Unbounded Length Contexts for PPM '95)

6. Langdon's PPM/LZ equivalence

7. Modern PPM : PPMZ, PPMd, PPMii, Rkive

8. Local Order Estimation (PPMZ)

9. Secondary Estimation (PPMZ)

10. Blending (PPMii / Volf)

11. "Information Inheritance" (novel context intitialization) (PPMii) ( "PPM: One Step to Practicality" - Shkarin )

12. Non-simple contexts (skip contexts, position contexts) (Rkive)

http://www.arturocampos.com/ac_finite_context_modeling.html
http://chiranjivi.tripod.com/Finite.html
http://www.dogma.net/markn/articles/arith/part1.htm
http://homepage3.nifty.com/wpage/links/prediction_by_partial_matching.html
http://www.compression.ru/ds/
http://www.arturocampos.com/ac_ppmc.html
http://www.cs.waikato.ac.nz/~ihw/publications.html

Furthermore, the PAQ page needs to be changed to be more of an Encyclopedia entry and less of a list of software version releases. In particular it should describe the algorithm and its relationship to earlier work, primarily CTW which it is obviously very similar to, and the Weighting/Mixing schemes of Volf/et.al. as well as the very early binary markov coders such as DMC.

(http://www.sps.ele.tue.nl/members/F.M.J.Willems/RESEARCH_files/CTW/ResearchCTW.htm)

While I'm at it, the LZ77 entry is pretty terrible (it makes a big deal about the overlap matches which is actually pretty irrelevant, and doesn't discuss coding methods or optimal parse or anything interesting), CTW is missing an entry, and the Arithmetic Coding entry is really bizarre and badly written. A proper arithmetic coding entry should briefly describe the theoretical [0,1] interval shrinking, and then start talking about practical implementation issues & history, CACM87, DCC95, approximate arithcoding like the skew coder and howard/vitter and ELS coders, the Q & QM coders, CABAC, and finishing with the "Range" coder. The entry on the range coder should be deleted. The example code on the range coder page is actually wrong because it doesn't deal with the underflow case, which is the whole interesting question.

18 comments:

Jon Olick said...

I nominate Mr. Bloom to make the compression wiki pages.

Jonathan said...

Second.

cbloom said...

Oh god damn it.

Autodidactic Asphyxiation said...

"Yeah, right. Its 2010 and we have C.Bloom writing fast huffman decoders, instead of posting HP entries or something.
I guess the next step would be RLE?.. It really disappoints the hell out of me."

http://encode.ru/threads/1116-Charles-Blooms-new-huffman-optimisation?p=22277&viewfull=1#post22277

You've let us all down. ;)

cbloom said...

Oh snap. Well, I kind of agree with him. I've been looking a lot recently at the stuff I did in 1995, and I'm pretty disappointed in the lack of progress in the last 15 years.

Spending time on practical issues might be useful for making things that are actually useful to your corporate masters, but it doesn't take you into new territory.

Shelwien said...

Yes, but also "good old" things are not always so good in practice either.
Its been really a lot of time, and now scrupulously describing details
of PPM development history won't (imho) help anyone to develop a better
compressor - we basically know that for escape estimation SEE is the best etc.
Also at the moment the modern CM implementations are better than PPM both
in speed and compression, and they still have a lot of space for improvement:
http://compressionratings.com/comp.cgi?ppmd+Jr1+-m256%20-r1%20-o8+ccm+1.30c+x%205+m1+x2-0.6+6%20enwik7.txt

cbloom said...

The point of an encyclopedia is not to describe current research. It's to record the body of knowledge that civilization has gained over time.

Most of the ideas of CM sadly have not moved beyond the ideas of PPMZ and PPMii and CTW, so a strong historical background in them is still very useful.

I've always found it to be very useful in really understanding modern techniques to back to old techniques and see why they were abandoned.

cbloom said...

And to be clear, the reason PPM should have a solid entry (more so than PAQ/Context Mixing) is because PPM has a large body of work behind it, it is well understood, the algorithms are described and many are open. It is also widely applicable to things other than very high compression archivers.

Shelwien said...

Sure, and I'd really appreciate if new compression
topics would somehow just appear on wikipedia.
But as well I'd appreciate even more if people who
could write these, like you, or Matt Mahoney,
would instead work on something new, because
there're still more unsolved tasks than solved in
compression, and imho its not really the right
time to analyze the past in hope to find some
missed opportunities.

"Most of the ideas of CM sadly have not moved
beyond the ideas of PPMZ and PPMii and CTW,"


I can't agree here, though its a matter of taste.
Some major points that I can list: interpolated
and/or multi-dimensional secondary estimation,
adaptive logistic mixing (CTW's one is not quite
as adaptive), context history (especially bit
history) analysis, automated parameter tuning
(including context masks), state-machine counters,
rangecoder complexity and speed optimizations,
higher-level text models, etc.

Anyway, I've seen both old and new codecs, and
new ones don't have anything in common with ones
from 2004 and before - thanks to Matt Mahoney.
Certainly, its not all good too - at least I
believe that its better to use precise statistics
(with trees) instead of randomized (with
hashtables).
But I also do believe that pure PPM won't be of
much help anymore - a similar tree-based CM
would show better compression, and light bitwise
CMs - better speed. And then there're also BWT
codecs.

"I've always found it to be very useful in
really understanding modern techniques to back
to old techniques and see why they were
abandoned."


I agree, but most of the time it requires
redoing the whole research process from scratch
(because most researchers don't ever publish
their sources and we can't just believe that
their results are unbiased).
And considering specifically PPM, I don't see a
point in doing that, as it only has historical
value.

cbloom said...

"I can't agree here, though its a matter of taste.
Some major points that I can list: interpolated
and/or multi-dimensional secondary estimation,
adaptive logistic mixing (CTW's one is not quite
as adaptive), context history (especially bit
history) analysis, automated parameter tuning
(including context masks), state-machine counters,
rangecoder complexity and speed optimizations,
higher-level text models, etc.
"

You should write about these! Or encourage people who know more about them to do so.

Too much of the work on encode.ru is closed source with no descriptions!

cbloom said...

But what I mean by using the same ideas is :

take CTW , the idea of working in binary and blending context models by weighting rather than selecting

take Malcolm's ideas from Rkive such as using non-contiguous contexts like AxBx

take SEE from PPMZ and the PPMii ideas of percolating updates to neighbors to help with sparse statistics and slow learning

you pretty much have modern CM

yes obviously there are lots of little tweaks, but so far as my understanding goes there's nothing dramatic.

logistic weighting is a good idea, it's obvious in hindsight that the standard machine learning methods for weighting experts are applicable to weighting models. In fact I'm surprised that more techniques from machine learning haven't been used yet; the weighting in PAQ is very basic machine learning stuff so far as a I can tell (simple linear online training), modern ML has lots more tools to apply.

Shelwien said...

"You should write about these! Or encourage
people who know more about them to do so."


Why, but I do write.
And I always try to explain things I know if anybody asks me.
I even maintain an IRC channel for that.

"Too much of the work on encode.ru is closed
source with no descriptions!"


There're some, but they mostly don't matter - and anyway,
executables are still way better than bpb results for
randomly chosen files like in DCC articles.
Also a few that did matter are not that closed anymore
either :)

Shelwien said...

Erm, the link was "threads started by Shelwien, page 3"

Shelwien said...

take CTW, the idea of working in binary and
blending context models by weighting rather than
selecting


Its not the CTW idea at all, and its trivial.

CTW idea is to use the string probabilities for
prediction, ie logistic estimation (which
appears because we can only handle string
probabilities in log domain). However, Matt's
logistic mixing was taken from neural networks,
not CTW.

Basically, both bitwise processing and
"blending" are obvious, when you try to make a
compressor based on NN (which is what Matt
originally did).

Also, (linear) mixing is natural for another
reason too - if we have two applicable models
which give predictions p1 and p2, and a
probability w of p1 being right, then
p=w*p1+(1-w)*p2 appears straight from the
definition.

As to logistic mixing, its less obvious, but I
found that paq mixing formulas can be actually
derived from the same source model based on
submodel switching, simply by considering string
probabilities.

take Malcolm's ideas from Rkive such as using
non-contiguous contexts like AxBx


Nobody remembers that anymore, but most still
use sparse contexts because its natural with a
hashtable-based coder.

take SEE from PPMZ and the PPMii ideas of
percolating updates to neighbors to help with
sparse statistics and slow learning


I agree with SEE, but PPMII's ideas are its
initialization of new counters and dynamic
ranking of symbols in tree nodes and unary
coding and SSE. And as to "fuzzy" updates - its
also one of these things which everybody invents
on their own.

Btw, ppmonstr also has sparse contexts.

you pretty much have modern CM

Still, I don't see it.
Modern CMs store different counters into
different data structures (they have to be
cache-friendly these days), mix submodel
predictions with a different mixing function and
encode the bits with a different arithmetic
coder (Matt's rangecoder is pretty unique,
although I don't like it). So even if you can
find some similarities, they're likely false.

yes obviously there are lots of little tweaks,
but so far as my understanding goes there's
nothing dramatic.


Depends on where you look. Both old and new
coders take data on input and write some
function of it on output, and both are based on
the same idea of using statistics to predict the
data.
But implementations look completely different to
me. And in fact their behaviour is fairly
different too - old coders compressed texts
considerably better - considerably enough for
ppmonstr still to keep the best result on
1G text compression:
http://mattmahoney.net/dc/text.html
(durilca is ppmonstr with attached text
preprocessor)

the weighting in PAQ is very basic machine
learning stuff so far as a I can tell (simple
linear online training),


It likely isn't that simple - that update
formula was acquired by minimizing the
prediction entropy; originally Matt used a
different one, borrowed from NN theory, and it
didn't work good enough.

> modern ML has lots more tools to apply.

http://prize.hutter1.net/
http://mailcom.com/challenge/
Good luck :)

Shelwien said...

Also please consider continuing this in http://encode.ru/threads/1126-Compression-state-of-art-discussion-with-C.Bloom
Because its somewhat troublesome to post here.

cbloom said...

I'm trying to post there but it tells me this :

"cbloom, you do not have permission to access this page. This could be due to one of several reasons:

1. Your user account may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
2. If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.
"

Will said...

Mr Bloom, Mr Mahoney, Shelwien and those who can contribute, please sit down at a summit or collaborate to lead the way forward!

Shelwien said...

It supposedly waited for email confirmation, maybe its in spam? Anyway, I tried to manually fix it, hope that it works now.

old rants