10-01-10 - Some Data Compression Corpora We Need Badly

If somebody wants a university project, these would be nice :

1. A lossless data compression corpus that is *broad* and also *representative*. That is, there are many types of data (probably 100-1000 files), some small, some large. Importantly the type of correlation structure in the data should be very diverse (eg. not just a ton of different English text files or executables). Too many of the corpora are simply too small, and even the ones that are reasonably large are too self-redundant, they wind up not containing a sample of a certain type of data that does occur in the wild.

Finally the thing that's really missing is there should be a weighting number assigned to each file such that they are given importance based on their chance of occurance in the wild. To get these numbers you could do a few different things - download every archive on thepiratebay and sample what's inside them (this gives you a sampling of the type of files people actually put in archives), or maybe put a snooper on the internet backbone and sample the total set of all data that flies on the internet. The point is that this sampling should be based on the actual frequency of various data types, not just an ad hoc composite.

2. An image set with human quality metrics. Somebody needs to take a big set of test images (32-100), munge them in various ways by running them through various compressors (as well as other ways of damaging them that aren't well known compressors), and then get actual human visual ratings on the damaged versions. Then provide all the damaged versions (or code to produce them) with the human ratings.

If we had a test set like that, we could tweak our algorithmic approximations of human quality rating (eg. SSIM etc) until they reproduce what the actual humans say. This is not a test set for image compressors, it's a test set for image quality metric training, which is what we really need to take image compressors to the next level.

No comments:

old rants