8/01/2008

08-01-08 - 1

Audio compression is broken. Here's a very easy way to see it.

1 GB can hold 5000 novels. (1 M letters per novel at 0.2 bits per character)

1 GB can hold about 15 CDs. (10^9 / 192 / 1000 * 8 / 60 / 45 = 15.4)

I think it's intuitively obvious that 333 novels contains more real information than one CD.

Here's another way of seeing it : an entire human genome is only 0.8 GB. I can write the code to make a musician, and I still have 0.2 GB left over in which I can store 10,000 songs in sheet music, as well as detailed instructions on how to manufacture instruments. This is a program to produce music (the fact that we can't currently execute this code is irrelevant to the theoretical consideration, it is executable in principle). I just encoded 10,000 songs in 1 GB. (albeit with a lot of loss in a PSNR sense)

Basically audio is jam packed full of useless detail. For example, if somebody sits and tries to strum a chord on a guitar exactly the same way over and over, the raw wav data for every one of those chords will look vastly different, but to our ear they all sound very similar, and the differences are just randomness. If you had a chord generator that could randomly make a sound that was somewhere in that space of what he was playing, you could let the generator replace all those notes. In a raw PSNR sense they would be way way off, but the experience would be identical to a human listener.

Let me add a little more detail :

Anything that a compressor does not learn, it is transmitting. So, for example, text compressors are still not really learning grammar, which means they are implicitly transmitting the information about the grammar. That means within the 0.2 bpc output, a lot of that information is actually the rules of grammar.

A basic audio compressor is also obviously transmitting tons of things it's not learning about the music. It's not learning what sound comes from what instrument, what those instruments tend to sound like (eg. a waveguide simulation that reproduces that sound), what instruments are vocal and the throat simulation parameters of that vocalist, etc.

However, even if the audio compressor learned all that stuff it would still be really wasteful. The problem is there's just so much irrelevant nonsense in audio. As a simple example, a pretty decent model for audio is a bunch of synth wave packets with effects. A wave packet is some amplitude curve applied to some periodic wave shape. Each of these has various parameters that can change over time. Each step also has random noise on it. The parameters of the noise (mean & sdev) are important, but the exact values that come out of the noise are not. Also, the phase of the basic wave shape at the base of the synth is irrelevant to how it sounds. But if you regenerate the exact same sound again with different random noise on the params and different phase, it will come out looking totally different.

If for some reason you still disagree, here's another way to see this : you can go to [KONTROL] or something and download some techno DJ mix MP3's. These are generally around 80 MB an hour. However, obviously they can be reproduced from the original tracker data, which is note timing + effect parameters over time. The total data for all that is maybe a few MB, perhaps less - it's very tiny. Modern techno is not trivial like MIDI, they're running a ton of complex processing with parameters that change over time. Now obviously techno is simpler than some guitar song - but the actual information content is not several orders of magnitude different.

No comments:

old rants