07-01-08 - 5

My audio knowledge really sucks, but I feel like there's this massive gap between where scientists think they are in research and what's actually available to the consumer. Back in 1999 I was talking to guys about the MPEG-4 audio spec (BTW MPEG-4 is not ".mp4"), and it includes things like decomposition of music into seperate instruments, modeling of instruments from various digital models of real sound generators, and encoding of notes as synth+error. That all sounds great, but where is it? The technology we're actually using is from 1992 (basic blocked frequency-space stuff).

It occurs to me that unmixing an audio track is basically the same problem as the "image doubler" I've written about before. If you consider the simple case that you're trying to unmix into 2 streams, then it's very very similar to the image doubler. Basically you have a constraint that A + B = I , that is, streams A and B add up to the input I. This is underconstrained of course, there are an infinite number of solutions, but that doesnt make it impossible.

You assume that your component streams A & B come from some real world likely audio generating sources. That is, just like in all data compression, not all streams are equally likely, and in fact very very few streams are likely. Each way of unmixing a given float has a certain probability of having been generated, and you just pick the single most likely way.

This is of course exactly what our brain does when we listen to things. Say you listen to a track. It starts off with a few piano notes. Then there are a few guitar plucks. Then there's a piano note + guitar pluck at the same time. In fact, this superposition is a big jumbled mess that looks like neither of the two pieces. And of course it might not actually be a piano note + guitar pluck, maybe it's some weird other sound source that creates sounds that are very similar to the combination of guitar + piano. But our brain assumes that the most likely source is right and it unmixes the sounds. Instead of your brain thinking "hey here's this new sound I've never heard before" it thinks "hey that's a guitar and piano combined". Of course software could do this exact same thing.

Exploring this stuff for purposes of compression is no longer super compelling, but it is interesting for processing. As I have always, compression research is always useful even if you don't really care about compression ratio, because to acheive high compression of data is equivalent to acheiving high understanding of data.

No comments:

old rants