9/02/2008

09-02-08 - 1

We were talking about SVD and PCA and KLT and such yesterday and it made me realize I'm still a little fuzzy on that stuff and it could use a really simple explanation. They're actually all exactly the same thing, and they're also closely related to eigen decomposition. Basically the PCA (Principal Component Analysis) or SVD (Singular Value Decomposition) gives you a new coordinate space (aka a set of basis vectors) that decorrelate the input data and concentrate as much energy as possible on the primary axes. If you like, they find the subset of the space that your data primarily spans, and the magnitudes tell you to what extent your data really needs all the extra dimensions. If you prefer to think in terms of transform theory, they construct the best possible linear matrix transform for decorrelating and concentrating data (like what the DCT does for JPEG). The KLT is just the name of the transform when you use the PCA vectors as your matrix transform.

As usual the Wikipedia pages are good. I'm going to be a bit sloppy with my maths and just try to hand wave a bit so you get the concept.

After I posted this Ignacio pointed me at a very cool tutorial by Jon Shlens : his page contains that and some other very cool stuff; it's a good tutorial, similar to mine but more correct and it's much more readable since it's PDF and all that.

There are two ways of thinking about this. One is you have a bunch of K-element vectors, N of them, it's like a bunch of samples of an experiment; you can of course put them together in a KxN matrix; in this mode you are usually looking for K-element basis vectors. The other way to think about it is just that you have some KxN matrix and you wish to find an approximation of it. In either case we'll call this matrix of KxN things M.

You need K basis vectors to span your K element vectors, of course. You will also get K N-element "cobasis" vectors. Your basis transform on your vectors will be a KxK matrix. Obviously the identity matrix is no transform and it corresponds to using a basis like {1000},{0100},{0010},{0001}. That is, instead of thinking of your data element as {a,b,c} think of it as a*(100) + b*(010) + c*(001). That original basis is not necessarily the best, or even the "right" one.

We can think about this in terms of data compression. For example maybe you have a sound file with K tracks and N samples in each track. If you just compress the tracks one by one in that space, there's a lot of correlation between the tracks you're wasting. What you want is to find a different basis in which all the same parts of the tracks are mashed together and each track only has the new unique bits of that track. Then, for example if two tracks are exactly the same, a whole track will be zeros. In general you can model any linear relationship between tracks with a linear transform like a matrix multiply, eg. stuff like D = A + B - C can be completely modeled out.

The best possible linear basis transform is the PCA. The PCA finds the "principal components" of the set. These basis functions are orthogonal and completely decorrelate the dimensions (in the strict linear correlation sense; more on this later). We can figure out the PCA from this :

The "covariance" of dimension i with dimension j is :

Cij = Sum(n to N) { M'in * M'jn }

(M' = M with the mean of each dimension subtracted off, note this is almost the "correlation" but the sdevs are not divided out). I'm going to drop the prime on the M and just assume that you took out the mean from now, but make sure you do that.

Of course we can just write this same thing in matrix notation :

C = M'^T * M'

^T = Tranpose

The diagonal elegents of C are the variances of the elements (the true correlation with yourself is 1). C is symmetric, and the off-axis elements tell you the correlations of one dimension with another. We want a basis transform that makes the cross-correlations zero. That is we want to find :

D = B * C * B^T

where D is some diagonal matrix and B is a basis transform.

For those who know something about eigenvectors you can see that the basis transform we want is just the eigenvectors of C.

The eigenvector has the property (by definition) that

C * e_i = k_i * e_i

That is, multiplying by C gives you the same vector scaled by some number. That is, C does not rotate the eigenvector at all, it stays in the same direction. That means if the eigenvectors were our basis vectors, then multiplying by C on any vector would only scale the coordinates in each dimension, it would not ever munge one coordinate into another.

Any matrix can always be written as the outer product of a spanning set of vectors. That is :

C = Sum{ij} Cij * |i> * where |i> is the ith basis vector and is the jth basis vector transposed.

If we write it in terms of the eigenvectors, then C must be diagonal because we know C turns e_i into e_i , thus all the off-diagonal terms must be zero, so

C = Sum{i} k_i * |e_i> * Since = delta_ij , that is they are orthonormal. The basis matrix B that we wanted is just all the eigenvectors stacked up, and the diagonal matrix D, that we get, is just the eigenvalues down the diagonal.

So, we see that the eigenvectors of C are a diagonal representation of C, which means that the non-self correlations are zero, which is exactly what we want, which means this B is exactly the basis transform we want for PCA.

Now, the eigenvalues also tell us how much energy is in each axis. If you look back at the definition of the correlation matrix, you see the diagonal terms are just the L2 norm of that dimension. Now if we look at C in terms of eigenvectors we can see the eigenvalues of C are just equal to the L2 norms of the data in that direction. If we sort the eigenvalues so k0 is the biggest, k1 is the next biggest, etc. this is the PCA.

Now let me look at the exact same thing a completely different way in terms of successive approximation.

You want to construct basis vectors which are orthogonal and decorrelating. You want the 0th to have most of the signal energy, then the 1st the next most, etc. You can do this just by learning each one one by one.

First find a basis vector b such that the linear correlation of b to the vectors in M is maximized. That's actually the same as minimizing the least square error of a fit with arbitrary coefficient :

Err = Sum{i in K,n in N} ( (Min - an * bi)^2 }

for some vector a ; b is a basis vector that has K element, a is a "cobasis" that has N elements. Either you can think about this as trying to find b and the a's are just arbitrary coefficients, or you can think of it as trying to approximate the matrix with an outer product of two vectors.

This gives your first b. Then take |a> The result that we get is a representation of M in terms of a sequence of outer products :

M = Sum{i} |ai> * The a are N-vectors and the b are K-vectors, our bases. If we normalize the a and b and pull out their lengths this is :

M = Sum{i} si * |a'i> * or M = A * S * B

Where S is a diagonal matrix of si values, A is NxK and B is KxK. Now C the correlation matrix = M^T * M = B^T * S * A^T*A * A * B = B^T * S*S * B

This is exactly like our eigenvectors decomposition, except we have S*S instead of the eigenvalues. Well they must be the same thing. So these B that we found by successive fitting are in fact exactly the same as the eigenvectors of the correlation.

Well this successive fitting is just the SVD, the Singular Value Decomposition. The S are the singular values - and we see now that the S are the square root of the eigenvalues of the correlation matrix. Or rather, if you square the singular values you get the L2 norm of the data in each basis, and they are sorted from largest to smallest.

What we have found is a K -> K orthogonal basis transform that takes us to a coordinate space where the dimensions are linearly decorrelated and as much energy (in an L2) sense is in the lowest dimensions.

This is often done as a way of reducing the number of dimensions you have to deal with. You can see how much L2 energy is in each dimension, and just say I'll use enough dimensions to capture 99% of the energy and just ignore the rest. It also lets you see when you have redundant dimensions. If one dimension can be written as a linear combo of another dimension, that's equivalent to saying you actually have lower dimensional data embedded in higher dimension space, like your data is actually planar but it's embedded in 3d space. The PCA vectors will be two primary vectors in the plane of the data, and one vector with very low energy orthogonal to them.

When this is done in data compression, it's often called the Karhunen-Lo�ve transform (KLT). PCA/SVD/KLT is obviously good for data compression not just because it docorrelates (linearly), but also because it puts as much energy as possible up front and then you can just truncate the stream.

The KLT is orthonormal, which means it's just a rotation. Orthonormal transforms are not required for reversibility; obviously any matrix that has an inverse is a reversible transform. The nice thing about Orthonormal is that the L2 norm of the data is preserved, which means that the magnitudes of the transformed data is in a space you can understand and compare to untransformed norms. That means you can measure the distortion in transformed space, you can quantize in transformed space, etc. and not have to worry about how that scales to the original space. In fact Wavelet transforms are not orthogonal, but are reversible, and the "short time fourier transform" that's sometimes used for audio compression is not even reversible the way it's often implemented. I should also note that if your input data is in integers, an arbitrary matrix may not map ints to ints and thus not be reversible. However, since the KLT is orthonormal it can be turned into a lifting transform that does map ints to ints, at the cost of not exactly decorrelating. (actually, most Wavelet transform implementations are not norm-preserving, which is fine, but the vast majority of implementations that I've seen screw up and fail to account for that when doing quantization and measuring error).

It's interesting to compare the KLT to something like the Fourier transform. The KLT has the property that the best subset of the data is the prefix. That is, the best possible 4 value representation is just the first 4 values, not ever the 5th or the 6th. (this is true over the whole set that the KLT was made for, not necessarilly true on any one given sample).

The Fourier Transform, or more correctly the DFT, is also a basis transform that can be written as a matrix transform. (we don't actually implement the DFT as a matrix transform because a matrix multiply is N^2 and the DFT can be done in NlogN due to the magic of the FFT). The Fourier Transform also has the optimal prefix property (called the "least squares property") - that is the first K terms of the DFT are the best possible K terms of any trigonometric transform (that is, no later terms could be substituted). This is very useful because it means we can truncate the terms correctly, the first ones are always the most important. However, unlike the KLT, the DFT is not necessarilly fully decorrelating, and that means that the energy in the lower dimensions is not maximized.

You could imagine taking a ton of images and gathering all the runs of 8 pixels into samples, so you have K=8 and N samples. Do the PCA to find the best spanning 8-vectors and use these instead of the DCT in something like JPEG. What you would find is that your PCA vectors would look very much like the DCT basis vectors. And you broke the magic NlogN fastness of the FFT.

BTW people in video games have long used the eigenvectors of the correlation matrix (aka the PCA) as basis vectors to find bounding volumes around points. This is complete crap. The PCA has absolutely zero connection to what axes would make good bounding volume axes. There are established very good algorithms for fitting bounding volumes that work in totally different ways.

Why don't people use KLT transforms more in data compression? Well, for one thing, you have to transmit your basis. If you use something like the DCT the basis is implicit, you're assuming some knowledge about the type of data you're working on which saves you bits in the stream. Also obviously it's slower, though the slowness on encode isn't as big a deal as the slowness on decode (there are lots of applications where slow encode would be fine). The main reason is that getting your transform perfect is actually not that important to modern data compression. If you are using a fancy quantizer and a fancy context coder on the back end, then mistakes in the transform can largely be hidden. For example, if your transform is not perfectly decorrelating, your entropy coder can largely model any correlation that's left over and hide that mistake. The simpler your back end is, the more your transform matters.

No comments:

old rants