I made a movie viewer app in Galaxy3 to view the Netflix movie database as points in 3d, located by the SVD to rank 3. Basically movies that are correlated are neighbors in a 3d space. I fade the names of the movies in & out such that when the font scale is near 1.0 they are opaque and as they get too big or too small they fade out. It doesn't work worth a damn, you can't find any good information by looking at it. Oh well.

The SVD was an okay way to get to 3d, but I was thinking about the general distance-embedding problem. Say you have N points in very high dimension (what dimension is perhaps unknown and you may not even know the coordinates of the points). What you do have are a bunch of point-point distances (more than N, but perhaps not all N*N of them). You wish to find an embedding in some lower dimensional space of K dimensions (such as K=3), such that the distances in K are as close as possible to the given distances.

Obviously if K >= the necessary dimension of the points, you can find an exact location that satisfies the desired distances. In general you can't satisfy them exactly.

You want to minimize the error

E = Sum_ij { ((Xi - Xj)^2 - Dij^2)^2 }Where Dij is the desired distance between points i and j, and you optimize the set of points X, and the sum is taken only over the pairs where Dij is defined (Dij may be sparse).

I don't think there's any way to solve this directly. What we can do is the old trick of solving for one Xi and pretending all the other X's are held constant, do that for all the X's, that's one step, and then keep stepping until it converges.

Starting from the slightly different error measure :

E = Sum_ij { (|Xi - Xj| - Dij)^2 }Take deritive in Xi and set to zero, do some algebra and you get :

Xi = Sum_j { Xj + Dij * Unit(Xi - Xj) } / Sum_j { 1 }Where Unit(V) = V / |V| , and the Sum is over terms where Dij is given and i != j

Now this doesn't look too useful because it's a solution for X in terms of X, but we can play a trick. The X on the right hand side is the X from the last "time step", the X on the left hand side is new X that we get for the current time step.

You can actually see what's happening here geometrically. This is like the Verlet constraint solver that game people use a lot. We have all these pair distance constraints of Dij. To solve for the new Xi, we take each "fixed" point the old Xj, and we make the point that satisfying the distance constraint by walking along the unit vector to the old Xi. Then we just average all these points that satisfied each constraint.

Now that we see that, we can play with this. In particular, we can weight each of the terms in this average. Rather than treat all the other Xj are equal, what we really care about are the points near our Xi. That is, I don't really care much about getting the very large distances exactly right, but I do care a lot about getting the closest points right. In particular I really want the identity of which point is my closest neighbor to be preserved by this mapping. Now that we think of this we can see our original choice if squared distance error was pretty poor. The squared distance error says that it's more important to turn 1004 into 1000 than it is to turn a 4 into a 1. Obviously that's wrong. We should have something that measures the fractional error, maybe

E = Sum_ij { (|Xi - Xj|/Dij - 1)^2 }

In any case, its obvious a decent weighting now would just be 1/Dij. That makes smaller D's much more important.

next Xi = Sum_j { Xj/Dij + Unit(Xi - Xj) } / Sum_j { 1/Dij }

Note that if any Dij are zero, this form snaps those points together exactly, which is good.

This iteration looks good, but there're a few niggles that still bothering me. This thing is pretty sensitive to the initial seeding and I don't have a really good way to do that. I have a hunch this iteration may be prone to getting stuck in local minima too.

I guess classically this problem is called Multidimensional Scaling . In 2000 in Science 290 there were two papers that seem to have shaken up this old problem. One introduced Isomap ( Isomap homepage ) , the other was Locally Linear Embedding . Both are pretty sexy.

## No comments:

Post a Comment