# cbloom rants

## 6/02/2008

### 06-02-08 - 4

Funk's "Hebbian" way of doing the SVD incrementally learns from each element of the residual matrix one by one. The BellKor paper uses my method of stepping rows, and points at this "EM PCA" paper as the source. I don't really get the point of that. You can derive the whole row SVD method very trivially using variational minimization.

You've already found the first N singular vectors and you wish to find the next one. You have a "residual" matrix Rij which is the original matrix minus the SVD so far. Think of the problem as a best fit approximation. The next rank SVD are two vectors Ui and Vj which you want to approximate the remaining residual as well as possible with the outer product Ui*Vj. So you have an error :

E = Sum_ij { Wij * (Rij - Ui*Vj)^2 }

Note that I've added a weight term, Wij, which measures the importance of each element. I won't say more about this but it's a cool thing to have. Also note that the Sum may be sparse, which is the same as having W = 0 for some terms. In particular there may be elements where Rij is unknown. For those we set Wij = 0 and they don't appear in the sum.

Now, we can just use this, but even better is to regularize. Since the sum is usually very sparse, we may have a bad problem with overfitting, which in practice usually means that instead of the U and V being around the right magnitudes they wind up with huge very large magnitude values that oscillate like crazy. To regularize we just add a penalty to the error for large U and V :

E = Sum_ij { Wij * (Rij - UiVj)^2 } + Sum_i alpha * Ui^2 + Sum_j beta * Vj^2

Where alpha and beta are regularization parameters. Their value must be tweaked.

To solve we just take the derivative in Ui and Vj and set the result to zero. It's some trivial algebra and you get :

Ui = Sum_j { Wij Rij Vj } / ( alpha + Sum_j { Wij Vj^2 } )
Vj = Sum_i { Wij Rij Ui } / ( beta + Sum_i { Wij Ui^2 } )

Now we can't just plug these together and solve, but we can do the hacky iteration solution. We have an equation for U only in V, and an equation for V only in U, so we step them over time :

U(t) = func( V(t-1) }
V(t) = func( U(t) }

Note that V is computed from the current U, not the previous one, so it's not a symmetric step. This is actually a very general hacky way to iterate hard equations. I talked about it here a while ago. In fact I mentioned before it's a cool trick for solving hard equations that are only in one variable by introducing a fake 2nd variable.

Now supposedly it's been proven that this iteration for the SVD actually converges and all that. That kind of rigor is beyond me, I'm afraid, but I have crufty experience with this kind of iteration. In practice I've found a few hacky options can help a lot.

Time centered parameter for V :

U(t) = func( V(t-1) }
V(t) = func( lerp(U(t-1),U(t),0.5) }

Incomplete step :

U(t) = lerp( U(t-1), func( V(t-1) ) , frac)
V(t) = lerp( V(t-1), func( U(t) ) , frac )

where frac is some fraction of a full step. This in theory doesn't change convergence but can make you converge must faster and prevent oscillation. I really don't like the way other guys have to cut off their iteration to avoid overfitting. For me, iterations should smoothly take me towards the ideal value, and I should only be cutting them off because I'm close enough and don't want to spend more CPU time getting closer. With some of the Netflix SVD implementations it's a huge hacky tweak parameter to do just the right amount of iteration.

Also note that you need some initial values. You actually only need one, V(0) then you can compute U(0) from that. All zeros is not an okay choice. All ones is not too bad. I usually use all random values in [0,1].

I should add that the BellKor guys do some other heuristic things that seem to improve their SVD. After they compute each pair of singular vectors, they scale down the residual before doing the next pass. This explicitly forces the later singular values to be smaller (something that should be happening automatically, but doesn't always work). It means that you're only modeling a portion of the error in the higher vectors, which turns out to be a good thing because a lot of that error is just noise and you don't want to model the noise.