A x = bin the least squares sence (that is, minimize |Ax - b|^2) , with weights W , you just are minimizing the error :
E = Sum_i { W_i * ( Sum_j {A_ij * x_j} - b_i )^2 } E = Sum_i { ( Sum_j { sqrt(W_i) * A_ij * x_j} - sqrt(W_i) * b_i )^2 } so set A'_ij = A_ij * sqrt(W_i) b'_i = b_i * sqrt(W_i) then E = Sum_i { ( Sum_j { A'_ij * x_j } - b'_i )^2 }so we can just do a normal lsqr solve for x using A' and b'.
If you have a Gaussian random event, probability of x is P(x), the (natural) bits to code an event is -ln(P(x)) =
P(x) = 1/(sigma * sqrt(2pi)) * e^( - 0.5 * ((x-c)/sigma)^2 ) H = -ln(P(x)) = ln(sigma * sqrt(2pi)) + 0.5 * ((x-c)/sigma)^2 H = const + ln(sigma) + 0.5 * ((x-c)/sigma)^2And bits to code events is a good measure of how well events fit a model, which is useful for training a model to observed events. In particular if we ignore the constants and the ln term,
total H = Sum_ij { (1/sigma_i^1) * (x_j - c_i)^2 }the bit cost measure is just L2 norm weighted by one over sigma^2 - which is intuitive, matching the prediction of very narrow Gaussians is much more important than matching the prediction of very wide ones.
Now that I have autoprintf, I find this very handy :
#define lprintfvar(var) lprintf(STRINGIZE(var) " : ",var,"\n");
Here's how I look at it:
ReplyDeleteThe normal equations are A^T A x = A^T b, an orthogonal projection of b onto A's column space by taking inner products. The L^2 inner product of vectors and matrices is P^T Q, and with inner product matrix S you just throw S into the middle so that it's P^T S Q.
Applied to our normal equations for A and b, that works out to (A^T S A) x = A^T S b or equivalently (sqrt(S)^T A)^T (sqrt(S) A) x = (sqrt(S)^T A)^T (sqrt(S) b). Since S is symmetric (and hence sqrt(S) is also symmetric), this is A'^T A' x = A'^T b' where A' = sqrt(S) A and b' = sqrt(S) b, which is just the least-squares problem for A' and b'.
I started writing all this out because I thought you were missing some square roots in your A' and b' definition. But then I noticed you defined the skewed inner product with weight factors sqrt(W_i) rather than W_i, so we do get the same result.
This comment has been removed by the author.
ReplyDeleteYou can also think of it like you're doing a least-squares solve in a scaled space (eg. "the y axis is twice as important/big as X and Z").
ReplyDelete"and with inner product matrix S you just throw S into the middle so that it's P^T S Q."
ReplyDeleteMmm this is an interesting way of thinking about it.
It reminds me of the "kernel trick" for SVM, which is really a general method.
Anytime you see something that works like an inner product, you can replace it with something else that works like an inner product.
In fact, rather than just using this for weighted-lsqr I imagine you could do kernel-lsqr this way. ... searching ... yep, they're doing it, see for example "kernel conjugate gradient" which appears to be an open area of research right now.