The primary principal component is just a linear fit to the data; then the other axes are perp and point in the next primary axes of variance. This is why a lot of people heuristically think of it as the "best fit" to the data.
For doing 2-means or BSP splitting, the primary principal component of course tells you the axis to split along. To find the best BSP plane, you can search along that axis in O(NlogN) by sorting all the data by their dot along that axis and trying the plane between each consecutive pair of points. Since you're incrementally stepping you can just increment to count the number of points on each side, as well as the mean & sdev of points of each side.
You should remove the means to calculate the covariance, but of course you don't have to remove the means to transform the data. The mean is just multiplied by the basis transform and becomes the mean in the new coordinate system. If you do remove the mean, then of course the mean in the new space is also zero.
You should *NOT* divide by sdevs in the covariance. The true "correlation" does divide by sdevs to give you a relative correlation measure that's scaled to [-1,1], where 1 = 100% correlation. The "covariance" without the sdevs divided out will have units of [value]^2 , just like the variance. A good thing to display is the *root* of the covariance, since that's in [value] units. I actually find root-covariance a more useful measure than the true correlation, because it tells you how significant the value correlation is in the original scale, and it won't exaggerate up small wiggles; obviously this depends on the problem you are solving. eg.   has a root-covariance of 2, a correlation of 1.0 , while   has a root covariance of 8.