Here's the one sentence version of PatchMatch : Seed your match field with some random guess or shitty initial matches; improve by incrementally propagating match offsets to neighbors and trying small random deltas to find improvements. (it's an absolute classic spin network magnetic moment relaxation kind of problem).
Here's what I've been doing : start with a match field set to all nulls (no match found yet). Then incrementally fill it in with matches and propagate them to neighbors. It proceeds in a few steps like this :
Step 1. Use computer vision methods to find feature points in a frame. Match feature points to the previous frame. This bit is a bit tricky and tweaky, you only want to make matches that you're pretty confident in. Note that this matching is done based on a "characteristic" of the feature point which has no distance limit, and is also somewhat immune to rotation and scaling and such. Sometimes this step finds some very good correspondences between the frames, but it's sparse - it only has high confidence at a few places in the frame, so you can't use it to find all the block matches (and you wouldn't want to even if you could). Generally this finds around 100 vectors.
Step 2. Find "distinctive" spots in the frame. The goal is to find some spots that are not degenerate - eg. not flat patches, not straight edges. The idea is that these are places where we can likely find a good motion vector with high confidence, unlike degenerate areas where there are lots of equally good match vectors. I use two mechanisms to find distinctive spots : one is the computer vision feature points that were not already used in the first matching step. The second is to take the "cornerness" map of the image using a Harris or Hessian operator on the derivative of gaussians (this is a lot like an edge map, but it kills straight edges). Find the top 5% highest cornerness values that are local maxima, and use those as distinctive spots. All of the distinctive spots do a long distance brute force block match (something like radius = 16 or 32) to try to find a good motion vector for them.
Step 3 : Flood fill to fill in the gaps. We now have presumably good motion vectors at a few key points in the frame. Go to their neighbors and search for match vectors that are close to the neighboring one that we already found. Put that in the blank and push its neighbors to the queue to continue the flood fill.
Step 4 : Relaxation pass. (this is not critical). We now have a motion vector everywhere in the frame. For each match vector in the frame, look at its 4 neighbors. Examine match vectors that are near my 4 neighboring vectors. If one is better, replace self. Continue to next. Theoretically you should do this pass a few times, but I find 1 or 2 is very close to infinite.
The key thing is that motion is usually semi-coherent (but not fully coherent, because we are not really trying to find true motion here, but rather just the best matching block, which is a lot more random than true motion is). By finding very good motion vectors in seed spots where we have high confidence, we can propagate that good information out to places where we don't have as much confidence. This lets us avoid doing large brute-force searches.
BTW I really do not understand the point of all the "diamond search" type shit in the video compression literature. It seems to just find really shitty motion vectors and is not making good use of the possibilities in the bit stream. Especially with GPU video encoding in this modern age, doing plain old big chunks of brute force motion search is preferrable. (yes, I know it's for speed, but it's a poor way to optimize, and the high quality encoders are still non-realtime anyway, so if you're not realtime you may as well take some more time and do better; plus the vast majority of use of non-realtime video encoders is in an encode-once decode-many type of scenario which means you should spend a lot of cpu and encode as well as possible).
With this method I find motion vectors using local searches of radius 8-16 that are the same quality as brute force searches of radius 50-100, which makes it about two orders of magnitude faster (and higher quality, since nobody does brute force searches that far).
ADDENDUM : To give this post a bit more weight, here are some numbers on quality from my video coder vs. brute force search radius :
-s16 : rmse : 9.3725 , psnr : 28.7277 -s26 : rmse : 9.2404 , psnr : 28.8510 -s48 : rmse : 9.0279 , psnr : 29.0531 -s64 : rmse : 8.9171 , psnr : 29.1603 -s100 : rmse : 8.7842 , psnr : 29.2907 -s9999: rmse : 8.5294 , psnr : 29.5465(-s16 means it's searching a 33x33 grid for motion vectors) (-s9999 means it searches full frame).
The above described iterative feature point propagation method gets
-sfast: rmse : 8.8154 , psnr : 29.2600BTW for doing full-frame brute force search you obviously should use a block-space acceleration structure for high dimensional nearest neighbor search, like a kd-tree, a bd-tree (box decomposition) or vp-tree (vantage point). High dimensional spaces are nasty though; the typical idea of "find a cell then walk to its immediate neighbors" is not fast in high D because you have O(D) neighbors.