EDIT : hmm whaddaya know, I found one.
You gather something like 20 high quality images of different characteristics. The number of base images you should use depends on how many tests you can run - if you won't get a lot of tests don't use too many base images.
Create a bunch of distorted images in various ways. For each base image, you want to make something like 100 of these. You want distortions at something like 8 gross quality levels (something like "bit rates" 0.125 - 2.0 in log scale), and then a variety of distortions that look different at each quality level.
How you make these exactly is a question. You could of course run various compressors to make them, but that has some weird bias built in as testers are familiar with those compressors and their artifacts, so may have predispositions about how they view them. It might be valuable to also make some synthetically distorted images. Another idea would be to run images through multiple compressors. You could use some old/obscure compressors like fractals or VQ. One nice general way to make distortions is to fiddle with the coefficients in the transformed space, or to use transforms with synthesis filters that don't match the analysis (non-inverting transforms).
The test image resolution should be small enough that you can display two side by side without scaling. I propose that a good choice is 960 x 1080 , since then you can fit two side by side at 1920x1080 , which I believe is common enough that you can get a decent sample size. 960 divides by 64 evently but 1080 is actually kind of gross (it only divides up to 8), so 960x1024 might be better, or 960x960. That is annoyingly small for a modern image test, but I don't see a way around that.
There are a few different types of test possible :
Distorted pair testing :
The most basic test would be to show two distorted images side by side and say "which looks better". Simply have the use click one or the other, then show another pair. This lets testers go through a lot of images very quickly which will make the test data set larger. Obviously you randomize which images you show on the left or right.
To pick two distorted images to show which are useful to test against, you would choose two images which are roughly the same quality under some analytic metric such as MS-SSIM-SCIELAB. This maximizing the amount of information you are getting out of each test, because when you put up images where one is obviously better than another you aren't learning anything (* - but this a useful way to test the user for sanity, occasionally put up some image pairs that are purely randomly chosen, that way you get some comparisons where you know the answer and can test the viewer).
Single image no reference testing :
You just show a single distorted image and ask the viewer to rate its "quality" on a scale of 0-10. The original image is not shown.
Image toggle testing :
The distorted and original image are shown on top of each other and toggled automatically at N second intervals. The user rates it on a scale of 0-10.
Double image toggle testing :
Two different distorted images are chosen as in "distorted pair testing". Both are toggled against the original image. The user selects which one is better.
When somebody does the test, you want to record their IP or something so that you make sure the same person isn't doing it too many times, and to be able to associate all the numbers with one identity so that you can throw them out if they seem unreliable.
It seems like this should be easy to set up with the most minimal bit of web programming. You have to be able to host a lot of bandwidth because the images have to be uncompressed (PNG), and you have to provide the full set for download so people can learn from it.
Obviously once you have this data you try to make a synthetic measure that reproduces it. The binary "this is better than this" tests are easier to deal with than the numeric (0-10) ones - you can directly test against them. With the numeric tests you have to control for the bias of the rating on each image and the bias from each user (this is a lot like the Netflix Prize actually, you can see also papers on that).