02-04-08 - 3

My next project is to get back on the Image Doubler and see if I can actually make the predictive/learning doubler do something worthwhile. I went looking for a big repository of hi-res research/reference images a while ago and couldn't find a damn thing that was decent, it's all super low res or super small collections, like 16 pictures or less. Yesterday I had a "duh" moment. Torrents! Go to the torrent sites and filter for pictures. Of course there's a lot of stupid pictures of ugly girls, but there's also awesome stuff like a package of 800 photos of nature at 1920x1200. Each pixel in each color plane is a training sample, so that's 5.5 billion training samples right there which should hold me for a while.

Ideally I'd get the uncompressed so I don't have spurious JPEG artifacts in my images gunking things up, but it's hella hard to find a good uncompressed image data set.

Ideally I would like an image training set which statistically exactly mirrored the overall statistics of all digital images in existance (weighted by the probability of a user interacting with that image). That is, if 32% of the neighborhoods in all the images in the universe were "smooth" , then in my ideal training set 32% of neighborhoods would be smooth. The average entropy under various predictors would be the same, etc. Basically it would be an expectation-equivalent characteristic sample. Some poor graduate student needs to make that for all of us.

No comments:

old rants