3/25/2014

03-25-14 - deduper

So here's my little dedupe :

dedupe.zip (x64 exe)

This is a file level deduper (not a block or sector deduper) eg. it finds entire files that are identical.

dedupe.exe does not delete dupes or do anything to them. Instead it outputs a batch file which contains commands to do something to the dupes. The intention is that you then go view/edit that batch file and decide which files you actually want to delete or link or whatever.

It runs on Oodle, using multi-threaded dir enumeration and all that. (see also tabdir )

It finds possible dedupes first by checking file sizes. For any file size where there is more than one file of that size, it then hashes and collects possible dupes that have the same hash. Those are then verified to be actual byte-by-byte dupes.

Obviously deleting a bunch of dupes is dangerous, be careful, it's not my fault, etc.

Possible todos to make it faster :

1. Use an (optional) cache of hashes and modtimes to accelerate re-runs. One option is to store the hash of each file in an NTFS extra data stream on that file (which allows the file to be moved or renamed and retain its hash); probably better just to have an explicit cache file and not muck about with rarely-used OS features.

2. If there are only 2 or 3 files of a given size, I should just jump directly to the byte-by-byte compare. Currently they are first opened and hashed, and then if the hashes match they are byte-by-byte compared, which means an extra scan of those files.

10 comments:

AJ said...

1/2 the files that are of the same size are coincidental, so a hash of a file's first and last sizeof(1 disk sector read) is sufficient to uniquely identify it. Files that maybe the same, say a number of same pixel width/height .BMP files, may only differ in any meta data, which is often kept in the first and/or last blocks too. If that fails to unique-ify the file, then reading the entire thing, and hashing would be required.

cbloom said...

Good point. Todo.

Only for large files though. Small files (< 64k certainly) you may as well go ahead and read the whole thing.

cbloom said...

I think the more general optimization is like this :

Compute hashes in 64k chunks.

So you're building a "signature" for each file.

Compute the hashes in a "dynamic programming sense". That is, lazy evaluate them only as needed to differentiate files.

So you only compute the hash of the second 64k chunk when there are other files that match the first 64k hash.

Unknown said...

I wrote this sort of tool in Python, and I got excellent performance comparing file content directly without hashing.
My tool immediately split same-size groups into common-prefix groups whenever a difference was found.
This excludes files from comparison as early as possible, but most of the performance likely came from reading files in 1MB chunks, limiting and amortizing I/O calls.

What's a good upper limit for the number of simultaneously open files? If there are many files of the same size closing files and reopening them before reading another chunk might be necessary.

MH said...

Fun! I wrote one of these. I sorted the list on filesize, then opened the file and read the first 4k (or all of it if it was smaller), and used that as a quick filter, then did full file compares.

My unverified hope was that a first-4k read off a disk was fast.

Though, now I want to do a fast deduper of SIFT on images to catch resized or otherwise modified images.

cbloom said...

Image search is something I would love to work on.

The current "reverse image searches" are all ridiculously bad. You want to be able to find crops and re-compresses and so on.

Doesn't seem that hard.

Anonymous said...

Google reverse image search seems to do all that. http://twitter.com/nothings/status/421590856617824256

cbloom said...

When I've tried it in the past it's always massively failed. Like it couldn't even find minor crops.

Anonymous said...

Maybe I forgot to delete the tags from the image, and so it still said 'NSA' in it, and I totally misinterpreted it?

cbloom said...

Maybe they special case NSA images. I dunno. Personally I've always had much more success with TinEye's reverse image search. But even that is very easily fooled by crops and resizes.

old rants