This is a file level deduper (not a block or sector deduper) eg. it finds entire files that are identical.
dedupe.exe does not delete dupes or do anything to them. Instead it outputs a batch file which contains commands to do something to the dupes. The intention is that you then go view/edit that batch file and decide which files you actually want to delete or link or whatever.
It runs on Oodle, using multi-threaded dir enumeration and all that. (see also tabdir )
It finds possible dedupes first by checking file sizes. For any file size where there is more than one file of that size, it then hashes and collects possible dupes that have the same hash. Those are then verified to be actual byte-by-byte dupes.
Obviously deleting a bunch of dupes is dangerous, be careful, it's not my fault, etc.
Possible todos to make it faster :
1. Use an (optional) cache of hashes and modtimes to accelerate re-runs. One option is to store the hash of each file in an NTFS extra data stream on that file (which allows the file to be moved or renamed and retain its hash); probably better just to have an explicit cache file and not muck about with rarely-used OS features.
2. If there are only 2 or 3 files of a given size, I should just jump directly to the byte-by-byte compare. Currently they are first opened and hashed, and then if the hashes match they are byte-by-byte compared, which means an extra scan of those files.