The problem of finding duplicate files is probably as old as megabyte capacity hard drives. The fact that there are so many on practically any system that has been in operation for some time is surprising. I would say it is even astonishing when you find source duplicates in your development tree. Well, as part of our cleanup and improvement process we decided to get rid of them. This presents the first problem, how do we find them?
There are many tools that can do this, but the off-the-shelf ones did not seem to be able to ignore white spaces. Of course diff can, but diffing all files pairwise just seemed to take too long (at least my script was still running after coming back from lunch). You may ask whether there is a chance of having files that only differ in white space? Well, think CRLF, and the files we were considering have seen both Windows and Linux. I figured it would not take too long to write a program that does this. You may think that C++ is an unlikely choice, but I am fairly familiar with it and it has the STL and MD5 (from OpenSSL). The first version was short and fast. Then Viktor came up with the bright idea to make it open source. This of course scared me since a 50 line ugly one-filer should be for my eyes only. So I polished it, documented it (even man pages) and tested it. All this of course took much more effort than writing the first version. So now we have yet another tool for finding duplicates, with the distinguishing feature that it could ignore white space (to be correct traditional white space - and not Unicode white space).
As soon as we released it, someone popped the question "how does it compare to ..."? Alright, let's compare. Fdupes seems to be always on top of Google hits and I also found two other popular ones: duff and fdf. I really expected these to run about the same speed as mine. What I found was surprising; I was sure there can be no new thing under the sun when it comes to finding duplicates. If you compare binaries, you first throw away the ones with unique sizes and then take a checksum of the rest and throw them into a map. Since I was after white space ignoring comparisons, I could not use file size as a filter. So the obvious idea came: take a small (user defined) prefix, calculate its checksum and let that be the filter. It made a hell of a difference and seemed to boost binary comparisons as well, especially when it came to large files of the same size. Here are the results of the comparison. Our tool seems to be much faster than the competition and the one that keeps up with it the most (and the most consistently) is fdf - a Perl program. Go figure!