Skip to main content
EU.EDGE
  • Services
  • Clients
  • Case Studies
    • FreshDirect
    • Dragontape
  • About Us
    • The Team
    • Sponsorship
  • Blog
  • Careers
    • Open Positions
    • Training
  • Contact

You are here

Home ‣ Blog ‣ ua: yet another tool for finding duplicates
  • News
  • Technology
    • Java
    • Android
    • Open Source

Latest Posts

  • May

    08

    2013

    RIPE Atlas probes hosted by EU Edge

    EU is proud to participate in a huge measurment project by RIPE NCC hosting RIPE Atlas probes. RIPE Atlas employs a global network of probes that measure Internet connectivity and reachability, Continue...
  • May

    08

    2013

    We are Atlassian experts

    It's not that we need the positive reinforcement to feel like experts, but it's nice to have an official certificate: EU Edge is now expert for Atlassian We can help you develop customized Continue...

Free Developer Training

HTML, CSS and Javascript

Careers

  • Senior Back-End Developer
  • Senior Mobile (iPhone/Android) Software Developer
  • Senior Web Front-End Developer
  • Senior Web Front-End Tester

We are constantly looking for great minds to join and enrich our team. Please upload your CV to let us know about your talents.

Upload your CV

Sponsorship

István

Nov

27

2007

ua: yet another tool for finding duplicates

Share on Facebook
Tweet this
  • News
  • Open Source

The problem of finding duplicate files is probably as old as megabyte capacity hard drives. The fact that there are so many on practically any system that has been in operation for some time is surprising. I would say it is even astonishing when you find source duplicates in your development tree. Well, as part of our cleanup and improvement process we decided to get rid of them. This presents the first problem, how do we find them?

There are many tools that can do this, but the off-the-shelf ones did not seem to be able to ignore white spaces. Of course diff can, but diffing all files pairwise just seemed to take too long (at least my script was still running after coming back from lunch). You may ask whether there is a chance of having files that only differ in white space? Well, think CRLF, and the files we were considering have seen both Windows and Linux. I figured it would not take too long to write a program that does this. You may think that C++ is an unlikely choice, but I am fairly familiar with it and it has the STL and MD5 (from OpenSSL). The first version was short and fast. Then Viktor came up with the bright idea to make it open source. This of course scared me since a 50 line ugly one-filer should be for my eyes only. So I polished it, documented it (even man pages) and tested it. All this of course took much more effort than writing the first version. So now we have yet another tool for finding duplicates, with the distinguishing feature that it could ignore white space (to be correct traditional white space - and not Unicode white space).

As soon as we released it, someone popped the question "how does it compare to ..."? Alright, let's compare. Fdupes seems to be always on top of Google hits and I also found two other popular ones: duff and fdf. I really expected these to run about the same speed as mine. What I found was surprising; I was sure there can be no new thing under the sun when it comes to finding duplicates. If you compare binaries, you first throw away the ones with unique sizes and then take a checksum of the rest and throw them into a map. Since I was after white space ignoring comparisons, I could not use file size as a filter. So the obvious idea came: take a small (user defined) prefix, calculate its checksum and let that be the filter. It made a hell of a difference and seemed to boost binary comparisons as well, especially when it came to large files of the same size. Here are the results of the comparison. Our tool seems to be much faster than the competition and the one that keeps up with it the most (and the most consistently) is fdf - a Perl program. Go figure!


Add new comment

More information about text formats

Filtered HTML

  • Email addresses will be obfuscated in the page source to reduce the chances of being harvested by spammers.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Services
  • Clients
  • Case Studies
  • About Us
  • Blog
  • Careers
  • Contact
  • Home
  • News
  • Contributions
  • Sponsorship
  • Training

join us on facebookfollow us on twitter

Copyright ©2011
Powered  by EU Edge LLC

EU Edge LLC
24 Tölgyfa street
1027 Budapest, Hungary

Tel.: +36 1 438 6337
Fax: +36 1 438 6336
email: