DFW Perl Mongers Winter 2013 Deduplication Hackathon

Can Perl Hack It?

Lets say you have 80,000 files taking up 100 GB of space and you’d like to do some housecleaning by weeding out the duplicates. Is Perl the right tool for this job? This is the question the DFW Perl Mongers Winter 2013 Deduplication Hackathon set out to answer.

It was really more of a programming contest than a hackathon in the classic sense. Participants wouldn’t be fixing bugs or adding features to some neglected open source project. Instead they’d all be creating their own version of a deduplication tool. I had some reservations whether this was a good hackathon topic, but contest organizer and DFW.pm co-president Tommy Butler assured me that there was in fact a need for a deduplication tool that wasn’t being met by current tools (largely aimed at deduplicating image files) and the contest would encourage participants to turn in reusable, packaged applications. So if nothing else, we’d at least end up with some potentially usable code on github.

Usefulness aside, it did strike me as a fun problem to tackle.

The Random Advantage

I also pitched the idea of participating to my co-worker, Tim, and we started discussing architecture and implementation ideas for a possible contest entry. After reviewing the contest specification one thing caught my eye: the test data was described as “a 100 gigabyte mass of…randomly generated files and directories,” and it said the data used in the final competition run would also be random.

Random? Really? That’s not a particularly good way to replicate a real-world deduplication situation. If this statement was correct, and the code generating the data used a decent random number generator, then that means files with the same size will either be identical or very different, and telling the difference will require examining only a few bytes from each file. No need to hash the entire file. This is quite different from a real-world file system where you are likely to find many almost identical files that have the same headers or otherwise vary by only the few hundred bytes that changed between revisions.

I wanted to confirm my suspicion about this before I started writing code that relied on this shortcut, so I sent off an email to contest organizer Tommy to get clarification on various details about the contest server environment. I asked “How high quality is the random data?” and Tommy responded “it’s mostly supplied by Perl 5.18’s int rand.”

I was intentionally being vague as to not give away why I thought this was relevant, but did say “…pretty good quality random data actually permits taking some shortcuts…” to which Tommy responded, “the code is open to peer review by contest judges. … you can opt to be a contest judge (which I’d love) and you can study the randomization code for yourself and make suggestions/improvements.”

To Judge or Not to Judge

I exchanged a few more emails with Tommy on the contest specifications, and he kept suggesting the idea of judging. While it sounded more fun to code up an entry, I told Tommy that I thought the contest was “a way to boost a positive image for Perl,” so if he needed a judge more than he needed another contestant, then I’d be happy to help out. In fact The Perl Shop would be willing to offer sponsorship for the contest.

But what about Tim? By this point he was already off doing research into deduplication theory. Tommy assured me that he didn’t see any conflict of interest, as long as we ceased collaborating on the contest entry. So we instituted a proverbial “Chinese wall” and had no further discussion of the implementation of Tim’s contest entry or any insider details of the final contest execution environment.

At this point Tommy sent me his script for creating the 100 GB file system, which I reviewed, and then explained what I was getting at with my questions about the random data. It didn’t take much convincing to get Tommy to switch to using data built from real-world files. A new test file system was made available to the contestants immediately. A good thing, too, as others had caught on to optimization techniques for the random data and were deduping it in seconds instead of the expected minutes.

Other suggestions I offered up included:

  • Adding a Perl 6 category;
  • adding a category for least memory use;
  • defining a standardized output format, so validating correct operation of entries will be as simple as diffing their output against the reference design;
  • adding a progress indicator to make watching the judging sessions more interesting for the audience; Tommy rightfully felt it was a bit late in the contest to impose this requirement, so it was made optional.

In part 2 I’ll cover how the contest judging went. In part 3 I’ll look at the approach taken by each contestant. And finally part 4 will have the winners.

Tom Metro is founder and Chief Consultant at The Perl Shop. More…