DFW Perl Mongers Winter 2013 Deduplication Hackathon: Judging

See part 1 for an introduction to the DFW Perl Mongers Winter 2013 Deduplication Hackathon.

Judgement Day

Wednesday night Tommy would be presenting the results of the contest to DFW.pm. About a week prior he announced that contestants would have to submit their code by end of business on the prior Monday. When the deadline arrived a few contestants asked for more time, and wanting to be accommodating, Tommy extended the deadline by 24 hours.

When Tuesday evening arrived, all contestants had submitted their code, but now Tommy had a scheduling conflict, which meant it wasn’t until about midnight that we got started doing trial runs.

About 30 people expressed interest in participating in the contest, and signed up to the DFW.pm mailing list, with maybe a dozen taking the next step of getting an account on the development server. In the end we had 6 entries. Just as well, given the limited time we had to judge them.

The contest offered a bunch of categories in which you could win. We split them into two groups. The first were objective measures, like speed, memory consumption, lines of code, and Perl::Critic score. The second group were subjective measures, like documentation quality, most (useful) features, packaging (as a reusable application), and best effort (some evidence that a lot of time was put into the code). With our time constraints we would only be able to judge the objective measures prior to the DFW.pm meeting.

So Tommy went through the procedure of resetting the virtual server being used to run the contest code to a clean state, checked out the first submission from Github, and gave it a spin. Almost immediately it hit a segfault. Not off to a good start.

Fortunately the author of the code was still up and able to step in a debug. The problem was one of the CPAN modules he used was choking on files that were zero bytes in length. A quick fix bypassed the problem. The VM was reset and we did take two. The run completed and the output matched the reference code. Perfect. We collected the other objective stats, and moved on to the next entry.

The next two we tried ran without error, but produced output that had numerous differences compared to the reference design. Per the contest rules, this meant they were disqualified, but we collected the other stats anyway.

One of these entries was aiming to win the low memory usage category. We measured it at only 10 KB of RAM, but that seemed impossibly low. We repeated the measurement again with similar results. Later, as a point of comparison I measured the memory use for a trivial “hello world” Perl program, and far exceeded this. In retrospect, I can’t say whether the bad data was a result of operator error or a flaw in our memory measurement technique. (I had previously run several sanity checks on the valgrind memory profiler technique we used, and found it had good correlation to what ps reported for a simple program that allocated a specified amount of memory and then remained in a sleep state so stats could be gathered.)

The next two entrants we tried produced gobs of permission denied errors. After some investigation we determined that this was a result of ACLs being used on some of the test files. The normal permissions on those files didn’t grant access to the user used to execute the contestant’s code, but the ACL overrode that and granted access. This failure was perplexing. The two entries used different modules and techniques for accessing the files. And regardless, ACLs should be getting resolved by the kernel at a layer far below the Perl code. Why were these two entries tripped up by this and not the prior 3 entries?

By this point it had been almost 3 hours of measuring code, and it had been a long night. The last entrant was fortunately only 6 lines of Perl 6 code. Its author provided it more as a prototype that illustrated what could be done with very concise code, but hadn’t made an attempt at having it implement all the functionality called for in the contest specification. We ran it anyway. We had a rule limiting runs to 30 minutes, and this code didn’t produce any output in that time, so it wasn’t a qualifying entry.

We had now evaluated all 6 entrants, plus Tommy’s unofficial entry, and the reference design. Only the latter two produced correct data on the first pass, and even with some assistance from an author to fix one, we still had 4 other entries that didn’t qualify.

Tommy felt bad about the prospect of disqualifying them, considering some had spent weeks developing their code, so our last order of business was to send out emails to the disqualified authors, giving them one final opportunity to submit corrected code.

The next day, after some discussion with two of the contestants, Tommy decided to remove the ACLs from the test files and grant access using common file permissions. The two entries that were tripped up by that now passed. The other 3 contestants chose not to submit revised entries, leaving us with 3 qualified finalists.

Everyone Has an Opinion

That night Tommy presented the results of the objective measures to the DFW.pm meeting and had several of the finalists do walkthroughs of their code. I’ll cover some of the code and link to the videos in part 3, and present the results in part 4.

After the meeting was behind us, both Tommy and I needed to take some time to get caught up with our $day jobs, before resuming the task of judging the more subjective categories. Here opinion of the judges would matter. Still, I tried to break down each subjective category into several quantifiable attributes, or at least define some criteria for the qualitative rating assigned to each code sample.

For example, for documentation, it was split into a bunch of attributes like POD, help text, architecture description, and code comments, where the first 3 were pass/fail (with some subjective notes added, if two entrants tied), and code comments was a count of the comment lines, plus some subjective judgement as to whether they were useful comments.

While we couldn’t eliminate the subjective nature of these categories, we could at least minimize it, so that it would be more likely that Tommy and I would come to an agreement on the winners for each category. It turns out the system worked pretty well, and we didn’t have any controversy among ourselves as to who won.

In retrospect, I’d definitely try and have fewer such subjective categories in future contests. They’re quite time consuming to judge, and potentially don’t add that much to the appeal of the contest.

In the next part we’ll take a look at the code.

Tom Metro is founder and Chief Consultant at The Perl Shop. More…