Unixwiz.net - Steve Friedl's Weblog: So does SCO have a case?

June 21, 2003

So does SCO have a case?

I've been following the whole SCO mess with much more of an open mind than most in the open source community: I do have respect for intellectual property, and I am more aware of the history of UNIX than most of those who weren't even born when I started using it (January 1981). I have been clear in my own mind to separate "SCO is going about this badly" from whether they might have a valid claim or not.

I'm tending to think more and more that they do not, mainly because they won't reveal any of their evidence. They claim it's because they don't want people to rip out the offending code before it gets to trial, but I think this shows a misunderstanding of "internet memory" - you can't make anything go away on the internet even if you try (and the Scientologists have surely tried).

At any rate, The Inquirer ran an interesting piece (amusing screenshot above) on an approach to actually compare various pieces of source code to look for borrowings by taking fingerprints of both trees. Each source file would be broken up into five-line chunks, and an MD5 checksum of each chunk taken and published, and it would allow distribution of the checksums without revealing the source code.

The approach proposed in the article was simpleminded - but a great start - and there is a much better approach. Former Yahoo! Chief Scientist Udi Manber (now at Amazon.com) came up with an algorithm for finding similarities in files, and it's based on a more advanced fingerprinting technique.

This paper, Method for Identifying Versioned and Plagiarised Documents by Hoad and Zobel (RMIT, Australia), which references Udi's algorithm, describes an approach which appears to be a good candidate. In particular, it should be much better than simpleminded MD5 checksums of five-line chunks that can be fooled by changing even a single character in that chunk.

The algorithm is more than "are they the same?", but it can detect co-derivatives, such as two files that came from a single source, and by producing a large file of these fingerprints for the System V source code, one could cross-compare it with the Linux source.

Interestingly, the algorithm allows one-to-many comparisons, so you'd not have to know that file1.c in the Linux source base implemented the same functionality as file2.c in System V - it will find all the likely matches in a huge n-by-n matrix.

As an aside, I met Dr. Manber when I visited Yahoo!, and it was clear why people like him are "Chief Scientist" types and I will never be. When this man walks down the hall, algorithms fall out of his pocket: some of them are just stunning. I'm great at implementing algorithms, but thinking them up is not my strong point. *sniff*

And he's a really nice guy too. Yahoo!'s loss is Amazon's gain, that's for sure.

Posted by Steve at June 21, 2003 08:54 AM | TrackBack

Comments