ESR to Shred SCO Claims? 554

Posted by michael on Tuesday September 09, 2003 @05:52PM from the woodchipper dept.

webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."

ESR to Shred SCO Claims?

This discussion has been archived. No new comments can be posted.

Search 554 Comments Log In/Create an Account

Comments Filter:

Re:Is there really that much data there? (Score:4, Informative)

by Paradox ( 13555 ) writes: on Tuesday September 09, 2003 @06:00PM (#6915236) Homepage Journal

No. Hashes are one way functions. So it'd be kinda pointless. Further, comparing two hashes for anything but equality is meaningless with most good hashing schemes (unless you're a cryptographer).

SCO may not know origin of code (Score:5, Informative)

by Malfourmed ( 633699 ) writes: on Tuesday September 09, 2003 @06:02PM (#6915267) Homepage

The Sydney Morning Herald continues its mainstream coverage of the SCO vs IBM roadshow by posting an article where Dr Warren Toomey, a Unix historian, says that SCO may not know the origin of their own code [smh.com.au].
Article text follows:
SCO may not know origin of code, says Australian UNIX historian
By Sam Varghese
September 9, 2003
More doubts have been cast on the heritage of System V Unix code, which the SCO Group claims as its own, by an Australian who runs the Unix Heritage Society. [tuhs.org]
Dr Warren Toomey, now a computer science lecturer at Bond University, said today: "I'd like to point out that SCO (the present SCO Group) probably doesn't have an idea where they got much of their code. The fact that I had to send SCO (the Santa Cruz Organisation or the old SCO) everything up to and including Sys III says an awful lot."
He said that even though SCO owned the copyright on Sys III, a few years ago it did not have a copy of the source code. "I was dealing with one of their people at the time, trying to get some code released under a reasonable licence. I sent them the code as a gesture because I knew they did not have a copy," he said with a chuckle.
Dr Toomey's statements come a few days after Greg Rose, an Australian Unix hacker from the 1970s, raised the possibility that there may be code contributed by people, including himself, which has made its way into System V Unix and is thus being used by companies like the SCO Group.
Dr Toomey said this was one reason why the code samples which the SCO Group had shown at its annual forum had turned out to be widely published code.
SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.
He said that there was lots of code which had been developed at the University of New South Wales in the 70s which went to AT&T and was incorporated into UNIX without any copyright notices.
"At that time the development that was going on was similar to open source - the only difference was that the developers all had to have copies of the code licensed from AT&T," he said.
Dr Toomey, who served 12 years with the Australian Defence Force Academy, an offshoot of the University of New South Wales, before joining Bond University, said he had source code for Unices from the 3rd version of UNIX which came out in 1974 to the present day. "I don't have Sys V code but there are people with licences for that code who are members of the Unix Heritage Society. We can compare code samples any time," he said.
He agreed that the codebase of Sys V was a terribly tangled mess. "It is very difficult to trace origins now. There is an awful lot of non-AT&T and non-SCO code in Sys V. There is a lot of BSD code there," he said.
In March, the SCO Group filed a billion-dollar lawsuit against IBM, for "misappropriation of trade secrets, tortious interference, unfair competition and breach of contract."
SCO also claimed that Linux was an unauthorised derivative of Unix and warned commercial Linux users that they could be legally liable for violation of intellectual copyright. SCO later expanded its claims against IBM to US$3 billion in June when it said it was withdrawing IBM's licence for its own Unix, AIX.
IBM has counter-sued SCO while Red Hat Linux has sued SCO to stop it from making "unsubstantiated and untrue public statements attacking Red Hat Linux and the integrity of the Open Source software development process."
-----
Wordforge writing contest now open: deadline 2003-03-28

Re:Nah... (Score:5, Informative)

by jedidiah ( 1196 ) writes: on Tuesday September 09, 2003 @06:04PM (#6915286) Homepage

Don't call it the "SCO kernel".

It is the SysV kernel.

Re:Nah... (Score:5, Informative)

by jonabbey ( 2498 ) * writes: <jonabbey@ganymeta.org> on Tuesday September 09, 2003 @06:06PM (#6915318) Homepage

if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.

IANAL, but I don't believe this is so in the general case. Copyright protects only specific expression of ideas, not the ideas themselves.

If SCO had valid patents on some of this stuff, they'd have a point of legal leverage, but they don't from all reports.

Re:maybe... (Score:3, Informative)

by Anonymous Coward writes: on Tuesday September 09, 2003 @06:06PM (#6915322)

look how Microsoft is directly trying to bias the case more with onesided biased news: check out [from today]: [msn.com article from supposed tech analyst Jonathan Cohen [msn.com]

then read:
More on Jonathan Cohen [threenorth.com]

Microsoft MSN a biased propaganda machine. Only shows one side of the facts (the lies).

Re:Can Someone Explain? (Score:5, Informative)

by Sterling Christensen ( 694675 ) writes: on Tuesday September 09, 2003 @06:06PM (#6915327)

From it's manual:
"The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style."

This is actually a darn good idea (Score:5, Informative)

by RocketRick ( 648281 ) writes: on Tuesday September 09, 2003 @06:07PM (#6915335)

By computing MD5 hashes of consecutive (overlapping) line triplets, the shred algorithm makes it easy to identify copied code, without ever seeing the actual code. This might be a perfect way for companies to allow a third party to compare code, without giving away any trade secrets in the process.

Of course, since MD5 is a very good cryptographic hash function, *any* one-bit change in the source will result in, on average, half of the bits in the result being flipped. So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....

In short, this might be a useful technique for allowing a third party to look for trivial wholesale copying of code, but it would be useless for finding a motivated miscreant, determined to steal code without being caught.

Ups and downs (Score:5, Informative)

by autocracy ( 192714 ) writes: <slashdot2007&storyinmemo,com> on Tuesday September 09, 2003 @06:07PM (#6915342) Homepage

Upside: we can maybe help catch more stolen code.
Downside: Uh... it just came out... and it's making some big, big claims involving fuzzy logic. I think it's gonna need some testing first, eh?
Also, anybody else think it only works on larger sections of code than just say 10 lines?

Re:Who cares? (Score:3, Informative)

by jmv ( 93421 ) writes: on Tuesday September 09, 2003 @06:08PM (#6915354) Homepage

I think the difference is that a 3rd party that has access to the SysV source can compute the hashes and make them public without violating copyright. That way anyone can look for common lines with Linux and see where they came from (legal or not).

Re:But the Important Question is... (Score:3, Informative)

by TMB ( 70166 ) writes: on Tuesday September 09, 2003 @06:10PM (#6915377)

From the README...

Besides the production C code, the distribution also includes working Python versions. These were used to prototype the concept.

No word on the latter... but it's ESR... so of course! ;-)

[TMB]

Re:Is there really that much data there? (Score:5, Informative)

by B'Trey ( 111263 ) writes: on Tuesday September 09, 2003 @06:32PM (#6915578)

RTFA. The code is split into overlapping "shreds" of three lines. For example, 7 lines of code would generate five hashes, consisting of the following lines:

1,2,3
2,3,4
3,4,5
4,5,6
5,6,7

Two source trees are shredded, then unique hashes are discarded. Anywhere there are three lines of code that are the same ANYWHERE in the source tree, it'll be spotted.

Now, it's trivial to defeat this if you're specifically aiming to do so. However, for existing source trees (such as nearly countless variations of *nix) that already exist and are duplicated in numerous places, it works nicely. It's impossible to go back and modify the tree because too many copies exist.

Re:Can Someone Explain? (Score:4, Informative)

by Bob the Hamster ( 705714 ) writes: on Tuesday September 09, 2003 @06:32PM (#6915581) Homepage Journal

And note that it is not comparing the MD5's of whole files, it is comparing MD5's of three-line "shreds" of files

Better way to compare code (Score:2, Informative)

by Brikus ( 670587 ) writes: on Tuesday September 09, 2003 @07:03PM (#6915820)

Speaking of BSD, a better way of doing this comes from Berkley too. It's a program called Moss [berkeley.edu] that is used by many universities to detect plagarism in CS classes. I know from firsthand experience that this is a very powerful program. Unlike the shredding technique, things like changing variable names won't affect the comparsion value Moss returns. It even does a pretty good job of noticing changes like replacing for loops with while loops.

One disadvantage it does have though is that it won't work with the MD5 checksums, although I'm a bit skeptical of how well that would work anyway.

You guys are missing the point. (Score:5, Informative)

by LinuxParanoid ( 64467 ) * writes: on Tuesday September 09, 2003 @07:47PM (#6916149) Homepage Journal

Pardon me, but a lot of you guys are missing the point of this comparator.

1) There are people out there with legit source licenses to SVR5 source trees. And not just Unix OEMs. Various people in large companies with SVR4/5 source licenses etc.

2) Such people cannot release the source code, and may (if paranoid of how they interpret 'derived works') not want to publish hashed MD5 codes of SVR5.

3) However, it should be possible for a legit SVR5 source licensee to publish openly a list of areas of code that are similar across trees, without that list being either A) a derived work, B) violating their NDAs (um, do check the fine print first though) and C) spending tons of their own, presumably expensive time, digging through stuff.

Then Linux advocates can then sift through the resulting large pieces of code and doublecheck/crosscheck the origins of it. At the very least, we'd have a public list of suspicious areas of Linux and could determine that certain parts are A) BSD-licensed, B) are verified as original by a known Linux coder, and C) don't fall into the above categories and remain 'suspicious'. This presumably is what ESR is referring to by "various persons will apply it in useful ways. Yes, I'm being deliberately vague and tantalizing". Let's say that its likely the percentages in A and B will be large.

Of course it's true that there could be code that this primitive tool doesn't catch. But SCO probably started their analysis by using tools like this also. Looking through millions lines of code by hand is no cakewalk, so one will inevitably start with code like this in such an investigation. (Unless one is concerned about one specific predetermined critical/sensitive piece of code.)

Oh, and the other thing about this tool that is nice IMHO? It demonstrates a "good faith" effort on the part of Linux advocates and coders to correct the problem -- despite the barriers raised by SCO (no code release except via NDA).

Finally, running this tool across a Linux and a BSD release should turn up some data that is both interesting and relevant for this dispute. I'm almost tempted to try that myself.

--LP

Re:IT WILL NOT WORK! Here's technical reason why (Score:4, Informative)

by Megaslow ( 694447 ) * writes: on Tuesday September 09, 2003 @08:25PM (#6916401) Homepage

RTFM:
Name

comparator, filterator -- fast comparisons among large source trees
Synopsis

comparator -c [-d dir] [-o file] [-s shredsize] [-w] [-x] path...

[snip]

The -s option changes the shred size. Smaller shred sizes are more sensitive to small code duplications, but produce correspondingly noisier output. Larger ones will suppress both noise and small similarities.

[snip]

The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style.

Using the appropriate switches will address two of your points. I'm sure it wouldn't be extraordinarily difficult to modify the code to ignore other things such as string constants, variable names, etc.

Re:Bah! FSS developers will never learn... (Score:3, Informative)

by joe_bruin ( 266648 ) writes: on Tuesday September 09, 2003 @09:01PM (#6916602) Homepage Journal

Line 7: for (i = 1000; i--;) {

Where's the limit test? Or did you mean:

for (i = 1000; ;i--) {

what the original poster had works correctly. i-- returns the value i (pre-increment), and satisfies the end condition when i is zero.

Re:IT WILL NOT WORK! Here's technical reason why (Score:4, Informative)

by miniver ( 1839 ) writes: on Tuesday September 09, 2003 @09:05PM (#6916635) Homepage

Download & read the source. Or just read the documentation [catb.org].

Comparator has the capability (-w) to ignore whitespace while generating the hash, while at the same time tracking the actual line numbers for purposes of merging and reporting. In my experience, most code-copiers are dumb and/or lazy -- to get past ESR's tool, the code-copier would have to (a) realize that they're violating a license, (b) not care, (c) be smart enough to realize that a pure cut-and-paste might get caught, and (d) energetic enough to munge up the code logic and variables. While I'm sure there are people like that, I would argue that most of them wouldn't be interested in contributing the result to the community, and the code wouldn't get past Linus if they did. The more logical case is some one/company who believe that they have a legitimate right to copy code from one kernel to another (BSD -> Linux / Linux -> SysV / SysV -> Linux) and thus not feeling the need to cover things up. Either of the SCO User Group examples would fit this category.

I've seen this before from ESR... (Score:3, Informative)

by Dr. Smeegee ( 41653 ) * writes: on Tuesday September 09, 2003 @09:46PM (#6916908) Homepage Journal

He developed a Callcenter Training Utility for our company in the early 80's. It used genetic algorithms to generate simulated customer complaints that were _very_ realistic, even to the point of using sample voices to "whine". Of course, the helpdesk trainees hated it...

But hey, the mewling was featureful.

Re:What if...? (Score:2, Informative)

by Anonymous Coward writes: on Tuesday September 09, 2003 @10:16PM (#6917113)

It's not that people around here think SCO is evil for saying their IP has been stolen. Some people around here think SCO is 'evil' for how they've handled the situation.

If I remember correctly, the open source community has made several offers to remove the tainted code if SCO would just say what code is in violation.

Doesn't even compile (Score:3, Informative)

by tvm662 ( 232083 ) writes: on Tuesday September 09, 2003 @10:50PM (#6917473)

Has anyone else tried to compile Eric's code?

>gcc --version
2.95.3

>make /usr/bin/gcc -c -g main.c
main.c: In function `report_time':
main.c:311: parse error before `int'
main.c:312: parse error before `int'
main.c:316: `buf' undeclared (first use in this function)
main.c:316: (Each undeclared identifier is reported only once
main.c:316: for each function it appears in.)
main.c:317: `minutes' undeclared (first use in this function)
main.c:317: `seconds' undeclared (first use in this function)
make: *** [main.o] Error 1

Looks like Eric has been coding too much c++ or something. I'm not a c coder myself, so I might be wrong, but don't you have to declare all the variables in a block of c code before using them. In report_time, he doesn't seem to have followed that rule. Maybe he might check his code on a number of compilers before declaring he has "perfected it".

Eric here's my patch:

--- main.c 2003-09-10 00:28:37.000000000 -0300
+++ main.c.fixed 2003-09-10 00:29:55.000000000 -0300
@@ -306,12 +306,17 @@

if (mark_time)
{
- int elapsed = endtime - mark_time;
- int hours = elapsed/3600; elapsed %= 3600;
- int minutes = elapsed/60; elapsed %= 60;
- int seconds = elapsed;
+ int elapsed;
+ int hours;
+ int minutes;
+ int seconds;
char buf[BUFSIZ];

+ elapsed = endtime - mark_time;
+ hours = elapsed/3600; elapsed %= 3600;
+ minutes = elapsed/60; elapsed %= 60;
+ seconds = elapsed;
+
va_start(ap, legend);
vsprintf(buf, legend, ap);
fprintf(stderr, "%% %s: %dh %dm %ds\n", buf, hours, minutes, seconds);

Re:Who says ESR can't code? (Score:5, Informative)

by joeytsai ( 49613 ) writes: on Tuesday September 09, 2003 @11:40PM (#6917862) Homepage

Actually fetchmail proves that he can code.

You may want to check out "The Emperor Has No Clothes [1accesshost.com]", a look at ESR's real code contributions.

I found out myself (Score:3, Informative)

by jtheory ( 626492 ) writes: on Tuesday September 09, 2003 @11:53PM (#6917987) Homepage Journal

Okay, here it is (from the man page):

comparator works by first chopping the specified trees into overlapping shreds (by default 3 lines long) and computing the MD5 hash of each shred.

(Emphasis added)

Re:What a weird tool (Score:2, Informative)

by dazk ( 665669 ) writes: on Wednesday September 10, 2003 @12:17AM (#6918171)

Eric's tool allows to compare larger and smaller chunks. Simple lines will easily match very often. Simple lines are not a problem. The problem is always lying in a sequence of lines. That's why you need overlapping sequences.

Re:maybe... (Score:3, Informative)

by Webmonger ( 24302 ) writes: on Wednesday September 10, 2003 @12:29AM (#6918249) Homepage

Hashes and lossy compression are different things. They're designed for completely different purposes and implemented for the purpose they serve. That's why LAME won't compress an mp3 to less than 8kbps, much less 128 bits. It's why md5sum doesn't have a --reproduce-original switch.

For a given input and parameters, any two (independently-developed) MP3 encoders will almost certainly produce different outputs. For a given input and parameters, different md5 implementations will produce the same result.

Looks like "fair use" to me (Score:2, Informative)

by Anonymous Coward writes: on Wednesday September 10, 2003 @01:12AM (#6918490)

I don't know if the MD5 sums are a derivative work of the original source or not, but I would be inclined to think that they are.

Let's look at what the law says about fair use

Fair Use [copyright.gov]

The four factors are: (1) the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational use; (2) the nature of the copyrighted work; (3) amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

It looks to me that under part (1), the MD5sums are a form of commentary or news reporting about the original work, not a replacement for the work. I don't know about (2). Under (3), the "amount" is definitely small, and the "substantiality" is low. And under (4), almost nobody who would buy the original work is going to substitute the MD5sum's instead, so the MD5sum's would have nil effect on the market for the original work.

So in my AC-IANAL opinion, distribution of the MD5sum's would be protected under American copyright law as a "fair use".

Re:MD5 easily fooled (Score:3, Informative)

by ESR ( 3702 ) writes: on Wednesday September 10, 2003 @04:45AM (#6919256) Homepage

But for changes of that kind, the burden of proof
is heavy and on the party alleging infringement.
The comparator tool isn't designed to try to catch
such deliberate obfuscation, because that would get
into murky territory near the boundary of expression and idea. Did you really think I failed to study the legal questions before I wrote this?

Re:No source = no copyright (Score:3, Informative)

by Raffaello ( 230287 ) writes: on Wednesday September 10, 2003 @08:35AM (#6920018)

You are missing the context to which the OP refers, which is Article I, Section 8 of the United States Constitution. This Section gives Congress the power:

"To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries; "
The "requirement" that the grantee of either a copyright or a patent publish the work in question follows from the first clause of the sentence, i.e. "To promote the progress of science and useful arts."

If grantees were allowed to keep their works secret, the grant would not be promoting the "progress of science and useful arts," since no other scientist or author would have access to their work.

The whole idea of patents and copyrights in the U.S. constitution is that the grantee goes public with the invention/work, thus letting others advance "science and the useful arts" by using the grantee's work. In exchange for disclosing this information (remember, the word "patent" means "public."), the grantee is given a legal monopoly on the right to profit from the invention/work for a limited period of time.

As the constitution sees it there are only two alternatives. Either the potential grantee keeps the work/invention a trade secret, never publishing it, but thereby giving up legal rights to time limited exclusive profitability, or, the potential grantee "promotes the sciences and useful arts" by publishing the work/invention, and thereby gains a time limited exclusive right to profit from it.

Re:maybe... (Score:4, Informative)

by Stephan Schulz ( 948 ) writes: <schulz@eprover.org> on Wednesday September 10, 2003 @09:58AM (#6920713) Homepage

I anticipate this tool will be useless more often than not, simply because the slightest systemic change would result in zero matches. Replacing tabs with spaces, two spaces with three, or even line-feeds with carriage-returns would yield 100% false negatives if you use this to identify copyright violations.

I've read the man page that comes with the program, and such things are taken care of. There is an option that will ignore horizontal and vertical white space for comparison purposes, and another one that ignores curly braces (possibly as bad a source of false negatives as formatting).
All in all, it seems to be quite a nice little tool.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

ESR to Shred SCO Claims? 554

ESR to Shred SCO Claims? More Login

ESR to Shred SCO Claims?

Re:Is there really that much data there? (Score:4, Informative)

SCO may not know origin of code (Score:5, Informative)

Re:Nah... (Score:5, Informative)

Re:Nah... (Score:5, Informative)

Re:maybe... (Score:3, Informative)

Re:Can Someone Explain? (Score:5, Informative)

This is actually a darn good idea (Score:5, Informative)

Ups and downs (Score:5, Informative)

Re:Who cares? (Score:3, Informative)

Re:But the Important Question is... (Score:3, Informative)

Re:Is there really that much data there? (Score:5, Informative)

Re:Can Someone Explain? (Score:4, Informative)

Better way to compare code (Score:2, Informative)

You guys are missing the point. (Score:5, Informative)

Re:IT WILL NOT WORK! Here's technical reason why (Score:4, Informative)

Re:Bah! FSS developers will never learn... (Score:3, Informative)

Re:IT WILL NOT WORK! Here's technical reason why (Score:4, Informative)

I've seen this before from ESR... (Score:3, Informative)

Re:What if...? (Score:2, Informative)

Doesn't even compile (Score:3, Informative)

Re:Who says ESR can't code? (Score:5, Informative)

I found out myself (Score:3, Informative)

Re:What a weird tool (Score:2, Informative)

Re:maybe... (Score:3, Informative)

Looks like "fair use" to me (Score:2, Informative)

Re:MD5 easily fooled (Score:3, Informative)

Re:No source = no copyright (Score:3, Informative)

Re:maybe... (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot