ESR to Shred SCO Claims? 554
webmaven writes "According to this article in eWEEK, ESR has released a utility called comparator for analyzing the similarity of source code trees. The technical details are interesting, in that ESR says he is using an implementation of a refined version of the 'shred' algorithm, with higher performance (on machines with enough RAM) than other versions. ESR won't say whether he intends the comparator to be used to compare older Unix code to Linux so as to be able to refute SCO's claims, but it's obviously well suited for such a purpose. Interestingly, as the shred algorithm can run reports on source trees using only the MD5 signature shreds (once generated), it is possible to use it to compare trees without direct access to the source code itself, leading to a possible use in comparing various proprietary source trees with each other and with Freely available code bases such as Linux and *BSD without requiring actual disclosure of the proprietary source code (a neutral third party could generate the shreds on a company's premises, and leave without taking a copy of the source with them). I'll be interested to see if (or which of) the proprietary vendors allow their source trees to be 'shredded' for such comparisons, and whether this becomes a standard forensic technique in source-code copyright and trade-secret disputes."
maybe... (Score:5, Funny)
Re:maybe... (Score:5, Interesting)
Re:maybe... (Score:5, Insightful)
If anyone is able to prove Microsoft is doing something illegal via the shared source initiative, they'll probably have to do it illegally.
Re:maybe... (Score:4, Insightful)
On the other hand, if all their code checks out, testing for that may violate their NDA, but it'd be difficult for them to show you checked their code if you don't mention it.
Re:maybe... (Score:5, Insightful)
A term in any contract, including any NDA, as stipulated by any party, which would obligate the other party to not report a violation of law, either statute or criminial, is PER SE unlawful and cannot be enforced within any jurisdiction of of most first world countries. Any contract bearing such a stipulation would in fact be at significant risk of invalidating the ENTIRE contract, not just the unlawful provisions therein.
C//
No source = no copyright (Score:5, Insightful)
Copyright is misapplied to source code. Either REVEAL THE SOURCE or you only get protection on that which you "publish" - namely, the binary.
Put up or shut up; no source, no copyright on the source. You won't share it, you don't need it protected.
Re:No source = no copyright (Score:4, Interesting)
Re:No source = no copyright (Score:3, Interesting)
Patents, in order to be patented, need to be fully disclosed. That's inherent in the patent
Re:No source = no copyright (Score:3, Informative)
"To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries; "
The "requirement" that the grantee of either a copyright or a patent publish the work in question follows from the first clause of the sentence, i.e. "To promote the progress of sci
Idiocy... (Score:3, Insightful)
Did your momma have any children that learned to think?
Source code gets no copyright protection: corporations keep their source as a "trade secret" and only get protection on the executable. It is illegal to redistribute (copy) the executable, and the source is entirely within their control (and their responsibility). No real "furtherance of the arts" is accomplished except within the limited scope of usage of the t
Better yet, a reason to get MS to stop funding SCO (Score:3, Interesting)
More importantly, get something like this accepted in a court of law as a legitimate way to do an initial assessment of code yet still preserve a litigants right to code privacy, and you are going to have not just MS but a number of big companies shaking in their boots. Not necessarily because they did steal anything but because they have to realiz
Re:maybe... (Score:4, Informative)
All in all, it seems to be quite a nice little tool.
Re:maybe... (Score:3, Informative)
For a given input and parameters, any two (independently-developed) MP3 encoders will almost certainly produce different outputs. For a given input and parameters, different md5 implementations will produce the same result.
Re:maybe... (Score:3, Informative)
then read:
More on Jonathan Cohen [threenorth.com]
Microsoft MSN a biased propaganda machine. Only shows one side of the facts (the lies).
Re:IT WILL NOT WORK! Here's technical reason why (Score:4, Funny)
like gcc?
Re:IT WILL NOT WORK! Here's technical reason why (Score:4, Informative)
Name
comparator, filterator -- fast comparisons among large source trees
Synopsis
comparator -c [-d dir] [-o file] [-s shredsize] [-w] [-x] path...
[snip]
The -s option changes the shred size. Smaller shred sizes are more sensitive to small code duplications, but produce correspondingly noisier output. Larger ones will suppress both noise and small similarities.
[snip]
The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style.
Using the appropriate switches will address two of your points. I'm sure it wouldn't be extraordinarily difficult to modify the code to ignore other things such as string constants, variable names, etc.
Re:IT WILL NOT WORK! Here's technical reason why (Score:4, Informative)
Download & read the source. Or just read the documentation [catb.org].
Comparator has the capability (-w) to ignore whitespace while generating the hash, while at the same time tracking the actual line numbers for purposes of merging and reporting. In my experience, most code-copiers are dumb and/or lazy -- to get past ESR's tool, the code-copier would have to (a) realize that they're violating a license, (b) not care, (c) be smart enough to realize that a pure cut-and-paste might get caught, and (d) energetic enough to munge up the code logic and variables. While I'm sure there are people like that, I would argue that most of them wouldn't be interested in contributing the result to the community, and the code wouldn't get past Linus if they did. The more logical case is some one/company who believe that they have a legitimate right to copy code from one kernel to another (BSD -> Linux / Linux -> SysV / SysV -> Linux) and thus not feeling the need to cover things up. Either of the SCO User Group examples would fit this category.
Re:Slightly less lazy (Score:3, Funny)
I found out myself (Score:3, Informative)
comparator works by first chopping the specified trees into overlapping shreds (by default 3 lines long) and computing the MD5 hash of each shred.
(Emphasis added)
SCO! (Score:4, Funny)
Re:SCO! (Score:3, Interesting)
Re:SCO! (Score:5, Insightful)
derivative work? (Score:5, Interesting)
Would these hashes of SCO source code be considered derivative works? That could have copyright implications...
Re:derivative work? (Score:5, Insightful)
Take a copyrighted work (Harry Potter and The Chamber of Secrets, for example).
Now, rearrange all the letters randomly, and pick (say) every 10th letter. Apply rot13 to the result, and print it.
Is this derivative work? If you think it is, then, yes, copyright holders should be able to control MD5 hashes produced from their work.
Re:derivative work? (Score:3, Insightful)
SCO's trade secrets --- it's all FUD (Score:5, Funny)
What if...? (Score:5, Insightful)
What if this ESR tool runs and finds commonality, and the research shows that, in fact, SCO's rights were breached. Remember, this type of analysis is a two-edged sword. The purpose of this ESR is to remove doubt... but remember doubt could be removed either direction.
So, given that hypothetical, what would people here think? Would you forive SCO? Would you concede SCO's point, but think that SCO defended their rights in a very poor manner? (this, btw, is what I would probably do). Would you stick your fingers in your ears and refuse to accept the outcome, and believe in some vast -wing conspiracy?
Obviously, the Linx movement would carry on. I don't think the death of Linux is even worth discussion. Some recourse would happen, probably monetary damages, and the offending code would be removed.
My real curiosity is how people's attitudes or feelings would change (or not change) if it turns out SCO is right (however unlikely that is).
Re:What if...? (Score:3, Insightful)
Most of us are relying on common sense and don't really care whether a few lines of archaic code were copied. Given SCO's
1) previous sales of Linux
2) misinformation about owning Unix
3) waffling on what IP is violated
4) refusal to show copied code
5) frequent, inconsistent press releases
6) heavy insider trading
7) ridiculous licensing terms
8
Re:What if...? (Score:3, Insightful)
Here's how you defeat obfuscators (Score:3, Insightful)
Someone mod this up I think I'm on to something!
Re:The real question is: (Score:3, Insightful)
Here is the reason: the people that "stole" SCO's code (if indeed that happened) probably were not acting with ill intent. They probably thought they were doing genuine, valid reuse, in which case, why hide it? Obfuscating runs the risk of introducing new bugs.
OSS programmers, even the ones that cut corners, are not malicious in my experience. There are honest mistakes made, because, well, they are lone programmers, not lawyers, or prof
Is there really that much data there? (Score:4, Funny)
Re:Is there really that much data there? (Score:4, Informative)
Re:Is there really that much data there? (Score:5, Informative)
1,2,3
2,3,4
3,4,5
4,5,6
5,6,7
Two source trees are shredded, then unique hashes are discarded. Anywhere there are three lines of code that are the same ANYWHERE in the source tree, it'll be spotted.
Now, it's trivial to defeat this if you're specifically aiming to do so. However, for existing source trees (such as nearly countless variations of *nix) that already exist and are duplicated in numerous places, it works nicely. It's impossible to go back and modify the tree because too many copies exist.
Re:Is there really that much data there? (Score:4, Funny)
ESR ADMITS TO ENRON PRACTICES (Score:5, Funny)
But the Important Question is... (Score:3, Funny)
Answered My Own Question.. (Score:4, Funny)
"...has two advantages: one, it's amazingly fast..."
Guess not. ;-)
Re:But the Important Question is... (Score:3, Informative)
From the README...
No word on the latter... but it's ESR... so of course! ;-)
[TMB]
Doubt it will help (Score:5, Insightful)
Re:Doubt it will help (Score:5, Insightful)
Until the lines that are common are identified, it's impossible to defend against the accusations. Because of that, I bet Darling Darl won't allow it to be used. The question is, how to turn the inevitable refusal into something that shuts him (up|down).
Re:Doubt it will help (Score:5, Insightful)
If ESR is given the big list of MD5 sums of SCO's kernel by someone who has legitimate access to it, and he runs his shred tool to compare it to the Linux kernel, and a bunch of stuff turns up matching (as expected) he can still see WHAT was matching because he has the Linux sources.
So then he can look at that and say, "hmmm, it looks like part of this ethernet driver is the same, and this NAT implementation, and bits and pieces of the VFAT filesystem code..." and then, find out how those got to be the way they are in Linux.
If it can be proved that the matching code is totally legit in Linux, (which is what I would expect) then it follows that either (a) SCO actually stole stuff out of Linux, rather than the reverse, or (b) Linux and SCO both took the code from a third source, like BSD.
Otherwise, option (c) is that Linux actually contains code from SCO which it should not. But this is still an improvement on the current situation, because it would allow the Linux development team to FIX THE PROBLEM.
Either way, (sooner or later, depending on if Linux fixes are required) it will shoot SCO's claims so full of holes that any reputable journalist reporting on SCO's latest insane claims will have to mention that "... but the source code has been analyzed and all code in Linux similar to SCO's software has been shown to be completely legitimate...", or "... but all code in Linux which SCO might have had a valid issue about has been removed..."
SCO's big stick right now is FUD. Fear, Uncertainity, and Doubt. The shred tool can remove the uncertainty and doubt. Only SCO will still have the Fear.
Nah... (Score:4, Insightful)
Re:Nah... (Score:5, Informative)
It is the SysV kernel.
Re:Nah... (Score:4, Funny)
OK then, the GNU/SCO kernel.
Re:Nah... (Score:4, Interesting)
Re:Nah... (Score:5, Informative)
if the GNU/Linux took ideas from the SCO kernel, SCO may be as eligible for compensation as if it were directly copied from SCO.
IANAL, but I don't believe this is so in the general case. Copyright protects only specific expression of ideas, not the ideas themselves.
If SCO had valid patents on some of this stuff, they'd have a point of legal leverage, but they don't from all reports.
The truth is out there (Score:3, Interesting)
Breaking News! (Score:3, Funny)
Other uses? (Score:4, Interesting)
Of course, that also depends more on how differences are actually calculated. Still, could make an interesting project to relate OSes based on how much shared code they still retain and show it in a graphical tree format, ala "family tree." 8)
Who cares? (Score:3, Insightful)
Anyway -- who cares? There's no question there are plenty of common chunks between Linux and SCO-owned source. And that there are ways to find them. The question is what they are (which SCO isn't saying) and what their common origin is and where that origin falls in the murky history of the Unix codebase. It's not as if anyone has been saying, "We're helpless in the face of this computational problem. If only there were a way to compare large bodies of text for common elements!"
Never mind that there are probably people who can compare both codebases in their heads.
Maybe he's made some major algorithmic breakthrough. (I doubt it but, but I'll leave that to the experts.) But this story is just him yapping again.
Re:Who cares? (Score:3, Informative)
SCO may not know origin of code (Score:5, Informative)
Article text follows:
SCO may not know origin of code, says Australian UNIX historian
By Sam Varghese
September 9, 2003
More doubts have been cast on the heritage of System V Unix code, which the SCO Group claims as its own, by an Australian who runs the Unix Heritage Society. [tuhs.org]
Dr Warren Toomey, now a computer science lecturer at Bond University, said today: "I'd like to point out that SCO (the present SCO Group) probably doesn't have an idea where they got much of their code. The fact that I had to send SCO (the Santa Cruz Organisation or the old SCO) everything up to and including Sys III says an awful lot."
He said that even though SCO owned the copyright on Sys III, a few years ago it did not have a copy of the source code. "I was dealing with one of their people at the time, trying to get some code released under a reasonable licence. I sent them the code as a gesture because I knew they did not have a copy," he said with a chuckle.
Dr Toomey's statements come a few days after Greg Rose, an Australian Unix hacker from the 1970s, raised the possibility that there may be code contributed by people, including himself, which has made its way into System V Unix and is thus being used by companies like the SCO Group.
Dr Toomey said this was one reason why the code samples which the SCO Group had shown at its annual forum had turned out to be widely published code.
SCO was unaware of the origins of much of the code and this "explains how they could wheel out the old malloc() code and the BPF (Berkeley Packet Filter) code, not realising that both were now under BSD licences - and in fact they hadn't even written the BPF code," Dr Toomey said.
He said that there was lots of code which had been developed at the University of New South Wales in the 70s which went to AT&T and was incorporated into UNIX without any copyright notices.
"At that time the development that was going on was similar to open source - the only difference was that the developers all had to have copies of the code licensed from AT&T," he said.
Dr Toomey, who served 12 years with the Australian Defence Force Academy, an offshoot of the University of New South Wales, before joining Bond University, said he had source code for Unices from the 3rd version of UNIX which came out in 1974 to the present day. "I don't have Sys V code but there are people with licences for that code who are members of the Unix Heritage Society. We can compare code samples any time," he said.
He agreed that the codebase of Sys V was a terribly tangled mess. "It is very difficult to trace origins now. There is an awful lot of non-AT&T and non-SCO code in Sys V. There is a lot of BSD code there," he said.
In March, the SCO Group filed a billion-dollar lawsuit against IBM, for "misappropriation of trade secrets, tortious interference, unfair competition and breach of contract."
SCO also claimed that Linux was an unauthorised derivative of Unix and warned commercial Linux users that they could be legally liable for violation of intellectual copyright. SCO later expanded its claims against IBM to US$3 billion in June when it said it was withdrawing IBM's licence for its own Unix, AIX.
IBM has counter-sued SCO while Red Hat Linux has sued SCO to stop it from making "unsubstantiated and untrue public statements attacking Red Hat Linux and the integrity of the Open Source software development process."
-----
Wordforge writing contest now open: deadline 2003-03-28
In all fairness.. (Score:4, Insightful)
The SCO Group (not old SCO) hasn't written any code in SysV UNIX.
Anyway.. One could hope that when this is all over, the UNIX sources will be bought up from the carcass of SCOX and open-sourced, finally putting it out of its misery..
That is, as long as SysV UNIX doesn't have more stolen code in addition to the BSD code we all know about..
The sooner the zombie of UNIX is put to rest, the better for all the live Unices.
Re:In all fairness.. (Score:5, Insightful)
SCO's value is in acting as a totem against future companies who would try this same stunt....Their value is in their smoking carcass with Daryl's chared head mounted promanently on a high pike...
At this point, there can be no comprimise with people who commit fraud to inflate their stock price and to promote FUD.... I believe that Daryl KNOWS that his claims are false...he deserves to fry....
I say, "smoking head on stake" for all the SCO/Canopy group members.... leave all the execs at SCO without a job and discredited like the MCI/ENRON execs....Leave all the investors holding worthless stock certs....Somebody needs to be an example, and SCO volunteered by inflating/changing/hyping/FUDing their claims.
I could have had a little sympathy for them if they had just filed their suit and shut-up until the trial....but at $17/share now, we need to destroy some wallets to remind everyone that it's not over till the gavel falls......
Be careful... (Score:5, Interesting)
Who says SCO gets to court first? (Score:5, Interesting)
Re:Be careful... (Score:4, Insightful)
Having many thousands of bright minds working on our side much more balances the advantage SCO can get by snooping on our discourse, if they can even come close to following it all, that is. We outnumber them, it's stupid not to capitalize on that.
Just think, if the word doesn't go out, there are many people who might not have come out of the woodwork to contribute their valuable input, historical recollection, interesting files, legal insight, whatever. We work in the open, we share information, we cooperate, we are many in number. They work in the dark, they trust nobody, they're afraid to ask for help, they are few. It's open source versus closed source all over again.
Also, we each do our own thinking, we try to come up with the part we can contribute, then we go looking for the best place to contribute it. Multiply by 10's of thousands. Compare to a few fevered minds going over and over the same rotten thoughts then sending out marching orders. Seen two systems like that before? Right, it's a free market economy versus Soviet-style central planning. In the end, the free market won because it is more efficient.
With the help of the open source community, they are slowly changing their weapon of choice from a shotgun to a rifle.
A rifle will not help you much against a herd of 50,000 enraged penguins stampeding towards you at an average speed in excess of 100 miles per hour.
Dibs on naming the KDE GUI (Score:4, Funny)
This is actually a darn good idea (Score:5, Informative)
Of course, since MD5 is a very good cryptographic hash function, *any* one-bit change in the source will result in, on average, half of the bits in the result being flipped. So, this method of identifying copied code would only work if the code had never been run through an obfuscator. It would also be defeatable by running the source through a script to have its variable names search-and-replaced with similar names (such as replacing every variable name with a new name consisting of the old name plus "_newname")....
In short, this might be a useful technique for allowing a third party to look for trivial wholesale copying of code, but it would be useless for finding a motivated miscreant, determined to steal code without being caught.
Re:This is actually a darn good idea (Score:5, Insightful)
So, this method of identifying copied code would only work if the code had never been run through an obfuscator.
You've hit the nail on the head, possibly without knowing it. The source code needs to be run through an obfuscator *before* shredding. Actually, I'm thinking a special obfuscator, let me explain.
Let's take a piece of C source, not randomly chosen:
malloc(mp, size) struct map *mp; { register int a; register struct map *bp; for (bp = mp; bp->m_size; bp++) { if (bp->m_size >= size) { a = bp->m_addr; bp->m_addr =+ size; if ((bp->m_size =- size) == 0) do { bp++; (bp-1)->m_addr = bp->m_addr; } while ((bp-1)->m_size = bp->m_size); return(a); } } return(0); } Now, the structure of the code is 99% of what matters. Variable names can change, but few people would change anything beyond that. Let's modify the code in a couple of important ways. First, all variable names are changed to new names, on a per-line basis. Blank lines and unneeded blanks are all removed. Each statement is on its own line, and formatting styles (such as curly bracket placement) are standardized. malloc(a, b) struct a *b; { register int a; register struct map *b; for (a=b;a->c;a++) { if (a->b>= c) { a=b->c; a->b=+c; if ((a->b=-c)==0) do { a++; (a-1)->b=a->b; } while ((a-1)->b=a->b); return(a); } } return(0); }This might not be perfect, but it should do the trick. A programmer can change variable names, spacing, or format, but as long as the code is the same, it'll match. Obviously, changing the code would have an impact, but nearly every line would have to be changed for it to not match, and in a substantial way. That's literally not always possible to even do in a way that would trick this function.
Anyone want to write it?
Michael
Ups and downs (Score:5, Informative)
Downside: Uh... it just came out... and it's making some big, big claims involving fuzzy logic. I think it's gonna need some testing first, eh?
Also, anybody else think it only works on larger sections of code than just say 10 lines?
Automating people's careers away (Score:5, Funny)
Slim to None (Score:5, Insightful)
Proprietary (closed) source companies have a tremendous advantage over open source software when it comes to violating intellectual property. Who will ever know if they did it? A source code "comparator" eliminates that crucial advantage.
Re:Slim to None (Score:5, Interesting)
But IBM already has a copy of SCO's code; they licensed it after all. They can release the output of "shred" without violating their agreements with SCO.
Re:Slim to None (Score:3, Interesting)
Not really.. Open-source software usually has a nice setup with mailing-lists, CVS, etc. Most of the code is well accounted-for. The same is not as true with a lot of proprietary software.
Remember, it's not enough that two pieces of code match to prove an infringement in
Results Will Appear "Tainted" (Score:5, Insightful)
It seems very important to me that "third parties" and experts who are not an integral part of the open-source movement validate that comparator works as intended and is effective at detecting code similarities. Hopefully we'll see some articles on respected sites in the next week or so with conclusive analyses of comparator. Not to mention a chance for someone to use it on SCO's code!
Oh, and "Yes, I'm being deliberately vague and tantalizing" is quite funny.
Re:Results Will Appear "Tainted" (Score:5, Insightful)
Then why would a reporter trust the press releases that SCO puts out on an daily basis?
The unfortunate reality is that they DO trust them. We may all think this is a joke here in our insular community, but the great majority of reporters report the press releases "as is". Then the analysts come along and refine those press releases into easily digestible chunks. Then the pundits come along with preconceptions based on those chunks. Ever wonder why the SCO stock keeps going up and up and up? It's because the only thing the general public knows about this issue has come from SCO.
Anything that can help get the truth before the public eye is a Good Thing(tm). A tool that can mathematically "prove" that SCO is lying is valuable, even if most reporters suspect a bias.
IBM has a project called History Flow (Score:5, Interesting)
This is perhaps a better project and it would be interesting to see this tool run against the source.
History Flow [ibm.com] The following is from their website:history flow
visualizing dynamic, evolving documents and the interactions of multiple collaborating authors:
Motivation
Most documents are the product of continual evolution. An essay may undergo dozens of revisions; source code for a computer program may undergo thousands. And as online collaboration becomes increasingly common, we see more and more ever-evolving group-authored texts. This site is a preliminary report on a simple visual technique, history flow, that provides a clear view of complex records of contributions and collaboration.
Would this really work? (Score:4, Insightful)
But if even one bit of the source is different, the MD5 hash will be quite different. So, the code slices have to be IDENTICAL. This is not a very good system because a simple find-replace could defeat it. A variable's name changed by one letter, or even capitalization, will defeat it.
Unless the code reveals much more complex tricks than ESR describes in the help file, this tool wouldn't be much use in the SCO case. Hell, it wouldn't be much use catching college class cheaters even.
Re:Would this really work? (Score:3, Insightful)
The tool ought to be able to highlight all those flagrant cases (if any) and the report generator would then generate something that would be perused by a human.
Re:Would this really work? (Score:3)
Verbatim would give a matching md5 sum, sysv code isn't tough to get your hands on (especially since IBM has it, as well as their own code they supposedly contributed). Making the md5 hashes will be a breeze.
Its been around for years (Score:3, Interesting)
Drop in the code you are interested in and it will tell you where its found in a bunch of open source stuff, including the linux kernel.
It seems impossible to compare without leaking (Score:4, Insightful)
Think of the chance that any given line of source code in an arbitrary program is repeated somewhere else in a large open source program such as the Linux Kernel. This is even more true if some degree of fuzziness is added to handle changes such as adding or removing spaces in insignificant places, removing comments, (and there are many other things like brace style which affect multiple lines so you might want to physically reformat between lines to a standard format....
If the number of lines is even only 1% that are found somewhere in the open source code base, I think a source who wants to keep their code base secret will have a big problem with someone computing the checksums. In reality, I wouldn't be suprised to see a much-higher percentage of lines leaked this way. And this is not the only way leaking can occur (think of application of simple cryptography).
I would not want to be the one publishing the checksums of the closed source due to possible legal liability. The checksums are a derived work in any case.
I can write such a utility also! (Score:5, Funny)
{
printf("These source trees appear to be entirely different!\n");
return 0;
}
Bah! FSS developers will never learn... (Score:4, Funny)
int main()
{
int i;
printf("Comparing source trees...\n");
sleep(2);
printf("Check started.\n");
for (i = 1000; i--;) {
printf(".");
sleep(1);
if (i % 100 == 0)
printf("\n%d0 percent remaining\n", i / 100);
}
printf("\n\nThese source trees appear to be entirely different!\n");
return 0;
}
Re:Bah! FSS developers will never learn... (Score:3, Informative)
what the original poster had works correctly. i-- returns the value i (pre-increment), and satisfies the end condition when i is zero.
how about the Bible Code Algorithm? (Score:4, Funny)
Not as useful in court (Score:3, Interesting)
Open Source (Score:4, Interesting)
Thanks, ESR.
You guys are missing the point. (Score:5, Informative)
1) There are people out there with legit source licenses to SVR5 source trees. And not just Unix OEMs. Various people in large companies with SVR4/5 source licenses etc.
2) Such people cannot release the source code, and may (if paranoid of how they interpret 'derived works') not want to publish hashed MD5 codes of SVR5.
3) However, it should be possible for a legit SVR5 source licensee to publish openly a list of areas of code that are similar across trees, without that list being either A) a derived work, B) violating their NDAs (um, do check the fine print first though) and C) spending tons of their own, presumably expensive time, digging through stuff.
Then Linux advocates can then sift through the resulting large pieces of code and doublecheck/crosscheck the origins of it. At the very least, we'd have a public list of suspicious areas of Linux and could determine that certain parts are A) BSD-licensed, B) are verified as original by a known Linux coder, and C) don't fall into the above categories and remain 'suspicious'. This presumably is what ESR is referring to by "various persons will apply it in useful ways. Yes, I'm being deliberately vague and tantalizing". Let's say that its likely the percentages in A and B will be large.
Of course it's true that there could be code that this primitive tool doesn't catch. But SCO probably started their analysis by using tools like this also. Looking through millions lines of code by hand is no cakewalk, so one will inevitably start with code like this in such an investigation. (Unless one is concerned about one specific predetermined critical/sensitive piece of code.)
Oh, and the other thing about this tool that is nice IMHO? It demonstrates a "good faith" effort on the part of Linux advocates and coders to correct the problem -- despite the barriers raised by SCO (no code release except via NDA).
Finally, running this tool across a Linux and a BSD release should turn up some data that is both interesting and relevant for this dispute. I'm almost tempted to try that myself.
--LP
Comparison algorithms? (Score:3, Interesting)
I'm interested in algorithms that could be used to compare code. Moss and CAP from Berkeley are not interesting because the algorithm is secret (AFAIK).
What algorithms other than ESR's comparator are there? (I recall but can't locate a recent comment on Slashdot that said something like "most plagiarism detection programs used by professors use the XXXX algorithm".)
I've seen this before from ESR... (Score:3, Informative)
But hey, the mewling was featureful.
Nobody has mentioned this yet ... (Score:4, Interesting)
In order that the method should not be fooled by simple changes, at least the following is required
* White space must be ignored
* Comparison must be at the statement level, not the code line level
* Variable names must be replaced by standard placeholders
* Routine names, other than standard library calls, must be replaced by standard placeholders
* (Probably difficult) logic will be needed in the tool to detect and ignore noops: how do you deal with
The trouble is: a high proportion of the code sections thus simplified will fall into a relatively small number of possibilities, vulnerable to dictionary type attacks. Thus, most of the code could be reconstructed, though admittedly as obfuscated source code. IMHO this provides a valid objection to its use.
Doesn't even compile (Score:3, Informative)
>gcc --version
2.95.3
>make
main.c: In function `report_time':
main.c:311: parse error before `int'
main.c:312: parse error before `int'
main.c:316: `buf' undeclared (first use in this function)
main.c:316: (Each undeclared identifier is reported only once
main.c:316: for each function it appears in.)
main.c:317: `minutes' undeclared (first use in this function)
main.c:317: `seconds' undeclared (first use in this function)
make: *** [main.o] Error 1
Looks like Eric has been coding too much c++ or something. I'm not a c coder myself, so I might be wrong, but don't you have to declare all the variables in a block of c code before using them. In report_time, he doesn't seem to have followed that rule. Maybe he might check his code on a number of compilers before declaring he has "perfected it".
Eric here's my patch:
--- main.c 2003-09-10 00:28:37.000000000 -0300
+++ main.c.fixed 2003-09-10 00:29:55.000000000 -0300
@@ -306,12 +306,17 @@
if (mark_time)
{
- int elapsed = endtime - mark_time;
- int hours = elapsed/3600; elapsed %= 3600;
- int minutes = elapsed/60; elapsed %= 60;
- int seconds = elapsed;
+ int elapsed;
+ int hours;
+ int minutes;
+ int seconds;
char buf[BUFSIZ];
+ elapsed = endtime - mark_time;
+ hours = elapsed/3600; elapsed %= 3600;
+ minutes = elapsed/60; elapsed %= 60;
+ seconds = elapsed;
+
va_start(ap, legend);
vsprintf(buf, legend, ap);
fprintf(stderr, "%% %s: %dh %dm %ds\n", buf, hours, minutes, seconds);
Useful tool (Score:3, Insightful)
And, given the dataset it generates, it could be extended to do other useful things such as detect redundant or cut-'n-pasted code, including bugs of the "pasted it in twice" sort.
What respect? (Score:3, Interesting)
Re:Genius (Score:3, Insightful)
Re:Can Someone Explain? (Score:4, Interesting)
That way if your variable is called numOfPorts and mine is called countOfPorts, the parsed code is the same for both, when stuff like that becomes meaningless.
Even if not, SCO seems to be saying that much of the code is copy-n-paste anyways.
Re:Can Someone Explain? (Score:5, Informative)
"The -w causes all whitespace in the file (including blank lines) to be ignored for comparison purposes (line numbers in the output report will nevertheless be correct). This is recommended for comparing C code; among other things it means the comparison won't be fooled by differences in indent style."
Re:Can Someone Explain? (Score:4, Informative)
But SCO's no ordinary rabbit! (Score:3, Funny)
Three. Three. And we'd better not risk another frontal assault. Their legal team is dynamite.
Linus:
Would it help to confuse it if we run away more?
Bruce Perens:
Oh, shut up and go change your firewall!
Alan Cox:
Let us taunt it! Darl may become so cross that he will make a mistake.
Bruce Perens:
Like what?
Alan Cox:
Well... ooh.
ESR:
Have we got bows?
Bruce Perens:
No.
ESR:
We have the Holy Hand Grenade.
Bruce Perens:
Yes, of course! The Holy Hand Grenade of Antioch! 'Tis
Re:fire the "laser" (Score:4, Interesting)
Re:Nonsensical idea (Score:5, Insightful)
Re:Finally ESR stops yapping and does some hacking (Score:4, Insightful)
Did you read the article? Those are some of the most self-aggrandizing quotes I've ever seen in real life. SCO lawyers should "be afraid" of him. He "perfected" the algorithm. His 1500 line program is a complete masterwork; both elegant beyond compare and a paragon of maintainability!
You don't ever see, say, Linus, Larry, or RMS talking themselves up like that.
Re:Is this really as useful as it seems? (Score:4, Insightful)
Straight forward copying of code is much easier to find, and much easier to show is copying in a court. If we look at all the instances of duplicate code, and determine if they are license violations or not, it will be a start to making SCO go away.
Comment removed (Score:5, Insightful)
Re:Who says ESR can't code? (Score:5, Informative)
You may want to check out "The Emperor Has No Clothes [1accesshost.com]", a look at ESR's real code contributions.
Re:MD5 easily fooled (Score:5, Interesting)
So, you've downloaded Comparator, and run tests, then.
I didn't need to, the following is in the readme:
He's wrong BTW (and he is smart enough to know it, which makes this a deliberate deception). A work is no less subject to copyright if someone does a global search and replace on a variable name.
Re:MD5 easily fooled (Score:3, Informative)
is heavy and on the party alleging infringement.
The comparator tool isn't designed to try to catch
such deliberate obfuscation, because that would get
into murky territory near the boundary of expression and idea. Did you really think I failed to study the legal questions before I wrote this?