How GitHub Copilot Could Steer Microsoft Into a Copyright Storm (theregister.com) 83
An anonymous reader quotes a report from the Register: GitHub Copilot -- a programming auto-suggestion tool trained from public source code on the internet -- has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim. On Monday, Matthew Butterick, a lawyer, designer, and developer, announced he is working with Joseph Saveri Law Firm to investigate the possibility of filing a copyright claim against GitHub. There are two potential lines of attack here: is GitHub improperly training Copilot on open source code, and is the tool improperly emitting other people's copyrighted work -- pulled from the training data -- to suggest code snippets to users?
Butterick has been critical of Copilot since its launch. In June he published a blog post arguing that "any code generated by Copilot may contain lurking license or IP violations," and thus should be avoided. That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) said their organization would stop using GitHub, largely as a result of Microsoft and GitHub releasing Copilot without addressing concerns about how the machine-learning model dealt with different open source licensing requirements.
Copilot's capacity to copy code verbatim, or nearly so, surfaced last week when Tim Davis, a professor of computer science and engineering at Texas A&M University, found that Copilot, when prompted, would reproduce his copyrighted sparse matrix transposition code. Asked to comment, Davis said he would prefer to wait until he has heard back from GitHub and its parent Microsoft about his concerns. In an email to The Register, Butterick indicated there's been a strong response to news of his investigation. "Clearly, many developers have been worried about what Copilot means for open source," he wrote. "We're hearing lots of stories. Our experience with Copilot has been similar to what others have found -- that it's not difficult to induce Copilot to emit verbatim code from identifiable open source repositories. As we expand our investigation, we expect to see more examples. "But keep in mind that verbatim copying is just one of many issues presented by Copilot. For instance, a software author's copyright in their code can be violated without verbatim copying. Also, most open-source code is covered by a license, which imposes additional legal requirements. Has Copilot met these requirements? We're looking at all these issues." GitHub's documentation for Copilot warns that the output may contain "undesirable patterns" and puts the onus of intellectual property infringement on the user of Copilot, notes the report.
Bradley Kuhn of the Software Freedom Conservancy is less willing to set aside how Copilot deals with software licenses. "What Microsoft's GitHub has done in this process is absolutely unconscionable," he said. "Without discussion, consent, or engagement with the FOSS community, they have declared that they know better than the courts and our laws about what is or is not permissible under a FOSS license. They have completely ignored the attribution clauses of all FOSS licenses, and, more importantly, the more freedom-protecting requirements of copyleft licenses."
Brett Becker, assistant professor at University College Dublin in Ireland, told The Register in an email, "AI-assisted programming tools are not going to go away and will continue to evolve. Where these tools fit into the current landscape of programming practices, law, and community norms is only just beginning to be explored and will also continue to evolve." He added: "An interesting question is: what will emerge as the main drivers of this evolution? Will these tools fundamentally alter future practices, law, and community norms -- or will our practices, law and community norms prove resilient and drive the evolution of these tools?"
Butterick has been critical of Copilot since its launch. In June he published a blog post arguing that "any code generated by Copilot may contain lurking license or IP violations," and thus should be avoided. That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) said their organization would stop using GitHub, largely as a result of Microsoft and GitHub releasing Copilot without addressing concerns about how the machine-learning model dealt with different open source licensing requirements.
Copilot's capacity to copy code verbatim, or nearly so, surfaced last week when Tim Davis, a professor of computer science and engineering at Texas A&M University, found that Copilot, when prompted, would reproduce his copyrighted sparse matrix transposition code. Asked to comment, Davis said he would prefer to wait until he has heard back from GitHub and its parent Microsoft about his concerns. In an email to The Register, Butterick indicated there's been a strong response to news of his investigation. "Clearly, many developers have been worried about what Copilot means for open source," he wrote. "We're hearing lots of stories. Our experience with Copilot has been similar to what others have found -- that it's not difficult to induce Copilot to emit verbatim code from identifiable open source repositories. As we expand our investigation, we expect to see more examples. "But keep in mind that verbatim copying is just one of many issues presented by Copilot. For instance, a software author's copyright in their code can be violated without verbatim copying. Also, most open-source code is covered by a license, which imposes additional legal requirements. Has Copilot met these requirements? We're looking at all these issues." GitHub's documentation for Copilot warns that the output may contain "undesirable patterns" and puts the onus of intellectual property infringement on the user of Copilot, notes the report.
Bradley Kuhn of the Software Freedom Conservancy is less willing to set aside how Copilot deals with software licenses. "What Microsoft's GitHub has done in this process is absolutely unconscionable," he said. "Without discussion, consent, or engagement with the FOSS community, they have declared that they know better than the courts and our laws about what is or is not permissible under a FOSS license. They have completely ignored the attribution clauses of all FOSS licenses, and, more importantly, the more freedom-protecting requirements of copyleft licenses."
Brett Becker, assistant professor at University College Dublin in Ireland, told The Register in an email, "AI-assisted programming tools are not going to go away and will continue to evolve. Where these tools fit into the current landscape of programming practices, law, and community norms is only just beginning to be explored and will also continue to evolve." He added: "An interesting question is: what will emerge as the main drivers of this evolution? Will these tools fundamentally alter future practices, law, and community norms -- or will our practices, law and community norms prove resilient and drive the evolution of these tools?"
Simple solution (Score:2, Interesting)
Just make not possible to copyright code.
Re: (Score:2)
Just make not possible to copyright code.
If only there were a way to copy it in another direction, like Left ...
Re: (Score:1)
I think i meant the other way around, as making impossible to copyright code
Re: (Score:2)
I think i meant the other way around, as making impossible to copyright code
The GNU [wikipedia.org] license -- commonly called a Copyleft -- kinda has that effect. Similarly, the BSD [wikipedia.org] and MIT [wikipedia.org] licenses (copyrights) are very permissive, usually only requiring retention of the notice and authors ... (apologies if you already know all this...)
Re: (Score:3)
Absolutely not.
The GPL license (What your alling the GNU) is not like that at all (And very different to BSD/MIT ones).
If you use GPL code in your code you *must* also release your code, under the GPL , OR, a GPL compatible license (of which there are very few). I
Re: (Score:2)
While I agree with what you posted, OP said "making impossible to copyright code" which I took to (generally) mean to allow others to freely use, share reuse that code -- as copyright is often used to prevent sharing and/or reusing of code -- for free anyway. In that case, the GPL. BSD and MIT licenses accomplish that by either requiring or allowing the code to be reused, etc ... Whether I'm explaining myself adequately, your descriptions of things make me think we're on the same page.
Re: (Score:2)
If that were the case, there would be no problem. But no, copyleft is copyright (with bad licensing terms). Even permissive licenses still keep copyright. You need to specifically put things in the public domain, such as with a CC0 license, if you don't want the work to be copyrighted.
Re: (Score:2)
"CopyLeft" is a stupid, nonsensical term. Code cannot be both free and copyrighted. You can't have it both ways. It is either free (as in "freedom") or it isn't. The purpose of copyright is to place restrictions on something. That is the exact opposite of what FOSS is supposed to be.
In many countries code just is copyright by default and there's nothing the author can do about that. That doesn't mean there have to be restrictions, just that you get rid of those restrictions by giving an open license. See the Creative Commons CC-0 license for example where the summary is really simple but the legal text is quite complex.
There are two simple points of view here. and it's a question of whether you want your software to be good for people and society of you want it to be good for develop
Re: (Score:2)
>That is the exact opposite of what FOSS is supposed to be.
Nope. I think you get into the paradox of intolerance area when you don't have restrictions like the GPL.
I also think that some people use the GPL exactly to extract a cost to further development on the code they created. This is basically a contract. This is my code. Do this with it or GET THE F OUT. It's still free to use, and open for all to see. But taking the work itself and extending it should *also* be open, not closed. And without
Re: (Score:2)
Re: (Score:2)
Why? does the original source vanish once copied?
The only thing that I agree with don't claim its yours when its not, and don't claim its someone else when its yours. That is trademark and protects people form lying.
Attribution while it seems OK, becomes cumbersome when you include libraries, that include libraries ....
The every modern invention/idea is built on other peoples invention/idea we have had fantastic innovation on the shoulders of others, and the creators owe society for that inspiration. Limite
Re: (Score:2)
It is still less work to do the attribution than it is to write those libraries yourself, so I don't see a problem.
Re: (Score:2)
Re: (Score:2)
A copyright protects the expression, presentation or arrangement of a creator’s ideas, but not the ideas themselves. Consider that many people could have the same idea, but they might express those ideas in vastly different ways. Those methods of expression are protected, but the shared idea is not.
Functionality is not supposed to be copyrightable, anyway, patents are for that. So unless the code is expressed in a creative way (and comments in the code that Copilot copied may fall under that), and the code's language did not limit how that functionality could be expressed, it should not be copyrightable.
Courts may rule otherwise - there seems to be no end to expanding "rights" of companies at the expense of individuals.
Re: (Score:2)
You also have the question if the CoPilot code could ever be considered copyrightable; it is simply acting as an interpretation of a generic idea.
As an example, the summary's sparse matrix code... just how much work was required to get CoPilot to spit it back out?
Re: Simple solution (Score:1)
There is a difference between creating an algorithm from scratch and what CoPilot does. All GitHub/Microsoft does is what your average coder does, search StackOverflow for a matching description and then copy/paste the actual code. Itâ(TM)s not regenerating new code on demand.
The question is whether those particular pieces of code are art (copyrightable) or if they are a mathematical expression or a list of facts.
Re: (Score:2)
Though code shouldn't be patentable.
Re: (Score:1)
If it's the best way to code something straightforward or simple, obviously more than one person is going to come up with that method and probably have much of the same code to accomplish it. It shouldn't be able to be copyrighted. The people who copyright it are likely not even the first people to use it, just opportunists.
Re: (Score:2)
And instead allow it to be patented?
Seems like licensing should be concern #1 (Score:4, Insightful)
If CoPilot isn't handling licensing, it's not ready for release and should be avoided at all costs.
Licensing should rarely be concern #1 (Score:1)
You only license, once you've determined that what you want to do would otherwise be a copyright violation. If Microsoft believes they aren't violating copyright (and then if a court agrees with them) then they don't need any licensing, so whatever licenses are offered, are irrelevant.
This is a really weird situation. Look at the extremes and try to figure out where copilot is:
At once extreme, you literally look at the code and copy it.
At the other extreme is clean room design. Someone looks at the code, ex
Re: (Score:2)
The article linked show that it's not just single isolated snippets, it copies multiple isolated snippets up to the level that a whole file is clearly a derivative work of the original work.
Re: Seems like licensing should be concern #1 (Score:1)
Re: (Score:2)
No this is not really a "bug", more a reveal of the true feature, CoPilot's database is a derivative work of the software it is trained on and so is subject to the GPL. Any software developed using CoPilot, even if it the code is unrelated to any GPL package is a derivative work of CoPilot and so is a derivative work of the GPL software used to train CoPilot.
Re: (Score:1, Troll)
Re:Bupkis (Score:4, Insightful)
Only problem with that is both sides want to keep copyrighting.
On the right, they see it as a money stream and are willing to spend big to protect that. On the left, they see it as a defence against the bullying of the right; especially when hiding the attribution for financial gain. An honesty keeper.
Re: (Score:2)
That doesn't really cover the full spectrum. The vast majority on the left and the right don't care about copyright at all. People who profit from copyright care about it a lot, and are willing to pay for it in campaign donations. A significant minority have a nuanced opinion opposing copyright, but generally aren't willing to pay for it in campaign donations.
tl;dr the establishment supports copyrights, the anti-establishment opposes them but most people don't care.
Re: (Score:2)
Look, they stole my for-loop!!
LOL this. Might not even be a good for-loop, but people become attached to such things and don't want to let them go.
Re: (Score:2)
>>Look, they stole my for-loop!!
>LOL this. Might not even be a good for-loop, but people become attached to such things and don't want to let them go.
Now, how about, Look they stole my copyrighted sparse matrix transposition code?
Yeah.
Re: Bupkis (Score:1)
Sparse matrix transposition code isnâ(TM)t exactly new, it is taught very early on in any data sciences. It is likely many people copied it or re-invented it before and introduced it to GPL/open source projects.
Wording (Score:3, Interesting)
GitHub Copilot -- a programming auto-suggestion tool trained from public source code on the internet -- has been caught generating what appears to be copyrighted code,
When individuals violate Microsoft's copyright, it's called "piracy" instead of copying.
When Microsoft violates an individual's copyright, it's called "generating" instead of copying (or piracy).
Re: (Score:2)
Re: (Score:2)
Re: Wording (Score:2)
Hes making the whole "tools" vs "paraphernalia" shtick others noticed.
Re:Wording (Score:4, Funny)
When Microsoft violates an individual's copyright, it's called "generating" instead of copying (or piracy).
Sometimes they call it "innovating".
Copilot is Theft (Score:5, Insightful)
The argument being made for its legitimacy is effectively "we stole from so many people that it cant be illegal"
Re: (Score:1)
Re: (Score:2)
Isn't that similar to the arguement being made for all those AI/ML generated images recently?
Re: (Score:2)
Isn't that similar to the argument being made for all those AI/ML generated images recently?
It absolutely is.
On the AI-generated image front one can argue that only the style is being copied: pointillism, surrealism, medieval painting, etc. That's not copyrightable: no artist has ever been sued for copyright infringement (successfully at least), just because their painting is in the same style as another artist. But in the case of Copilot is the generated code really original or is it just composed of chunks of the source material. If the latter it would be a derivative work and thus copyright in
Re: Copilot is Theft (Score:1)
Functional aspects not copyrightable (Score:4, Insightful)
Only creative aspects of code are copyrightable, not functional aspects. An algorithm for sparse matrix transposition is going to be extremely functional and have little or no creative aspect and thus quite likely not protected by copyright.
Re: Functional aspects not copyrightable (Score:1)
But still it might be easier to copyright the that section as it does "something" by copyrighting the entire block it does something in/to. People are trying to copyright syntax which is very limited and calling it style which is applied to something that actually runs.
Re: (Score:2)
Yeah but even with a sparse matrix transposition, if you're smart you'll do a clean-room implementation if you're serious about avoiding copyright violations.
Integrate Clippy (Score:4, Funny)
all source code wants to be free (Score:4, Funny)
Garbage in? Garbage out. (Score:5, Interesting)
Forget the copyright dispute for a minute. CoPilot is fundamentally flawed for another reason and that's because it is trained on unchecked, unreviewed code samples. Most code I come across in public repos is of horrible quality - student homeworks, experimental projects, online tutorials, people just tinkering with some new libraries, and so on. Of course, there also some good projects out there but according to Sturgeon's law "90% of everything is crap". Since AI/machine learning/neural network training is fundamentally dependent on the volume of data points used to strenghten relevant signals, all those outposts of good code become insignificant among all the other poop floating around, and the system starts suggesting crap code with all its functional, architectural and security issues.
Re: (Score:2)
So, Tay [wikipedia.org] has gone on to study CS?
Re:Garbage in? Garbage out. (Score:4, Funny)
And remember, you can apply Sturgeon's Law recursively as well on the other 10% as many times as you need.
They're probably already filtering the cruft (Score:2)
Re: (Score:2)
it still does give interesting suggestions for:
print("kill
or
print("women can't
Re: (Score:2)
>They're probably already filtering the cruft
I imagine they are having just as difficult a time doing this successfully as with any other AI classification project of human based inputs. "Good code" is like trying to determine pornography - "I know it when I see it".
Re: (Score:2)
Write your own source code! (Score:2)
It's quite satisfying to just do the work.
seems like just the thing they would want (Score:2, Informative)
Or are we talking about some other company instead of Microsoft?
LoB
I'm just going to leave this here... (Score:1)
Oh please (Score:3)
Also, most open-source code is covered by a license, which imposes additional legal requirements.
Considering the wholesale violations of "legal requirements" people perform when stealing music, videos, or games, this comment should never be included when discussing open source.
Re: (Score:2)
Data point (Score:4, Interesting)
The code in question appears to have been published in the book 'Direct Methods for Sparse Linear Systems' and does not have an obvious license or restriction in that text.
If I'm reading a book that has an example of how to e.g. handle file IO, I'd assume that it was acceptable to use that example code in my work. Am I wrong in assuming that?
If I am wrong, I owe a deep apology to the authors of the 'Turbo C++ Professional handbook'. The first non-basic dev work I ever did was built by stitching together examples from that text. (This was ~1991 and I didn't have internet access, so this book and the language reference manual were how I learned C++.)
Bias disclosure: I work for Microsoft in a non-development role unrelated to Visual Studio, CoPilot, or Github.
Re: Data point (Score:2)
Re: (Score:2)
You say "stupid and obviously" illegal.
I can search Google today by typing in things, and it brings back suggestions. Some of those suggestions may include copyrighted material that the author intended to be publicly available but still under their copyright. I can then decide to copy that material directly into my editor, and then use it. I made the decision, Google just made a suggestion when I searched for something.
Now, let's replace Google with Co-pilot:
I can search Co-pilot today by typing in thing
Re: (Score:2)
>Some of those suggestions may include copyrighted material that the author intended to be publicly available but still under their copyright.
>What's the difference?
Fair use, the search term is used to search, not to put in a paper. The code is designed to go into your project. It is blatantly illegal to be pulling copyrighted code through software and put it into another project.
>Some of those suggestions may include copyrighted material that the author intended to be publicly available but stil
Re:Data point (Score:4, Informative)
I don't have this book but there is probably a section somewhere stating the readers' rights and limitations to use the code from the book. It's standard and all books have them.
Re: (Score:3)
I looked but didn't find one. Ignoring that, even with the fuzziness on this specific case, there is still a problem here. The source that copilot kicks back needs to be sufficiently original that it isn't obviously someone else's work.
On a related note, are there any free or cheap IP scanning tools as mentioned in TFA?
Re: (Score:2)
Please explain why Microsoft did not train Copilot on their own code if the generated code cannot possibly infringe on the copyright of the source material.
Nat Friedman claims that "training ML systems on public data is fair use" but training it on their own code would have avoided this controversy entirely. Also it would have been the perfect source material for people who are mostly going to use it to write Windows applications.
Clearly Microsoft knows Copilot is likely to infringe on the copyright of th
Re: (Score:2)
You have as much data as I do. Same circus, different tent.
Is this type of thing an issue for dall-e generated images too? (I'm asking in ignorance; I genuinely don't know.)
Re: (Score:1)
Re: (Score:1)
Re: (Score:3)
Or I think it could be even better. Users should be able to select which licenses they agree to get their source code generated from. Then, as the system proposes snippets or whatever it does, it also generates a new license for the user. If the user selected GPL or LGPL or CC-BY-like licenses, the generated license lists the names of all the people whose code was used to train the dataset. Alternatively, because it's very likely to be an extremely long list, it could link to a webpage listing all those peo
Just imagine a GAN... (Score:2)
How Microsoft owning Github can break open source. (Score:2)
Now we see how Microsoft owning Github can be used to break open source. By adding a handy tool that helpfully "suggests" copyright-violating insertions, they can encourage developers to sprinkle the bulk of the open-source codebase with IP-law violations.
Interestingly, this can also be described as a form of "Embrace, extend, extinguish", though the boobytrap is of an entirely different nature.
Scènes à Faire (Score:3)
The problem with the argument that a code generator, whether or not trained on FOSS, emits copyrighted code is that it doesn't take into account the basic premise of "scènes à faire": given a set of design constraints and desired output, a given set of developers working on a specific function or procedure will likely create nearly-identical code.
Scènes à faire is a long-standing defense approach in copyright cases, albeit primarily in the intellectual-property worlds of graphic arts, photography, and other visual media. However, nothing says the approach can't be applied to the intellectual-property world of code design and deployment. And, given the propensity for developers and coders to reuse code under the dictum of "laziness is a virtue", trying to pin down a given code snippet for copyright violation, especially if created independently from the claimed copyrighted snippet, is likely to be a Sisyphean task.
Just my two cents' worth.
Re: (Score:2)
>The problem with the argument that a code generator, whether or not trained on FOSS, emits copyrighted code is that it doesn't take into account the basic premise of "scènes à faire": given a set of design constraints and desired output, a given set of developers working on a specific function or procedure will likely create nearly-identical code.
And now apply that to the copyrighted sparse matrix transposition code. Somehow I doubt your premise, given this evidence. Sure, there may be some
Not trained on Microsoft's own code (Score:2)
Free Software Radicals Should Be Ignored (Score:1)
Purchase, Poison, Purge (Score:1)