Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Programming Microsoft The Courts

How GitHub Copilot Could Steer Microsoft Into a Copyright Storm (theregister.com) 83

An anonymous reader quotes a report from the Register: GitHub Copilot -- a programming auto-suggestion tool trained from public source code on the internet -- has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim. On Monday, Matthew Butterick, a lawyer, designer, and developer, announced he is working with Joseph Saveri Law Firm to investigate the possibility of filing a copyright claim against GitHub. There are two potential lines of attack here: is GitHub improperly training Copilot on open source code, and is the tool improperly emitting other people's copyrighted work -- pulled from the training data -- to suggest code snippets to users?

Butterick has been critical of Copilot since its launch. In June he published a blog post arguing that "any code generated by Copilot may contain lurking license or IP violations," and thus should be avoided. That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) said their organization would stop using GitHub, largely as a result of Microsoft and GitHub releasing Copilot without addressing concerns about how the machine-learning model dealt with different open source licensing requirements.

Copilot's capacity to copy code verbatim, or nearly so, surfaced last week when Tim Davis, a professor of computer science and engineering at Texas A&M University, found that Copilot, when prompted, would reproduce his copyrighted sparse matrix transposition code. Asked to comment, Davis said he would prefer to wait until he has heard back from GitHub and its parent Microsoft about his concerns. In an email to The Register, Butterick indicated there's been a strong response to news of his investigation. "Clearly, many developers have been worried about what Copilot means for open source," he wrote. "We're hearing lots of stories. Our experience with Copilot has been similar to what others have found -- that it's not difficult to induce Copilot to emit verbatim code from identifiable open source repositories. As we expand our investigation, we expect to see more examples. "But keep in mind that verbatim copying is just one of many issues presented by Copilot. For instance, a software author's copyright in their code can be violated without verbatim copying. Also, most open-source code is covered by a license, which imposes additional legal requirements. Has Copilot met these requirements? We're looking at all these issues."
GitHub's documentation for Copilot warns that the output may contain "undesirable patterns" and puts the onus of intellectual property infringement on the user of Copilot, notes the report.

Bradley Kuhn of the Software Freedom Conservancy is less willing to set aside how Copilot deals with software licenses. "What Microsoft's GitHub has done in this process is absolutely unconscionable," he said. "Without discussion, consent, or engagement with the FOSS community, they have declared that they know better than the courts and our laws about what is or is not permissible under a FOSS license. They have completely ignored the attribution clauses of all FOSS licenses, and, more importantly, the more freedom-protecting requirements of copyleft licenses."

Brett Becker, assistant professor at University College Dublin in Ireland, told The Register in an email, "AI-assisted programming tools are not going to go away and will continue to evolve. Where these tools fit into the current landscape of programming practices, law, and community norms is only just beginning to be explored and will also continue to evolve." He added: "An interesting question is: what will emerge as the main drivers of this evolution? Will these tools fundamentally alter future practices, law, and community norms -- or will our practices, law and community norms prove resilient and drive the evolution of these tools?"
This discussion has been archived. No new comments can be posted.

How GitHub Copilot Could Steer Microsoft Into a Copyright Storm

Comments Filter:
  • Simple solution (Score:2, Interesting)

    by Z80a ( 971949 )

    Just make not possible to copyright code.

    • Just make not possible to copyright code.

      If only there were a way to copy it in another direction, like Left ...

      • by Z80a ( 971949 )

        I think i meant the other way around, as making impossible to copyright code

        • I think i meant the other way around, as making impossible to copyright code

          The GNU [wikipedia.org] license -- commonly called a Copyleft -- kinda has that effect. Similarly, the BSD [wikipedia.org] and MIT [wikipedia.org] licenses (copyrights) are very permissive, usually only requiring retention of the notice and authors ... (apologies if you already know all this...)

          • The GNU license -- commonly called a Copyleft -- kinda has that effect. Similarly, the BSD and MIT licenses (copyrights) are very permissive, usually only requiring retention of the notice and authors ... (apologies if you already know all this...)

            Absolutely not.

            The GPL license (What your alling the GNU) is not like that at all (And very different to BSD/MIT ones).

            If you use GPL code in your code you *must* also release your code, under the GPL , OR, a GPL compatible license (of which there are very few). I

            • While I agree with what you posted, OP said "making impossible to copyright code" which I took to (generally) mean to allow others to freely use, share reuse that code -- as copyright is often used to prevent sharing and/or reusing of code -- for free anyway. In that case, the GPL. BSD and MIT licenses accomplish that by either requiring or allowing the code to be reused, etc ... Whether I'm explaining myself adequately, your descriptions of things make me think we're on the same page.

          • by ET3D ( 1169851 )

            If that were the case, there would be no problem. But no, copyleft is copyright (with bad licensing terms). Even permissive licenses still keep copyright. You need to specifically put things in the public domain, such as with a CC0 license, if you don't want the work to be copyrighted.

    • by jbengt ( 874751 )
      Creative expression is copyrightable. [berkeley.edu]

      A copyright protects the expression, presentation or arrangement of a creator’s ideas, but not the ideas themselves. Consider that many people could have the same idea, but they might express those ideas in vastly different ways. Those methods of expression are protected, but the shared idea is not.

      Functionality is not supposed to be copyrightable, anyway, patents are for that. So unless the code is expressed in a creative way (and comments in the code that Copilot copied may fall under that), and the code's language did not limit how that functionality could be expressed, it should not be copyrightable.

      Courts may rule otherwise - there seems to be no end to expanding "rights" of companies at the expense of individuals.

      • You also have the question if the CoPilot code could ever be considered copyrightable; it is simply acting as an interpretation of a generic idea.

        As an example, the summary's sparse matrix code... just how much work was required to get CoPilot to spit it back out?

        • There is a difference between creating an algorithm from scratch and what CoPilot does. All GitHub/Microsoft does is what your average coder does, search StackOverflow for a matching description and then copy/paste the actual code. Itâ(TM)s not regenerating new code on demand.

          The question is whether those particular pieces of code are art (copyrightable) or if they are a mathematical expression or a list of facts.

      • by raynet ( 51803 )

        Though code shouldn't be patentable.

    • This is the way.

      If it's the best way to code something straightforward or simple, obviously more than one person is going to come up with that method and probably have much of the same code to accomplish it. It shouldn't be able to be copyrighted. The people who copyright it are likely not even the first people to use it, just opportunists.
    • by HiThere ( 15173 )

      And instead allow it to be patented?

  • by 93 Escort Wagon ( 326346 ) on Wednesday October 19, 2022 @05:10PM (#62981345)

    If CoPilot isn't handling licensing, it's not ready for release and should be avoided at all costs.

    • by Anonymous Coward

      You only license, once you've determined that what you want to do would otherwise be a copyright violation. If Microsoft believes they aren't violating copyright (and then if a court agrees with them) then they don't need any licensing, so whatever licenses are offered, are irrelevant.

      This is a really weird situation. Look at the extremes and try to figure out where copilot is:

      At once extreme, you literally look at the code and copy it.

      At the other extreme is clean room design. Someone looks at the code, ex

  • Re: (Score:1, Troll)

    Comment removed based on user account deletion
    • Re:Bupkis (Score:4, Insightful)

      by evanh ( 627108 ) on Wednesday October 19, 2022 @05:32PM (#62981399)

      Only problem with that is both sides want to keep copyrighting.

      On the right, they see it as a money stream and are willing to spend big to protect that. On the left, they see it as a defence against the bullying of the right; especially when hiding the attribution for financial gain. An honesty keeper.

      • That doesn't really cover the full spectrum. The vast majority on the left and the right don't care about copyright at all. People who profit from copyright care about it a lot, and are willing to pay for it in campaign donations. A significant minority have a nuanced opinion opposing copyright, but generally aren't willing to pay for it in campaign donations.

        tl;dr the establishment supports copyrights, the anti-establishment opposes them but most people don't care.

    • Look, they stole my for-loop!!

      LOL this. Might not even be a good for-loop, but people become attached to such things and don't want to let them go.

      • by jvkjvk ( 102057 )

        >>Look, they stole my for-loop!!

        >LOL this. Might not even be a good for-loop, but people become attached to such things and don't want to let them go.

        Now, how about, Look they stole my copyrighted sparse matrix transposition code?

        Yeah.

        • Sparse matrix transposition code isnâ(TM)t exactly new, it is taught very early on in any data sciences. It is likely many people copied it or re-invented it before and introduced it to GPL/open source projects.

  • Wording (Score:3, Interesting)

    by Kunedog ( 1033226 ) on Wednesday October 19, 2022 @05:15PM (#62981363)

    GitHub Copilot -- a programming auto-suggestion tool trained from public source code on the internet -- has been caught generating what appears to be copyrighted code,

    When individuals violate Microsoft's copyright, it's called "piracy" instead of copying.

    When Microsoft violates an individual's copyright, it's called "generating" instead of copying (or piracy).

  • Copilot is Theft (Score:5, Insightful)

    by pimpsoftcom ( 877143 ) on Wednesday October 19, 2022 @05:30PM (#62981393) Journal

    The argument being made for its legitimacy is effectively "we stole from so many people that it cant be illegal"

    • Comment removed based on user account deletion
    • Isn't that similar to the arguement being made for all those AI/ML generated images recently?

      • by fgouget ( 925644 )

        Isn't that similar to the argument being made for all those AI/ML generated images recently?

        It absolutely is.
        On the AI-generated image front one can argue that only the style is being copied: pointillism, surrealism, medieval painting, etc. That's not copyrightable: no artist has ever been sued for copyright infringement (successfully at least), just because their painting is in the same style as another artist. But in the case of Copilot is the generated code really original or is it just composed of chunks of the source material. If the latter it would be a derivative work and thus copyright in

        • CoPilot does sometimes spit out duplicated code. Microsoft added checks for this and is working on preventing/minimizing this issue. But I find most of the time the code is original and most notably copilot will conform to my style conventions and interfaces that I wrote (showing that it clearly is understanding my code to some degree) I think the issue is that if code is copy/pasted enough copilot might be fooled into thinking it's required boiler plate code.
  • by LetterRip ( 30937 ) on Wednesday October 19, 2022 @05:32PM (#62981397)

    Only creative aspects of code are copyrightable, not functional aspects. An algorithm for sparse matrix transposition is going to be extremely functional and have little or no creative aspect and thus quite likely not protected by copyright.

    • But still it might be easier to copyright the that section as it does "something" by copyrighting the entire block it does something in/to. People are trying to copyright syntax which is very limited and calling it style which is applied to something that actually runs.

    • Yeah but even with a sparse matrix transposition, if you're smart you'll do a clean-room implementation if you're serious about avoiding copyright violations.

  • by OrangAsm ( 678078 ) on Wednesday October 19, 2022 @05:36PM (#62981407)
    It looks like you're trying to place a backdoor into this application. Would you like to obfuscate this?
  • by FudRucker ( 866063 ) on Wednesday October 19, 2022 @05:36PM (#62981409)
    source code is becoming self aware and it was escaping copyright restrictions, see, i told you all source code wants to be free
  • by devslash0 ( 4203435 ) on Wednesday October 19, 2022 @05:45PM (#62981421)

    Forget the copyright dispute for a minute. CoPilot is fundamentally flawed for another reason and that's because it is trained on unchecked, unreviewed code samples. Most code I come across in public repos is of horrible quality - student homeworks, experimental projects, online tutorials, people just tinkering with some new libraries, and so on. Of course, there also some good projects out there but according to Sturgeon's law "90% of everything is crap". Since AI/machine learning/neural network training is fundamentally dependent on the volume of data points used to strenghten relevant signals, all those outposts of good code become insignificant among all the other poop floating around, and the system starts suggesting crap code with all its functional, architectural and security issues.

    • by PPH ( 736903 )

      So, Tay [wikipedia.org] has gone on to study CS?

    • by dlingman ( 1757250 ) on Wednesday October 19, 2022 @07:30PM (#62981567)

      And remember, you can apply Sturgeon's Law recursively as well on the other 10% as many times as you need.

    • What makes you think that CoPilot is trained so indiscriminately? The need to curate, filter, and reduce bias in training data is well known in other applications of machine learning, such as natural language processing. (Researchers figured this out as soon as they saw people prompting their language models to emit hate speech and other forms of abuse.) I rather suspect that Microsoft is aware of this issue and has taken steps to weed out, or at least assign a lower weighting to, undesirable training da
      • by raynet ( 51803 )

        it still does give interesting suggestions for:
        print("kill
        or
        print("women can't

      • by jvkjvk ( 102057 )

        >They're probably already filtering the cruft

        I imagine they are having just as difficult a time doing this successfully as with any other AI classification project of human based inputs. "Good code" is like trying to determine pornography - "I know it when I see it".

    • Yea, good lord ya. I cannot tell you how many repos I have fought when I tried to use their libs. Just to find out they half worked or had fatal flaws. Just to save time on some personal projects:P
  • It's quite satisfying to just do the work.

  • We are talking about Microsoft correct? Spreading software with GPL license violations is something they paid millions to help Caldera attack opensource software users. They will just let it run until a court tells them to stop, it'll take them 8 months to undo the code and stop it and then 6 months later they will start posting how GPL violations have spread and why their Microsoft Licensed software is the safer bet.

    Or are we talking about some other company instead of Microsoft?
    LoB
  • by quonset ( 4839537 ) on Wednesday October 19, 2022 @07:23PM (#62981557)

    Also, most open-source code is covered by a license, which imposes additional legal requirements.

    Considering the wholesale violations of "legal requirements" people perform when stealing music, videos, or games, this comment should never be included when discussing open source.

    • Nitpick: I've never stolen music or movies, but I've downloaded my fair share. I've only once gotten an illegal software copy, 25 years ago, Windows 98SE, and installed it. The next software I installed and used, ever since, is predominantly Free Software and always legal. This was partially related to the crappiness of 98SE and the desire to be part of the positive communities around FLOSS. There is no such thing as general legal private copying of for pay software, whereas for music, books and video's the
  • Data point (Score:4, Interesting)

    by ElizabethGreene ( 1185405 ) on Wednesday October 19, 2022 @09:43PM (#62981749)

    The code in question appears to have been published in the book 'Direct Methods for Sparse Linear Systems' and does not have an obvious license or restriction in that text.

    If I'm reading a book that has an example of how to e.g. handle file IO, I'd assume that it was acceptable to use that example code in my work. Am I wrong in assuming that?

    If I am wrong, I owe a deep apology to the authors of the 'Turbo C++ Professional handbook'. The first non-basic dev work I ever did was built by stitching together examples from that text. (This was ~1991 and I didn't have internet access, so this book and the language reference manual were how I learned C++.)

    Bias disclosure: I work for Microsoft in a non-development role unrelated to Visual Studio, CoPilot, or Github.

    • I agree with you, but: you did buy 5he book, which gives you some rights (IANAL) to use its contents. Copilot isn't compensating anyone for anything. Honestly, it is such a stupid and obviously illegal idea that only a huge corp like MS could get away with this. Anyone else would already have court decisions against them.
      • You say "stupid and obviously" illegal.

        I can search Google today by typing in things, and it brings back suggestions. Some of those suggestions may include copyrighted material that the author intended to be publicly available but still under their copyright. I can then decide to copy that material directly into my editor, and then use it. I made the decision, Google just made a suggestion when I searched for something.

        Now, let's replace Google with Co-pilot:

        I can search Co-pilot today by typing in thing

        • by jvkjvk ( 102057 )

          >Some of those suggestions may include copyrighted material that the author intended to be publicly available but still under their copyright.

          >What's the difference?

          Fair use, the search term is used to search, not to put in a paper. The code is designed to go into your project. It is blatantly illegal to be pulling copyrighted code through software and put it into another project.

          >Some of those suggestions may include copyrighted material that the author intended to be publicly available but stil

    • Re:Data point (Score:4, Informative)

      by ZiggyZiggyZig ( 5490070 ) on Thursday October 20, 2022 @04:31AM (#62982269)

      I don't have this book but there is probably a section somewhere stating the readers' rights and limitations to use the code from the book. It's standard and all books have them.

      • I looked but didn't find one. Ignoring that, even with the fuzziness on this specific case, there is still a problem here. The source that copilot kicks back needs to be sufficiently original that it isn't obviously someone else's work.

        On a related note, are there any free or cheap IP scanning tools as mentioned in TFA?

    • by fgouget ( 925644 )

      Please explain why Microsoft did not train Copilot on their own code if the generated code cannot possibly infringe on the copyright of the source material.
      Nat Friedman claims that "training ML systems on public data is fair use" but training it on their own code would have avoided this controversy entirely. Also it would have been the perfect source material for people who are mostly going to use it to write Windows applications.

      Clearly Microsoft knows Copilot is likely to infringe on the copyright of th

      • Please explain why Microsoft did not train Copilot on their own code if the generated code cannot possibly infringe on the copyright of the source material.

        You have as much data as I do. Same circus, different tent.

        Is this type of thing an issue for dall-e generated images too? (I'm asking in ignorance; I genuinely don't know.)

  • Comment removed based on user account deletion
    • Comment removed based on user account deletion
    • Or I think it could be even better. Users should be able to select which licenses they agree to get their source code generated from. Then, as the system proposes snippets or whatever it does, it also generates a new license for the user. If the user selected GPL or LGPL or CC-BY-like licenses, the generated license lists the names of all the people whose code was used to train the dataset. Alternatively, because it's very likely to be an extremely long list, it could link to a webpage listing all those peo

  • ...with the generator producing code, and the discriminator detecting if the code is copyrighted.
  • Now we see how Microsoft owning Github can be used to break open source. By adding a handy tool that helpfully "suggests" copyright-violating insertions, they can encourage developers to sprinkle the bulk of the open-source codebase with IP-law violations.

    Interestingly, this can also be described as a form of "Embrace, extend, extinguish", though the boobytrap is of an entirely different nature.

  • by TVmisGuided ( 151197 ) <alan.jump@g m a i l . c om> on Thursday October 20, 2022 @08:10AM (#62982569) Homepage

    The problem with the argument that a code generator, whether or not trained on FOSS, emits copyrighted code is that it doesn't take into account the basic premise of "scènes à faire": given a set of design constraints and desired output, a given set of developers working on a specific function or procedure will likely create nearly-identical code.

    Scènes à faire is a long-standing defense approach in copyright cases, albeit primarily in the intellectual-property worlds of graphic arts, photography, and other visual media. However, nothing says the approach can't be applied to the intellectual-property world of code design and deployment. And, given the propensity for developers and coders to reuse code under the dictum of "laziness is a virtue", trying to pin down a given code snippet for copyright violation, especially if created independently from the claimed copyrighted snippet, is likely to be a Sisyphean task.

    Just my two cents' worth.

    • by jvkjvk ( 102057 )

      >The problem with the argument that a code generator, whether or not trained on FOSS, emits copyrighted code is that it doesn't take into account the basic premise of "scènes à faire": given a set of design constraints and desired output, a given set of developers working on a specific function or procedure will likely create nearly-identical code.

      And now apply that to the copyrighted sparse matrix transposition code. Somehow I doubt your premise, given this evidence. Sure, there may be some

  • According to Microsoft the code generated by Copilot is not subject to the license of the code it has been trained on because it's too transformative. It's interesting then that, according to the article, Copilot was trained on open-source code but not on Microsoft's own source code. That's an odd choice for a tool that will mostly be used to generate code for the Microsoft ecosystem if they are not worried about Copilot violating the copyright of the code it was trained on.
  • The free software movement should not attempt to gut the fair use defense simply because it benefits them. In particular, the free software community has relied a lot on warranties disclaiming infringement. Hard to see a justification for applying that to Microsoft when the oss community itself relies on being able to disclaim it. I wonder if the Free Software movement can be ignored here. Or if they will do something radical. Hopefully they don't have enough political power to gut fair use. Moral of story
  • So now any project on GitHub could potentially have IP problems? Is this an accident?

Truly simple systems... require infinite testing. -- Norman Augustine

Working...