Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Programming AI Open Source

Free Software Foundation Will Fund Papers on Issues Around Microsoft's 'GitHub Copilot' (fsf.org) 111

GitHub's new "Copilot" tool (created by Microsoft and OpenAI) shares the autocompletion suggestions of an AI trained on code repositories. But can that violate the original coder's license? Now the Free Software Foundation (FSF) is calling for a closer look at these and many other issues...

"We already know that Copilot as it stands is unacceptable and unjust, from our perspective," they wrote in a blog post this week, arguing that Copilot "requires running software that is not free/libre (Visual Studio, or parts of Visual Studio Code), and Copilot is Service as a Software Substitute. These are settled questions as far as we are concerned."

"However, Copilot raises many other questions which require deeper examination..." The Free Software Foundation has received numerous inquiries about our position on these questions. We can see that Copilot's use of freely licensed software has many implications for an incredibly large portion of the free software community. Developers want to know whether training a neural network on their software can really be considered fair use. Others who may be interested in using Copilot wonder if the code snippets and other elements copied from GitHub-hosted repositories could result in copyright infringement. And even if everything might be legally copacetic, activists wonder if there isn't something fundamentally unfair about a proprietary software company building a service off their work.

With all these questions, many of them with legal implications that at first glance may have not been previously tested in a court of law, there aren't many simple answers. To get the answers the community needs, and to identify the best opportunities for defending user freedom in this space, the FSF is announcing a funded call for white papers to address Copilot, copyright, machine learning, and free software.

We will read the submitted white papers, and we will publish ones that we think help elucidate the problem. We will provide a monetary reward of $500 for the papers we publish.

They add that the following questions are of particular interest:
  • Is Copilot's training on public repositories infringing copyright? Is it fair use?
  • How likely is the output of Copilot to generate actionable claims of violations on GPL-licensed works?
  • How can developers ensure that any code to which they hold the copyright is protected against violations generated by Copilot?
  • Is there a way for developers using Copilot to comply with free software licenses like the GPL?
  • If Copilot learns from AGPL-covered code, is Copilot infringing the AGPL?
  • If Copilot generates code which does give rise to a violation of a free software licensed work, how can this violation be discovered by the copyright holder on the underlying work?
  • Is a trained artificial intelligence (AI) / machine learning (ML) model resulting from machine learning a compiled version of the training data, or is it something else, like source code that users can modify by doing further training?
  • Is the Copilot trained AI/ML model copyrighted? If so, who holds that copyright?
  • Should ethical advocacy organizations like the FSF argue for change in copyright law relevant to these questions?

This discussion has been archived. No new comments can be posted.

Free Software Foundation Will Fund Papers on Issues Around Microsoft's 'GitHub Copilot'

Comments Filter:
  • by ShanghaiBill ( 739463 ) on Saturday July 31, 2021 @05:42PM (#61642607)

    The Free Software Foundation should focus on making software free rather than looking for new ways to restrict it.

    • There has to be a way to shoehorn a GNU Hurd joke in here...

    • Re: (Score:3, Informative)

      If you read again, you will notice that this is what they're doing.
      • Re: (Score:1, Insightful)

        by gacattac ( 7156519 )

        Seems they are trying to restrict a tool that helps people write better code faster.

        • by tlhIngan ( 30335 ) <slashdot&worf,net> on Sunday August 01, 2021 @04:29AM (#61643413)

          Seems they are trying to restrict a tool that helps people write better code faster.

          You're kidding, right? The tool's a joke and practically useless.

          There are plenty of examples where its results are less than ... usable. Some of the more egregious examples include regurgitating GPL code into your project [reddit.com] (to be fair, it did include the license, but if you're not coding for GPL, may be problematic).

          Or such brilliant things like storing currency values as a float [twitter.com]. Which is supposed to be a "premier example" of good coding.

          Honestly, TheDailyWTF [thedailywtf.com] probably will have to include a special section for code created by Copilot.

          The problem is the training set they used for it just isn't very good. And with things like what we've seen, I'm not entirely sure it's that useful. I mean, the inclusion of GPL code is particularly egregious - if your code is incompatible with the GPL, the last thing you want is to have GPL code tossed in by a tool.

          • I don't think the tool's a joke and or practically useless. I think there's quite a few things that it can help with certain mundane tasks, and by the way forget that it's in beta, it's also the first version of this project. The applications of GPT-3 in general (which I'm pretty positive GH CoPilot) can assist with all sorts of things, is the results perfect, far from it. However as of right now as a code completion tool it's just astonishing, that it does feel good enough to be like a junior programmer

    • "Free as in Speech", not "Free as in Beer". The distinction is fuhdamental and at the core of the _free speech_ issues that the Free Software Foundation advocates , teaches, and publishes software for. They've consistently protected the right of people to see, use, and modify software. Where people in favor of "open source" have disagreed has normally been when individuals or companies seek to proprietize a project, to seal away parts of the software to sell or seal away from public view. The most blatant e

      • Free speech isn't sticky; if you say something, and I hear you, and I say the same thing, that's the power of free speech, the reason for free speech, the ebb and flow of the marketplace of ideas.

        Free as in speech is a lie; free as in "defended from capitalism."

        You have to point to lawsuits designed to stop certain speech to find examples of freedom to speak? With the Apache 2 license nobody gets to sue anybody, and everybody gets to copy the code. That's freedom.

        • > Free speech isn't sticky; if you say something, and I hear you, and I say the same thing, that's the power of free speech,

          The power of copyright is a distinct though related right. Let us not confuse them. Copyright is the power to control who may repeat your exact words, especially written words. It's designed to protect the authors and the publishers from wholesale plagiarism, and was developed in response to the invention of the printing press.

          Part of the difficulty which the Free Software Foundatio

          • Part of the difficulty which the Free Software Foundation addresses, successfully, is the popular tendency to copy someone else's words and claim them as your own, then to use copyright against others.

            lol they don't help with that at all! Copyright itself gives you that power, and your ability to enforce it depends entirely on your access to lawyers.

            And if you give your copyright over the FSF, they do not actually enforce it at all, they use it to extort from large violators, and in the end never implement the promised terms. It is all lies, and horse shit.

            Which Apache 2 license, I have the exact same protections under the Copyright Act, and yet, no lies, no bullshit, no need for lawyers unless indeed so

            • > lol they don't help with that at all! Copyright itself gives you that power, and your ability to enforce it depends entirely on your access to lawyers.

              When the source code is kept secret, it's much more difficult to prove the copyright violation. Have you ever tried to enforce a software copyright for anything you published?

              • This was part of the difficulty with the SCO versus Linux users lawsuit. SCO refused to display the source code they claimed was infringed, for years. By compelling publishers of software to include access to the source code for those clients, it makes it much easier to trace a violation.

                The GPL also blocked Sourceforge from bundling spamware into GIMP: Sourceforge took over the idle source code repository for Windows compatible GIMP, inserted various adware and spew, and published it as a Windows GIMP pack

                • Did you know they make these things called "calendars?"

                  Take a look at what year it is. Then look up the SCO lawsuit. Then realize that's your best, most recent example.

                  It proves my point. Look at my user id. I know about the SCO lawsuit. I followed it here on slashdot.

                  • I'm old. The most infamous examples and public examples are not recent, but they are compelling. For me, I need to keep such discovered abuses more private and resolve them more discreetly, so I cannot post them here.

    • The FSF is focused on freeing the software itself from being locked away by malicious actors, because this makes it free for the users. And the users are the ones who matter. The code doesn't have feelings. The developers are in the minority, and if they aren't there to serve the users, fuck them anyway.

      • How do you "lock away" BSD? How would you even try? It is just a lie that the FSF tells, it isn't an actual thing they're doing.

        • Because it's easy to make things sound more nefarious if you are disingenuous. While I do support some of the FSF's goals they often get ridiculous when they do things like comparing their cause to slavery, as if slaves had a choice about when and whether to be slaves or not like computer users do over when and whether to use free or non-free software. But hey, it makes non-free software sound so much more evil!

          • As a developer my favorite thing about the Apache 2 license is that I have no control at all once I give it away. I have no temptation to outrage or manipulation. I can share future updates, or not, and that is it. Just so simple, and free. Use it or don't. Use it what I thought of, or something else. I don't have to care, I'm not the one using it for that! No calculation, no hyperbole, no temptation to a cause. Just some code.

            • Yes, it's altruism.
              • Well, it is whatever you want it to be.

                You give it away, some company uses it, they want paid support, they might want to hire you.

                You apply for some job, you want to talk about your open source in the interview, who are your users? A bunch of FSF neckbeards, or are companies using your code in their products? Which has more commercial value?

                Or if you don't care about any of that, maybe it was just "altruism," also known as, that warm fuzzy feeling when you do something Virtuous.

                Or maybe you just want there

      • The FSF is focused on freeing the software itself from being locked away by malicious actors, because this makes it free for the users.

        You mean preventing non-free derivative works. In this instance this "Copilot" thing is facilitating sharing of code, the exact thing the FSF claims to be in support of.

    • Restriction is what they mean by "freedom." When they say "software freedom," they do not mean the word software combined with the word freedom. They mean instead, freedom from unapproved choices.

      For end users this isn't really that noticeable, because people just want to download and use stuff without paying anything, which they can do.

      For developers... well, the vast majority of new open source projects that have users are Apache 2 licensed, or BSD.

      Remember, the "free software" people consider "open sourc

  • Auto complete, Error/Suggest,....and the like have been a thing for a while now? The internet is a cognitive entity and may have been from the beginning?
  • ... so they've seemed to choose GitHub. Did anyone here really think that Microsoft acquired GitHub for anything besides capitalist purposes? This is Microsoft we are talking about. Buy something for the purpose of making money off of it, not to improve what is bought.

    .
    How many products have Microsoft bought to kill them?

    • by ChatHuant ( 801522 ) on Saturday July 31, 2021 @07:36PM (#61642791)

      Did anyone here really think that Microsoft acquired GitHub for anything besides capitalist purposes? This is Microsoft we are talking about. Buy something for the purpose of making money off of it, not to improve what is bought.

      How many products have Microsoft bought to kill them?

      You don't really know what you're talking about, do you? Take a look here [wikipedia.org], for more information. I picked just a few counter-examples to your assertion:

      Forethought: purchased in 1987, their product became Powerpoint, still available after 35 years.
      SQL server - licensed originally from Sybase, 1989. Still going strong 32 years later.
      Flight Simulator - originally from Sublogic, 1982. Newest release 2020, 38 years after.
      LinkedIn - purchased by Microsoft in 2016, still up and running 6 years after.
      Navision - purchased in 2002, available in 2021 as Microsoft Dynamics, 19 years later.
      Visio, bought in 2000, still available in 2021

      There are some companies that MS bought then dropped (Nokia is probably the worst example, and Skype too), but comparing this to Google's approach is just silly.

      • by fahrbot-bot ( 874524 ) on Saturday July 31, 2021 @07:52PM (#61642811)

        ... Powerpoint, still available after 35 years.

        It's debatable as to whether this one is a good thing. :-)

        • "It looks like you're shitposting...." - Clippy
        • by bn-7bc ( 909819 )
          You can't blame the tool for the output created by users that realy should not do presentations at all. That would be like blaming a nail gun for nailing your foot to something else when it was functioning correctly at the time. You could ifc argue that Piwerpoint shuld nake it harder for people to overload presentations with way to many transitions and other effects, or to flick thru 70 slides in 25 minutes etc ( numbers might be slightly inflated/ reduced). But a bad presentation would probably be bad an
      • by Guspaz ( 556486 )

        It's also worth noting that after buying GitHub, Microsoft went and started replacing everything with Git. Azure DevOps (the successor to TFS and VSTS) uses Git via GitHub's libgit2 as its primary source control system (TFSVC is still supported but Git is the default).

        • It started before github.

          I know a few Microsofties. What's well known is that Microsoft has a pretty strong dog fooding policy. What's moderately well known is they also had two of their own version control systems both of which are terrible. There was a long running and large internal fight about using git internally, which had been rumbling on door at least 5 years maybe longer.

          Eventually team git won.

          • And imagine what a horrorshow those MS systems must have been in order to make git look good by comparison.
      • by bn-7bc ( 909819 )
        Well flight simulatot hat a rather long pause where ms did notting with the ip, from October 16 2006 ( the release of fsx steam edition) until they started work on fs 2020, so while you might say that they have had the product for 38 years, it has certainly not been acrivly developed during all of that time. IIRC ms at one point officialy said that they stopped all development on fs and fired most if the devs.
      • Skype wasn't dropped. I pay $14/m for an unlimited outbound bridge to the phone system in Thailand. Before that I was buying those shitty "phone cards" all the time.

      • Actually, I do know what I am talking about. Your cherry-picked examples notwithstanding.
    • How many products have Microsoft bought to kill them?

      I think Microsoft is more like Adobe - they buy companies to prevent competition, then gradually turn their products into crapware.

      • Microsoft has actually purchased many companies and products specifically with the intent of adding the functionality to Windows. WLBS used to be Wolfpack, many people said it was better when it was but the point is they didn't just throw it away. It became a Windows feature. Might have been superseded by now, I haven't kept up with clustering on Windows.

    • Of course they make acquisitions that they think will make them money. That is what companies _should_ do.

      Killing GitHub would _not_ make Microsoft money. It's the complete opposite, they make money by making it better. Consider some of the changes MS made since they bought Github:

      * They added GitHub actions which became the #1 CI/CD tool practically overnight.
      * Unlimited private repos for free
      * Fixed Lots of usability issues [github.blog]
      * Github Codespaces [github.com]
      * Automatically found and fixed millions of security issue

  • What about us? (Score:5, Insightful)

    by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Saturday July 31, 2021 @07:31PM (#61642779) Homepage

    When I read a book, read source code, etc. and train my electrochemical neural network brain, and then create code later on using that neural network, does that constitute a copyright issue?

    No. No it doesn't. Unless the AI is spitting out verbatim copyrighted code, or something very close to it, then I don't see how this could be an issue.

    • Re:What about us? (Score:5, Interesting)

      by Waffle Iron ( 339739 ) on Saturday July 31, 2021 @08:37PM (#61642857)

      I'd argue that current computer neural networks currently do not work similarly enough to biological ones to count as creating novel works.

      Computer memory is much more precise than biological memory. Even if the code has been taken apart and stored in an unrecognizable fashion, the computer neural network is still likely to reconstitute the code in nearly verbatim chunks. This system almost certainly has no "understanding" of the problems its solving. It's just matching patterns it has stored against what someone has typed, then mechanically reassembling the patterns.

      Your argument will probably hold if they ever develop AI to the point where it actually understands the problem statement and produces code that addresses requirements of the project just from that. Unfortunately, if that ever happens, software developers will all be out of a job anyway.

    • by godrik ( 1287354 )

      Unless the AI is spitting out verbatim copyrighted code, or something very close to it, then I don't see how this could be an issue.

      Well... in many cases it does spit out existing code verbatim..
        Coments, authorship, and license included.

      That's why they are concerned. It seems like a pretty clear case of infrigement even in more moderate cases.

      • How's that any different than the current practice of copying code out of github? Does "done with an ai..." somehow change the practice, let alone the obligations?

        • by godrik ( 1287354 )

          Copying code out of github in many cases IS copyright infringement. And FSF does fight it and help the copyright owner fight it in those cases too.

          What changes with copilot is that the legal status of the generated code is very unclear. To me, it seems that the machine learning model is derived work from the code and therefore should be GPLed and the code derived from the model is also derived work and should be GPLed too.

          Though, I am a random slashdoter, what do I know...

          • To me, it seems that the machine learning model is derived work from the code and therefore should be GPLed and the code derived from the model is also derived work and should be GPLed too.
            Luckily this is not the definition of "derived work".

            • by godrik ( 1287354 )

              Isn't it?
              If I print a GPL code it is derived work. And then if I scan the printed version, it is still derived work of the original work. And it is subject to the GPL.

              If I zip a piece of code and unzip it, it does not magically become not subject to the GPL.

              Provided when you run the code through their machine learning model it is still able to reproduce the code exactly as is including variable names, comments, license, and authorship, it seems pretty clear to me that their model just transform the code to

              • Isn't it?
                If I print a GPL code it is derived work. And then if I scan the printed version, it is still derived work of the original work.

                No, if you print it is a copy of the original work, and if you scan that, it is still a copy of the original work.

              • I suggest you simply read the relevant law.
                What "derived work" is, is a legal term. And it is completely clearly written in said law.

        • Copying code out of GitHub is usually not copyright infringement because the source code is provided with license terms. As long as you comply with those terms you can use the code. Many times code is copied without taking measures to comply with the required license obligations. For hobbyist work this may not be a big deal. But for commercial entities it can be a legal quagmire. So much so that there is an entire industry built around detecting and preventing this!
          • Copyright not in any way applies until you actually "distribute" what you made. As I see it co-pilot distributes FOSS licensed snippets to coders who use it, yet it's coder's responsibility to double-check all code's legal requirements before distributing his code if it was made with assistance of this tech. While if it's something you do purely for yourself, licensing concerns never apply since you don't distribute it.
            • True, but bear in mind that pushing it into a git repo that is not on your own machine and which others can pull from probably counts as distributing, so even private companies using distributed source control for internal projects may, technically, be distributing the source code.

              I mean, it would be hard to argue that distributed source control doesn't contain distributed code.

    • The difficulty for copyright is when you copy large portions verbatim. This happened to Hellen Keller, quite accidentally by all accounts. It happens to new writers when editors pass submissions to different authors to see what they can do with the story, and they leave in too much of the original accidentally. The AI would have to be written cautiously to avoid just this problem, and it can be difficult for even a lawyer, judge, or editor to judge consistently.

    • by AmiMoJo ( 196126 )

      Unless the AI is spitting out verbatim copyrighted code, or something very close to it, then I don't see how this could be an issue.

      That's exactly what it's doing, right down to reproducing the copyright notice comments.

    • In Europe it would not be a problem.
      Work solely created by a computer/algorithm is nit copyrightable. Not even by the author of that algorithm.
      So if you want to nitpick: it is a combined work of the programmer - copyrighted by him - and uncopyrighted snippets contributed by the AI.
      But who would care?
      A better programmer had written the same - or better code - by himself.
      And coming for the same problem to the same solution, that is hardly a copyright issue.

  • by Anonymous Coward on Saturday July 31, 2021 @08:11PM (#61642839)

    ... that training the CoPilot expert system on freely-available public repositories absolutely qualifies as fair use under the current definition of the term.

    The other questions are mostly harder to answer definitively. For instance, the question of whether CoPilot suggesting verbatim re-use of code copyrighted under (i.e. - GPL) free software licenses constitutes violation of that copyright depends on whether the programmer who incorporates that code gives credit to the author of the copied stuff, and on whether he/she makes the resultant program free and open.

    Disclaimer: I am not a programmer. I am not YOUR programmer. If you need a programmer's services, hire one ...

    (Posted anonymously only so as not to undo positive mods to previous comments on this story.)

    --

    Check out my novel [amazon.com].

    • Seems to me Github Copilot gets to the heart of, can one copyright ideas? It's not giving snippits of code from code repositories, but the idea that goes with the intent the programmer is trying for.

      Also the tool is going through growing pains [www.fast.ai] so all this may be premature.

  • by imp ( 7585 ) on Saturday July 31, 2021 @09:39PM (#61642945) Homepage

    Once upon a time, AT&T said that if I read their proprietary sources to learn how Unix worked, my brain had been infected and I couldn't pass along that knowledge. That position lost in court. It sounds a bit like what FSF is saying here: Read GPL code, then the code you produce must be GPL'd.

  • GPL v4 (Score:4, Insightful)

    by backslashdot ( 95548 ) on Saturday July 31, 2021 @11:59PM (#61643169)

    IF they are so mad about it, the FSF ought to add an AI-training exclusion in the next version of GPL. Meanwhile it was never stated in a GPL license that people can't use it to train their AI or use snippets of the code. As long as Microsoft discloses which code it used to train the AI they are in the clear.

    • by bn-7bc ( 909819 )
      Well thst would nean gpl v3,5 or 4, as the fsf has no possibility to change existing versions if the gpl unless all projects abr related contributors accept it, that's why the kinux kernel stays in gol v2
  • I can see an argument for making it so that all code generated by Copilot must be released under the GPL, and that too seems shaky to me.

    I don't see any cause for there being a copyright violation to train an AI with GPL code. FSF is being a bitch.

    • Being a birch? Grow up. You're just engaging in reflexive anti FSF rhetoric before engaging your brain.

      This is a huge unaddressed question across the whole industry. No one knows if using copyright data to train a network violates copyright. And no one knows how close you have to get to the original work before copyright is violated.

      You're not a lawyer and you don't know. In fact no lawyers do either, though they can make educated guesses. The only person who will ultimately get to decide this is a judge, o

  • If folk are using copilot, then perhaps the onus should be on them to do due diligence on the code it generates, to find the originating source and associated license?

    I mean, it's fairly easy to tell the difference between a fairly common algorithm vs. a big chunk of code that a coder could fairly easily be suspicious about - "wow, that's a pretty complex bit of auto-generated code!"

    However, if copilot isn't revealing *where* that code originated - in which repository - I guess that makes a users decision m

  • When a song includes a minimal sample from another song, it's considered copyright violation.

    Even if two songs are not digital clones of each other, reproducing the same patterns (as a direct song cover or as a very similar melody over very similar chords) is again copyright violation.

    Isn't the above exactly what Copilot does?

    (The AI process is irrelevant - if AI was used to help composers and producers write music and added for good taste a sample of U2 in a song or reproduced the melody of a Metallica son

    • There's only so many ways to write a function to calculate the nth Fibonacci number. What you have in the music industry would be akin to the first person to copyright such a function would be able to prevent everyone else from publishing a function using the same algorithm.

      Programming is not art. A function is not an artistic expression. Give a number of programmers same problem, and you will often find that two or more programmers independently arrive at very similar solutions.

      In software we often emphasi

      • It seems to me that we need to establish when it becomes too "inspired" and when you could reasonably have arrived at the same formulation yourself.

        It seems to me that we need to do away with software patents so that software can be of the highest quality possible without having to worry about whether one has reinvented a wheel.

        One might argue that patents in general are holding back progress today, but software patents are clearly bananas. The same terms don't make sense as for physical inventions.

      • it becomes a burden to prove copyright infringement.
        Even accidentally copying bunches of code outside of a 'cleanroom implementation' is not a copyright violation, but an independent development.

    • The RIIA most likely would not care. As it would be an issue between Metallica and U2.

  • If Microsoft is allowed to get away with this, then I hope a lot of people take to the high seas to make them aware of just how hard it can be to fuck people over, even with all the tools a mega-corporation with the morals and ethics of a serial killer can bring to bear.

  • link [slashdot.org]: Haaaa!

Talent does what it can. Genius does what it must. You do what you get paid to do.

Working...