Forgot your password?
typodupeerror
Python AI

Python 'Chardet' Package Replaced With LLM-Generated Clone, Re-Licensed 47

Ancient Slashdot reader ewhac writes: The maintainers of the Python package `chardet`, which attempts to automatically detect the character encoding of a string, announced the release of version 7 this week, claiming a speedup factor of 43x over version 6. In the release notes, the maintainers claim that version 7 is, "a ground-up, MIT-licensed rewrite of chardet." Problem: The putative "ground-up rewrite" is actually the result of running the existing copyrighted codebase and test suite through the Claude LLM. In so doing, the maintainers claim that v7 now represents a unique work of authorship, and therefore may be offered under a new license. Version 6 and earlier was licensed under the GNU Lesser General Public License (LGPL). Version 7 claims to be available under the MIT license.

The maintainers appear to be claiming that, under the Oracle v. Google decision, which found that cloning public APIs is fair use, their v7 is a fair use re-implementation of the `chardet` public API. However, there is no evidence to suggest their re-write was under "clean room" conditions, which traditionally has shielded cloners from infringement suits. Further, the copyrightability of LLM output has yet to be settled. Recent court decisions seem to favor the view that LLM output is not copyrightable, as the output is not primarily the result of human creative expression -- the endeavor copyright is intended to protect. Spirited discussion has ensued in issue #327 on `chardet`s GitHub repo, raising the question: Can copyrighted source code be laundered through an LLM and come out the other end as a fresh work of authorship, eligible for a new copyright, copyright holder, and license terms? If this is found to be so, it would allow malicious interests to completely strip-mine the Open Source commons, and then sell it back to the users without the community seeing a single dime.
This discussion has been archived. No new comments can be posted.

Python 'Chardet' Package Replaced With LLM-Generated Clone, Re-Licensed

Comments Filter:
  • by FictionPimp ( 712802 ) on Friday March 06, 2026 @02:07PM (#66026518) Homepage

    The U.S. Supreme Court has ruled that AI-generated artwork cannot be copyrighted because it lacks human authorship, reaffirming that copyright law requires works to be created by humans. This decision follows a case involving Stephen Thaler's AI-generated artwork, which was denied copyright protection by the U.S. Copyright Office.

    • by Uninvited Guest ( 237316 ) on Friday March 06, 2026 @02:22PM (#66026554)

      The U.S. Supreme Court has ruled that AI-generated artwork cannot be copyrighted because it lacks human authorship, reaffirming that copyright law requires works to be created by humans. This decision follows a case involving Stephen Thaler's AI-generated artwork, which was denied copyright protection by the U.S. Copyright Office.

      *effectively ruled. The SCOTUS declined to take the appeal, leaving in place the lower appeals court ruling.The ruling doesn't specifically include source code, but there's nothing in the ruling (or in copyright law) to suggest an exception for AI-generated source code. It sure sounds like chardet v7 is in the public domain from creation, and cannot be restricted by any license.

      • by DarkOx ( 621550 ) on Friday March 06, 2026 @02:32PM (#66026568) Journal

        This isn't really the same question though.

        Now that ruling may imply the newly generated library can't be licensed at all because it can't be copyrighted.

        However the question is can you tell an LLM to re-implement some other IP and then claim that does not infringe on the original / isn't subject to its license.

        I honestly don't understand why it would not come back to the same rulings that have been made about clean room implementations in the past and the fact that you can't copyright an interface.

        In the closed source case where we can assume the original could not have been in the training set, and you give claude nothing but the API doc and say make me a library that behaves exactly like this description - I don't see how that could infringe on the original.

        On the other hand if you provide the source to the original, I don't see how it couldn't. Just like if I renamed all the characters in Harry Potter, and used a thesaurus to replace every fifth word, JK Rolling would probably have little trouble suing me.

        In the case FOSS where its down right probably the model was trained on the source, we are back to the unsettled questions of how much of the original content survives, how likely is the model to generate outputs that don't materially differ from the original, and the usual cases by case disputes about when something is materially different...

        • One complication is that while the maintainers maintain that they didn't expose Claude to the original source code and specifically told Claude not to do that, is that really a thing, given the degree to which LLMs are being trained on Github, and therefore Claude has, actually, seen the source code, and probably can't be relied upon to have "forgotten" what it looked like.

          I don't think that's necessarily an insurmountable hurdle, but someone examining the code who finds sections that are substantially the

          • by rta ( 559125 )

            and even if it hadn't "seen" it during training the models have shown to be quite morally flexible.

            so telling it to NOT look at the answers to the test that are in the unlocked cabinet over there (the public github repo) while you step out of the room for 30 minutes would be about as effective as telling it to Bart Simpson.

            (ok... idk exactly what Claude would do in this case, but both various papers and my personal experience w/ other tasks show a serious desire to get away with whatever they can get away

        • You're solely focused on are they allowed to create a clone (because if you can't create it then you couldn't license it), and are ignoring the other interesting implication of the idea that AI creations cannot be copyrighted in the US.

          What happens to all this "work" that people are using AI to create? Supposedly amplifying our productivity, and pushing companies to expect more from fewer people.

          What if someone copies it (unhappy employee before leaving). Is that illegal if it can't be protected IP? Or a

      • > It sure sounds like chardet v7 is in the public domain from creation, and cannot be restricted by any license. ...but only if it can be proven it's not based on the GPL'd work. So basically the maintainers can't relicense it, they can either maintain it's genuinely clean-room in which case there is no copyright and they can't license it under MIT or anything else (and probably need to warn potential users of potential issues applying copyright protection to software that uses it), or they can admit its

        • by ceoyoyo ( 59147 )

          Why would they need to warn anybody? If the code is public domain someone can use it freely.

          Looking at who the top maintainers are, I suspect the goal here was to remove the restrictions imposed by the GPL so the software could be easily used in closed source programs from their employers. The MIT license allows that. Public domain even more so.

          • > If the code is public domain someone can use it freely.

            The issue comes with combining it with other GenAI code. The impact of declaring GenAI stuff to be public domain is much wider than "Everyone can use it".

            • by ceoyoyo ( 59147 )

              It doesn't. If it's public domain, i.e. uncopyrightable, that's what it is. Anybody can use it for whatever they want. It doesn't infect the rest of your code the way the GPL does.

    • by EvilSS ( 557649 )
      The problem with that is that no one in Congress reads the /. summary.
    • by Ksevio ( 865461 )

      That's true, but I'm not sure how much that applies. There was undoubtedly human input and code is a bit different from artwork. If you take something that doesn't have copyright applied to it and then modify it, it seems like it would be able to be licensed

    • by Kisai ( 213879 )

      This absolutely will apply to code, however I feel the question of "laundered" is what is really important here.

      If one can merely launder a work through an LLM, to strip it of copyright, won't everyone do that to every creative work out there to effectively end copyright?

      What's to stop people from making ML "covers" of music? How is this different from code? (Before you ask, yes, people have already been doing this for both the musical and lyric component of songs, making AI generated tracks of artists who

  • Feels wrong (Score:4, Interesting)

    by liqu1d ( 4349325 ) on Friday March 06, 2026 @02:26PM (#66026560)
    I don't have a legal basis for this argument but it seems wrong to change the license on the project in such a manner. If the owner wanted to change it then it's fine but it appears the current maintainer isn't the owner. If they truly believe it's legally unique then it should be created under its own repo and stop providing updates to the GPL one.
    • Meh, just clone the new codebase 20x times, rename the projects chardet-ng, chardet2, ..... and make small modifications to each of them. Then write blogs that argue the merits of the forked versions and show sample code. The AI's will pick it up, and pretty soon they will recommend using variants. Then you can start adding subtle bugs to each variant's codebase, and write more blogs complaining how chardet should not be used because it's buggy.

      Open Source depends on both licensing terms and community go

    • Like a purposeful attempt to cause the judicial system to consider all the implications of the precedents they're setting with other decisions.

      Or they truly believed they weren't stealing anything. That they either didn't imagine the old code could be used during this recreation, or they were careful to keep things separate and just haven't spent the time to explain all the ways (maybe to not give opposing lawyers anything to prepare from).

      But yeah. It's hard to believe this general process should be seen

    • but it appears the current maintainer isn't the owner

      Well the owner could simply delete the changes if they disagree with it and kick the maintainer off the project. A maintainer is given an implicit trust by the owner to manage the project. The owner retains power over their github project.

  • Why not just take their V7 and run it through Claude again, declare it to be V8, and release under the original copyright?

    Is there any reason to believe that the code would be *identical* after that second pass?

  • Not clean room (Score:4, Insightful)

    by F.Ultra ( 1673484 ) on Friday March 06, 2026 @03:08PM (#66026636)
    This is clearly not a clean room since the LLVM was trained on the copyrighted source code and it also is not just a reimplementiaton of the API so Oracle v Google does not apply.
    • Not hard to make it to a clean room implementation though with AI agents. You follow the same steps:
      1st AI gets the code and extrapolates a complete and fully nuanced functional spec.
      2nd AI gets the clean functional spec and writes net new code without ever having seen the original code.
      (largely unnecessary / could be 1st) 3rd AI performs parallel unit and functional testing against both versions of code and feeds back a list of exceptions and revisions for 2nd AI to make to net new.

      If you want extra sanit
      • I don't think you understand the OPs point. LLMs are trained on everything the model creator can get their hands on. That means all Internet available open source (and many non-open source) code, including chardet. An existing AI can't perform a clean-room implementation because even if you don't show it the code, *it's already seen it*. And since the training data is encoded in the weights in a very non-trivial manner, you can't specifically remove a set of training data free the fact. You'd have to train
        • Copyright isn't that infectious.

          You'd never be allowed to right a book after university as you would have been exposed to too much material if that were the case.

          • by caseih ( 160668 )

            The difference is most humans don't have a perfect memory. And those that do, if they were to put an entire extract from a copyrighted work in their own work would called on it. And people have been successfully sued for copyright infringement over creating something that was too much like something else they had heard or read, even if it was the product of their own mind.

            LLMs have been shown over and over again to be able to reproduce literary works word for word if you prompt them in just the right way.

            • by topham ( 32406 )

              Plagiarism and copyright violation aren't the same thing, contrary to some people's opinion.

              LLMs don't produce a literally work verbatim, they produce extracts verbatim. I have two ex-gf that can quote entire pages from books they've read. (dozens, to hundreds). (it helps to be on the spectrum).

              But, the ability to do doesn't mean anything they produce is inherently a violation; it does increase the *risk*.

        • Take this training dataset and sanitize it of all references and code from that project and dataset over there. Gotcha, right. Then use the resulting dataset to train a new functionally equivalent model with 0.0000000001% of its training data missing.

          We'll call that a job for the 0th AI Agent.
        • LLMs are trained on everything the model creator can get their hands on.

          So if you've ever read source code you're not able to claim you made a clean room implementation even if your implementation is nothing like the original? Sorry but you're not interpreting that correctly. It's ultimately a question of derivative works or not. Just because you've read code, doesn't make everything you come up with with a similar functionality a derivative. That's now how the original ruling applied.

          • not if you are a LLM no, if you are human yes because no one expects you to have complete 100% memory of terrabytes of source code, an LLM though is a completely different thing.
    • Clear room implementation is *not* required.

      The idea of a clean room implementation is to remove the possibility of the resulting code being in violation, however, that's not actually required to avoid being in violation. That just makes it much easier to show good faith.

      If you implement a test suite, and then have the AI generate a version that complies with the test suite, it's entirely possible you are not in violation.

      Symantec code validation would be prudent, but not necessarily required.

      (Replace all v

    • The process sounds functionally the same as running code through a compiler. You get compiled code that isn't character-for-character identical, but has the same functionality. And it requires the original source.

      Having an LLM obfuscate/refactor copyrighted code, by feeding it the copyrighted code, doesn't seem much different.

  • Looks like this demonstrates a need for updating licensing terms for open source code. Can this code be used as input to a LLM? If so, is the resulting code limited to a specific license? I can see things such as "Claude-GNU LLM" being released where the LLM can output GNU licensed code. This would be guaranteed by only using training material licensed to allow for it.

  • No. (Score:5, Informative)

    by Local ID10T ( 790134 ) <ID10T.L.USER@gmail.com> on Friday March 06, 2026 @03:48PM (#66026720) Homepage

    Can copyrighted source code be laundered through an LLM and come out the other end as a fresh work of authorship, eligible for a new copyright, copyright holder, and license terms?

    That is simply creating a derivative work. Derivative works generally are infringing (various exceptions exist: fair use, etc.).

    The maintainers appear to be claiming that, under the Oracle v. Google decision, which found that cloning public APIs is fair use, their v7 is a fair use re-implementation of the `chardet` public API.

    This is a misrepresentation of the finding in Oracle v. Google. The finding was that APIs are not subject to copyright because they are statements of facts (e.g. function "blah" takes input integer, returns character) and are intentionally published for interoperability (like listing phone numbers in a phone book so that they can be called). How the underlying code is implemented is a separate issue.

    Code may not be subject to copyright if the function can only be implemented in a particular way. If there are many ways to do a thing, then the particular way it is done may be copyrighted -and a different way of doing it would not be infringing. Either creating an different way of doing the thing if the original way is known, or by "clean-rooming" -creating a way of doing a thing knowing only the specifications would not be infringing on the copyright as similarity could be attributed to obviousness of the method and copyright protects creative expression.

    • by Sloppy ( 14984 )

      Code may not be subject to copyright if the function can only be implemented in a particular way.

      Yeah, that's what patents are for!

    • That is simply creating a derivative work. Derivative works generally are infringing (various exceptions exist: fair use, etc.).

      No implementing a functionality in a different way does not automatically make it derivative. A significant portion of the original work needs to be used for it to be derivative, and the courts explicitly rejected the idea that the function - the API - was copyrightable.

  • Surely this pretty much falls under the same plagiarism rules and laws as translating code from one language to another, no?
  • I took an illegally obtained copy of the source code to Windows. I put it through an LLM so that it would generate new source code. It does exactly what Windows does, but it's different code. I even had the LLM convert it to Pascal. I'm making this new source code available under an MIT licence. It's fine because it's all original work. /s
    • by jpatters ( 883 )

      You may not even need the source code, it is likely that AI models are pretty close to being able to produce a specification from binaries.

  • The release notes claim the new version has higher accuracy, meaning it returns different (better) answers for some input strings. It seems to me that it's training can't be limited to the old code in order to achieve this. Still agree that if they fed the code to a program and told it "write a better version" then the copyright of the original code still applies. They may also have just generated a lot of strings and in some (many) cases fed them to the old program and told the AI to make a program that pr

  • If this is legal, I can train an LLM on not the source, but instead the binary code, of an existing program and create a non-infringing clone. Open source clone of Microsoft Word, anyone?
  • One can contemplate that it would be possible to do some sort of "clean room" implementation where you input some source code (or even an executable) to one AI system that then outputs a specification, and then feed the specification to a different AI system to produce a new source code output. However, the result shouldn't be copyrightable at all because it is not the result of human authorship.

  • I think this person should criticized harshly.

  • The prompts provided to the LLM should be copyrightable as code and the code generated should be protected the same way compiled or intermediate code is protected.

    The issue at hand is how the model was trained. And the users of the LLM should be made clearly aware by the model trainer whether the user or the model trainer is responsible for the liability related to using other peoples code for training the model.

    That said, we should soon be seeing models that are trained using training courses rather than m
  • If running somethinbg through an LLM creates a new non-derivative work it isn'tnecessarily limited to open source code. How long would it take to train an LLM to use or be a disassembler and then rewrite everything in some other language?

  • The project maintainers released a new version of the software. No problem there. They ran it through an LLM. Well, no problem there, right? That's what they're for. They changed the license. Is that unusual? Does it matter?

    I'm struggling to see how there are any issues here. What's the problem?

Dealing with the problem of pure staff accumulation, all our researches ... point to an average increase of 5.75% per year. -- C.N. Parkinson

Working...