More Than Half of GitHub Is Duplicate Code, Researchers Find

More Than Half of GitHub Is Duplicate Code, Researchers Find (theregister.co.uk) 115

Posted by msmash on Thursday November 23, 2017 @07:53PM from the redundant dept.

Richard Chirgwin, writing for The Register: Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the "granularity" of copying -- that is, how much files changed between different clones -- but along the way, they turned up a "staggering rate of file-level duplication" that made them change direction. Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique. Before readers say "so what?", the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.

More Than Half of GitHub Is Duplicate Code, Researchers Find

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 115 Comments Log In/Create an Account

Comments Filter:

Dupes? (Score:3, Funny)

by Tablizer ( 95088 ) writes: on Thursday November 23, 2017 @08:02PM (#55613105) Journal

You you don't don't say say.

- Re: (Score:2)
  
  by Shikaku ( 1129753 ) writes:
  
  Can we get one for Slashdot too?
- Re: (Score:3)
  
  by Zaiff Urgulbunger ( 591514 ) writes:
  
  You you don't don't say say.
  You're forking kidding me!
- Re: (Score:2)
  
  by zifn4b ( 1040588 ) writes:
  
  You you don't don't say say.
  Too bad there isn't a server de-dupe solution for Slashdot to optimize space usage
- Re:The Facebook of code (Score:4, Funny)
  
  by glenebob ( 414078 ) writes: on Friday November 24, 2017 @03:20AM (#55614161)
  
  horrible JavaScript
  I found duplication in your post.
  
  - Re: (Score:2)
    
    by Zero__Kelvin ( 151819 ) writes:
    
    That's redundancy, not duplication.
    - Re: (Score:2)
      
      by ColdWetDog ( 752185 ) writes:
      
      Duplicate redundancy is even better.
- Re: (Score:2)
  
  by jellomizer ( 103300 ) writes:
  
  So if I made a Web Application on GIT Hub, and I attached my downloaded version of jquery to my project, would that count?
  These Libraries (espectially Javascript) are in source code form, and being so general purpose, they probably take up just as much space if not more then the app that uses it.
Git submodules = hard (Score:4, Interesting)

by brian.stinar ( 1104135 ) writes: on Thursday November 23, 2017 @08:09PM (#55613149) Homepage

Yeah, it can be rough to learn how to use Git submodules...
Honestly though, the few times I've directly integrated with someone else's code, it hasn't exactly been library-ready. There was a lot of massaging that had to be done the last time I did this, so a straight up duplication of their stuff was actually not a bad idea (AFTER I submitted them a PR to try and help manage this.) Their application wasn't designed as a library though, so I'm not sure what the right thing to do when you library-ify someone's code actually should be.

- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  Forks. That's the major reason for all the duplicate code. Actually, that's rather how git is supposed to work. The fact that it's only on the order of 19% unique files is surprising more that the number of unique files are so high. The other surprising part is just how badly we are when it comes to code that we're still working things as above. I can't count the number of times I've seen programs and realized I want to make a trivial change and how it's simply not possible without grabbing a bunch of
- Re: (Score:2)
  
  by serviscope_minor ( 664417 ) writes:
  
  Yeah, it can be rough to learn how to use Git submodules...
  Or, maybe they're using subtrees :)
  Honestly though, the few times I've directly integrated with someone else's code, it hasn't exactly been library-ready. There was a lot of massaging that had to be done the last time I did this, so a straight up duplication of their stuff was actually not a bad idea (AFTER I submitted them a PR to try and help manage this.) Their application wasn't designed as a library though, so I'm not sure what the right thing
More Than Half of GitHub Is Duplicate Code, Resear (Score:5, Funny)

by Baron_Yam ( 643147 ) writes: on Thursday November 23, 2017 @08:17PM (#55613167)

Richard Chirgwin, writing for The Register:
Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the "granularity" of copying -- that is, how much files changed between different clones -- but along the way, they turned up a "staggering rate of file-level duplication" that made them change direction. Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique. Before readers say "so what?", the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.

- Re: (Score:2)
  
  by Aighearach ( 97333 ) writes:
  
  That's the hilarious part; duplicating code is also most of the purpose of github!!
  Wetness detected in local river!
  - Re: (Score:2)
    
    by Half-pint HAL ( 718102 ) writes:
    
    That's the hilarious part; duplicating code is also most of the purpose of github!!
    Wetness detected in local river!
    How about reading the point made in TFS?
    The researchers did this study because Github is used as a source of data for identifying trends in computing. As they say, this duplication of code skews the results, and anyone wanting to draw serious conclusions from this data needs to account for this.
    The important data isn't the headline, it's... well... the data. I'm hoping there will be less (virtual) printing of sensationalist "JavaScript is the best language in the world" headlines due to this prompting peop
    - Re: (Score:2)
      
      by Aighearach ( 97333 ) writes:
      
      Thanks for pointing that out, I had no idea that the word "wet" fails to describe the local river with the maximum known precision! Golly.
      - Re: (Score:2)
        
        by Half-pint HAL ( 718102 ) writes:
        
        Óbh-óbh.
        Look, this is more like pointing out that you're measuring the total length of the world's rivers wrong when you measure the source of the Rio Negro and the Rio Amazon from source to sea, because for a fair portion of that length, the Rio Negro is the Amazon. If hydrological researchers were making such a fundamental error, someone would have to point it out.
        But code researchers were making a completely analogous error, and it needed quantified. And now it is.
        
        Re: (Score:2)
        
        by Aighearach ( 97333 ) writes:
        
        It is kind of like that, except in your example there is one mistake that goes away when you apply the fix, and in the story, it is still really fuzzy and the remaining code might even still be mostly copied.
        So it is like if you didn't have maps of the rivers, and didn't know which ones overlap, and so the data is complete crap, and then you find a fragmented map and now you know where some parts of a few of the rivers are. It is progress towards a good goal, but the data is still crap so far.
- Re: (Score:2)
  
  by Aighearach ( 97333 ) writes:
  
  If a project has very few dependencies, they might be following best practices to include the exact libraries they are using with the distribution. It isn't a one-size-fits-all situation.
  - Been there, done that (Score:2)
    
    by dallaylaen ( 756739 ) writes:
    
    In an open-source project aimed at in-house usage - I don't want my "customer" to suffer denial of service just because a 3rd party neither of us controls (or the internet provider) went down.
    I wonder what the proper procedure could be? Put it under /3rd-party? Add as a build-time dependency? Something else?..
    - Re: (Score:2)
      
      by Aighearach ( 97333 ) writes:
      
      I like the Ruby approach with Bundler. You can point to a remote repository, and then have it basically locked to a version and cached into the project. You get a lot of choices in how to manage it, and once it is set up it is easy for the developers working with it.
      In C if you put in a directory and integrate it into make then you're doing good! Never ever ever believe it is "good enough" to just document how to build it. If it ships with the project, it has to build with the project.
Downplay much (Score:2)

by lucm ( 889690 ) writes:

70% is a lot more than half. In this case the difference between half and 70% is a casual 129,000,000 duplicated files.
Kudos for not going in mega-clickbait mode, but still, "nearly 3/4 or more than 2/3" would be a better title.
- Re: (Score:2)
  
  by zifn4b ( 1040588 ) writes:
  
  70% is a lot more than half. In this case the difference between half and 70% is a casual 129,000,000 duplicated files.
  Kudos for not going in mega-clickbait mode, but still, "nearly 3/4 or more than 2/3" would be a better title.
  The files aren't duplicated with modern clustered file storage technology. They're only logically duplicated. That's why I don't see why this topic is of interest.
- Re: (Score:2)
  
  by lucm ( 889690 ) writes:
  
  There are many more crappy "me too"
  ^ trigger warning
How could more than half be duplicate? (Score:2)

by tie_guy_matt ( 176397 ) writes:

If half of the code is duplicate does that mean it is just a duplicate of the other half? If so then how would you know what the duplicate is and what the original is? Unless you count the duplicate code in with the original code in which case only one quarter of the code is a duplicate of the other quarter. Or maybe in my post thanksgiving carb haze I am over thinking this?
- Re: How could more than half be duplicate? (Score:4, Informative)
  
  by joelsherrill ( 132624 ) writes: on Thursday November 23, 2017 @09:06PM (#55613315) Homepage
  
  Even then, the original code may not be on GitHub. Peojexts like GCC, RTEMS and FreeBSD have the original code somewhere other than GitHub. So all of the code there for these and other projects is not original.
  
- Re: (Score:2)
  
  by Zero__Kelvin ( 151819 ) writes:
  
  " If so then how would you know what the duplicate is and what the original is?"
  Somebody should invent time and some way of recording it when a file is checked in, along with who is doing the commit! Either you have never used git or spent no time thinking before you posted.
why dont that make one common pool (Score:3)

by FudRucker ( 866063 ) writes: on Thursday November 23, 2017 @08:45PM (#55613257)

put all the code in there and link it to the associated github accounts, providing the code is 100% identical it should work, but they must consider forks and even one line of code in one file will make a lot of difference in the compiled software

- Re: (Score:3)
  
  by KiloByte ( 825081 ) writes:
  
  This could be a lot easier if you had content-addressable storage that refers to objects by their SHA1 hash.
  - Re: (Score:2)
    
    by angel'o'sphere ( 80593 ) writes:
    
    You mean like a git repository?
    - Re: why dont that make one common pool (Score:1)
      
      by zaphirplane ( 1457931 ) writes:
      
      Or the woo sh cvs
  - - Code isn't random (Score:2)
      
      by DrYak ( 748999 ) writes:
      
      No matter what Perl looks to you (even if it is valid code written by your cat walking across the keyboard), not every random jumble of noise is valid code.
      Yes, it is entirely possible that two files of size > sizeof(SHA1) (= 128 bits) will have the same hash.
      But on the other hand, it's very likely that none of them is valid code, but gibberish.
      Once you intersect both requirements (must share a hash and must be legit code) suddenly the probability drops a lot (because "must be code" is a very stringent
    - Re: (Score:2)
      
      by Zero__Kelvin ( 151819 ) writes:
      
      Files also have metadata. You calculate the hash and compare the metadata including size and creation date. If they are identical so are the files for all practical purposes.
      - Re: (Score:2)
        
        by KiloByte ( 825081 ) writes:
        
        Why would you care about creation date? Git doesn't preserve it, and even in tools that do, timestamps will usually be mangled when someone copies the code around (even cp doesn't default to -p). Thus, those 70% of copied files from the article would have almost all dates different even when byte-to-byte identical.
        A non-broken hash is enough. SHA1 is somewhat broken: known collisions exist, but those are not very interesting, and git has a modified version of SHA1 that detects the attack and uses an inco
        
        Re: (Score:2)
        
        by Zero__Kelvin ( 151819 ) writes:
        
        I agree, and certainly trust git to do it right. I was just attempting to add some extra extra assurance for the GP, that admittedly is unnecessary and as you say, upon further reflection it was a bad idea.
- Re: libraries (Score:1)
  
  by zaphirplane ( 1457931 ) writes:
  
  That is not how it should be done in Java (maven or gradle) Python pip in c/cpp you usually just expect the dependency tone handled by a human. I only know of Golang where people check in the vendored dep
  I assume you mean JS, thou Iâ(TM)m not sure why with npm exsisting
  - Re: (Score:2)
    
    by pjt33 ( 739471 ) writes:
    
    in c/cpp you usually just expect the dependency tone handled by a human
    Which is the biggest reason that I hate compiling other people's C. In the absence of a standard way to document dependencies, too many projects simply don't.
Excluding forks? (Score:3)

by Zaiff Urgulbunger ( 591514 ) writes: on Thursday November 23, 2017 @09:05PM (#55613309)

Do they mean (obv. I didn't read TFA) code is duplicated in non-forked code, or are they just observing that lots of projects will be forked by other users in order that they can play with it and post their pull requests to them?

'cos if it's the latter, then that's kind of obvious isn't it?

- Re: (Score:3)
  
  by Aighearach ( 97333 ) writes:
  
  They're saying, if you do research on software using github for your data, you have to take file duplication into account in your formulas.
  The problem, IMO, is that a lot of the rest is duplicated from somewhere else, but only one time on github, so the data is still polluted by duplication.
  - - Re: (Score:2)
      
      by Aighearach ( 97333 ) writes:
      
      No, you're just not understanding my words. I already addressed your point. So maybe try saying it again but leave out the "except that" that purports to talk about something I didn't address.
      I'm saying, the results would still be problematic. You're saying, "Good enough for me!" except phrasing it as if it is external to you.
- Re: (Score:2)
  
  by multi io ( 640409 ) writes:
  
  Do they mean (obv. I didn't read TFA) code is duplicated in non-forked code
  Yes they do mean that. The summary should've mentioned this. From https://dl.acm.org/citation.cf... [acm.org]:
  (abstract) [...] This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. [...]
Avoiding dependency hell (Score:2, Interesting)

by Anonymous Coward writes:

I wonder how much is just people trying to avoid dependency hell?
Because let's face it, when I just want "that one bit" of some gargantuan framework / solve-all / codeball-from-hell then I'd rather spend five minutes of disentangling and integrating than a lifetime playing in "follow the library".
Pull requests (Score:5, Informative)

by manu0601 ( 2221348 ) writes: on Thursday November 23, 2017 @09:36PM (#55613379)

No surprise here, this is how this stupid thing works: in order to submit a one-line bugfix, one have to fork the repository, patch, commit, pull request.

- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  It's true that git stores snapshots.
  However, if you make a one-line change, it's not going to store new copies of every file in the repository. It only stores a new and old copy of the one file that changed.
  https://git-scm.com/book/en/v2... [git-scm.com]
  So yes, there is some duplication, but not the entire repository for each change.
  - Re: (Score:1)
    
    by Tenebrousedge ( 1226584 ) writes:
    
    No shit, Sherlock. If you thought anyone here needs to be told that then I hope that you were drunk instead of assuming that everyone else here is at your intellectual level.
- Re: (Score:2)
  
  by serviscope_minor ( 664417 ) writes:
  
  No surprise here, this is how this stupid thing works: in order to submit a one-line bugfix, one have to fork the repository, patch, commit, pull request.
  You don't have to fork it on github unless you want to use github's internal mechanisms. You can submit patches using any of the other mechanisms too, like a PR to an external repo, or a git-send email and so on and so forth.
  It is however rather convenient.
  - Re: (Score:3)
    
    by manu0601 ( 2221348 ) writes:
    
    You don't have to fork it on github unless you want to use github's internal mechanisms. You can submit patches using any of the other mechanisms too, like a PR to an external repo, or a git-send email and so on and so forth.
    I must be unlucky, but every time I did that, I was answered to send a pull request.
Or (Score:2)

by no-body ( 127863 ) writes:

reused/recycled code. One would be stupid to event/develop everything from the very beginning yet again...
- haven't looked at the study though, no time..
Makes sense (Score:3)

by barbariccow ( 1476631 ) writes: on Friday November 24, 2017 @12:32AM (#55613767)

Makes sense... it's called a fork. Several of my projects are forked more times than they contain files..

- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - So your PR is accessible, avoid tying to one drive (Score:2)
    
    by raymorris ( 2726007 ) writes:
    
    > You can clone/download all what you wish and enjoy it on your own machine, but why having publicly accessible codes which have been basically developed by other people
    There are a couple major reasons to make your version of the project accessible on the internet. Maybe the most important is so that other people can see your pull requests. As an example, I used to do a lot of work on some software called Moodle, which is used by many schools. Moodle has a mature development process, so any changes to
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
      - Moodle is 5,000-10,000 files. Kernel is 24,000 (Score:2)
        
        by raymorris ( 2726007 ) writes:
        
        > corresponding file stops being identical
        Yep, the two or three or four files I change are no longer identical. The other 4,997 files in the project haven't changed, they are identical in both versions (forks). GitHub, presents my version of the *project*. It doesn't only show the differences and force users to download from someone else's fork, then apply my changes. They can just download my version of the project. (GitHub can also show the differences, if that's what someone wants to see.)
        That doe
        
        Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
      - Re: (Score:2)
        
        by barbariccow ( 1476631 ) writes:
        
        If you're asking WHY do folks fork and NOT modify, it's to "lock" a version, and to be able to build in an automated way. Granted, git supports this via checking out a specific commit, but for some reason a LOT of folks find it better to fork it, and then clone off that fork. The only advantage I can think of is it protects you from the original deleting the project altogether.
        So imagine if you're developing a commercial software that uses LibraryA. You write it to how LibraryA looked when you pulled it and
        
        Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
        
        Re: (Score:2)
        
        by barbariccow ( 1476631 ) writes:
        
        Lots of projects use "generic build tools" and unfortunately, this may be the easiest and safest way to integrate AND get the project that's under-budgeted by months out the door.
        
        Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
Comment removed (Score:3)

by account_deleted ( 4530225 ) writes: on Friday November 24, 2017 @04:32AM (#55614283)

Comment removed based on user account deletion

- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
Keeping the Stats Up (Score:2)

by coofercat ( 719737 ) writes:

I'm doing my bit to keep the stats up though, There are no 'duplicates' of any of my code ;-)
Identical != Duplicate (Score:2)

by engineerErrant ( 759650 ) writes:

That's like calling identical twins "duplicate twins" and saying we should drop half of them in any study of population genetics.
If two code files are the same, that's not just noise - a person made that happen for some purpose. It makes no difference whether you find that "bad" or "sloppy" - it's a legitimate part of the in-use population.
Now, that doesn't mean some studies shouldn't still drop them - for example, if I'm studying the *writing* of code, I might want a sample of unique stretches of code that
Most used programming languages (Score:1)

by hvidstue ( 1260682 ) writes:

Also this could affect the surveys of what programming languages are most used.
At worst the current surveys only shows in which language programmers do most copy-paste code.
not surprising (Score:2)

by bobmajdakjr ( 2484288 ) writes:

a lot of bots and stupid people use the fork button to bookmark or make themselves look legit and most forks go nowhere. so yeah.
- Re: no surprise (Score:2)
  
  by hackwrench ( 573697 ) writes:
  
  I don't understand how you can come to that conclusion. Forking under your own account is the most natural way of interacting with the code base.
  - Re: no surprise (Score:5, Informative)
    
    by MightyYar ( 622222 ) writes: on Thursday November 23, 2017 @10:44PM (#55613537)
    
    And the only way to push a change back to a repository you don't control! You fork, push your change to your fork, then create a pull request. This is by design - I have no idea why this is in any way a surprise.
    
    - - Re: (Score:2)
        
        by MightyYar ( 622222 ) writes:
        
        Where does it say that? I just re-read it and I'm pretty sure you made that up. There is a mention of copy and paste also contributing at the bottom, but that's it.
        
        Re: (Score:1)
        
        by Anonymous Coward writes:
        
        I helped review the presentation and personally know the people that put together the paper, I can confirm they excluded forks
        
        Re: (Score:2)
        
        by MightyYar ( 622222 ) writes:
        
        Thanks for pointing that out - I followed the link to the abstract and then downloaded the paper, and you are correct. The Register article is misleading... lesson learned.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Dupes? (Score:3, Funny)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re:The Facebook of code (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Git submodules = hard (Score:4, Interesting)

Re: (Score:1)

Re: (Score:2)

More Than Half of GitHub Is Duplicate Code, Resear (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Been there, done that (Score:2)

Re: (Score:2)

Downplay much (Score:2)

Re: (Score:2)

Re: (Score:2)

How could more than half be duplicate? (Score:2)

Re: How could more than half be duplicate? (Score:4, Informative)

Re: (Score:2)

why dont that make one common pool (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: why dont that make one common pool (Score:1)

Code isn't random (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: libraries (Score:1)

Re: (Score:2)

Excluding forks? (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Avoiding dependency hell (Score:2, Interesting)

Pull requests (Score:5, Informative)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3)

Or (Score:2)

Makes sense (Score:3)

Re: (Score:2)

So your PR is accessible, avoid tying to one drive (Score:2)

Re: (Score:2)

Moodle is 5,000-10,000 files. Kernel is 24,000 (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Comment removed (Score:3)

Re: (Score:2)

Re: (Score:2)

Keeping the Stats Up (Score:2)

Identical != Duplicate (Score:2)

Most used programming languages (Score:1)

not surprising (Score:2)

Re: no surprise (Score:2)

Re: no surprise (Score:5, Informative)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals