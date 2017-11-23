More Than Half of GitHub Is Duplicate Code, Researchers Find (theregister.co.uk) 25
Richard Chirgwin, writing for The Register: Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found. An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the "granularity" of copying -- that is, how much files changed between different clones -- but along the way, they turned up a "staggering rate of file-level duplication" that made them change direction. Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique. Before readers say "so what?", the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.
Dupes? (Score:2)
You you don't don't say say.
Re: (Score:2)
Can we get one for Slashdot too?
Re: (Score:2)
You're forking kidding me!
Re: no surprise (Score:1)
Git submodules = hard (Score:2)
Yeah, it can be rough to learn how to use Git submodules...
Honestly though, the few times I've directly integrated with someone else's code, it hasn't exactly been library-ready. There was a lot of massaging that had to be done the last time I did this, so a straight up duplication of their stuff was actually not a bad idea (AFTER I submitted them a PR to try and help manage this.) Their application wasn't designed as a library though, so I'm not sure what the right thing to do when you library-ify someone'
More Than Half of GitHub Is Duplicate Code, Resear (Score:2)
Re: (Score:2)
Downplay much (Score:2)
70% is a lot more than half. In this case the difference between half and 70% is a casual 129,000,000 duplicated files.
Kudos for not going in mega-clickbait mode, but still, "nearly 3/4 or more than 2/3" would be a better title.
How could more than half be duplicate? (Score:1)
If half of the code is duplicate does that mean it is just a duplicate of the other half? If so then how would you know what the duplicate is and what the original is? Unless you count the duplicate code in with the original code in which case only one quarter of the code is a duplicate of the other quarter. Or maybe in my post thanksgiving carb haze I am over thinking this?
Re: How could more than half be duplicate? (Score:2)
Even then, the original code may not be on GitHub. Peojexts like GCC, RTEMS and FreeBSD have the original code somewhere other than GitHub. So all of the code there for these and other projects is not original.
why dont that make one common pool (Score:2)
Re: (Score:2)
This could be a lot easier if you had content-addressable storage that refers to objects by their SHA1 hash.
Excluding forks? (Score:2)
'cos if it's the latter, then that's kind of obvious isn't it?