Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Microsoft

Microsoft Introduces GVFS (Git Virtual File System) (microsoft.com) 213

Saeed Noursalehi, principal program manager at Microsoft, writes on a blog post: We've been working hard on a solution that allows the Git client to scale to repos of any size. Today, we're introducing GVFS (Git Virtual File System), which virtualizes the file system beneath your repo and makes it appear as though all the files in your repo are present, but in reality only downloads a file the first time it is opened. GVFS also actively manages how much of the repo Git has to consider in operations like checkout and status, since any file that has not been hydrated can be safely ignored. And because we do this all at the file system level, your IDEs and build tools don't need to change at all! In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files. With GVFS, this means that they now have a Git experience that is much more manageable: clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes. And we're working on making those numbers even better.
This discussion has been archived. No new comments can be posted.

Microsoft Introduces GVFS (Git Virtual File System)

Comments Filter:
  • Meh... (Score:4, Insightful)

    by the_skywise ( 189793 ) on Friday February 03, 2017 @09:47AM (#53795025)

    There aren't THAT many repos with over 3 million files in them.

    The great majority of projects I've been on have been around the 100k-300k range and doing a build (to properly test the product) required ALL of them.

    And even then, once you've got all of them the first time, GIT does the diffing automatically so it "scales" already.

    Maybe MS could put some of their vast R&D efforts to to something more useful... like having their free Visual Studio Code editor handle files bigger than 1gb?

    • by AmiMoJo ( 196126 )

      If your repo has 3 million files in it, you have bigger problems. Solving those seems better than trying to mitigate them.

      • Re: (Score:3, Informative)

        And if you have a million [acm.org]?

        • Re: (Score:2, Funny)

          million

          Billion, you fucking moron. lurn 2 rite.

          • by caseih ( 160668 )

            The link is apparently slashdotted so I can't view it, but I think you misread it. The ACM link apparently says there is a billion *lines of code* not a billion files in one repo. Big difference! The OP would appear to be right.

            • by caseih ( 160668 )

              Hmm. It appears the ACM cannot write headlines. The article finally loaded for me and it seems the headline is plain wrong, at least if the article is correct. It does say a billion files, and no where talks about lines of code. Sigh.

              • by cdrudge ( 68377 )

                The ACM article headline is correct. The post that mentions billions is correct. You just missed it in the article.

                Fourth paragraph (emphasis added):

                The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence. The repository contains 86TB of data, including approximately two billion lines of code in nine million unique source files.

          • by unrtst ( 777550 )

            Don't be an ass.
            They were referring to file count, not lines of code.

            The repository contains 86TBa of data, including approximately two billion lines of code in nine million unique source files.

          • by SirSlud ( 67381 )

            Oh noes! It turns out the moron is you!

            • Oh noes! It turns out the moron is you!

              Yep. I'm not generally that rude to other people. I had various attacks of stupidity and brain malfunction today.

      • I meant a billion in my other post.

    • Re:Meh... (Score:5, Interesting)

      by Transcendent ( 204992 ) on Friday February 03, 2017 @10:14AM (#53795191)

      Microsoft's repos *are* that large. That's why they implemented this.

      Microsoft Office's repository is over 1 TB in size. Yes, terabyte. For *office*. They absolutely cannot (could not, I suppose now) use Git on it.

      • by mwvdlee ( 775178 )

        Why are they that large in the first place?
        Do they also store all design files and compiler-generated files in the repo?

        • They likely store their comments as separate files - one per comment.

          (no, really... has no one in Redmond ever heard of making their shit modular?)

      • Re: (Score:2, Funny)

        by djbckr ( 673156 )
        I would propose that if a repo is that large, it should probably be broken into several smaller projects. Then you build what you need.
        • Re:Meh... (Score:5, Funny)

          by Anonymous Coward on Friday February 03, 2017 @11:07AM (#53795603)

          all right, you've clearly nominated yourself to untangling a 1TB repository. get on it bud.

          • In all seriousness, maybe they *should* get a team together and 'rip the bandage off' now, before another decade elapses and the thing gets even hairier...

        • by tepples ( 727027 )

          But if multiple applications in Office share a library, where do you put that library so that the build process for each Office application can see it? Are submodules or subtrees a good choice, and if "yes," which is more appropriate?

          • But if multiple applications in Office share a library, where do you put that library so that the build process for each Office application can see it? Are submodules or subtrees a good choice, and if "yes," which is more appropriate?

            You make that library a specific project, releasable on its own schedule, with a known distribution system that everyone can access for headers and binaries, and everyone uses releases of that project.

            I did that under SVN at a previous position. I had 1 large Qt-based project that generated about 30 static libraries, about 20 standard C/C++ static library projects, a common headers project for the standard C/C++ static libraries, and about 10-50 programs that used the libraries and headers. All-in-all, i

          • But if multiple applications in Office share a library, where do you put that library so that the build process for each Office application can see it? Are submodules or subtrees a good choice, and if "yes," which is more appropriate?

            Microsoft experimented with the submodules approach for Windows. Didn't work:

            "We started down at least 2 failed paths to scale Git. Probably the most extensive one was to use Git submodules to stitch together lots of repos into a single “super” repo. I won’t go into details but after 6 months of working on that we realized it wasn’t going to work – too many edge cases, too much complexity and fragility. We needed a bulletproof solution that would be well supported by almos

    • by tepples ( 727027 )

      As opposed to something like EFF's HTTPS Everywhere project, which stores its FAQ in its Git repository. If you want to suggest a change to the user manual, you have to fork the project on GitHub, clone your fork to your local PC, make changes, commit and push them to your fork, and then make a pull request on GitHub. Not having to spend bandwidth (and potentially pay overage fees) on cloning the whole thing to your local PC would make it easier to suggest changes.

  • by lucasnate1 ( 4682951 ) on Friday February 03, 2017 @09:48AM (#53795033) Homepage

    The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

    • Microsoft are just getting efficient. They have simply skipped "Embrace".
      • Microsoft are just getting efficient. They have simply skipped "Embrace".

        No they didn't. For one thing, Git has been supported in TFS for four years now. And then there's this:

        "Among them, we learned the Git server has to be smart. It has to pack the Git files in an optimal fashion so that it doesn’t have to send more to the client than absolutely necessary – think of it as optimizing locality of reference. So we made lots of enhancements to the Team Services/TFS Git server. We also discovered that Git has lots of scenarios where it touches stuff it really does

    • by thegarbz ( 1787294 ) on Friday February 03, 2017 @10:20AM (#53795245)

      The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

      Because it's biggest advantage is also one of it's greatest inefficiencies and frankly on a large project chances are you may not need it all. The whole point is you have an identical copy on your machine of what you're working on

      • The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

        Because it's biggest advantage is also one of it's greatest inefficiencies and frankly on a large project chances are you may not need it all. The whole point is you have an identical copy on your machine of what you're working on

        So buy a bigger disk. They're cheap.

        Why did they do it? It's obvious: it's the bait on the hook to get you to break git and your open source projects (even CURRENT ones

        • by tepples ( 727027 )

          So buy a bigger disk. They're cheap.

          Not if you want both the speed of an SSD and enough capacity for your project in a laptop that's practical to carry.

          • It will be a good idea to look for a laptop will many storage slots e.g. at least two M.2 slots, at least one of which accepts both M.2 PCIe and M.2 SATA ; either a 2.5" or one more M.2 ; heck UFS memory cards might be big as well a couple years from now (perhaps M.2 to UFS adapters will be a thing)

        • by Kjella ( 173770 )

          Why did they do it? It's obvious: it's the bait on the hook to get you to break git and your open source projects (even CURRENT ones) that compete with them.

          Sounds like a non-starter for distributed development to me. I imagine this is to make git work differently in a corporate environment where for the average developer if the master repo/server goes down it's not your problem. And perhaps for infosec reasons on proprietary code, who made a complete copy of the source code. This seems more like Microsoft adapting to use open source tools instead of their own proprietary tools like TFS.

      • Because it's biggest advantage is also one of it's greatest inefficiencies and frankly on a large project chances are you may not need it all

        In this case call it something different. Git is known for that.

    • by AuMatar ( 183847 )

      When I use svn I have a copy of my branch on my local machine. I may not have every other branch or every part of the repo, but I have what I'm working on. I'm not sure what this is for other than companies that can't find a way to partition their version control between products.

      • Make a shallow clone, then. It will have everything you need to hack on the current code and to push it back.

        Not having the history breaks any advanced git workflow, though. The reason git won over svn and such is bisect, rebases and so on; svn is hardly better than a stack of daily tarballs.

        • The reason git won over svn and such is bisect, rebases and so on; svn is hardly better than a stack of daily tarballs.

          Git and SVN are both excellent tools. Git won for FOSS because of being distributed being; but if you need centralized control than SVN is hands down the best tool out there. Most FOSS projects need a DVCS; many companies do too; yet there are still many cases where a centralized VCS system is the best choice - e.g I wouldn't want to try to have to deal with export control of code with Git;

    • by Turmio ( 29215 )
      Go read the damn blog post and then ask yourself again if your concern is still valid in this case.
    • If you want an identical copy, just mirror the GVFS path to a non-GVFS path, and there's your local copy.

    • Re: (Score:3, Insightful)

      by Anonymous Coward

      The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

      "A Clone now takes only minutes instead of 12+ Hours!"
      Ja, that's because you're NOT making a copy.

    • Re: (Score:3, Insightful)

      by Anonymous Coward

      No, the whole point of git is that every file version is immutable and referenced by a globally unique hash. This means that it doesn't matter where the actual data is located - until you need the actual data for some actual reason. This model has been copied by countless systems since git, because it is extremely robust and has multiple benefits, and none of those other systems expect the local user to download the entire database before he even begins work. Nonetheless, such systems can also support do

      • Forcing every developer in the same office to separately download a complete copy of the full history is inefficient. But then git does have a way to reference objects files from another path.

        For large (but probably not Windows large) git repos, you could add a "git alternate" reference to a network share for your ancient history. So long as you are careful in how you manage that folder, and never remove anything from it, this can work quite well.

        Giving each team a low latency, local mirror of this folder

    • by tangent ( 3677 ) on Friday February 03, 2017 @10:54AM (#53795505) Homepage

      > Why take away git's biggest advantage?

      Because "clone now takes a few minutes instead of 12+ hours, checkout takes 30 seconds instead of 2-3 hours, and status takes 4-5 seconds instead of 10 minutes."

      That is problem is not unique to Git. JÃrg Sonnenberger [sonnenberger.org] tried importing the NetBSD repository into Fossil [sonnenberger.org], and "the rebuild step which (re)creates the internal meta data cache took 10h on a fast machine." There are ways to make Fossil skip the rebuild on clone, which results in a suboptimal DB, but it still takes hours to clone. NetBSD's project history goes back something like a quarter century; it's going to take time to pull and organize all that.

      DVCSes are great when you can afford their associated costs â" namely, the very advantages you refer to â" but for very large repos, those costs can be very high.

      Do you really need every single version going back a quarter century? And if you do, do you need it 5 minutes after the initial clone?

      One idea that's come up on the Fossil mailing list is to do a shallow clone initially, then trickle the back history in over time. I'd like a DVCS that gave me the past 30 days of history at the tip of every open branch, then over the next day or so back-filled the rest.

    • The whole point of git is that you have identical copy on your machine. Why take away git's biggest advantage?

      The issue is that it doesn't well with how VS works which was based on how Visual Source Safe (VSS - Microsoft's version of CVS) worked, and it did locks per file as it pulled each file from the repository when you opened it.

      Honestly, that's really the only reason I can see for why MS would want this. It makes it fit back into that old, broken model of locking files and tracking changes. Perhaps it has some benefit for how they track who did what/when, but it's not really something that is broken in git

  • by DrXym ( 126579 ) on Friday February 03, 2017 @09:49AM (#53795041)
    I had to use Clearcase as my source control system for one company I worked for. The idea was you set up a view spec (a bit like a branch), mapped a drive letter to it and you never had to pull again because it would always reflect that branch. Your local changes went over the top and when it was time to commit you could merge up and commit. In practice what it meant was the source code was constantly changing under your feet, and binaries were constantly stale or in a mystery state because you didn't know what they were compiled against. And because this was IBM software it was unusably slow across WANs, memory hungry and enjoyed triggering random blue screens.

    While a vfs sounds like a great idea, I think in theory it's only of use for very, very large repos. Even then I wonder if the exact same issues that made Clearcase suck would make it suck even with Git.

    • Re:Ah nostalgia (Score:5, Informative)

      by Anonymous Coward on Friday February 03, 2017 @10:16AM (#53795203)

      In practice what it meant was the source code was constantly changing under your feet, and binaries were constantly stale or in a mystery state because you didn't know what they were compiled against.

      Then you had a piss-poor release engineer who didn't understand how to construct config specs based on a stable baseline, label & promote stable builds regularly, and use clearmake properly, or manage dependencies and allow you to do a clean, fast local build.

      I love git, and I work with it daily, and the monorepo craze baffles the shit out of me, to be honest. But I used and supported ClearCase for 14 years at a large financial services company, and I can assure you that the problems you're complaining about are not limitations of the tool - they are limitations of your team's release engineers. ClearCase has many failings, but the issues you're describing simply reflect poor implementation and design choices.

      And because this was IBM software it was unusably slow across WANs, memory hungry and enjoyed triggering random blue screens.

      It stemmed from fundamental concepts cribbed from Apollo's DSEE environment. HP's acquisition of Apollo prompted what would then become the ClearCase team to leave Apollo/HP and form Pure, then they combined with Atria to form PureAtria, then Rational acquired PureAtria, and then IBM acquired Rational -- so ClearCase was a thing long before it was IBM software, and the features you're griping about were extant long before the IBM acquisition. The IBM era mostly saw them continue to focus on jamming ClearCase into their "Application Lifecycle Management" toolset, Rational Team Concert, wrapping everything in a ghastly blue Eclipse RCP client, and making it more of a pain in the ass to use.

      Dynamic views as you're talking about were not - and never were - intended for use across WANs, their Admin & Deploy guides specifically stated that it required a fast connection to a local server. If you wanted WAN connectivity, you either used RTC (Rational Team Client) to pull web views, or you used snapshot views, or you ponied up for MultiSite licenses and set up a sync scheme so that each site could have local copy on a VOB & View server they had a fast connection to.

      Again - poor implementation by your release team. It's like complaining that a hammer makes a giant hole in the drywall when you put screws in with it - it doesn't mean there's a problem with the hammer, it means there's a problem with the operator. If you use the tool in a way it's not intended to be used, then don't be surprised when it does a shitty job.

      • Re:Ah nostalgia (Score:5, Insightful)

        by AuMatar ( 183847 ) on Friday February 03, 2017 @10:24AM (#53795267)

        The fact you needed a release team and release engineers to manage a clear case implementation is why its considered one of the worst systems out there, remembered with hatred by almost everyone who used it. A version control system should be easily set up by one admin in an hour or two, and then usable without reams of documentation by any of the engineers. ClearCase failed that.

      • by DrXym ( 126579 )

        Then you had a piss-poor release engineer who didn't understand how to construct config specs based on a stable baseline, label & promote stable builds regularly, and use clearmake properly, or manage dependencies and allow you to do a clean, fast local build.

        Oh they had plenty of release engineers, and that sort of demonstrates what bullshit Clearcase was. It was so slow that every site needed its own set of engineers, own set of servers and own set of mirrors to replicate each repo. Something no sane source control system has ever required. Then they had to have scripts to periodically sync changes back and forth. Two teams at two sites had to sit around and wait for changes to appear, and of course view specs couldn't be shared, and occasionally syncs failed

      • by tepples ( 727027 )

        Let me take two guesses as to why you might see a monolithic repository:

        First, all applications with the potential to be shipped together may rely on common libraries, and the build process needs to know how to combine the libraries with the source code specific to each application. I'm under the impression that the logistics of this are similar when everything is in one repository.

        Second, paid hosts of private Git repositories used to bill users per repository, not (say) per gigabyte of storage or data tra

    • Re: (Score:3, Informative)

      by Anonymous Coward

      I had to use Clearcase as my source control system for one company I worked for. The idea was you set up a view spec (a bit like a branch), mapped a drive letter to it and you never had to pull again because it would always reflect that branch. Your local changes went over the top and when it was time to commit you could merge up and commit. In practice what it meant was the source code was constantly changing under your feet, and binaries were constantly stale or in a mystery state because you didn't know what they were compiled against. And because this was IBM software it was unusably slow across WANs, memory hungry and enjoyed triggering random blue screens.

      While a vfs sounds like a great idea, I think in theory it's only of use for very, very large repos. Even then I wonder if the exact same issues that made Clearcase suck would make it suck even with Git.

      To be fair to IBM, ClearCase had this behavior before the three mergers that made it part of IBM. (Pure + Atria -> PureAtria, PureAtria + Rational -> Rational, IBM + Rational -> IBM)

      I actually liked the concept of "wink-in" where derived objects that came from the same source objects and build environment could just be pulled from someone else's build instead of rebuilt. But the system as a whole required a zippy network.

      I don't hold out hope that a vfs on top of another scm solution would be eve

  • Ah, Microsoft (Score:2, Interesting)

    by Kierthos ( 225954 )

    "Hey, how can we do what GitHub does, only stupider?"

  • Just curious what the author of GIT has to say about this. He can point out the truth with absolute authority.

    (Reinvented a square wheel? Solved a non-problem? Cured a symptom?)

  • by Luthair ( 847766 ) on Friday February 03, 2017 @10:20AM (#53795239)
    If your developers aren't using all the files then you should probably split your repository.
    • by deKernel ( 65640 )

      Sometimes it just isn't all that simple. As an example, we have one product that comprises several Windows services as well as an ASP.Net front-end. Each of those services have a multitude of DLL that are run-time configurable. As it is, we make an extended effort to share as much code as possible which would cause issues if we were to breakup the repo into several smaller repos. So, if we had several smaller repos, and there is a fix/enhancement to one of the shared/reused components, then you are prone to

  • Lately they stole the name Neon from the KDE distribution, now they steal the name GVFS from GNOME. Who's next? Stealing something from the cinnamon desktop? Or maybe Some eXtended Filemanager for windows CE (XFCE)?

  • Comment removed based on user account deletion
  • I really have to wonder what Microsoft is doing such that git status on a "normal" repository allegedly takes ten minutes (maybe NTFS just sucks, guys).

    But what's being unsaid throughout this is whether this works with a standard Git server, or whether it only works with a special Microsoft-kluged server. While the former is vaguely interesting, the latter merits only a derisive snort.

  • GVFS feels like a philosophical disconnect with GIT and a software tool created to work around a lack of software architecture. It probably would be a better idea to fix the software architecture problem.

    Polytron's tool chain supported partial local builds back in the 80's. We used Polymake and PVCS to build Comshare's EIS. If you changed just one C file, that was all that compiled on your system. Polymake basically had two paths it looked at for all dependencies and their lib command had a nice rep
  • If MS released the GVFS under an Open Source License, then MAYBE their recent posturing re Open Source and Linux has some sincerity to it.
    If they did not then it is probably more Embrace, Extend, Extinguish.
    softcodeer

Imagination is more important than knowledge. -- Albert Einstein

Working...