Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Programming

GitHub Claims Source Code Search Engine Is a Game Changer (theregister.com) 39

Thomas Claburn writes via The Register: GitHub has a lot of code to search -- more than 200 million repositories -- and says last November's beta version of a search engine optimized for source code that has caused a "flurry of innovation." GitHub engineer Timothy Clem explained that the company has had problems getting existing technology to work well. "The truth is from Solr to Elasticsearch, we haven't had a lot of luck using general text search products to power code search," he said in a GitHub Universe video presentation. "The user experience is poor. It's very, very expensive to host and it's slow to index." In a blog post on Monday, Clem delved into the technology used to scour just a quarter of those repos, a code search engine built in Rust called Blackbird.

Blackbird currently provides access to almost 45 million GitHub repositories, which together amount to 115TB of code and 15.5 billion documents. Shifting through that many lines of code requires something stronger than grep, a common command line tool on Unix-like systems for searching through text data. Using ripgrep on an 8-core Intel CPU to run an exhaustive regular expression query on a 13GB file in memory, Clem explained, takes about 2.769 seconds, or 0.6GB/sec/core. [...] At 0.01 queries per second, grep was not an option. So GitHub front-loaded much of the work into precomputed search indices. These are essentially maps of key-value pairs. This approach makes it less computationally demanding to search for document characteristics like the programming language or word sequences by using a numeric key rather than a text string. Even so, these indices are too large to fit in memory, so GitHub built iterators for each index it needed to access. According to Clem, these lazily return sorted document IDs that represent the rank of the associated document and meet the query criteria.

To keep the search index manageable, GitHub relies on sharding -- breaking the data up into multiple pieces using Git's content addressable hashing scheme and on delta encoding -- storing data differences (deltas) to reduce the data and metadata to be crawled. This works well because GitHub has a lot of redundant data (e.g. forks) -- its 115TB of data can be boiled down to 25TB through deduplication data-shaving techniques. The resulting system works much faster than grep -- 640 queries per second compared to 0.01 queries per second. And indexing occurs at a rate of about 120,000 documents per second, so processing 15.5 billion documents takes about 36 hours, or 18 for re-indexing since delta (change) indexing reduces the number of documents to be crawled.

This discussion has been archived. No new comments can be posted.

GitHub Claims Source Code Search Engine Is a Game Changer

Comments Filter:
  • by bradley13 ( 1118935 ) on Wednesday February 08, 2023 @05:22AM (#63274905) Homepage

    Would a search engine for source code just be a better way for students to find online solutions to their homework? I'm having trouble imagining any other use-case for this.

    I mean, if you want a snippet (like "how do I safely encrypt passwords"), you can already find plenty of solutions online. Some of which may even work.

    If you need complete solutions, you won't be searching in code. You will use ordinary search engines: "ERP system", "Webshop", or whatever.

    What professional programmers are likely to need is somewhere in between. For example, one of my next projects is to create a solution that fetches data from measurement instruments, stores this in a database, and provides a customized graphical display of the data. I'm not going to find that pre-implemented anywhere.

    Seriously, what's the use case here?

    • by jlar ( 584848 ) on Wednesday February 08, 2023 @05:44AM (#63274943)

      "I mean, if you want a snippet (like "how do I safely encrypt passwords"), you can already find plenty of solutions online."

      Only if you are doing main stream development. I used to be a research but I am now a developer/domain expert in a very narrow field. I have often needed an implementation of a scientific algorithm. They are never available on Stack Overflow or similar. But there are often methods implementing these specific algorithms buried deep down in some anonymous GitHub repository created by a researcher in the field. So, that would be one use case.

      Another less direct use case is that it may be useful for Github Copilot. But that is just me speculating.

      • by chill ( 34294 ) on Wednesday February 08, 2023 @07:06AM (#63275053) Journal

        So like Google Scholar [google.com], but for code?

        A repository of proven correct, bug-free implementations of various algorithms in small variety of languages would be a very interesting development. Something like Stack Exchange, except only with bug-free code, proven correct, and free for the taking.

      • but how would you even search for that algorithm using code as your search term? Wouldn't you instead search github for "levenshein distance" or something descriptive?
      • Using code in some anon repo isn't a great strategy, especially if it's doing something like encryption. A good way to get fired is to copy/paste code from code that hasn't been vetted anywhere.
        • by nasch ( 598556 )

          Using code in some anon repo isn't a great strategy

          Trusting code from an anonymous repo isn't a great strategy, but whether you write it yourself or find it on the internet, you're going to have to test it to make sure it works the way you want. If you can write a good set of tests to prove the code, does it matter where it came from? And if you can't... well you're just rolling dice either way.

    • You're absolutely right, but that also shows that there's a need for better, more widespread, and maybe even more granular package management across a number of languages. Every stack overflow search for some common code should be delegated to a library, or repository, of tested algorithm implementations with minimal/no dependencies.
    • by ShanghaiBill ( 739463 ) on Wednesday February 08, 2023 @06:11AM (#63274977)

      I'm having trouble imagining any other use-case for this.

      A "search engine for source code" is a pretty good description of Stackoverflow.

      You really can't think of a use case for that?

      I use Stackoverflow a dozen times per day. It is a major productivity booster.

    • by psmears ( 629712 ) on Wednesday February 08, 2023 @06:59AM (#63275047)
      In my experience this sort of thing can be very useful for debugging - e.g. "I have this obscure message in a logfile from some third-party software, show me the code that produced it so I can figure out what was the likely cause", or "My calls to <obscure undocumented library/kernel function> don't seem to be working as expected; show me examples of other code that calls this function".
    • by fazig ( 2909523 )
      It probably panders to the same crowd that relies strongly on stackoverflow, which are either beginners who still have a lot to learn or the bottom of the barrel developers that you shouldn't even let near the timer settings of your microwave oven.

      For professionals it could be handy for boiler-plate stuff and of course lend you a hand in debugging things.
    • by Anonymous Coward

      I've used it a fair bit in the last few months, it's useful for looking through the entire company's codebase for stupid shit like hard coded secrets, or for looking for guidance on how some internal bespoke tooling can be used in a given circumstance by looking for existing usage of it.

      It's probably not much use at a small company, but somewhere where you have hundreds of developers, thousands of projects, it's handy.

      I wouldn't describe it as a game changer though, it's not a significant uplift from any ex

    • by _merlin ( 160982 ) on Wednesday February 08, 2023 @08:44AM (#63275157) Homepage Journal

      Finding examples of real-world code using poorly-documented APIs. Often API documentation is absolutely terrible if it exists at all. If you can find commented real-world code using an API it can be a life-saver.

    • by altp ( 108775 ) on Wednesday February 08, 2023 @08:57AM (#63275177)

      My main use case for github search is finding code in the organization that I work in. If:

      * we see an error message in our centralized logging solution, and I need to know where it comes from
      * i want to find all the internal code that calls an internal micro service
      * we need to find product names across repos

      We have a /lot/ of repos (thousands). Downloading all of them and searching locally isn't really feasible.

      Github's default search is horrible, and strips most "special" characters out, and doesn't support regular expression searching. Sourcegraph is great for this now, but is extremely expensive at $90 per active user per month. Having a better search built into Github would be a game changer for organizations.

    • Exactly. I fail to see how this is a game changer, Maybe the methodology behind their indexing strategy is a game changer, and it can be applied somewhere else that's actually useful.
      • by tragedy ( 27079 )

        I mean, to me the summary read pretty much as: we wanted to search faster through a lot of data so we indexed it in a database. I mean, maybe they implemented their own nosql database, but it sounds pretty much like they're just replicating some of the functionality of existing relational databases. Also, the summary referred to "sharding". Maybe I'm wrong on the nomenclature, but I thought that was gaming specific terminology for distributed computing.

        • by nasch ( 598556 )

          Maybe I'm wrong on the nomenclature, but I thought that was gaming specific terminology for distributed computing.

          It's also used in nosql databases (or with MongoDB at least).

          • by tragedy ( 27079 )

            That makes sense. From what I can find, it was probably adopted from the nosql crowd from online gaming.

  • by Meneth ( 872868 ) on Wednesday February 08, 2023 @05:23AM (#63274907)
    They've learned how to build a free-text search index. Altavista says "welcome to the club".
    • by Entrope ( 68843 )

      If only he said something about that early in TFS....

      The truth is from Solr to Elasticsearch, we haven't had a lot of luck using general text search products to power code search

  • Do they really think google does a "grep" of every single web page in the world when somebody starts typing in the search box?

    They may be 200 million repositories on github but there's probably only about 2000 different queries that make up the bulk of what people look for.

    You do the greps on demand and cache the results.

    • I think they're certain that index search does not work that way.

      In reading the summary, I think they tried grep because searching code is lexically different than searching language.

      Where do you break into "word" tokens something like a method? How can it know when it's important to index string literals vs. not? How do you index Python directory syntax without getting your indexer (or maybe the query engine) stuck on the key names vs. the code concepts you want to find? And then there's perl. *sigh*.

      So yo

    • And people are Hard at work building search engines to address that....

  • Why does the porridge bird lay his egg in the air?

  • The relationship between a function, atype somename(a,b,c) and function , btype something_here(a,b,c).....is not unique to any code. So searchable?

  • They don't use grep for their search engine. Noted.
  • Usually a reliable indicator that it is not. Likely the same here.

  • Written as a song or as a novel. By a great master, a genius.

    Unfortunately, it turned out that it is copy-pasted by huge groups of low-cost hack-workers with assistance of shameless marketers.
    • by tragedy ( 27079 )

      Written as a song or as a novel. By a great master, a genius.

      Unfortunately, it turned out that it is copy-pasted by huge groups of low-cost hack-workers with assistance of shameless marketers.

      That's just Sturgeon's law. Apply the same thing to architecture, for example. Most houses/buildings are not conceived and created by great masters. "Shameless marketers" describes most property developers and most of the real estate market pretty well also.

      This is perfectly natural, of course. The demand for buildings is high and the availability of great masters to produce them is low. Same is true in software. The demand for software is high and the supply of masters is low. Same is true of your examples

      • Not disagreeing totally but....the ones who get to design and build things like 4 story apartment complexes all over the country, just to be practical, are not taking the opportunity to make something unique and the pinnacle of craft? Utility is never impractical....

        • by tragedy ( 27079 )

          As I said, there are plenty of mediocre creators in many fields. Most work is mediocre. Mediocre work is not actually bad, It's generally perfectly functional, aesthetic, etc. just nothing to write home about. The referenced masters and geniuses are rare, so their work is rare. At the other end of the bell curve are the creators whose work does not manage to achieve mediocrity. That work, combined with the mediocre work is the 90% of work that's "cr*p" according to Sturgeon's Law. A lot of it is actually no

          • I think the gap between masterwork and others is the fact giving people what they want relies on them, the people, actually knowing what they want. So a thing that performs a role is functional and becomes standardized. Then the art is in the lines between standards and challenging those standards will get the artist initial notice but the functionality of the vision may not be adaptable. So their work is ignored. Sometimes until long after the artist is gone then standards shift and the master is asknowled

  • This search engine will be underwhelming.

    Try searching for something on Windows. You might find it, you might not, it's anybody's guess. It doesn't matter whether the thing you are searching for exists or not, the search engine is clueless.

  • Last time I checked you couldn't search for specific code patterns inside repos or projects (collections of repos). That can be very well served by a simple recursive grep, and it would be tremendously powerful for many things.

Keep up the good work! But please don't ask me to help.

Working...