GitHub Claims Source Code Search Engine Is a Game Changer (theregister.com) 39
Thomas Claburn writes via The Register: GitHub has a lot of code to search -- more than 200 million repositories -- and says last November's beta version of a search engine optimized for source code that has caused a "flurry of innovation." GitHub engineer Timothy Clem explained that the company has had problems getting existing technology to work well. "The truth is from Solr to Elasticsearch, we haven't had a lot of luck using general text search products to power code search," he said in a GitHub Universe video presentation. "The user experience is poor. It's very, very expensive to host and it's slow to index." In a blog post on Monday, Clem delved into the technology used to scour just a quarter of those repos, a code search engine built in Rust called Blackbird.
Blackbird currently provides access to almost 45 million GitHub repositories, which together amount to 115TB of code and 15.5 billion documents. Shifting through that many lines of code requires something stronger than grep, a common command line tool on Unix-like systems for searching through text data. Using ripgrep on an 8-core Intel CPU to run an exhaustive regular expression query on a 13GB file in memory, Clem explained, takes about 2.769 seconds, or 0.6GB/sec/core. [...] At 0.01 queries per second, grep was not an option. So GitHub front-loaded much of the work into precomputed search indices. These are essentially maps of key-value pairs. This approach makes it less computationally demanding to search for document characteristics like the programming language or word sequences by using a numeric key rather than a text string. Even so, these indices are too large to fit in memory, so GitHub built iterators for each index it needed to access. According to Clem, these lazily return sorted document IDs that represent the rank of the associated document and meet the query criteria.
To keep the search index manageable, GitHub relies on sharding -- breaking the data up into multiple pieces using Git's content addressable hashing scheme and on delta encoding -- storing data differences (deltas) to reduce the data and metadata to be crawled. This works well because GitHub has a lot of redundant data (e.g. forks) -- its 115TB of data can be boiled down to 25TB through deduplication data-shaving techniques. The resulting system works much faster than grep -- 640 queries per second compared to 0.01 queries per second. And indexing occurs at a rate of about 120,000 documents per second, so processing 15.5 billion documents takes about 36 hours, or 18 for re-indexing since delta (change) indexing reduces the number of documents to be crawled.
Blackbird currently provides access to almost 45 million GitHub repositories, which together amount to 115TB of code and 15.5 billion documents. Shifting through that many lines of code requires something stronger than grep, a common command line tool on Unix-like systems for searching through text data. Using ripgrep on an 8-core Intel CPU to run an exhaustive regular expression query on a 13GB file in memory, Clem explained, takes about 2.769 seconds, or 0.6GB/sec/core. [...] At 0.01 queries per second, grep was not an option. So GitHub front-loaded much of the work into precomputed search indices. These are essentially maps of key-value pairs. This approach makes it less computationally demanding to search for document characteristics like the programming language or word sequences by using a numeric key rather than a text string. Even so, these indices are too large to fit in memory, so GitHub built iterators for each index it needed to access. According to Clem, these lazily return sorted document IDs that represent the rank of the associated document and meet the query criteria.
To keep the search index manageable, GitHub relies on sharding -- breaking the data up into multiple pieces using Git's content addressable hashing scheme and on delta encoding -- storing data differences (deltas) to reduce the data and metadata to be crawled. This works well because GitHub has a lot of redundant data (e.g. forks) -- its 115TB of data can be boiled down to 25TB through deduplication data-shaving techniques. The resulting system works much faster than grep -- 640 queries per second compared to 0.01 queries per second. And indexing occurs at a rate of about 120,000 documents per second, so processing 15.5 billion documents takes about 36 hours, or 18 for re-indexing since delta (change) indexing reduces the number of documents to be crawled.
What is the use case here? (Score:3, Interesting)
Would a search engine for source code just be a better way for students to find online solutions to their homework? I'm having trouble imagining any other use-case for this.
I mean, if you want a snippet (like "how do I safely encrypt passwords"), you can already find plenty of solutions online. Some of which may even work.
If you need complete solutions, you won't be searching in code. You will use ordinary search engines: "ERP system", "Webshop", or whatever.
What professional programmers are likely to need is somewhere in between. For example, one of my next projects is to create a solution that fetches data from measurement instruments, stores this in a database, and provides a customized graphical display of the data. I'm not going to find that pre-implemented anywhere.
Seriously, what's the use case here?
Re: What is the use case here? (Score:1)
Re:What is the use case here? (Score:5, Insightful)
"I mean, if you want a snippet (like "how do I safely encrypt passwords"), you can already find plenty of solutions online."
Only if you are doing main stream development. I used to be a research but I am now a developer/domain expert in a very narrow field. I have often needed an implementation of a scientific algorithm. They are never available on Stack Overflow or similar. But there are often methods implementing these specific algorithms buried deep down in some anonymous GitHub repository created by a researcher in the field. So, that would be one use case.
Another less direct use case is that it may be useful for Github Copilot. But that is just me speculating.
Re:What is the use case here? (Score:4, Interesting)
So like Google Scholar [google.com], but for code?
A repository of proven correct, bug-free implementations of various algorithms in small variety of languages would be a very interesting development. Something like Stack Exchange, except only with bug-free code, proven correct, and free for the taking.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Using code in some anon repo isn't a great strategy
Trusting code from an anonymous repo isn't a great strategy, but whether you write it yourself or find it on the internet, you're going to have to test it to make sure it works the way you want. If you can write a good set of tests to prove the code, does it matter where it came from? And if you can't... well you're just rolling dice either way.
Re: (Score:2)
Re:What is the use case here? (Score:4, Interesting)
I'm having trouble imagining any other use-case for this.
A "search engine for source code" is a pretty good description of Stackoverflow.
You really can't think of a use case for that?
I use Stackoverflow a dozen times per day. It is a major productivity booster.
Re:What is the use case here? (Score:5, Insightful)
Re: (Score:1)
For professionals it could be handy for boiler-plate stuff and of course lend you a hand in debugging things.
Re: (Score:1)
I've used it a fair bit in the last few months, it's useful for looking through the entire company's codebase for stupid shit like hard coded secrets, or for looking for guidance on how some internal bespoke tooling can be used in a given circumstance by looking for existing usage of it.
It's probably not much use at a small company, but somewhere where you have hundreds of developers, thousands of projects, it's handy.
I wouldn't describe it as a game changer though, it's not a significant uplift from any ex
Re:What is the use case here? (Score:5, Insightful)
Finding examples of real-world code using poorly-documented APIs. Often API documentation is absolutely terrible if it exists at all. If you can find commented real-world code using an API it can be a life-saver.
Re:What is the use case here? (Score:4, Interesting)
My main use case for github search is finding code in the organization that I work in. If:
* we see an error message in our centralized logging solution, and I need to know where it comes from
* i want to find all the internal code that calls an internal micro service
* we need to find product names across repos
We have a /lot/ of repos (thousands). Downloading all of them and searching locally isn't really feasible.
Github's default search is horrible, and strips most "special" characters out, and doesn't support regular expression searching. Sourcegraph is great for this now, but is extremely expensive at $90 per active user per month. Having a better search built into Github would be a game changer for organizations.
Re: (Score:2)
Re: (Score:2)
I mean, to me the summary read pretty much as: we wanted to search faster through a lot of data so we indexed it in a database. I mean, maybe they implemented their own nosql database, but it sounds pretty much like they're just replicating some of the functionality of existing relational databases. Also, the summary referred to "sharding". Maybe I'm wrong on the nomenclature, but I thought that was gaming specific terminology for distributed computing.
Re: (Score:2)
Maybe I'm wrong on the nomenclature, but I thought that was gaming specific terminology for distributed computing.
It's also used in nosql databases (or with MongoDB at least).
Re: (Score:2)
That makes sense. From what I can find, it was probably adopted from the nosql crowd from online gaming.
Impressive /s (Score:3)
Re: (Score:2)
If only he said something about that early in TFS....
The truth is from Solr to Elasticsearch, we haven't had a lot of luck using general text search products to power code search
Re: (Score:3)
https://slashdot.org/story/06/... [slashdot.org]
They allowed regex search, which was really nice.
Doing it wrong... (Score:2)
Do they really think google does a "grep" of every single web page in the world when somebody starts typing in the search box?
They may be 200 million repositories on github but there's probably only about 2000 different queries that make up the bulk of what people look for.
You do the greps on demand and cache the results.
Re: Doing it wrong... (Score:2)
I think they're certain that index search does not work that way.
In reading the summary, I think they tried grep because searching code is lexically different than searching language.
Where do you break into "word" tokens something like a method? How can it know when it's important to index string literals vs. not? How do you index Python directory syntax without getting your indexer (or maybe the query engine) stuck on the key names vs. the code concepts you want to find? And then there's perl. *sigh*.
So yo
Re: Doing it wrong... (Score:1)
Maybe search by implementation. How the code interacts with other code. Sounds like api at that point though and that's been well covered...
Re: (Score:2)
Re: (Score:1)
And people are Hard at work building search engines to address that....
Uh, Clem (Score:2)
Why does the porridge bird lay his egg in the air?
math is open source (Score:1)
The relationship between a function, atype somename(a,b,c) and function , btype something_here(a,b,c).....is not unique to any code. So searchable?
So that is Google's secret (Score:2)
"Vendor claims their product is great!" (Score:2)
Usually a reliable indicator that it is not. Likely the same here.
Code should be written. (Score:2)
Unfortunately, it turned out that it is copy-pasted by huge groups of low-cost hack-workers with assistance of shameless marketers.
Re: (Score:2)
Written as a song or as a novel. By a great master, a genius.
Unfortunately, it turned out that it is copy-pasted by huge groups of low-cost hack-workers with assistance of shameless marketers.
That's just Sturgeon's law. Apply the same thing to architecture, for example. Most houses/buildings are not conceived and created by great masters. "Shameless marketers" describes most property developers and most of the real estate market pretty well also.
This is perfectly natural, of course. The demand for buildings is high and the availability of great masters to produce them is low. Same is true in software. The demand for software is high and the supply of masters is low. Same is true of your examples
Re: Code should be written. (Score:1)
Not disagreeing totally but....the ones who get to design and build things like 4 story apartment complexes all over the country, just to be practical, are not taking the opportunity to make something unique and the pinnacle of craft? Utility is never impractical....
Re: (Score:2)
As I said, there are plenty of mediocre creators in many fields. Most work is mediocre. Mediocre work is not actually bad, It's generally perfectly functional, aesthetic, etc. just nothing to write home about. The referenced masters and geniuses are rare, so their work is rare. At the other end of the bell curve are the creators whose work does not manage to achieve mediocrity. That work, combined with the mediocre work is the 90% of work that's "cr*p" according to Sturgeon's Law. A lot of it is actually no
Re: Code should be written. (Score:1)
I think the gap between masterwork and others is the fact giving people what they want relies on them, the people, actually knowing what they want. So a thing that performs a role is functional and becomes standardized. Then the art is in the lines between standards and challenging those standards will get the artist initial notice but the functionality of the vision may not be adaptable. So their work is ignored. Sometimes until long after the artist is gone then standards shift and the master is asknowled
Judging by Microsoft's search track record (Score:2)
This search engine will be underwhelming.
Try searching for something on Windows. You might find it, you might not, it's anybody's guess. It doesn't matter whether the thing you are searching for exists or not, the search engine is clueless.
Have they implemented searching within a repo? (Score:2)