Searchable C/C++ DB surpasses 275 million lines 328
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
Re:Statistics: (Score:3, Informative)
Overall Stats
Number of Packages: 10,931
Total Number of Files: 1,151,819
Total Lines of Code (No comments, no blank lines): 283,119,081
Total of All Lines: 420,355,464
Total Number of Functions: 7,782,468
Total Number of Functions Called: 69,500,700
Total Number of Macros: 9,947,564
Total Number of Classes: 209,361
Total Number of Comments: 38,125,107
Total Number of Structures: 554,178
Total Number of Unions: 19,687
Total Number of Includes: 5,904,187
Hit Refresh (Score:5, Informative)
-everphilski-
Re:What? Millions of code? (Score:5, Informative)
Its not a searchable database written in 275 million lines of code.
See also: Codase.com (Score:3, Informative)
-Mark
Koders.com (Score:3, Informative)
How about a potential buffer overflow index? (Score:5, Informative)
Re:Slashdot Block (Score:3, Informative)
For every problem, there is a solution that is simple, elegant and wrong. In every other market, the more demand there is, the higher the price/revenue/profit. Web servers are pretty much the only place where you lose more money the more popular you are (e-commerce sites and such not included). If so many people want the content, they can find a way to share it. Even then they're getting a bloody good deal, if you ask me. What exactly are you complaining about, that they aren't generous *enough*? Blocking slashdottings is a small price to pay compared to turning it into a [ad] pile [ad] of [ad] advertisements or subscription site. That is what you do if you "don't want a big bandwidth bill".
Re:Please check for this: comma in brackets in C++ (Score:4, Informative)
But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C.
In other words, you can't do something like this in C++:
class SmartArray {
public:
SmartArray(int height, int width);
int operator(const int &x, const int &y) const;
};
...
SmartArray a(5, 5);
a[12, 13];
Re:Please check for this: comma in brackets in C++ (Score:3, Informative)
To validate, I pulled out my copy of K&R 2nd edition (Actually a copy I once rescued from a trash bin, and my copy is only "Based on Draft-Proposed ANSI C"). In section 5.9 Pointers vs. Multi-dimensional Arrays it points out,
Re:Interesting stats (Score:5, Informative)
for your reading pleasure [vidarholen.net].... the linux kernel fuck count
Re:My vote is for... (Score:2, Informative)
ONLY K&R!!!!
Seriously, I am a K&R maniac, which caused me to get quite irritated at one of my professors, who once wrote "confusing braces" on a programming assignment I handed in. (It was a little confusing, but because I was being clever and efficient, not because of my braces preferences.)
I think the proportion of code written in K&R vs. The Incorrect Styles would be very interesting to see.
Re:Statistics: (Score:2, Informative)
http://csourcesearch.net/license/ [csourcesearch.net]
You can also click on any of those licenses and then on that page choose to only search for code found in that license.
Re:Choice of db? (Score:4, Informative)
I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)
So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
They are right, Java is very fast at handling the searching and I've been very impressed.
Most searches in the Java database only take one or two seconds.
The MySQL query/join for additional info take another 4 or 5 seconds.
Most searches take about 8 seconds to come up, even under no load.
I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.
Proposed workaround doesn't work (Score:4, Informative)
You can force the conversion with
blah[ location(5), 5] = 10;
but that's not useful except to see what's happening.
You can't overload the built-in operators for built-in types. So overloading, outside of an object, "operator,(int, int)" won't work either.
Hence the need for a straightforward solution.
Re:histogram of C reserved words - well, B .... (Score:3, Informative)
auto was necessary in B for local variables, as a plain variable name by itself was a valid expression statement (as it is in C), not a declaration (IIRC).
1. foo() { auto bar;
2. foo() { static bar;
3. foo() { extrn bar;
4. foo() { bar;
All mean something different in B: the first three instances of bar are declarations, the fourth is an expression statement (and if I remember my B correctly, it is invalid as the first statement of foo(), because bar hasn't been declared one of auto, static, or extrn yet in this function).
In C, auto is completely redundant. Except, perhaps, in comments.
Ah, B. The days when programmers were programmers and data was data, and you could perform any operation you liked on any variable. Want to divide a pointer to a string by 3? Go ahead. Self-disciplined programmers don't need training wheels. Just a choice between auto, static and extrn.
Codase is much better with 250M of C/C++/JAVA code (Score:1, Informative)
quoted from their website:
"For the first time, to find relevant code, developers can simply type into a search box about the same code as they do in their daily development work. The Codase smart query system processes the input and then builds an internal query to feed into the search engine. Through this free style format, complex combinations of multiple search terms can be easily entered. For example, to find any main method that contains variable t and function calls of thread.start() and println, this query can be used: main() { var t; thread.start(); println; }",
http://www.codase.com/search/smart?join=main()+%7