Searchable C/C++ DB surpasses 275 million lines 328
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
Some statistics to get you started (Score:5, Funny)
The following "interesting statistics" come to mind:
You gotta get the variables searchable. Most critical for that last statistic. Also, I'm too lazy to learn Lucene Query Parser Syntax [apache.org], so the statistics for "Natalie Portman" may include references to "portman."
useful statistic (Score:5, Funny)
Re:useful statistic (Score:5, Funny)
Re:useful statistic (Score:2)
Re:useful statistic (Score:5, Funny)
Sounds like you should have written it in C++ instead of a laggard language like PHP
Re:useful statistic: parent: -1 troll (Score:4, Funny)
I know PHP is a great web language and that it probably isn't the cause of the slowdown. Heck, even Yahoo! uses it these days.
I was attempting (unsuccessfully, it seems) to make fun of the purists who insist that robust web applications must run on something compiled in order to reach acceptable performance under high load.
My vote is for... (Score:5, Insightful)
}
Re:My vote is for... (Score:2, Funny)
Re:My vote is for... (Score:2, Insightful)
if (cond) {
}
or this:
if (cond)
{
}
Re:My vote is for... (Score:5, Interesting)
etc
Re:My vote is for... (Score:5, Interesting)
(Partial to BSD style myself..)
Re:My vote is for... (Score:2)
Re:My vote is for... (Score:5, Funny)
or "// FIXME" (Score:5, Funny)
Re:My vote is for... (Score:2)
So I'd suspect lines with purely "}" and whitespace would be quite a few.
Similarity checking (Score:5, Funny)
Or similarities between different projects (Score:2)
Related: having found similar code sections, follow trends in them over time. Find where two programs copied the same code, but one has failed to implement what might be a bug fix or improvement in another, by looking at changes to the code over time.
Interesting stats (Score:5, Interesting)
Re:Interesting stats (Score:5, Informative)
for your reading pleasure [vidarholen.net].... the linux kernel fuck count
SCO (Score:2, Funny)
But then again, probably not...
One word (Score:2)
Statistics: (Score:5, Interesting)
2. Comment / command ratio
3. Number of curse word variable names
Re:Statistics: (Score:2, Insightful)
Re:Statistics: (Score:3, Informative)
Overall Stats
Number of Packages: 10,931
Total Number of Files: 1,151,819
Total Lines of Code (No comments, no blank lines): 283,119,081
Total of All Lines: 420,355,464
Total Number of Functions: 7,782,468
Total Number of Functions Called: 69,500,700
Total Number of Macros: 9,947,564
Total Number of Classes: 209,361
Total Number of Comments: 38,125,107
Total Number of Structures: 5
Re:Statistics: (Score:3, Funny)
So the code calls 61,718,232 functions which don't even exist?
But maybe they just meant "Total Number of Function Calls"
Measurements I have made (Score:5, Insightful)
Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).
Keep up the good work!
Need to watch those stats (Score:3, Funny)
Re:Statistics: (Score:2)
1. Maximum brace nesting level for each function (might be difficult, but a good metric for determining the complexity of a function)
2. Percentages of control structures that are while, for, switch, if, etc.
3. Number of embedded constants that aren't 0 or 1
4. Count of references to each function/constant within in a single project
Re:Statistics: (Score:2)
5. ?
6. Profit
Bob.
(where 5 is a pretty good chance of getting counter-sued out of existance by IBM when the answer is some { and a few less }.)
Re:Statistics: (Score:2)
ratio (Score:5, Funny)
Re:ratio (Score:5, Funny)
Search -- foo -> Results 1 - 10 of about 26,600,000 for foo. (0.06 seconds)
Search -- bar -> Results 1 - 10 of about 385,000,000 for bar [definition]. (0.16 seconds)
Search -- foo bar -> Results 1 - 10 of about 7,900,000 for foo bar. (0.12 seconds)
'bar' wins. This intuitively makes sense, as who would want to go to the 'foo' for a drink, or eat an 'energy foo'? Could you imagine a lawyer being 'dis-fooed'?
Suggestion (Score:5, Funny)
How about a new server?
Slashdot Block (Score:3, Interesting)
Re:Slashdot Block (Score:3, Insightful)
If you don't want to pay a big bandwidth bill then don't run a webserver.
Re:Slashdot Block (Score:2, Insightful)
Re:Slashdot Block (Score:2)
If you want access to a web server, don't run a system that's known to give the provider big bandwidth bills.
At the end of the the day, they don't owe you anything, and anything they offer you is a courtesy, not an obligation. If you don't like that, please feel free to go create and finance your own WWW.
Re:Slashdot Block (Score:3, Insightful)
That's a little harsh don't you think?
It's one thing to run a site and have reasonable expectations of having "enough" bandwidth for your projected traffic, and it's another thing to pay for a slashdotting on a
Re:Slashdot Block (Score:3, Informative)
For every problem, there is a solution that is simple, elegant and wrong. In every other market, the more demand there is, the higher the price/revenue/profit. Web servers are pretty much the only place where you lose more money the more popular you are (e-commerce sites and such not included). If so many people want the content, they can find a way to share it. Even then they're getting a bloody good deal, if you ask me. What exactly
Hit Refresh (Score:5, Informative)
-everphilski-
Re:Hit Refresh (Score:3, Funny)
Re:Slashdot Block (Score:2, Interesting)
Choice of db? (Score:4, Interesting)
I've used mysql for some small projects, but generally it does handle
millions of rows (although the upper limit on rows can be patched with
some additional behaviors). So, for big dbs, I use postgresql.
How did you decide to use mysql? (Was it that the project started,
and grew, or did you know it would handle large numbers of rows
from the start)?
Just curious. This is probably going to be viewed as a flame by many
(particularly those who don't really use dbs very much, but use them
enough to have strong opinions).
Re:Choice of db? (Score:4, Informative)
I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)
So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
They are right, Java is very fast at handling the searching and I've been very impressed.
Most searches in the Java database only take one or two seconds.
The MySQL query/join for additional info take another 4 or 5 seconds.
Most searches take about 8 seconds to come up, even under no load.
I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.
Statistics TM (c) (Score:5, Interesting)
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)
Interesting Statistics (Score:5, Interesting)
Amazon style statistics (Score:5, Interesting)
So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.
I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).
word. Oh some adhesion stats would rock!
please type the word in this image: adhesion
random letters - if you are visually impaired, please email us at pater@slashdot.org
The basics and more (Score:2, Insightful)
interesting stat (Score:3, Funny)
2) compile
3) execute
4) ???????
5) PROFIT!
Woman (Score:2, Funny)
Eh.
Unfortunately (Score:2)
Sounds kind of like the PMD scoreboard... (Score:5, Interesting)
PMD home page is here [sf.net], book site is here [pmdapplied.com].
cout "why bother" (Score:2)
1) look in my library (books)
2) do a deja search
3) ask smarter people than me
4) do a web search (usually on specific sites)
Find all buffer overflows please (Score:2)
Not working well -- TRY AGAIN LATER (Score:2)
I tried searching. Here's what I got:
XML Parsing Error: junk after document element Location: http://csourcesearch.net/performSearch.php?type=Fu nctionTypeReturned&search=(&ignoredRandomNumber=11 33805159922.7798 [csourcesearch.net] Line Number 2, Column 1:Warning: mysql_connect() [function.mysql-connect [slashdot.org]]: Can't connect to MySQL server on '127.0.0.1' (4) in
^
Please check for this: comma in brackets in C++ (Score:5, Interesting)
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?
Re:Please check for this: comma in brackets in C++ (Score:3, Insightful)
I'm no C++ expert, but isn't int array[row][col] a multidimensional array?
Re:Please check for this: comma in brackets in C++ (Score:2)
No, it's an array of pointers to an array of elements, which is not quite the same thing.
Arrays with multiple subscripts have many uses. Sparse array implementations, for example. People implement this now with code that looks like
tab(i,j) = 1;
This is valid C++, and with the right overloads, it compiles and runs, but it looks wierd.
Re:Please check for this: comma in brackets in C++ (Score:3, Informative)
To validate, I pulled out my copy of K&R 2nd edition (Actually a copy I once rescued from a trash bin, and my copy is only "Based on Draft-Proposed ANSI C"). In section 5.9 Pointers vs. Multi-dimensional Arrays it points out,
Re:Please check for this: comma in brackets in C++ (Score:2)
Re:Please check for this: comma in brackets in C++ (Score:4, Interesting)
Proposed workaround doesn't work (Score:4, Informative)
You can force the conversion with
blah[ location(5), 5] = 10;
but that's not useful except to see what's happening.
You can't overload the built-in operators for built-in types. So overloading, outside of an object, "operator,(int, int)" won't work either.
Hence the need for a straightforward solution.
Re:Please check for this: comma in brackets in C++ (Score:4, Informative)
But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C.
In other words, you can't do something like this in C++:
class SmartArray {
public:
SmartArray(int height, int width);
int operator(const int &x, const int &y) const;
};
...
SmartArray a(5, 5);
a[12, 13];
Re:Please check for this: comma in brackets in C++ (Score:4, Insightful)
This is a great counterexample to the GP. Changing the meaning
of the comma within square brackets would gain NOTHING and would
mean every existing compiler is now wrong.
The existing C array type is bad enough as it is, why make it
even more unwieldy by introducing a new variant? C++ is already
on the right track: discourage C arrays, and encourage container
classes that have things like bounds checking and automatic
memory allocation.
Re:Please check for this: comma in brackets in C++ (Score:3, Funny)
Also, C++ programmers are getting really old, and they don't handle change very well.
best_idea_ever (Score:4, Insightful)
Search for this bug (Score:2)
See also: Codase.com (Score:3, Informative)
-Mark
Koders.com (Score:3, Informative)
grep++ (Score:2)
I'd really love to see data
How about a potential buffer overflow index? (Score:5, Informative)
stats we'd like to see... (Score:5, Funny)
-# of ( ),{ },\
-time spent debugging/compiling
-total hours spent in production
-gallons of coffee consumed
-hours of daylight seen
-# of relationships destroyed
Code Styles (Score:5, Interesting)
Recycling code (Score:2)
That said, congrats on the milestone, and looking forward to hearing of more!
histogram of C reserved words (Score:5, Interesting)
to the, uh, national average.
1222 if
638 return
482 static
413 for
399 int
217 const
201 else
194 void
128 char
115 case
112 break
55 default
43 sizeof
37 do
35 switch
27 enum
24 struct
23 while
15 float
14 typedef
10 auto
7 unsigned
6 extern
1 long
Re:histogram of C reserved words (Score:2)
Re:histogram of C reserved words (Score:5, Funny)
2431 int
1802 goto
Re:histogram of C reserved words - well, B .... (Score:3, Informative)
auto was necessary in B for local variables, as a plain variable name by itself was a valid expression statement (as it is in C), not a declaration (IIRC).
1. foo() { auto bar;
2. foo() { static bar;
3. foo() { extrn bar;
4. foo() { bar;
All mean something different in B: the first three instances of bar are declarations, t
Finally! (Score:2)
Comments (Score:2)
And are there any haiku?
the known answer (Score:2)
0 row(s) returned
plagerism at its finest [thinkgeek.com]
mod -1 lame
Don't mess around, learn from NLP folks (Score:5, Insightful)
This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP [wikipedia.org] people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation [wikipedia.org]. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.
If that's too hard, try finding all n-grams [wikipedia.org] instead, at least under some length. That's a lot more useful than just individual tokens or strings.
With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.
That's easy: search for known security holes (Score:2)
Though I don't develop much in C++ currently, and haven't had the time to do anything Linux wise in years, I would love to have an identified location for security-bug free algorithms, etc. that I could use if I need to do more C++ work in the future.
There is boost? (Score:2)
It can't be called 'comprehensive' after that...
Cyclomatic complexity... (Score:2)
TODOs (Score:2, Interesting)
Re:And then... (Score:5, Interesting)
Re:And then... (Score:3, Funny)
Re:Are you proud of 275 million lines of code? (Score:2)
Re:What? Millions of code? (Score:5, Informative)
Its not a searchable database written in 275 million lines of code.
Re:What? Millions of code? (Score:2)
Re:What? Millions of code? (Score:2)
Re:What? Millions of code? (Score:2)
Re:275+ million lines (Score:2)
Knowing that there are not so many women writing (or *sigh* reading) open source I think it is very unlikely that adding such line to your source code will get you anywhere. You could try though, and of course tell us what happend
Re:275+ million lines (Score:3, Funny)
No, no, no.
You do not use lines 1..N on the same lady until it works. It's not like breaking encryption -- you don't get to try all the possible keys.
I have friends who have done this, and they swear it's a percentage game. Choose one line you like, and try it on women 1..N until it
Re:Wtf? (Score:3, Insightful)
A person who is a true programmer in his soul doesn't ask himself "why". Oftentimes the sheer joy of creating something from nothing is enough.
Re:the obvious answer (Score:2)
Re:Size doesn't matter (Score:3, Funny)
Re:Evolution data server and courier imap (Score:2)
...barely the lobby. (Score:2)