Searchable C/C++ DB surpasses 275 million lines 328
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
Interesting stats (Score:5, Interesting)
Wtf? (Score:1, Interesting)
Am I missing something here?
Statistics: (Score:5, Interesting)
2. Comment / command ratio
3. Number of curse word variable names
Slashdot Block (Score:3, Interesting)
Choice of db? (Score:4, Interesting)
I've used mysql for some small projects, but generally it does handle
millions of rows (although the upper limit on rows can be patched with
some additional behaviors). So, for big dbs, I use postgresql.
How did you decide to use mysql? (Was it that the project started,
and grew, or did you know it would handle large numbers of rows
from the start)?
Just curious. This is probably going to be viewed as a flame by many
(particularly those who don't really use dbs very much, but use them
enough to have strong opinions).
Statistics TM (c) (Score:5, Interesting)
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)
Interesting Statistics (Score:5, Interesting)
Re:And then... (Score:5, Interesting)
Amazon style statistics (Score:5, Interesting)
So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.
I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).
word. Oh some adhesion stats would rock!
please type the word in this image: adhesion
random letters - if you are visually impaired, please email us at pater@slashdot.org
Re:My vote is for... (Score:5, Interesting)
(Partial to BSD style myself..)
Re:Choice of db? (Score:1, Interesting)
Re:Slashdot Block (Score:2, Interesting)
Sounds kind of like the PMD scoreboard... (Score:5, Interesting)
PMD home page is here [sf.net], book site is here [pmdapplied.com].
Please check for this: comma in brackets in C++ (Score:5, Interesting)
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?
Code Styles (Score:5, Interesting)
histogram of C reserved words (Score:5, Interesting)
to the, uh, national average.
1222 if
638 return
482 static
413 for
399 int
217 const
201 else
194 void
128 char
115 case
112 break
55 default
43 sizeof
37 do
35 switch
27 enum
24 struct
23 while
15 float
14 typedef
10 auto
7 unsigned
6 extern
1 long
TODOs (Score:2, Interesting)
Re:Please check for this: comma in brackets in C++ (Score:4, Interesting)
HACK, TODO, BUG & FIXME (Score:2, Interesting)
Pete
Re:My vote is for... (Score:5, Interesting)
etc
Re:Yet another source code search engine? (Score:2, Interesting)
This engine understands the code at a C/C++ syntax level, unlike koders.com so you can better search for what your after (comments, functions, macros, classes, etc).
Also this engine DOES allow you to click on words in the code, but only includes and function or macro calls.
There are several things that are not that great about my site, it's a little slow, doesn't support free text searching nor variable searching, and you can't copy search URL's for pasting (uses XMLHttp and form POST's).
But it's just me doing this thing, and I have limited time and most importantly limited money/hardware.
My wish is for google to do their own but index a LOT more code and have it be fast and friendly
They certainly have the resources to do it and would be a great tool for coders to use. Maybe this will help fill a gap in the mean time