Searchable C/C++ DB surpasses 275 million lines 328

Posted by Hemos on Monday December 05, 2005 @01:27PM from the interesting-applications dept.

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

Searchable C/C++ DB surpasses 275 million lines

This discussion has been archived. No new comments can be posted.

Search 328 Comments Log In/Create an Account

Comments Filter:

Interesting stats (Score:5, Interesting)

by sparkes ( 125299 ) writes: on Monday December 05, 2005 @01:32PM (#14186097) Homepage Journal

How many lines contain expletives?

Wtf? (Score:1, Interesting)

by GeckoX ( 259575 ) writes: on Monday December 05, 2005 @01:32PM (#14186102)

What, you've created this wonderful piece of software and _now_ want to figure out what to do with it?

Am I missing something here?

Statistics: (Score:5, Interesting)

by duckpoopy ( 585203 ) writes: on Monday December 05, 2005 @01:32PM (#14186104) Journal

1. Lines per function
2. Comment / command ratio
3. Number of curse word variable names

Slashdot Block (Score:3, Interesting)

by Yerase ( 636636 ) writes: <randall DOT hand AT gmail DOT com> on Monday December 05, 2005 @01:34PM (#14186125) Homepage

I love the GeShi page, how it blocks everything from Slashdot. Setup a site to advertise a product, then restrict people from using it....
URLs on this server linked by slashdot.org will be refused. Permission is given to slashdot to mirror content as necessary for the purpose of providing its users access to the information on the site. Slashdot should not attempt to bypass the referer block. Use of the google cache page for the site is acceptable as long as the page(s) concerned have no more than 1 image.

Choice of db? (Score:4, Interesting)

by Anonymous Coward writes: on Monday December 05, 2005 @01:35PM (#14186137)

So, this is not a flame, but I'm curious about your choice of dbs.
I've used mysql for some small projects, but generally it does handle
millions of rows (although the upper limit on rows can be patched with
some additional behaviors). So, for big dbs, I use postgresql.

How did you decide to use mysql? (Was it that the project started,
and grew, or did you know it would handle large numbers of rows
from the start)?

Just curious. This is probably going to be viewed as a flame by many
(particularly those who don't really use dbs very much, but use them
enough to have strong opinions).

Statistics TM (c) (Score:5, Interesting)

by chunews ( 924590 ) writes: on Monday December 05, 2005 @01:38PM (#14186162)

It would be interesting to see the number of different copyright notices contained within all that source code, and then to present the notices in groups, like GPL GPL2, etc..
Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?
And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)

Interesting Statistics (Score:5, Interesting)

by iso-cop ( 555637 ) writes: on Monday December 05, 2005 @01:39PM (#14186174)

In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.

Re:And then... (Score:5, Interesting)

by Sembiance ( 124190 ) writes: on Monday December 05, 2005 @01:39PM (#14186176) Homepage

Advertise? No, I'm just a single coder doing this for fun and hope that some people will find it useful.

Amazon style statistics (Score:5, Interesting)

by tod_miller ( 792541 ) writes: on Monday December 05, 2005 @01:39PM (#14186180) Journal

I was very impressed with Amazon, who for each book say which phrases and words were particularly unique to that book. (reminds me of that google game where try try and get any two words with only 1 hit).

So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.

I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).

word. Oh some adhesion stats would rock!

please type the word in this image: adhesion
random letters - if you are visually impaired, please email us at pater@slashdot.org

Re:My vote is for... (Score:5, Interesting)

by epiphani ( 254981 ) writes: <epiphani&dal,net> on Monday December 05, 2005 @01:43PM (#14186223)

Same type of thing, but indenting styles. K&R vs. BSD, ect. I'm curious how that breaks up.

(Partial to BSD style myself..)

Re:Choice of db? (Score:1, Interesting)

by Anonymous Coward writes: on Monday December 05, 2005 @01:44PM (#14186232)

Tell that to the 23 million row table I'm currently playing with - no tweaking or patching needed. What version of MySQL have you tried "large" databases with (23 million rows isn't large)

Re:Slashdot Block (Score:2, Interesting)

by wampus ( 1932 ) writes: on Monday December 05, 2005 @01:45PM (#14186238)

Thats why I use Cacheout [thetechgurus.net]. Its a Firefox extension that adds a context menu item to coralize any link. Bypass the restriction AND not kill the site, all at the same time.

Sounds kind of like the PMD scoreboard... (Score:5, Interesting)

by tcopeland ( 32225 ) * writes: <tom.thomasleecopeland@com> on Monday December 05, 2005 @01:48PM (#14186268) Homepage

...that is, a static analysis of a bunch of Java SourceForge projects [sourceforge.net]. It does unused code and duplicate code detection... sometimes it finds some interesting things.

PMD home page is here [sf.net], book site is here [pmdapplied.com].

Please check for this: comma in brackets in C++ (Score:5, Interesting)

by Animats ( 122034 ) writes: on Monday December 05, 2005 @01:58PM (#14186354) Homepage

C++, for historical reasons dating back to C, has wierd semantics for commas in brackets. The operator precedence for commas is different inside of "()" and "[]".
So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.
I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.
This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?

Code Styles (Score:5, Interesting)

by ionrock ( 516345 ) writes: on Monday December 05, 2005 @02:09PM (#14186443)

I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.

histogram of C reserved words (Score:5, Interesting)

by jab ( 9153 ) writes: on Monday December 05, 2005 @02:14PM (#14186487) Homepage

I'd love to see how one of my programs (stats below) compares to the, uh, national average. 1222 if 638 return 482 static 413 for 399 int 217 const 201 else 194 void 128 char 115 case 112 break 55 default 43 sizeof 37 do 35 switch 27 enum 24 struct 23 while 15 float 14 typedef 10 auto 7 unsigned 6 extern 1 long

TODOs (Score:2, Interesting)

by mrshoe ( 697123 ) writes: on Monday December 05, 2005 @02:52PM (#14186841) Homepage

Counting the number of "TODO"s and "XXX"s in "production" open source code could be interesting.

Re:Please check for this: comma in brackets in C++ (Score:4, Interesting)

by The boojum ( 70419 ) writes: on Monday December 05, 2005 @03:19PM (#14187087)

I was just going to point this out. I even hacked up a simple example to show it:

struct location { int dimension, coordinates[ 20 ]; location( int first_coordinate ) : dimension( 1 ) { coordinates[ 0 ] = first_coordinate; } location &operator,( int const right ) { coordinates[ dimension++ ] = right; return *this; } }; struct array { int matrix[ 100 ][ 100 ]; int &operator[]( location const &right ) { return matrix[ right.coordinates[ 1 ] ][ right.coordinates[ 0 ] ]; } }; int main( int argc, char **argv ) { array blah; blah[ 5, 5 ] = 10; }

Proof of concept and it doesn't really do anything, but it compiles just fine. I don't see a problem here. A real implementation would probably do some clever stuff so that the optimizer can optimize away the intermediate data structure.

HACK, TODO, BUG & FIXME (Score:2, Interesting)

by Pete Brubaker ( 35550 ) writes: <{pbman96} {at} {hotmail.com}> on Monday December 05, 2005 @03:52PM (#14187410) Homepage Journal

I recently did a search on some of our codebase here at work to see how many times the above keywords remained in shipping code. I was a little surprised to see how many cases there were in our code. I think sometimes, maybe even most of the time we as programmers over use these words.

Pete

Re:My vote is for... (Score:5, Interesting)

by baadger ( 764884 ) writes: on Monday December 05, 2005 @03:57PM (#14187459)
Theres an idea right there, how about some stats showing popularity of various coding conventions?
- Variables: under_score vs. camelCase
- Tabs vs. spaces
- "if (cond) {" vs. "if (cond)\n{"
- How many coders bother enclosing single conditionally executed statements with {}
- How many coders bother producing comments directly before or after function definitions, describing function implementation?
- Lines of comments to lines of code ratios
- Number of functions to lines of code ratios for various projects?
- Number of projects making use of global variables?
- C, to C++, to C# (if your engine covers it) project ratio
etc
Re:Yet another source code search engine? (Score:2, Interesting)

by Sembiance ( 124190 ) writes: on Monday December 05, 2005 @04:36PM (#14187815) Homepage

I just did it for fun, and hopefully some people might get some use out of it.

This engine understands the code at a C/C++ syntax level, unlike koders.com so you can better search for what your after (comments, functions, macros, classes, etc).

Also this engine DOES allow you to click on words in the code, but only includes and function or macro calls.

There are several things that are not that great about my site, it's a little slow, doesn't support free text searching nor variable searching, and you can't copy search URL's for pasting (uses XMLHttp and form POST's).

But it's just me doing this thing, and I have limited time and most importantly limited money/hardware.

My wish is for google to do their own but index a LOT more code and have it be fast and friendly :)

They certainly have the resources to do it and would be a great tool for coders to use. Maybe this will help fill a gap in the mean time :)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Searchable C/C++ DB surpasses 275 million lines 328

Searchable C/C++ DB surpasses 275 million lines More Login

Searchable C/C++ DB surpasses 275 million lines

Interesting stats (Score:5, Interesting)

Wtf? (Score:1, Interesting)

Statistics: (Score:5, Interesting)

Slashdot Block (Score:3, Interesting)

Choice of db? (Score:4, Interesting)

Statistics TM (c) (Score:5, Interesting)

Interesting Statistics (Score:5, Interesting)

Re:And then... (Score:5, Interesting)

Amazon style statistics (Score:5, Interesting)

Re:My vote is for... (Score:5, Interesting)

Re:Choice of db? (Score:1, Interesting)

Re:Slashdot Block (Score:2, Interesting)

Sounds kind of like the PMD scoreboard... (Score:5, Interesting)

Please check for this: comma in brackets in C++ (Score:5, Interesting)

Code Styles (Score:5, Interesting)

histogram of C reserved words (Score:5, Interesting)

TODOs (Score:2, Interesting)

Re:Please check for this: comma in brackets in C++ (Score:4, Interesting)

HACK, TODO, BUG & FIXME (Score:2, Interesting)

Re:My vote is for... (Score:5, Interesting)

Re:Yet another source code search engine? (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot