Searchable C/C++ DB surpasses 275 million lines 328

Posted by Hemos on Monday December 05, 2005 @01:27PM from the interesting-applications dept.

Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."

Searchable C/C++ DB surpasses 275 million lines

This discussion has been archived. No new comments can be posted.

Search 328 Comments Log In/Create an Account

Comments Filter:

Re:Statistics: (Score:3, Informative)

by Anonymous Coward writes: on Monday December 05, 2005 @01:40PM (#14186181)

From the stats page if you cannot get to it...

Overall Stats
Number of Packages: 10,931
Total Number of Files: 1,151,819
Total Lines of Code (No comments, no blank lines): 283,119,081
Total of All Lines: 420,355,464
Total Number of Functions: 7,782,468
Total Number of Functions Called: 69,500,700
Total Number of Macros: 9,947,564
Total Number of Classes: 209,361
Total Number of Comments: 38,125,107
Total Number of Structures: 554,178
Total Number of Unions: 19,687
Total Number of Includes: 5,904,187

Hit Refresh (Score:5, Informative)

by everphilski ( 877346 ) writes: on Monday December 05, 2005 @01:44PM (#14186229) Journal

Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)

-everphilski-

Re:What? Millions of code? (Score:5, Informative)

by tgd ( 2822 ) writes: on Monday December 05, 2005 @01:49PM (#14186276)

Its a searchable database OF code from other products, containing 275 million lines you can search across.

Its not a searchable database written in 275 million lines of code.

See also: Codase.com (Score:3, Informative)

by kriegsman ( 55737 ) writes: on Monday December 05, 2005 @02:00PM (#14186371) Homepage

See also Codase.com [codase.com], another "Source Code Search Engine", which lets you search by method names, class names, variable names, free text, etc..

-Mark

Koders.com (Score:3, Informative)

by knipknap ( 769880 ) writes: on Monday December 05, 2005 @02:01PM (#14186376) Homepage

Don't know, koders.com [koders.com] supports a lot more languages and also lets you narrow your search to specific licenses. The few extra lines of code just don't seem too do it, especially because such measures highly depend on the chosen method.

How about a potential buffer overflow index? (Score:5, Informative)

by raddan ( 519638 ) writes: on Monday December 05, 2005 @02:07PM (#14186429)

You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.

Re:Slashdot Block (Score:3, Informative)

by Kjella ( 173770 ) writes: on Monday December 05, 2005 @02:25PM (#14186580) Homepage

If you don't want to pay a big bandwidth bill then don't run a webserver.

For every problem, there is a solution that is simple, elegant and wrong. In every other market, the more demand there is, the higher the price/revenue/profit. Web servers are pretty much the only place where you lose more money the more popular you are (e-commerce sites and such not included). If so many people want the content, they can find a way to share it. Even then they're getting a bloody good deal, if you ask me. What exactly are you complaining about, that they aren't generous *enough*? Blocking slashdottings is a small price to pay compared to turning it into a [ad] pile [ad] of [ad] advertisements or subscription site. That is what you do if you "don't want a big bandwidth bill".

Re:Please check for this: comma in brackets in C++ (Score:4, Informative)

by chris macura ( 899109 ) writes: on Monday December 05, 2005 @02:29PM (#14186615)

Yes, they are. But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C. In other words, you can't do something like this in C++: class SmartArray { public: SmartArray(int height, int width); int operator(const int &x, const int &y) const; // ... }; ... SmartArray a(5, 5); a[12, 13];

Re:Please check for this: comma in brackets in C++ (Score:3, Informative)

by milgr ( 726027 ) writes: on Monday December 05, 2005 @02:38PM (#14186695)

The grandparent got it correct. C does support multidimensional arrays. I suspect that C++ does too.
To validate, I pulled out my copy of K&R 2nd edition (Actually a copy I once rescued from a trash bin, and my copy is only "Based on Draft-Proposed ANSI C"). In section 5.9 Pointers vs. Multi-dimensional Arrays it points out,

Newcomers to C are sometimes confused about the difference between a two-dimensional array and an array of pointers, such as name in the example above. Given the definitions

int a[10][20]; int *b[10];

then a[3][4] and b[3][4] are both syntatctically legal references to a single int. But a is a true two-dimensional array: 200 int-sided locations have been set aside, and the conventional rectangular subscript calculation 20xrow+col is used to find the element a[row,col]. For b, however the definition only allocates 10 pointers and does not initialize them; initialization must be done explicitly, either statically or with code.

Re:Interesting stats (Score:5, Informative)

by moosesocks ( 264553 ) writes: on Monday December 05, 2005 @03:05PM (#14186951) Homepage

How many lines contain expletives?

for your reading pleasure [vidarholen.net].... the linux kernel fuck count

Re:My vote is for... (Score:2, Informative)

by Anonymous Coward writes: on Monday December 05, 2005 @03:36PM (#14187243)

K&R!!
ONLY K&R!!!!

Seriously, I am a K&R maniac, which caused me to get quite irritated at one of my professors, who once wrote "confusing braces" on a programming assignment I handed in. (It was a little confusing, but because I was being clever and efficient, not because of my braces preferences.)
I think the proportion of code written in K&R vs. The Incorrect Styles would be very interesting to see.

Re:Statistics: (Score:2, Informative)

by Sembiance ( 124190 ) writes: on Monday December 05, 2005 @04:07PM (#14187550) Homepage

You can see the license type broken down here:

http://csourcesearch.net/license/ [csourcesearch.net]

You can also click on any of those licenses and then on that page choose to only search for code found in that license.

Re:Choice of db? (Score:4, Informative)

by Sembiance ( 124190 ) writes: on Monday December 05, 2005 @04:18PM (#14187649) Homepage

I've used MySQL in the past for some projects at work, where the number of rows were several hundred million and ran with no problems so I knew it was capable of large row numbers.

I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)

So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
They are right, Java is very fast at handling the searching and I've been very impressed.
Most searches in the Java database only take one or two seconds.
The MySQL query/join for additional info take another 4 or 5 seconds.

Most searches take about 8 seconds to come up, even under no load.

I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.

Proposed workaround doesn't work (Score:4, Informative)

by Animats ( 122034 ) writes: on Monday December 05, 2005 @04:31PM (#14187773) Homepage
Yes, that compiles and runs, but it doesn't do what you think it does. Put in some debug print to see what's actually happening, which is this:
- "5,5" is evaluated using the built-in definition of ",", returning "5". The no-conversion built-in operator comma has higher priority than the conversion sequence involving a conversion to "location", then the use of the overloaded comma operator. So the built-in comma operator is used. See the discussion in the C++ ARM, section 13.2, "Argument matching": which says "consider an exact match better than any conversion".
- "5" is converted to type "location" by the constructor for "location", resulting in a "location" object with "dimension=1" and "coordinates[0]=5".
- This "location" object is passed to "operator[]", which then accesses "coordinates[1]", an uninitialized value, which it then uses as a subscript, returning a reference to a arbitrary memory location. So, instead of returning "&blah.matrix[5][5]", it returns "&blah.matrix[???][5]". The example program seems to run in VC++ only because that part of memory happens to be 0 at startup, so this returns "&blah.matrix[0][5]". In other circumstances, it might cause a crash.
- "10" is stored into the wrong location of "blah",or outside it, due to the bad reference generated above.. This is where the buffer overflow occurs.
You can force the conversion with
blah[ location(5), 5] = 10;
but that's not useful except to see what's happening.
You can't overload the built-in operators for built-in types. So overloading, outside of an object, "operator,(int, int)" won't work either.
Hence the need for a straightforward solution.
Re:histogram of C reserved words - well, B .... (Score:3, Informative)

by ignavus ( 213578 ) writes: on Monday December 05, 2005 @06:36PM (#14189096)

auto is a throwback to B days (the language immediately before C). B had no data types (no int, float, double, etc) but did have storage types: auto, static, and extrn.

auto was necessary in B for local variables, as a plain variable name by itself was a valid expression statement (as it is in C), not a declaration (IIRC).

1. foo() { auto bar; ... }
2. foo() { static bar; ... }
3. foo() { extrn bar; ... }
4. foo() { bar; ... }

All mean something different in B: the first three instances of bar are declarations, the fourth is an expression statement (and if I remember my B correctly, it is invalid as the first statement of foo(), because bar hasn't been declared one of auto, static, or extrn yet in this function).

In C, auto is completely redundant. Except, perhaps, in comments.

Ah, B. The days when programmers were programmers and data was data, and you could perform any operation you liked on any variable. Want to divide a pointer to a string by 3? Go ahead. Self-disciplined programmers don't need training wheels. Just a choice between auto, static and extrn.

Codase is much better with 250M of C/C++/JAVA code (Score:1, Informative)

by Anonymous Coward writes: on Monday December 05, 2005 @11:35PM (#14190884)

http://www.codase.com/ [codase.com] a new search engine, seems to have better user interface and performance. It also has a smart query search system to deal with complex queries,
quoted from their website:
"For the first time, to find relevant code, developers can simply type into a search box about the same code as they do in their daily development work. The Codase smart query system processes the input and then builds an internal query to feed into the search engine. Through this free style format, complex combinations of multiple search terms can be easily entered. For example, to find any main method that contains variable t and function calls of thread.start() and println, this query can be used: main() { var t; thread.start(); println; }",

http://www.codase.com/search/smart?join=main()+%7B var+t%3B+thread.start()%3B+println%3B+%7D&scope=jo in%2Fjoin&lang=*&project= [codase.com]

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Searchable C/C++ DB surpasses 275 million lines 328

Searchable C/C++ DB surpasses 275 million lines More Login

Searchable C/C++ DB surpasses 275 million lines

Re:Statistics: (Score:3, Informative)

Hit Refresh (Score:5, Informative)

Re:What? Millions of code? (Score:5, Informative)

See also: Codase.com (Score:3, Informative)

Koders.com (Score:3, Informative)

How about a potential buffer overflow index? (Score:5, Informative)

Re:Slashdot Block (Score:3, Informative)

Re:Please check for this: comma in brackets in C++ (Score:4, Informative)

Re:Please check for this: comma in brackets in C++ (Score:3, Informative)

Re:Interesting stats (Score:5, Informative)

Re:My vote is for... (Score:2, Informative)

Re:Statistics: (Score:2, Informative)

Re:Choice of db? (Score:4, Informative)

Proposed workaround doesn't work (Score:4, Informative)

Re:histogram of C reserved words - well, B .... (Score:3, Informative)

Codase is much better with 250M of C/C++/JAVA code (Score:1, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot