Forgot your password?
typodupeerror
Programming IT Technology

Searchable C/C++ DB surpasses 275 million lines 328

Posted by Hemos
from the interesting-applications dept.
Sembiance writes "I've been working on a C/C++ source code search database for the past year. It has recently surpassed 275 million lines of searchable open source C/C++ code. The search engine is C/C++ syntax aware so you can search for specific elements such as functions, macros, classes, comments, etc. The site is built upon many open source products including: MySQL and Lucene for the database, CodeWorker to parse the code, PHP and Apache for the website and GeSHi for syntax highlighting. I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code."
This discussion has been archived. No new comments can be posted.

Searchable C/C++ DB surpasses 275 million lines

Comments Filter:
  • by Anonymous Coward on Monday December 05, 2005 @01:28PM (#14186064)
    I'm currently looking for suggestions on what sort of 'interesting statistics' I could create from 275+ million lines of open source C/C++ code.


    The following "interesting statistics" come to mind:

    • Percentage of functions named "deepThroat" (0%)
    • Number of comments mentioning a "girlfriend" (11) or "wife" (29) to "Natalie Portman" (41)
    • How many variables named "penis" are of type "long" versus type "short" (unknowable!)


    You gotta get the variables searchable. Most critical for that last statistic. Also, I'm too lazy to learn Lucene Query Parser Syntax [apache.org], so the statistics for "Natalie Portman" may include references to "portman."
  • by kunzy (880730) on Monday December 05, 2005 @01:30PM (#14186081) Homepage
    the time from the frontpage acticle on /. to the death of your server?
  • My vote is for... (Score:5, Insightful)

    by Anonymous Coward on Monday December 05, 2005 @01:31PM (#14186090)
    How many lines consist of:
    }
  • by roguerez (319598) on Monday December 05, 2005 @01:31PM (#14186093) Homepage
    Find similarities with stuff like SCO.
    • Including those with incompatible licenses.

      Related: having found similar code sections, follow trends in them over time. Find where two programs copied the same code, but one has failed to implement what might be a bug fix or improvement in another, by looking at changes to the code over time.
  • Interesting stats (Score:5, Interesting)

    by sparkes (125299) on Monday December 05, 2005 @01:32PM (#14186097) Homepage Journal
    How many lines contain expletives?
  • SCO (Score:2, Funny)

    by cmburns69 (169686)
    With all that code indexed, maybe we'll finally be able to figure out what the heck SCO's talking about.

    But then again, probably not...
  • . . . well program, sloccount [dwheeler.com]. Of course, do some research and tweak the paramaters to get a reasonably accurate result though.
  • Statistics: (Score:5, Interesting)

    by duckpoopy (585203) on Monday December 05, 2005 @01:32PM (#14186104) Journal
    1. Lines per function
    2. Comment / command ratio
    3. Number of curse word variable names
    • Re:Statistics: (Score:2, Insightful)

      by gronofer (838299)
      4. The number of times the wheel has been reinvented.
    • Re:Statistics: (Score:3, Informative)

      by Anonymous Coward
      From the stats page if you cannot get to it...

      Overall Stats
      Number of Packages: 10,931
      Total Number of Files: 1,151,819
      Total Lines of Code (No comments, no blank lines): 283,119,081
      Total of All Lines: 420,355,464
      Total Number of Functions: 7,782,468
      Total Number of Functions Called: 69,500,700
      Total Number of Macros: 9,947,564
      Total Number of Classes: 209,361
      Total Number of Comments: 38,125,107
      Total Number of Structures: 5
      • Total Number of Functions: 7,782,468
        Total Number of Functions Called: 69,500,700

        So the code calls 61,718,232 functions which don't even exist?

        But maybe they just meant "Total Number of Function Calls" :-)
    • by derek_farn (689539) <<ku.oc.fosonk> <ta> <kered>> on Monday December 05, 2005 @01:47PM (#14186262) Homepage
      Source code usage measurements contain many surprises (ie, developers don't always write what people think they do). Some statistics I have collected, on a smaller code base, are available here [coding-guidelines.com]. The source of the tools used to exract much of the data (at least for those tables and figure I produced) is available here [knosof.co.uk] (C only at the moment).

      Being able to search so much source is also very useful. I was involved in a discussion a while back about the frequency of use of bessel functions in programs (I claimed rare). The handful of uses returned from your database helped back up my argument (dare I say prove it).

      Keep up the good work!

    • For example, "Lines of code" / "Lines of commenting" will always produce "Inf"
    • Some other real suggestions of useful statistics:
      1. Maximum brace nesting level for each function (might be difficult, but a good metric for determining the complexity of a function)
      2. Percentages of control structures that are while, for, switch, if, etc.
      3. Number of embedded constants that aren't 0 or 1
      4. Count of references to each function/constant within in a single project
    • 4. How many lines belong to SCO.
      5. ?
      6. Profit

      Bob.

      (where 5 is a pretty good chance of getting counter-sued out of existance by IBM when the answer is some { and a few less }.)
  • ratio (Score:5, Funny)

    by FreeBSDbigot (162899) on Monday December 05, 2005 @01:33PM (#14186106)
    ... of "foo" to "bar."
    • Re:ratio (Score:5, Funny)

      by ahem (174666) on Monday December 05, 2005 @03:24PM (#14187127) Homepage Journal
      From google:

      Search -- foo -> Results 1 - 10 of about 26,600,000 for foo. (0.06 seconds)
      Search -- bar -> Results 1 - 10 of about 385,000,000 for bar [definition]. (0.16 seconds)
      Search -- foo bar -> Results 1 - 10 of about 7,900,000 for foo bar. (0.12 seconds)

      'bar' wins. This intuitively makes sense, as who would want to go to the 'foo' for a drink, or eat an 'energy foo'? Could you imagine a lawyer being 'dis-fooed'?
  • Suggestion (Score:5, Funny)

    by lbmouse (473316) on Monday December 05, 2005 @01:33PM (#14186120) Homepage
    "I'm currently looking for suggestions..."

    How about a new server?
  • Slashdot Block (Score:3, Interesting)

    by Yerase (636636) <randall.handNO@SPAMgmail.com> on Monday December 05, 2005 @01:34PM (#14186125) Homepage
    I love the GeShi page, how it blocks everything from Slashdot. Setup a site to advertise a product, then restrict people from using it....
    URLs on this server linked by slashdot.org will be refused. Permission is given to slashdot to mirror content as necessary for the purpose of providing its users access to the information on the site. Slashdot should not attempt to bypass the referer block. Use of the google cache page for the site is acceptable as long as the page(s) concerned have no more than 1 image.
    • Re:Slashdot Block (Score:3, Insightful)

      by lowrydr310 (830514)
      This policy is employed for the sole purpose of avoiding a huge bandwidth bill that I would have to pay out of my own pocket. Anyone who would like this restriction to go away is more than welcome to send me bucketloads of cash.

      If you don't want to pay a big bandwidth bill then don't run a webserver.

      • Re:Slashdot Block (Score:2, Insightful)

        by b4k3d b34nz (900066)
        Why would anybody WANT to pay a big bandwidth bill? It's called being smart so that he doesn't get the shaft when he has to pay his utilities this month.
      • If you don't want to pay a big bandwidth bill then don't run a webserver.

        If you want access to a web server, don't run a system that's known to give the provider big bandwidth bills.

        At the end of the the day, they don't owe you anything, and anything they offer you is a courtesy, not an obligation. If you don't like that, please feel free to go create and finance your own WWW.

      • Re:Slashdot Block (Score:3, Insightful)

        by gstoddart (321705)

        This policy is employed for the sole purpose of avoiding a huge bandwidth bill that I would have to pay out of my own pocket. Anyone who would like this restriction to go away is more than welcome to send me bucketloads of cash.

        If you don't want to pay a big bandwidth bill then don't run a webserver.

        That's a little harsh don't you think?

        It's one thing to run a site and have reasonable expectations of having "enough" bandwidth for your projected traffic, and it's another thing to pay for a slashdotting on a

      • Re:Slashdot Block (Score:3, Informative)

        by Kjella (173770)
        If you don't want to pay a big bandwidth bill then don't run a webserver.

        For every problem, there is a solution that is simple, elegant and wrong. In every other market, the more demand there is, the higher the price/revenue/profit. Web servers are pretty much the only place where you lose more money the more popular you are (e-commerce sites and such not included). If so many people want the content, they can find a way to share it. Even then they're getting a bloody good deal, if you ask me. What exactly
    • Hit Refresh (Score:5, Informative)

      by everphilski (877346) on Monday December 05, 2005 @01:44PM (#14186229) Journal
      Just hit refresh and the webserver won't get the HTTP_REFERRER (granted you'll have to manually delete the text file he serves you)

      -everphilski-
      • by sglane81 (230749)
        Actually, if you click refresh on a page from a link, it will resend the referrer as well. Most browsers do this. One more thing, you spelled HTTP_REFERRER correctly, which is wrong :) It's spelled HTTP_REFERER, only has one R. Reverse grammar nazi FTW?
    • Re:Slashdot Block (Score:2, Interesting)

      by wampus (1932)
      Thats why I use Cacheout [thetechgurus.net]. Its a Firefox extension that adds a context menu item to coralize any link. Bypass the restriction AND not kill the site, all at the same time.
  • Choice of db? (Score:4, Interesting)

    by Anonymous Coward on Monday December 05, 2005 @01:35PM (#14186137)
    So, this is not a flame, but I'm curious about your choice of dbs.
    I've used mysql for some small projects, but generally it does handle
    millions of rows (although the upper limit on rows can be patched with
    some additional behaviors). So, for big dbs, I use postgresql.

    How did you decide to use mysql? (Was it that the project started,
    and grew, or did you know it would handle large numbers of rows
    from the start)?

    Just curious. This is probably going to be viewed as a flame by many
    (particularly those who don't really use dbs very much, but use them
    enough to have strong opinions).
    • Re:Choice of db? (Score:4, Informative)

      by Sembiance (124190) on Monday December 05, 2005 @04:18PM (#14187649) Homepage
      I've used MySQL in the past for some projects at work, where the number of rows were several hundred million and ran with no problems so I knew it was capable of large row numbers.

      I initially used their FULLTEXT indexing as well, but it dies a horrible death with a large number of rows or search terms. (The developers that live in #mysql on Freenode confirmed this)

      So I had to hand off searching to Lucene, which worried me a great deal (being java) but as folks tell me 'Java is not slow'.
      They are right, Java is very fast at handling the searching and I've been very impressed.
      Most searches in the Java database only take one or two seconds.
      The MySQL query/join for additional info take another 4 or 5 seconds.

      Most searches take about 8 seconds to come up, even under no load.

      I simply don't have enough RAM to keep the necessary MySQL indexes in RAM and use index only queries.
  • Statistics TM (c) (Score:5, Interesting)

    by chunews (924590) on Monday December 05, 2005 @01:38PM (#14186162)
    It would be interesting to see the number of different copyright notices contained within all that source code, and then to present the notices in groups, like GPL GPL2, etc..

    Also, I would really like to find "patient 0" for sourcecode. For example, is there a common library or utility function (perhaps Hex2Ascii?) that *everybody* uses? Well, who wrote it first?

    And in a similar vein, who are the "top 5-10-100" authors of open source code by use, reuse, KLOC, etc.. Not of too much use unless I were awarding the Nobel prize for programming, or perhaps creating a list of individuals for the RIAA to sue, after their done with their other useless lawsuits. :)

  • by iso-cop (555637) on Monday December 05, 2005 @01:39PM (#14186174)
    In the software engineering world, people will be interested in all sorts of code metrics such as cyclomatic complexity, operator/operand counts, lines of code per module, and such as well as object oriented metrics for the C++ code (depth of inheritance, for example). If you can marry these sorts of metrics with defect data (bugs) for each of the modules then you have a useful data repository for predicting defects in source code. Keeping around different versions of modules changed is also valuable here. If you can gather information on how long it took to produce the module and how long it took to correct defects in the module you are getting even better. If you make it easy to reuse the C and C++ modules...even better.
  • by tod_miller (792541) on Monday December 05, 2005 @01:39PM (#14186180) Journal
    I was very impressed with Amazon, who for each book say which phrases and words were particularly unique to that book. (reminds me of that google game where try try and get any two words with only 1 hit).

    So show code with coloured background to the lines, from green to red, green being 'normal every day boiler plate' code, red would mean this code must be more specialised, or written by some half-wit l33t h4x0r at least.

    I forgot what they called it, but they had 3/4 visible stats based on the semantics of the stuff, probably more under the 'hood (omg lol).

    word. Oh some adhesion stats would rock!

    please type the word in this image: adhesion
    random letters - if you are visually impaired, please email us at pater@slashdot.org
  • by PetriBORG (518266)
    Start with the basics, and then move on..
    1. Whitespace to code ratio
    2. Counts for each of the dirty 7
    3. Line counts that just contained () or {} or []
    4. A list of projects the code is from
    5. And then more interestingly, I'd like to run some sort of program on it to find similarities in code, to see how much one code base overlaps with another. It would be interesting to see if OSS actually does share code between projects or if its all NIH (not invented here).
  • by bsdluvr (932942) on Monday December 05, 2005 @01:45PM (#14186241) Homepage
    1) randomly select 2000 lines of code
    2) compile
    3) execute
    4) ???????
    5) PROFIT!
  • Woman (Score:2, Funny)

    by chris_mahan (256577)
    I'd like to know whether the word "woman" appears anywhere, and if so, in what projects.

    Eh.
  • All the code was just /.'ed into oblivion. Time to start from the beginning all over again. :(
  • by tcopeland (32225) * <tom.thomasleecopeland@com> on Monday December 05, 2005 @01:48PM (#14186268) Homepage
    ...that is, a static analysis of a bunch of Java SourceForge projects [sourceforge.net]. It does unused code and duplicate code detection... sometimes it finds some interesting things.

    PMD home page is here [sf.net], book site is here [pmdapplied.com].
  • I'm currious, when people are looking for code, what do they do as a first resort? Maybe this should be a poll. Me, I'm a bit funny...
    1) look in my library (books)
    2) do a deja search
    3) ask smarter people than me
    4) do a web search (usually on specific sites)

  • I can only hope that this database has good metadata on which code fragments contain/don't contain various common species of exploits (buffer overflow, stack overflow, mal-formed input vulnerabilities, etc.). It would be nice to know which code fragments have all the needed input/size checking needed to be safe for exposure to the outside world and which are "for internal use only."
  • It is hosed.

    I tried searching. Here's what I got:

    XML Parsing Error: junk after document element Location: http://csourcesearch.net/performSearch.php?type=Fu nctionTypeReturned&search=(&ignoredRandomNumber=11 33805159922.7798 [csourcesearch.net] Line Number 2, Column 1:Warning: mysql_connect() [function.mysql-connect [slashdot.org]]: Can't connect to MySQL server on '127.0.0.1' (4) in /home/csourcesearch.net/include/php/GraphXML.php on line 309
    ^
  • by Animats (122034) on Monday December 05, 2005 @01:58PM (#14186354) Homepage
    C++, for historical reasons dating back to C, has wierd semantics for commas in brackets. The operator precedence for commas is different inside of "()" and "[]".

    So tab(i,j) is a function call with two arguments. But tab[i,j] is an invocation of the "comma operator", then a function call with one argument. The default "comma operator" ignores the first argument and returns the second. It once had some uses in C macros.

    I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++. But there's a concern that somewhere, someone might have code that depends on the current semantics of the comma operator inside square brackets.

    This new archive offers the opportunity to eliminate that possibility. So, do this search: Find, in non-comment standard C++ code, any occurences of a comma operator within square brackets. Eliminate any where there are parentheses within the square brackets enclosing the comma. Can you find any? In any production code? In any open-source project? Anywhere?

    • I've argued with the C++ committee about this. If "operator[]" had the same syntax as "operator()", we could have support for multidimensional arrays in C++.

      I'm no C++ expert, but isn't int array[row][col] a multidimensional array?
      • I'm no C++ expert, but isn't int array[row][col] a multidimensional array?

        No, it's an array of pointers to an array of elements, which is not quite the same thing.

        Arrays with multiple subscripts have many uses. Sparse array implementations, for example. People implement this now with code that looks like

        tab(i,j) = 1;

        This is valid C++, and with the right overloads, it compiles and runs, but it looks wierd.

        • The grandparent got it correct. C does support multidimensional arrays. I suspect that C++ does too.

          To validate, I pulled out my copy of K&R 2nd edition (Actually a copy I once rescued from a trash bin, and my copy is only "Based on Draft-Proposed ANSI C"). In section 5.9 Pointers vs. Multi-dimensional Arrays it points out,

          Newcomers to C are sometimes confused about the difference between a two-dimensional array and an array of pointers, such as name in the example above. Given the definitions

          i

          • You're correct, but that's not what the original post is saying. The only way to provide a sparse-matrix class in C++ is with member functions. You can't do it by overloading [] to accept two arguments, e.g. array[2,4]. You have to use a member function, making it look like array.get(2,4), or perhaps overloading () for array(2,4). There's no way to write a matrix class that uses square brackets for indexing more than one dimension.
      • by chris macura (899109) on Monday December 05, 2005 @02:29PM (#14186615)
        Yes, they are.

        But from an OOP standpoint, it's impossible to create a datastructure that "knows" you're using the [] operator twice. So if you overload the [] operator in an array structure, to get multi-dimensional arrays, you have to nest single dimensions arrays, which is almost always inefficient because the rows (or columns, depending on whether you're row major, or column major) are lying around the RAM (depending on where they were allocated) , rather than a continous chunk like with C.

        In other words, you can't do something like this in C++:

        class SmartArray {
          public:
            SmartArray(int height, int width);

            int operator(const int &x, const int &y) const;

            // ...
        };

        ...

        SmartArray a(5, 5);

        a[12, 13];
        • by Old Wolf (56093) on Monday December 05, 2005 @04:09PM (#14187566)
          You can do exactly that -- just write a(12,13) instead of a[12,13].
          This is a great counterexample to the GP. Changing the meaning
          of the comma within square brackets would gain NOTHING and would
          mean every existing compiler is now wrong.

          The existing C array type is bad enough as it is, why make it
          even more unwieldy by introducing a new variant? C++ is already
          on the right track: discourage C arrays, and encourage container
          classes that have things like bounds checking and automatic
          memory allocation.
    • Well, the obscureness of the comma operator is used by C++ recruiters who thinks they are really "clever", and in "clever" C/C++ puzzles on usenet. If you took it away, how would you hire C++ programmers and how would you have fun on usenet?

      Also, C++ programmers are getting really old, and they don't handle change very well.

  • best_idea_ever (Score:4, Insightful)

    by l33t-gu3lph1t3 (567059) <arch_angel16.hotmail@com> on Monday December 05, 2005 @01:58PM (#14186359) Homepage
    charge for a premium service that allows Computer Science and Software Engineering profs to perform a somewhat intelligent search of the code to see just how much of their students' code is lifted off the 'net ;)
  • if( something = something ) ...
  • See also: Codase.com (Score:3, Informative)

    by kriegsman (55737) on Monday December 05, 2005 @02:00PM (#14186371) Homepage
    See also Codase.com [codase.com], another "Source Code Search Engine", which lets you search by method names, class names, variable names, free text, etc..

    -Mark
  • Koders.com (Score:3, Informative)

    by knipknap (769880) on Monday December 05, 2005 @02:01PM (#14186376) Homepage
    Don't know, koders.com [koders.com] supports a lot more languages and also lets you narrow your search to specific licenses. The few extra lines of code just don't seem too do it, especially because such measures highly depend on the chosen method.
  • I'm surprised that Perl's CPAN archive [cpan.org] doesn't have structured searching at smaller granularity than module name or freeform metadata. Maybe once the archives let us find code by content, we'll get version control databases that store each line in a record, each block as references in a separate table, maybe even referential integrity of variables as foreign keys. I'd love my editor to pull code from DB storage, padding whitespace only in the presentation layer per my preferences.

    I'd really love to see data
  • by raddan (519638) on Monday December 05, 2005 @02:07PM (#14186429)
    You can start by seeing how often people use gets(), strcpy(), strcat(), etc... Look for all the fun little common mistakes that people make.
  • by digitaldc (879047) * on Monday December 05, 2005 @02:08PM (#14186438)
    -# of non-numerical constants
    -# of ( ),{ },\ /,#,; characters in code
    -time spent debugging/compiling
    -total hours spent in production
    -gallons of coffee consumed
    -hours of daylight seen
    -# of relationships destroyed
  • Code Styles (Score:5, Interesting)

    by ionrock (516345) on Monday December 05, 2005 @02:09PM (#14186443)
    I would love to see if different code styles could be analyzed to see how many peopel use what sort of syntax style. There is camelCase and under_scores but it seems possible to find more complicated trends that might allow reviews to statistically determine what practices really help to make code better.
  • How much of this open-source code DB is reusable? Are most of the lines things that have limited applications, or are most of them more general? I mean, if you have 275 million lines, but 175 million lines are code designed to solve one specific problem and can't be easily cross-applied, then it isn't as useful as the statement implies.

    That said, congrats on the milestone, and looking forward to hearing of more!
  • by jab (9153) on Monday December 05, 2005 @02:14PM (#14186487) Homepage
    I'd love to see how one of my programs (stats below) compares
    to the, uh, national average.

       1222 if
        638 return
        482 static
        413 for
        399 int
        217 const
        201 else
        194 void
        128 char
        115 case
        112 break
         55 default
         43 sizeof
         37 do
         35 switch
         27 enum
         24 struct
         23 while
         15 float
         14 typedef
         10 auto
          7 unsigned
          6 extern
          1 long
  • Now SCO will finally be able to find all the code that was stolen from them!
  • I'm dying to know... What percentage of the code is commentary?

    And are there any haiku?
  • select count(*) from sourcecode where comments > 0
    0 row(s) returned

    plagerism at its finest [thinkgeek.com]

    mod -1 lame
  • by Xofer D (29055) on Monday December 05, 2005 @02:37PM (#14186681) Homepage Journal

    This is a good opportunity to build complex statistics about the C++ grammar actually used in context. Learn from the NLP [wikipedia.org] people! Parse the whole thing, and start finding common subtrees in the grammar used. Look at common lexical entries between subtrees, so we can make a tool that can help recognize errors by comparing against commonly used C++ grammar fragments. Or do function completion based on what kind of function you look like you're writing. See if you can do alignment with similar languages and do statistical source translation [wikipedia.org]. If you keep information about comments used (and maybe apply some real NLP), you might even have a shot at automatically classifying functions based on their form, and documenting them with simple comments.

    If that's too hard, try finding all n-grams [wikipedia.org] instead, at least under some length. That's a lot more useful than just individual tokens or strings.

    With a lot of data, you can do very cool things. Don't mess around with string frequency counting. C++ is simple compared to English, do something interesting.

  • that permit things like buffer overflows, etc.

    Though I don't develop much in C++ currently, and haven't had the time to do anything Linux wise in years, I would love to have an identified location for security-bug free algorithms, etc. that I could use if I need to do more C++ work in the future.

  • This index doesn't even contain Boost (http://www.boost.org/ [boost.org]) and Loki libraries!

    It can't be called 'comprehensive' after that...
  • would be a nice feature to have, both average and per project/module basis.
  • TODOs (Score:2, Interesting)

    by mrshoe (697123)
    Counting the number of "TODO"s and "XXX"s in "production" open source code could be interesting.

What this country needs is a good five dollar plasma weapon.

Working...