Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
Get HideMyAss! VPN, PC Mag's Top 10 VPNs of 2016 for 55% off for a Limited Time ×
Programming Software

No, It's Not Always Quicker To Do Things In Memory 486

itwbennett writes: It's a commonly held belief among software developers that avoiding disk access in favor of doing as much work as possible in-memory will results in shorter runtimes. To test this assumption, researchers from the University of Calgary and the University of British Columbia compared the efficiency of alternative ways to create a 1MB string and write it to disk. The results consistently found that doing most of the work in-memory to minimize disk access was significantly slower than just writing out to disk repeatedly (PDF).
This discussion has been archived. No new comments can be posted.

No, It's Not Always Quicker To Do Things In Memory

Comments Filter:
  • on the speed of your memory, and the speed of your disk, SSD's are getting more common.

    • Re:It depends (Score:5, Insightful)

      by Lunix Nutcase ( 1092239 ) on Wednesday March 25, 2015 @12:04PM (#49336215)

      Even the slowest DDR3 SDRAM has more memory bandwidth and magnitudes faster access time.

      • Re:It depends (Score:5, Insightful)

        by ShanghaiBill ( 739463 ) on Wednesday March 25, 2015 @12:32PM (#49336533)

        Even the slowest DDR3 SDRAM has more memory bandwidth and magnitudes faster access time.

        Indeed. Their results make no sense. They are doing something weird. For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second. Their code is listed at the end of the paper, and they seem to be assuming that "flush" means the code is actually written to disk. It does not. It just means the bytes were passed to the operating system.

        The real story here, is that if you don't know how to write code properly, then string concatenation can be really slow.

        Was their paper peer reviewed?

        • by Anonymous Coward on Wednesday March 25, 2015 @12:54PM (#49336773)

          Was their paper peer reviewed?

          It just was. Why do you ask?

          lololol

        • Re:It depends (Score:5, Informative)

          by PacoSuarez ( 530275 ) on Wednesday March 25, 2015 @12:55PM (#49336793)

          [...] For instance, their paper says that concatenating a million one byte strings into a single million byte string takes 274 seconds. That should take much less than one second.

          I didn't RTFA, but after reading this I am certainly not going to. This C++ piece of code takes around 0.01 seconds to run on my computer:

          #include <iostream>
          #include <string>

          void build_string(std::string &s, std::string r) {
              for (int i = 0; i < 1000000; ++i)
                  s += r;
          }

          int main() {
              std::string s;
              build_string(s, "a");
              std::cout s.length() '\n';
          }

        • by TiggertheMad ( 556308 ) on Wednesday March 25, 2015 @12:58PM (#49336817) Homepage Journal
          I just scanned the paper, because their claim seem to be idiotic. It looks like they are appending a single byte on the end of a string in memory and on disk. For the memory operation, this will result in a string copy since strings are immutable, vs. doing a one byte file append onto the disk. The former is increasingly expensive and the latter is a fixed cost, so after infinite operations, the disk cost becomes far less than the memory operation. If this is indeed their claim, and I am not missing something, then they should be collectively slapped for wasting our time by writing this paper. If this is really your use case, write some proper data structures to manage your data in a sane fashion.

          So yes, if you do stupid things, you can make bad engineering decisions look like good ones.
          • by Bengie ( 1121981 ) on Wednesday March 25, 2015 @01:43PM (#49337273)
            They should follow best practices and use StringBuilder and rerun their tests.
            • Many people are suggesting using string builder, as a easy fix...If you think about this problem, that doesn't solve it as you approach infinite operations, it just pushes the cost crossover point way out (possibly beyond the limits of existing hardware, so it might be practically moot). Since they are doing silly comparisons like this, I would suggest just writing a linked list to store each byte as a counter example that will provide more of an apples to apples comparison. Adding an element to an linked l
          • by ndykman ( 659315 )

            I strongly encourage people to email the authors and clue them in. Seriously, this makes me angry. If CS doesn't already have a reputation for being completely academic and out of touch. Things like this, no wonder people think you can learn to code in 10 weeks.

        • Re:It depends (Score:5, Insightful)

          by sjames ( 1099 ) on Wednesday March 25, 2015 @01:41PM (#49337251) Homepage Journal

          It makes perfect sense once you read the paper. The conclusion is techniocally correct but deceptive.

          The results apply in the case of Java and Python where strings are immutable objects. They also used buffered I/O handled by libc. When you concatenate immutable strings, you must allocate a new string large enough to hold both parts, then a memcpy from both of the parts is performed to construct it. The parts are eventually garbage collected.

          In contrast, writing to a file with buffered I/O means just copying the additional write buffer to the current end of the buffer and moving updating the accounting information.

          As a result, in both cases, only one actual filesystem transaction takes place writing out the complete string. Thus, the actual practical difference between the two methods is that the 'in memory' version copies the memory around many times while the 'disk i/o' one copies the data once (in multiple steps, but each byte sees one copy).

          That seems like a bit of a no-brainer, but the point is valid because many programmers may deceive themselves into thinking the 'in memory' method is faster because they don't take the file i/o buffering and the way immutable strings are handled into account.

        • Re: (Score:3, Funny)

          Was their paper peer reviewed?

          I believe that it may have been beer reviewed.

        • Re:It depends (Score:4, Insightful)

          by Trailer Trash ( 60756 ) on Wednesday March 25, 2015 @02:01PM (#49337449) Homepage

          The real story here, is that if you don't know how to write code properly, then string concatenation can be really slow.

          Was their paper peer reviewed?

          I just reviewed it, but frankly, they're not my peers.

          They actually understand the problem and state it near the end of the paper. The issue is pretty simple and when I read the /. summary I knew what the problem was. They're appending single bytes to a string. In both chosen languages - Java and Python - strings are immutable so the "concatenation" is way the hell more complex than simply sticking a byte in a memory location. What it involves is creating a new string object to hold both strings together. So, there's the overhead of object creation, memory copying, etc. Yes, by the time you're done it's a lot of extra work for the CPU.

          I'm going to state this as nicely as I can: what they proved is that a complete moron can write code so stupidly that a modern CPU and RAM access can be slowed down to the extent that even disk access is faster. That's it.

          Even if you wrote this in C in the style in which they did it the program would be slow. Since there's no way to "extend" a C string, it would require determining the length of the current string (which involves scanning the string for a null byte), malloc'ing a new buffer with one more byte, copying the old string and then adding the new character and new null byte. Scanning and copying are both going to require an operation for each byte (yeah, it could be optimized to take advantage of the computer's word length) on each iteration, with that byte count growing by "1" each time.

          The sum of all integers up to N is N(N+1)/2. If N is 1,000,000 the sum is 500,000,500,000. So, counting bytes (looking for null) requires half a trillion operations and copying bytes requires another half trillion operations. Note that "operations" is multiple machine instructions for purposes of this discussion.

          Yeah, modern computers are fast, but when you start throwing around a trillion operations it's going to take some time.

          Writing to disk will be faster for a number of reasons, mainly because the OS is going to buffer the writes (and know the length of the buffer) and handle it much much better. It's not doing a disk operation every time they do a write. If they were to flush to disk every time they would still be waiting for it to finish.

          There are a few notes, here. First, in Java and Python the string object likely holds a "length" value along with the actual character buffer. That would make it faster and not require all the operations the badly written C code that I describe above would require. But the overhead of objects, JVM, interpreter, etc. gets thrown into the mix. Second, if I were doing something like this in C I could keep the string length as part of a struct and at least make it that much faster. The point is that a good programmer wouldn't write code in this manner.

          Anyway, this "paper" proves nothing except that really bad code will always suck. One would have to be an idiot to write anything close to what they've done here in a real-life scenario. I know because I've cleaned up other people's code that's on the level of this junk...

          • Even if you wrote this in C in the style in which they did it the program would be slow. Since there's no way to "extend" a C string, it would require determining the length of the current string (which involves scanning the string for a null byte), malloc'ing a new buffer with one more byte, copying the old string and then adding the new character and new null byte. Scanning and copying are both going to require an operation for each byte (yeah, it could be optimized to take advantage of the computer's word length) on each iteration, with that byte count growing by "1" each time.

            Actually, you can "extend" a C-style string just fine in C - just replace the NULL byte with another byte. It's a common error in C programs to miss the NULL byte.

            This works because C doesn't do boundary checks and will gladly let you overwrite your stack or heap.

            Unlike Java, C doesn't try to protect you from yourself.

            • Well, yeah, but that's not going to work consistently. Worst case is if the string is on the stack you'll smash the stack and likely have a memory access error. If it's on the heap you'll likely get the error quicker.

              I wouldn't even think of writing a program in the manner in which their sample was written, but if I was trying to solve their basic "problem" there are better ways to go about it.

        • OK, so the authors are bad programmers and don't understand how string concatenation works. Strings are contiguous arrays, whereas disk files are made up of consecutive blocks, which are accessed through an index. If you want to append to a file, you may add a block, and modify the end of the index. But if you want to append to an array, you are forced to allocate a whole fresh array, because strings use fixed-size arrays.

          On the other hand, Java StringBuffers have amortized O(1) append cost. A StringB

    • Re:It depends (Score:5, Informative)

      by greg1104 ( 461138 ) <gsmith@gregsmith.com> on Wednesday March 25, 2015 @12:09PM (#49336281) Homepage

      SSDs and disk speed have nothing to do with this. None of these writes are hitting disk. All they've shown is that when you cache a write to disk, the operating system might add data to it more efficiently than the slow Python and Java string code can expand a string.

    • Re:It depends (Score:5, Insightful)

      by hcs_$reboot ( 1536101 ) on Wednesday March 25, 2015 @12:11PM (#49336301)
      RAM *is* faster (by far) than any persistent media 9SSD, HD...). So whatever the test, the algorithm is probably bad,
    • Re:It depends (Score:4, Insightful)

      by Carewolf ( 581105 ) on Wednesday March 25, 2015 @12:24PM (#49336441) Homepage

      on the speed of your memory, and the speed of your disk, SSD's are getting more common.

      No, it doesn't. Memory is faster. If they get a result saying otherwise, they are doing it wrong, and are actually just measuring the performance of the in-memory cache speeding up the simplest implementation vs the performance of their own crappy implementation.

    • Re:It depends (Score:4, Informative)

      by jellomizer ( 103300 ) on Wednesday March 25, 2015 @12:48PM (#49336693)

      In general writing to RAM is faster than writing to the disk. However there are things that get in the way of both.
      1. OS Memory Management: So you making a small memory string to a big one. So will the os fragment the string, when it comes up to an other systems reserved memory spot. Will it overwrite it (Buffer overflow), will it find a contiguous larger memory block and copy the data there. Will it copy and move the memory slots to a new location away from the memory. Will this be happening preemptively, or when the error condition occurs, will all this stuff happen with a cpu cycle that is not sharing with your app. Also if you are low on memory the system may dump it to the disk anyways.

      2. OS Disk management: A lot of the same concerns that memory management has. However a bunch of small request is easier to find free space, then asking for a larger spot. So they may be more seek time.

      3. Disk Caching: You tell the program to append to the disk. The OS sends the data to the drive, the drive responds back Yea I got it. then the OS goes back to handling your app, in the mean time your drive is actually spinning to save the data on the disk.

      4. How your compiler handles the memory. Data = Data + "STRING" vs. Data+="STRING" vs Data.Append("STRING") vs { DataVal2=malloc(6); DataVal2="STRING"; DataRec->Next = *DataVal2; } You could be spending O(n) time saving your memory where you can be doing in in O(1)

      Now sometime I do change my algorithm to write to the disk vs. handling it in memory. Mostly because the data I am processing is huge, and I much rather sacrifice speed, in order to insure that the data gets written.

    • python and java (Score:5, Informative)

      by Spazmania ( 174582 ) on Wednesday March 25, 2015 @01:10PM (#49336949) Homepage

      They tested using strings in python and java, both of whose string libraries are very much overweight. And they tested by concatinating strings in a way that requires constant reallocations and memory copies versus pushing data to fixed size disk buffers in the OS cache.

      So... surprise! When writing data sequentially the C implementation of disk buffers is faster than the java and python implementations of strings.

  • by ubergeek65536 ( 862868 ) on Wednesday March 25, 2015 @11:48AM (#49336053)

    Sorry but you'll need to do it without using any memory. We need to make it fast.

    • by Anonymous Coward on Wednesday March 25, 2015 @12:42PM (#49336647)

      Sorry but you'll need to do it without using any memory. We need to make it fast.

      Memory bandwidth is about 20Gb/s. Disk bandwidth is about 0.05Gb/s. The performance consequences of this are obvious to anyone who knows how basic arithmetic works.

      The results they got are invalid because their test framework is broken. This is exactly why everyone should be forced to learn C/C++ or Assembler in college/university. The reason for the crap result is they did not preallocate their buffers so they wasted all their execution time allocating and reallocating larger buffers from the heap. The disk APIs have their own internal buffer implementations, that were not written by idiots, that manage this correctly which is the cause of the difference.

  • by s.petry ( 762400 ) on Wednesday March 25, 2015 @11:51AM (#49336067)

    'll have to dig through their testing and methods, but this seems pretty fishy given the summary.

    Seek/Read/Write time of a disk is always slower than memory. No exceptions to the rule exist given current commodity hardware. Bus length to a disk is also much longer than to memory. Again, there are no exceptions given commodity hardware.

    Won't be the first time someone reported that the laws of physics don't exist for something, and I'm sure it won't be the last. Maybe someone with free mornings in the US can break it down better than the summary.

    • by LordLimecat ( 1103839 ) on Wednesday March 25, 2015 @11:54AM (#49336105)

      Tl; DR:

      They used python and java. Sort of hard to develop a meaningful thesis on general programming when you're that far up the abstraction stack. Who knows, maybe python and Java suck at memory management (GASP).

      • Re: (Score:3, Insightful)

        by s.petry ( 762400 )
        read their code, you will see the problems.
        • Re: (Score:2, Informative)

          by Anonymous Coward

          You don't even have to read the code. Reading what languages they used reveals the entire flaw. They used languages with expensive string operations when done in-memory which is the only reason why writing to a buffered cache and writing to disk is faster.

          • by danlip ( 737336 ) on Wednesday March 25, 2015 @12:30PM (#49336511)

            The language is not the problem, the code is terrible. They did String concatenation in the most expensive way possible. I'm pretty sure if you used a pre-sized StringBuilder it would be faster in memory.

            They also make some very novice benchmarking mistakes.

            This is actually a pretty good interview problem. Anyone who writes code like that should not be hired, even for a junior position.

            • Re: (Score:3, Informative)

              by Anonymous Coward

              Changed Java code to use StringBuilder instead of String += String. Results on my machine:
              1: 0.010625
              10: 0.002375
              100: 0.001

              Maybe somebody who study Chemical and Biological Science is not good developer

        • by Anonymous Coward on Wednesday March 25, 2015 @12:16PM (#49336341)

          Let me guess

          1. They used "" + "" instead of StringBuilder
          2. They didn't actually flush the file bytes to disk, so it's really a comparison of stupid programmer in-memory string cat and intelligence caching of file writes.
          3. They intentionally engineered a scenario that reported data that was contrary to reality in order to get clicks

          • Pretty much. This entire article is basically saying that if you do things in the most stupid way possible you can make it slow.

            • I think what they've proven is that there are so many layers in modern programming languages that most of what programmers do because it seems like a good idea probably generates terrible outcomes.

              This actually explains a lot about modern programs, and how 5 years later a machine with twice the resources takes the same amount of time to do something as 5 year old software.

              Because the bloat and inefficiencies added in those five years offset any other improvements. :-P

              • Except any half-decent Java developer uses Stringbuilder not + concat because everyone knows the latter is slower and causes more to be objects created. The only thing they proved is by purposefully doing something wrong you can make it crappy.

          • by halivar ( 535827 )

            1. Eyup.
            2. They actually did flush.
            3. Absolutely.

        • by bondsbw ( 888959 ) on Wednesday March 25, 2015 @12:23PM (#49336429)

          Specifically, the time measured to write to memory uses the following code:

            for (int i=0; i < numIter; i++) {
                    concatString += addString;
            }

          The time measured to write to disk uses the following code:

            for (int i=0; i < numIter; i++) {
                    writer.write(addString);
            }
            writer.flush();
            writer.close();

          In Java, strings are immutable. Each string concatenation produces a new string on the heap, and the old string is unchanged. So there are numIter strings created in memory, and I assume garbage collection will probably happen at some point once enough memory is used. O(n) reads and O(n) writes to the heap with O(n^2) memory usage plus an unknown number of garbage collections. This can cause considerable slowing of the in-memory algorithm.

          That algorithm is then compared with one that does numIter writes to a buffer, which is then flushed to disk at the end. O(n) writes to memory buffer (no need to re-read memory) using O(n) memory space, followed by O(1) writes to disk and O(n) disk space used.

          Granted, it's been over a decade since I took algorithms so I wouldn't doubt that someone can show how I am off, but this kind of thing should be simple to spot for anyone who has an undergrad CS degree.

          PS - I love how the paper makes this aside as if it doesn't matter tremendously:

          Java performance numbers did not change when the concatenation order was reversed in the code in Appendix 1. However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results.

          • by halivar ( 535827 ) <bfelger@@@gmail...com> on Wednesday March 25, 2015 @12:43PM (#49336649)

            And this is why we should not teach CS101 in Java or Python. If they'd been forced to use C this whole experiment would have turned out differently. Even the professors are getting lazy, now.

            • by Coryoth ( 254751 )

              And this is why we should not teach CS101 in Java or Python. If they'd been forced to use C this whole experiment would have turned out differently.

              Not at all. If you wrote your C in memory string handling as stupidly as they wrote the Python and Java you will still get worse performance in C (e.g. each iteration malloc a new string and then strcpy and strcat into it, and free the old string; compared to buffered file writes you'll lose). It's about failing to understand how to write efficient code, not about which language you chose.

              • by Rakarra ( 112805 ) on Wednesday March 25, 2015 @01:51PM (#49337341)

                Not at all. If you wrote your C in memory string handling as stupidly as they wrote the Python and Java you will still get worse performance in C (e.g. each iteration malloc a new string and then strcpy and strcat into it, and free the old string; compared to buffered file writes you'll lose). It's about failing to understand how to write efficient code, not about which language you chose.

                Yes, but we're talking new programmers here. At least in C, you're forced to have to explicitly write inefficient code. New programmers know what malloc does (if they don't, they're behind in their classes). In Java and Python, things are done for you. That can be good! It frees you from a bit of micromanagement. But again, for a new programmer, it's not apparent that they're doing something especially inefficient because the work happens invisibly. It's obvious when you have to malloc() a whole new string buffer in C every time you append to a string. It's less obvious in Java when you just append and the runtime ends up creating a new buffer on the heap for you. ASM is perhaps a bit TOO low-level and weird to start a new programmer on, but I think a full OOP language like Java or scripting language like Python might be too high-level and encourage bad habits to develop. In my CS classes, C hit a pretty good sweet spot.

                Then again, you can program badly in any language, and C has its own perils.

            • by OverlordQ ( 264228 ) on Wednesday March 25, 2015 @03:50PM (#49338341) Journal

              THATS THE ENTIRE POINT OF THIS PAPER.

              It is easy to explain the results: In high-level languages such as Java and Python, a seemingly benign
              statement such as concatString += addString may actually involve executing many extra cycles behind
              the scenes. To concatenate two strings in a language such as C, if there is not enough space to expand
              the concatString to the size it needs to be to hold the additional bytes from addString, then the
              developer has to explicitly allocate new space with enough storage for the sum of the sizes of the two
              strings and copy concatString to the new location, and then finally perform the concatenation. In Java
              and Python strings are immutable, and any assignment will result in the creation of a new object and
              possibly copy operations, hence the overhead of the string operations. The disk-only code, although
              apparently writing to the disk excessively, is only triggering an actual write when operating system
              buffers are full. In other words, the operating system already lessons disk access times. A developer
              familiar with the language and system internals readily notices the causes of this observed behaviour,
              but this behaviour may be easily missed, as indicated by examining similar cases in production code.

          • The direct disc write should also manages to overlap write to the stream object with flushes from it to the underlying drive. Except of course it doesn't because they aren't writing enough data for the disc write to actually start before they're done. I'm also a little confused about why they think flush+close is synchronous, it's going to return instantly and flush data in the background. So they aren't even timing what they think they are.

            Back in the world of programmers with a clue, I did fix an in-memor

          • by bwcbwc ( 601780 )

            And there goes another grad student's research thesis up in smoke. CS departments need to have more courses that distinguish between abstract theory (raw algorithms) and software engineering (practical effects of choosing specific languages and features). It's clear the authors of this are in an ivory tower where every string type is the same type of construct in every language.

      • by Frnknstn ( 663642 ) on Wednesday March 25, 2015 @12:06PM (#49336243)

        It's not even the choice of tools, they seem to willfully misuse the languages to get poor results.

      • That is pretty much what the article suggests. Concatenating string involves creating objects, blah blah blah...

        I doubt you'd see the same "9000 times slower!" kind of results with standard C strings.

      • This was of course compounded by the fact that they did not follow the languages own guidelines with regard to string concatenation. Nor did they demonstrate any clear understanding of how modern operating systems works. Sadly this was an all round a poor effort.
      • Looks like someone forgot to use StringBuilder

    • by s.petry ( 762400 )
      Not to Karma whore, but I already see two problems with their testing by reading their code samples. Lets see who else finds them. The simple answer is no, disk is not slower than memory. The long answer is yes, programmers can make it look that way.
      • Somewhat off-topic, but somewhat related:

        Many years ago, when I was doing my degree and computers were still steam powered, a friend and I were writing the same assignment.

        He worked for the university, and had a privileged account on the VAX. I had the loan of a 286 from a prof who no longer needed it and took pity on me.

        I, being constrained by physical memory, had to write a new kind of sparse array to hold my data. He, having access to lots more virtual memory heap than I, wrote a huge array which wasn'

      • by s.petry ( 762400 )

        Bah, I wrote the wrong thing...

        The simple answer is no, disk is not faster than memory.

    • by TheCarp ( 96830 )

      Except that frictionless spherical cows are not realistic even if they are very helpful in physics.

      When is the last time you actually talked to raw hardware? if its recent, you are a special case, and likely write drivers....in which case, good for you.

      When you write "to disk" you are working in memory because its going to be a buffered access, likely reads as well, especially if it is something you recently wrote.

      Exceptions will exist but, they are exceptions to the rule.

    • No, it's completely understandable and shouldn't even be thought of as strange to seasoned programmers.

      The critical issue is there's a difference between calling an I/O function like write, and actually manipulating the IDE control lines on a hard disk. Typically for the former, the operating system is sitting there buffering things up in a relatively simple, uncomplex, way - ie it has some memory allocated, a pointer, and when you call the function all it does is copy the bytes to the memory and increme

    • What they actually compared wasn't the speed of the disks, but the speed of the language runtime and OS file IO buffering routines!

      It wasn't really that surprising that concatenating java or phyton objects can be slower than letting the low-level runtime do the same task.

      If they had wanted to test the disk IO speed then they would have had to add at least some fflush() calls.

      It is trivial, in any language, to make your code faster than the actual disk transfer speed, but a lot harder to make it faster than

    • by w3woody ( 44457 )

      Really, what's happening is that they're performing repeated concatenations of various length strings--an operation that eventually becomes O(m*n) time, with m being the length of the string and n being the number of strings. (Concatenating strings in Java requires a new string to be created, then the contents of the two source strings copied into the new destination.) Appending a file, on the other hand, is only an O(n) operation, but has a very large constant time associated with it. So, in essence: TL;DR

  • by MobyDisk ( 75490 ) on Wednesday March 25, 2015 @11:53AM (#49336081) Homepage

    This is the dumbest research I've seen in 2015. There was actually no computation involved -- they just wanted to write a long string to disk. They concluded that adding the superfluous step of concatenating strings in memory, then writing to disk, was slower. Well duh! That's not what memory is for!

    • by c ( 8461 )

      Pretty much my thoughts. Writing to disk is slow, but it's also semi-async operation (in that much of the time, the job is offloaded to the I/O subsystem before the write is complete), which generally means the sooner you start writing your results the sooner you'll finish, and if you start early you can do computational work while the I/O is happening rather than spinning wheels while trying to write the whole thing in one go. All they seem to have done is add a pile of latency and may even have introduced

      • It's even more simple than that. Their "writes to disk" are just being stored in disk cache hence the "faster" speed. On the other hand, they do basically the most inefficient in-memory operations possible.

    • by Jaime2 ( 824950 )
      It's dumber than that. They didn't even do it right in Java. There is a note near the end of the paper that says "However, using a mutable data type such as StringBuilder or StringBuffer dramatically improved the results". They didn't present the numbers, but what they really meant was "The performance problems we saw were entirely due to our not using StringBuilder or StringBuffer, this paper shows no meaningful difference in performance between memory-then-disk and disk-only access once the algorithm is f
  • The price of ECC ram doesn't drop for years and years.

    • What they're saying is if you write bad code, it performs like shit. Did someone get a PhD from this?

      • What they're saying is if you write bad code, it performs like shit. Did someone get a PhD from this?

        Well, two biology majors did comprise 2/3rds of the contributors to this madness... I sure hope the Electrical and Computer engineer didn't get a PHD for this.. There's no way to defend this with a straight face if you ask me.

  • It's slower in languages with automatic memory management, or with a VM, which is no surprise.
    It would be much faster than disk if you wrote the time critical parts in a language designed for, you know, speed...

  • Generally if you're looking to speed things up in RAM its not because youre concatenating a group of strings over and over, its because your overall read time improves dramatically as well. The study also doesnt take into account IO controller overhead...for example the overhead to write to RAM is generally mitigated in intel chips as the northbridge is merged into the processor and takes advantage of cool things like predictive instructions by the ALU. PERC raid controllers and HBA's are typically limite
  • This is a REALLY mind boggling stupid test (or at least headline). Of course it is faster to immediately write stuff to disk as it becomes available, than to build the string in memory and then flush it to disk. Keep the IO bus full while the next write is prepared.

    That doesn't change the fact that you should avoid touching the disk as much as possible, it just illustrates that if you must touch the disk, you should try to do it while the processor is busy doing other things (if possible).

  • for "idiotic premise"

  • Unless I have misread the paper, it seems that these folks have just found experimental proof that disk writes are buffered.

    "In Java and Python strings are immutable, and any assignment will result in the creation of a new object and possibly copy operations, hence the overhead of the string operations. The disk-only code, although apparently writing to the disk excessively, is only triggering an actual write when operating system buffers are full. In other words, the operating system already lessons disk access times.

    I'm guessing that this investigation started with someone making a bet while their thought processes were slightly impaired.

  • No, It's Not Always Quicker To Do Things In Memory

    The title ("No, It's Not Always Quicker To Do Things In Memory") should be modded Flamebait, Troll or similar. If it'd be possible.

  • They're only examining the performance of concatenating immutable strings, versus the performance of writing to a (buffered) stream.

    This is a problem that's been known about for donkey's ages. It's just that computers are so stupidly powerful it's no longer an issue that many programmers ever have to confront.

    In VB6 you had to jump through hoops to do it properly, but it's such a common case in Java that the compiler will optimize repeated concatenations in a loop into using a StringBuilder instead. I presu

  • Basically the article doesn't give enough detail. It doesn't say whether the strings were created using the base string objects in Java/Python or using the much more efficient stringbuilder objects. The former would be horrendously slow. Also what was the base setup of the machines being tested on how much memory did they have? did their disk controllers have built in cache? What kind of disks were used.
  • The research tells us that repeatedly concatenating strings together is a bad thing... WE ALREADY KNOW THIS!!! good grief, who taught these guys to code? The title of their paper "When In-Memory Computing is Slower than Heavy Disk Usage" implies heavy disk access where none exists. They actually go on to point out that it's the OS doing magic things that helps out. i.e. it's the OS using RAM to buffer the disk that keeps your app speedy. So erm... memory being used instead of disk then... the exact opposite

  • I'd argue it's always faster to do things in memory. In the case presented here they were *not*. In both cases being compared they were writing to disk. All they did was determine the better way (for their case) to write to disk.
  • by Alsee ( 515537 ) on Wednesday March 25, 2015 @01:12PM (#49336971) Homepage

    NEW SCIENTIFIC DISCOVERY!
    For n equal to one million, an O(n^2) algorithm is slower than an O(n) algorithm. Even when the O(n^2) algorithm is run in RAM, and the O(n) algorithm is disk writes being buffered and optimized by the operating system.

    I'll take my Nobel Prize now, thank you.

    -

  • by orlanz ( 882574 ) on Wednesday March 25, 2015 @01:21PM (#49337069)

    Ok, I read all the other "This is stupid" comments and my jaw kept dropping. I actually felt this was an April fools thing or something similar and that we were all missing something somewhere (and please let me know if I am... I REALLY need to know). I HAD to read the article and underlying paper, cause I just couldn't believe the absolute asinine stupidity of the test, let alone that it was being presented as research, or that the test itself was so flawed! So after all that, had to post. Summary for others, adding my voice to the crowd.

    ----------------
    Assumption: Software Developers avoid disk access cause they believe doing it in memory is faster. This is put in context of BI and bigdata.

    Testing: Create a program representing a common task that can be tested where one uses memory and the other uses diskspace.
    Memory Test:
    1) Create a string in memory.
    2) Add it multiple times into another string
    3) Write second string onto Disk
    4) Flush writes

    Disk Test:
    1) Create a string in memory
    2) Write it multiple times to Disk
    3) Flush writes

    Create code in Python and Java.

    Conclusion: Memory Test is so much slower than Disk Test! Additionally, the languages used have certain quirks to make it worse. Optimization helped a little but only on Linux. Therefore, programmers should reassess and understand their OS and programming languages before assuming this belief which is not true.
    ---------------

    Assumption & Testing idea... very good. I would have loved to know the unknown scenarios where this assumption should be questioned. Especially in the world of click&drag programming for workflows, ETLs, and report writing.

    But from there... its all BS and stupidity. Basically the test tests if replicating the hard drive driver in memory and then using the driver to write to disk is faster than just using the driver to write to disk. Are you bloody serious?!?! That's like testing if 2+2 is greater than 2+0. And that is before we start looking at using Java and Python which do a ton of work in terms of memory management and build all types of stuff around data types. Before the fact that they wrote the Python code WRONG (that's the slow way of doing string or listing concat). So they picked languages that write in memory O(n) extra times for the same data.

    This test would have come to the same conclusions in C, C++, or Assembly! But the folks wouldn't have been able to write code to see the micro second time differences.

    So lets set the record straight. NO developer out there goes out of their way to just write to a memory file if its simply going to flush to disk. Its not worth the extra lines of code, nor the lost CPU cycles in reading them. Especially since most operating systems do this already at multiple points along the data chain at the very low hardware & driver levels! If we have developers like this, we have a ton of bigger problems in software development than this little thing that will be solved by money.

    To test this belief properly, give me a scenario where you reuse the written to disk/memory stuff, transform it, and then write to disk. See which one is slower. If its written properly, you will see that the underlying hardware systems will actually store stuff in cache or memory for you to help you speed it up! If you find proper scenarios where the memory part is slower, please let us know cause that is actually adding to the IT body of knowledge.

    God, as this was BigData related, I was hoping at least something along the lines of "In DB data processing and extract vs extract and client side processing". Give me the points along a curve where one is better/worse than the other. THAT would have been interesting.

  • by jetkust ( 596906 ) on Wednesday March 25, 2015 @01:29PM (#49337153)
    Maybe we should store our files in memory and load them into the harddrive to do calculations.
  • by viperidaenz ( 2515578 ) on Wednesday March 25, 2015 @02:21PM (#49337619)

    String concatString = "";
      for (int i=0; i numIter; i++) {
      concatString += addString;
      }

    That's going to create 1,000,000 StringBuilder objects, use them to append a single String each, and allocate 1,000,000 new String objects as well

    StringBuilder builder = new StringBuilder(
      for (int i = 0; i numIter; i++) {
      builder.append(addString);
      }
    String concatString = builder.toString();

    I bet $1,000,0000 that code is faster.

    tl;dr; Researchers who don't know who Java works suck at writing Java benchmarks.
    String a = b + c;
    gets translated by the compiler to something like:
    String a = new StringBuilder(a).append(b).toString();

    It's creating a new StringBuilder object, its member variables including a char array, it copies the String passed in to the constructor. Append is probably also expanding the array, which means creating a new array and copying the old one to the new one, then copying the data from b to the end of the new array.
    toString then creates a new String object, copying the data again.

    If you write shit code, you get shit performance.

  • by ChaoticCoyote ( 195677 ) on Wednesday March 25, 2015 @03:38PM (#49338229) Homepage
    Slashdot has fallen far in credibility if it promotes sloppy research like the referenced article.

Help fight continental drift.

Working...