Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Programming As If Performance Mattered 615

Junks Jerzey writes "Saw the essay 'Programming as if Performance Mattered', by James Hague, mentioned at the Lambda the Ultimate programming language weblog. This is the first modern and sensible spin on how optimization has changed over the years. The big 'gotcha' in the middle caught me by surprise. An inspiring read." Hague begins: "Are particular performance problems perennial? Can we never escape them? This essay is an attempt to look at things from a different point of view, to put performance into proper perspective."
This discussion has been archived. No new comments can be posted.

Programming As If Performance Mattered

Comments Filter:
  • by rpozz ( 249652 ) on Thursday May 06, 2004 @12:46AM (#9070697)
    Performance can be quite a major thing if you're doing a lot of forking/threading (ie like a daemon). If you create 100 threads, any memory leaks or bottlenecks are multiplied 100 times.

    However, 0.1s delay after clicking an 'OK' button is perfectly acceptable. It all depends on what you're coding.
  • by fervent_raptus ( 664099 ) on Thursday May 06, 2004 @12:48AM (#9070704)
    this slashdot post would read:

    I just finished reading the essay "Programming as if Performance Mattered", by James Hague. The essay covers how compiler optimization has changed over the years. If you get bored, keep reading; there's a big 'gotcha' in the middle. Hague begins: "Will performance issues haunt us forever? This essay puts performance analysis in perspective."
  • by Anonymous Coward on Thursday May 06, 2004 @01:03AM (#9070763)
    Obviously you have never done any programing wrt cryptography. Optimization is *_N-E-V-E-R_* done in the hardware!!! The difference between using a good algorithm and a crappy one is the difference between 2 days for the program to run, and fifty trillion centuries (literally). Hardware upgrades are merely incremental. Moore's law says speed doubles every 18 months, but doubling is a tiny incremental increase, if you want an exponential/logorithmic change, you have to use software. I'm not talking about "oh, twice as fast as a year ago", but "10,000 times as fast as that other software" or "1 billion times as fast".
  • by metlin ( 258108 ) * on Thursday May 06, 2004 @01:08AM (#9070786) Journal
    Contrary to popular belief, managed code environments do optimize code a whole lot more than you would think!

    Joe Beda [eightypercent.net], the guy from Microsoft behind Avalon, had a discussion on Channel9 where he talked about why managed code is not that bad a thing afterall.

    Like I mentioned in an earlier post, managed code helps optimize the code for some of the bad programmers out there who cannot do it themselves, and takes care of a lot of exceptions and other "troublesome" things :) So, in the long run, it may not be that bad a thing afterall.

    There are two facets to optimization - one is optimization of the code per se, and the other is the optimization of the project productivity - and I think managed code environments do a fairly good job of the former and a very good job of the latter.

    My 0.02.
  • Optimization isn't really a hard topic. Should a programmer spend days nitpicking fifty lines of code that won't be used frequently? No. When initially writing code, should someone use Bogosort [wikipedia.org] instead of Quicksort [wikipedia.org] ? I'll let you figure that one out.
    My biggest (reasonable) beef in the optimization area is software bloat. Programs like huge office suites containing excessive, poorly implemented crap that people won't use really ticks me off. KISS. Even the stuff that has to be complicated.

    Of course, I'll always be a sucker for tweaking code for the fun of it, when I have the time. =)
  • by Jason Pollock ( 45537 ) on Thursday May 06, 2004 @01:38AM (#9070921) Homepage

    I am hearing a lot of people saying that you shouldn't optimise prior to the first release. However, it is very easy to select a design or architecture that limits your high end performance limit. Therefore, there is some optimisation that needs to be done early.

    When you're architecting a system that is going to take tens of man years of effort to implement, you need to ensure that your system will scale.

    For example, a project I recently worked on hit a performance wall. We had left optimisation for later, always believing that it shouldn't be done until last. However, while the architecture chosen was incredibly nice and clear, it limited the performance to 1/3th what was required. Back to the drawing board, we just doubled the project cost - ouch.

    Even worse, there are performance differences on each platform! For example, did you know that throwing an exception is 10,000 times slower than a return statement in HP/UX 11? Solaris is only a little better at 2 orders of magnitude. Linux is (I understand) a dead heat.

    So, while low-level optimisation of statements is silly early in the project, you do need to ensure that the architecture you choose is going to meet your performance requirements. Some optimisations are definitely necessary early in the project.

    The article also talks about tool selection, suggesting that the extra CPU could be better used to support higher level languages like Erlang. If a system has CPU to spare, I agree, use what you can. The projects I work on always seem to lack in CPU cycles, disk write speed, and network speed. You name it, we're short of it. In fact, a large part of our marketing strategy is that we are able to deliver high performance on low end systems. What would happen to us if we dropped that edge? We're working with a company that has implemented a real-time billing system in Perl. Not a problem, until you try and send it 500 transactions/second. Their hardware budget? Millions to our 10s of thousands. Who do you think the customer likes more?

    Jason Pollock
  • by scot4875 ( 542869 ) on Thursday May 06, 2004 @01:47AM (#9070950) Homepage
    An intelligent compiler (i.e. any modern compiler you'd be likely to use) will automatically __inline the fred::setQ function, and then the peephole optimizer will reduce it down to the equivalent of myFred.q = 10;

    --Jeremy
  • by joib ( 70841 ) on Thursday May 06, 2004 @01:48AM (#9070957)
    WTF are you talking about?

    I'm staring at the apt codebase on my screen just now, and it's all C++, baby. Ok, so there is a trivial amount of perl; sloccount summary:

    Totals grouped by language (dominant language first):
    cpp: 26481 (89.75%)
    sh: 2816 (9.54%)
    perl: 209 (0.71%)

    This is for apt-0.5.14, but I can't imagine that the newest version in unstable (0.5.24) would be that different.

    Now, if the rest of your story is true, that's mind-boggling. If the new teacher refused to judge your, from your description very fine, work just because he has a serious hard-on for gentoo, I seriously believe you should have taken it up with the dean of the faculty instead of just swallowing it and later complaining on /..

    That being said, why chose apt in the first place? Now, I haven't profiled apt, but I guess it spends the majority of time waiting on network i/o or waiting for dpkg to finish anyway.
  • by Anonymous Coward on Thursday May 06, 2004 @01:56AM (#9070988)
    I don't know why his example was so bad.

    A good example would be how to detect if a king is in check in a chess program. There are a few different approches. Some are fast, some are slow, and a compiler just cannot "optimize" a slow approach into a fast one. The function is called millions of times per second in a chess program, so you want it optimized.
  • by Anonymous Coward on Thursday May 06, 2004 @02:12AM (#9071050)
    One of the most amazing PC demos I ever saw was a 256 byte intro that ran under DOS (I forget the name of it).
    This one [pouet.net]?
  • by Anonymous Coward on Thursday May 06, 2004 @02:13AM (#9071052)
    The program you're looking after is "tube". If you want to get seriously impressed, have a look at "lattice" instead.
    However, you can't achieve the same easily in linux since
    a) putting pixels is more than just writing to 0a0000h and
    b) elf format has actually some structure. (iirc program that merely returns 42 takes 53 bytes and uses quite obscene amount of trickery to achieve that)

    These will probably bloat the linux version to something like 512 bytes;) Oh dear.
  • by Anonymous Coward on Thursday May 06, 2004 @02:19AM (#9071068)
    bzzzt... split infinitive, use of passive voice, etc.

    Sheesh. Don't make the corrections unless you know what you're talking about. Where's the split infinitive? In "I just finished reading..."? There's no split infinitive there.

    That post doesn't use passive voice, either -- "the essay" is a perfectly valid subject. Passive voice would be "compiler optimization is discussed."

    Hmph.
  • Re:What annoys me (Score:1, Informative)

    by Anonymous Coward on Thursday May 06, 2004 @02:21AM (#9071072)
    Indictment means the act of indicting, and to indict means to accuse of wrongdoing. There are also the legal definitions, but indictment is correct for common usage.
  • by Anonymous Coward on Thursday May 06, 2004 @02:36AM (#9071115)
    Here's a somewhat relevant anecdote.

    I interviewed at a company that makes a big deal about being super duper technical on their web site. They had a written coding problem as part of the interview. (A good sign!)

    They left me in a room with a non-networked PC with instructions get as far as possible in writing a program in Java to take an initial date and a number of days to add or subtract, and figure out what the resulting date would be. The test instructions contained a detailed explanation of the workings of the Gregorian calendar system. The PC had Windows and the JDK installed on it, and just about nothing else. They gave me a pretty short period of time to do it in - 15 or 20 minutes, if I remember right.

    At first I had to call the interviewer back in so that I could show him that there were about TEN different past solutions still stitting on the hard drive, and that I was going to delete them all while he watched and ask him to start the clock again. (Lamers...)

    When I read the problem I realized that it was very easily solved using the java.util.GregorianCalendar class that comes with the JDK [sun.com]. I didn't remember exactly how to use it but fortunately the installed JDK on this PC also included the JDK source. I javadoc'd the source to GregorianCalendar and Calendar and read the docs, wrote my app, and tested it thoroughly. Of course it didn't take long to get it working, since the hard work was already done. I had to walk all over the office looking for the interviewer, who apparently wasn't expecting me to actually complete the task within the allotted time.

    When I reviewed my very short program with the proctor and explained all of the things that I had done in order to do it that way, he seemed upset, as though I cheated. I tried to make a case for the fact that I had passed up a chance to actually cheat and then been resourceful, but he wasn't convinced. I didn't do it exactly the slow and tedious way they wanted, so I was wrong. I pointed out that if I was on a project and caught a developer duplicating base JDK functionality due to plain ignorance of the class library, I'd consider that a *bad* thing, not an example of technical excellence.

    The rest of the interview went OK, but they eventually called me back and said there was a hiring freeze. Well maybe so, since it was in 2000 or 2001 (I don't remember exactly when), or maybe not. I wasn't exactly crushed.

    Since then, the hard-skills tests that I use when interviewing developer candidates includes something like this for the relevant environment... kinda like you said. Something like "read in a text file, sort the lines, and print it out in sorted order". If their program includes a sort routine, BZZT, they failed the test.

  • by ibullard ( 312377 ) on Thursday May 06, 2004 @02:47AM (#9071164)
    Quote:
    Even traditional disclaimers such as "except for video games, which need to stay close to the machine level" usually don't hold water any more.

    Yeah, as long as you write simple, 2D games [dadgum.com](like the author of the essay does) that would be true. Complex, 3D games are another matter. I write games for a living and even if you're within sight of cutting edge you're writing at least some assembly and spending a lot of time optimizing C++.

    Now I'm not knocking all he says or saying that good games need to be in C++ and assembly. Some games rely heavily on scripting languages to handle the game mechanics and world events. There's a lot less assembly code than there used to be. However, the core engine that handles graphics, physics, AI, and I/O is going to be written in C++ and assembly and will be for the forseeable future.

    If I published a game that required a 3Ghz computer to display 576x576 images at 66fps, I'd be laughed off the internet. A PS2 has a 300Mhz processor and needs to display a 512x448 image every 30-60 seconds.

  • Re:uhh.. yeah (Score:3, Informative)

    by prockcore ( 543967 ) on Thursday May 06, 2004 @02:56AM (#9071197)
    this is a zillion times more efficient than if(a == 1) a = 0; else a = 1;

    This is the one time where I'll step up and say that VC actually does a few neat tricks for the trinary operator.
    c=(a>b)?0:1 /* or c=!(a>b), it's the same code */
    translates to
    cmp b,a
    sbb c,c
    inc c
    there are other variants of this, I'll leave it as an exercise to the reader to figure out what is going on.
  • by nimblebrain ( 683478 ) on Thursday May 06, 2004 @03:35AM (#9071346) Homepage Journal

    After years of developing, I really take to heart two things:

    1. Premature optimization often makes better optimizations down the line much more difficult
    2. It's 90% guaranteed that the slowdown isn't where or what you thought it was

    Profilers are the best thing to happen to performance since compilers - really. I encounter a number of truths, but many myths about what degrades performance. A few examples of each:

    Performance degraders

    • Mass object construction
    • Searching sequentially through large arrays
    • Repeated string concatenation (there are techniques to mitigate this)
    • Staying inside critical sections for too long

    Not performance degraders

    • Lots of object indirection
    • Lots of critical sections

    The "lots of object indirection" myth is one I encounter frequently. Object A calls Object B calls Object C, and it "intuitively" looks like it must be slow (Computer A calling Computer B, etc. would be slow), but even with stack frame generation, these are lightning fast compared with even the likes of "date to string" functions, never mind line-drawing commands or notification-sending.

    The reason that particular myth is dangerous is that it's the single most pervasive myth (IMHO) that leads to premature optimization. People take out layers of object indirection and make it harder to put in better solutions later. I had an object that recorded object IDs in a list and let you look them up later. If I had "flattened" that into the routine that needed it, I might have effected a 0.1% speed increase (typical range for many premature optimizations). As it stood, because it hid behind an interface (equivalent to an ABC for C++ folks), when I had finally implemented a unit-tested red/black tree, it was trivial (~5 minutes) to drop in the new functionality. That's not an isolated case, either.

    Mind you, I profiled the program to determine the slowdown first. Searching on the list, because so many were misses (therefore full scans), the search was taking up 98.6% of the entire operation. Switching to the red/black tree dropped the search down to 2.1%.

    All in all, if you have a slow program, profile it. There is no substitute for a well-written profiler. Stepping through and "feeling" how long it takes in a debugger, while it can point you in rough directions, will miss those things that take 50 ms out of the middle of each call to the operation you're checking. Manually inserting timing calls can be frustrating enough to maintain or slow down your program enough that you can't narrow down the performance hit.

    gprof works well with gcc and its relatives (make sure to add -pg to your flags), but I'm not sure if there's a good open source option out there for people using other tools that doesn't require you to alter your source.

    In the Windows world, we recently got in the professional version of AQTime 3 [automatedqa.com]. It's an astounding package, allowing you numerous reports, pie charts and call graphs, saving the last few runs, calculating differences in performance between runs, allowing attachment to running processes, on top of a pretty nice way to define areas of the program to profile. The single nicest thing about it, though, is the performance. We turned on full profiling (that is, profiling all methods in all modules, including all class library and third party components) on the largest project we had, and it ran with perhaps a 30% slowdown. If you've used profilers before, you know how astounding that is ;)

    Profiling applications always surprises me. In one case, a space-making algorithm I was running on controls seemed a little pokey; I found out more than 50% of the time spent was on constantly verifying that the lists were sorted. Today, I was investigating a dialog that looked like it must hav

  • Re:uhh.. yeah (Score:1, Informative)

    by Anonymous Coward on Thursday May 06, 2004 @03:49AM (#9071384)
    XOR it up!

    boolean a;
    a = a ^ 1

    One assembly instruction on most processors.
  • by BlackHawk-666 ( 560896 ) on Thursday May 06, 2004 @04:29AM (#9071502)
    My charge out rate at my last company was 1000/day (about $1500USD) so I'm going to say yes to the optimisation in this case, because it is cheaper. If it were going to take me 3-4 days I'd say get the better hardware and try to keep the code as lean and as fast as possible without wasting too much time trying to wring it for performance.
  • by r6144 ( 544027 ) <r6k&sohu,com> on Thursday May 06, 2004 @05:22AM (#9071647) Homepage Journal
    Some good habits in coding helps the compiler to do its job better, and also results in clearer (at least not uglier) code.

    Example 1: in C, if you use "int" for a variable "x" that should have a type of "unsigned", "x/4" will not just be a simple shift, instead three or four instructions are involved. Indeed, it would be very hard for the compiler to infer that "x" is always non-negative and optimize for you, except in the simplest cases.

    Example 2: in floating-point math, "divide by 10" is not exactly the same as "multiply by 0.1", thus many compilers (gcc 3.4 without "-ffast-math", icc8 by default, and probably the Java VM) won't optimize the former into the latter, even in the many cases where it won't matter. This results in code that is 10-40 times slower on the P4.

    Example 3: in Haskell, since lazy evaluation has much more overhead than eager evaluation, compilers always try to optimize the former into the latter. However, in many cases it is impossible for the compiler to do that, since it can't decide if using eager evaluation will prevent the evaluation from terminating.

    In short, it is good to rely on the compiler to do the optimization (such as register allocation) that is known to be done well, but what the compiler can do is very limited, since (1) it can't know your intent if you had not expressed it, so (for example) it has to make sure that every floating-point operation conforms to very stringent error bounds, often at the cost of significant speed, even if you don't really care about that; and (2) some code-optimization problems take extortionate time to solve, or might even be theoretically infeasible in general. Therefore, when writing code that is going to take some significant CPU-time, it is good to have some good habits that helps the compiler, as long as the code isn't uglified too much.

  • by julesh ( 229690 ) on Thursday May 06, 2004 @06:38AM (#9071825)
    Maybe then we wouldn't need dual-core 4-6 GHz CPUs and 2GB ram to run their new OS.

    The reason they're targetting this kind of system is because the hardware will probably be cheaper than Windows itself by the time Longhorn comes out.

    I'm sure they'll let you switch off the flash features that need it, though. All recent versions of Windows have been able to degrade to roughly the same performance standard as the previous version if you choose the right options.
  • Huh? (Score:3, Informative)

    by Theatetus ( 521747 ) * on Thursday May 06, 2004 @07:37AM (#9072009) Journal

    What are you talking about? I get paid to write open-source software. Where did you get the idea that open-source software is written entirely by volunteers?

  • by Theatetus ( 521747 ) * on Thursday May 06, 2004 @07:55AM (#9072064) Journal
    It doesn't matter how much hardware you throw at a problem if it needs to scale properly and you have an O(n^3) solution.

    Well, maybe you're not ignoring it since you said "if it needs to scale properly". But that's a very crucial "if", and the "scale properly" only refers to certain situations.

    If the array you need to sort might have several million members and you won't be sorting more than a few dozen of those arrays, yes you should use an O(n lg n) or whatever sort routine. OTOH, if the array itself is smaller (a few hundred members) but you have to sort several hundred thousand of them, quicksort or merge sort will be remarkably slow compared to the much-maligned bubble sort.

    Big-O notation is an asymptotically-tight bound, not the function itself. For small datasets, not only is there no guarantee that the lower big-O algorithm will be faster, it's in fact usually the case that the allegedly "less efficient" algorithm will actually be faster.

  • by BlackHawk-666 ( 560896 ) on Thursday May 06, 2004 @08:21AM (#9072164)
    I should have qualified my statement a little better, and I suspect qsort vs bubblesort is not the best illustration possible. Each sort algorithm has strengths and weaknesses e.g. easy to implement but slow to run, ruthlessly difficult to code but fast as hell, good at sorting random data but worst case scenerio on near sorted data. If qsort were always faster than every other algorithm then we wouldn't still be talking about them. QSort is generally faster than most other sorts, and the O(n log n) is an average sort cost, not a gaurantee.

    When I was in uni our lecturer gave us an example from the QU campus where he used to lecture. There was a computer (remember, this is back in the eighties) that needed to sort rather a lot of data and it took three days to do it with the qsort algorithm. The main problem was, I believe, due to memory restrictions i.e. all the data could not fit into memory at once. It was recoded to use a different algorithm, one that could work from disk and in small chunks, and ran orders of magnitude faster. The recoded algorithm was theoretically slower, but faster in actuality due to the nature of the data and the machine it had to run on.

  • by yecrom2 ( 461240 ) on Thursday May 06, 2004 @10:52AM (#9073364)
    We were introduced to vtune during a 2-week trip to Intel. Profilers are good. vtune is the best one that I've found.

    The way that we use it is to not even touch it until we have the feature completely working in the simplest form possible. Then we do some performance testing. If everything works well under load, we don't even bother profiling it. Otherwise run it in vtune and see what the bottleneck is. 90% of the time, there is some type of minor oversight. Occasionaly, there is an algorithmic change that needs to take place, like adding a secondary index to something, or making some temporaries thread-local.

    We run both event-counters and call-tracing, but I've found that call-tracing is far more accurate. The best use of VTune is to smite arrogant developers. The result of our trip to Intel was that one of our developers, who had to write everything from scratch, was shown that all of his "high performance components" were completely worthless.

    Just my $0.02.

    Matt
  • by An Onerous Coward ( 222037 ) on Thursday May 06, 2004 @11:21AM (#9073709) Homepage
    A project I was doing last semester had just what you described: thousands of arrays of twenty members each. I was still able to double the performance by switching from bubblesort to quicksort. Besides, you never know when those arrays are going to get bumped up from a few hundred members to a few thousand.

    I'm still a firm believer in the principle that bubblesort is never the way to go.
  • by s00p41337h4x0r ( 696697 ) on Thursday May 06, 2004 @04:09PM (#9076930)
    That's a well known algorithm called "Bogosort [wikipedia.org]" in the Jargon File.

    Interesting thing about it is that it is one of the few algorithms that has an expected running time of O(n!). If you're teaching an intro algorithms class it's easy to come up with examples of O(lg n), O(n), O(n lg n), O(n^2), and O(2^n) in lecture but O(n!) is tricky. Useful as an extra credit question.

  • by innosent ( 618233 ) <jmdority.gmail@com> on Thursday May 06, 2004 @11:16PM (#9080225)
    Exactly, he says it's not about (discrete) mathematics, but when it comes down to what a programmer is supposed to do, it's all discrete math. You have a Turing machine (albeit limited), and the whole point is to do what needs to be done correctly and as efficiently as possible. Some things still need to be written the same way they were in "1985", unlike the author's view that optimal code doesn't matter.

    Yes, machines are faster now, by at least an order of magnitude, but optimizing poor code can speed things up more than engineering a new processor.
    Bubble Sort a list of one billion points of data (O(n^2) compares = k * (1e18)) on a new 3GHz machine (assuming 1 compare per clock), and you need about 3.333e8 (333333333.33) seconds (about 10.5 years).
    Weak Heap (best), Quick, or Heap sort the same billion points (O(n*log n) compares = k * (~3e10)) on an old 486/33MHz (again assuming 1 compare per clock), and you need about 9.0909e3 (9090.909) seconds (about 2.5 hours).

    There you go, the author can use the new Pentium 5s, and I guess the rest of us can go dig out our 486s. Sure, you don't often need to sort a billion records, but next time you do, make sure your algorithm is reasonable, then use a language that allows you to implement it. One-size-fits-all library, garbage collection, and run-time language error checking might be good for rapid development, but doing things efficiently requires lower-level interaction, sometimes even below what C allows. Bubble sort 2 records, and you won't see much benefit, but for performance critical sections of code, sometimes it's better to optimize first, then use comments to make it readable, than to write code that looks like comments. Even adding bounds checking to the sorts would at least double the time required, which in this case could mean anywhere from another 3 or 4 hours up to the time left until you collect social security.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...