Programming As If Performance Mattered 615
Junks Jerzey writes "Saw the essay 'Programming as if Performance Mattered', by James Hague, mentioned at the Lambda the Ultimate programming language weblog. This is the first modern and sensible spin on how optimization has changed over the years. The big 'gotcha' in the middle caught me by surprise. An inspiring read." Hague begins: "Are particular performance problems perennial? Can we never escape them? This essay is an attempt to look at things from a different point of view, to put performance into proper perspective."
Re:The question I always ask is (Score:2, Informative)
However, 0.1s delay after clicking an 'OK' button is perfectly acceptable. It all depends on what you're coding.
If everyone paid attention in english class... (Score:2, Informative)
I just finished reading the essay "Programming as if Performance Mattered", by James Hague. The essay covers how compiler optimization has changed over the years. If you get bored, keep reading; there's a big 'gotcha' in the middle. Hague begins: "Will performance issues haunt us forever? This essay puts performance analysis in perspective."
Re:Funny thing about performance (Score:2, Informative)
Re:Managed environments (Score:5, Informative)
Joe Beda [eightypercent.net], the guy from Microsoft behind Avalon, had a discussion on Channel9 where he talked about why managed code is not that bad a thing afterall.
Like I mentioned in an earlier post, managed code helps optimize the code for some of the bad programmers out there who cannot do it themselves, and takes care of a lot of exceptions and other "troublesome" things
There are two facets to optimization - one is optimization of the code per se, and the other is the optimization of the project productivity - and I think managed code environments do a fairly good job of the former and a very good job of the latter.
My 0.02.
Optimizations in the Real World (Score:2, Informative)
My biggest (reasonable) beef in the optimization area is software bloat. Programs like huge office suites containing excessive, poorly implemented crap that people won't use really ticks me off. KISS. Even the stuff that has to be complicated.
Of course, I'll always be a sucker for tweaking code for the fun of it, when I have the time. =)
Performance is IMPORTANT (Score:5, Informative)
I am hearing a lot of people saying that you shouldn't optimise prior to the first release. However, it is very easy to select a design or architecture that limits your high end performance limit. Therefore, there is some optimisation that needs to be done early.
When you're architecting a system that is going to take tens of man years of effort to implement, you need to ensure that your system will scale.
For example, a project I recently worked on hit a performance wall. We had left optimisation for later, always believing that it shouldn't be done until last. However, while the architecture chosen was incredibly nice and clear, it limited the performance to 1/3th what was required. Back to the drawing board, we just doubled the project cost - ouch.Even worse, there are performance differences on each platform! For example, did you know that throwing an exception is 10,000 times slower than a return statement in HP/UX 11? Solaris is only a little better at 2 orders of magnitude. Linux is (I understand) a dead heat.
So, while low-level optimisation of statements is silly early in the project, you do need to ensure that the architecture you choose is going to meet your performance requirements. Some optimisations are definitely necessary early in the project.
The article also talks about tool selection, suggesting that the extra CPU could be better used to support higher level languages like Erlang. If a system has CPU to spare, I agree, use what you can. The projects I work on always seem to lack in CPU cycles, disk write speed, and network speed. You name it, we're short of it. In fact, a large part of our marketing strategy is that we are able to deliver high performance on low end systems. What would happen to us if we dropped that edge? We're working with a company that has implemented a real-time billing system in Perl. Not a problem, until you try and send it 500 transactions/second. Their hardware budget? Millions to our 10s of thousands. Who do you think the customer likes more?
Jason PollockRe:You don't optimize, that's the job of the compi (Score:3, Informative)
--Jeremy
Re:Article puts it all in perspective (Score:3, Informative)
I'm staring at the apt codebase on my screen just now, and it's all C++, baby. Ok, so there is a trivial amount of perl; sloccount summary:
Totals grouped by language (dominant language first):
cpp: 26481 (89.75%)
sh: 2816 (9.54%)
perl: 209 (0.71%)
This is for apt-0.5.14, but I can't imagine that the newest version in unstable (0.5.24) would be that different.
Now, if the rest of your story is true, that's mind-boggling. If the new teacher refused to judge your, from your description very fine, work just because he has a serious hard-on for gentoo, I seriously believe you should have taken it up with the dean of the faculty instead of just swallowing it and later complaining on
That being said, why chose apt in the first place? Now, I haven't profiled apt, but I guess it spends the majority of time waiting on network i/o or waiting for dpkg to finish anyway.
Re:You don't optimize, that's the job of the compi (Score:3, Informative)
A good example would be how to detect if a king is in check in a chess program. There are a few different approches. Some are fast, some are slow, and a compiler just cannot "optimize" a slow approach into a fast one. The function is called millions of times per second in a chess program, so you want it optimized.
Re:the software taketh what the hardware giveth. (Score:3, Informative)
Re:the software taketh what the hardware giveth. (Score:3, Informative)
However, you can't achieve the same easily in linux since
a) putting pixels is more than just writing to 0a0000h and
b) elf format has actually some structure. (iirc program that merely returns 42 takes 53 bytes and uses quite obscene amount of trickery to achieve that)
These will probably bloat the linux version to something like 512 bytes;) Oh dear.
What are you talking about? (Score:1, Informative)
Sheesh. Don't make the corrections unless you know what you're talking about. Where's the split infinitive? In "I just finished reading..."? There's no split infinitive there.
That post doesn't use passive voice, either -- "the essay" is a perfectly valid subject. Passive voice would be "compiler optimization is discussed."
Hmph.
Re:What annoys me (Score:1, Informative)
Re:Funny thing about performance (Score:2, Informative)
I interviewed at a company that makes a big deal about being super duper technical on their web site. They had a written coding problem as part of the interview. (A good sign!)
They left me in a room with a non-networked PC with instructions get as far as possible in writing a program in Java to take an initial date and a number of days to add or subtract, and figure out what the resulting date would be. The test instructions contained a detailed explanation of the workings of the Gregorian calendar system. The PC had Windows and the JDK installed on it, and just about nothing else. They gave me a pretty short period of time to do it in - 15 or 20 minutes, if I remember right.
At first I had to call the interviewer back in so that I could show him that there were about TEN different past solutions still stitting on the hard drive, and that I was going to delete them all while he watched and ask him to start the clock again. (Lamers...)
When I read the problem I realized that it was very easily solved using the java.util.GregorianCalendar class that comes with the JDK [sun.com]. I didn't remember exactly how to use it but fortunately the installed JDK on this PC also included the JDK source. I javadoc'd the source to GregorianCalendar and Calendar and read the docs, wrote my app, and tested it thoroughly. Of course it didn't take long to get it working, since the hard work was already done. I had to walk all over the office looking for the interviewer, who apparently wasn't expecting me to actually complete the task within the allotted time.
When I reviewed my very short program with the proctor and explained all of the things that I had done in order to do it that way, he seemed upset, as though I cheated. I tried to make a case for the fact that I had passed up a chance to actually cheat and then been resourceful, but he wasn't convinced. I didn't do it exactly the slow and tedious way they wanted, so I was wrong. I pointed out that if I was on a project and caught a developer duplicating base JDK functionality due to plain ignorance of the class library, I'd consider that a *bad* thing, not an example of technical excellence.
The rest of the interview went OK, but they eventually called me back and said there was a hiring freeze. Well maybe so, since it was in 2000 or 2001 (I don't remember exactly when), or maybe not. I wasn't exactly crushed.
Since then, the hard-skills tests that I use when interviewing developer candidates includes something like this for the relevant environment... kinda like you said. Something like "read in a text file, sort the lines, and print it out in sorted order". If their program includes a sort routine, BZZT, they failed the test.
This guy is out on a limb (Score:3, Informative)
Even traditional disclaimers such as "except for video games, which need to stay close to the machine level" usually don't hold water any more.
Yeah, as long as you write simple, 2D games [dadgum.com](like the author of the essay does) that would be true. Complex, 3D games are another matter. I write games for a living and even if you're within sight of cutting edge you're writing at least some assembly and spending a lot of time optimizing C++.
Now I'm not knocking all he says or saying that good games need to be in C++ and assembly. Some games rely heavily on scripting languages to handle the game mechanics and world events. There's a lot less assembly code than there used to be. However, the core engine that handles graphics, physics, AI, and I/O is going to be written in C++ and assembly and will be for the forseeable future.
If I published a game that required a 3Ghz computer to display 576x576 images at 66fps, I'd be laughed off the internet. A PS2 has a 300Mhz processor and needs to display a 512x448 image every 30-60 seconds.
Re:uhh.. yeah (Score:3, Informative)
This is the one time where I'll step up and say that VC actually does a few neat tricks for the trinary operator. translates to there are other variants of this, I'll leave it as an exercise to the reader to figure out what is going on.
Postmature optimization (Score:5, Informative)
After years of developing, I really take to heart two things:
Profilers are the best thing to happen to performance since compilers - really. I encounter a number of truths, but many myths about what degrades performance. A few examples of each:
Performance degraders
Not performance degraders
The "lots of object indirection" myth is one I encounter frequently. Object A calls Object B calls Object C, and it "intuitively" looks like it must be slow (Computer A calling Computer B, etc. would be slow), but even with stack frame generation, these are lightning fast compared with even the likes of "date to string" functions, never mind line-drawing commands or notification-sending.
The reason that particular myth is dangerous is that it's the single most pervasive myth (IMHO) that leads to premature optimization. People take out layers of object indirection and make it harder to put in better solutions later. I had an object that recorded object IDs in a list and let you look them up later. If I had "flattened" that into the routine that needed it, I might have effected a 0.1% speed increase (typical range for many premature optimizations). As it stood, because it hid behind an interface (equivalent to an ABC for C++ folks), when I had finally implemented a unit-tested red/black tree, it was trivial (~5 minutes) to drop in the new functionality. That's not an isolated case, either.
Mind you, I profiled the program to determine the slowdown first. Searching on the list, because so many were misses (therefore full scans), the search was taking up 98.6% of the entire operation. Switching to the red/black tree dropped the search down to 2.1%.
All in all, if you have a slow program, profile it. There is no substitute for a well-written profiler. Stepping through and "feeling" how long it takes in a debugger, while it can point you in rough directions, will miss those things that take 50 ms out of the middle of each call to the operation you're checking. Manually inserting timing calls can be frustrating enough to maintain or slow down your program enough that you can't narrow down the performance hit.
gprof works well with gcc and its relatives (make sure to add -pg to your flags), but I'm not sure if there's a good open source option out there for people using other tools that doesn't require you to alter your source.
In the Windows world, we recently got in the professional version of AQTime 3 [automatedqa.com]. It's an astounding package, allowing you numerous reports, pie charts and call graphs, saving the last few runs, calculating differences in performance between runs, allowing attachment to running processes, on top of a pretty nice way to define areas of the program to profile. The single nicest thing about it, though, is the performance. We turned on full profiling (that is, profiling all methods in all modules, including all class library and third party components) on the largest project we had, and it ran with perhaps a 30% slowdown. If you've used profilers before, you know how astounding that is ;)
Profiling applications always surprises me. In one case, a space-making algorithm I was running on controls seemed a little pokey; I found out more than 50% of the time spent was on constantly verifying that the lists were sorted. Today, I was investigating a dialog that looked like it must hav
Re:uhh.. yeah (Score:1, Informative)
boolean a;
a = a ^ 1
One assembly instruction on most processors.
Re:Throw hardware at it. (Score:3, Informative)
The compiler can't do all micro-optimizations (Score:3, Informative)
Example 1: in C, if you use "int" for a variable "x" that should have a type of "unsigned", "x/4" will not just be a simple shift, instead three or four instructions are involved. Indeed, it would be very hard for the compiler to infer that "x" is always non-negative and optimize for you, except in the simplest cases.
Example 2: in floating-point math, "divide by 10" is not exactly the same as "multiply by 0.1", thus many compilers (gcc 3.4 without "-ffast-math", icc8 by default, and probably the Java VM) won't optimize the former into the latter, even in the many cases where it won't matter. This results in code that is 10-40 times slower on the P4.
Example 3: in Haskell, since lazy evaluation has much more overhead than eager evaluation, compilers always try to optimize the former into the latter. However, in many cases it is impossible for the compiler to do that, since it can't decide if using eager evaluation will prevent the evaluation from terminating.
In short, it is good to rely on the compiler to do the optimization (such as register allocation) that is known to be done well, but what the compiler can do is very limited, since (1) it can't know your intent if you had not expressed it, so (for example) it has to make sure that every floating-point operation conforms to very stringent error bounds, often at the cost of significant speed, even if you don't really care about that; and (2) some code-optimization problems take extortionate time to solve, or might even be theoretically infeasible in general. Therefore, when writing code that is going to take some significant CPU-time, it is good to have some good habits that helps the compiler, as long as the code isn't uglified too much.
Re:The Longhorn developers... (Score:3, Informative)
The reason they're targetting this kind of system is because the hardware will probably be cheaper than Windows itself by the time Longhorn comes out.
I'm sure they'll let you switch off the flash features that need it, though. All recent versions of Windows have been able to degrade to roughly the same performance standard as the previous version if you choose the right options.
Huh? (Score:3, Informative)
What are you talking about? I get paid to write open-source software. Where did you get the idea that open-source software is written entirely by volunteers?
You're ignoring the "gotcha" (Score:5, Informative)
Well, maybe you're not ignoring it since you said "if it needs to scale properly". But that's a very crucial "if", and the "scale properly" only refers to certain situations.
If the array you need to sort might have several million members and you won't be sorting more than a few dozen of those arrays, yes you should use an O(n lg n) or whatever sort routine. OTOH, if the array itself is smaller (a few hundred members) but you have to sort several hundred thousand of them, quicksort or merge sort will be remarkably slow compared to the much-maligned bubble sort.
Big-O notation is an asymptotically-tight bound, not the function itself. For small datasets, not only is there no guarantee that the lower big-O algorithm will be faster, it's in fact usually the case that the allegedly "less efficient" algorithm will actually be faster.
Re:Followed your link (Score:4, Informative)
When I was in uni our lecturer gave us an example from the QU campus where he used to lecture. There was a computer (remember, this is back in the eighties) that needed to sort rather a lot of data and it took three days to do it with the qsort algorithm. The main problem was, I believe, due to memory restrictions i.e. all the data could not fit into memory at once. It was recoded to use a different algorithm, one that could work from disk and in small chunks, and ran orders of magnitude faster. The recoded algorithm was theoretically slower, but faster in actuality due to the nature of the data and the machine it had to run on.
Intel's VTune is your friend. (Score:3, Informative)
The way that we use it is to not even touch it until we have the feature completely working in the simplest form possible. Then we do some performance testing. If everything works well under load, we don't even bother profiling it. Otherwise run it in vtune and see what the bottleneck is. 90% of the time, there is some type of minor oversight. Occasionaly, there is an algorithmic change that needs to take place, like adding a secondary index to something, or making some temporaries thread-local.
We run both event-counters and call-tracing, but I've found that call-tracing is far more accurate. The best use of VTune is to smite arrogant developers. The result of our trip to Intel was that one of our developers, who had to write everything from scratch, was shown that all of his "high performance components" were completely worthless.
Just my $0.02.
Matt
Re:You're ignoring the "gotcha" (Score:4, Informative)
I'm still a firm believer in the principle that bubblesort is never the way to go.
Re:Funny thing about performance (Score:2, Informative)
Interesting thing about it is that it is one of the few algorithms that has an expected running time of O(n!). If you're teaching an intro algorithms class it's easy to come up with examples of O(lg n), O(n), O(n lg n), O(n^2), and O(2^n) in lecture but O(n!) is tricky. Useful as an extra credit question.
Re:That guy doesn't know ANYTHING about performanc (Score:3, Informative)
Yes, machines are faster now, by at least an order of magnitude, but optimizing poor code can speed things up more than engineering a new processor.
Bubble Sort a list of one billion points of data (O(n^2) compares = k * (1e18)) on a new 3GHz machine (assuming 1 compare per clock), and you need about 3.333e8 (333333333.33) seconds (about 10.5 years).
Weak Heap (best), Quick, or Heap sort the same billion points (O(n*log n) compares = k * (~3e10)) on an old 486/33MHz (again assuming 1 compare per clock), and you need about 9.0909e3 (9090.909) seconds (about 2.5 hours).
There you go, the author can use the new Pentium 5s, and I guess the rest of us can go dig out our 486s. Sure, you don't often need to sort a billion records, but next time you do, make sure your algorithm is reasonable, then use a language that allows you to implement it. One-size-fits-all library, garbage collection, and run-time language error checking might be good for rapid development, but doing things efficiently requires lower-level interaction, sometimes even below what C allows. Bubble sort 2 records, and you won't see much benefit, but for performance critical sections of code, sometimes it's better to optimize first, then use comments to make it readable, than to write code that looks like comments. Even adding bounds checking to the sorts would at least double the time required, which in this case could mean anywhere from another 3 or 4 hours up to the time left until you collect social security.