Posted
by
michael
from the better-but-more-expensive dept.
deadfx writes "Linux Journal is carrying an article authored by members of the Intel Compiler Lab examining specific optimizations employed by the compiler that allowed it to beat the gcc compiler on several benchmarks."
This discussion has been archived.
No new comments can be posted.
Wow. I doubt the techniques themselfs are patented (again wouldn't be surprised) but the artical seems mostly about what it does, not how. Even if it's useless to GCC, I learned a couple new things. Good artical.
From what very little I understand, gcc has years of infrastructure focused on multiple machine instruction sets.
I'm pleasantly surprised gcc can do as well as it does considering that it can be built and run on some very unusual and dated pieces of hardware (although, from the recent release notes, it looks like some of the most obscure ones are slipping into oblivion.)
Many of the optimizations discussed in the article are being implemented in GCC. Actually, some (e.g., SSA) have been in a parallel development branch for quite a while now, where they can make major funky changes and not have to play by the "no breakage" rules that govern the release branches.
Just last week there was a discussion about OpenMP and how it might be implemented in GCC.
A huge optimization that the article mentioned in the intro (but didn't go into in depth, I think) is intraprocedural optimization. That can be hard in languages that don't compile the entire program at once; you really need help from the linker.
it can be built and run on some very unusual and dated pieces of hardware (although, from the recent release notes, it looks like some of the most obscure ones are slipping into oblivion.)
Absolutely. If nobody's using those platforms, and we want to make a change that might effect them in a negative way -- but we can't tell, because we don't have that machine available to test on -- we have two options: make the change and hope it doesn't break, or deliberately mark the platform as "no one cares, so neither do we".
Actually GCC may avoid some optimization for political reasons. I apologize for the lack of details, I only skimmed the discussion, it was a while ago, but it was on/. so it must be competely accurate.;-)
My fuzzy recollection is that there is a class of optimizations that GCC does not utilize because a required intermediary form of code could be used to circumvent the GPL. Anyone have a better recollection or actual knowledge of the issue?
It's a shame that RMS fails to see the advantage of opening up GCC's data structures. Sure there would be some cheaters who'd use it for personal gain (writing proprietary front-ends for the GCC backend) - but there would be ten times as many people creating useful free languages and other free tools with it. It would also greatly aid in the learning of GCC itself (no simple task). It's a calculated risk, but in my opinion, one with more upside than downside potential. But who are we to say what RMS should do with his work? The FSF is the copyright holder, afterall.
Apologies if I was not clear originally, but my recollection is that both patents and politics are responsible. That politics prevents unencumbered techniques.
Oh yeah, try getting docs on the UltraSparc 3 Give me a break! The stuff's in the Linux kernel source. Let me guess, you've never looked at the kernel source?
Look in linux/arch/sparc64/kernel/cpu.c.That's just detection of the CPU. If you look in that directory you will see code to do all kinds of low-level stuff on ALL the UltraSPARC processors, so get back under your bridge, troll. I tried to post some of the code but hit the damned lamness filter.
Like Cypress/Ross, Fujitsu-Siemens, SUN, Texas Instruments - and these are just the ones i have sitting on my desk right now. Also these sparc chips are all 100% compatible with the target revision of the sparc standard, whereas the various x86-compatible chips have huge differences... take 3dnow support for instance.
It's only high performance on 4 gigabytes of RAM or less. Then you're back into segmentation and other hackery, just like in the "good old days" of MS-DOS. You young 'uns, don't know how good you have it. In my day, your code had to fit inside 64k segements, and so did your data. No arrays bigger than 65533 bytes, and you had to do "far" jumps to code in other segments. Yes, addresses were 20 bits, but you had to make them up with two 16-bit chunks. So, you needed 32 bits of storage to address 1 megabyte of RAM. Don't get me started on TSRs...
Wrong, the highest performing microprocessor in the world is the Alpha EV7. Also consider the fact that it can address a full 64bits of memory, and scale almost linearly to 64 or 128 processors, 2 things that the x86 architecture cannot do.
Are you sure? I thought the alpha could only use 48 bit addresses. Is this something they changes in the EV67/EV7? It's still 256TB, but it isn't quite the same.
Yes I know The CPU registers are 64 bit, but the way it handled virtual memory limited it to 48 usable bits.
Even so, the adresses are still stored as 64bit longs, so with a processor upgrade and possibly minor tweaks to the os, the address space could be extended to the full 64bit range... I believe it was done this way for cost reasons, but i could be wrong...
But saying that, there arent many uses that would require more than 256tb of ram..
So the P4 is *1.4 times* as fast on integer maths, and two percent slower at floating point. Ouch. Care to rethink that statement about the EV7? Sure, it's 64bit, and scalable, and a much nicer architecture, to boot. But I wasn't arguing address space size, suitability for large-scale NUMA or SMP clustering, or ease of writing compiler backends, was I?
Peace, 'jfb
PS: Even more horrifying for fans of elegance in their microarchitectures:
1.4 times the integer rate, almost 3 times the clockrate, imagine the speed of a 3.06ghz EV7... which hp could easily do if they wanted, they could also improve their compiler somewhat, since it doesnt even support the ev7.. but theyre more concerned with promoting the itanic
HPaq is definitely up to their necks in Itanic (which, btw, is already faster than EV7), but even if they weren't, I find it hard to believe that they'd have the process to shrink EV7 further and clock it higher.
The truly scary thing is that there's plenty of headroom in the P4 core. Intel is talking about 7ghz on their next shrink. One hopes that AMD is able to keep up, if only to keep the price of gross nasty fast microprocessors within the grasp of the common man.
To me it would seem that the partial redundancy would only work if the code was like the examples they give. If you coded such that there was no partial redundancy in the code then it probably would not be any better than gcc.
i.e.
x[i] += a[i+j*n] + b[i+j*n];
could be coded as
l=i+j*n;
x[i] += a[l] + b[l];
Both would yeild the same result, one would use more variables, but both would only do the calc once. Would this be any different?
Personally I can see not doing the calc's twice, but I'd think that a good programmer would code so he is not doing the calcs twice.
Of course and they make that pretty clear. But their much better memory disambiguation and anaylsis of the program allow them to apply this trick much more often than GCC can.
(Set aside the fact that C automatically scales pointers arithmetic for you. Also ignore for the moment the fact that x86 allows you to do a scalar multiply by 4 in the load instruction -- pretend we are accessing structures of some large or non-power-of-two size.)
Here the computation of x * 4 is redundant, even though we never wrote x * 4 in the original program.
The point is, dumb code doesn't just arise because of dumb programmers, but because of the compilation process. (Also imagine you are calling a macro that computes offsets for you, etc.) Anway, every compiler implements this level of common sub-expression elimination, even gcc, so don't worry!
Something that's been bothering me ever since I was bitten by my first compiler bug several decades ago. (And couldn't get the bug acknowledged by the vendor because I couldn't get it to occur in a short code fragment).
WHY does every vendor have optimization on by default all the time? Optimization routines in compilers are the most likely places to have bugs, and they are often extraordinarily subtle bugs that are hard to reproduce. I have personally encountered bugs in which the insertion of deletion of a COMMENT affected the code, and certainly many of us have encountered bugs that could not be demonstrated in a short fragment because they only occurred when things got deeply nested and the compiler ran out of temporary registers.
Optimization also interferes with debuggers. I know that YOU are capable of doing up-front planning and writing bug-free code, but _I_ am not, and forcing me to recompile in order to use the debugger is one more hurdle in the way of getting the bugs out of the code.
Why isn't optimization off by default, and turned on only in specific modules, or certain SECTIONS of the code (with pragmas)--those specific sections that can be demonstrated to be time-critical?
Most compilers get more testing with optimisations enabled, so while there are bugs in optimisers there are also latent bugs exposed when the optimiser is turned off
With modern C++ code, turning the optimiser off completely can kill performance. At least some basic inlining is needed to make code work at tolerable speeds.
Let's see, GCC has optimization off by default. VC++ projects by default give you two project settings, one with no optimization. If you're using something advertised and marketed as an optimizing compiler (like ICC), then it's up to you to write your makefiles appropriately.
I'd like to see aggressive optimized subscript and iterator checking in gcc. We've been discussing this in the C++ standards newsgroup, and the consensus is that early detection of assertion failures is within the standard, so it's OK to do this.
Here's what I'm talking about. Consider
vector<int> tab(10); ...
int k=11; ...
int total=0;
for (unsigned int i=0; i<k; i++)
{ total += tab[i]; }
Now the usual way to check something like this
would look like this:
vector<int> tab(10); ...
int k=11; ...
int total=0;
for (unsigned int i=0; i<k; i++)
{ assert(i< tab.size()); total += tab[i]; }
Note that the assert will fail on the iteration for which i=10. (The vector is from 0..9, remember.) The first ten iterations get executed, then there's an assertion failure, and if the assert wasn't there; "undefined behavior" would occur at tab[10]. This is a valid check, but it adds overhead on every iteration.
If the compiler hoists the assert out of the loop,
we get
vector<int> tab(10); ...
int k=11; ...
assert(k<tab.size());
int total=0;
for (unsigned int i=0; i
{ total += tab[i]; }
and we've pulled the subscript check entirely out of the loop, eliminating almost all the cost of subscript checking. This is possible for a high percentage of subscripts in inner loops, where it matters. It's a huge win in performance, and it makes subscript checking feasible in production programs.
There was a philosophical argument that such an optimization is illegal, because it triggers an assertion failure well before the program has actually crashed. But the consensus was that if the program is inevitably headed for undefined or illegal behavior, the compiler can take any appropriate action, including detecting the problem early. So this is an OK optimization.
Re:Patented? (Score:1)
Crossing my fingers and 'rtfa'ing now.
Re:Patented? (Score:1)
Re:Patented? (Score:1)
You should skip class less often!
Re:Patented? (Score:5, Insightful)
From what very little I understand, gcc has years of infrastructure focused on multiple machine instruction sets.
I'm pleasantly surprised gcc can do as well as it does considering that it can be built and run on some very unusual and dated pieces of hardware (although, from the recent release notes, it looks like some of the most obscure ones are slipping into oblivion.)
Re:Patented? (Score:1)
Re:Patented? (Score:1)
Re:Patented? (Score:5, Interesting)
Many of the optimizations discussed in the article are being implemented in GCC. Actually, some (e.g., SSA) have been in a parallel development branch for quite a while now, where they can make major funky changes and not have to play by the "no breakage" rules that govern the release branches.
Just last week there was a discussion about OpenMP and how it might be implemented in GCC.
A huge optimization that the article mentioned in the intro (but didn't go into in depth, I think) is intraprocedural optimization. That can be hard in languages that don't compile the entire program at once; you really need help from the linker.
Absolutely. If nobody's using those platforms, and we want to make a change that might effect them in a negative way -- but we can't tell, because we don't have that machine available to test on -- we have two options: make the change and hope it doesn't break, or deliberately mark the platform as "no one cares, so neither do we".
Re:Patented? (Score:2)
"Can someone please steal this and put it in GCC for free???"
Way 2 go.
GCC limits optimizations due to politics? (Score:2)
My fuzzy recollection is that there is a class of optimizations that GCC does not utilize because a required intermediary form of code could be used to circumvent the GPL. Anyone have a better recollection or actual knowledge of the issue?
Re:GCC limits optimizations due to politics? (Score:3, Interesting)
Re:GCC limits optimizations due to politics? (Score:2)
Re:GCC limits optimizations due to politics? (Score:2)
So... (Score:4, Informative)
The best wat to get better performance out of GCC is to recopile your program - targeted to PowerPC, Alpha, Mips or hell... AMD.
Screwing with GCC to get better x86 performance is like puting a trubo on a Honda Civic - just buy the damn Corvette.
Re:So... (Score:3, Insightful)
Re:So... (Score:2)
Give me a break! The stuff's in the Linux kernel source. Let me guess, you've never looked at the kernel source?
Re:So... (Score:1)
Re:So... (Score:2)
Re:So... (Score:2)
Now go wipe your face. It's covered with Scott McNealy's jism.
Re:So... (Score:1, Flamebait)
Re:So... (Score:2)
Also these sparc chips are all 100% compatible with the target revision of the sparc standard, whereas the various x86-compatible chips have huge differences... take 3dnow support for instance.
Re:So... (Score:2, Informative)
Re:So... (Score:2)
'jfb
Re:So... (Score:3, Informative)
Re:So... (Score:2)
Re:So... (Score:2)
Yes I know The CPU registers are 64 bit, but the way it handled virtual memory limited it to 48 usable bits.
Re:So... (Score:2)
Sorry, Alpha fans ... (Score:2)
P4, 3.06ghz: 1130/1103 (int/fp)
EV7, 1.15ghz: 795/1124 (int/fp)
So the P4 is *1.4 times* as fast on integer maths, and two percent slower at floating point. Ouch. Care to rethink that statement about the EV7? Sure, it's 64bit, and scalable, and a much nicer architecture, to boot. But I wasn't arguing address space size, suitability for large-scale NUMA or SMP clustering, or ease of writing compiler backends, was I?
Peace,
'jfb
PS: Even more horrifying for fans of elegance in their microarchitectures:
Itanic^2, 1.0ghz: 810/1431 (int/fp)
Re:Sorry, Alpha fans ... (Score:2)
Re:Sorry, Alpha fans ... (Score:2)
The truly scary thing is that there's plenty of headroom in the P4 core. Intel is talking about 7ghz on their next shrink. One hopes that AMD is able to keep up, if only to keep the price of gross nasty fast microprocessors within the grasp of the common man.
'jfb
Re:So... (Score:2)
bad code gets good optimization.. (Score:3, Interesting)
i.e.
x[i] += a[i+j*n] + b[i+j*n];
could be coded as
l=i+j*n;
x[i] += a[l] + b[l];
Both would yeild the same result, one would use more variables, but both would only do the calc once. Would this be any different?
Personally I can see not doing the calc's twice, but I'd think that a good programmer would code so he is not doing the calcs twice.
Re:bad code gets good optimization.. (Score:2)
Re:bad code gets good optimization.. (Score:1)
But their much better memory disambiguation and
anaylsis of the program allow them to apply this trick
much more often than GCC can.
Bad code arises during compilation of good code (Score:3, Insightful)
In fact,
a[x] = b[x] + c[x];
probably compiles into something like (in the C equivalent):
offset_b = x * 4 +
val_b = *offset_b;
offset_c = x * 4 +
val_c = *offset_c;
offset_a = x * 4 +
val_a = val_c + val_b;
*offset_a = val_a;
(Set aside the fact that C automatically scales pointers arithmetic for you. Also ignore for the moment the fact that x86 allows you to do a scalar multiply by 4 in the load instruction -- pretend we are accessing structures of some large or non-power-of-two size.)
Here the computation of x * 4 is redundant, even though we never wrote x * 4 in the original program.
The point is, dumb code doesn't just arise because of dumb programmers, but because of the compilation process. (Also imagine you are calling a macro that computes offsets for you, etc.) Anway, every compiler implements this level of common sub-expression elimination, even gcc, so don't worry!
Optimization and compiler bugs (Score:3, Interesting)
WHY does every vendor have optimization on by default all the time? Optimization routines in compilers are the most likely places to have bugs, and they are often extraordinarily subtle bugs that are hard to reproduce. I have personally encountered bugs in which the insertion of deletion of a COMMENT affected the code, and certainly many of us have encountered bugs that could not be demonstrated in a short fragment because they only occurred when things got deeply nested and the compiler ran out of temporary registers.
Optimization also interferes with debuggers. I know that YOU are capable of doing up-front planning and writing bug-free code, but _I_ am not, and forcing me to recompile in order to use the debugger is one more hurdle in the way of getting the bugs out of the code.
Why isn't optimization off by default, and turned on only in specific modules, or certain SECTIONS of the code (with pragmas)--those specific sections that can be demonstrated to be time-critical?
Re:Optimization and compiler bugs (Score:1)
Some reasons we use optimisation by default:
With modern C++ code, turning the optimiser off completely can kill performance. At least some basic inlining is needed to make code work at tolerable speeds.
Re:Optimization and compiler bugs (Score:1)
Inside the Intel compiler (Score:2, Funny)
Optimized subscript and iterator checking (Score:2)
Here's what I'm talking about. Consider
int k=11;
int total=0;
for (unsigned int i=0; i<k; i++)
{ total += tab[i]; }
int k=11;
int total=0;
for (unsigned int i=0; i<k; i++)
{ assert(i< tab.size()); total += tab[i]; }
If the compiler hoists the assert out of the loop, we get
int k=11;
assert(k<tab.size());
int total=0;
for (unsigned int i=0; i { total += tab[i]; }
There was a philosophical argument that such an optimization is illegal, because it triggers an assertion failure well before the program has actually crashed. But the consensus was that if the program is inevitably headed for undefined or illegal behavior, the compiler can take any appropriate action, including detecting the problem early. So this is an OK optimization.
I'd like to see that in "gcc".