Are 64-bit Binaries Slower than 32-bit Binaries? 444
JigSaw writes "The modern dogma is that 32-bit applications are faster, and that 64-bit imposes a performance penalty. Tony Bourke decided to run a few of tests on his SPARC to see if indeed 64-bit binaries ran slower than 32-bit binaries, and what the actual performance disparity would ultimately be."
architectural differences... (Score:4, Informative)
Re: OSNews (Score:2, Informative)
Opteron is faster in 64 bit (Score:5, Informative)
Re:Couldn't time fix this? (Score:4, Informative)
Not just kernels. All programs.. however this happens in the compiler. Or assembly code. Not in "kernels" unless they are assembly code kernels..
Basically this test is moot without using compilers optimized for the 64 bit chips..
--ken
Re:I'll save you guys the read. (Score:5, Informative)
They added more registers to an architecture that had very few of them. This is likely where most of the performance increase comes from in 64bit mode on the Opteron, not from the fact that it is 64bit.
Re:I'll save you guys the read. (Score:1, Informative)
No tricks. The benefit doesn't come from 64 bit-ness, it comes from other changes in the ISA when something is compiled in 64-bit mode. There are 8 more GPRs on AMD64 than on IA-32. More registers = less movs to/from cache = faster. Also the integrated memory controller can't hurt. Also it has 8 more SSE2 registers IIRC.
Note that none of these things is tied to AMD64 having 64 bit regs. Course there are plenty of 64-bit benefits too (PK ops like RSA are basically instantly 4 times as fast on 64-bit machines as compared to a 32-bit machine).
Re:Opteron is faster in 64 bit (Score:5, Informative)
Jebus christ. (Score:2, Informative)
This article sounds completely stupid. Someone didn't know that pulling 64-bits across the bus( reading/writing can take longer than 32-bits? Never thought of the caches?
Just read the GCC Proceedings [linux.org.uk], there's explanations and benchmarks of the why/how/when of x86-64 in 32 vs 64-bit mode, both speed of execution and image size.
Re:Moving more data (Score:5, Informative)
By the same token, 32-bit code on systems with 64-bit wide data paths will move twice as many pointers in one bus cycle.
Today's CPUs almost completely decouple buses from ALU-level operations. Buses usually spend most of their time transfering entire cache lines to service cache misses, so if your pointers take up a bigger portion of a cache line, 64-bit code is still consuming more bus clock cycles per instruction on average no matter how wide your buses are.
BTW, 32-bit processors have been using 64-bit external data buses since the days of the Pentium-I.
Re:And 32 bit is slower than 16 bit (Score:1, Informative)
Yes, I remember the days of the first 386es and how much slower they were! I still have a 286-25 (I think...somewhere...) that kicked 386-33 butts.
Re:Something is wrong. (Score:1, Informative)
Operations concerning large integers were MADE for 64 bit
How? Explain please. Crytographic algorithms perform logical operations on each individual bit. To set a single bit in a register, you have to do something like:
Since you're operating on a bitstream - at the bit level - and each bit operation depends on others - my question is: how - in any way, does it matter that rax is 64-bit rather than a 32-bit eax? For simple math, yes, 64-bit will help. For bitwise logic and crypto - no. If anything - besides overflow - a large register size is a hinderance for operations like this.Re:gcc? (Score:4, Informative)
And before you start complaining, that comes from 3 years coding for a graphics company where every clock tick counts. We saw a MAJOR (like more than 20%) difference in execution speed of our binaries depending upon which compiler was used.
Hell, gcc didn't even get decent x86 (where x>4) support in a timely manner. Remember pgcc vs. gcc?
Re: OSNews (Score:3, Informative)
Re:I'd kill for a 64 bit platform... (Score:1, Informative)
We tried win2k3 and the /3gb switch, but we kept having very odd things happen.
Besides possible bugs in your code, that might be because /3GB only leaves 1GB for the OS which might not be enough in some situations. On W2K3 you can try /userva [microsoft.com].
Well.. MySQL4 loves 64-bit (Score:1, Informative)
I benched MySQL4 on a dual Athlon-MP system and it ran about 32% faster in 64-bit mode. Try it yourself is all I can say.
It was a sweet upgrade as I had been using the server in 32-bit mode the first couple of months having it.
Re:architectural differences... (Score:5, Informative)
Probably applicable to the G5 as well (and Alpha, PA-RISC, MIPS), which like the SPARC has pretty much the same architecture for 32 bits and 64 bits.
The Itanic has an IA-32 subsystem hanging on it - performance is really poor compared to the main 64 bit core. The Opteron has more registers available in 64 bit mode than 32 bit mode and should show some performance improvements just for that reason.
As has been said mucho times - 64 bit processors really shine when you have lots of memory to work with. Having said that, one advantage of 64 bits is being able to memory map a large file and can result in better performance even with much less than 4 GB of memory - witness the MySQL tests.
of course, they are (Score:5, Informative)
64bit may help with speed only if software is written to take advantage of 64bit processing. But the main reason to use 64bit processing is for the larger address space and larger amount of memory you can address, not for speed. 4Gbytes of address space is simply too tight for many applications and software design started to suffer many years ago from those limitations. Among other things, on 32bit processors, memory mapped files have become almost useless for just the applications where they should be most useful: applications involving very large files.
Re:A Makefile? (Score:3, Informative)
How many times have we slapped around these types of people with our new technology trout only to hear "Yeah, but $OLD_TECHNOLOGY is STILL being developed, and it's cheap. Why should we bother with $NEW_TECHNOLOGY." Yeah yeah, I know that technically 64-bit isn't NEW, but to these guys...
Re:Moving more data (Score:5, Informative)
Modern processors (which actually stretches back at least 10 years) really want to run out of cache as much as possible, both for instruction and data access. And they've never wanted to do it more than now when in the x86 world, the processor core and L1 cache are operating at 3200MHz vs. 400MHz for the RAM.
One thing that has to happen is that you make a bet on locality of execution (again both for instructions and data) and burst load a section of memory into the caches (L2 and L1, and sometimes even L3). In implementation terms, it takes some time to charge up the address bus, so you increase bandwidth and execution speed by charging up address n, but doing a quick read of n+1, n+2, n+3, and more on the latest CPUs. You only have to wiggle the two low-order address lines for the extra reads, so you don't pay the pre-charge penalty that you would for access randomly in memory.
That's good if you're right about locality and bad if you're wrong. That's what predictive branching in the processor and compiler optimizations are all about - tailoring execution to stay in cache as much as possible.
On a 64-bit processor, those burst moves really are twice as big and they really do take longer (the memory technology isn't radically different between 32- and 64-bit architectures, although right now it would be odd to see a cost-cutting memory system on a 64-bit machine). If all the accesses of the burst are actually used in execution, then both systems will show similar performance (the 64-bit will have better performance on things like vector opcodes, but for regular stuff, 1 cycle is 1 cycle). If only half of the bursted data is used, then the higher overhead of the burst will penalize the 64-bit processor.
If you're running a character based benchmark (I've never looked at gzip, but it seems like it must be char based), then it's going to be hard for the 64-bit app and environment to be a win until you figure out some optimization that utilizes the technology. If your benchmark was doing matrix ops on 64-bit ints, then you'll probably find that that Opteron, Itanium, or UltraSparc will be pretty hard to touch.
A hammer isn't the right tool for every job as much as you'd like it to be. I actually think that the cited article was a reasonable practical test of performance, but extrapolating from that would be like commenting on pounding nails with a saw - it's just a somewhat irrelevant measure.
I guess I'm violently agreeing with renehollan's comment about speed bumps - apps that can benefit from an architectural change are as important as more concrete details such as compiler optimizations.
Re: OSNews (Score:5, Informative)
Re:Couldn't time fix this? (Score:3, Informative)
The benefit of a 64 bit processor is a larger address space and the ability to work on 64 bit data types much much faster than on a 32 bit system. More GPRs is an additional, separate benefit.
Re:This guy is a tool (Score:2, Informative)
Slower? It depends. (Score:5, Informative)
That's where the slowdown comes (plus some possible library issues, early 64-bit HP and Sun system libraries were very slow for some operations).
If your process resident memory size is the same in 64 and 32-bit mode, you should not see any slowdown. If you do, it's an issue with the library of the compiler (even though the compiler in this case is the same, the code generator is not, and there may be some low-level optimizations it does differently). If resident size of 64-bit application is larger, you are likely to see slowdown, and the more memory-bound the program is the larger it'll be.
Re: OSNews (Score:5, Informative)
Re:retarded. (Score:5, Informative)
They've at best proved a supposition about a single architecture/process/compiler family. They have not proved a general case. Did they test on amd64? Alpha? Mips? No? Then why are they making unwarranted generalizations? Ah, they're retarded.
Actually, they didn't make generalizations. He very specifically stated that he only tested on a 64-bit Sparc, and an older one at that. He pointed out that while you can make some general conclusions, you can and should run tests on other architectures.
He also pointed out that he only tested a few applications, not a whole bunch of them. He was questioning conventional wisdom and wanted to know if there was any fact behind it, and he determined that there was. He did not determine the entire scope of the facts, and he did not claim to do so.
Sorry, I found it to be an interesting read, but you really have to take the first page seriously when he says "I only tested these things, so I can only conclude based on these tests, and it doesn't prove the general case." If you ignore that, then yes, you'll wind up with what you took away from the article.
If its currently a win32 app... (Score:2, Informative)
Address Windowing Extensions (AWE) really are a good solution for your problem.
If you're doing Win32, but really want 64-bit, then consider Win64 [microsoft.com]. There are several OEMs [microsoft.com] providing it.
If your response is "can't afford it", then your .5 Terabyte database project is probably underfunded and likely to fail.
Re:There's always a trade-off (Score:2, Informative)
This is completely wrong. Clock rate is determined by your slowest pipe stage.
A modern P4 is a 20+ stage pipeline because they want to squeeze the logic into tiny little sections, so that they don't have any "big" pipe stages. This lets them ramp up the clock rate.
A 386-era design isn't going to be nearly that heavily pipelined. Since it has more logic per pipe stage, it will have a very slow clock rate by today's standards, even if you upgraded it to a modern fab process.
Plus, a 386 executes x86 instructions instead of "micro-ops" (the RISC-style instructions that are executed at the core of a modern pentium). Those instructions "do more" and require more logic to begin with.
Re:of course, they are (Score:4, Informative)
That's an additional reason. There are probably many other places that neither of us has thought of that have been scaled up to make a true 64bit processor and that benefit 32bit applications running on the same hardware in 32 bit mode.
I'm beginning to wonder these days how much CPU speed even matters though.
It matters a great deal for digital photos, graphics, speech, handwriting recognition, imaging, and a lot of other things. And, yes, regular people are using those more and more.
Unless you are running photoshop, SETI, Raytracing, etc., you probably wouldn't notice if I replaced your 3GHz processor with a 1GHz.
You probably would. Try resizing a browser window with "images to fit" selected (default in IE, I believe). Notice how that one megapixel image resizes in real time? CPU-bound functionality has snuck in in lots of places.
Re:OSNews = UnNews? (Score:5, Informative)
One of the issues that people forget is that a 64-bit processor may be able to retire a set number of 64-bit, say, integer additions per clock cycle (NOTE: retiring an operation per clock cycle does NOT mean that the operation takes one clock cycle to perform). Well, the odds are that it will also retire the same number of 32-bit integer additions per clock cycle. It may take 5 clock cycles to do either sized addition even. So, what do you have that is different? Well, on the SPARC, most simple operations are going to be similar in execution time. Regarless of the number of register windows that the particular architecture supports (which may come into play in some codes), you still basically have 32 registers for use in your computational kernel. The only real difference between many 32-bit and 64-bit versions of the code will be the amount of data that has to be moved around.
Where the 64-bit will help is when the 32-bit code has to synthesize 64-bit operations or has to do things like work on bit streams (not word/byte streams exactly) and can work on 64-bits at a time rather than doing really the same thing on 32-bits two times as much (128 bytes can be traversed in 32 32-bit operations or 16 64-bit operations - half the number of reads/operations).
All of this is pretty well understood by those who have dealt with these type systems before. However, the relative newcomer Opteron has an additional twist. In 64-bit mode, there are twice as many registers that can be used compared to 32-bit mode. This may (read: will) cause some codes to be done faster simply because more data can be stored in registers rather than memory, even L1 cache is a bit slower than a register.
Re:Something is wrong. (Score:3, Informative)
All public key systems currently in use depend on doing arithmetic on large integers. Let's start with the classical algorithms for addition/subtraction/multiplication/division.
The addition and subtraction algorithms are O(N) and multiplication/division is O(N^2), where N is the number of digits.
What is a digit? On a 32-bit process, it will probably be 32 bits. On a 64-bit processor, it will probably be 64-bits.
What this means is that operating on large integers, say, 1024 bits, will be twice as fast on the 64-bit process for addition/subtraction and 4 times as fast for multiplication/division.
Most large integer packages use Karatsuba multiplication instead of the classical algorithm. Karatsuba is O(N^1.58). On a 64-bit processor, that is 3 times faster than on a 32-bit processor.
Looking at it from the other direction, if on a 32-bit processor, using a given set of algorithms which are working in base B, you can do public key cryptography using N bits, then just by using the same algorithm, working in base B^2 on a 64-bit processor running at the same basic speed, you can in the same time do public key cryptography using 2N bits.
Re:Now I'm confused... (Score:4, Informative)
steve
Re:Slower? It depends. (Score:3, Informative)
I think the 20% increased size is the reason for the 20% worse performance, because memory access is often the bottleneck for real-life programs.
Summary of discussion (Score:3, Informative)
But there are several points
1. The results for openssl are no good because openssl for sparc32 has critical parts written in asm, while for sparc64 it is generic C.
2. The results would be much better if you did it with Sun's cc, which is much better optimised for both sparc32 and sparc64.
3. The results, even if they were accurate, are good only for sparc32 vs sparc64. Basically, sparc64 is the same processor as sparc32, only wider
I don't know what's the case for ppc32 vs ppc64, but when you look at x86 vs x86-64 (or amd64 as some prefer to call it) you have to take into account much larger number of registers, both GP and SIMD.
As a matter of fact, x86 is such a lousy architecture that it really doesn't have GP registers -- every register in x86 processor has its purpose, other than the rest. It looks better in case of FP and SIMD operations, but it's ints that most of the programs deal with. Just compile your average C code to asm and look how much of it deals with swapping data between registers.
(well, full symmetry of registers for pure FP, non-SIMD operations was true until P4, when Intel decided to penalize the use of FP register stack and started to ``charge'' you for ``FP stack swap'' commands, which were ``free'' before, and are still free on amd processors)
x86-64 on the other hand in 64bit mode has twice more registers with full symmetry between them, as well as even more SIMD registers. And more execution units accessible only in 64bit mode.
But, from this chaotic notes you can already see, that writing good comparission of different processors is a little bit more than ``hey, I've some thoughts that I think are important and want to share''. And the hard work starts with proper title for the story -- in this case it should be ``Are sparc64 binaries slower than sparc32 binaries?''.
Robert
Re:Not so simple for AMD64 (Score:1, Informative)
Re:6502? (Score:1, Informative)
I also *THINK* (cant find anything to back this up) that the operton and itanium are not capable of addressing a full 64 bit address space.
So basically we already increased the data bus size, its a messy solution, it increases page lookup times and you can still only access any 4 gig at once (kind of reminds me of the EMS and XMS systems to allow you to access addresses beyond 1 meg in real mode).
Clean 64 bit implementations will also give a major boost to both integer and floating point performance. See - http://www.digit-life.com/articles2/amd-hammer-fa