Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Software

Are 64-bit Binaries Slower than 32-bit Binaries? 444

JigSaw writes "The modern dogma is that 32-bit applications are faster, and that 64-bit imposes a performance penalty. Tony Bourke decided to run a few of tests on his SPARC to see if indeed 64-bit binaries ran slower than 32-bit binaries, and what the actual performance disparity would ultimately be."
This discussion has been archived. No new comments can be posted.

Are 64-bit Binaries Slower than 32-bit Binaries?

Comments Filter:
  • by jusdisgi ( 617863 ) on Saturday January 24, 2004 @12:08AM (#8072833)
    I can only assume that this is only going to be limited to SPARC...I mean, we've already seen the major differences between Itanium and Opteron dealing with 32 bit apps, right? Or is this a different question, since Opteron gets to run 32bit effectively "native"? And, at this point, when running 32 bit apps on a 64 bit chip, just what can "native" mean anyway?
  • Re: OSNews (Score:2, Informative)

    by Ninwa ( 583633 ) <jbleau@gmail.com> on Saturday January 24, 2004 @12:12AM (#8072857) Homepage Journal
    Well neither of you have provided any actual evidence proving they rock.. or sock... o.O -tromps off to OSNews to check out their benchmarks- I shall be back ^_^
  • by citanon ( 579906 ) on Saturday January 24, 2004 @12:15AM (#8072874)
    But that's only because it has two extra execution units for 64 bit code. 64 bit software is not inherently faster. Most people here would know this, but I just thought I might preemptively clear up any confusion.
  • by Ken Broadfoot ( 3675 ) on Saturday January 24, 2004 @12:16AM (#8072889) Homepage Journal
    "Most "tech gurus" I've talked to at my university about the benefites of 64bit processing say that it is in part due to the increase of the number of registers (allowing you to use more at the same time, shortening the number of cycles needed)."

    Not just kernels. All programs.. however this happens in the compiler. Or assembly code. Not in "kernels" unless they are assembly code kernels..

    Basically this test is moot without using compilers optimized for the 64 bit chips..

    --ken
  • by ParisTG ( 106686 ) <tgwozdz@g[ ]l.com ['mai' in gap]> on Saturday January 24, 2004 @12:22AM (#8072912)
    Makes me wonder what tricks AMD has managed to pull out of their hat to increase 64 bit performance by 20-30%...

    They added more registers to an architecture that had very few of them. This is likely where most of the performance increase comes from in 64bit mode on the Opteron, not from the fact that it is 64bit.

  • by Anonymous Coward on Saturday January 24, 2004 @12:25AM (#8072919)
    Makes me wonder what tricks AMD has managed to pull out of their hat to increase 64 bit performance by 20-30%...

    No tricks. The benefit doesn't come from 64 bit-ness, it comes from other changes in the ISA when something is compiled in 64-bit mode. There are 8 more GPRs on AMD64 than on IA-32. More registers = less movs to/from cache = faster. Also the integrated memory controller can't hurt. Also it has 8 more SSE2 registers IIRC.

    Note that none of these things is tied to AMD64 having 64 bit regs. Course there are plenty of 64-bit benefits too (PK ops like RSA are basically instantly 4 times as fast on 64-bit machines as compared to a 32-bit machine).
  • by fifirebel ( 137361 ) on Saturday January 24, 2004 @12:26AM (#8072924)
    Also because in 64-bit mode, the Opteron has access to more registers. The IA-32 architecture is so register-limited that throwing more registers at any task makes a huge difference.
  • Jebus christ. (Score:2, Informative)

    by eddy ( 18759 ) on Saturday January 24, 2004 @12:28AM (#8072934) Homepage Journal

    This article sounds completely stupid. Someone didn't know that pulling 64-bits across the bus( reading/writing can take longer than 32-bits? Never thought of the caches?

    Just read the GCC Proceedings [linux.org.uk], there's explanations and benchmarks of the why/how/when of x86-64 in 32 vs 64-bit mode, both speed of execution and image size.

  • Re:Moving more data (Score:5, Informative)

    by Waffle Iron ( 339739 ) on Saturday January 24, 2004 @12:33AM (#8072959)
    *cough* wider data busses *cough*. 'course this does mean that 64 bit code on systems with 32 bit wide data paths will be slower

    By the same token, 32-bit code on systems with 64-bit wide data paths will move twice as many pointers in one bus cycle.

    Today's CPUs almost completely decouple buses from ALU-level operations. Buses usually spend most of their time transfering entire cache lines to service cache misses, so if your pointers take up a bigger portion of a cache line, 64-bit code is still consuming more bus clock cycles per instruction on average no matter how wide your buses are.

    BTW, 32-bit processors have been using 64-bit external data buses since the days of the Pentium-I.

  • by Anonymous Coward on Saturday January 24, 2004 @12:46AM (#8073030)
    Don't forget (ugh) all the additional clock cycles needed for 16:16 and 16:32 bit addressing modes- (which we still have, unless you run an older Novell server which uses flat 32 memory mode.) Also, all the additional clock cycles needed to go from real to protected mode (stupid PC hardware stuff- but only on Windoze platforms) and "thunking"- also only on Windoze.

    Yes, I remember the days of the first 386es and how much slower they were! I still have a 286-25 (I think...somewhere...) that kicked 386-33 butts.
  • by Anonymous Coward on Saturday January 24, 2004 @12:51AM (#8073053)

    Operations concerning large integers were MADE for 64 bit

    How? Explain please. Crytographic algorithms perform logical operations on each individual bit. To set a single bit in a register, you have to do something like:

    mov rbx, value // value is 0 or 1
    shr rbx, 8 // shift right into bit 8
    and rax, ~8 // mask off the target bit 8
    or rax, rbx // or value into the register
    [perform operation on rax, etc]
    Since you're operating on a bitstream - at the bit level - and each bit operation depends on others - my question is: how - in any way, does it matter that rax is 64-bit rather than a 32-bit eax? For simple math, yes, 64-bit will help. For bitwise logic and crypto - no. If anything - besides overflow - a large register size is a hinderance for operations like this.
  • Re:gcc? (Score:4, Informative)

    by PatMouser ( 1692 ) on Saturday January 24, 2004 @12:54AM (#8073066) Homepage
    Yup! It turns out poorly optimized code in 32 bit mode and I shudder to think what the 64 bit code would look like.

    And before you start complaining, that comes from 3 years coding for a graphics company where every clock tick counts. We saw a MAJOR (like more than 20%) difference in execution speed of our binaries depending upon which compiler was used.

    Hell, gcc didn't even get decent x86 (where x>4) support in a timely manner. Remember pgcc vs. gcc?
  • Re: OSNews (Score:3, Informative)

    by be-fan ( 61476 ) on Saturday January 24, 2004 @12:59AM (#8073087)
    Because he used the same compiler, in 32-bit and 64-bit mode???
  • by Anonymous Coward on Saturday January 24, 2004 @12:59AM (#8073090)
    Have you thought about using AWE [microsoft.com]? (Of course if you just used SQL Server instead of rolling your own database you'd get automatic AWE support...)

    We tried win2k3 and the /3gb switch, but we kept having very odd things happen.

    Besides possible bugs in your code, that might be because /3GB only leaves 1GB for the OS which might not be enough in some situations. On W2K3 you can try /userva [microsoft.com].

  • by destiney ( 149922 ) on Saturday January 24, 2004 @01:07AM (#8073122) Homepage

    I benched MySQL4 on a dual Athlon-MP system and it ran about 32% faster in 64-bit mode. Try it yourself is all I can say.

    It was a sweet upgrade as I had been using the server in 32-bit mode the first couple of months having it.

  • by calidoscope ( 312571 ) on Saturday January 24, 2004 @01:10AM (#8073134)
    I can only assume that this is only going to be limited to SPARC...

    Probably applicable to the G5 as well (and Alpha, PA-RISC, MIPS), which like the SPARC has pretty much the same architecture for 32 bits and 64 bits.

    The Itanic has an IA-32 subsystem hanging on it - performance is really poor compared to the main 64 bit core. The Opteron has more registers available in 64 bit mode than 32 bit mode and should show some performance improvements just for that reason.

    As has been said mucho times - 64 bit processors really shine when you have lots of memory to work with. Having said that, one advantage of 64 bits is being able to memory map a large file and can result in better performance even with much less than 4 GB of memory - witness the MySQL tests.

  • of course, they are (Score:5, Informative)

    by ajagci ( 737734 ) on Saturday January 24, 2004 @01:12AM (#8073137)
    Both 32bit and 64bit binaries running on the same processor get the same data paths and the same amount of cache on many processors. But, for one thing, 64bit binaries use up more cache memory for both code and data. So, yes, if you run 32bit binaries on a 64bit processor with a 32bit mode, then the 32bit binaries will generally run faster. But the reason why they run well and all the data paths are wide is because the thing is a 64bit processor in the first place--that's really what "64bit" means.

    64bit may help with speed only if software is written to take advantage of 64bit processing. But the main reason to use 64bit processing is for the larger address space and larger amount of memory you can address, not for speed. 4Gbytes of address space is simply too tight for many applications and software design started to suffer many years ago from those limitations. Among other things, on 32bit processors, memory mapped files have become almost useless for just the applications where they should be most useful: applications involving very large files.
  • Re:A Makefile? (Score:3, Informative)

    by LoadWB ( 592248 ) on Saturday January 24, 2004 @01:13AM (#8073144) Journal
    I accept this article as dumbed down a bit for the lower end, non-guru user who is wooed by the 64-bit "revolution" but not technically saavy enough to understand the "32-bit faster than 64-bit" comments that continue to surface in many forums. If bean counters and cheap tech workers can be made to understand that there truly ARE benefits in 64-bit technology, then progress will not be held in place by beating the 32-bit horse to death -- even if it does run at hellaspeeds.

    How many times have we slapped around these types of people with our new technology trout only to hear "Yeah, but $OLD_TECHNOLOGY is STILL being developed, and it's cheap. Why should we bother with $NEW_TECHNOLOGY." Yeah yeah, I know that technically 64-bit isn't NEW, but to these guys...
  • Re:Moving more data (Score:5, Informative)

    by dfung ( 68701 ) on Saturday January 24, 2004 @01:14AM (#8073145)
    Oh, now I'll *cough* a little too.

    Modern processors (which actually stretches back at least 10 years) really want to run out of cache as much as possible, both for instruction and data access. And they've never wanted to do it more than now when in the x86 world, the processor core and L1 cache are operating at 3200MHz vs. 400MHz for the RAM.

    One thing that has to happen is that you make a bet on locality of execution (again both for instructions and data) and burst load a section of memory into the caches (L2 and L1, and sometimes even L3). In implementation terms, it takes some time to charge up the address bus, so you increase bandwidth and execution speed by charging up address n, but doing a quick read of n+1, n+2, n+3, and more on the latest CPUs. You only have to wiggle the two low-order address lines for the extra reads, so you don't pay the pre-charge penalty that you would for access randomly in memory.

    That's good if you're right about locality and bad if you're wrong. That's what predictive branching in the processor and compiler optimizations are all about - tailoring execution to stay in cache as much as possible.

    On a 64-bit processor, those burst moves really are twice as big and they really do take longer (the memory technology isn't radically different between 32- and 64-bit architectures, although right now it would be odd to see a cost-cutting memory system on a 64-bit machine). If all the accesses of the burst are actually used in execution, then both systems will show similar performance (the 64-bit will have better performance on things like vector opcodes, but for regular stuff, 1 cycle is 1 cycle). If only half of the bursted data is used, then the higher overhead of the burst will penalize the 64-bit processor.

    If you're running a character based benchmark (I've never looked at gzip, but it seems like it must be char based), then it's going to be hard for the 64-bit app and environment to be a win until you figure out some optimization that utilizes the technology. If your benchmark was doing matrix ops on 64-bit ints, then you'll probably find that that Opteron, Itanium, or UltraSparc will be pretty hard to touch.

    A hammer isn't the right tool for every job as much as you'd like it to be. I actually think that the cited article was a reasonable practical test of performance, but extrapolating from that would be like commenting on pounding nails with a saw - it's just a somewhat irrelevant measure.

    I guess I'm violently agreeing with renehollan's comment about speed bumps - apps that can benefit from an architectural change are as important as more concrete details such as compiler optimizations.
  • Re: OSNews (Score:5, Informative)

    by be-fan ( 61476 ) on Saturday January 24, 2004 @01:18AM (#8073156)
    GCC uses the same code generator for both Sparc32 and Sparc64.
  • by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Saturday January 24, 2004 @01:34AM (#8073207) Homepage Journal
    64 bit architectures do not automatically have more general purpose registers than 32 bit ones. x86-64 happens to have four times as many GPRs as x86, but that's a special case.

    The benefit of a 64 bit processor is a larger address space and the ability to work on 64 bit data types much much faster than on a 32 bit system. More GPRs is an additional, separate benefit.

  • by pritchma ( 169341 ) on Saturday January 24, 2004 @01:55AM (#8073306)
    Dude, he knew you were going to write this comment and so he included page 4 just for you. *grin*
  • Slower? It depends. (Score:5, Informative)

    by BobaFett ( 93158 ) on Saturday January 24, 2004 @01:56AM (#8073313) Homepage
    Depends mainly on what data the test is using. If it's floating-point heavy, and uses double, then it always was 64-bit. On 64-bit hardware it'll gain the full-width data path and will be able to load/store 64-bit floating-point numbers faster, all things being equal. If it uses ints (not longs), it is and will stay 32-bit, there will be no difference unless the hardware is capable of loading two 32-bit numbers at once, effectively splitting the memory bus in two (HP-PA RISC can do it, his old Sun cannot, newest Suns can, I don't know if Opterons can). Finally, if the test uses data types which convert from 32 to 64 bits it will become slower, but only if it does enough math on these types. The later is important, since every half-complicated program uses pointers, explicitly or implicitly, but not every program does enough pointer arithmetics compared to other operations to make a difference. However, if it does, then it'll copy pointers in and out of main memory all the time, and you can fit half as many 64-bit pointers into the cache.
    That's where the slowdown comes (plus some possible library issues, early 64-bit HP and Sun system libraries were very slow for some operations).
    If your process resident memory size is the same in 64 and 32-bit mode, you should not see any slowdown. If you do, it's an issue with the library of the compiler (even though the compiler in this case is the same, the code generator is not, and there may be some low-level optimizations it does differently). If resident size of 64-bit application is larger, you are likely to see slowdown, and the more memory-bound the program is the larger it'll be.
  • Re: OSNews (Score:5, Informative)

    by be-fan ( 61476 ) on Saturday January 24, 2004 @02:00AM (#8073332)
    On SPARC, there are no 64-bit-only optimizations. The only reason to use 64-bit math is either if you need 64-bit integers, or use 64-bit pointers. Since none of the benchmarks can use either (the MySQL benchmark could, but the machine only had 256MB of RAM).
  • Re:retarded. (Score:5, Informative)

    by fucksl4shd0t ( 630000 ) on Saturday January 24, 2004 @02:09AM (#8073372) Homepage Journal

    They've at best proved a supposition about a single architecture/process/compiler family. They have not proved a general case. Did they test on amd64? Alpha? Mips? No? Then why are they making unwarranted generalizations? Ah, they're retarded.

    Actually, they didn't make generalizations. He very specifically stated that he only tested on a 64-bit Sparc, and an older one at that. He pointed out that while you can make some general conclusions, you can and should run tests on other architectures.

    He also pointed out that he only tested a few applications, not a whole bunch of them. He was questioning conventional wisdom and wanted to know if there was any fact behind it, and he determined that there was. He did not determine the entire scope of the facts, and he did not claim to do so.

    Sorry, I found it to be an interesting read, but you really have to take the first page seriously when he says "I only tested these things, so I can only conclude based on these tests, and it doesn't prove the general case." If you ignore that, then yes, you'll wind up with what you took away from the article.

  • by davegust ( 624570 ) <gustafson@ieee.org> on Saturday January 24, 2004 @02:10AM (#8073377)

    Address Windowing Extensions (AWE) really are a good solution for your problem.

    If you're doing Win32, but really want 64-bit, then consider Win64 [microsoft.com]. There are several OEMs [microsoft.com] providing it.

    If your response is "can't afford it", then your .5 Terabyte database project is probably underfunded and likely to fail.

  • by Anonymous Coward on Saturday January 24, 2004 @02:28AM (#8073433)
    "Using modern technology to build a 386 chip would result in one of the highest clock speeds ever but it would be practically useless."

    This is completely wrong. Clock rate is determined by your slowest pipe stage.

    A modern P4 is a 20+ stage pipeline because they want to squeeze the logic into tiny little sections, so that they don't have any "big" pipe stages. This lets them ramp up the clock rate.

    A 386-era design isn't going to be nearly that heavily pipelined. Since it has more logic per pipe stage, it will have a very slow clock rate by today's standards, even if you upgraded it to a modern fab process.

    Plus, a 386 executes x86 instructions instead of "micro-ops" (the RISC-style instructions that are executed at the core of a modern pentium). Those instructions "do more" and require more logic to begin with.
  • by ajagci ( 737734 ) on Saturday January 24, 2004 @02:58AM (#8073505)
    Having an extra ALU around will surely push more 32bit numbers through the pipe

    That's an additional reason. There are probably many other places that neither of us has thought of that have been scaled up to make a true 64bit processor and that benefit 32bit applications running on the same hardware in 32 bit mode.

    I'm beginning to wonder these days how much CPU speed even matters though.

    It matters a great deal for digital photos, graphics, speech, handwriting recognition, imaging, and a lot of other things. And, yes, regular people are using those more and more.

    Unless you are running photoshop, SETI, Raytracing, etc., you probably wouldn't notice if I replaced your 3GHz processor with a 1GHz.

    You probably would. Try resizing a browser window with "images to fit" selected (default in IE, I believe). Notice how that one megapixel image resizes in real time? CPU-bound functionality has snuck in in lots of places.
  • Re:OSNews = UnNews? (Score:5, Informative)

    by fitten ( 521191 ) on Saturday January 24, 2004 @04:28AM (#8073790)
    Don't know how they could be exactly the same *except* for the word size. In order to process the two different word sizes, there will have to be differences in circuitry (ALU is wider, so are lots of things like the buffers between pipeline stages and such).

    One of the issues that people forget is that a 64-bit processor may be able to retire a set number of 64-bit, say, integer additions per clock cycle (NOTE: retiring an operation per clock cycle does NOT mean that the operation takes one clock cycle to perform). Well, the odds are that it will also retire the same number of 32-bit integer additions per clock cycle. It may take 5 clock cycles to do either sized addition even. So, what do you have that is different? Well, on the SPARC, most simple operations are going to be similar in execution time. Regarless of the number of register windows that the particular architecture supports (which may come into play in some codes), you still basically have 32 registers for use in your computational kernel. The only real difference between many 32-bit and 64-bit versions of the code will be the amount of data that has to be moved around.

    Where the 64-bit will help is when the 32-bit code has to synthesize 64-bit operations or has to do things like work on bit streams (not word/byte streams exactly) and can work on 64-bits at a time rather than doing really the same thing on 32-bits two times as much (128 bytes can be traversed in 32 32-bit operations or 16 64-bit operations - half the number of reads/operations).

    All of this is pretty well understood by those who have dealt with these type systems before. However, the relative newcomer Opteron has an additional twist. In 64-bit mode, there are twice as many registers that can be used compared to 32-bit mode. This may (read: will) cause some codes to be done faster simply because more data can be stored in registers rather than memory, even L1 cache is a bit slower than a register.
  • by harlows_monkeys ( 106428 ) on Saturday January 24, 2004 @04:38AM (#8073824) Homepage
    How? Explain please

    All public key systems currently in use depend on doing arithmetic on large integers. Let's start with the classical algorithms for addition/subtraction/multiplication/division.

    The addition and subtraction algorithms are O(N) and multiplication/division is O(N^2), where N is the number of digits.

    What is a digit? On a 32-bit process, it will probably be 32 bits. On a 64-bit processor, it will probably be 64-bits.

    What this means is that operating on large integers, say, 1024 bits, will be twice as fast on the 64-bit process for addition/subtraction and 4 times as fast for multiplication/division.

    Most large integer packages use Karatsuba multiplication instead of the classical algorithm. Karatsuba is O(N^1.58). On a 64-bit processor, that is 3 times faster than on a 32-bit processor.

    Looking at it from the other direction, if on a 32-bit processor, using a given set of algorithms which are working in base B, you can do public key cryptography using N bits, then just by using the same algorithm, working in base B^2 on a 64-bit processor running at the same basic speed, you can in the same time do public key cryptography using 2N bits.

  • by NerveGas ( 168686 ) on Saturday January 24, 2004 @05:01AM (#8073879)
    The performance increase comes from a combination of lower memory latency (built-in memory controller) and an increased number of registers. The small number of registers on x86 chips has always been one of the main gripes people have had about the architecture.

    steve
  • by pe1chl ( 90186 ) on Saturday January 24, 2004 @06:51AM (#8074115)
    That is why I am a bit astonished that he finds a 20% slowdown, then also examines the increased size of the executables, finds it is about 20%, and considers that a minor issue.

    I think the 20% increased size is the reason for the 20% worse performance, because memory access is often the bottleneck for real-life programs.
  • by Gadzinka ( 256729 ) <rrw@hell.pl> on Saturday January 24, 2004 @07:22AM (#8074204) Journal
    It's all in benchmark. It doesn't matter what you benchmark, only what you benchmark with ;)

    But there are several points

    1. The results for openssl are no good because openssl for sparc32 has critical parts written in asm, while for sparc64 it is generic C.

    2. The results would be much better if you did it with Sun's cc, which is much better optimised for both sparc32 and sparc64.

    3. The results, even if they were accurate, are good only for sparc32 vs sparc64. Basically, sparc64 is the same processor as sparc32, only wider ;)

    I don't know what's the case for ppc32 vs ppc64, but when you look at x86 vs x86-64 (or amd64 as some prefer to call it) you have to take into account much larger number of registers, both GP and SIMD.

    As a matter of fact, x86 is such a lousy architecture that it really doesn't have GP registers -- every register in x86 processor has its purpose, other than the rest. It looks better in case of FP and SIMD operations, but it's ints that most of the programs deal with. Just compile your average C code to asm and look how much of it deals with swapping data between registers.

    (well, full symmetry of registers for pure FP, non-SIMD operations was true until P4, when Intel decided to penalize the use of FP register stack and started to ``charge'' you for ``FP stack swap'' commands, which were ``free'' before, and are still free on amd processors)

    x86-64 on the other hand in 64bit mode has twice more registers with full symmetry between them, as well as even more SIMD registers. And more execution units accessible only in 64bit mode.

    But, from this chaotic notes you can already see, that writing good comparission of different processors is a little bit more than ``hey, I've some thoughts that I think are important and want to share''. And the hard work starts with proper title for the story -- in this case it should be ``Are sparc64 binaries slower than sparc32 binaries?''.

    Robert
  • by Anonymous Coward on Saturday January 24, 2004 @10:34AM (#8074687)
    with gcc 3.3.2-r2 sizeof long on amd64 is 8 bytes..
  • Re:6502? (Score:1, Informative)

    by Anonymous Coward on Saturday January 24, 2004 @02:50PM (#8075982)
    Ever since the Pentium Pro 36 bit address buses have been supported. If you look in your linux kernel options you'll see support for PTE (Page Table Extensions) and more than 4 gig of memory. This allows for up to 64 gig of memory. But the addressing mechanism is a bit messy.

    I also *THINK* (cant find anything to back this up) that the operton and itanium are not capable of addressing a full 64 bit address space.

    So basically we already increased the data bus size, its a messy solution, it increases page lookup times and you can still only access any 4 gig at once (kind of reminds me of the EMS and XMS systems to allow you to access addresses beyond 1 meg in real mode).

    Clean 64 bit implementations will also give a major boost to both integer and floating point performance. See - http://www.digit-life.com/articles2/amd-hammer-fam ily/index.html

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...