Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Hardware Technology

How to Kill x86 and Thread-Level Parallelism 72

kid inputs: "There's an interesting article discussing how one might go about 'killing' x86. The article details a number of different technological solutions, from a clean 64-bit replacement (Alpha?), to a radically different VLIW approach (Itanium), and an evolutionary solution (Opteron). As is often the case in situations like these, market forces dictate which technologies become entrenched and whether or not they stay that way (VHS vs Beta, anyone?). Another article by the same author covers hardware multi-threading and exploiting thread level parallelism, like Intel's Hyperthreading or IBM's POWER4 with its dual-cores on a die. These types of implementations can really pay off if the software supports it. In the case of servers, most applications tend to be multi-user, and so are parallel in nature."
This discussion has been archived. No new comments can be posted.

How to Kill x86 and Thread-Level Parallelism

Comments Filter:
  • by zulux ( 112259 ) on Saturday January 31, 2004 @01:37PM (#8144850) Homepage Journal


    Post! First

    A From Litte A system endian!

    Rules! x86

    • Why can't there just be a recompiler for x86. Have a program that crawls through the executable, recompiling the instructions along the way, and at conditional jumps ignore the conditional and recompile both possible paths. Doesn't seem too hard. Wouldn't this work.
      • It's called binary translation. It's not impossible; search on citeseer. But how do you telll (in the general case) the difference between code and data?
        • Thats why I'm saying have it virtually execute it. Have it trace along the program jumping to all possible paths. Anything it executes will be code, everything else will be data. And also have it take both paths of conditional jumps. Why wouldn't this work?
          • It would work for any code which does the exact same thing on every run. However, it would not work any program that (a) depends on user input, (b) depends on input from other hardware, (c) uses a random number generator or is time sensitive, or (d) is not guaranteed to terminate. Oh, come to think of it, it'll fail on any program that (e) uses certain information for both code and data, which is less rare than you might think.

            It's the same issue as decompiling, really; binary translation is decompile-reco
            • Programs do the exact same thing on every run. Jumps are jumps, they can be followed. Conditional jumps either jump or they don't. So you just follow both possible paths. You do have a point about the code & data in one place. But this usually only happens in JITs, Virtual Machines, and Program Packers. The JITs and Virtual Machines you'll just have to have a port to the destination system (which is pretty likely to already exist if its popular). You could have a plugin system that recognizes exe
              • Programs do the exact same thing on every run. Jumps are jumps, they can be followed.

                If only it were that simple. Ever heard of a "computed goto"?
                • If only it were that simple. Ever heard of a "computed goto"?

                  Computed goto can be a problem, but it can be overcome, or in a new archeteture eliminated entirely.

                  A fully general solution requires JIT translation. Lay the code out in blocks with block metadata to indicate the state of translation. Pre-translate by starting at the entry point and following program flow through both sides of branches. As you note, a computed jump cannot be predicted for all cases. However, once a computed jump instruction

              • Echo does different things when I call it with the argument "foo" than when I call it with the argument "bar", no?
              • The dynamic linker is nothing more than a program that does the "same thing" every time it runs: what it does is equivalent to reading the requested executable, writing its contents to memory as data, then jumping to it. (The reality is both more and less complicated due to mmaping.) System-level programs and libraries, plus just-in-time environments like Java, do stuff like that all the time.

        • The crusoe processor from Transmeta did just that. Intels own chips do some form of conversion from x86 CISC into RISC microcode though to a lesser degree than the Transmeta chips. They have to perform such conversions in order to keep up with the RISCs.
          The x86 is terrible. It sprung from a chip designed for calculators and still carries all of the baggage from that origin. This is why the Athlon64 is so distressing. Sure it's a more convenient transistion and I'm all for dethroning Intel, but it maintains
      • Do a google search for HP's dynamo. Fascinating stuff.
  • by Anonymous Coward
    Buy Apple :D
  • Don't forget (Score:4, Interesting)

    by Misinformed ( 741937 ) on Saturday January 31, 2004 @01:42PM (#8144888)
    The space shuttle still uses 16-bit x86s, the financial system is reliant on v_e_r_y old systems which spew out dot-matrix printed backups. Old systems survive today, and IMHO will always. It has to be organic.
    • Wrong, the shuttle does not use x86 processors.
      • I know of at least one instrument used on the space shuttle that *does* use x86 processors. 286, to be exact. The reason for this is that the 286 (at least, the one they're using) is fully static, so it won't get affected by the radiation the way that the dynamic components do in newer processors.

        Basically, if you use DRAM in space, the tiny capacitors inside end up getting disrupted by the ambient radiation, causing bits to get flipped.

        • When people talk about the space shuttle computers they usually mean the 5 computers that control vehicle flight.

          Anyway, lots of Intel Pentium and later class computers have flown on the shuttle. I don't think they have too much trouble with the radiation, despite being off-the-shelf models. The shuttle is still well protected from radiation at its typical altitude.
        • I was reading in ... ars?... that if you have a gigabyte of memory, you can expect about a random bit flip per week, just from quantum fluctuations. Course, most bit flips are benign, occuring on pages marked clean, and unused, or subsequently overwritten before read.

          However, if desktop memory gets bigger, ECC RAM will become necessary. It appears to have been constant at 256/512 for a while now, so the increase has slowed, if not stopped.
    • So does old software. I've seen many a checkout-like system that appeared to be running on DOS, and some terminals I walked up to (even though I may not have been supposed to get to them physically, they were relatively unguarded) responded directly to the three fingered salute.

      And while old hardware still works, especially as long as you have software that's ported to it, old software does not. For that matter, since old hardware is so cheap, people who would keep using 16-bit processors should buy 32-b
  • Let's kill x86! (Score:2, Insightful)

    by ObviousGuy ( 578567 )
    We should rewrite all of our COBOL programs in C while we're at it.

    Might as well compound the folly of tossing out a perfectly good instruction set with the folly of tossing out perfectly good source code.

    Update, don't reinvent. The desire to reinvent is a junior engineer character flaw. It takes several experiences in spending long hours tracking down bugs in the new implementation rather than simply updating some older code that worked fine.
    • There are too opposing opinions in this matter:

      1. The mythical man-month: Plan to build one to throw away. You will anyhow.

      2. Hack something together. Extend it. It will work fine. (This approach really works excellent in Common Lisp and proves deadly for Perl programs)

      It is true that Intel's base instruction set survived the last 18 years quite unchanged. And if you consider the pre-80386-era even longer. It is also true that it is proven and works. But if you ever tried to write an assembler or disasse
      • Re:Let's kill x86! (Score:5, Informative)

        by Valar ( 167606 ) on Saturday January 31, 2004 @04:02PM (#8145900)
        The problem with keeping the x86 architecture and the ISA are that it is carrying around legacy burdens from the 286. Even the p4 still boots into real address mode at boot up, and has to be PUT into protected mode. There are hundreds and hundreds of instructions, over 100 registers (but still only 8 GPRs), many of which overlap in purpose or are used for entirely non-intuitive purposes (CMPX EAX, EAX). x86 is ready, at the least, for a real version 2, that isn't afraid to break compatibility in order to add major architectural advances (I wouldn't mind a register ring :).
        • Re:Let's kill x86! (Score:2, Insightful)

          by Anonymous Coward
          It has to be put into protected mode at boot-up? Wow, that must take like 10 ns at least, every single time you cycle the power! You're right, better just scrap the whole thing...
          • And of course, address translation doesn't cost anything in terms of die size, performance or power consumption. It didn't contribute to unnecessarily complicated code and operating systems. Obviously, the engineers at AMD agree with you, otherwise why would they have dumped the segmented memory model for x86-64... Oh wait.
            • And of course, address translation doesn't cost anything in terms of die size, performance or power consumption

              Nope. It doesn't. It used to in the early 90s, but now we have transistors to spare. The ISA doesn't matter anymore. It's at most 2nd order effect on die size, power and performance. I design x86 processors for a living - It's a fact.

              There are a million tricks architects can play to get around poor ISAs. What are the fastest SPECint machines on the planet? Hmm...x86 machines!

              The only rea
        • At least they could dump 16-bit Real Mode all together, but imagine all the problems this would create. Most noticeable: BIOS (which is itself mostly written for Real Mode, except some half harted attempts to introduce Protected Mode interfaces, which proved to be too buggy to use anyway) as we know it may a) cease to exist b) grow into something that has at least some importance to the OS.
        • There's no reason that real mode can't be phased out. The first take on it will need a strap to determine if the CPU starts in real or protected mode. This is mostly to avoid a great deal of chaos for BIOS. There's no reason the CPU can't start in flat 32 bit mode with all segments set to 0-0xffffffff.

          Of course, LinuxBIOS spends as little time as possible in real mode before going to flat 32bit mode but other BIOS will need more significant changes.

          It should be possible to phase in a new mode where the

    • "Update, don't reinvent. The desire to reinvent is a junior engineer character flaw. "
      That is not always true. There was company called Wright Aircraft engines and yes it was started by the Wright brothers. In the late 40s and early 50s they where one of the top engine makers. They did not want to waist time with those new fangled jet engines. There piston engines where the standard in airliners and they thought it would go on for ever.... It didn't.
      Sometimes starting over is a good thing.
      The x86 is also fa
  • h/w vs s/w (Score:2, Insightful)

    by StarBar ( 549337 )
    This is much like my day to day work. The h/w guys thinks they are gods and always blames us s/w guys not to utilize the smartness of their designs fast enough. s/w compatibility is what counts for general purpose systems, and it always will. You can cry the guts out of yourself about bad system design and segment hell etc etc and it will not help.
    • Re:h/w vs s/w (Score:3, Interesting)

      by oscast ( 653817 )
      I think the opposite... software should accomidate hardware. Software should be the comodity and hardware the primary asset.
      • I agree on that idealistic view of it but software is much more complex than hardware and as a pragmatist I can tell that it will not happen that a completelly new architecture will take over the x86 domination in the current market. It's just too expensive. If just hardware guys could understand that too. They need to invent something that is 10 times better and 10 times cheaper to manufacture to stir the bowl. Not twice the performance to half the price becuase that will not be good enough.
  • by caesar79 ( 579090 ) on Saturday January 31, 2004 @02:34PM (#8145308)
    "Throughput computing"..where the performance is measured not individually but in aggregate.
    See their media kit available at [sun.com]
    http://www.sun.com/aboutsun/media/presskits/thro ug hputcomputing/ for more details.

    However, I believe the whole idea is nothing new. AFAIK, there are only two ways of increasing the performance of a processor (Operations Per Second) - either increase the IPC (Instructions per cycle) by increasing parallelism or decrease the cycle time by increasing the clock Rate (Ghz).

    Each method has its limits and follows the law of diminishing returns - for e.g. increasing the clock rate implies increasing the number of stages in the pipeline...and after say 10000 stages, the penalties imposed due to flushing the pipeline might compensate for the increased GhZ. Similarly if you manage to place 100000 cores on a chip, scheduling amongst these cores and providing realtime access to the memory for all these cores will become the bottleneck. Hence, I take statements like "how to kill the x86" with a pinch of salt.

    Finally, it will the fabcrication (physical) technology that decides which one of these dies. For e.g. if tomorrow someone is able to come up with a process that enables 100Ghz chips at the (think extensions of SOI etc) decreasing the cycle time will win. Similarly, if someone comes out with femto (10^-15 ) metre fabrication technology, then parallelism will win.
    • I just have to point out: 1 femtometer is about 10000 times smaller than the nominal "diameter" of a hydrogen atom.

      Buy who knows, maybe we will have superstring transistors in the future.
    • Especially what the article called vertical MT (or coarse grained) was the basis of the coolest supercomputer evar: The Tera.

      It had no cache -- no cache logic! It did have hardware support for a god-awful number of threads per cpu tho. Each time one of them stalled, it would thread switch and keep going. After about 60 or so cycles (this was a few years back), the memory read would be back, so if you had 64 threads per CPU, you would never see a memory-latency related stall.

      As all things extreme (think CM
  • by ajagci ( 737734 ) on Saturday January 31, 2004 @02:40PM (#8145349)
    Two decades ago, the instruction set still mattered because it was closely tied to how the processor executed things. Today, we can put enough logic between the instruction strem and the processor that the instruction set makes no difference anymore.

    And VLIW in particular is quite unconvincing: processors should rely less on compilers, not impose a bigger burden on software writers.
    • Today, we can put enough logic between the instruction strem and the processor

      Wouldn't less decoder logic allow for a smaller decoder, which requires less die space and emits less heat?

      processors should rely less on compilers

      To the other extreme, do you propose a processor that can run Perl directly? What compromise would you find best?

    • I agree with you that CISC vs. RISC is not an issue any more. The decoded tracecache that P4 has, for instance, takes the conversion off the critical path.

      x86 however has a ridiculously small number of registers. This means that you have to go to memory A LOT. It's easy to make register operations fast, extremely hard to make memory fast. The performance gap between memory and processors is constantly increasing.

      That's why x86-64 has 16 general purpose registers, Alpha - 64 and Itanium ... 128.

      Bottom

  • Cost-efficiency > * (Score:1, Interesting)

    by Anonymous Coward
    Since I use linux and it or its applications can be ported to most architectures you throw at it, I could theoretically have my pick of the litter for a future system. What I consider most is the bang-for-my-buck factor.

    Sure, I could spend $20 on eBay and get a Sparc Lunchbox, but there's not enough processing power in there for me. I could also go out and buy a year-old IBM mainframe, but I doubt any auction site will have them anywhere near my price range. I want something that's decent but also cheap. I
    • Yes, economy of scale determines who provides
      the most bang for the buck, but there are more
      dimensions to the purchasing decision than
      mips, mflops, and $$. There are watts and
      hours and then, god forbid, intangibles.

      ARM and PPC have the best shot at displacing
      ia32 and its best successor, amd64, because
      they accomodate very real market segments.
      We keep waiting for commodity PPC hardware,
      but it never emerges because the OSS community
      isn't big enough to drive sales to economical
      volume; but some magical event co
  • The Power 4 is two full CPUs (cores) on a single die, not that crap that Intel puts out called Hyper-Threading where you only have a single full CPU and then some extra logic to quickly swap over to another thread when needed.
    • Re:Power 4? (Score:2, Informative)

      by chez69 ( 135760 )
      The power 5 will have 2 cpus on a die, and they both will behave like hyperthreading intel cpus.

      so each 'cpu' will look like 4 logical cpus
    • Re:Power 4? (Score:2, Insightful)

      by beerman2k ( 521609 )
      Don't underestimate Intel. Unlike the Gnomes they have a plan

      Step 1: Hyperthreading
      Step 2: Multicore
      Step 3: Crush competition (i.e. Profit)

      • I wouldn't expect seeing multicore in home PCs within the next five years, even if multicore becomes so cheap Intel could start putting it in its Celeron chips. The limitation is that Microsoft charges for Windows licenses per core; a license for Windows XP Professional, which can handle two cores, costs much more than a license for Windows XP Home Edition, which can handle one core. Wouldn't multicore require selling the machine with a more expensive version of Microsoft Windows?

        I say "next five years"

  • by Animats ( 122034 ) on Saturday January 31, 2004 @03:21PM (#8145644) Homepage
    Depends on the goal. Here's an architecture for reliability. If vendors had to pay whenever a program crashed, we would have seen this years ago.
    • Channelized I/O With current peripheral bus to memory interfaces, peripherals can store anywhere in memory. So drivers impact system stability. It doesn't have to be that way. IBM got this right in mainframe design in the 1960s. You want an MMU between the peripherals and memory. Drivers then become non-privileged programs. Existing peripherals don't even need to know there's an MMU in the middle, just as programs don't.
    • High-speed copy. Copying data in memory should be really fast, so fast that it's almost free, even if it takes copy-on-write hardware in the cache to do it. Why? Because then the temptation to put everything in one big address space decreases. With good interprocess communication (think QNX messaging, not CORBA, or horrors, SOAP), building programs out of components can actually work. This includes the OS. File systems, networking, and drivers should all be user programs.

      The neat hardware implementation of this would be to make all MOV instructions take nearly the same time, regardless of the amount of data moved. A MOV should result in a remapping of the source and destination memory in the cache system. Even if this were just implemented for aligned moves, it would be a big help. When your application's 8K buffer needs to be copied to the file system, that copy should be done by updating cache control info, not by really doing it.

    • Graphics MMUs Get rid of the "window damage" approach, and have real hardware support for overlapping windows. All that's needed are big sprites. Then programs don't have to know or care which window is on top. "Overlay planes" do some of this, but it's not general enough.

      With this, windowing becomes far simpler. Each window is maintained locally. Shared window management is reduced to screen space allocation, which is done by commanding the window MMU.

    • by RalphBNumbers ( 655475 ) on Saturday January 31, 2004 @05:15PM (#8146315)
      Actually, what you refer to under "Graphics MMUs" has been done for a while under OSX with Quartz and Quartz Extreme.
      Windows are drawn on OpenGL surfaces and their layering is handled entirely by the GPU in Quartz Extreme, and plain old Quartz does basically the same thing in software buffers. In either case, an app never has to do any redreawing when one of it's windows is revealed, it's all handled by Quartz.
      And supposedly, whenever it eventually comes out, Longhorn will do more or less the same thing.

      Channelized I/O is probably a good idea, but it's either going to cost you some bandwidth (route all IO through a expanded version of current MMUs), or be expensive (a seperate MMU for IO). I'm not saying it might not be worth it in the long run, but it will take a bite out of price/performance in the short term for questionable immeadiate stability gains (one would hope that most people writing kernel space drivers have the sense to KISS).

      High speed copy sounds really interesting, but I'm not sure how practical it is to add to current systems.

      • Channelized I/O is probably a good idea, but it's either going to cost you some bandwidth (route all IO through a expanded version of current MMUs), or be expensive (a seperate MMU for IO).

        It shouldn't hurt bandwidth. The problem with MMUs is latency, and adding a few hundred nanoseconds to I/O latency isn't going to hurt. I/O accesses have far more coherency than regular memory accesses, so you don't need that much cacheing within the I/O MMU.

        The original Apollo Domain machines had an MMU between t

    • Explicit hardware support for overlapping windows is unnecessary. You don't really want the number of windows you can open to be limited by your video hardware, do you? It can be handled easily in software, using video card acceleration features that are standard today. XFree86 still does things the old-fashioned "redraw when windows are exposed" way, but I don't think there's any technical reason why a new X server couldn't keep all contents of all windows in memory at the same time and never redraw due
      • There have been implementations with support for 8 windows, and that clearly wasn't enough. If you had support for, say, 1024, that probably would be enough, even if every icon took up a window slot. The transistor count for this isn't a big deal any more.

        The point here is that we're tied to some architectural decisions from an era when transistors were more expensive, and those decisions are worth a new look.

  • Multiple chips (Score:3, Interesting)

    by Tablizer ( 95088 ) on Saturday January 31, 2004 @03:37PM (#8145750) Journal
    Why not define a new standard machine code set and start making new chips with it? Old software can use the old chip and new software use the new chip. Game machines do something like this.

    Emulators can be implemented such that old chips can still run code from the new standard (and visa versa), just slower. For development, training, simple apps, and testing that is usually fast enough.

    A box could come with both an X86 and an Alpha-clone, for example. Eventually over time the X86 chip is not worth it. The few old apps laying around just use emulation mode.
    • Re:Multiple chips (Score:2, Informative)

      by Lapzilla ( 739264 )
      IIRC, Apple did this with their Dos Compatible Macs.... You could run both DOS and MacOS at the same time, and had some bios function that switched between the two.

Technology is dominated by those who manage what they do not understand.

Working...