Why do CISC chips need to translate instructions?
May 20, 2008 9:25 AM   Subscribe

Why can't CISC instruction sets be converted to RISC/uops by an advanced assembler so that CISC chips, such as Intel's Atom, would be able to run with a simpler instruction decode hardware?

I just read this article from ars technica about Intel's new Atom processor. The author feels that the Atom's instruction decode hardware needed for the CISC architecture hurts its power consumption and increases the core size compared to the ARM A-9.

My question is, why do all of the decode at runtime, on-chip, with dedicated transistors, when it could be done by the assembler or by a post-assembler x86 opcode to uops? I suggesting a smaller hardware instruction decoder with a opcode to uop translation in software at compile/assemble time; not completly doing away with the decoder.

What stands in the way of doing the instruction decode well before hand? Isn't the existing hardware logic a good start on for the software version?

The only thing I can think of is it seems that it would make it difficult to write self-modifying code. I'm not a EE so a non-hardware CS or layman's answer would be best.
posted by bdc34 to Computers & Internet (22 answers total) 2 users marked this as a favorite
 
hey! you just founded Transmeta!

Also, if it's a static recompilation, that's kind of a non-starter for commercial software. If you can recompile then you might as well simply retarget it to whatever instruction set you want. But if you mean on-the-fly, then yeah, people do that. The major issue is overhead and the amount of time needed to optimize efficiently. Depending on how you look at it you're essentially describing a virtual machine and then all the typical VM issues come into play.
posted by GuyZero at 9:37 AM on May 20, 2008 [1 favorite]


why do all of the decode at runtime

Purely to make the existing code base run without needing recompilation.

The x86 architecture has been well described as a pig in lipstick, but it's what 90% of the world's software was compiled to run on. So if your shiny new chip won't run x86 software without needing clever things done first, nobody is going to build a system around it.

You might want to look at the architecture of the TransMeta Crusoe if you're interested in alternative approaches to this kind of design constraint.
posted by flabdablet at 9:40 AM on May 20, 2008


Oh, jinx!
posted by flabdablet at 9:41 AM on May 20, 2008


Best answer: This is done for at least two reasons. First, to preserve compatibility with existing executables and second, to allow for more flexibility in the RISC core.

If the translation were done in software, then it would have to be done for every executable and library anyone would want to run on the RISC processor. There are two problems with this: first, it's time-consuming and second, there are copyright issues in creating a derivative work of the original binary.

Doing the translation in hardware also allows the RISC core to change without having to run everything through a new translator. New versions of the Atom may very well have new or different uops, and better op to uop optimizations may be discovered. By implementing all of it in hardware on the chip, the whole process is transparent to the end user.

Bear in mind that the op to uop translation process has been how x86 chips work since the original Pentium, so if there were a substantial benefit, I imagine Intel, AMD, VIA, etc would have seized on it by now.

On preview: The Transmeta chips were a little different in that the software translation was still done on the fly and transparently rather than prior to execution and explicitly, but the basic idea of doing the translation in software was there. And, yeah, it didn't work, even in the mobile context.
posted by jedicus at 9:45 AM on May 20, 2008


What I'd like to see tried is a design that, given N pipeline stages each with the ability to run M simultaneous instructions, runs MxN parallel threads. Seems to me that this would give it comparable grunt to what a conventional superscalar architecture gets from all its clever instruction re-ordering stuff, without needing all the clever instruction re-ordering stuff; you wouldn't need to change the order you stuff instructions down the pipe if you're round-robin fetching them from N logically distinct code threads and each thread is operating with its own set of registers.

When I first heard that Intel was bringing out a processor with "hyperthreading", I was hoping this is what they'd done, and was disappointed to find out that "hyper" means "two".
posted by flabdablet at 10:23 AM on May 20, 2008


Uh, wouldn't that have issues with branch prediction? I mean, one mispredicted branch and the whole MxN stages get killed - wouldn't they?
posted by GuyZero at 10:31 AM on May 20, 2008


It wouldn't even bother doing branch prediction. The point of branch prediction is to get the code that's most likely to be wanted flowing down the pipe ahead of time, so that the only time the pipeline runs dry is if the prediction was wrong. If the pipeline is always full anyway by virtue of having parts of N threads worth of instructions being jammed down it at any instant, it never runs dry anway, and you don't have to do the branch prediction.

Looking at instructions from any individual thread, you'd see instructions completing strictly sequentially, with no overlap at all. The outcome of the branch test would be known at the time the instruction following the branch is fetched, instead of needing to be guessed at. All the overlap would be between instructions in different threads, not between parts of instructions from the same thread
posted by flabdablet at 10:43 AM on May 20, 2008


The problem with that, flabdablet, is that you need to have N fully parallel threads worth of instructions to run. That requires that the software you are running is fully parallel or you're running a lot of parallel separate tasks. Modern multi-core chips are moving to the model of simpler pipelines but lots of them, but this does not give you the behavior of "new chip==my code all runs faster now!" without significant redesigning of the software.
posted by ch1x0r at 5:01 PM on May 20, 2008


you need to have N fully parallel threads worth of instructions to run

which is pretty much the standard working load for any server OS. And on the desktop, with any modern CPU, available compute performance on any single task is typically grossly in excess of what's required; it's really only when multiple threads start contending that CPU performance becomes an issue.

The architecture would have a certain amount of inherent "speed-step" in it too; with N-1 threads executing HALT and only one executing code, most of the chip would not be changing state most of the time, and power consumption ought to drop a fair bit. It's a natural fit for stuff like codec graphs as well, so it might make a nice basis for a portable media player.

Modern multi-core chips are moving to the model of simpler pipelines but lots of them

which is not really what I'm advocating. I'm interested in the idea of a long and very piecewise single pipeline serving multiple interleaved sets of state, in such a way that execution processing for any single set works about the same way it did on the 8048. No code reordering, no branch prediction, no intra-thread instruction overlap, no pipeline flush on exceptions. OK, you can have a cache - a big one if you want - but that's it for optimization :-)

I'm not trying to claim that this idea would Solve All Problems and make the world a better place - but I would be very interested to hear about the results of any research efforts in this line, if such exist.
posted by flabdablet at 7:19 PM on May 20, 2008


which is pretty much the standard working load for any server OS. And on the desktop, with any modern CPU, available compute performance on any single task is typically grossly in excess of what's required; it's really only when multiple threads start contending that CPU performance becomes an issue.

This may be the case some of the time, but I would argue it is probably more common to have a mostly idle system where the active tasks need as much cpu as they can possible grab when they are running. Most of the systems I work with are of this nature. The reason superscalar processors exist is that most systems are not of a massively parallel nature. Additionally you start to incur serious memory performance problems when you are pulling unrelated pieces of memory into the same cache for multiple processes at once. That grossly worsens your performance, and, let's note, the power efficiency of your chip.

which is not really what I'm advocating. I'm interested in the idea of a long and very piecewise single pipeline serving multiple interleaved sets of state, in such a way that execution processing for any single set works about the same way it did on the 8048.

While I can't find anything directly speaking to this, I can promise you some grad student somewhere has published a paper or two on the topic. The reason this doesn't go anywhere is that it is incredibly hard to keep the pipeline filled if you have to rely on there being enough parallel work on the system to always have instructions ready to fetch.
posted by ch1x0r at 8:21 PM on May 20, 2008


Additionally you start to incur serious memory performance problems when you are pulling unrelated pieces of memory into the same cache for multiple processes at once

It seems to me that this, as well as your other points, apply equally strongly to multi-core CPUs, and there seems to be no shortage of those on the market.

If you ever run across one of those grad student papers, I'd be grateful for a heads-up; email in profile.
posted by flabdablet at 8:54 PM on May 20, 2008


It seems to me that this, as well as your other points, apply equally strongly to multi-core CPUs, and there seems to be no shortage of those on the market.

They do somewhat, although the individual cores on multi-core systems tend to have separate L1 caches per core which can alleviate some of this. But yes, modern multi-core systems are going to rely more and more on parallelization of the code itself, which is rapidly becoming the big challenge in large-scale software system development. You should absolutely read the Piranha paper if you haven't, which is one of the major foundation papers for the current multi-core revolution and talks about some of your points.
posted by ch1x0r at 6:10 AM on May 21, 2008


this does not give you the behavior of "new chip==my code all runs faster now!" without significant redesigning of the software.

But most modern datacenters don't want faster. They want energy efficiency and spatial density.

VMware has rewritten the rules on this one - datacenters take 5-20 old servers and replace them with one box that has 2-4 quad-core processors. Wham. A fraction of the power usage. And anyone running a SaaS business is massively parallel from day one. The issue of designing desktop CPUs is solved. All these crazy new CPUs either target mobility or the datacenter where the rules are different.

I would argue it is probably more common to have a mostly idle system where the active tasks need as much cpu as they can possible grab when they are running.

Yeah, well, I would argue no. The only major growth market for CPUs is datacenters (and mobile but the mobile people are cheap) so what you see being designed are datacenter CPUs. Lots of cores. Many parallel tasks. Often running a hypervisor.
posted by GuyZero at 7:51 AM on May 21, 2008


But most modern datacenters don't want faster. They want energy efficiency and spatial density.

But what do their users want? Are they willing to accept markedly slower processors? Current multi-cores still have pretty beefy processing available per core; the processing power per core hasn't grown like back in the day, but it hasn't drastically shrunk either. We aren't talking about multi-cores, though, we're talking about a processor that would presumably run each thread much slower for the exchange of some power savings. I don't know that there is enough appetite out there currently for such a thing. Have you ever taken a code base built for a small multi-core machine and ported it to run comparably on a many-hundred slower core machine? It is not a pretty process, and not one most people are going to jump to do unless they absolutely have to.
posted by ch1x0r at 4:26 PM on May 21, 2008


I don't really see how one virtual processor per pipeline stage suddenly turns into a "many-hundred slower core machine". I think you're comparing apples with pineapples when it's fairer to compare them with oranges.

The motivation for this idea stems from the premise that there will generally be work available for a goodly handful of threads. Not hundreds or thousands of threads, but certainly tens.

If this is the case (and on a system running multiple VMs, it will be, in general) then it seems to me that doing things the way I suggested ought to be no slower than doing them the usual way, and may actually be marginally faster due to reducing the rate of thread context switches.

Context switches would actually be quite interesting to look at. There ought to be no need at all to flush the pipeline on an exception, or have any logic to handle unwinding partially completed instructions on an exception, since exceptions would be handled per virtual processor rather than per pipeline, and the virtual processors are plodding, stepwise, simple-minded beasts.

I also don't have much of an idea how much power is consumed by current chips in the logic that does instruction re-ordering, register renaming and branch prediction, none of which would be needed by the proposed architecture. All of these would be replaced by a 4-bit register per pipeline stage, identifying the VP that the stage is currently servicing.

Thanks for the Pirhana paper - I'll take the time to give that a thorough reading soon. In the meantime, it's nice to see that I'm not totally out in left field here: "In comparison, CMP advocates using simpler processor cores at a potential loss in single-thread performance, but compensates in overall throughput by integrating multiple such cores." If multiple cores work, so should multiple virtual in-order processors per core.
posted by flabdablet at 7:23 PM on May 21, 2008


The motivation for this idea stems from the premise that there will generally be work available for a goodly handful of threads. Not hundreds or thousands of threads, but certainly tens.
If this is the case (and on a system running multiple VMs, it will be, in general) then it seems to me that doing things the way I suggested ought to be no slower than doing them the usual way, and may actually be marginally faster due to reducing the rate of thread context switches.


No slower? Are you quite sure that your system will not be slower? You are giving what is essentially a much slower processor to each thread. If the system used to be limited because there were far more threads to run than cores to run them, then yes, you might not be slower. But that is not currently the case for every process, and again, I would say that it is probably not the case for the majority of the processes out there. The performance drop for going to your simple pipeline would be huge, to counter this (and sell the box) you would have to put a shit ton of virtual processors in, which brings you basically to my many-hundred-slow-core box example.
posted by ch1x0r at 7:41 PM on May 21, 2008


Are you quite sure that your system will not be slower?

IF the premise is reasonable - IF there are more non-idle threads than pipeline stages - then the pipeline is going to stay full, and there should be no throughput penalty compared to a similarly powerful set of execution units served by a pipeline kept full by conventional means.

And I'm fairly sure that there's a large class of processing loads where there are, in general, at least a few tens of active threads most of the time. We don't need far more threads to run than processors to run them - we need at least as many threads as there are pipeline stages. Less than twenty per processor. Maybe less than ten. Depends how broken-down the pipeline stages are.

I also can't see how you'd counter a performance drop due to a lack of threads by adding a shit ton of virtual processors. If the processing mix doesn't even keep one real processor's worth of virtual processors busy, how is adding more VP's supposed to help?
posted by flabdablet at 8:28 PM on May 21, 2008


Hooray! It has a name! Thanks for the links, ch1x0r.
posted by flabdablet at 2:18 AM on May 22, 2008


It's even been sold!
posted by flabdablet at 2:21 AM on May 22, 2008


And given away!

It really is a shame that the x86 pig-in-lipstick is so incredibly dominant. Sigh.
posted by flabdablet at 2:30 AM on May 22, 2008


Glad I could help with the links.
I still think you're missing my key point, which is that any given thread will run slower on your system, so you would have to be currently running a system greatly limited by processor availability for this to provide equivalent performance. That's not to say there is no use for such a thing; as your searching proved, it has great potential for RTS where total performance is less important than guaranteed performance.

I also can't see how you'd counter a performance drop due to a lack of threads by adding a shit ton of virtual processors. If the processing mix doesn't even keep one real processor's worth of virtual processors busy, how is adding more VP's supposed to help?

You're not countering a performance drop due to lack of threads by adding a bunch of virtual processors. You're adding those virtual processors to make the case to your customers that while each virtual processor is significantly slower than what they have now, there are so many available they don't have to wait to acquire one, and to get the kind of per-process (not thread, not processor) performance they had on a fast core box they "merely" need to make their processes more multi-threaded.
posted by ch1x0r at 6:19 AM on May 22, 2008


There's no reason I can think of why the multiple threads running on a given pipeline need all belong to the same process, if the MMU design is right.
posted by flabdablet at 6:42 AM on May 22, 2008


« Older Advancing to candidacy gift/gesture?   |   Uninstalling A Bootleg Copy of Photoshop CS3 (Mac) Newer »
This thread is closed to new comments.