some follow-on questions about Apple's 'rosetta' technology
January 15, 2006 3:54 AM   Subscribe

I'm curiouser and curiouser about this "Rosetta" technology which will bridge the gorge between here and an all-Intel world for Apple. I understand that it is "binary translation" and not exactly "emulation," but I have some follow-ups...

1) Supposedly, with "binary translation" chunks of code only need to be 'translated' once and can then be run without emulation afterward. How does that work, exactly? Can I power up my MacBook Pro, use all my apps all day, put it to sleep, and still enjoy the benefits of that day's work tomorrow? When I wake up in the morning, will all my "chunks of code" still be translated already? How persistent are those translations and what size "chunks" are we talking about?
2) So is Apple recompiling their OS and all its companion utilities and apps for the Intel platform? Or are things like iMovie and TextEdit and the System Preferences control panel all going to run under "Rosetta?"
3) What kind of historical predecessors exist for this "binary translation" technique and how much of a noticeable affect might it have on everyday performance? It's great that the processors are "4x" as fast, but if every single thing they're running is bogged down in "binary translation" then it doesn't make a big diff.
4) How much rigamarole do developers need to go through to release "Universal" editions of their software, which can run natively? Is it likely that they are even welcoming this change, so they don't have to worry about the PowerPC platform anymore?
5) What has Apple lost, processor-wise, in the transition? They spent a lot of time talking up Alti-Vec and the G5. Is the dual-core Intel chip all that?

I am a more or less average computer user and not particularly well-versed in chip technology or software development.
posted by scarabic to Computers & Internet (14 answers total) 1 user marked this as a favorite
 
First, you may find answers to your questions about binary translation here: Transitive is the company from which Apple licensed the technology they call Rosetta.

1) As I understand it, Rosetta stores emulated code on disk somehow, so that the penalty of translating instructions to x86 is only incurred once.

2) Apple has recompiled its entire OS and all companion apps and utilities for x86. Actually, most of it was native for x86 all along via the secret Marklar project at Apple. After the move to Intel was announced, PowerPC apps that were not yet ported (such as iTunes) were moved over.

3) I believe Transitive was the first to come up with the "translation" thing, which is really sophisticated emulation with caching of translated instructions.

4) If a developer has used Apple's XCode platform to write their application, making a universal binary is often as easy as checking a single checkbox. To the extent that their application has platform-specific code, developers may also have to tweak their app somewhat -- an example of this would be apps that rely on PowePC AltiVec.

If an application is written using the legacy CodeWarrior environment from Motorola (now Freescale), it needs to be moved to XCode before it can be compiled as a universal binary. This is much more time-consuming. Microsoft Office, for example, faces this migration problem.

5) Hopefully somebody more processor-savvy than me will answer this one, though it's useful to remember that Apple has historically been great at talking up the hardware and software of the moment. Off the cuff, they've lost 64-bit for the time being, and have abandoned a rather elegant processor architecture for the morass of x86. Intel's new chips aren't bad, though. They're really an evolution of the Pentium III -- the Pentium M architecture was developed by a small team of engineers in Israel, and Intel has embraced it for all future desktop and mobile processors. The Pentium IV, which was a horrible architecture solely designed to hit rediculously high clockspeeds for marketing reasons , has been formally abandoned.
posted by killdevil at 4:23 AM on January 15, 2006


Oh, and also -- Apple talked up AltiVec a lot, but when you get right down to it Intel's SSE3 instructions are very, very similar. So much so that Apple has published a developer doc that provides step-by-step instructions for rewriting AltiVec code to target SSE3. I believe developers also now have the ability to target an Apple-provided API that provides a level of abstraction above Altivec/SSE3 and makes the appropriate calls on each platform.
posted by killdevil at 4:34 AM on January 15, 2006


1. The impression I get from Rosetta's documentation is that since Rosetta uses a cache to store translated code, you'll only incur the translation delay once for each application so long as A) the application remains open and B) the amount of translated code doesn't exceed the cache size.

2. Apple indicates that all of their applications are now being compiled as universal binaries. OS X, I'd assume, is compiled for the particular platform of the Mac it's shipped with, if you're buying a new machine, or will ship two different versions if you're buying it off the shelf.

3. I don't honestly know.

4. See this document.

5. I don't honestly know.
posted by ubernostrum at 4:37 AM on January 15, 2006


You might want to check Apple's brief notes to developers on Rosetta.

1) Rosetta only caches frequently used translated code, and even then only until you quit the program. It's all done in memory, not disk. You don't lose it when you sleep, though. I doubt you'll have to worry about any of it though.

2) Everything that comes with the computer is Intel native, as far as anyone knows.

3) Java. Several implementations of Java do exactly what Rosetta does, converting Java bytecode to PowerPC or x86.

4) Most programs aren't written specifcally for the PowerPC, so most of the time it is as simple as ticking the checkbox that creates an Intel version. The hard part is testing everything to double check nothing's been broken. The errors caused can be very subtle.

5) The G5 has two dedicated full-precision floating point units. These are why some people wanted to build supercomputers out of them. They aren't that important to normal computing though. Intel processors only have one, and its partially responsible for running SSE code (roughly equivalent to Altivec)

re: Altivec. Motorola own all the patents on this kind of technology, so the Altivec implementation was completely textbook and perfect. Intel just can't make anything as good.
posted by cillit bang at 4:42 AM on January 15, 2006


The other reason Altivec is supposedly so great is that it has an unusally clean and simple programming interface. I think that's it's big draw. For the end user there probably isn't much difference from SSE3.
posted by joegester at 6:47 AM on January 15, 2006


The comparison with Java is not fair. Java bytecode is at a higher level of abstraction than machine code, typically. So you go from one Java "instruction" to fair number of machine instructions. In contrast, Rosetta needs to go from one (sometimes more than one, I would suspect) machine instruction to another machine instruction (again, sometimes more than one). So it's not as clean or efficient (although you could imagine a Rosetta-like technology that handled many different platforms using (something like) Java bytecode as an intermediate step to reduce theproblem from n^2 to n)

I must admit that the difference may not be that great and I don't have a good knowledge of bytecode (or Rosetta) to back this up...
posted by andrew cooke at 7:23 AM on January 15, 2006


The G5 has two dedicated full-precision floating point units

More importantly, the G5 has a bus/memory interface that doesn't suck. The Power4 and Power5 spend more than half the die on bus interface and cache control, because IBM recognized that speeding up the clock without speeding up the bus meant that you're fast processor would be running lots of no-ops waiting for data. Apple took this lesson to heart.

The G5 doesn't have a bus that matches the Power4, but the Hypertransport bus it does have is a very fast one, compared to other desktop CPUs.

The Pentium 4, for example, stands as exactly how not to do it -- the processor is constantly stalling on memory, even with enormously large on-die caches.

As to the question...

Apple's good at this trick, having done so once already when they migrated from the 68000 series CPUs to the PowerPC. A big part of the trick is the on-the-fly cached instruction translation. Another part is, quite simply, the dual cored Pentium M that they're using is just a fast processor, so that even when the OS has to translate instructions step by step, you can get reasonable performance -- esp. when you compare it to the slower G4 PowerPC that used to be in the platform. Remember, the whole reason for this switch is that nobody could get a G5 processor into a notebook.
posted by eriko at 7:37 AM on January 15, 2006


Dec Alpha workstations included binary translators when they ran NT for Alpha. That way they had access to x86 applications as well. Machine code does relatively simple things: move memory from one place to another, memory to a register, a register to memory, change the program counter on some condition etc. The translators job is to sit and watch the translation process and determine when it makes sense to build up blocks of native code.

Apple has done it in the past as well when they went from the 68XXX family to the first incarnation of the PowerPC. They didn't cache stuff however. The PowerPC was fast enough compared to the 68XXX that within a year or so the fastest 68XXX Macintosh was a PowerPC.
posted by substrate at 7:43 AM on January 15, 2006


How does this work - here's how I'd do it (and I've written code like this in the past):
While you load PPC code you cross compile it to the target architecture. Effectively, you treat PPC object code as source code. Generate your target code in a manipulatable format and start applying peephole optimizations being careful not to damage entry points or branch targets.

Now the code can execute, but if you save it in a cache, you never have to do the first step again.

Things that make this easy: most processors do pretty much the same things: move memory from point a to point b, compare values, branch, maniuplate the stack, etc. For most high-level, compiler-generated code, this is going to be very straight forward since compilers tend to generate pretty simple code.

Things that make this hard: there are always some edge conditions that are really hard to deal with no matter what. For example, the 6809 processor has a "half carry" bit for making BCD easier, and it turns out to be a real pain to emulate even though it is only rarely used. People tend to write spaghetti assembly code to save instructions which means jumping into the middle of one routine and dicking with the stack to return somewhere else. If you have to assume that all code is written like this, it makes it harder to optimize.

Other things that make this task a "joy": the PPC architecture is rich with registers, but the x86 is register poor.
posted by plinth at 8:31 AM on January 15, 2006


Sounds like it works the same way as java bytecode translation, so it's not that complicated. Transmeta also has products that do this with Intel code, so it's not exactly an untested technology.

OTOH, the java bytecode was designed to be simple, while the PPC architecture was not, but that just makes it more work. .
posted by delmoi at 8:42 AM on January 15, 2006


The comparison with Java is not fair. Java bytecode is at a higher level of abstraction than machine code, typically. So you go from one Java "instruction" to fair number of machine instructions.

There's some goofy stuff, (like loading a class, throwing an exception, IIRC) but it's mostly pretty basic assembly. FWIW, on Intel you can throw an exception in one instruction (int) push and pop from the stack (push and pop) and even call a function (call) in assembly if you want. I don't think most compilers or OSes use those features.

I haven't paid attention to this low level hardware stuff for a long time. I didn't even know there was an SSE3 :P
posted by delmoi at 8:49 AM on January 15, 2006


The Pentium 4, for example, stands as exactly how not to do it -- the processor is constantly stalling on memory, even with enormously large on-die caches.

well... all processors are constantly stalling on memory accesses, even with large caches. memory latencies have not really improved that much over the last 10 years, but of course memory bandwidths have increased significantly.

the Pentium-M is actually a modified Pentium-III core attached to a pentium-4 bus. the P4 bus is a split-transaction bus unlike the P3 bus. if it were strictly true that there was nothing to gain from putting a faster/better frontside bus on the chip, the Pentium-M would not be any faster than the P3 at a given clock rate.

the problem with the P4 is that the pipeline depth is just crazy, and mispredicted branches or anything else that causes a pipeline stall/fill costs enormously. the P3 pipeline isnt nearly as deep and this is probably the #1 contributor to its superior performance at a given clock rate.

the idea with the P4 pipeline was that you can clock it super fast compared to the P3 pipeline, but in practice the underlying silicon technology kind of erases this advantage... you can clock the Pentium-M fast enough in every situation that it outperforms most P4s.
posted by joeblough at 10:44 AM on January 15, 2006


Fat binaries are a different approach to this problem from virtual machines, and are how Apple handled the PowerPC transition.

Graphics will, as I understand it, be the big bottleneck. Judging by initial benchmarks, OSX on Intel beats PowerPC on everything else. But the whole big-endian/little-endian problem seems like it's now come full cycle. One of the biggest hurdles to porting Direct X-based engines has come in translating that bit of low-level code, and now we have to deal with the performance penalty of translating it back to where it came from, in real time. Oy gevalt!
posted by mkultra at 4:41 PM on January 15, 2006


Response by poster: http://www.macworld.com/2006/01/features/imaclabtest1/index.php

Macworld says that Intel-native versions of apps tend to run 10-15% faster on the Intel iMac than a G5.

But apps running in Rosetta run about 50% slower :(

Great answers, all. Thank you~!
posted by scarabic at 1:37 PM on January 19, 2006


« Older Single-screen co-op games for Xbox?   |   bittorrent peers limited? Newer »
This thread is closed to new comments.