What is Intel doing with all that die?
December 7, 2010 8:57 PM Subscribe
Why do the specs of some non-x86 processors seem so much better than those of the latest Intel chips, even though Intel's manufacturing is more cutting-edge?
Sorry for the extremely nerdy question, but I figured that someone here might be in the know.
Although Intel's processors are supposed to be right at the forefront, and I know a fair amount about them, I sometimes see other companies with specifications that seem so much better that it's hard to believe. The SPARC T3, for example, offers 16 cores, each with 8-way SMT for a total of 128 simultaneous threads, and that's on a 40-nm process. The highest-end versions of Intel's 32-nm Sandy Bridge chips (which lack an on-die GPU), anticipated toward the end of next year, will max out at 8 cores, each with 2-way SMT.
The die sizes I've seen for these two chips are similar at about 370 mm2. Sandy Bridge will have up to about 14 MB more cache than the SPARC T3, but given Intel's 32-nm SRAM cell size at .182 μm2, it seems that only explains about 20 mm2 of the difference. What is Intel doing that takes up so much space, where other companies have found a way to cram in twice as many cores?
Sorry for the extremely nerdy question, but I figured that someone here might be in the know.
Although Intel's processors are supposed to be right at the forefront, and I know a fair amount about them, I sometimes see other companies with specifications that seem so much better that it's hard to believe. The SPARC T3, for example, offers 16 cores, each with 8-way SMT for a total of 128 simultaneous threads, and that's on a 40-nm process. The highest-end versions of Intel's 32-nm Sandy Bridge chips (which lack an on-die GPU), anticipated toward the end of next year, will max out at 8 cores, each with 2-way SMT.
The die sizes I've seen for these two chips are similar at about 370 mm2. Sandy Bridge will have up to about 14 MB more cache than the SPARC T3, but given Intel's 32-nm SRAM cell size at .182 μm2, it seems that only explains about 20 mm2 of the difference. What is Intel doing that takes up so much space, where other companies have found a way to cram in twice as many cores?
# of cores is good for multi-threaded applications, but not so much for gaming, word processing, limited web browsing, etc (ie; the market Intel's processors are mainly designed for.) # of cores definitely isn't the only measure of performance clock speed or the architecture of your instruction set matter more when applications have a sequential bottleneck.
Beyond that, Intel doesn't have "performance" as their only metric for a processor design. Cost, energy efficiency, compatibility with existing desktop lanscape (not a concern for a SPARC,) etc etc.
posted by oblio_one at 9:55 PM on December 7, 2010
Beyond that, Intel doesn't have "performance" as their only metric for a processor design. Cost, energy efficiency, compatibility with existing desktop lanscape (not a concern for a SPARC,) etc etc.
posted by oblio_one at 9:55 PM on December 7, 2010
You're not comparing apples to apples. SPARC is a RISC architecture; x86 is CISC. SPARC is relatively new (circa 2005); x86 has legacy cruft dating back to 1978. SPARC is optimized for applications written for multi-core; x86 is optimized for existing PC software.
I haven't seen any bechmarks comparing the two, but if you're running Windows or Mac OS, it doesn't matter how fast SPARC is, does it?
I'm not sure how you're calcuating 20mm for the cache size but on the block diagrams I've seen, the L3 cache alone takes up nearly half the die, then there's the per-core L2 cache to consider.
posted by zanni at 1:26 AM on December 8, 2010
I haven't seen any bechmarks comparing the two, but if you're running Windows or Mac OS, it doesn't matter how fast SPARC is, does it?
I'm not sure how you're calcuating 20mm for the cache size but on the block diagrams I've seen, the L3 cache alone takes up nearly half the die, then there's the per-core L2 cache to consider.
posted by zanni at 1:26 AM on December 8, 2010
A lot of the design in commodity/desktop processors is in making them energy efficient, work with the problems presented to them and mostly making them cheap. If you can sell a processor for $1000 (or $10,000), you don't mind if you have to fab up 10 of them to get one good one.
Also, the "big boy" processors have different internal structures that are harder to make work. A while back, anyway, they had TONS more internal registers than desktop chips, which made every clock cycle much more productive. In some cases, they didn't even have to context switch (pull all the information from the current process back into slow memory, load the new process into memory, process for a couple clock cycles, and then back again), they just loaded the other process into other registers and point the processor in that direction.
I remember a guy I worked with (who used to be a service tech for a mainframe company) explain that his mainframes' processors might run at 100mhz (at the time) but have a full-speed 32 bit bus to the memory and hard drives. That was a 20 foot cable laying on the floor. Where desktop processors were 700mhz, but their internal busses were 16mhz, 16 bit ISA or 32 bit 33mhz PCI. That matters because the processor has to sit there for whatever its multiplier (bus speed versus cup speed) is, waiting for information to get pulled into memory.
And those kinds of speeds are REALLY expensive to implement. When you are paying $20,000 for a SPARC station, or $1,000,000 for an IBM z series, you want the best. You'll also be willing to modify your own processes to take advantage of any new technologies in the machine. Intel has to innovate while still maintaining backward compatibility. When you are paying $299 for a laptop, you make some trade-offs. The $299 laptop is a fascinating piece of technology, but not because of pure performance.
SPARC started in 1989.
posted by gjc at 7:19 AM on December 8, 2010
Also, the "big boy" processors have different internal structures that are harder to make work. A while back, anyway, they had TONS more internal registers than desktop chips, which made every clock cycle much more productive. In some cases, they didn't even have to context switch (pull all the information from the current process back into slow memory, load the new process into memory, process for a couple clock cycles, and then back again), they just loaded the other process into other registers and point the processor in that direction.
I remember a guy I worked with (who used to be a service tech for a mainframe company) explain that his mainframes' processors might run at 100mhz (at the time) but have a full-speed 32 bit bus to the memory and hard drives. That was a 20 foot cable laying on the floor. Where desktop processors were 700mhz, but their internal busses were 16mhz, 16 bit ISA or 32 bit 33mhz PCI. That matters because the processor has to sit there for whatever its multiplier (bus speed versus cup speed) is, waiting for information to get pulled into memory.
And those kinds of speeds are REALLY expensive to implement. When you are paying $20,000 for a SPARC station, or $1,000,000 for an IBM z series, you want the best. You'll also be willing to modify your own processes to take advantage of any new technologies in the machine. Intel has to innovate while still maintaining backward compatibility. When you are paying $299 for a laptop, you make some trade-offs. The $299 laptop is a fascinating piece of technology, but not because of pure performance.
SPARC started in 1989.
posted by gjc at 7:19 AM on December 8, 2010
Response by poster: To maybe partially answer my own question: a little more research shows that SPARC does not do superscalar or out-of-order execution, which seems like a huge difference, and would also partially explain why SPARC would need all of that extra SMT to get any kind of in-core parallelism. That link also mentions Kyol's suggestion that some of that complexity reduction was done at the expense of clock speed. This is really interesting.
In response to oblio_one and zanni's suggestion that the ISA has a lot to do with it, I didn't think the instruction decoder was very large in modern processors (either in space or execution time). And how I'm calculating the 20 mm2 was just by multiplying the number of bits by the SRAM cell size. Maybe the extra stuff, coherency and associativity and whatnot, makes it substantially bigger than that, I don't know. I haven't seen a Sandy Bridge floorplan. Hard to see how Intel's L2 cache could be relevant, since it's only 256 KB per core.
posted by Xezlec at 7:28 AM on December 8, 2010
In response to oblio_one and zanni's suggestion that the ISA has a lot to do with it, I didn't think the instruction decoder was very large in modern processors (either in space or execution time). And how I'm calculating the 20 mm2 was just by multiplying the number of bits by the SRAM cell size. Maybe the extra stuff, coherency and associativity and whatnot, makes it substantially bigger than that, I don't know. I haven't seen a Sandy Bridge floorplan. Hard to see how Intel's L2 cache could be relevant, since it's only 256 KB per core.
posted by Xezlec at 7:28 AM on December 8, 2010
SPARC started hitting the market in the late 80s, not 2005.
posted by jjb at 8:41 AM on December 8, 2010
posted by jjb at 8:41 AM on December 8, 2010
CISC vs. RISC isn't the answer, largely because they're all RISC now. The execution cores of modern x86 chips operate with fixed length instructions, just like classic RISC cores:
There are arguments to be me made about load/store, but they quickly approach "true Scotsman" territory.
posted by NortonDC at 11:09 AM on December 8, 2010
x86 instructions are decoded into 118-bit micro-operations (micro-ops). The micro-ops are RISC-like; that is, they encode an operation, two sources, and a destination.That's from a description of an X86 processor that Intel first sold in 1995. This design feature is integral to basically all modern X86 chips.
There are arguments to be me made about load/store, but they quickly approach "true Scotsman" territory.
posted by NortonDC at 11:09 AM on December 8, 2010
A little old (it's the Niagara/Sparc T1), but this gives an insight into Sparc's CPU philosophy:
Sun's UltraSPARC T1
posted by Kyol at 9:28 PM on January 4, 2011
Sun's UltraSPARC T1
posted by Kyol at 9:28 PM on January 4, 2011
This thread is closed to new comments.
Mind you, if your problem can use all 128 threads, it's a beautiful thing. Earlier this afternoon I compressed a 16gig core file in 3 minutes of wall time which took 96 minutes of CPU time. Admittedly that was on a t5240 which has a pair of T2+ CPUs to get 128 threads, but that's the architecture the T3 is designed to simplify.
posted by Kyol at 9:17 PM on December 7, 2010