CPU performance and clock speed have leveled out over the last several years. What does this mean for the industry?
Moore’s law
Strictly speaking, Moore’s law says that the number of transistors on a chip will double every 18 to 24 months. And that’s been true for the last 40 years. And it appears set to continue for another decade.
But Moore’s observation has been simplified to mean a doubling of performance every 18 to 24 months. And that too has been true. But not anymore.
Transistors and performance do not have a one-to-one relationship. Yes, clock speeds have improved from the 1 MHz 6502 processor in the original Apple II to over 3 GHz in the latest and greatest. But we’ve reached the end of the line in clock speed improvements: in a third of a nanosecond light moves about 4 inches or 10 cm. Not much distance when chips have miles of internal wiring.
But clock speed isn’t the whole story. Chips now move data in 64 and 128-bit chunks, rather than the 6502’s 8-bit chunks. While there are experiments with Very Long Word computer architectures, as a practical matter we were also at the end of the line for wider data paths as well: 256 bits is as wide as personal and commercial processors can reasonably use.
More RAM? We’ve also been adding ever-larger on-chip caches that improve performance. But inevitably cache-hit ratios decline with size and so do the performance benefits.
Multicore
We can’t make processors go faster. We can’t process more data per clock cycle. So how do we put twice as many transistors to work?
Stuffing more processors on a chip. And right now many of the brightest minds in computer science are struggling with the problem of getting usable work out of 8, 12 or 16 core CPUs.
Dual and quad core processors work pretty well because our multitasking operating systems run a lot of background threads. Spreading those threads across multiple cores improves performance for everyone.
But outside video, image, voice and scientific apps, most of what we do today – word processing, e-mail, web surfing, spreadsheets and presentations – don’t need multicore architectures. Certainly not 8 or more cores. Humans aren’t good multi-taskers.
The wall
We’ve hit a technology wall. We can still double the number of transistors on a chip every couple of years. We can still double disk drive capacity every 2 to 3 years. We can build faster interconnects, such as QuickPath, Light Peak and 10 Gb Ethernet.
But the easy wins are over. Going forward performance gains will be measured in single digit percents each year.
Implications
Information technology, like most of the US economy, is driven by consumer spending. So what happens when a new PC is only 20% faster than your fully paid for three-year-old PC?
Digital Equipment Corporation, who pioneered the minicomputer in the 1960s, had a simple model for product improvements. A successful product would add functionality and performance at a constant cost. And they would offer the same functionality and performance at a declining cost. Here’s a graph of their model:
At some point the cost of producing a given level of functionality would be so low that distribution and marketing costs would dominate. Then volumes would migrate to the price performance sweet spot and lower volume products died.
Today, we can no longer count on performance increases to open up new application territory. Therefore we will see differentiation move to what were once considered secondary characteristics.
- Power. The server space is grappling with the implications of greater power efficiency, but the mobile space has been pushing this metric for the last 15 years. That will continue for years to come.
- Integration. Open up in iPad or a MacBook Air and what do you see? A tiny PC board, a few chips and a huge set of batteries. Long battery life is what makes the product so convenient that they become part everyday life.
- Functionality. Creatively integrating multiple applications, each with their own dedicated core, may enable consumer devices to collapse multistep workflows into a single handy device. Combine image capture, voice recognition, editing and compression into a single device that would enable consumers to capture, edit and post video from a single candy bar sized device, editing on-the-fly with spoken commands.
- Cost. The first low-res digital cameras cost hundreds of dollars, but today we build them into cheap cell phones.
The StorageMojo take
The days of the Moore’s Law driven application growth are over. The next step is to use our still growing technical capabilities to refine what we already do.
The good news for the storage industry is that new data production will continue to grow rapidly. Always on, always available consumer data systems will create ever more demand for storage.
This is also another nail in the coffin of the RAID controller paradigm. Distributed multicore processing power requires distributed data protection and storage architectures.
When you can’t scale up, you have to scale out. Decomposable storage architectures will inevitably come to the fore.
Courteous comments welcome, of course. Oddly enough, the Apple ][ motherboard’s style was the same as today’s MacBook Air: a few chips on a PC board. Friends were always startled to see empty my Apple ][‘s case was.
One way to speed up processors and ASICs is to eliminate the clocks by using self-timed logic design.
The shortest clock period allowed in a design is determined by the worst case path delay in the entire design. Worst case meaning longest logic path delay, and worst case temperature, and worst case process error, etc.
Worst case logic path delay for example could be the longest possible delay through a 32 bit adder which is typically defined by the time it takes the carry input for bit 0 to ripple all the way through to the carry out of bit 31. Over the last 50 years there have been many, many adder designs to reduce this delay to only a portion of the ripple carry of the adder. You know the ones carry look ahead, carry skip, etc. The carry look ahead designs are especially wasteful in terms of area and power.
In actuality the typical ripple carry for an add operation is more like 5 to 8 bits, I forget the exact number, and not the entire 32 bit ripple carry chain, but because we are using clocks we must design for the absolute worst case ripple carry. So in most add operations the vast majority of the circuitry in a carry look ahead design is just burning power contributing little and is just lying in reserve for that one worst case add every so often.
A self timed circuit on the other hand uses special ternary logic circuits which detect when each bit of the add is complete so when the 5 or 8 bit ripple carry is finished the operation is noted as complete and the output fed to the next stage of logic. Self timed circuits also have the advantage that worst case, best case the circuit will still function it may just function slower or faster. Evaluating worst or best case performance then becomes a statistical exercise of is it good enough versus will it function or not exercise.
The big stumbling blocks to self-timed logic circuits is designers themselves, EDA vendors and silicon vendors. They’ve been doing things roughly the same way for 40 years. If anything the options for silicon designers have been reduced over the years to just CMOS static logic circuits. Why change now?
Perhaps now we’ll see a bit of improvement in how horribly inefficiently most programs make use of those GHz. When faced with the choice to optimize their single-core program, or step into the multi-core world, I see a significant number of folks going with the former.
On the semi-flipside – a large reason for that inefficiency is memory access latency (and improper understanding of it). I see memory bus speeds continuing to follow Moore’s law for a while, so effective program speed will still increase without major software changes. Likewise for storage performance via flash etc.
I’m surprised you didn’t get the storage take: more transistors per chip also means denser RAM/flash/etc, no? At what point do we switch from having fast on-chip cache to just having fast on-chip RAM? Also, I think we’re going to see more peripherals integrated onto the main socket, first of course in the embedded space where this is already happen with ‘system-on-a-chip’ packages, but again, at some point you throw a minor gpu on the silicon just because it’s that cheap.
PJ, while you are correct, it is in Intel’s interest to position themselves as a CPU vendor instead of a smart DRAM vendor. Samsung, OTOH, may find that very interesting.
Actualy, processors has some bottlenecks at it’s terminals, the majors peripherals and memories can’t run at it’s speed.
The majors inovations, will be Memories and Peripherals that cover this role!
A user can simply read a spreadsheet or a doc, but your computer was playing music that are transfer by internet to your mediaplayer while your antivirus still scaning your drive and are verifing the music itself, and parallel to this, your printer are printing your job, someone is calling at your personal networking, etc… Yes, the man are Multitask too!! the Limit it’s your computer (Not only Processor).
I think the implications on the storage requirements of MultiCore are going to be the need for true random access storage. MultiCore processors are going to be the catalyst that drives Solid State Storage. Many tasks running across many many cores. Intel have already demonstrated 80 cores on a processor. 80 applications with 80 IO requirements. Only Silicon storage can deal with that.
Dave is correct: Many cores will randomize I/O to an extent that only silicon storage will suffice. Spinning disks will fade into the background like tape before it – but never quite go away.
I agree with Dave. We are moving to another wave in IT. It is now time to remove storage bottleneck. Only SSD or similar technologies can do it. With nano-second delay in RAM to several milli-second on disk the gap is too high.
The next generation of computer will treat SSD like RAM. Not like disk. Too much overhead created by legacy SCSI stuff.
By having a new kind of memory the CPU will be able to deal with data much faster than we are doing today. To achieve this we need to add new instructions to X86 or other in other to have this new memory schema. I know some manufactures are working on this…
Idle CPU make VMware and other virtual server software happy. Once we solve the data latency storage bottleneck, several applications vendors will have to remove and clean their software. Lots of buffer and cache are used to hide data storage latency problem. That will take another 5 to 8 years after SSD become a memory stack instead of just a disk architecture.
Kind of surprised you didn’t mention the biggest benefactor to multi core in IT – virtualization. The ability to have dozens, and soon hundreds of cores on a system is a massive boon for consolidation. I wrote an article not too long ago “Testing the limits of virtualization” which talks about this to some extent. The more cores you have the more flexibility the hypervisor has in scheduling tasks. Parallelizing code to take advantage of multi core is a difficult task to be sure, not as difficult as “shared nothing scale out” designs though that’s far more complicated.
Clock for clock CPUs have exploded in performance as well, using AMD as an example:
http://www.techopsguys.com/wp-content/uploads/2010/03/Over_Time_4P.jpg
As for multi core CPUs driving SSD adoption I don’t really see that as a big driver myself. There are tons of apps out there that drive a LOT of CPU and minimal amounts of IO. Myself I’ve run several production VM systems where there are dozens of applications running in dozens of virtual machines and the amount of I/O is no more than 150 IOPS on average and maybe 500-1000 peak(very rare peaks).
Conversely you can saturate hundreds of physical spindles with just a few CPU cores depending on the application and load.
Well said, Robin.
I have noticed that Intel’s 32nm process is not that mature. For example, the E5520 -> E5620 transition from 45nm to 32nm only gained us 133Mhz at the same TDP level at even more defect density(E5620 has 33% of the cores broken). The best TDP chips are still 45nm grade like the SU9600(10W) chips used in the newest mac book air.
So client side performance will be stuck in single digit % per year growth. While Intel/AMD wouldn’t let price shrink too much either because it reduces ASP. So the only game in town is to use transistor count to improve power utilization. Unfortunately power consumption isn’t the x86’s game. I believe that ARM in the end would take over as the preferred architecture for servers and client.(It already dominates the mobile world for a reason) The x86 architecture would spend a bigger portion of transistor count toward power optimization to reduce power consumption down to the cell phone levels, and the game will surely get interesting.
Moore’s law is dead at 15nm node. Unless Intel can pull a giant rabbit out of their rear end.
I think you’re correct about hitting a wall and agree that computers as a whole have maxed out and we’ve gotten all the performance we can out of the systems we currently have available.
We are at the edge of the cliff and either we will have minor upgrades to computers for the next 5 years or a revolutionary change in computers as a whole. Sadly I think the first is the case. I have not seen any revolutionary new ideas out there that would warrant companies spending limited resources to upgrade to new systems. (Going from say a quad core to a six core server just to get two cores is not a good return on investment.)
Why should I spend my money to upgrade my 1 year old or newer computer to something that is only going to give me less than a 10% increase in performance if that? Or even a 2 year old computer..
This is just my scattered thoughts on the whole mess.. 🙂
I found myself agreeing with “myself” just there 🙂
Sorry Other John, for my joking, but you’re on the money.
This was actually a chat i had with my elderly mom just now, this hour (Oh, how the elderly suffer, but she knows enough that this kinda thing matters genetically): I do not believe the computer revolution has come yet. We need a generation who grow up with what once were multi million buck tools at their disposal, and for that to work, we need an educative revolution, which i mean to be making some sense of the amount of raw information out there.
So, Storage Guys, Fix This Already!
It’s a artificial bottleneck, and on account it’s not “neat” to fix commercial hopelessness, there’s no Fields’ or Nobel to be won, which just sucks more.
Interim, why do i never have enough DIMM slots on any board? RAIDed RAM, Chipkill, this is 90s tech and older technique. I can afford a TB of RAM, even really good RAM, the case is screamingly obvious, but the entry ticket is punitive. So we get “cloud” instead, and if you’ve not seen it, I agree with Larry’s wonderful rant against “cloud” which is on Youtube.
Yup, here you go: http://www.youtube.com/watch?v=8UYa6gQC14o
Let’s stop reinventing the 70’s for nostalgia’s sake, or because today’s designers were denied big compute when they were kids.
– john k
TS,
“Unless Intel can pull a giant rabbit out of their rear end.”
Funny thing is, i think they did just that, not a blink ago:
http://www.realworldtech.com/page.cfm?ArticleID=RWT012707024759
Nate,
I read your whole blog. Very cool.
But some guys are raising issues with network power:
http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20101026_External.pdf
There is no reason why app develpers cannot make their things work nicely on a big box, save laziness. Provoctive statement for sure, but IT in general is healthily guilty of einenting old skills in a new field.
best,
oh dear, i shan’t initialize “Other John”! 🙂
– j
Robin,
it’s a funny thing also, as Intel is so Grove’s move away from RAM.
They’ve got to sponsor eradicating the bottlenecks to their core tech.
At least, i think that’s how their SSD venture began.
Only, the tail might wag the dog.
Second time!
– jk
Terry,
I’m several nines certain I’m wrong here, but wasn’t one problem with clockless line volt drops?
IOW you got more going on than quantum jumps at even two digit nano scale … you got routing probems.
I think you hit on that hard with your EDA comments, but it would rock if you had more to say.
– jk