Everyone in the data storage industry knows about the gap between I/Os per second of disk drives and processor I/O requirements. But there is a similar problem facing DRAM support of many-core chips.

Named “the memory wall” by William Wulf and Sally McKee in their 1994 paper Hitting the Memory Wall: Implications of the Obvious (pdf) they described the problem this way:

We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.

According to an article in IEEE Spectrum that time is almost upon us. Sandia national labs simulations predict that once there are more than 8 on-chip cores conventional memory architectures will slow application performance.

Update 1: here’s the graph from Sandia. It took me quite a while to figure out what it was saying – thanks commenter! – so I didn’t publish in the original post. As I said on ZDnet this morning:

Performance roughly doubles from 2 cores to 4 (yay!), near flat to 8 (boo!) and then falls (hiss!).

Many-cores fall over the performance cliff.

Many-cores fall over the performance cliff.

End update 1.

James Peery of Sandia’s computation, computers, information and mathematics research group is quoted saying “after about 8 cores, there is no improvement. At 16 cores, it looks like 2.” The memory wall’s impact is greatest on so-called informatics applications, where massive amounts of data must be processed, such as sifting through data to determine if a nuclear proliferation failure has occurred.

John von Neumann emphasized the point in his First Draft of a Report on the EDVAC (pdf)

This result deserves to be noted. It shows in a most striking way where the real difficulty, the main bottleneck, of an automatic very high speed computing device lies: At the memory.

Gee, bandwidth is important. I thought it was all IOPS.

Help is on the way
The Spectrum article notes that Sandia is investigating stacked memory architectures, popular in cell phones for space reasons, to get more memory bandwidth. Professor McKee has also worked on the Impulse project to build a smarter memory controller for

. . . critical commercial and military applications such as database management, data mining, image processing, sparse matrix operations, simulations, and streams-oriented multimedia applications.

Update 2: Turns out Rambus has a 1 TB/sec initiative underway. Goals include:

  • 1 TB/s memory bandwidth to a single system on a chip
  • Suitable for low-cost, high-volume manufacturing
  • Works for gaming, graphics and multi-core apps

The Terabyte Bandwidth Initiative is an initiative, not a product announcement. They’ll be rolling out some of the technologies in 2010 with next-gen memory specs. Courtesy of Rambus is this slide describing some of their issues:

Alas, it doesn’t look like Intel’s late-to-the-party on-board memory controller and Quick Path Interconnect in Nehalem will help us get ahead of the problem. And with multi-core, multi-CPU system designs, how do you keep the system from looking like a NUMA architecture?
End update 2.

The StorageMojo take
Given Intel’s need to create a market for many-core chips, expect significant investment in this engineering problem. It isn’t clear to what extent this affects consumer apps, so solutions that piggyback on existing consumer technologies – like stacked memory from cell phones – will be the economic way to slide this into consumer products like high-end game machines.

Expect more turbulence at the peak of the storage pyramid, which will further encourage the rethinking of storage architectures. That is a good thing for everyone in the industry.

Courteous comments welcome, of course. If anyone wants to make the case that von Neumann was wrong, I’m all ears.