The top storage challenges of the next decade

by Robin Harris on Wednesday, 6 July, 2016

StorageMojo recently celebrated its 10th anniversary, which got me thinking about the next decade.

Think of all the changes we’ve seen in the last 10 years:

  • Cloud storage and computing that put a price on IT’s head
  • Scale out object storage.
  • Flash. Millions of IOPS in a few RU.
  • Deduplication.
  • 1,000 year optical discs.

There’s more, like new file systems, advanced erasure coding, data analytics, and remote storage management. All great stuff, making storage more reliable, robust, and easier to manage.

But hey, that was then. This is now.

Don’t worry: the next decade is shaping up to be even more exciting and disruptive than the last. OK, some of you should worry.

Grand challenges
For the next decade the storage industry has a new set of challenges. With the flood of data, especially video and IoT, we’ll need more capacity, at lower cost, using fewer human cycles than ever before.

That implies a number of new market opportunities for storage entrepreneurs. And more emerging storage technologies!

What are these grand challenges? Here’s my list in no particular order:

  • Data-centric infrastructure. Hyper-converged is a good start, but not the end-game.
  • Eliminate backup. Finally.
  • Fast object storage. Make scale-out advanced erasure codes fast and efficient enough to enable object stores to displace file servers.
  • Autonomous storage. Storage with enough AI to manage itself, including deleting data.
  • NVRAM optimized CPUs, I/O stacks and storage systems.
  • Much lower I/O latencies.
  • High density, low access time archives. Even more active than today’s “active” archives.

The StorageMojo take
I expect to write about each of these in the coming years. But the fundamental driver is that we do IT for the information, not the infrastructure.

Now that the rate of performance improvements are slowing – especially in CPUs, but also in networks and storage – we are forced to focus on important second order gains: reducing costs; tighter integration; greater flexibility.

Yes, there are breakthrough technologies ahead. But the future will be won by smarter architectures, not brute force, solving the big challenges of future storage.

Courteous comments welcome, of course.

{ 10 comments… read them below or add one }

John Other July 7, 2016 at 1:36 am

Reminds me of the 3M computer:

Here’s a go at selecting some magnitudes that may apply:

MegaIOPS is where it’ at. MegaIOPS will radically transform what we understand of whatever we consider primary “desktop” computing. Also programming and language models. Hence applications, I hazard. Of course, generally, we are looking for a spec that redefines capabilities and drives new uses.

Most shops don’t presently need Petabytes over Terabytes, rather someplace in between. Okay, Petabyte, as it defines the scale that will come to be considered normal soon enough.

Milli, as in millisecond latency, complete stack.

Mega, for million client endpoints per array, as a entry level.

Terabit/s networking, is within current range, and could be pushed down the product lists.

Hecto-core is plausible in a IOT widget. For very basic “arrays”. … so Kilo-core? Nope, we have kilo-core servers, now, so that has to go up a order. Mega-core.

Deca? Double digit parity (or equivalent) redundancy? Some distributed schemes working last mile latency might effectively be much more redundant. Mega-parity? Think of IOT processors as disposable in every way, including cost, and local store is as essential as cache for a processor core. So I think Mega-parity will be at least a reference idea in marketing. After all, if I drive to hundreds of destinations in a year, and using low power radio networks, access to authentication is slow or costly in engineering around application latency (think of driving through gates at 10mph, potentially) then it makes sense to have those locations replicate auth data, as nightmarish as that may be. So, thousands of copies of key data, or partial store of essential data. Toll roads would do this, potentially. With NVRAM price declining, and density not being a issue for small devices at all, I can imagine a toll “booth” widget holding credentials for all frequent travellers.

Micro… micro-watt. IOT wants silicon to sip power.

Dear me, all the language sounds stuffy… “Client” sounds outdated, in the IOT age. “Array” is fast becoming descriptively obsolete, too. “Drive” seems anachronistic, when DIMMS could be the standard connector quite easily. “Store” seems too locality bound.

Time for marketers to come up with new words! 🙂

M for Mega-IOPS
M for Mega – parity
M for Millisecond latency
M for millions of endpoints / clients supported
M for thousands of cores
M for milliwatt (capable of sleeping or sipping, or applying to components on bigger devices machines, if not total power draw)

P for Petabyte capacity
T for Terabit networks

Hmm, I get a PT7M machine ….

or just a 7M machine, because capacity and network bandwidth are arguably less descriptive constraints across a range of purposes.

Anyone think that is a outrageous spec target in due course?

John Other July 7, 2016 at 1:39 am

oops, arithmetic fail, above 🙁

It’s a 6M machine.

Which is a nice power of two improvement 🙂

John Other July 7, 2016 at 2:13 am

Sorry guys to hog the comments, but I think my milliwatt (meant as milliwatt, not microwatt, my typo in the above) line of thought will attract questions. I originally wrote and then edited out a disclaimer: that I was thinking component power draw targets. This applies both in IOT widgets for battery or low power requirements as well as in grid fed kit, because of thermal constraints on many components together. There will be very dense machines indeed, to match some of these requirements. So my milliwatt spec is a more general power target which is a industry wide goal generally. I thought it should be included despite not being a spec, even only because low power engineering is a ongoing precondition for truly dense / complex systems which will be a cumulative significant breakthrough, at whatever power drain actually enables the other results, regardless of the number attached.

John Other July 7, 2016 at 2:42 am

I am so sorry, I should have drafted this properly, instead I just spilled my coffee, noting “M for thousands of cores”.. Whoops!

I was thinking millions of threads, anyhow threadcount matters greatly to current storage bounds. I think millions of threads is eminently conceivable when arguing really large storage systems and numbers of IOT like clients or microservice clients..

I do not know how good a fit throughput processing is with the kind of simpler core GPUs use, but I think silicon could go this way, maybe through Intel Phi style routes or more custom designs.

Modest internet business run fleets of thousands of servers. Just put a thousand threads per box, and you’re there, theoretically with things we have today.

The key point I get from beginning to think this through (and I accept I should have applied more thought before posting, it warrants some better examination, and I have just kept looking back and adding things at random waiting for a conf call to set up…) is that one has to define the space or envelope, rather than a defined spec for a particular application. Just as we see storage companies proliferate now, to apply themselves to slices through a multi dimensional space of technologies, so figuring out what will be the next level, ten years out, is about defining the outer limits that geometry. In managing color spaces, doing a balanced job not too far out of the same space as intended, gets results. I am thinking the same thing will apply here. One or two co-ordinates in reality may fall a bit short of the proposed dimension co-ords, but the job will be done, storage reality (and I think this is going to converge ever faster with general computing problems and goals) will be changed.

Victor Engle July 7, 2016 at 2:07 pm

I like the Storage with AI idea. That will probably be a pervasive theme in all areas of IT during the next decade or so. During the next 10 years, will many of the heavy duty apps that are staples of of enterprise computing transform so that they fit into a scale-out virtualized model instead of on scaled up big iron mid-range systems? It seems that as soon as virtualization was generally accepted as an optimal host environment, 80-90% of host were consolidated into virtual environments but the resource intensive apps have been stuck on physical. If those apps were drastically redesigned they might fit nicely into virtualized environments. If so, adoption of public cloud tech would accelerate beyond the present rate of adoption.

Andy Lawrence July 8, 2016 at 9:58 am

I still think the major task over the next decade is to improve data connections. Related data today is often stored on many different devices (cloud, DAS, NAS, pocket flash drive, etc.) and managed by separate data management systems that don’t talk to each other.

It is too hard to write an application that has to communicate with a half dozen different systems to get all the information it needs. My last company had a single program that needed to talk with 5 different databases (some relational, some NoSQL) to get access to all its data. That was in addition to storing unstructured data within files that were stored all over the place.

Even those systems that try to coalesce around unified APIs (e.g. SQL for databases or S3 for cloud) , there is often differences between them that have to be addressed. It is simply too hard today to discover (i.e. search) information when it is spread across so many systems and be confident that you found everything.

John Other July 12, 2016 at 4:42 am

I am tempted by Andy Lawrence’s observations, to propose we need much generic Data Discovery Brokers, to act as intermediaries to unearth not only metadata about the storage encapsulation in ways we face when assembling large datasets for a purpose, e.g. NoSQL vs RDBMS vs “unstructured”, but to obtain and invest in new meta stores information about the “entropy” of metadata surrounding the data sets worked with, as they are used and created.

If I write a draft for a paper, then my browsing habits, pages visited, PDFs opened, queries to JSTOR, SQL I write to obtain data to work with, all these _user actions_ have high value in creating context that can describe not only the value of data (e.g. if I query a production system for transaction records or query a “silo” for “data mining”, or use R to transform results input from a store, all point to the level of interest I am applying to data.) but also the location and “whereabouts” of pertinent information which has value in seeking context for future use.

Analogy might be found in the way some video cards have hardware functions for streaming gameplay over twitch or similar broadcasting services, and that the stream of user interaction with data is as valuable as the nature of it, and even the value of data is proportionate to the user interaction with it in ways that are quite beyond what can be divined form looking at hot sectors that one might push up a proverbial HSM to be more readily to hand.

*Who* accesses the data is also valuable. If my CFO spends a lot of time working with R on a cross section of sale ledger, I might infer there is importance in this underlying data which is far beyond what we ought to do with it to improve access performance.

Moreover, content protection systems already do a lot of use monitoring and access / authentication monitoring, and make inferences from that that may trigger alerts or policy enforcement.

I mention content protection as a example where companies are happy to spend license fees and provide extra compute to enable a “layer of oversight”, for different reasons, which could have secondary use in observing data usage trends we could impute capacity / performance allocation instructions from.

A Data Discovery Broker would analyse user interaction across a network, across workstations, across user application space, and could be ambitious enough to query available formats so they are reprovisioned in alternative formats more suitable to the types of access they are anticipated to be demanded.

Such a broker would offer suitably accredited users with dashboards of active data in the organisation, with tools to reprovision e.g. a sales ledger or a forward foreign exchange for sales ledger, in a variety of formats, so you could pull current active work areas as XLS (ideally through the server Excel setups) and work through that, so a big table join or complex query would be available as a array in R, or a LINQ expression.

The job then would be to make these data copies available to the fastest layer of storage nearest to the user, to apply higher levels of redundancy for data protection to the original underlying data, to negotiate snapshots or specific rollback points from databases, to route In Action (sic) Data to storage networks that run at the lowest latencies /highest bandwidth. To do a (pls take this a pure caffeinated metaphor) to do a kind of NUMA across the network for work in progress. To get what is useful to the right node.

The capability derives from what is being done to and done with the data, as it is in flight between the store and the applications that are using it.

I see it as a means to create a generic management tool, a data monger also, but one underpinned by modern networks having incredible configurability and on switch processing like FPGAs. I think network switch use of FPGA could extend the storage stack. To make certain data available a a memory access, if the application affords that call. The Data Discovery Broker would be a plugin architecture to integrate network optimisation of data in flight with visibility of important data, visibility of use context, recall of user interaction at a low level with that data (thinking streaming application interaction like games play streaming for twitch) , which would then have security potential also, and switch level FPGA or kernels could marshal disparate stores that are relevant, such as cold store on S3 for archival accounts, ready to be staged closer to e.g., the CFO who is getting interested in applying R to discover important relationships for her model projections.

I think storage is going to be ubiquitous, at every level.

I do not think we will have monolithic storage in normally accessed production use. I think today’s big Hitachi systems will continue to exist, but they will be – despite able to perform at amazing speed compared with their predecessors, they will be stores used for corporate integrity and continuity as much as anything else, of not more so.

I said Hitachi only because I see their niche in multi site mainframe replication and how well they integrate that, giving them a piece of a different level of importance in infrastructure, but take your pick… Now having said that, the plethora of new vendors are each taking their pick of new layers to be found at different architectural level of integration, from network sensitive to compute sensitive, in terms of where they align. That meant to me the difference between super fast IOPS over InfiniBand or similar versus hyper converged putting the latter-day-platters close to enough cores to forget latencies.

So I believe there will remain a role for very high end storage that is all about business continuity, that can handle OS level clustering, that can manage replication across sites with real mainframe reliability.

But we will want increasingly to connect storage to the network and place cores at storage function disposal, and so I see a closer co-operation between IP switch vendors and commodity boxes, only to sell the commodity boxes, which will not be cheap because maxxed out for the foreseeable furure to whatever Intel can provide, those boxes wildrop the hyper converged approach the minute they can DMA storage resoruces intelligently. Storage *compute* not merely storage _protocol_ compute will move to the network interface / cards, or mor elikely start to exist in co-processors on multi cpu boxes. The recent Xeon-D series have two 10GB/s ethernet controllers on die. Switches with significant processing, especially those with their own Linux kernels, will calculate optimal routes to raw storage silos and even multicasting / multiplexing could be done at the network level for loading bulk reads across the network. I think the outbound links to cloud storage and remote stores will be negotiated by CPU with onboard NIC, and switches with the ability to process protocols and assess latency and other issues concerning data in transit, perform management of those streams. What will pay for these systems is having “data dashboards” that create a C-suite intelligible overview of important data across the entire corporate domain. Access and security as well as basic storage allocation will seep down the stack as well as up the stack, from drives all through to user software. New switches will report how it is happening in real time.

John Other July 12, 2016 at 5:32 am

I think Victor Engle has a not incompatible thought with my own.

AI, or heuristics, is likely to play a role in storage, very soon.

What I tried to extrapolate was the data from which AI or heuristics can learn what is opportunistically good for the storage network. In fact, my aim was to imagine a product that would encourage direct inputs about the validity of decisions regarding data optimisation. Since Excel spreadsheets have been able to reside on a client – server model, I have though that I would like access to patterns of use and reuse of spreadsheets, for a variety of reasons.

I think there is another factor at play here: how will be all get sold 10 core desktops in general office use?

The only way is to seek to use the incredible amount of data generated by users – around – what theory do for their “work”. I put work in inverted commas, because we all understand that work is too often defined by the polished result of what we create for others in the organisation. But *everything* we do (oh, probably even reasing XKCD comics!) has a relevance to, if it is not actually the “real work” we undertake throught our day at the office.

Google and the incredible array of advertisers tracking our every online movement, have got one idea about monetising our every last interaction with the machine (browser!) we employ for our goals. I do not see why all the “leaked” or “incidental” usage we enact through say playing with some pivot tables, cannot offer interesting insights that can be valuable to a company.

I certainly have NOT thought al this through, because it is what is cropping up in my thoughts as I write. I do not know how useful certain data may be, in the contexts I posit above.

AI, or AI techniques surely have roles to play, and quite possibly my idea above for the Data Discovery Broker is a non starter. Just bevcause it is both highly ambitious as well as could end up with dealing only with really narrow cases. In which case, genuine AI may be required to interpret what is going on and what needs optimisation. But to obtain the data to analyse, I think storage must be treated as part of the network infrastructure. I see no reason why we could not end up with silos of DIMMs accessed by Ethernet DMA across microsecond total latencies, across LAN if not WAN campus spans. I think that proximity to the local silo – or rather having knowledge of what is in the local silo, in a way advertsable to the network – suddenly I think IPv6 has a real role, now, as does real multicast IP – is the way to solve storage discovery. I think that with NICs embedded in server CPUs (and within desktops, I am looking to build a Xeon-D desktop shortly, and think the only actual obstacle to that line of chips being accepted generally is pricing close to faster general like Xeons) means that *applications* can be written to stop caring where their storage is, and *sharing* on mobo DIMMS, especially NV RAM, with routes publiushed via a switcfh with a lernel aware of the availability, security requirements and aware of alternate copies, aware of need to copy out (with permissions intact) that DIMM based copy, able to advertise a address range over IP to other intelligent switches, and the OS could marshall all this extended address spacejust as it maps network drives today.

Much of my thinking out loud is really close to how some super computers work, in terms of moving data around memory to closest to the processor. The difference which is the be all and end all, is that this will not be about message pasing or having DMA aware applications, it will be all exported as one contigious “drive” that a new workstation or server plugs into. Making backups will be maybe even a automatic thing, as read / write cycle longevity / wear becomes less and less a worry, I imagine almost peer to peer copying of caches into SSD style silos, authomatically on booting a machine on the network. It will be a fabric store, of a kind. Only not perforce a fabric store, because policy choices could be applied. I think policies, security, and data value decisions will be commonly managed at the network. Data classificiation, use and utility and obviously processing, will happen with CPUs with on die fast Ethernet or equivalent. There will be talkback between network cores and processors how best to utilise storage silos from DIMMs to whatever else we have plugged in. Whether applications of monolithic type, like SPSS or whatever we use, or micro-services involved in a federation of compute for a webserver, will be able to make system calls at OS level to pick and chose their needs, even take direct control over resources in a partition style excision of resources. It sounds to me like lots of grid computing ideas will flourish again, as will DCOM / CORBA interfaces for the software.

Where ultimately “storage vendors” will be, in all of this, I have no clue.

I believe all storage vendors will have to radically rethink what they can achieve.

As Robin has so forcefully illustrated recently – and I am grateful for it being put so clearly – storage companies have now got to shed some serious amounts of legacy, associated with optimising spinning rust.

What they risk doing now, is building up another legacy of optimisation that can be taken away by a variety of competitors in fields they are not much connected with. Intel could burn a storage core onto every chip. Network switch vendors could write storage stacks into their Linux kernels. Data management companies, from other ends of the field such as security and communications / IP integrity could come up with must have applications that initially justify expensive dedicated hardware (just as firewalls built a industry of hardware) and Microsoft could treat storage as a service in a micro – scale, talking their FS over DMA to silicon directly. SSD manufacturers, even within heat and power constraints, could think again about accessibility direct to their store silicon. Someone could even see all of this as something that can be done on a custom cloud setup. At least for much of the usability issues I’ve touched upon, if not raw performance.

I think that the true possibility of the flattening of latencies and equalling of bandwidth will result in *elegant* solutions winning. That is a paradigm change in a world that has revelled in down and dirty optimization from its nascent moments.

Andy Lawrence July 13, 2016 at 8:02 am

I was trying to follow John Other’s train of thought about a ‘Data Discovery Broker’ but I got a little lost.

My original comment was about the need for ‘data connections’ that span across hardware systems and data managers. We have millions or billions of discrete pieces of information within our data sets. Each piece of information is related in some way to a variety of other pieces of information. For example, a sales report is related to a snapshot of a set of database tables where the raw data was derived. In this case one of the pieces is stored by a file system while the other is stored within a relational database.

We need systems that can help us ‘connect the dots’ and find relationships between data points regardless of where they are stored or what data manager is used to store them. In order to accomplish this, we need a uniform way of associating extra metadata to those data points (e.g. tags) regardless of the data manager. The tagging mechanism can be a combination of manual processes and automated processes using some kind of AI.

Once the tags are in place, we can further make connections by lining up the tags. For example, I might have a document about a meeting with a client; an email I received from that client; and a copy of a white paper written by that client. If all three pieces of information have a metadata tag that specifies the client’s name, it is much easier to connect those dots. But it doesn’t work well if the metadata tags are hidden behind each of the walled gardens that have siloed off our data.

I am currently working on a system that makes connections like this possible.

Nicholas Jarvis December 4, 2017 at 3:31 pm

Please review this link below to those folks interested in AI/Machine learning with in a scale out storage platform that provides production performance (70us and 450k random iops per node) and leverages high capacity spinning disk, NVRAM, Intel Optane and Samsung ZSSD today!

Leave a Comment

{ 1 trackback }

Previous post:

Next post: