Bayesian analysis of IT infrastructure

by Robin Harris | Thursday, February 5, 2009 | Architecture, Enterprise, Video | 6 comments

The current economic free fall makes one thing clear: the days of solid gold enterprise IT are numbered. Successful IT architects and managers must be expert in wringing the maximum business value from IT architecture and product choices.

But how, exactly, do you do that? The solid-gold strategy is easy: all the kit is the best money can buy; vendor service techs and SE’s are on-site; everything is over-configured; redundancy everywhere; and the sales team is golfing with the CIO.

But in tough economic times – which look to last a while thanks to Congressional economic illiteracy – the focus moves to getting the most bang for the buck. The problem is measuring that and convincing gimlet-eyed CFOs.

How about a little Clarity?
That’s where Clarity AP (Assessment & Planning) comes in. It uses Bayesian probability calculus to quantify system availability and recovery times.

The net net: you can calculate the availability of a system of discrete elements that each have their own availability probabilities. For example you can compare the availability – and the inverse, downtime – of a RAID 5 array to a RAID 6 array. Or an infrastructure composed of Big Iron arrays to one of clustered commodity boxes.

Compare a Tier IV data center to a Tier I. Add remote backup into the mix. Even, good heavens, cloud storage.

They laughed when I sat down at the piano, but when I started to play!
Of course, the problem isn’t the math, it’s getting the data to feed the model. Most such IT “tools” are designed by consultants to extract money, not data.

The general pitch is “great tool, now hire us for 3 weeks to gather the data on your infrastructure.” That $100,000 “study” is mostly guys counting boxes in racks – with a little analysis tacked on.

Clarity AP is different. It’s got a catalog of storage devices from EMC, HP, IBM and others that you can drag and drop into a configuration. Then there’s a long list of processes – LTO backup to remote replication – as well.

In a few minutes you can build a model of a storage infrastructure. Then start changing elements and see the impact – financial as well as operational.

With very little effort you can show senior managers in charts and graphs where you are today, where you’d like to go, and how much it will reduce downtime and data loss.

Here’s a 4 minute StorageMojo Video White Paper on Clarity AP:

If you go to YouTube you can watch it in what they call HD. Click on the “watch in HD” link at the lower right corner of the player.

Brought to you by. . .
TwinStrata is a Boston-based startup whose 2 co-founders, Nicos Vekiarides and John Bates, have several decades of storage and DR experience between them. They co-founded StorageApps, which was acquired by HP in 200l.

The StorageMojo take
Everybody exhorts IT pros to be “business partners” to the LOBs, but nobody ever explains how. Clarity AP is a powerful tool for turning arcane technical details into dollars and cents.

All the business guys want to know is what they will get and how much will it cost. The civilians will thank you for, finally, speaking their language.

Courteous comments welcome, of course. Yes, TwinStrata hired me to make the video. But I didn’t agree to it until I saw that they had something cool. Try it out yourself and let me – and them – know what you think.

6 Comments

NonRandomGuy on Wednesday, 11 February, 2009 at 11:03 am

Robin,

In the past I’ve run the numbers using a variety of probablistic techniques and compared them to actual experience and have come to the conclusion that the overall fault rate is orders of magnitude higher than the equipment fault rate mostly due to human factors. So the numbers produced by a tool such as this can only be thought of as a lower bound.

The classic example I’ve used in the past is the bumbling CEO. He takes a potentially lucrative customer into the ‘special’ machine room (because he can), and waves his hand madly describing all the cool technology. In the process he rips out an entire row of fibres. (You can replace CEO with ‘authorized technician’ who trips on something and get the same effect).

My second example is equipment failure due to brownout. The odds that a second drive dies because its power is derived from the same feed as the first is much higher than the non-correlated case.

These are not independent, random failures. The correlation between the failure of one element and another is strongly dependent on location in the rack, the position of the rack in lab (near the cool stuff or not), etc. There is no way I can capture all these dependencies in a tool, and even if I could, there is no way to quantify it. (I suppose I could send 100 CEO’s into a competitors machine room, and count the number of disasters….but I digress).

Equally important are errors in process. If the guy with the golden backup tape is sick (and doesn’t take the tape home as expected), the tape sits in the machine for a week, right next to the other copy on the shelf. Now a fire takes out both copies. Its not a double (random) failure.

Note that if I were to try to quantify the conditional probability that the second tape is destroyed with the first (or fibre 2 is pulled out with fibre 1), the conditional probability approaches 1 for the scenarios I have described. So the problem is not separable: you must have the probability of the entire scenario, you can not derive the risk from the individual failure rates. The number of correlated failure scenarios goes up super-exponentially with the number of elements, making the problem virtually intractable.

I’m not trying to say the tool is not useful, its just the first step, and establishes the lower bound. Of course if the lower bound is too high for your comfort, no amount of process control can improve it.

I’d appreciate seeing any responses you or TwinStrata may have.

NonRandomGuy
John Bates on Wednesday, 11 February, 2009 at 10:06 pm

NonRandomGuy:

A great, thoughtful comment, and I’m happy to address it. Everything you say is spot on.

In my own personal experience as an engineer for various storage vendors, I’ve too often run across the situation where an aggressive salesperson pushes a solution on a customer that is both more than that customer needs and inappropriate for the problem being solved. That kind of sale winds up wasting both my time and the customer’s time as we work out how exactly to shoehorn the technology into the customer’s business, and the only one who winds up really benefiting is the sales person who gets to take home a nice fat commission check. The customer gets stuck with more than they need, at a much higher cost.

The design goal of Clarity was to introduce a tool to simplify a requirements-driven approach to storage planning. Instead of letting the size of the commission determine which solution is going to be sold, we try to encourage people to think about the business requirements that are driving the data center, and examine how the proposals fit those requirements. Towards that end, Clarity is very focused on the notions of relative risk and cost comparison. Understanding the differences between different mirroring approaches, for example, or different backup policies, is much easier for humans to do when they are able to quickly compare options. Minor changes can have significant impact across the data center and the business, and understanding how those changes propagate risk and cost is what Clarity is all about.

As you say, any model which attempts to describe reality is necessarily an approximation. Furthermore, there’s a strong tension between accuracy and ease-of-use: to get closer to reality, you need more information, which increases the overhead of using the model. If I said that I had a great tool for predicting failures, but you needed to input a complete catalog of every disk drive, wiring diagram, and rack layout in your data center, well, I don’t think it would be very useful. Furthermore, I’d question whether taking such an extreme set of measurements wouldn’t just give a false sense of accuracy: too many factors would still remain unaccounted for. Measuring disk failure rates, for example, leads to significant variations within the same brand and model, and vendors don’t identify manufacturing batches. To have an truly accurate failure model, you’d need to catalog each disk by manufacturing date, *and* have a reliable measurement for each batch. Ugh.

Clarity’s model tries to replace such illusory hyper-accurate modeling with a set of assumptions that lead us to good ballpark estimates of overall storage system reliability. Furthermore, by basing our estimates on a set of common assumptions, we are able to arrive at very good comparisons between policy variations.

We started our modeling based on several data sets, including a couple that have been raised right here on StorageMojo: the FAST07 paper and the Google labs reports. The most recent ACM Transactions on Storage also had a great paper from some of the NetApps people, which echoes your points perfectly. The Clarity model started there, and with our own experiences.

Here’s what we know: disk failures are not independent. From a high level, that’s obvious, since, if they all reside in a single data center, a disaster there will affect all of the systems and subsystems. What wasn’t obvious, at least, unless you’d actually worked in data center and seen the real world, was that even failures within a single rack were significantly correlated.

The beauty of the bayesian approach is that we don’t have to assume independence, and in fact, Clarity doesn’t. We spent quite a bit of time going through various real-world results and devising a model that fit them to a reasonable approximation. We even incorporated your “bumbling CEO” (although I called it the “kamikaze squirrel”… I’ll let your imagination fill in the experience that led to that). We try and integrate many factors such as the manageability of the array, the reliability of the firmware, and the ‘environment’ of the disks, to capture as many different interdependencies as is reasonable. The probabilistic fault tree gives us a clean, formal way to use informal estimates of these kinds of factors and arrive at a good ballpark.

Some of your specific points are demonstrate exactly the kinds of policy that Clarity is meant to model: for example, Clarity will report a much lower reliability on a configuration in which tapes are stored on-site than one in which they are shipped remotely. It evens describes the difference between shipping tapes daily vs. weekly.

Where I do disagree with you is when you say that the problem is virtually intractable. The bayesian approach is to identify conditional independence: given that there is not a fire, or a power outage, or one of a hundred different factors, the failure of a tape and the failure of a disk *is* independent. The joint distribution of failures across the entire data center *is* exponential, but thanks to the conditional independence of the subsystems, we are given a tractable path to combine them. We don’t have to consider *all* of the possible scenarios simultaneously. We do have to describe how individual subsystems interrelate to lead to an overall event: the failure of the system to function. Networks based on probabilistic dependencies are really, really powerful tools for describing and combining real world scenarios, precisely because they are good at propagating information.

Say, for example, that you are told that a disk has failed. If you are given no other information, then the tape’s state is *not* independent of that knowledge. Since you don’t know why the disk failed, the possibility that there has been a fire in the data center increases, and if the possibility of a fire increases, then the possibility that the tape has been destroyed increases, too. So, if you know that a disk has failed, then you have to worry a little bit more that the tape has failed, too. But if you are also given the knowledge that the data center is *not* on fire, then the state of the tape is suddenly separated from the state of the disk, and you can consider the failures independently again. That’s the kind of relationship that the Clarity model is based upon: trying to capture a broad, high-level view of the interrelationships between all of the policies and components.

The final result is just that: for most cases, we can provide an approximation that is useful in the decision-making process. Clarity is about determining how well a solution fits business requirements, rather than cramming a business into some one-size-fits-all solution.

Thanks for commenting. It’s clear that you’ve thought a lot about exactly the kinds of problems that Clarity tries to address, and I greatly appreciate your response. “It’s just the first step,” should be our slogan… There is no substitute for human knowledge and experience. Clarity is a supplement.
Jonathan Brill on Thursday, 12 February, 2009 at 10:03 am

@ John

In my experience of doing site surveys, the vast gap between environmental issues in datacenters and what IT managers believe them to be is simply staggering.

The model for reliability needs to make huge assumptions about facilities: power management, cooling quality within each rack, ferrous particulate matter, the quality of staff training, etc. that are simply impossible to ascertain through internal review.

From an annecdotal viewpoint, My best friend and I both run our personal equipment 24-7 and I probably run mine harder…but I pamper it. My best friend loses a personal harddrive every -six months. I haven’t lost one in 20 years…knock on wood. This isn’t hyperbole.

Either I’m super luck or there are small differences in maintenance and operating environment that have a massive impact on reliability. It’s not a question of hardware. We are both buying the top of the line. It’s a question of usage.

How are you modeling for the subtle unknown?
John W. Bates on Friday, 13 February, 2009 at 2:00 am

Jonathan:

You guys are gonna make me work? Geez.

Before I answer your question, I just want to reiterate: I believe that the best way to use Clarity is as a planning tool, comparing between the impacts of different scenarios. One of the reasons is simple: to a certain extent, the noise cancels out. Looking at two different scenarios sharing the same base assumptions will highlight the differences made by policy changes. Or, to use our analogies from up above, the same bumbling CEO will still be in the building with the same number of squirrels. We’d like to avoid having to count the squirrels or conduct experiments with CEOs.

There are a number of formal techniques to integrate both known unknowns and unknown unknowns into modeling, and we do use them. They mostly involves ways to either model broad, simplifying assumptions or to introduce controlled noise. The goal is not necessarily to be precise with the numbers, but to correctly capture the behavior of the system when changes are made. We should increase reliability at the correct rate when good changes are made, and decrease it at the right rate when bad changes are made. To illustrate, let me talk about disk drives:

We have no good reason to assume bad faith on the part of drive manufacturers when it comes to reporting reliability numbers. But the fact remains that there is a significant amount of variability in failure rates when their drives are embedded in real-world scenarios. The question that we started from was, “How can we develop a failure model that approximates the observations?”

A naÃ¯ve approach would be to just assume that since putting a disk drive into a data center exposes it to factors that can increase its wear rate, we should adjust the failure rate of that drive upwards. If a drive manufacturer claimed a 0.05% AFR, we could use a fudge factor to adjust it to 0.1%. Or, we could scour the world for data about that particular model, and base our modeled rate on what we empirical observations we can find.

What’s interesting, though, is that neither of those approaches explains what we see: drive failures are clustered. They exhibit significant spatial and temporal locality. Simply adjusting the failure rate, or calculating a new one, would not give us what we really need: all drive failures would just be more frequent, but remain independent. What we need is a way to describe the dependencies that are introduced when we put a drive into a box with a bunch of other drives, and then put that box into a room with a bunch of other boxes.

The Clarity model is based on a large set of random variables representing both tangibles and intangibles. When we look at disk drives, for example, we start with assigning a random variable representing the idealized physical hard drive: here’s the way the manufacturer claims that this drive model will work. For simplicity, we say that the drive either works, or it doesn’t. (We also make a simplifying assumption that if this idealized drive stops working, it will not start working at a later point.)

We then assign a new random variable to represent a real drive: an ideal drive “instantiated” and placed into a data center. There’s a simple causal link between the state of the real drive and the state of the ideal drive: if the ideal drive enters a non-working state, so does the real drive. Pretty straightforward so far, but here’s where it gets fun. A real drive is embedded in an environment, which is represented as a third random variable. An environment variable is much more opaque: we have no way to measure it or to determine what its state is. What it does, though, is represent a second causal influence on the state of the real drive. The state of the environment ranges from clean to hazardous, increasing the probability that a real drive embedded in that environment will fail. Here’s the kicker: all real drives which share a common real environment, e.g. a rack, share a common environment variable. Essentially, all drives within that environment get an extra chance to fail, even if their ideal drive is just fine.

If we can’t measure the environment, though, what good is it? That’s where the network comes into play. If a real drive fails, we can see two potential causes: either it wore out in accordance with our expectations of the idealized drive, or damage from the environment accelerated its failure rate. Since we can neither gather direct evidence about the state of the idealized drive nor the environment, we have to increase our probability estimate for *both*. When we increase our belief in the failure of the idealized drive, nothing much happens, as it is independent of all the other ideal (and therefore real) drives. But when we increase our belief that the environment is hazardous, that change propagates to all the other real drives in the environment.

Let me repeat that: seeing a drive fail changes our belief about the state of the environment which changes our belief about the reliability of all the other drives in that environment. Suddenly, we have the behavior that we were hoping for from the start: spatial and temporal locality of disk failures. When one drive fails, the probability that other drives located nearby will also fail soon is increased. The increase is modest, at least under the parameters that we derived, but it’s noticeable, and it fits much better with our real world observations.

Of course, a disk is embedded in a raid set which is embedded in an array which communicates with hosts which are located in a data center which communicates with other data centers which send backups to vaults and… the interrelationships are complex and the assumptions are myriad and noisy. We can’t get it right.

I sure do type a lot. I hope you think it was worth it. Thank you for giving me an opening to spout off about this…

John
Peter Englmaier on Friday, 13 February, 2009 at 6:46 am

@Jonathan:
The Bayesian method is a statistical method which can take the unknown into account. There are always factors which are non known or just uninteresting. Using real-world data, it is possible to make a model of something you do not understand in detail. Then add in some things you know about (i.e. differences between RAID5 and RAID6), and your able to compute probabilities based on that. Environmental factors like, random CEO’s passing through the server room, can be taken into account as well. It will just add to the error margin of the predictions. The video doesn’t show this, but the software should (at least internally) know what the errors on the predicted costs are.

It is interesting to see it’s application to this topic and I will keep an eye on the product.
Thierry on Tuesday, 17 February, 2009 at 3:26 pm

It is a cool tool.
Help my clients to decide how to use their assets to do business from an operational perspective.
That’s the way I am going to use this tool.
It is not perfect, but it is very usefull like for example finite state machine to make realistic simulations.
The perfect tool doesn’t exist yet to replace our experience.
That’s common sense.
There is no reason to be afraid of numbers or probabilities.
Only human being make mistakes.
Knowledge led by experience is the key and that kind of tools should help my customers to comply with their business goals and go forward.
There is no holy grail feature to expect from that tool, it is just there to help us to feel ourselves more comfortable with our daily choices and our technical decisions to do our job the best possible way.
Thank you for this nice tool of knowledge management and decision making, it will help a lot.
I am already waiting for the next release and I am pretty sure that similar solutions from other vendor will follow.