Ask StorageMojo: 80,000 mailboxes need help

by Robin Harris | Wednesday, November 2, 2011 | Architecture, Enterprise, NAS, IP, iSCSI, SSD/Flash/NVRAM, Virtualization | 47 comments

A StorageMojo reader has a problem. Can you help?

Our mail hub (80,000+ mailboxes) is virtualized with vSphere 4.1 with Red Hat Enterprise Linux 5 x64 and Dovecot 2.0 [an open source IMAP/POP3 email server for Linux/UNIX-like systems]. We are using HP LeftHand Networks P4300 iSCSI storage in a “network RAID10 setup of RAID10 storage” for Dovecot indexes and multiple “networks RAID1 of RAID5 storage” for actual mailboxes.

This is my take: our Dovecot indexes are getting hammered with lots of small I/O requests, about 8,000 IOPS continuous during 8-working-hour days, 75% write. Indexes are fairly small (50 GB) and expected to grow to 100-150 GB, but need a lot of random I/O. We need real-time replication in storage (LeftHand is ok for us) and we think that SSD should shine in this situation. Bandwidth is not a problem (200-300 megabits of indexes traffic, but we need more IOPs).

The problem is the indexes, but our total mailbox capacity is expected to grow to 6 TB compressed using zlib compression in Dovecot.

We want to buy a storage appliance with the following requirements:

Vsphere 4.1 & 5 certified storage, VAAI enabled (if possible)

iSCSI (1 gbps)

High number of IOPS (at least 12,000+, most of them writes)

Small size (200 GB)

Fault tolerant (RAID, battery-backed write cache, power supply, fans, multiple gigabit uplinks, synchronous replication)

Cheap (less than $30k the full setup)

We want to buy at the beginning of 2012. Any product that fits?

The StorageMojo take
Suspect price will be the most significant limiter. But the respondent only needs index storage not the whole shooting match. He’s pretty happy with LeftHand for mailbox storage.

But if we can solve both problems for him, why not? If he should relax some constraint, feel free to suggest it.

He’ll be watching the comments, so if you have questions please ask them. I’ll be following the comments as well.

Courteous comments welcome, of course. His email was edited for clarity.

47 Comments

Paul Monaghan on Wednesday, 2 November, 2011 at 9:16 am

I run a similar mail server but with Maildir, our issue are large directories for IMAP/Webmail users, we’ve addressed this with Netapp FlexCache in “Metadata mode.” File name, etc are all stored in 256GB RAM cache on the filer this looks like it will allow us to replace shelves of 15k RPM drives with 1 shelf of SATA for the store since the big IO Hit is the metadata to build the mailbox itself.

Are the dovecot indexes per node or shared across the mail cluster? If they are per node have you thought about simply moving them to FusionIO cards in each mail node? If they are shared you could look at doing FusionIO in Openfiler/Linux machine and expose the FusionIO card via 10Gb Ethernet NFS/iSCSI. $30k for 12,000 IOPS is going to be a big bottle neck, so rather than upgrading the whole iSCSI SAN take the component that is giving you grief and speed it up. Most cost effective.
GChapman on Wednesday, 2 November, 2011 at 9:41 am

The 30k Price Point is a non-starter for that number of IOPS, and the desire for VAAI integration. Best guess would be a price point at 50-65k would get all the requirements. Given the tight integration with VMWare I’d look at Tintri.
Karl Katzke on Wednesday, 2 November, 2011 at 10:08 am

The price is definitely the stopper if that’s the price of the whole solution.

Why is one of the specifications “Small size (200 GB)” — is this for the mailbox indexes, or is this for mailbox storage? I guess I’m not clear on the question’s parameters, then.

You might look at Scalable Informatics … I’m not sure that they’re VMWare certified, but they are impressive machines that can provide what you’re looking for… and probably do it with spinning rust; their SSD stuff blows the barn doors off. Otherwise, I’d probably end up looking at something that was cheap DAS but VMWare-compatible, like a Dell MD3200 (or 3200i if you insist on iSCSI) packed with SSDs.
Javier on Wednesday, 2 November, 2011 at 11:32 am

200 GB are just for indexes. We are using zlib compressed mdbox for mailboxes, no problem with lots of mails in a single directory, they are nicely packed in a few files (1 file has all the e-mails of one day)

We only need ultra high IOPs for indexes, only 200 GB of high-perfomance iSCSI, the SAN we want to buy is only for that. We are evaluating buying a couple of Dell, pack them with a raid 1 of SSD each and put on top HP LeftHand VSA to “network-raid” them, but it should be other way of doing this.
Ryan Malayter on Wednesday, 2 November, 2011 at 12:10 pm

How about getting better software? This is the true cause of the problem. Any mail server that uses a real database with real indexes for mail storage (like MS Exchange or Lotus Notes) will be vastly more efficient. A B-tree is a wonderful thing.

Paraphrasing Knuth, adding hardware can, in the best of cases, net you a 10-fold improvement. A good software fix can get you a 1000-fold improvement in performance.
John on Wednesday, 2 November, 2011 at 1:37 pm

You can leave the mailbox×³s data on lefthand p4300 ( which can easly scale )
And use HP LeftHand VSA on local SSDs for index’s

Same management console, peer motion , remote snapshots , Supports VAAI

Not sure about the price , or your server configuration
matt on Wednesday, 2 November, 2011 at 1:43 pm

Bad design and equipment selection for starters. Since you don’t reveal THE most important spec on the units, I assume you’re using the 12tb SATA model?

Indexes should be on 15K drives no exceptions. “network RAID 10 or 1” is just silly. LH units are designed to be fault-resilient. The likelihood of an entire unit going up in smoke is vanishingly small. It’s only email so a multi-hour outage is not a problem. (get over it people)

Nobody who uses Dell Equallogic bothers with network RAID so why do that with LH? Do you really think it’s that inferior?

If you absolutely must mirror each volume, use Linux MD1 with “write-behind” or DRBD in async mode on whatever Linux host that’s getting the LUNs presented to it.

The more LUN queues the host OS has to work with the better. If each LH unit has four 1GB ethernet, there should be at least 4 LUNs (better yet 2x) presented to fully maximize the ethernet links. I suspect you did network raid 10 to get bigger LUNs. Wrong way to do it. Use LVM on the Linux initiators.

On Linux filesystem if EXT3/4, run it in journaled mode and put the journal on an SSD or other LUN that isn’t shared with anything else except perhaps other journals.

You have more than enough equipment. You’re just doing it wrong or in the case of the indexes, likely using the wrong drive type.
matt on Wednesday, 2 November, 2011 at 1:52 pm

I built a 220 spindle EMC CX3-40c (8 iscsi, 4 FC) used unit for $28,000 6 weeks ago. One of my multi-billion row database instance chews thru 12,000 IOPs of 80+% random I/O in just over half my spindle count. Talk to BLTrading or MaximumMidrange or your favorite used EMC reseller.
Matt on Wednesday, 2 November, 2011 at 3:55 pm

A CX3-40 is a very old piece of hardware. You would be off contract with that right from the start, which isn’t really the way you want to go if availability is at all important.

I’m with Paul at the top. I’d get a small Netapp filer with a 256GB PAM card and a shelf of 1 or 2TB SATA drives. The indexes will be lightning fast and you can keep the mail on the same solution.
Javier on Wednesday, 2 November, 2011 at 4:10 pm

Thank you for your answers, some clarifications:

1Âº Software election: Microsoft / Lotus cals are much more expensive than the SAN we are trying to use. Also, try to serve all those mailboxes with those suites, how much hardware do you need?

2Âº Hardware: all of our LeftHand nodes are P4300 8 SAS disks 15.000 rpm. We have 5.000+ concurrent imap sessions. Mailboxes are in a stripped LVM of LeftHands with raid5 and indexes are in dedicated pair of LeftHands with raid 10. We read about write barriers, noop/deadline linux IO elevators, multipathing, jumbo frames, delayed ACKs and queue dephts with iSCSI. We did our homework.

3Âº Network raid: regulations impone us a DR plan with very little downtime. Our LHN nodes must be in separate buildings. This works as expected, that is the reason we are happy with LeftHand.

4Âº A pair of LeftHand (network raid 1) with each node 8 SAS 15k rpm disks in raid10 with 512 MB+battery cache barely support 8.000 iops (60-75% write) continous. I have the graphs that prove it.
Javier on Wednesday, 2 November, 2011 at 4:11 pm

John, thank you for your answer. That is the config we are evaluating, but we wanted to ask to the gentle reader @storagemojo.

Thank you again, John.
David Magda on Wednesday, 2 November, 2011 at 4:39 pm

Is there any reason why iSCSI is a requirement? Is it for consistency with the rest of the environment? Personally I’d go with NFS as Dovecot handles it just fine [1] and you can have multiple VMs to spread load but use the Director [2] to send users to the same server for a given session to keep caches hot.

[1] http://wiki2.dovecot.org/MailLocation/SharedDisk
[2] http://wiki2.dovecot.org/Director

Either way, perhaps look at some of the ZFS-based appliance vendors out there (Nexenta, Oracle/Sun). They can generally do iSCSI (and/or NFS), and you can order order units with flash, so that will definite help with synchronous writes.

I’m at a place now where we’re using Isilon a lot, and we’re very, very happy with it. It’s probably out of the OP’s price range a bit, but definitely looking into for possible future use.

@Ryan Malayter: A “real database” will not necessarily solve this problem. We had Exchange fall over just last week when my manager’s mailbox/INBOX was moved from a version 2007 to version 2010 server. He had ‘only’ 200,000 messages in it, and it hit some kind of database log limit. I’ve had Courier IMAP/POP Maildir accounts with 2.7 million messages in their inboxes (it was for an automated processes and left to run for 4+ years before we ran out of inodes (Solaris UFS)).

FWIW, Dovecot does have “real” indexes [3], and its own native mailbox format [4] makes for very quick and efficient storage (the author has obviously learned from previous efforts).

[3] http://wiki2.dovecot.org/IndexFiles
[4] http://wiki2.dovecot.org/MailboxFormat

Honestly, if you’re running a Unix-based system for a service, your first choice for remote storage should be NFS. You should only go to other protocols (read: iSCSI) if for some reason NFS doesn’t work for particular application (and even then, I’d assume that it was something I was doing wrong). By going NFS, and mounting directly from the guest, you don’t have to worry about vSphere compatibility either, which may open up your choice of vendors.
Elfar on Wednesday, 2 November, 2011 at 5:49 pm

Buying a server with a few ssd drives and using the vsa is a nice solution which will fulfill the requirements Javier lists.

Take care though not to trust the iops numbers on the specs blindly and not to buy just the minimum amount of usable drive space since the drives performance can degrade when getting close to full and generally the more space you have on ssd’s the longer they will last.

Then there’s some general stuff that Javier might have checked out on the current system like the partition alignment on the index file system, disabling the last access time stamps if possible and moving the journal to another disk.
Elfar on Wednesday, 2 November, 2011 at 5:57 pm

Also, I did hear that HP is releasing a version of the lefthand with ssd’s first quarter next year. How they are going to implement it I don’t know though.
Jacob Marley on Wednesday, 2 November, 2011 at 8:12 pm

Have you tried to disabling Dovecot index files?
No more thousands of small random IOs.

Dovecot will generate indexes in memory when it opens a folder.

Depending on server CPU/memory head room this might be a short term solution.
Javier on Wednesday, 2 November, 2011 at 11:39 pm

@Jacob. Disabling indexes can be a real problem if we have to restart the system. Once reebooted, indexes must me recreated, reading the entire mailbox. At 8:00 AM we would have hundreds of concurrent users with fat mailboxes (4-5 GB) trashing our systems.

@Elfar, thank you for your answers. We are using iSCSI for consistency with the rest of the enviroment, we have 100+ VMs and we wanted to use the same kind of storage for everything (we have 5 pairs of HP LeftHand, two pairs are for mailboxes storage, another pair for indexes and the other two pairs are for the other VMs)

@Elfar, indexes are partition-aligned, noatime, nodiratime, ext4 . Any news / rumor site about Lefthand+SSD you can share?

Regards
Javier
natmaka on Thursday, 3 November, 2011 at 1:35 am

“80,000+ mailboxes ((…)) Dovecot indexes are getting hammered with lots of small I/O requests, about 8,000 IOPS continuous during 8-working-hour days, 75% write”

6000 write/s may be easy to handle on a journaled fs, but 2000 random IOPS is a fair load and implied heads moves break the stream of write operations.

The typical (arithmetic mean) mailbox is polled every 40 seconds (80000 boxes / 2000 read operations per sec). Is it really necessary? Clients frequently polling may have their data kept in a cache (RAM on a controller, Linux buffercache…) and many/most client polls are served through at least a (real, physical) read on a spindle.

You may be able to get some/many random reads out of the way by letting the clients poll less frequently. 1 poll every 3 minutes may be sufficient and will reduce the load to ~450 read operations/sec, reducing head moves. Using some proxy able to cache replies by directly answering “no new mail” for any mailbox already polled during the last X seconds may be more practical and efficient. Bump X and the read operations density will fall, reducing load and enabling more efficient (no head moves!) write operations.
John on Thursday, 3 November, 2011 at 2:33 am

Javier,
Another idea is to use:
1. Seanodes – same protection levels as lefthand vsa but more cost effective and use it only for the index’s on local SSD’s , As far as I tested, Seanodes provide great performance, scalability and data protection
I was also very impressed with the native linux multipath with multiply seanodes instances – which you cant get with lefthand

2. Zimbra opensource, I know its not storage related , however they do use lucent index , zimbra is more memory hungry but I think its more “easy” on the storage side – iam not completely sure about it maybe it worth checking.

3.Zimbra ( not the open source version) has storage tiering built in and good prices for the education, iam not sure it will be cost effective as lefthand / seanodes.
matt on Thursday, 3 November, 2011 at 5:22 am

OF COURSE a CX3 is off contract. EMC “support” pricing ought to be justification for prison sentences, frankly. An equivalent CX4-240 solution was well into $200,000 because of support and software costs. I know several multi-national financial firms that run CX500’s and they do just fine.

I have 1 yr support from my EMC reseller. A CX is a single board computer. It doesn’t break. And if it did, I can have a replacement board for $1000 and in a jiffy. The hard drives are 300G/10K FiberChannel at $100 each. 147G/15K are $50.
8000 write IOPs of what size? 512B or 4KB? Scattered writes are SLOW no matter how fast the drive, so you need more IOPs (ie spindles) and intelligent destaging. My 12000 IOPS figure is for 8KB blocks.
NFS has all kinds of locking issues which DoveCot may or may not have addressed. You are also at the mercy of the NFS server’s consistency guarentees. NetApp is just fine. I can’t speak to the others.
matt on Thursday, 3 November, 2011 at 5:49 am

Linux LVM striping is broken. Only MD striping works correctly. only 8 spindles per LH node is a joke! You should be running at least 16 if not 24 before doubling for the paired node. Hardware Raid10 of 8 drives is typically only 50% faster than Raid10 of 4. If you have benchmarks saying otherwise, great. But Linux software RAID0 gets a lot closer to 100% than 50% improvement. Get a pair of 24drive units, and make six raid 10 groups of 4 drives each and 1 iSCSI LUN per raid group. If DoveCot can intelligently spread the indexes across 6 filesystems, you’re good. Otherwise you may have to MD0 across them.
Dan on Thursday, 3 November, 2011 at 6:38 am

In order to help contribute I feel that I kind of need to know what is going on with the system…and the links that David provided helped. 80,000+ mailboxes equates to that amount of index files. I’m not sure how many drives you’ve setup in the network RAID10 holding the indexes but do understand the capacity utilization of these indexes to be small (200 GB) with 8k iops 1:3 R:W Ratio.

As mentioned by other posters the dollar figure is tight for a hardware solution like IO Fusion and the like. I know software solutions have been mentioned but nothing specific. There are software only solutions where you can optimize the underlying storage and considering the dollar amount this maybe the best approach. Problem I see is that 80K + indexes is randomizing the IO’s to the storage….also predominantly writes which is another hinderence to storage. There are solutions that can take all those writes and serialize them into a sequential stream down to a small logging disk (roughly 10 GB in size off your fastest drive technology) before asynchronously destaging these writes to a slower tier of drives (imposing a logging architecture at the VMDK level). This can help minimize the amount of really fast expensive storage needed, also sequential write workloads in general are handled by storage better. I know Virsto Software has this type of capabilities that may be of interest.
Javier on Thursday, 3 November, 2011 at 10:37 am

@natmaka a caching imap proxy could be an interesting solution to reduce load by MUAs, worth a look. But I do not know if $MEGABOSSES will be happy if new mails are “delayed” a couple of minutes before appearing in their iphones / outlooks. Thank you for your answer @natkama

@John: Dovecot has some “storage tiering” called “alternate storage” (http://wiki2.dovecot.org/Tools/Doveadm/Altmove). I can define rules to move “old” e-mails to cheaper/bulkier storage if needed.

Seanodes? I did not know about them. I will investigate, thank you for your response, @John.

@matt The biggest lefhtand node has only 12 disks. And I “feel” that buying all that spinning rust, several terabytes, only to use 100-200 GB is someway “wasteful”. Our take is high-iops, very low capacity, iscsi, cheap synchronous replicated storage.

We did some test with MD raid 0 vs LVM stripping and MD is somewhat faster, but not 50% faster. Any reference that support your statement “Linux LVM striping is broken”?

Thank you for your answers, @matt

@David Magda. You give a possible solution, use several dovecot instances/VM and cluster them via director, each instance with their own indexes. But this would be a major overhaul to our mail backend, adding more complexity. We think that a storage-only solution is “cleaner” in this case. Thank you for your answer.

@Dan. Good point. I will read about Virsto. Thank you for your answer.
Cristian on Thursday, 3 November, 2011 at 1:13 pm

Javier, please check if the Coraid products (www.coraid.com) could meet your requirements.

Regards,
Cristian.
Kyle on Thursday, 3 November, 2011 at 4:12 pm

I would look at scality (www.scality.com) – We had a demo from them a few weeks, and this is exactly one of the use cases their product grew from. It is a software product that uses commodity hardware, can operate w/o raid card and is actually very in-expensive for what it does. Maybe it’ll fit for you, maybe it wont, but it’s worth taking a look at.
Phil Jaenke on Thursday, 3 November, 2011 at 4:40 pm

The long and short of it is:
Need more money.

This is not doable or feasible with your budget. It just isn’t. For that capacity and workload combination, perceived fanboy-isms aside, the right answer is IBM SVC – Entry Edition will be plenty. The reason is, SVC will allow you to deploy multiple smaller arrays behind it while aggressively caching. The key to random IO remains spindles, and random IO is what SVC excels at. (Just don’t ask it to destroy records doing sequential.) Distributing the load across multiple arrays allows you to increase the spindle count significantly while keeping controller loads more reasonable. The SVC also scales backend IO linearly. That is to say, if you have 2 x 15K IOPS behind an SVC, SVC consumers see 30K IOPS presuming you striped the MDisks across both controllers.

All of that said, I’d tend toward an IBM SVC Entry Edition with a pair of DS3500’s for back-end storage with a pair of lower end FC switches. That gets you the needed migration path for host connectivity, iSCSI, and long-term scalability. Run out of capacity? Add more disks. Run out of performance? Add another DS3500. The SVC EE will scale to about 10 times the IOPS you’re seeing now, so consider it a long-term permanent fix (3+ years before node swap) which also permits non-disruptive upgrade plus transparent back-end migration and upgrade. Obviously, SVC also supports non-IBM arrays including EMC, 3PAR, Violin and a bunch of others. DS3500 is just pretty darn cheap and a reasonably solid performer.

But, as mentioned, it will cost you. Expect to be somewhere in the $100,000 ballpark when it’s all said and done. But it’s a long term investment rather than a quick fix.
nate on Thursday, 3 November, 2011 at 7:21 pm

While I don’t have any new suggestions vs what’s already been offered I was curious are your mail servers running on VMDKs on top of VMFS ? or are you using RDM through vSphere? Or are you using software iSCSI directly from the guest VMs?
Jacob Marley on Thursday, 3 November, 2011 at 7:54 pm

Javier,

I presume Dovecot is configured with fsync=no?
Javier on Friday, 4 November, 2011 at 12:29 am

@Jacob Marley. We have mail_fsync=optimized for LDA and mail_fsync=never for the rest.

@Phil. I do not agree with you. We do not need so much horsepower, we can scale horizontally our mailboxes adding more LeftHand boxes & Linux LVM. We only need to scale-up IOPs count in indexes. Thank you for answering, anyway.

@Cristian & @Kyle I will check out both coraid & scality, thank you for your answers.

@nate. Yes, we are running VMDKs on top of VMFS for consistency reasons with the rest of the virtual infraestructure , makes easier the DR strategy.
Andre on Friday, 4 November, 2011 at 10:15 am

Javier

We did a similar exercise with a distributed DB system. We had 4 ESX servers, placed off-the-shelf 128GB SSD’s into them in in Raid10 and ran a HP VSA’s on each host. All ‘hot’ data was pushed to these SSD LUN’s, giving us a massive boost in performance. As you already have physical P4300’s you may be able to get free VSA licenses from HP – if so count on total cost in the region of $3000.
Anders Gregersen on Friday, 4 November, 2011 at 1:11 pm

What about nexsan? It solves most of the problems presented
Lee Johns on Friday, 4 November, 2011 at 1:51 pm

You could always look at the LeftHand Virtual SAN Appliance in a couple of Servers with Solid State drives in them. This would give you a compatible cluster with what you already have and a higher IOPS solution for the new requirement. I have not costed it but think you will likely get to your goal. @StorageOlogist
Wes Felter on Friday, 4 November, 2011 at 4:15 pm

I’m surprised to see such complex solutions being advocated. What about a Dot Hill or Engenio array with SSDs?
nate on Friday, 4 November, 2011 at 4:55 pm

I have a suggestion now 🙂 I came across this while doing some checking for something un related and I don’t see it as being suggested by someone else. I have not tried it but it looks interesting and since it is a VSA it is probably easy to test

http://www.voltaire.com/Products/Application_Acceleration_Software/voltaire_storage_accelerator_vsa

There seem to be two modes to it – stand alone where it exports out iSCSI LUNs, or some sort of gateway mode where it can act as a “flash cache” for FC-based storage arrays.

Sounds like something I may want to check out for my upcoming project(which uses FC-based storage arrays with vSphere).
Jacob Marley on Friday, 4 November, 2011 at 11:45 pm

Jaiver,

You said…
“@nate. Yes, we are running VMDKs on top of VMFS for consistency reasons with the rest of the virtual infraestructure , makes easier the DR strategy.”

So your indexes are on a 2nd VMDK which is on a separate VMFS which is on the network RAID10 of LeftHand storage arrays, correct?

If so, when you say…
“@Elfar, indexes are partition-aligned, noatime, nodiratime, ext4 . Any news / rumor site about Lefthand+SSD you can share? ”

Do you mean the indexes are aligned to the partition on VMDK only?

OR

Do you mean the indexes are aligned the partition on the VMDK, which is aligned to the VMFS, which is aligned to the LUN on the storage array?
Bernie on Saturday, 5 November, 2011 at 12:19 pm

I used to run a CentOS 5 Linux IMAP server, MTA and webmail on the same box. It had fewer active accounts, and it was Cyrus rather than Dovecot, and I/O wasn’t really a problem even on ext3 (most would want a newer filesystem, but we didn’t need to move away from ext3).

When you’ve got lots of synchronous I/O and a write-intensive but mixed and seek-y load, a nice tip is to use RAID 50 (or 60) for your volumes, and use large (but 1s.

My old mailserver used 15 drives in RAID 60 (1 hot spare) for volumes + 6 in RAID 10 for journals.
Phil Jaenke on Saturday, 5 November, 2011 at 1:11 pm

@Javier – as @Wes mentioned, what’s being bandied about is a very complex solution both mechanically and from a management standpoint. Though I disagree on Engenio (now NetApp) AKA IBM DS5000-series. Those, are overkill. When configured properly – which pretty much requires taking a vacation with C’thulhu – even a DS5100 can do 32K IOPS.

There are a couple of problems that are going to make any multi-controller system harder to manage and maintain without a single front-end like SVC, which is why I advocated it. The combination of high capacity and high random basically means stacking boxes and manually balancing loads. The desire to keep costs rules out SSDs completely due to the capacity issue. Any multi-controller solution is going to paint you into a corner; when performance goes down again, you’re going to have to buy more, and rebalance everything or have uneven loading and performance.

There are a couple things that can be done as far as Dovecot goes to reduce load in the meantime. Personally, I’d switch to a PostgreSQL 9.1 backend for authentication. That can reduce replication complexity – just replicate the database between two systems – and can reduce IO, since PostgreSQL has a very solid caching strategy. The double login caching will be a non-issue, but all logins combined are probably a miniscule percentage of total IO.
The folks upstairs (and probably some others) will freak at the suggestion, but, I’d consider switching to FreeBSD. You’re in a situation with high network IO and high disk IO, and that’s where FreeBSD excels. It also opens the door to a ZFS or Geom solution, both of which are very stable and very fast. The obvious downside is that it’s FreeBSD – most folks don’t know it, and you can’t just call RedHat up when things break.

I don’t think a an auto-tiering solution is realistic or feasible here, partly due to lack of data. For a auto-tiering solution to solve the problem, you’d need to have a known number of ‘power users’ with a known amount of ‘hot’ data. At 80,000 users, it’s possible that auto-tiering storage would churn data, resulting in a performance hit. Manually tiering the Indexes to an internal SSD set (either individual disks in ZFS or hardware RAID) and using HAST[1] or ZFS remote mirroring would offer the most bang for your buck and only requires a minor Dovecot tweak. That has the potential to reduce randomness and total disk hit fairly significantly at minor cost and headache. Or, as @David Magda mentioned, doing Indexes off these over NFS.

[1] http://www.freebsd.org/doc/en/books/handbook/disks-hast.html

Ultimately though, the main reason I suggested SVC isn’t purely because of Dovecot. The reason is because it sounds like there’s a lot of growth planned. SVC is less an application solution, and more a total solution to cover all disk needs for growth long term. It’s more of a “buy this once, only manage the disks behind it once a year, spend less than a week migrating to new disks in 3 years” answer than a “I need IOPS!” answer.
Wes Felter on Saturday, 5 November, 2011 at 9:29 pm

I was thinking more of the Engenio 2600 aka IBM DS3500 â€” pretty much the lowest-end tier one storage. After all, that’s pretty much all you can afford for $30K.

And if you read carefully, the requirement is for very *low* capacity â€” 200 GB. That’s why I suggested a single array, no SVC, and SSDs.
Ken Coccaro on Monday, 7 November, 2011 at 6:32 am

In the recent past Midas used a great slogan in an advertising campaign, â€œIâ€™m not going to pay a lot for this mufflerâ€. When implementing any type of solution that delivers 12K iops, be it high performance storage infrastructure or otherwise, not paying a lotâ€™ will start well above $30K. Add to that the requirements for fault tolerance, VMware, etc., and unfortunately anything under $30K is not realistic.

Further, some of the more cheaper solutions may end up costing you more anyway and not deliver the performance promised. Case in point is adding SSDs to existing legacy storage. Need to be careful there. SSD media is Superman.. Legacy storage today, largely based on 15-20 year old architecture that was built specifically for spinning disk, is Kryptonite to SSD media. Plus, plug SSDs into legacy storage, especially in a heavy write environment, and ware leveling issues will have you replacing them within 1 year to 2 years as performance and capacity decreases while errors increase along with your â€˜initial investmentâ€™. If you really want to get the full benefit of SSD media, you really need to employ it within an architecture that is purpose built for SSD Media. Only these architectures can deliver the consistent high performance SSD media promises. . As my son always reminds me, there is no such thing as a short cut. If a short cut was the â€˜wayâ€™ it would be called the â€˜wayâ€™.

My suggestion?
â€¢ Donâ€™t shortcut, do it the right way
â€¢ Secure a realistic budget
â€¢ Look to an SSD solution that scales, has built in HA, is software based (not built on proprietary hardware).
â€¢ Look for a solution that requires little change to your application or infrastructure environment, is frictionless

Kaminario is such an SSD solution. Itâ€™s a SAN, non-proprietary, built on x86 platform, no single point of failure, incorporates Fusion IO, minimum configuration will exceed 100K iops no problem at sub millisecond latency. No need to be an integrator or software engineer or spend another minute re-engineering code, etc. Instead just install Kaminario (in under a day), move your indices or whatever data is driving your IO to Kaminario and problem immediately solved. It cannot get any simpler than that.

[Ken is a Kaminario employee.]
Sanguy on Monday, 7 November, 2011 at 10:20 am

When did Storagemojo turn from offering good advice to being a podium for sales reps to make recommendations under the guise of helping when they don’t even invest the time to read or understand the problem?

The “This is perfect for Nexsan” is a great one. Last time I checked Nexsan flogged cheap bulk storage, and was dabbling in SAS and SSD as well – but nothing that would make it the perfect fit. This guy needs 200GB of CHEAP IOP’s – sorry, not a match for Nexsan.

Next we have recommendations for IBM SVC with the first comment being “not enough money”. No sh~t sherlock! Anything built by IBM will quickly blow the 30K budget, as will NetApp. These are a non-starter.

Finally we have Kaminario – last time I checked this was some pretty fantastic equipment with a price tag to match. So no surprises that the $30K budget is unrealistic.

PLEASE read his original post – that they are a big LeftHand shop and his plan was to use the VSA and some SSD to have consistency in management and skill-set. This is a wise first try, just invest in a couple good quality SSD (eg: Not OCZ) and give it a whirl — it can’t get much worse then it is!
gchapman on Monday, 7 November, 2011 at 10:35 am

@Ken, SSD for mail seems like massive overkill the requirements were peak 8k IOPS a Kaminario rig is probably not what this guy is looking for.

Why pay 250-300k when you could get what the OP wanted for around 60k with far more capacity.
Ken Coccaro on Wednesday, 9 November, 2011 at 9:22 am

Thank you, I appreciate the comments. “@Ken, SSD for mail seems like massive overkill the requirements were peak 8k IOPS a Kaminario rig is probably not what this guy is looking for. ”

Yes you are right. Mail is typically not an application that the benefits of an SSD appliance would normally be considered, be it Kaminario or any other. But having said that, the value proposition can still be compelling. The appliance can be installed in 1/2 day, no code changes, no application changes, just move the indices, tempdb, redo logs, whatever, to the applicance and done. The problem (poor io processing) is immediately solved. No one can argue the simplicity of that. Overkill? Perhaps (probably) if only in the sense that the applliance can do so much more. Price? I’m not sure where $250-$300K came from. That too is very unrealistic. You could acquire a handful for that. But in the end, the simplicity of solving the problem in a a single day without having to do anything else to the environment is, if nothing else, a consideration. Thanks again.
piezo on Wednesday, 9 November, 2011 at 3:04 pm

What I saw a week ago in an ISPs datacenter was a setup that with little changes MAY fit. In few words it was a setup of two 1U Supermicro servers, interconnected via Infiniband link. Each server had 4Gb ethernet ports, one 40Gb IB card, 8x Intel’s SLC 64GB drives in Raid10/256GB in total/ over LSI’s “new” raid controller series /based on LSI 2208 processor/ with FastPath/claiming SSD optimizations/. Both had Xeons and lots of RAM and of course redundant power supplies. The price of both machines was said to be 19600$. I guess for few thousands licenses for Open-E or Nexenta with Active/Active or Active/Passive and iSCSI target can be bought.
Omar Barraza on Thursday, 10 November, 2011 at 3:31 pm

Javier,

There are many good ideas in this thread, but I think the best is your approach of complementing existing infrastructure by adding a product to handle the most demanding workloads. Iâ€™m a fan of this very pragmatic approach.

Astute Networks offers a â€œstorage applianceâ€ called ViSX G3 that adds a 100% flash memory datastore to VMware environments using 1G/10G Ethernet and the iSCSI protocol. It delivers direct storage and sustained performance: there is no caching, tiering, de-dup, compression, or other price/performance â€˜strategiesâ€™ involved. ViSX G3 is VMware Ready and managed by vCenter and vSphere as a standard datastore after installation so you already know how to use one.

You can get a ViSX G3 for $29,000 MSRP and this includes warranty, support and service.

ViSX G3 provides more capacity (over 1TB) and faster performance (80,000+ IOPS) than your indexes need, but I think you will have other virtual machines that can use faster datastores too. Just use vCenter to find your most demanding VMs and use Storage vMotion to migrate them to the flash datastore. Iâ€™m sharing this because ViSX G3 is fairly new so you might not find it in your search for a solution to your problem.

I hope this helps.

[Full disclosure: I work for Astute Networks.]
alexh on Saturday, 26 November, 2011 at 1:51 pm

Maybe you should not focus too much on the storage side and more on the application side.

Mail is a problem that you can easily distribute on multiple servers even on the storage side.

Just setup up pairs of Dovecot servers with DAS storage shelves and use DRBD to replicate the data.
Use SATA for mailbox and 15K SAS for indexes.
You just need a database table where you map the mailbox to the right server pair. Then put a VM with a Dovecot IMAP/POP3 proxy in front of them and now you have a single access point for all your customers.
In addition use the Dovecot lmtpd on the IMAP/POP3 pairs to get the new mails in from your MTAs.

With this setup you don’t need iscsi, nfs or any fancy storage array – you just use cheap servers and DAS shelves. you get highest performance from your local storage and you can easily scale the service on your demands.

yet there are 2 problems with this approach.

first you have to modify your mailbox management software to consider the distribution of the mailboxes on the various IMAP/POP3 pairs. should not be a problem if you use opensource software or in house developed software. you may also need some scripts to balance the mailboxes between the server pairs. maybe you can use the dovecot dsync feature for this.

second, if you want ha with drbd you can only scale in pairs.
however you can use drbd in a cross setup. so every node is master for a mailbox volume and slave for another one.

PS: i think rackspace uses a similar setup for their mailservice.
Joe G on Monday, 5 December, 2011 at 2:44 am

We use commodity storage when we need to do something low-cost. We special order polywells, the 222TB ones, with Areca controllers. You should get decent IOPs off of these controllers. I would recommend SAS for you random I/O needs.

http://www.polywell.com/us/storage/NetDisk2012A.asp
Gokalp on Sunday, 18 December, 2011 at 3:12 am

How many ESX servers do you have in your environment?
Have you considered using the new vmware storage appliance and replicating local ssd storage across 3 esx servers ( I believe 3 is the max that the vsa supports)?

VSA licensing would be 5,995.00 and add the cost of local ssd storage.
If you have drive limitations on your esx servers you can always opt for a pci-e ssd drive like the 240gb revodrive 3.

I have not tested the iops on the vsa but i dont see why it could not deliver the required iops when using ssd drives and a good replication link.

I would not recommend to do something like this when you should get a san, but the price point is difficult to meet. It is crucial that your san provider gives you the support you need and can meet your sla. If you do choose to go with a non-mainstream san provider make sure that you will get the support you need and they have a good reputation. Ask for recommendations from some of their other customers. If you do not choose a san provider that can deliver the support you need you may as well build your own solution that can meet your budget.
Joe Landman on Friday, 30 December, 2011 at 6:42 pm

Of course I’d not read SM in a while, so I am very late to the discussion.

Short version is we have something that should fit fairly well (performance requirements, price, capacity) in our siFlash unit (link on the website).

We have arrays that are built of either the SSD variant of flash or the PCIe variant (or both if you require). 8k IOPs sustained isn’t that hard … actually quite easy. You can see a video of me demoing one of these units (an SLC based PCIe flash unit at SC11 here: http://youtu.be/WSP_MxpMVeE … this particular unit was Virident based, but we build a number of different types as indicated). With our iSCSI target (VMware certified, see http://scst.sourceforge.net/ ), and your price point requirement, we can fairly easily hit your size and performance targets, within your price target. BTW: that unit hitting 400k-ish IOPs is back in our lab, and we’ve tested this with iSCSI over 10GbE, as well as FC, and Infiniband. We have some other … er … pretty interesting results.

If you are still looking, and still interested, feel free to contact me at the email address. We might be able to set up a remote access test for you if you are interested. Same goes for anyone else interested in this sort of capability.

Short version: is its very doable, and you are looking in the right direction for it.