A post last month in ACM’s Queue raised a disturbing point around block-level deduplication in flash SSDs: it could hose your file system.
De-dup is a Good Thing, right?
Researchers found that at least 1 Sandforce SSD controller – the SF1200 – does block-level deduplication by default. Many file systems write critical metadata to multiple blocks in case one copy gets corrupted. But what if, unbeknownst to you, your SSD de-duplicates that block, leaving your file system with only 1 copy?
Yup, corruption of 1 block could wipe out your entire file system. And since all the “copies” point to the same corrupted block, there’s no way to recover.
Most Unix superblock-based FSs and ZFS could be pooched by loss of a single block. NTFS also mirrors critical metafile info and could be vulnerable as well.
To be fair, AFAIK no one has reported this failure in the wild, so it is conjecture today. That said, it may have happened to people who didn’t realize what went wrong.
But in the world of storage, if something can happen it will, usually at the worst possible time. Have you seen a total data loss on an otherwise functioning SSD?
The StorageMojo take
I’ve made calls to a number of vendors to get their responses, including Sandforce, Intel, Texas Memory Systems and OCZ. With any luck we’ll soon have a 1st pass on who does what to your data.
Don’t panic: not all SSD controllers do this. Texas Memory Systems controllers don’t, partly because they don’t use MLC flash and partly because minimizing capacity use and maximizing data availability are conflicting goals, and they chose the availability over capacity.
Also note that the SF-1200 is offered as a consumer grade controller. Not clear what Sandforce does with the rest of their line, but their site does repeatedly reference their “DuraWrite” technology which appears to include block-level dedup.
Just last week StorageMojo recommended faster adoption of SSDs in the enterprise – and still does. But this once again underlines the need for mirroring. The sooner we find these issues, the sooner they’ll be fixed.
Watch the comments for vendor info, and I’ll update this post with more info if and when it develops.
Update:Here is the Sandforce response:
In the recent article by David Rosenthal he mentions a conversation with Kirk McKusik and the ZFS team at Sun Microsystems (Oracle). That conversation explains why it is critical that meta data not be lost or corrupted. He goes on to say that “If the stored metadata gets corrupted, the corruption will apply to all copies, so recovery is impossible.”
SandForce employs a feature called DuraWrite which enables flash memory to last longer through innovative patent pending techniques. Although SandForce has not disclosed the specific operation of DuraWrite and its 100% lossless write reduction techniques, the concept of deduplication, compression, and data differencing is certainly related. Through all the years of development and OEM testing with our SSD manufacturers and top tier storage users, there has not been a single reported failure of the DuraWrite engine. There is no more likelihood of DuraWrite loosing data than if it was not present.
We completely agree that any loss of metadata is likely to corrupt access to the underlying data. That is why SandForce created RAISE (Redundant Array of Independent Silicon Elements) and includes it on every SSD that uses a SandForce SSD Processor. All storage devices include ECC protection to minimize the potential that a bit can be lost and corrupt data. Not only do SandForce SSD Processors employ ECC protection enabling an UBER (Uncorrectable Bit Error Rate) of greater than 10^-17, if the ECC engine is unable to correct the bit error RAISE will step in to correct a complete failure of an entire sector, page, or block.
This combination of ECC and RAISE protection provides a resulting UBER of 10^-29 virtually eliminates the probabilities of data corruption. This combined protection is much higher than any other currently shipping SSD or HDD solution we know about. The fact that ZFS stores up to three copies of the metadata and optionally can replicate user data is not an issue. All data stored on a SandForce Driven SSD is viewed critical and protected with the highest level of certainty.
Readers: how does that sound to you?
End update.
Update 2: Oddly enough, the Sandforce web site specifies the SD-1200 controller at
ECC Recovery: Up to 24 bytes correctable per 512-byte sector
Unrecoverable Read Errors: Less than 1 sector per 1016 bits read
which is about where many enterprise disk drives spec’d – and quite a bit less than 10-29. Hmm-m.
End update 2.
Update 3:
Spoke to James Myers of Intel. He said that no current Intel SSD uses any form of compression, including dedup. He also cautioned against making too much of the risk: after all, you’d have to have an unrecoverable read error AND it would have to be that 1 critical block. Perhaps, he suggested, file systems that do use multiple copies of critical FS metadata could slightly alter the copies to eliminate the possibility of deduplication.
End update 3.
Courteous comments welcome, of course. TMS has been advertising on StorageMojo for a couple of years.
Might this also apply to any system which uses block deduplication underneath a LUN that is formatted with a host-side file system?
The host’s file system would assume that redundant meta-data blocks were physically separate, but in reality they may share the same disk block.
Is this really a problem? There is probably a good argument that enterprise class error detection and RAID protection would provide ample redundancy to the physical storage.
I agree with Mr. Pemberton’s last point, that any enterprise system will have more than one copy of the data on different physical devices. With consumer-level implementations, what’s the typical failure? If it’s a single bit, then the ECC will help; if it’s an entire chip (or section of chip), then if you have two copies on disk you might lose them both anyway.
I know in the old days of FAT, every system had two copies of the file allocation table, and it seems whenever one was corrupt the OS would pick the wrong one as the “good” one…
Steve, to be clear, an unrecoverable read error is just that: an entire block is not recoverable after ECC, parity protection, predictive correction, prayer etc. So we are talking about critical FS metadata becoming unrecoverable from that device forever. So far I haven’t seen any evidence that Sandforce does this across ALL their controllers – enterprise & consumer – but they very could be as they tout “DuraWrite” technology – whatever that might include – across their entire product line.
Steven, you are correct that any block level dedup would perform the same way. But even with RAID, ECC and all the other signal processing/data reconstruction magick engineers can dream up, there are still blocks that become unreadable/unrecoverable.
Most flash SSDs I’ve looked at are spec’d at 1 URE in every 10-15 or better, so we’re talking 100 TB to 1PB. With small capacity drives – say 160 GB or less – most drives will never see a URE – and only rarely will that URE hit a critical FS metadata block. But when it does, that drive is gone. That’s when mirroring saves the day.
Robin, I wanted to answer the additional question you raised above. You are correct in the stated ECC specifications of the SF-1200 listed on our web site. That particular specification does not include the higher UBER provided by our RAISE technology which takes over at the point where the ECC would not be able to correct the error. We list the UBER of the ECC alone so it can be easily compared to other SSDs which do not have the benefit of our RAISE technology. A I mentioned above, ECC only protects against bit errors, while RAISE protects against the loss of whole sectors, flash pages, or flash blocks.
RAISE is available in all current SandForce SF-1000 and SF-2000 family of SSD processors for enterprise, client, and embedded/industrial markets providing protection for meta and user data from errors beyond the ability of the ECC.
Well, i would like to assert, that this is just a problem for zfs when you put your datapool directly on the SSD. However from my perspective this isn’t the most common way to deploy the SSD with ZFS. Most people use them in conjuction with the hybrid storage pool concept in this config it simply doesn’t matter if your block gets deduplicated away. The sZIL is just a fallback when power fails (things are going really strange when you have go read from your ZIL and experience an error due to corrupted zil blocks due to deduplicated blocks pointing to the same block … i’m a really paranoid storage guy, but that far out on my scale of things) and when the L2ARC goes corrupted the data would be loaded from disk … the rotating one.
That said, this isn’t a problem of SSD alone …. everything deduplicating at block level is haunted by this issue ….
Thanks, Robin, for following up on my ACM Queue piece with the vendors. I’ve responded on my blog at http://blog.dshr.org/2011/06/more-on-de-duplicating-flash.html
In brief I’d regard SandForce’s claims with the same considerable skepticism as I regard the UBER specs. For the reasons, see my earlier piece in ACM Queue at http://queue.acm.org/detail.cfm?id=1985003
The arguments presented by Rosenthal are good to keep in mind when integrating deduplication at the block device rather than at the file system. With Albireo, we have experience with integration at both the block-level and the file system level; there are a number
of trade-offs when considering at what point to add deduplication. For this specific case, a sensible approach might be a protocol extension that allows a file system (or application) to indicate blocks that are not to be deduplicated. Much as TRIM allows SSDs to operate more efficiently and reliably, a UNIQUE extension would allow critical file system metadata to be preserved as multiple copies.
Getting SandForce’s response is very thorough reporting, Robin. The ACM Queue letter omits this perspective.
Contrary to what SandForce state I believe you will find that RAISE has been disabled by SandForce partners on most client based SF2xxx SSD offerings.
http://www.google.com/patents/US20120054415
SandForce presumably uses some sort of differential information update. When a block is modified, you find the difference between the old data and the new data. If the difference is small, you can just encode it over a smaller number of bits in the flash page. If you do the difference encoding, you cannot gc the old data unless you reassemble and rewrite the new data to a different location.
Difference encoding requires more time (extra read, processing, etc). So, you must not do it when the write buffer is close to full. You can always choose whether or not you do differential encoding.
It is definitely not deduplication. You can think of it as compression.
Having read this, and reading yesterday that BTRFS opts to forgo metadata replication on SSDs, it occurs to me that a way around this would be to use encryption. The on board controller would be completely unable to perform deduplication in this case. This would have to be software side encryption, as using on board encryption wouldn’t hide the data from the SSD controller. It would also increase write amplification etc, so know what you’re in for.