Mercredi, 22 décembre 2010
I tested various real-time online deduplication implementations at the filesystem level under Linux, namely Opendedup, Lessfs and a couple of proprietary ones. What’s great of course is that it’s transparent, you can share data using all protocols at once (NFS, CIFS, iSCSI, FTP, whatever), namely you can use your dedupe filesystem exactly like an ordinary one.
The only concern can be about performance and particularly IOPS; Some systems get decent throughput but in any case the IOs are all basically turned into random IOs, therefore dedupe hurts performance badly any way you turn it.
Some will ask : why are all IOs made random? Because while you’re sending data to the “deduplicator”, it checksums blocks of data as they’re coming and compares them against its hash index of blocks, throwing away any block previously seen. If the filesystem is big, this database won’t fit in memory and will be continuously scanned while writing so the disk access pattern is basically write a bit sequentially, read the index, write some more data, read the index…
Reading is a different matter; we suppose that about half the data is not unique, so when you’re reading some sequential data, you’re actually reading blocks written at different times and as such physically scattered around the disks. So reading is almost certainly quasi random, limiting seriously performance.
Example from real benchmarks : I created 60 TB of semi-randomized data, occupying a total of 19 TB after combined compression/dedupe. The filesystem performance without dedupe is about 350 MB/s write, 500 MB/s read, 1500 IOPS. With dedupe the figures are 120 MB/s write (but it varies extremely from 30 MB/s to 300 at times) and 200 MB/s read; however IOPS drop to an abysmal 40 to 50.
My conclusion is that it’s not generally commendable as a primary storage application; however as a “warm” archive or backup, or for rarely accessed files, it’s wonderfully effective.
posté à: 17:01 permalink