
Lundi, 28 mai 2018
The Great Cloud Filesystems Shoot-out, Round 1: LizardFS
LizardFS is a Free Software cloud file system being developed since 2012. It was originally derived from MooseFS, and unlike MooseFS there are no restricted features in the OSS version (high availability metadata server, advanced redundancy management).
Architecture
The file system is structured in a classic way. The different components are:
- the “Chunk Servers” who provide the actual storage. Storage is based on a standard local file system.
- a “Master Server” which takes on the classic role of a metadata server, which manages object naming, access and client orientation to the “chunk” server containing the required data.
- a “Metalogger Server” which manages a list of the operations in progress in order to be able to resume on incident in the event of sudden shutdown of the “Master Server”.
- optionally, one or more “Shadow Servers” which are backup metadata servers, in order to be able to take over from the “Master Server” in the event of a problem.
- and finally the FUSE client which allows to mount the file system locally.
Of course a physical server can simultaneously assume the functions of “Chunk Server” and “Master”, “Metalogger” or “Shadow Server”, as well as being a client.
Local storage will typically rely on RAID arrays. Redundancy at the server cluster level will therefore be added to any local RAID redundancy.
The cluster configuration is quite simple and described in the LizardFS documentation.
Note that it is very easy to set different security modes for different parts of the overall file system. Granularity is done at the folder or even file level.
User Interface
LizardFS is used with fairly simple command line tools, and provides a convenient web interface to monitor the status of the cluster in all its aspects.
The administration essentially uses the lizardfs and lizardfs-admin commands to modify the cluster parameters.
The lizardfs-admin command communicates with the Master Server to change global parameters, for example:
lizardfs-admin list-goals [master ip] [master port]
The lizardfs command applies to file system objects and allows for example to define the redundancy level for a folder, a file, etc. Example:
lizardfs setgoal -r important_file /mnt/lizardfs/IMPORTANT
On the client side, mounting the storage is as simple as follows:
mfsmount -o big_writes,nosuid,nodev,noatime /mnt/lizardfs
To improve reading performance, a few options can be added:
mfsmount -o big_writes,nosuid,nodev,noatime -o cacheexpirationtime=500 -o readaheadmaxwindowsize=4096 /mnt/lizardfs
Overall use is very simple for a seasoned Linux administrator.
The web interface presents all the reports as tabs, it is possible to display several tabs simultaneously by pressing the small “+”. Here are all the screens (click to enlarge):
Tested configuration
The tested configuration includes 5 “Chunk Servers”. One of them is also a Master, another a Metalogger and a third a Shadow Server. All are cluster clients. A sixth identical machine is also used as a client, either via the FUSE LizardFS module, or via Samba sharing of the LizardFS volume from one of the cluster members.
Each machine includes 16 disks of 8 TB aggregated in Linux software RAID (md-raid) a Xeon D1541 2.1 Ghz 8 cores processor and 16 GB of RAM and is connected to the network with 10 Gigabit ethernet.
The disk performance of the standalone servers was measured beforehand:
16x 8Tb RAID-6
- writing: 800 MB/s (CPU constrained)
- reading: 2400 MB/s (1 stream), 1880 MB/s (4 streams)
4 x 8 TB RAID-0
- writing: 1800 MB/s
- reading: 2500 MB/s
In order to compare the relative performance of the local RAID-6 and the different redundancy modes of LizardFS, we performed a first test with disks configured as a RAID-6 array on the servers, and without any redundancy at the cluster level. For the following tests, we reconfigured the disks into 4 RAID-0 arrays of 4 disks, in order to maximize local disk performance, while comparing the different redundancy modes:
- no redundancy: for reference. Obviously not to be used in production!
- x2 replication : each data block is written to two different chunks. It is a “mirror” mode.
- XOR 3+1 : a parity block is created for 3 data blocks. Each of the 4 blocks is then written to a different chunk. It’s equivalent to a “RAID-5” array of servers.
- Erasure Coding 3+2 : two parity blocks are created for 3 data blocks. Each of the 5 blocks is then written to a different chunk. This is equivalent to a “RAID-6”. array of servers.
Tests will focus on sequential performance.
The Tests
Test 1: RAID-0, no redundancy
The purpose of this test is to define maximum performance, limited by network and disk bandwidth. This is our reference point, but of course do not use such a configuration in production, it would be very risky.
Test 2: Local RAID-6, no cluster redundancy
This test evaluates the relative efficiency of the distributed parity calculation at the cluster level compared to a local parity calculation on each server. This mode allows the loss of one or two disks per server, but if an entire server failed the cluster would be inoperable.
Test 3: RAID-0, replication x2
This mode of replication will undoubtedly be favoured for the most important data. It is the simplest and in our case the fastest, especially in reading. The limiting factor here seems to be the network bandwidth between nodes.
Test 4: RAID-0, redundancy XOR 3+1
This redundancy mode allows the failure of a complete chunk (typically a server or RAID volume).
Test 5: RAID-0, redundancy Erasure Coding 3+2
This redundancy mode allows two chunks to fail. Although documentation shows optimal performance in this mode, the high processor load limits performance, especially in writing.
Test 6: Local RAID-6, no redundancy, SMB sharing
The result of this test is to be compared to test 2. Only one client was used, but accessing the volume through an SMB mount rather than using the LizardFS client. Performance is much lower, but still very satisfactory.
Conclusion
The ease of deployment and use of LizardFS makes it a good candidate for very massive storage systems, intended for archiving but also for production. In future tests, we will evaluate the behaviour in case of node failure and cluster extension by adding new systems. It will also be interesting to compare LizardFS to its main competitors.
posté à: 16:08 permalink
The Great Cloud Filesystems Shoot-out : Introduction
Long gone is the time when the professionals with high data storage needs could be satisfied with a simple centralised system like a SAN or NAS. Nowadays, everyone’s thinking about cloud, scalable architectures and capacity, redundancy…
Unfortunately the number of software solutions, platforms, open or proprietary systems available to build these newfangled storage systems is growing rapidly. What are the benefits, the downsides of such and such solution? What are they best used for? Why choose this one over this other one? That’s the reason of this series of tests.
What’s a Cloud Filesystem?
A filesystem is a common way to organise data as files in a hierarchical tree of folders, generally adhering to the POSIX model. Usually the filesystem is hosted in a single storage device, lika an SSD, a hard drive or a RAID array (abstracting away a set of physical drives as a virtual, unified storage volume).
A cluster filesystem allows several different computers (called “nodes”) to access the same filesystem (therefore the same folder tree and files). These cluster filesystems are divided into two families: SAN filesystems and Cloud filesystems (some solutions belong to both categories at the same time).
A SAN filesystem allows a set of systems connected through a storage area network (SAN) to share a common storage device (typically a large SAN-oriented RAID storage array), and to work simultaneously on the same tree of files. Redundancy, scalability are managed by the storage array subsysstem. Some known solutions from this family are Quantum StorNext, HP(SGI) CXFS, IBM GPFS, Tiger tech Tiger Store… In the Free Software/OpenSource world, we have RedHat GPFS, Oracle OCFS… This family of filesystems is not what we’ll be discussing here.
A Cloud storage solution gives several computers access to a shared storage volume, but instead of relying upon a shared, distinct storage subsystem, it will instead pool the local storage resources of each cluster member. There are many different cloud storage solutions; some can present their storage as a filesystem, some can’t, but generally offer primarily an “object oriented interface”, particularly targeted towards large internet-based applications, that we won’t treat here. Some have both a filesystem and an object interfaces, but generally in tnat case the object one comes first.
Examples of cloud storage solutions (and their privileged access mode): Scality (object, file connector), Dell/EMC Isilon (file only), OpenIO (object only), CEPH (object, block, file), LizardFS (file), XtreemFS (file), MooseFS (file), RozoFS (file)…
A last group of cluster filesystems offer no data redundancy and are tailored for High Performance Computing and Science applications, they’re for instance Lustre, OrangeFS, BeeGFS, etc. These filesystems won’t be detailed here.
Why use a Cloud Filesystem?
Most cloud filesystems have clear benefits in terms of scalability (ability to add more storage as needed without downtime) and redundancy (immunity to the loss of an entire storage node or even a complete datacentre).
We’ll be concentrating in this comparison on filesystems with the following features:
- ability to add a new node to expand the cluster capacity
- ability to withstand complete failure of at least one node
- ability to rebuild data from a failed node
Most of these filesystems work using a user mode (FUSE) driver. The main benefit of FUSE drivers is that they are often also available for MacOS and Windows, therefore giving native access to the storage cluster to client machines running these operating systems instead of going through an additional file-sharing layer. The main downside of FUSE drivers is that they tend to be less robust than native Linux kernel drivers; for instance under very low memory conditions, the driver may be killed by the “OOM killer”. Another limitation is that a FUSE filesystem cannot be shared using the kernel NFS server, which is faster thant the user-mode NFS server.
The Competitors of this Shoot-out
We plan to test the following Cloud Filesystems:
- LizardFS
- RozoFS
- CEPH FS
- GlusterFS
The goal is to test these in an enterprise storage environment for video production and post-production, archiving, rather than on pure internet-oriented use like massive virtualisation, video streaming, etc.
posté à: 15:35 permalink