Distributed File Systems of the Future


    Feb 15, 2022

    With big data and artificial intelligence becoming the norm, storage methods in the data center are once again taking center stage. In the compute area, multi-node architectures for simulations, analyses, and forecasts have become established. Computer networks consisting of hundreds, even thousands, of servers require ever greater performance in the storage environment. Ultimately, CPUs want feeding. What could be more obvious than to transfer the scale-out in the compute to the storage?

    This poses new challenges for distributed file systems.

    Two Options

    Essentially, there are two options for making IT systems more powerful. The first is to keep upgrading the individual node, i.e., adding or replacing CPUs, storage and I/O capabilities — a process referred to as scale-up. And the second is scale-out. In this case, more nodes are added “simply.” For various reasons, including cost, flexibility and technical capabilities, people are increasingly moving toward scale-out today. On the software side, Message Passing Interfaces (MPI) are used to make the computer network a complete system. Distributed file systems are responsible for this in the storage area. Their primary task is to provide the performance of individual storage nodes aggregated to the computer processes.

    All of which sounds simple, but can prove tricky to implement. If you write a large file, for instance, you can divide this up into individual blocks and at the same time send it to multiple storage nodes. But this isn’t the right solution if you have lots of small files.

    What might an optimally distributed file system look like?

    As you know, a file system is a hierarchy of folders (directories) and files. This is where we have the first opportunity to split up workloads. Different storage nodes are responsible for different directories/files. Now you just need someone to keep track of it, and in distributed file systems that’s called the metadata service (MDS). Among other things, it manages the hierarchy and knows where which data is stored. (Object) storage devices (OSD) are responsible for storing the data. They are responsible for storing the data on the SSDs or HDDs.

    Moving on to the file level, large files should be distributed across multiple storage systems. Small files are forwarded directly to different OSDs.

    Now we see that the MDS not only has a lot of work to do, but also a great responsibility. Without MDS, we no longer know where the filed data is and how it is composed. You can meet both these requirements by running a synchronized MDS on each storage node. If one node fails, the other can take over the tasks. This gives us many MDS, but we don’t yet know who the clients should talk to.

    To do that, we distribute the clients via load balancing across various storage nodes. A distributed lock mechanism prevents all clients from changing the same data at the same time. This ensures serialization of write requirements.

    And there’s another option for HPC. A specific driver is able to address the particular OSDs directly. It assumes the tasks of the MDS directly on the compute node and can therefore use the performance of the entire storage network.

    We’ve already explored the availability of the MDS, but what about the availability of the data?

    There are two options here. Some distributed file systems rely on the availability of the particular hardware. In other words, they use RAID. More modern solutions use erasure code. This not only provides availability via many disks, but also via many nodes. In such a system, not only can individual HDDs or SDDs fail, but also multiple storage systems, and all this while maintaining erasure-code-immanent efficiency.

    If you now package all this in state-of-the-art hardware, in lots of access protocols (NFS, SMB, S3, HDFS and POSIX/MPI-IO) and add a very simple management interface, you have the basis for the Huawei OceanStor Pacific. In addition, there are many more services, such as QoS, snapshots, replication, authorization management, etc., each of which would be worth its own article.

    Finally, a very personal forecast. If someone were to ask me what storage in the data center of the future would look like, it would look exactly like OceanStor Pacific! Totally flexible, capable of handling any workload, easy to use, no migrations and scaled up to the exabyte range, and, if necessary, back down again.

    Disclaimer: Any views and/or opinions expressed in this post by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Huawei Technologies.


      Leave a Comment

      Posted in


      Posted in