How MindSpore Overcomes Bottlenecks in Model Training

    By

    Mar 29, 2022

    Distributed training is an effective solution for the recent rise in bottlenecks on ultra-large-scale networks. These bottlenecks are caused by a mismatch between current hardware performance, and the demands we place on computing power. The amount of data and parameters used in model training have grown exponentially as machine learning has developed, with more emphasis placed on the accuracy and generalization of neural networks. However, the hardware of a single node has not been able to keep pace with the growing demands for computing power.

    Distributed training enables horizontal resource scaling to provide greater overall computing power, while reducing the load on hardware resources such as memory and the CPUs of each node. The Parameter Server framework is well suited for MindSpore’s distributed training. It addresses performance challenges using distributed machine learning algorithms, and supports both asynchronous and synchronous stochastic gradient descent (SGD) training. It deploys model computing in worker processes, and model update in server processes, so that resources can scale out independently of each other.

    Additionally, in a large data center, it is not uncommon for one or more compute, network, or storage nodes to break down. The Parameter Server framework can isolate the failed nodes to prevent them from interrupting ongoing training jobs.

    MindSpore uses open source, lightweight, and efficient ps-lite interfaces. The ps-lite architecture has three components: server, worker, and scheduler.

    • The server saves model weights and backward computation gradients, and updates the model using gradients pushed by workers.
    • The model that was updated by the server is downloaded to the worker through the Pull API. The worker performs forward and backward computation on the network. The gradient value for backward computation is uploaded to a server through the Push API.
    • The scheduler establishes the communication relationship between the server and worker.

    Based on the remote communication capability provided by the ps-lite and abstract Push/Pull primitives, MindSpore supports synchronous SGD for distributed training. Relative to the synchronous AllReduce training method, Parameter Server offers better flexibility and scalability as well as node failover capability.

    Furthermore, with the Huawei Collective Communication Library (HCCL) and NVIDIA Collective Communication Library (NCCL) in Ascend and GPUs, MindSpore also provides the hybrid training mode of Parameter Server and AllReduce. Some weights can be stored and updated through Parameter Server, and others are trained using the AllReduce algorithm.

    Figure 1:  Distributed training with Parameter Server

    Click the link for instructions on how to use the Parameter Server framework.

    Explore more content like this when you meet MindSpore.

    Further reading in this series

    References

    [1] TensorFlow’s tutorials for parameter server training

    [2] Article by “MS xiaobai” at CNBlogs.com

    [3] Microsoft’s Azure Machine Learning Documentation


    Disclaimer: Any views and/or opinions expressed in this post by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Huawei Technologies.

    Loading

      Leave a Comment

      Reply
      Posted in

      TAGGED

      Posted in