Speaking Your Language: The Transformer in Machine Translation


    Feb 01, 2022

    Machine translation has emerged as a key topic in the AI field in recent years, and the race is on for large companies to launch their own machine translation services.

    Today, we’ll explore a major technique behind machine translation: The Transformer Model.

    The Transformer is a deep learning model that was first proposed in 2017. It adopts a “self-attention” mechanism, which improves the performance of Neural Machine Translation (NMT) applications relative to the traditional Recurrent Neural Network (RNN) model, and consequently accelerates the training process in Natural Language Processing (NLP) tasks.

    First, let’s take a brief look at the traditional machine learning model for machine translation, RNN.

    RNN Model

    Neural networks, and in particular RNNs, were once the leading approach for language-understanding tasks such as machine translation.

    Figure 1: RNN model (image source: Understanding LSTM Networks)

    RNNs can perform tasks on inputs of varying lengths, ranging from a single word to a whole document. They are most suitable for natural language modeling. However, as RNNs produce hidden state vectors through recurrent computations, they treat all tokens in the sequence uniformly and equally, limiting the applicability of the RNN model.

    The two major weaknesses of the RNN model are that:

    • RNNs scale poorly due to the fundamental difficulty of parallelizing state computations.
    • RNNs suffer from vanishing and exploding gradient problems, and they cannot model longer sequences with long-term dependencies.

    Transformer Model

    The Transformer model, like RNN models, is designed to process sequential input data for natural language tasks, such as translation. However, unlike RNNs, the Transformer does not necessarily process the input data in sequential order. Instead, the self-attention mechanism (shown in Figure 2) identifies the context which gives meaning to each position in the input sequence, allowing more parallelization than RNN models and reducing the training time.

    Figure 2: Self-attention mechanism (image source: Attention Is All You Need)


    Figure 3: Transformer architecture (image source: Attention Is All You Need)

    Similar to the Sequence-to-Sequence (seq2seq) machine translation model, the Transformer model is also based on the encoder-decoder architecture. However, the Transformer differs from the seq2seq model in three ways:

    Transformer Block: The recurrent layer in seq2seq is replaced by a Transformer Block. This block contains a multi-head attention layer and a network with two Position-Wise Feed-Forward network layers for the encoder. Another multi-head attention layer is used to compute the encoder state for the decoder.
    Add & Norm: The inputs and outputs of both the multi-head attention layer and the Position-Wise Feed-Forward network are processed by two Add & Norm layers which contain a residual structure and a layer normalization layer.
    Position Encoding: Since the self-attention layer does not distinguish the order of items in a given sequence, a positional encoding layer is used to add sequential information into each sequence item.

    How the Transformer works

    The Transformer’s primary three functions are data preprocessing, model training, and model prediction.

    Data Preprocessing

    The data is preprocessed using tokenizers before being fed into the Transformer model. Inputs are tokenized and then the generated tokens are converted into the token IDs used in the model.
    For example, for PyTorch, tokenizers are instantiated using the “AutoTokenizer.from_pretrained” method in order to:

    1. Get tokenizers that correspond to pretrained models in a one-to-one mapping.
    2. Download the token vocabulary that the model needs when using the model’s specific tokenizer.

    Model Training

    Teacher Forcing is a popular training method for neural machine translation. It uses the actual output instead of the predicted output from the previous timestamp as inputs during training, and thus reduces the training time.

    Model Prediction

    1. The encoder encodes the input sentence of the source language.
    2. The decoder uses the code generated by the encoder and the start token () of the sentence to predict the model.
    3. At each decoder time step, the predicted token from the previous time step is fed into the decoder as an input, in order to predict the output sequence token by token. When the end-of-sequence token () is predicted, the prediction of the output sequence is complete.

    MindSpore and the Transformer

    MindSpore is a deep learning framework which aims to deliver easy development, efficient execution, and all-scenario coverage. Meet MindSpore and learn about how it supports the Transformer model.


    In this blog, we have given you an insight into the Transformer model for machine translation. If you’d like to learn more about the Transformer, we recommend the following learning resources, which are also the main references of this article.

    1. GitHub’s article on Dive into Deep Learning
    2. Google’s Attention Is All You Need
    3. Hugging Face’s guide to Transformers


    Disclaimer: Any views and/or opinions expressed in this post by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Huawei Technologies.


      Leave a Comment

      Posted in


      Posted in