Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Kenton Murray, Jeffery Kinnison, Toan Q. Nguyen, Walter Scheirer, David Chiang

Introduction

Encoder-decoder based neural network models are the state-of-the-art in machine translation. However, these models are very dependent on selecting optimal hyperparameters and architectures. This problem is exacerbated in very low-resource data settings where the potential to overfit is high. Unfortunately, these searches are computationally expensive. For instance, Britz et al. (2017) used over 250,000 GPU hours to compare various recurrent neural network based encoders and decoders for machine translation. Strubell et al. (2019) demonstrated the neural architecture search for a large NLP model emits over four times the carbon dioxide relative to a car over its entire lifetime.

Unfortunately, optimal settings are highly dependent on both the model and the task, which means that this process must be repeated often. As a case in point, the Transformer architecture has become the best performing encoder-decoder model for machine translation Vaswani et al. (2017), displacing RNN-based models Bahdanau et al. (2015) along with much conventional wisdom about how to train such models. Vaswani et al. ran experiments varying numerous hyperparameters of the Transformer, but only on high-resource datasets among linguistically similar languages. Popel and Bojar (2018) explored ways to train Transformer networks, but only on a high-resource dataset in one language pair. Less work has been devoted to finding best practices for smaller datasets and linguistically divergent language pairs.

In this paper, we apply auto-sizing Murray and Chiang (2015), which is a type of architecture search conducted during training, to the Transformer. We show that it is effective on very low-resource datasets and can reduce model size significantly, while being substantially faster than other architecture search methods. We make three main contributions.

1. We demonstrate the effectiveness of auto-sizing on the Transformer network by significantly reducing model size, even though the number of parameters in the Transformer is orders of magnitude larger than previous natural language processing applications of auto-sizing.

2. We demonstrate the effectiveness of auto-sizing on translation quality in very low-resource settings. On four out of five language pairs, we obtain improvements in BLEU over a recommended low-resource baseline architecture. Furthermore, we are able to do so an order of magnitude faster than random search.

3. We release GPU-enabled implementations of proximal operators used for auto-sizing. Previous authors (Boyd et al., 2010; Duchi et al., 2008) have given efficient algorithms, but they don’t necessarily parallelize well on GPUs. Our variations are optimized for GPUs and are implemented as a general toolkit and are released as open-source software.https://github.com/KentonMurray/ProxGradPytorch

Hyperparameter Search

While the parameters of a neural network are optimized by gradient-based training methods, hyperparameters are values that are typically fixed before training begins, such as layer sizes and learning rates, and can strongly influence the outcome of training. Hyperparameter optimization is a search over the possible choices of hyperparameters for a neural network, with the objective of minimizing some cost function (e.g., error, time to convergence, etc.). Hyperparameters may be selected using a variety of methods, most often manual tuning, grid search Duan and Keerthi (2005), or random search Bergstra and Bengio (2012). Other methods, such as Bayesian optimization Bergstra et al. (2011); Snoek et al. (2012), genetic algorithms Benardos and Vosniakos (2007); Friedrichs and Igel (2005); Vose et al. (2019), and hypergradient updates Maclaurin et al. (2015), attempt to direct the selection process based on the objective function. All of these methods require training a large number of networks with different hyperparameter settings.

In this work, we focus on a type of hyperparameter optimization called auto-sizing introduced by Murray and Chiang (2015) which only requires training one network once. Auto-sizing focuses on driving groups of weights in a parameter tensor to zero through regularization. Murray and Chiang (2015) focused on the narrow case of two hidden layers in a feed-forward neural network with a rectified linear unit activation. In this work, we look at the broader case of all of the non-embedding parameter matrices in the encoder and decoder of the Transformer network.

GPU Optimized Proximal Gradient Descent

Murray and Chiang (2015) train a neural network while using a regularizer to prune units from the network, minimizing:

where $W$ are the parameters of the model and $R$ is a regularizer. For simplicity, assume that the parameters form a single matrix $W$ of weights. Murray and Chiang (2015) try two regularizers:

The optimization is done using proximal gradient descent (Parikh and Boyd, 2014), which alternates between stochastic gradient descent steps and proximal steps:

The algorithm starts by taking the absolute value of each entry and sorting the entries in decreasing order. Figure 1a shows a histogram of sorted absolute values of an example $\mathbf{v}$ . Intuitively, the goal of the algorithm is to cut a piece off the top with area $\eta\lambda$ (in the figure, shaded gray).

We can also imagine the same shape as a stack of horizontal layers (Figure 1b), each $i$ wide and $\delta_{i}$ high, with area $i\delta_{i}$ ; then $c_{i}$ is the cumulative area of the top $i$ layers. This view makes it easier to compute where the cutoff should be. Let $k$ be the index such that $\eta\lambda$ lies between $c_{k-1}$ and $c_{k}$ . Then $b_{i}=\delta_{i}$ for $i<k$ ; $b_{k}=\frac{1}{k}(\eta\lambda-c_{k-1})$ ; and $b_{i}=0$ for $i>k$ . In other words, $b_{i}$ is how much height of the $i$ th layer should be cut off.

Although this algorithm is less efficient than the quickselect-like algorithm when run in serial, the sort in line 4 and the cumulative sums in lines 6 and 8 Ladner and Fischer (1980) can be parallelized to run in $O(\log n)$ passes each.

Transformer

The Transformer network, introduced by Vaswani et al. (2017), is a sequence-to-sequence model in which both the encoder and the decoder consist of stacked self-attention layers. Each layer of the decoder can attend to the previous layer of the decoder and the output of the encoder. The multi-head attention uses two affine transformations, followed by a softmax. Additionally, each layer has a position-wise feed-forward neural network (FFN) with a hidden layer of rectified linear units:

The hidden layer size (number of columns of $W_{1}$ ) is typically four times the size of the model dimension. Both the multi-head attention and the feed-forward neural network have residual connections that allow information to bypass those layers.

Though the Transformer has demonstrated remarkable success on a variety of datasets, it is highly over-parameterized. For example, the English-German WMT ’14 Transformer-base model proposed in Vaswani et al. (2017) has more than 60M parameters. Whereas early NMT models such as Sutskever et al. (2014) have most of their parameters in the embedding layers, the added complexity of the Transformer, plus parallel developments reducing vocabulary size Sennrich et al. (2016) and sharing embeddings Press and Wolf (2017) has shifted the balance. Nearly 31% of the English-German Transformer’s parameters are in the attention layers and 41% in the position-wise feed-forward layers.

Accordingly, we apply the auto-sizing method to the Transformer network, and in particular to the two largest components, the feed-forward layers and the multi-head attentions (blue and orange rectangles in Figure 2). A difference from the work of Murray and Chiang (2015) is that there are residual connections that allow information to bypass the layers we are auto-sizing. If the regularizer drives all the neurons in a layer to zero, information can still pass through. Thus, auto-sizing can effectively prune out an entire layer.

2 Random Search

As an alternative to grid-based searches, random hyperparameter search has been demonstrated to be a strong baseline for neural network architecture searches as it can search between grid points to increase the size of the search space Bergstra and Bengio (2012). In fact, Li and Talwalkar (2019) recently demonstrated that many architecture search methods do not beat a random baseline. In practice, randomly searching hyperparameter domains allows for an intuitive mixture of continuous and categorical hyperparameters with no constraints on differentiability Maclaurin et al. (2015) or need to cast hyperparameter values into a single high-dimensional space to predict new values Bergstra et al. (2011).

Experiments

All of our models are trained using the fairseq implementation of the Transformer Gehring et al. (2017).https://github.com/pytorch/fairseq Our GPU-optimized, proximal gradient algorithms are implemented in PyTorch and are publicly available.https://github.com/KentonMurray/ProxGradPytorch For the random hyperparameter search experiments, we use SHADHO,https://github.com/jeffkinnison/shadho which defines the hyperparameter tree, generates from it, and manages distributed resources Kinnison et al. (2018). Our SHADHO driver file and modifications to fairseq are also publicly available.https://bitbucket.org/KentonMurray/fairseq_autosizing

We looked at four different low-resource language pairs, running experiments in five directions: Arabic-English, English-Arabic, French-English, Hausa-English, and Tigrinya-English. The Arabic and French data comes from the IWSLT 2017 Evaluation Campaign Mauro et al. (2012). The Hausa and Tigrinya data were provided by the LORELEI project with custom train/dev/test splits. For all languages, we tokenized and truecased the data using scripts from Moses Koehn et al. (2007). For the Arabic systems, we transliterated the data using the Buckwalter transliteration scheme. All of our systems were run using subword units (BPE) with 16,000 merge operations on concatenated source and target training data Sennrich and Haddow (2016). We clip norms at 0.1, use label smoothed cross-entropy with value 0.1, and an early stopping criterion when the learning rate is smaller than $10^{-5}$ . All of our experiments were done using the Adam optimizer Kingma and Ba (2015), a learning rate of $10^{-4}$ , and dropout of 0.1. At test time, we decoded using a beam of 5 with length normalization Boulanger-Lewandowski et al. (2013) and evaluate using case-sensitive, detokenized BLEU Papineni et al. (2002).

The originally proposed Transformer model is too large for our data size – the model will overfit the training data. Instead, we use the recommended settings in fairseq for IWSLT German-English as a baseline since two out of our four language pairs are also from IWSLT. This architecture has 6 layers in both the encoder and decoder, each with 4 attention heads. Our model dimension is $d_{model}=512$ , and our FFN dimension is 1024.

1.2 Auto-sizing parameters

1.3 Random search parameters

As originally proposed, the Transformer network has 6 encoder layers, all identical, and 6 decoder layers, also identical. For our random search experiments, we sample the number of attention heads from $\{4,8,16\}$ and the model dimension ( $d_{model}$ ) from $\{128,256,512,1024,2048\}$ . Diverging from most implementations of the Transformer, we do not require the same number of encoder and decoder layers, but instead sample each from $\{2,4,6,8\}$ . Within a layer, we also sample the size of the feed-forward network (FFN), varying our samples over $\{512,1024,2048\}$ . This too differs from most Transformer implementations, which have identical layer hyperparameters.

2 Auto-sizing vs. Random Search

Auto-sizing trains in a similar amount of time to the baseline system, whereas the cumulative training time for all of the models in random search is substantially slower. Furthermore, for Tigrinya-English and French-English, random search found models that were almost 10 and 5 times larger respectively than the auto-sized models.

3 Training times

One of the biggest downsides of searching over architectures using a random search process is that it is very time and resource expensive. Contrary to that, auto-sizing relies on only training one model.

4 Auto-sizing Sub-Components

As seen above, on very low-resource data, auto-sizing is able to quickly learn smaller, yet better, models than the recommended low-resource transformer architecture. Here, we look at the impact of applying auto-sizing to various sub-components of the Transformer network. In section 3, following the work of Murray and Chiang (2015), auto-sizing is described as intelligently applying a group regularizer to our objective function. The relative weight, or regularization coefficient, is a hyperparameter defined as $\lambda$ . In this section, we also look at the impact of varying the strength of this regularization coefficient.

Tables 4 and 5 demonstrate the impact of varying the regularization coefficient strength has on BLEU scores and model size across various model sub-components. Recall that each layer of the Transformer network has multi-head attention sub-components and a feed-forward network sub-component. We denote experiments only applying auto-sizing to feed-forward network as “FFN”. We also experiment with auto-sizing the multi-head attention in conjunction with the FFN, which we denote “All”. A regularization coefficient of 0.0 refers to the baseline model without any auto-sizing. Columns which contain percentages refer to the number of rows in a PyTorch parameter that auto-sizing was applied to, that were entirely driven to zero. In effect, neurons deleted from the model. Note that individual values in a row may be zero, but if even a single value remains, information can continue to flow through this and it is not counted as deleted. Furthermore, percentages refer only to the parameters that auto-sizing was applied to, not the entire model. As such, with the prevalence of residual connections, a value of 100% does not mean the entire model was deleted, but merely specific parameter matrices. More specific experimental conditions are described below.

4.2 FFN matrices

Auto-sizing only the feed-forward sub-component, and not the multi-head attention part, results in better BLEU scores, even when deleting all of the feed-forward network components. Impressively, this is with a model that has fully one-third fewer parameters in the encoder and decoder layers. This is beneficial for faster inference times and smaller disk space.

4.3 Encoder vs. Decoder

Recall from Figure 2 that there are residual connections that allow information and gradients to flow around both the multi-head attention and feed-forward portions of the model. Here, we have the case that all layers of the encoder have been completely deleted. However, the decoder still attends over the source word and positional embeddings due to the residual connections. We hypothesize that for these smaller datasets that there are too many parameters in the baseline model and over-fitting is an issue.

5 Random Search plus Auto-sizing

Conclusion

In this paper, we have demonstrated the effectiveness of auto-sizing on the Transformer network. On very low-resource datasets, auto-sizing was able to improve BLEU scores by up to 3.9 points while simultaneously deleting one-third of the parameters in the encoder and decoder layers. This was accomplished while being significantly faster than other search methods.

Additionally, we demonstrated how to apply proximal gradient methods efficiently using a GPU. Previous work on optimizing proximal gradient algorithms serious impacts speed performance when the computations are moved off of a CPU and parallelized. Leveraging sorting and prefix summation, we reformulated these methods to be GPU efficient.

Overall, this paper has demonstrated the efficacy of auto-sizing on a natural language processing application with orders of magnitude more parameters than previous work. With a focus on speedy architecture search and an emphasis on optimized GPU algorithms, auto-sizing is able to improve machine translation on very low-resource language pairs without being resource or time-consuming.

Acknowledgements

This research was supported in part by University of Southern California, subcontract 67108176 under DARPA contract HR0011-15-C-0115. We would like to thank Justin DeBenedetto for helpful comments.