High-Accuracy Low-Precision Training

Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R. Aberger, Kunle Olukotun, Christopher Ré

Introduction

Many machine learning training tasks can be written as an optimization problem over a finite sum of NN components

A standard way of solving these optimization problems over very large training datasets is by using stochastic gradient descent (SGD) . Given that training for deep neural networks can take weeks, it is important to produce results quickly and efficiently. For machine learning inference tasks, speed and efficiency has been greatly improved by the use of hardware accelerators such as Google’s TPU and Microsoft’s Project Brainwave . Much of the benefit of these accelerators comes from their use of low-precision arithmetic, which reduces the overall cost of computation by reducing the number of bits that need to be processed. These accelerators have been primarily used for inference and not training, partially because the effects of precision on training are not yet well understood. This motivates us to study how low-precision can be used to speed up algorithms for solving training problems like (1).

Unfortunately, the systems benefits of low-precision (LP) arithmetic come with a cost. The round-off or quantization error that results from converting numbers into a low-precision representation introduces noise that can affect the convergence rate and accuracy of SGD. Conventional wisdom says that, for training, low precision introduces a tradeoff of the number-of-bits used versus the statistical accuracy—the fewer bits used, the worse the solution will become. Theoretical upper bounds on the performance of low-precision SGD and empirical observations of implemented low-precision algorithms further confirm that current algorithms are limited by this precision-accuracy tradeoff.A simple way to avoid this and make an algorithm of arbitrary accuracy would be to increase the number of bits of precision as the algorithm converges. However, this is unsatisfying as it increases the cost of computation, and we want to be able to run on specialized low-precision accelerators that have a fixed bit width.

In this paper, we upend this conventional wisdom by showing that it is still possible to get high-accuracy solutions from low-precision training, as long as the problem is sufficiently well-conditioned. We do this with an algorithm called HALP which transcends the accuracy limitations of ordinary low-precision SGD. We address noise from gradient variance using a known technique called SVRG, stochastic variance-reduced gradient . To address noise from quantization, we introduce a new technique called bit centering. The intuition behind bit centering is that as we approach the optimum, the gradient gets smaller in magnitude and in some sense carries less information, so we should be able to compress it. By dynamically re-centering and re-scaling our low-precision numbers, we can lower the quantization noise asymptotically as the algorithm converges. We prove that, for strongly convex problems, HALP is able to produce arbitrarily accurate solutions with the same linear asymptotic convergence rate as SVRG, while using low-precision iterates with a fixed number of bits. Our theory also exposes a novel tradeoff between condition number κ\kappa and precision which suggests that the number of bits needed for linear convergence is b=log(O(κ))b=\log(O(\kappa)). Our contributions are as follows:

In Section 3, we introduce and study low-precision SVRG (LP-SVRG), which has no bit centering step. We prove that LP-SVRG converges at the same linear rate as SVRG, but (as conventional wisdom would predict) only converges down to an accuracy limit caused by the low-precision arithmetic.

In Section 4, we introduce HALP, High-Accuracy Low-Precision, and prove that it converges at the same linear rate as SVRG down to solutions of arbitrarily high accuracy, even though it uses a fixed number of bits of precision for its iterates.

In Section 5, we show that on a CPU, HALP can compute iterations up to {\color[rgb]{0,0,0}3}\times faster than plain SVRG on the MNIST dataset and up to {\color[rgb]{0,0,0}4}\times faster than plain SVRG on a synthetic dataset with 10,000 features. We also evaluate our method as a new algorithm for deep learning. We implement our algorithms in TensorQuant and show that when training deep modelsTo simulate training deep models in this paper we ran the computation at full-precision then quantized the updates using the algorithms presented in Sections 3 and 4. our validation performance can match SVRG and exceed low-precision SGD.

Our results about asymptotic convergence rates and time complexity, compared with standard rates for SGD and SVRG, are summarized in Table 1.

Related work

Motivated by the increasing time and energy cost of training large-scale deep learning models on clusters, several recent projects have investigated decreasing these costs using low-precision arithmetic. It has been folklore for many years that neural network inference could be done effectively even with 8-bit arithmetic , and there has been much work recently on compressing already-trained networks by (among other things) making some of the weights and activations low-precision . This interest in low-precision arithmetic for inference has also led to the development of new hardware accelerators for low-precision machine learning, such as Google’s TPU which is based on 8-bit low-precision multiplies .

Work has also been done on evaluating and guaranteeing the effectiveness of low-precision training. Researchers have gathered empirical evidence for low-precision training in specific settings, although these results have typically not produced empirical support for 8-bit training . Researchers have also proven bounds on the error that results from using low-precision computation on convex problems and non-convex matrix recovery problems . Recently, Zhang et al. has developed techniques called double sampling and optimal quantization which enable users to quantize the training dataset with provable guarantees on the accuracy for convex linear models. Using these techniques, they designed and evaluated a hardware accelerator that computes low-precision SGD efficiently. A similar evaluation of the hardware efficiency of low-precision methods on commodity hardware was done by De Sa et al. , which outlined how quantizing in different ways has different effects on accuracy and throughput when SGD is made low-precision. While these works showed that low-precision training has many benefits, they all observe that accuracy degrades as precision is decreased.

While SGD is a very popular algorithm, the number of iterations it requires to achieve an objective gap of ϵ\epsilon for a strongly convex problem is O(1/ϵ)O(1/\epsilon). In comparison, ordinary gradient descent (GD) has an asymptotic rate of O(log(1/ϵ))O(\log(1/\epsilon)), which is known as a linear rateThis is called a linear rate because the number of iterations required is linear in the number of significant figures of output precision needed., and is asymptotically much faster than SGD. There has been work on modifications to SGD that preserve its high computational throughput while also recovering the linear rate of SGD . SVRG is one such method, which recovers the linear rate of gradient descent for strongly convex optimization, while still using stochastic iterations ; recently it has been analyzed in the non-convex case as well and shown to be effective in some settings . These variance-reduced methods are interesting because they preserve the simple hardware-efficient updates of SGD, while recovering the statistically-efficient linear convergence rate of the more expensive gradient descent algorithm. While we are not the first to present theoretical results combining low-precision with SVRG (Alistarh et al. previously studied using low-precision numbers for communication among workers in parallel SGD and SVRG), to the best our knowledge we are the first to present empirical results of low-precision SVRG and to propose the additional bit centering technique.

Warmup: Mixing low-precision and SVRG

This is a generalization of standard fixed-point arithmetic, where the scale factor is allowed to be arbitrary rather than being restricted to powers of two. Low-precision numbers with the same scale factor can be easily added using integer addition, producing a new number with the same scale factor and the same number of bits.The result of an addition will have the same number of bits if saturating addition is used. If exact addition is desired, the number of bits must be increased somewhat to prevent overflow. Any two low-precision numbers can be multiplied using an integer multiply, producing a new number with a scale factor that is the product of the input scale factors, and a number of bits that is the sum of the two input bit-counts. If we restrict ourselves in constructing an algorithm to use mostly additions and multiplies of these forms, we can do most of our computation with efficient low-precision integer arithmetic.

Quantization. Now that we have described low-precision representations, we need some way to convert numbers to store them in these representations. For reasons that have been explored in other work , for our algorithms here we will use unbiased rounding (also known as randomized rounding or stochastic rounding). This involves using a quantization function QQ that chooses to round up or down at random such that for any xx that is in the interior of the domain of the low-precision representation, E[Q(x)]=x\mathbf{E}\left[Q(x)\right]=x. If xx is not in the interior of the domain, QQ outputs the closest representable value to xx (which will always be either the largest or smallest representable value). Here, we let Q(δ,b)Q_{(\delta,b)} denote the function that quantizes into the low-precision representation (δ,b)(\delta,b). When we use QQ to quantize a vector, we mean that all the components are quantized independently.

Theory. The natural next question is: how does using low-precision computation affect the convergence of the algorithm? One thing we can say immediately is that LP-SVRG will not converge asymptotically at a linear rate, as it will be limited to producing outputs in the low-precision representation (δ,b)(\delta,b): once it gets as close as possible to the solution in this representation, it can get no closer, and convergence will stop. The next-best thing we can hope for is that LP-SVRG will converge at a linear rate until it reaches this limit, at which point it will stop converging—and in fact this is what happens. But before we can prove this, we need to state some assumptions. First, we require that the objective ff is μ\mu-strongly convex

and the gradients fi\nabla f_{i} are all LL-Lipschitz continuous

In terms of these parameters, the condition number of the problem is defined as κ=L/μ\kappa=L/\mu. These assumptions are standard and are the same ones used for the analysis of SVRG . We also need to assume that the global solution ww^{*} to our problem is within the range of numbers that are representable in our low-precision representation. To ensure this, we require that for any jj,

This is easy to satisfy in practice if we have some bound on the magnitude of ww^{*}. Under these conditions we can provide convergence guarantees for LP-SVRG.

Suppose that we run LP-SVRG (Algorithm 2) under the above conditions, using option II for the epoch update. For any constant 0<γ<10<\gamma<1 (a parameter which controls how often we take full gradients), if we set our step size and epoch lengths to be

then the outer iterates of LP-SVRG will converge to an accuracy limit at a linear rate

Validation. To validate LP-SVRG empirically, we ran it on a synthetic linear regression problem. Figure 1 shows that LP-SVRG, with both 8-bit and 16-bit precision, tracks the convergence trajectory of full-precision SVRG until reaching an accuracy floor that is determined by the precision level. This behavior matches our theoretical predictions. The results also show that LP-SVRG matches or outperforms low-precision SGD (LP-SGD) both in terms of convergence speed and the eventual accuracy limit.

HALP: High-accuracy with low-precision

While LP-SVRG converged at a linear rate, it only converged down to a level of accuracy proportional to the delta of quantization of the low-precision representation chosen. In fact, this is a fundamental limitation of algorithms like LP-SVRG and LP-SGD: we cannot produce a solution that is asymptotically more accurate than the most accurate solution representable in the low-precision representation we have chosen. Since the low-precision representation (δ,b)(\delta,b) in the previous section is chosen a priori and is fixed throughout the algorithm, this accuracy limitation is impossible to overcome. While the use of SVRG allowed us to reach this minimum level of accuracy more quickly (at a linear rate, in fact) compared with low-precision SGD , it has not let us surpass this minimum.

In this section we develop an algorithm that can surpass this minimum level of accuracy, and converge to arbitrarily accurate solutions while still using low-precision arithmetic. First, we will introduce a technique called bit centering, which reduces the noise from quantization as the algorithm converges. Second, we will state and explain our algorithm, HALP. Third, we will validate HALP by showing theoretically and empirically that it can converge at a linear rate, just like full-precision SVRG. Finally, we will give some implementation details that show how HALP can be computed efficiently on a class of problems.

Bit centering. In standard SVRG, each outer iteration is conceptually rewriting the original objective (1) as

This means that if at each outer iteration we dynamically reset the low-precision representation to

Theory. Using the same conditions that we used for LP-SVRG, we can prove that HALP converges at a linear rate, to solutions of arbitrarily low error.

Suppose that we run HALP (Algorithm 3) under the standard conditions of strong convexity and Lipschitz continuity, using option II for the epoch update. For any constant 0<γ<10<\gamma<1, if we use a number of bits

and we set our step size and epoch lengths to be

then after running HALP the output will satisfy

This theorem shows that we can achieve a linear asymptotic convergence rate even with constant-bit-width low-precision computation in the inner loop. It also describes an interesting tradeoff between the precision and the condition number. As the condition number becomes larger while the precision stays fixed, we need to use longer and longer epochs (TT becomes larger), until eventually the algorithm might stop working altogether. This suggests that low-precision training should be combined with techniques to improve the condition number, such as preconditioning.

Validation. Just as we did for LP-SVRG, we validate these results empirically by running HALP on the same synthetic linear regression problem. Figure 1 shows that HALP, with both 8-bit and 16-bit precision, tracks the linear convergence trajectory of full-precision SVRG until their accuracy becomes limited by the error of the high-precision floating point numbers. Interestingly, even though all the algorithms in Figure 1 use 64-bit floating point numbers for their full-precision computation, HALP actually converges to a solution that is more accurate: this happens because when we are very close to the optimum, the quantization error of the floating-point numbers in SVRG actually exceeds that of the low-precision numbers used in HALP.

The second challenge, summing and scaling vectors to compute utu_{t}, requires a bit more care. This task reduces to

Evaluation

In this section, we empirically evaluate both LP-SVRG and HALP on standard training applications. Our goal here is to validate (1) that these new low-precision algorithms can lead to high-accuracy solutions and (2) that the increased throughput of these low-precision algorithms can lower the required end-to-end training time. In Section 5.1 we validate that both LP-SVRG and HALP are capable of achieving high-accuracy solutions (when compared to low-precision SGD) on both a convolutional neural network (CNN) and a recurrent neural networks (RNN) neural network. In Section 5.2 we validate that on multi-class logistic regression, HALP can lead to substantially higher accuracy solutions than LP-SVRG while executing each epoch up to {\color[rgb]{0,0,0}4}\times faster than full-precision SVRG running on a commodity CPU.

For deep learning, we show that (1) LP-SVRG can exhibit better training losses than low-precision SGD, and (2) HALP can have even better training losses than LP-SVRG.

Experimental Setup. To demonstrate the effectiveness of LP-SVRG and HALP, we empirically evaluate them on both a CNN and a RNN. To run these experiments, we extended the TensorQuant toolbox to simulate the quantization operations from Algorithm 2 and Algorithm 3. In more detail, our deep learning training results run the computation at full-precision, but simulate the quantization operations during the model update. We conduct the CNN experiment with a ResNet architecture with 16,32,16,32, and 6464 channels of 3×33\times 3 kernels in the first, second, and third building blocks. We train this model on the CIFAR10 image classification dataset. For the RNN experiment, we evaluate a character-level language modeling task on the TinyShakespeare dataset . The RNN model is a two-layer LSTM, where each layer contains with 128 hidden units. In our 8-bit experiments, we uniformly set learning rate to 0.5 for the CNN experiments and to 1.0 for the RNN experiments. We sweep scale δ\delta in grid {0.001,0.002,0.005,0.01,0.02,0.05}\{0.001,0.002,0.005,0.01,0.02,0.05\} for both the CNN and RNN experiments. For HALP runs, we sweep μ\mu in the grid {1,2,5,10,20,50}\{1,2,5,10,20,50\}. For each algorithm, we present the training loss curve and validation metric curve with the best values found from the grid search.

CNN Discussion. Figures 3(a) and 3(b) show that 8-bit HALP comes close to the training loss of full-precision SVRG, while significantly outperforming LP-SVRG and LP-SGD. In terms of validation accuracy, Figure 3(b) shows that 8-bit LP-SVRG produces a model with improved validation accuracies of 0.8% when compared to LP-SGD, and more importantly that 8-bit HALP reaches validation metrics that match the quality of full-precision SVRG.

LSTM Discussion. In Figures 3(c) and 3(d) we show that 8-bit LP-SVRG outperforms LP-SGD in training loss and validation perplexity on the LSTM model. We show that 8-bit HALP is able to achieve a training loss that is better than LP-SVRG and closely matches the results from full-precision SVRG. We also observe that the validation perplexity from 8-bit HALP is worse than the one from 8-bit SVRG; this phenomenon (a lower training loss not guaranteeing better generalization) is often observed in deep learning results.

2 Multi-Class Logistic Regression Results

We validate that HALP strictly outperforms LP-SVRG and LP-SGD on two logistic regression training applications while executing each iteration up to 4×4\times faster than full-precision (64-bit) SVRG.

Experimental Setup. To test the effectiveness of the HALP algorithm, we compared 8-bit HALP to 8-bit LP-SGD, 8-bit LP-SVRG, 64-bit SVRG, and 64-bit SGD implementations on (10 class) logistic regression classification tasks. We implemented 8 and 16-bit versions of each algorithm in C++ using AVX2 intrinsics. We present results on two datasets, MNIST with 784 features and 60,000 samples and a synthetic dataset with 10,000 features and 7,500 samples. The synthetic dataset was generated using scikit-learn’s make_classification dataset generator. In Section A.3 we present the complete experimental details.

Statistical Discussion. Figure 4 shows that HALP strictly outperforms both LP-SGD and LP-SVRG. Interestingly, Figure 4(a) shows that HALP outperforms all algorithms except full-precision SVRG on the MNIST dataset. This is expected because, due to its use of bit centering, HALP has lower-magnitude quantization noise than both LP-SGD and LP-SVRG. Amazingly, Figure 4(b) shows that sometimes HALP is capable of outperforming all learning algorithms including full-precision SVRG.

Performance Discussion. Table 2 shows that HALP outperforms SGD by up to {\color[rgb]{0,0,0}3}\times and SVRG by up to {\color[rgb]{0,0,0}4}\times while remaining within 25% of LP-SGD per epoch. On MNIST the performance gains of HALP are less pronounced as this dataset is purely compute bound—meaning we do not experience the memory bandwidth benefits of low-precision. On the larger synthetic dataset the application becomes more memory bound, and we correspondingly notice a larger performance improvement in the low-precision applications. To a certain degree these performance results are limited by the design of current CPU architectures (see Section A.3).

Conclusion

In this paper we presented HALP, a new SGD variant that is able to theoretically converge at a linear rate while using fewer bits. HALP leverages SVRG to reduce noise from gradient variance, and introduces bit centering to reduce noise from quantization. To validate the effectiveness of SVRG in low-precision computation, we show that both HALP and low-precision SVRG converge to high-accuracy solutions on LSTM, CNN, and multi-class logistic regression applications up to {\color[rgb]{0,0,0}4}\times faster than full-precision SVRG.

References

Appendix A Extended Evaluation

Our results in Theorem 2 exposed a relationship between the condition number and the performance of low-precision training. These results suggested that as the condition number is increased, there is a threshold (the threshold at which the conditions of the theorem hold) above which the performance of HALP is no longer guaranteed, and may become dramatically worse. Here, we validate this intuition empirically.

A.2 Extended Neural Network Evaluation

In Figure 6 we show 16-bit experiments on the LSTM and CNN neural networks presented in Section 5.1. We use the same experiment protocol as with Section 5.1, except the grid for δ\delta is

Our results for the most part follow the trends from the 8-bit results from Section 5.1. Still, we highlight here that 16-bit LP-SVRG benefits from variance reduction in this low-precision training. As such, LP-SVRG achieves a training loss that matches the performance of 32-bit full-precision SVRG. We show that 16-bit LP-SVRG also consistently outperforms 16-bit SGD in training loss.

A.3 Extended Multi-Class Logistic Regression Evaluation

The synthetic dataset was generated with n_informative=n_features and otherwise default parameters (besides those already mentioned like n_features). To select hyperparameters we ran a 100-point grid search for each training algorithm and present the configurations that minimized the gradient norm after training for 50 epochs. Tables 4 and 3 shows the exact hyperparameters used for the runs present. For all SVRG-based algorithms (SVRG, LP-SVRG, and HALP) we compute the full gradient every two epochs as done in Johnson and Zhang . Note our definition of epoch for the experimental results means one full pass over the dataset, whereas in the algorithmic definitions, it referred to one outer loop iteration of SVRG or HALP. To test statistical convergence we measure the gradient norm every other epoch because it goes to zero as the algorithm converges, and for strongly convex functions it is at most a constant factor away from other common metrics such as the objective gap and the distance to the optimum. To ensure robustness we run each algorithm with five different random seeds and present the average of the results. We ran all experiments on a single machine with a total of 56 cores on four Intel Xeon E7-4850 v3 CPUs and 1 TB of RAM. For the SVRG-based algorithms (SVRG, LP-SVRG, and HALP), we parallelized the computation of the full gradient as this is embarrassingly parallel (and does not affect the statistical results). We run these algorithms with 56 threads. For end-to-end performance, we measure the wallclock time for each epoch and report the average of 5 epoch timings.

Performance Discussion.

The stochastic quantization is expensive, even when implemented in AVX2 using the XORSHIFT pseudorandom number generator. Generally, we found we also need more instructions to operate on fixed-point types, limiting the throughput benefits we can achieve from using low precision. A common example arises with multiplication, where we need to perform an additional shift following the multiplication of two fixed-point types to maintain the same bit width. Moreover, current architectures do not support efficiently operating on types less than 8 bits, limiting our expected performance improvement to 8-bit and 16-bit fixed-point types. Still, these results are exciting and we hope will influence the design of future architectures and hardware accelerators.

Clipping Discussion.

In the body of the paper, we explained how the effect observed in Figure 4(b), where 8-bit HALP outperforms plain SVRG, can be attributed to a gradient-clipping-like effect caused by saturating arithmetic. Here, we provide more detailed evidence of this claim.

The hypothesis that we want to validate is that the improved performance of 8-bit HALP over SVRG is caused by the saturating arithmetic: the fact that the distance the iterates can move in a single epoch is bounded by the box of representable numbers, as illustrated in Figure 2. Additionally, we want to rule out the possibility that the quantization itself is causing this improvement, or that the noise caused by quantizing to a particular scale is responsible. To investigate this hypothesis, we ran two additional experiments. The first (which we call ‘Scale’) runs 16-bit HALP with the same scale (the same δ\delta) as 8-bit HALP would use at every iteration. To do this, we adjusted the value of μ\mu so that μ8(b811)=μ16(b1611)\mu_{8}(b^{8-1}-1)=\mu_{16}(b^{16-1}-1). The second (which we call ‘Clip’) runs 16-bit HALP with the same range of representable numbers (the same μ\mu) as 8-bit HALP. If our hypothesis is correct, we would expect the second 16-bit HALP run, which has the same gradient-clipping-like effect as 8-bit HALP, to mimic its performance and outperform SVRG. We would also expect the first 16-bit HALP run to be closer to SVRG’s performance, since it has a much larger range of representable numbers and so will not exhibit the same gradient-clipping-like effect.

Figure 7 presents the results of our experiment. Our hypothesis was validated, and in fact the 16-bit HALP runs are nearly indistinguishable from the trajectories we hypothesized they would take. This strongly suggests that the gradient-clipping-like effect is indeed responsible for the improved performance of 8-bit HALP over SVRG in Figure 4(b).

Appendix B Proofs

Before we prove the main theorems presented in the paper, we will prove the following lemmas, which will be useful later.

Under the above conditions where we quantize using the low-precision representation (δ,b)(\delta,b), for any ww,

First, observe that this entire inequality separates additively along dimensions. Therefore, it suffices to prove it just for the case of d=1d=1.

To prove it for d=1d=1, we can consider two cases separately. First, if ww is within the range representable by (δ,b)(\delta,b), then E[Q(δ,b)(w)]=w\mathbf{E}\left[Q_{(\delta,b)}(w)\right]=w. In this case,

Since ww is within the representable range, it will either be rounded up or down at random. Let zz be the rounded-down quantization of ww. Then Q(δ,b)(w)Q_{(\delta,b)}(w) will round to z+δz+\delta (the rounded-up quantization of ww) with probability wzδ\frac{w-z}{\delta}, and it will round to zz with probability z+δwδ\frac{z+\delta-w}{\delta}. This quantization is unbiased because

It follows that in the d=1d=1 case, when ww is on the interior of the representable region,

In the other case, when ww is on the exterior of the representable region, the quantization function Q(δ,b)Q_{(\delta,b)} just maps it to the nearest representable value. Since ww^{*} is in the interior of the representable region, this operation will make ww closer to ww^{*}. Thus,

and so it will certainly be the case that

We have now proved this inequality for all values of ww, when d=1d=1. The inequality now follows in full generality by summing up over dimensions. ∎

For completeness, we also re-state the proof of following lemma, which was presented as equation (8) in Johnson and Zhang .

Under the standard condition of Lipschitz continuity, if ii is sampled uniformly at random from {1,,N}\{1,\dots,N\}, then for any ww,

Clearly, if ii is sampled randomly as in the lemma statement, E[gi(w)]=f(w)\mathbf{E}\left[g_{i}(w)\right]=f(w). But also, ww^{*} must be the minimizer of gig_{i}, so for any ww

where the second inequality follows from the Lipschitz continuity property. Re-writing this in terms of fif_{i} and averaging over all the ii now proves the lemma statement. ∎

Now we are ready to prove Theorem 1. Our proof of this theorem follows the structure of the proof of the original SVRG convergence result in Johnson and Zhang .

We start by looking at the expected distance-squared to the optimum.

By Lemma 1, this can be bounded from above by

Applying the recursive definition of uk,tu_{k,t} from the algorithm statement produces

where this last inequality follows from convexity of the function ff. This second-order term can be further bounded by

where the first inequality holds because x+y22x2+2y2\left\|x+y\right\|^{2}\leq 2\left\|x\right\|^{2}+2\left\|y\right\|^{2} and the second holds because the variance is always upper bounded by the second moment. We can now apply Lemma 2 to this last expression, which produces

Substituting this into the expression above,

Summing this up across all TT iterations of an epoch produces

If we use option II to assign the next outer iterate, then

As a consequence of the strong convexity property,

This is the same as the analogous expression for SVRG, except for the additional term that is a function of δ\delta. Now, suppose we want to have an expected contraction factor of γ\gamma each epoch. That is, we want

This equation only has solutions when the discriminant is non-negative, that is, when

The minimal value of TT for which this will be able to hold will be when it holds with equality, or when

If we choose this TT, then the solution to the above quadratic equation is, by the quadratic formula, to set α\alpha such that

We can see that these are the settings of TT and α\alpha prescribed in the theorem statement. With these settings of TT and α\alpha, we get

where in the last line we use the fact that 1+γ21+\gamma\leq 2 and 2+γ22+\gamma\geq 2. Now subtracting the fixed point of this expression from both sides,

It follows by applying this statement recursively that

The analysis of the inner loop of HALP is identical to the analysis of LP-SVRG. By using the same argument as in the proof of Theorem 1, we can get that

Unlike for LP-SVRG, for HALP, the value of δ\delta changes over time. Specifically, it is assigned to

Next, suppose as before that we want to contract by a factor of γ\gamma in expectation each step. That is, we need

The analysis of this is identical to that in the proof of Theorem 1. By this same analysis, the minimal value of T^\hat{T} for which this will be able to hold will be when

In order for T^\hat{T} to have this magnitude, we need

If we assign α\alpha and TT in this way, as they are given in the theorem statement, then

and the result now follows by induction. ∎