Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization
Yuchen Zhang, Lin Xiao
Introduction
We are especially interested in developing efficient algorithms for solving problem (1) when the number of samples is very large. In this case, evaluating the full gradient or subgradient of the function is very expensive, thus incremental methods that operate on a single component function at each iteration can be very attractive. There have been extensive research on incremental (sub)gradient methods (e.g. ) as well as variants of the stochastic gradient method (e.g., ). While the computational cost per iteration of these methods is only a small fraction, say , of that of the batch gradient methods, their iteration complexities are much higher (it takes many more iterations for them to reach the same precision). In order to better quantify the complexities of various algorithms and position our contributions, we need to make some concrete assumptions and introduce the notion of condition number and batch complexity.
Let and be two positive real parameters. We make the following assumption:
Each is convex and differentiable, and its derivative is -Lipschitz continuous (same as being -smooth), i.e.,
In addition, the regularization function is -strongly convex, i.e.,
Under Assumption A, the gradient of each component function, , is also Lipschitz continuous, with Lipschitz constant , where . In other words, each is -smooth. We define a condition number
and focus on ill-conditioned problems where . In the statistical learning context, the regularization parameter is usually on the order of or (e.g., ), thus is on the order of or . It can be even larger if the strong convexity in is added purely for numerical regularization purposes (see Section 3). We note that the actual conditioning of problem (1) may be better than , if the empirical loss function by itself is strongly convex. In those cases, our complexity estimates in terms of can be loose (upper bounds), but they are still useful in comparing different algorithms for solving the same given problem.
To make fair comparisons with batch methods, we measure the complexity of stochastic or incremental gradient methods in terms of the number of equivalent passes over the dataset required to reach an expected precision . We call this measure the batch complexity, which are usually obtained by dividing their iteration complexities by . For example, the batch complexity of the stochastic gradient method is . The batch complexities of full gradient methods are the same as their iteration complexities.
By carefully exploiting the finite average structure in (1) and other similar problems, several recent work proposed new variants of the stochastic gradient or dual coordinate ascent methods and obtained the iteration complexity . Since their computational cost per iteration is , the equivalent batch complexity is of their iteration complexity, i.e., . This complexity has much weaker dependence on than the full gradient methods, and also much weaker dependence on than the stochastic gradient methods.
In this paper, we propose a stochastic primal-dual coordinate (SPDC) method, which has the iteration complexity
When , this is lower than the batch complexity mentioned above. Indeed, it is very close to a lower bound for minimizing finite sums recently established in .
2 Outline of the paper
Our approach is based on reformulating problem (1) as a convex-concave saddle point problem, and then devising a primal-dual algorithm to approximate the saddle point. More specifically, we replace each component function through convex conjugation, i.e.,
Under Assumption A, each is -strongly convex (since is -smooth; see, e.g., [15, Theorem 4.2.2]) and is -strongly convex. As a consequence, the saddle point problem (4) has a unique solution, which we denote by .
In Section 2, we present the SPDC method as well as its convergence analysis. It alternates between maximizing over a randomly chosen dual coordinate and minimizing over the primal variable . In order to accelerate the convergence, an extrapolation step is applied in updating the primal variable . We also give a mini-batch SPDC algorithm which is well suited for parallel computing.
In Section 3 and Section 4, we present two extensions of the SPDC method. We first explain how to solve problem (1) when Assumption A does not hold. The idea is to apply small regularizations to the saddle point function so that SPDC can still be applied, which results in accelerated sublinear rates. The second extension is a SPDC method with non-uniform sampling. The batch complexity of this algorithm has the same form as (3), but with , where , which can be much smaller than if there is considerable variation in the norms .
In Section 5, we discuss related work. In particular, the SPDC method can be viewed as a coordinate-update extension of the batch primal-dual algorithm developed by Chambolle and Pock . We also discuss two very recent work which achieve the same batch complexity (3).
In Section 7, we present experiment results comparing SPDC with several state-of-the-art optimization methods, including both batch algorithms and randomized incremental and coordinate gradient methods. On all scenarios we tested, SPDC has comparable or better performance.
The SPDC method
We give the details of the SPDC method in Algorithm 1. The dual coordinate update and primal vector update are given in equations (7) and (8) respectively. Instead of maximizing over and minimizing over directly, we add two quadratic regularization terms to penalize and from deviating from and . The parameters and control their regularization strength, which we will specify in the convergence analysis (Theorem 1). Moreover, we introduce two auxiliary variables and . From the initialization and the update rules (7) and (9), we have
Equation (10) obtains based on extrapolation from and . This step is similar to Nesterov’s acceleration technique [25, Section 2.2], and yields faster convergence rate.
The Mini-Batch SPDC method in Algorithm 2 is a natural extension of SPDC in Algorithm 1. The difference between these two algorithms is that, the Mini-Batch SPDC method may simultaneously select more than one dual coordinates to update. Let be the mini-batch size. During each iteration, the Mini-Batch SPDC method randomly picks a subset of indices of size , such that the probability of each index being picked is equal to . The following is a simple procedure to achieve this. First, partition the set of indices into disjoint subsets, so that the cardinality of each subset is equal to (assuming divides ). Then, during each iteration, randomly select a single index from each subset and add it to . Other approaches for mini-batch selection are also possible; see the discussions in .
In Algorithm 2, we also switched the order of updating and (comparing with Algorithm 1), to better illustrate that is obtained based on an extrapolation from to . However, this form is not recommended in implementation, because is usually a dense vector even if the feature vectors are sparse. Details on efficient implementation of SPDC are given in Section 6. In the following discussion, we do not make sparseness assumptions.
With a single processor, each iteration of Algorithm 2 takes time to accomplish. Since the updates of each coordinate are independent of each other, we can use parallel computing to accelerate the Mini-Batch SPDC method. Concretely, we can use processors to update the coordinates in the subset in parallel, then aggregate them to update . In terms of wall-clock time, each iteration takes time, which is the same as running one iteration of the basic SPDC algorithm. Not surprisingly, we will show that the Mini-Batch SPDC algorithm converges faster than SPDC in terms of the iteration complexity, because it processes multiple dual coordinates in a single iteration.
Since the basic SPDC algorithm is a special case of Mini-Batch SPDC with , we only present a convergence theorem for the mini-batch version. The expectations in the following results are taken with respect to the random variables , where denotes the random subset picked at the -th iteration of the SPDC method.
Assume that each is -smooth and is -strongly convex (Assumption A). Let be the unique saddle point of defined in (4), , and define
If the parameters and in Algorithm 2 are chosen such that
then for each , the Mini-Batch SPDC algorithm achieves
The proof of Theorem 1 is given in Appendix A. The following corollary establishes the expected iteration complexity of Mini-Batch SPDC for obtaining an -accurate solution.
Suppose Assumption A holds and the parameters , and are set as in (16). In order for Algorithm 2 to obtain
it suffices to have the number of iterations satisfy
Applying the inequality to the denominator above completes the proof. ∎
Recall the definition of the condition number in (2). Corollary 1 establishes that the iteration complexity of the Mini-Batch SPDC method for achieving (17) is
So a larger batch size leads to less number of iterations. In the extreme case of , we obtain a full batch algorithm, which has iteration or batch complexity . This complexity is also shared by the AFG methods (see Section 1.1), as well as the batch primal-dual algorithm of Chambolle and Pock (see discussions on related work in Section 5).
Since an equivalent pass over the dataset corresponds to iterations, the batch complexity (the number of equivalent passes over the data) of Mini-Batch SPDC is
The above expression implies that a smaller batch size leads to less number of passes through the data. In this sense, the basic SPDC method with is the most efficient one. However, if we prefer the least amount of wall-clock time, then the best choice is to choose a mini-batch size that matches the number of parallel processors available.
2 Convergence rate of primal-dual gap
In the previous subsection, we established iteration complexity of the Mini-Batch SPDC method in terms of approximating the saddle point of the minimax problem (4), more specifically, to meet the requirement in (17). Next we show that it has the same order of complexity in reducing the primal-dual objective gap , where is defined in (1) and
Under Assumption A, the function defined in (4) has a unique saddle point , and
Thus the result in Theorem 1 does not translate directly into a convergence bound on the primal-dual gap. We need to bound and by and , respectively, in the opposite directions. For this purpose, we need the following lemma, which we extracted from . We provide the proof in Appendix B for completeness.
Suppose Assumption A holds and the parameters , and are set as in (16). Let . Then for any , the iterates of Algorithm 2 satisfy
The function is strongly convex in with parameter , and is the minimizer. Similarly, is strongly convex in with parameter , and is minimized by . Therefore,
We bound the following weighted primal-dual gap
The first inequality above is due to Lemma 1, the second and fourth inequalities are due to the definition of , and the third inequality is due to (19). Taking expectations on both sides of the above inequality, then applying Theorem 1, we obtain
Since and , this implies the desired result. ∎
Extensions to non-smooth or non-strongly convex functions
To be concise, we only consider the case where neither is smooth nor is strongly convex. Formally, we assume that each and are convex and Lipschitz continuous, and has a saddle point . We choose a scalar and consider the modified saddle-point function:
Denote by the saddle-point of . We employ the Mini-Batch SPDC method (Algorithm 2) to approximate , treating as and as , which are all -strongly convex. We note that adding strongly convex perturbation on is equivalent to smoothing , which becomes -smooth (see, e.g., ). Letting , the parameters , and in (16) become
Although is not exactly the saddle point of , the following corollary shows that applying the SPDC method to the perturbed function effectively minimizes the original loss function . Similar results for the convergence of the primal-dual gap can also be established.
Assume that each is convex and -Lipschitz continuous, and is convex and -Lipschitz continuous. Define two constants:
Let be a shorthand notation. We have
Here, equations (i) and (vii) use the definition of the function , inequalities (ii) and (v) use the definition of the function , inequalities (iii) and (iv) use the fact that is the saddle point of , and inequality (vi) is due to the fact that is the saddle point of .
Since is -Lipschitz continuous, the domain of is in the interval , which implies (see, e.g., [38, Lemma 1]). Thus, we have
On the other hand, since is -Lipschitz continuous, Theorem 1 implies
The corollary is established by finding the smallest that satisfies inequality (23). ∎
There are two other cases that can be considered: when is not smooth but is strongly convex, and when is smooth but is not strongly convex. They can be handled with the same technique described above, and we omit the details here. In Table 1, we list the complexities of the Mini-Batch SPDC method for finding an -optimal solution of problem (1) under various assumptions. Similar results are also obtained in .
SPDC with non-uniform sampling
The basic idea is to use non-uniform sampling in picking the dual coordinate to update at each iteration. In Algorithm 3, we pick coordinate with the probability
where is a parameter. In other words, this distribution is a (strict) convex combination of the uniform distribution and the distribution that is proportional to the feature norms. Therefore, instances with large feature norms are sampled more frequently, controlled by . Simultaneously, we adopt an adaptive regularization in step (26), imposing stronger regularization on such instances. In addition, we adjust the weight of in (27) for updating the primal variable. As a consequence, the convergence rate of Algorithm 3 depends on the average norm of feature vectors, as well as the parameter . This is summarized in the following theorem.
Suppose Assumption A holds. Let . If the parameters in Algorithm 3 are chosen such that
Since is a bound on the convergence factor, we would like to make it as small as possible. For its expression in (29), it can be minimized by choosing
where is an average condition number. We have if . The value of decreases slowly to zero as the ratio grows, and increases to one as the ratio drops. Thus, we may choose a relatively uniform distribution for well conditioned problems, but a more aggressively weighted distribution for ill-conditioned problems.
For simplicity of presentation, we described in Algorithm 3 a weighted sampling SPDC method with single dual coordinate update, i.e., the case of . It is not hard to see that the non-uniform sampling scheme can also be extended to Mini-Batch SPDC with . Here, we omit the technical details.
Related Work
Chambolle and Pock considered a class of convex optimization problems with the following saddle-point structure:
When both and are strongly convex and the parameters , and are chosen appropriately, this algorithm obtains accelerated linear convergence rate [9, Theorem 3].
We can map the saddle-point problem (4) into the form of (30) by letting and
The SPDC method developed in this paper can be viewed as an extension of the batch method (31)-(33), where the dual update step (31) is replaced by a single coordinate update (7) or a mini-batch update (13). However, in order to obtain accelerated convergence rate, more subtle changes are necessary in the primal update step. More specifically, we introduced the auxiliary variable , and replaced the primal update step (32) by (8) and (14). The primal extrapolation step (33) stays the same.
To compare the batch complexity of SPDC with that of (31)-(33), we use the following facts implied by Assumption A and the relations in (34):
Based on these conditions, we list in Table 2 the equivalent parameters used in [9, Algorithm 3] and the batch complexity obtained in [9, Theorem 3], and compare them with SPDC.
The batch complexity of the Chambolle-Pock algorithm is , where the notation hides the factor. We can bound the spectral norm by the Frobenius norm and obtain
(Note that the second inequality above would be an equality if the columns of are normalized.) So in the worst case, the batch complexity of the Chambolle-Pock algorithm becomes
which matches the worst-case complexity of the AFG methods (see Section 1.1 and also the discussions in [20, Section 5]). This is also of the same order as the complexity of SPDC with (see Section 2.1). When the condition number , they can be worse than the batch complexity of SPDC with , which is .
If either or in (30) is not strongly convex, Chambolle and Pock proposed variants of the primal-dual batch algorithm to achieve accelerated sublinear convergence rates [9, Section 5.1]. It is also possible to extend them to coordinate update methods for solving problem (1) when either or is not strongly convex. Their complexities would be similar to those in Table 1.
Our algorithms and theory can be readily generalized to solve the problem of
We can also solve the primal problem (1) via its dual:
Because of the problem structure, coordinate ascent methods (e.g., ) can be more efficient than full gradient methods. In the stochastic dual coordinate ascent (SDCA) method , a dual coordinate is picked at random during each iteration and updated to increase the dual objective value. Shalev-Shwartz and Zhang showed that the iteration complexity of SDCA is , which corresponds to the batch complexity .
For more general convex optimization problems, there is a vast literature on coordinate descent methods; see, e.g., the recent overview by Wright . In particular, Nesterov’s work on randomized coordinate descent sparked a lot of recent activities on this topic. Richtárik and Takáč extended the algorithm and analysis to composite convex optimization. When applied to the dual problem (35), it becomes one variant of SDCA studied in . Mini-batch and distributed versions of SDCA have been proposed and analyzed in and respectively. Non-uniform sampling schemes have been studied for both stochastic gradient and SDCA methods (e.g., ).
Shalev-Shwartz and Zhang proposed an accelerated mini-batch SDCA method which incorporates additional primal updates than SDCA, and bears some similarity to our Mini-Batch SPDC method. They showed that its complexity interpolates between that of SDCA and AFG by varying the mini-batch size . In particular, for , it matches that of the AFG methods (as SPDC does). But for , the complexity of their method is the same as SDCA, which is worse than SPDC for ill-conditioned problems.
More recently, Lin et al. developed an accelerated proximal coordinate gradient (APCG) method for solving a more general class of composite convex optimization problems. When applied to the dual problem (35), APCG enjoys the same batch complexity \widetilde{\mathcal{O}}\bigl{(}1+\sqrt{\kappa/n}\bigr{)} as of SPDC. However, it needs an extra primal proximal-gradient step to have theoretical guarantees on the convergence of primal-dual gap [20, Section 5.1]. The computational cost of this additional step is equivalent to one pass of the dataset, thus it does not affect the overall complexity.
2 Other related work
Another way to approach problem (1) is to reformulate it as a constrained optimization problem
Suzuki considered a problem similar to (1), but with more complex regularization function , meaning that does not have a simple proximal mapping. Thus primal updates such as step (8) or (14) in SPDC and similar steps in SDCA cannot be computed efficiently. He proposed an algorithm that combines SDCA and ADMM (e.g., ), and showed that it has linear rate of convergence under similar conditions as Assumption A. It would be interesting to see if the SPDC method can be extended to their setting to obtain accelerated linear convergence rate.
Efficient Implementation with Sparse Data
Suppose that . For this case, the updates for each coordinate of are independent of each other. More specifically, can be computed coordinate-wise in closed form:
where denotes in Algorithm 1, or in Algorithm 2, or in Algorithm 3, and represents the -th coordinate of .
Although the dimension can be very large, we assume that each feature vector is sparse. We denote by the set of non-zero coordinates at iteration , that is, if for some index picked at iteration we have , then . If , then the SPDC algorithm (and its variants) updates without using the value of or . This can be seen from the updates in (7), (13) and (26), where the value of the inner product does not depend on the value of . As a consequence, we can delay the updates on and whenever without affecting the updates on , and process all the missing updates at the next time when .
Such a delayed update can be carried out very efficiently. We assume that is the last time when , and is the current iteration where we want to update and . Since implies , we have
Notice that is updated only at iterations where . The value of doesn’t change during iterations , so we have for . Substituting this equation into the recursive formula (38), we obtain
The update (39) takes time to compute. Using the same formula, we can compute and subsequently compute . Thus, the computational complexity of a single iteration in SPDC is proportional to , independent of the dimension .
where follows the definition in Section 6.1. If , then and equation (40) can be simplified as
Similar to the approach of Section 6.1, we delay the update of until . We assume to be the last iteration when , and let be the current iteration when we want to update . During iterations , the value of doesn’t change, so we have for . Using equation (44) and the invariance of for , we have an time algorithm to calculate , which we detail in Appendix D. The vector can be updated by the same algorithm since it is a linear combination of and . As a consequence, the computational complexity of each iteration in SPDC is proportional to , independent of the dimension .
Experiments
In this section, we compare the basic SPDC method (Algorithm 1) with several state-of-the-art optimization algorithms for solving problem (1). They include two batch-update algorithms: the accelerated full gradient (FAG) method [25, Section 2.2], and the limited-memory quasi-Newton method L-BFGS [29, Section 7.2]). For the AFG method, we adopt an adaptive line search scheme (e.g., ) to improve its efficiency. For the L-BFGS method, we use the memory size 30 as suggested by . We also compare SPDC with three stochastic algorithms: the stochastic average gradient (SAG) method , the stochastic dual coordinate descent (SDCA) method and the accelerated stochastic dual coordinate descent (ASDCA) method . We conduct experiments on a synthetic dataset and three real datasets.
We first compare SPDC with other algorithms on a simple quadratic problem using synthetic data. We generate i.i.d. training examples according to the model
In the form of problem (1), we have and . As a consequence, the derivative of is -Lipschitz continuous and is -strongly convex.
We evaluate the algorithms by the logarithmic optimality gap , where is the output of the algorithms after passes over the entire dataset, and is the global minimum. When the regularization coefficient is relatively large, e.g., or , the problem is well-conditioned and we observe fast convergence of the stochastic algorithms SAG, SDCA, ASDCA and SPDC, which are substantially faster than the two batch methods AFG and L-BFGS.
Figure 1 shows the convergence of the five different algorithms when we varied from to . As the plot shows, when the condition number is greater than , the SPDC algorithm also converges substantially faster than the other two stochastic methods SAG and SDCA. It is also notably faster than L-BFGS. These results support our theory that SPDC enjoys a faster convergence rate on ill-conditioned problems. In terms of their batch complexities, SPDC is up to times faster than AFG, and times faster than SAG and SDCA.
Theoretically, ASDCA enjoys the same batch complexity as SPDC up to a multiplicative constant factor. Figure 1 shows that the empirical performance of SPDC is substantially faster that of ASDCA for small . This may due to the fact that ASDCA follows an inner-outer iteration procedure, while SPDC is a single-loop algorithm, explaining why it is empirically more efficient.
2 Binary classification with real data
Here, is the smoothed hinge loss (see, e.g., ). It is easy to verify that the conjugate function of is for and otherwise.
The performance of the five algorithms are plotted in Figure 2 and Figure 3. In Figure 2, we compare SPDC with the two batch methods: AFG and L-BFGS. The results show that SPDC is substantially faster than AFG and L-BFGS for relatively large , illustrating the advantage of stochastic methods over batch methods on well-conditioned problems. As decreases to , the batch methods (especially L-BFGS) become comparable to SPDC.
Summarizing Figure 2 and Figure 3, the performance of SPDC are always comparable or better than the other methods in comparison.
Appendix A Proof of Theorem 1
We focus on characterizing the values of and after the -th update in Algorithm 2. For any , let be the value of if , i.e.,
Since is -smooth by assumption, its conjugate is -strongly convex (e.g., [15, Theorem 4.2.2]). Thus the function being maximized above is -strongly concave. Therefore,
Multiplying both sides of the above inequality by and re-arrange terms, we have
According to Algorithm 2, the set of indices to be updated are chosen randomly. For every specific index , the event happens with probability . If , then is updated to the value , which satisfies inequality (45). Otherwise, is assigned by its old value . Let be the sigma field generated by all random variables defined before round , and taking expectation conditioned on , we have
As a result, we can represent , , and in terms of the conditional expectations on , , and , respectively. Plugging these representations into inequality (45) and re-arranging terms, we obtain
Then summing over all indices and dividing both sides of the resulting inequality by , we have
where we used the shorthand notations (appeared in Algorithm 2)
Since only the dual coordinates with indices in are updated, we have
We also derive an inequality characterizing the relation between and . Since the function being minimized on the right-hand side of (14) has strong convexity parameter and is the minimizer, we have
Rearranging terms and taking expectation conditioned on , we have
In addition, we consider a particular combination of the saddle-point function values at different points. By the definition of in (4) and the notations in (48), we have
Next we add both sides of the inequalities (47) and (50) together, and then subtract equality (51) after taking expectation with respect to . This leads to the following inequality:
We need to lower bound the last term on the right-hand-side of the above inequality. To this end, we have
Recall that and, according to (16), . Therefore,
The above upper bounds on the absolute values imply
Combining the above two inequalities with (52) and (53), we obtain
Note that we have added the nonnegative term \theta\bigl{(}f(x^{(t)},y^{\star})-f(x^{\star},y^{\star})\bigr{)} to the left-hand side in (54) to ensure that each term on one side of the inequality has a corresponding term on the other side.
If the parameters , , and are chosen as in (16), that is,
Then the ratios between the coefficients of the corresponding terms on both sides of the inequality (54) are either equal to or bounded by . More specifically,
Therefore, if we define the following sequence,
Comparing the definition of in (15), we have
For , by letting , the last two terms in (56) for disappears. Moreover, we can show that the sum of the last three terms in (56) are nonnegative, and therefore we can replace with on the left-hand side of (55). To see this, we bound the absolute value of the last term:
where in the second inequality we used , in the equality we used , and in the last inequality we used . The above upper bound on absolute value implies
Appendix B Proof of Lemma 1
Assumption A implies that is smooth and is Lipschitz continuous with constant . We can bound the spectral norm with the Frobenius norm, i.e., , which results in . By definition of the saddle point, the gradient of at is . Therefore, we have
Combining the above inequality with , we have
Similarly, the second inequality can be shown by first writing , where
In this case, is Lipschitz continuous with constant . Again by definition of the saddle-point, we have . Therefore,
Recalling that , we conclude with
Appendix C Proof of Theorem 2
The proof of Theorem 2 follows similar steps for proving Theorem 1. We start by establishing relation between and between . Suppose that the quantity minimizes the function . Also notice that is a -strongly convex function minimized by , which implies
Then, following the same argument for establishing inequality (45) and plugging in inequality (57), we obtain
Note that with probability . Therefore, we have
where represents the sigma field generated by all random variables defined before iteration . Substituting the above equations into inequality (58), and averaging over , we have
where and have the same definition as in the proof of Theorem 1.
For the relation between and , we first notice that is a -strongly convex function minimized by , which implies
Following the same argument for establishing inequality (49) and plugging in inequality (60), we obtain
Taking expectation over both sides of inequality (C) and adding it to inequality (59) yields
where the matrix is a -by- matrix, whose -th row is equal to the vector .
Next, we lower bound the last term on the right-hand side of inequality (62). Indeed, it can be expanded as
Note that the probability given in (28) satisfies
Since the parameters and satisfies , we have and consequently
Combining the above two inequalities with lower bounds (62) and (63), we obtain
Recall that the parameters , , and are chosen to be
Plugging in these assignments and using the fact that , we find that
Therefore, if we define a sequence such that
then inequality (64) implies the recursive relation , which implies
To eliminate the last two terms on the left-hand side of inequality (65), we notice that
where in the equality we used . This implies
Substituting the above inequality into inequality (65) completes the proof.
Given at iteration , we present an efficient algorithm for calculating . We begin by examining the sign of .
If , then equation (69) implies for all . Therefore, we have the closed-form formula:
Finally, if , then equation (69) implies .
We look for the largest such that the right-hand side of equation (72) is positive, which is equivalent of
Thus, is the largest integer in such that inequality (73) holds. If , then is obtained by (72). Otherwise, we can calculate by formula (69), then resort to Case I or Case III, treating as .
where is the largest integer in such that the following inequality holds:
If , then is obtained by (74). Otherwise, we can calculate by formula (69), then resort to Case I or Case II, treating as .
Finally, we note that formula (69) implies the monotonicity of . As a consequence, the procedure of either Case I, Case II or Case III is executed for at most once. Hence, the algorithm for calculating has time complexity.