Less than a Single Pass: Stochastically Controlled Stochastic Gradient Method

Lihua Lei, Michael I. Jordan

Introduction

Optimizing the finite-sum convex objectives is ubiquitous in different areas:

where each $f_{i}(x)$ is a convex function. These problems are often solved by algorithms that either make use of full gradients (obtained by processing the entire dataset) or stochastic gradients (obtained by processing single data points or mini-batches of data points). The use of the former provides guarantees of eventual convergence and the latter yields advantages in terms of rate of convergence rate, scalability and simplicity of implementation . An impactful recent line of research has shown that a hybrid methodology that makes use of both full gradients and stochastic gradients can obtain the best of both worlds—guaranteed convergence at favorable rates, e.g. . The full gradients provide variance control for the stochastic gradients.

While this line of research represents significant progress towards the goal of designing scalable, autonomous learning algorithms, there remain some inefficiencies in terms of computation. With the definition of computation and communication cost in Section 2.1, the methods referred to above require $O(n\cdot C(\epsilon,d))$ computation to achieve an $\epsilon$ -approximate solution, where $n$ is the number of data points, $\epsilon$ is a target accuracy and $d$ is the dimension of the parameter vector. Some methods incur a $O(nd)$ storage cost . The linear dependence on $n$ is problematic in general. Clearly there will be situations in which accurate solutions can be obtained with less than a single pass through the data; indeed, some problems will require a constant number of steps. This will be the case, for example, if the data in a regression problem consist of a fixed number of pairs repeated a large number of times. For deterministic algorithms, the worst case analysis in shows that scanning at least a fixed proportion of the data is necessary; however, learning algorithms are generally stochastic and real-world learning problems are generally not worst case.

An equally important bottleneck for learning algorithms is the cost of communication. For large data sets that must be stored on disk or distributed across many computing nodes, the communication cost can be significant, even dominating the computation cost. For example, SVRG makes use of full gradient over the whole dataset which can incur prohibitive communication cost. There is an active line of research that focuses on communication costs; see, e.g. .

In this article, we present a variant of the stochastic variance reduced gradient (SVRG) method that we refer to as stochastically controlled stochastic gradient (SCSG). The basic idea behind SCSG—that of approximating the full gradient in SVRG via a subsample—has been explored by others, but we present several innovations that yield significant improvements both in theory and in practice. In contradistinction to SVRG, the theoretical convergence rate of SCSG has a sublinear regime in terms of both computation and communication. This regime is important in machine learning problems, notably in the common situation in which the sample size is large, ( $n\in[10^{4},10^{9}]$ ), while the required accuracy is low, $\epsilon\in[10^{-4},10^{-2}]$ . The analysis in this article shows that SCSG is able to achieve the target accuracy in this regime with potentially less than a single pass through the data.

In the regime of low accuracy, SCSG is never worse than the classical stochastic gradient descent (SGD). Although SCSG has the same dependence on the target accuracy as SGD, it has a potentially much smaller factor. In fact, the theoretical complexity of SGD depends on the uniform bound of $\nabla f_{i}(x)$ over the domain and the component index. This might be infinite even in the most common least square problems. By contrast, the complexity of SCSG depends on a new measure $\mathcal{H}(f)$ , defined in Section 2 and discussed in Section 4, which is finite and small for a large class of practical problems. In particular, $\mathcal{H}(f)=O(1)$ in many cases where SGD does not have theoretical guarantees to converge. The measure $\mathcal{H}(f)$ sheds light upon characterizing the difficulty of optimization problems in the form of a finite sum and reveals some intrinsic difference between finite-sum optimization and stochastic approximation, which is considered by other relevant works; e.g., streaming SVRG and dynaSAGA.

The remainder of the paper is organized as follows. In Section 2, we review SVRG, discuss several of its variants and we describe the SCSG algorithm. We provide a theoretical convergence analysis in Section 3. In Section 4, we give a comprehensive discussion on the difficulty measure $\mathcal{H}(f)$ . The empirical results on real datasets are presented in Section 5. Finally, we conclude our work and discuss potential extensions in Section 6. All technical proofs are relegated to the Appendices.

Notation, Assumptions and Algorithm

The assumption A1 on the smoothness of individual functions will be used throughout this paper.

$f_{i}$ is convex with $L$ -Lipschitz gradient

for some $L<\infty$ and all $i\in\{1,\ldots,n\}$ ;

The following assumption will be used in the context of strongly-convex objectives.

Note that we only require the strong convexity of $f$ instead of each component.

Let $x^{*}$ denote the minimizer of $f$ that minimizes $\mathcal{H}(f)$ in (2), then $\mathcal{H}(f)$ can be written as

Then $\frac{\mu}{2}\Delta_{x}\leq\Delta_{f}\leq\frac{L}{2}\Delta_{x}$ under assumption A1 and A2. A point $y$ , possibly random, is called an $\epsilon$ -approximated solution if

The expectation of the above distributions satisfy that

The stochastic variance reduced gradient (SVRG) method blends gradient descent and stochastic gradient descent, using the former to control the effect of the variance of the latter . We summarize SVRG in Algorithm 1.

Using the definition from Section 2.1, it is easy to see that the computation cost of SVRG is $O((n+m)T)$ . As shown in the convergence analysis of , $m$ is required to be $\Omega(\kappa)$ to guarantee convergence. Thus, the computation cost of SVRG is $O((n+\kappa)T)$ . The costs of the other algorithms considered in Table 1 can be obtain in a similar fashion. For comparison, we only present the results for smooth case (assumption A1).

Much of this work focuses on the strongly convex case. In the non-strongly convex setting one way to proceed is to add a $L_{2}$ regularizer $\frac{\lambda}{2}\|x\|^{2}$ . Tuning $\lambda$ , however, is subtle and requires multiple runs of the algorithm on a grid of $\lambda$ . For general convex functions an alternative approach has been presented by (they generate $N_{j}$ by a different scheme in line 4), which proves a computation complexity $O\left(\frac{n}{\epsilon}\right)$ . Another approach is discussed by , who improve the complexity to $O\left(n+\frac{\sqrt{n}}{\epsilon}\right)$ by scaling the stepsize as $O\left(\frac{1}{\sqrt{n}}\right)$ . However, their algorithm still relies on calculating a full gradient. Other variants of SVRG have been proposed in the distributed computing setting and in the stochastic setting .

2 SCSG

The average computation cost of SCSG is $BT+\sum_{j=1}^{n}N_{j}$ . By the law of large numbers and the expectation formula (4), this is close to $2BT$ . Table 1 summarizes the computation complexity as well as some other details of SCSG and 11 other existing popular algorithms. The table includes the computation cost of optimizing non-strongly-convex functions (column 1) and strongly convex functions (column 2). In practice, the amount of tuning is of major concern. For this reason, a fixed stepsize is usually preferred to a complicated stepsize scheme and it is better that the tuning parameter does not depend on unknown quantities; e.g., $\Delta_{x}$ or the total number of epochs $T$ . These issues are documented in column 3 and column 4. Moreover, many algorithms requires $\|\nabla f_{i}\|$ to be bounded, i.e. $f_{i}$ to be Lipschitz. However, this assumption is not realistic in many cases and it is better to discard it. To address this issue, we document it in column 5. To highlight the dependence on $\epsilon$ and $\kappa$ (or $\mu$ ), we implicitly assume that other parameters, e.g. $\Delta_{x},L$ , are $O(1)$ as a convention.

As seen from Table 1, SCSG and SGD are the only two methods which are able to reach an $\epsilon$ -approximate solution with potentially less than a single pass through the data; moreover, the number of accesses of the data is independent of the sample size $n$ . Comparing to SCSG, SGD requires each $f_{i}$ to be Lipschitz, which is not satisfied by least-square objectives. By contrast, as will be shown in Section 3, the computation cost of SCSG only depends on the quantity $\mathcal{H}(f)$ , which is relatively small in many cases. Furthermore, SGD either sets the stepsize based on unknown quantities like the total number of epochs $T$ or needs to use a time-varying sequence of stepsizes. This involves intensive tuning as opposed to a fixed stepsize.

On the other hand, SCSG is communication-efficient since it only needs to operate on mini-batches as SGD. This is particularly important in modern large-scale tasks. By contrast, those algorithms that require full gradients evaluation either need extra communication for synchronization or need extra computational cost for the asynchronous version to converge; See e.g. .

Convergence Analysis

In this section we present a convergence analysis of SCSG. We first state the following key lemma that connects our algorithm with the measure $\mathcal{H}$ defined in (2).

The proof, which appears in Appendix B, involves a standard technique for analyzing sampling without replacement. Obviously, $\mathcal{H}=O(1)$ if $\|\nabla f_{i}(x)\|$ is uniformly bounded as is often assumed in the literature. In section 4 we will present various other situations where $\mathcal{H}=O(1)$ .

Note that the extra variation vanishes when $B=n$ and in general is inversely proportional to the batch size. In the rest of this section, we will first discuss the case $B=n$ , which we refer to as R-SVRG (Randomized SVRG), to compare with the original SVRG. Later we will discuss the general case.

Let $B=n$ and assume that $\eta L\leq\frac{1}{3}$ , then

Based on Theorem 3.2, we first consider a constant stepsize $\eta$ scaled as $\frac{1}{L}$ .

Let $\eta=\frac{\theta}{L}$ with $\theta<\frac{1}{3}$ . Then under the assumption A1, with the output $\bar{x}_{T}$ ,

By scaling $\eta$ as $\frac{1}{\sqrt{n}}$ , R-SVRG is able to achieve the same complexity of , which is the best bound in the class of SVRG-type algorithms without acceleration techniques.

Let $\eta=\frac{\theta}{L\sqrt{n}}$ with $\theta\leq\frac{1}{3}$ . Then under the assumption A1, with the output $\bar{x}_{T}$ ,

Assume that $\eta L<\frac{1}{13}$ . Under the assumption A1,

which does not equal $\nabla f(x^{(j)}_{k})$ in general. Most novelty of our analysis lies in dealing with the extra bias. Fortunately, we found that the extra terms do not worsen the complexity by scaling $\eta$ as $\frac{1}{B}$ .

Assume that $\displaystyle\frac{\theta}{1-13\theta/B}\cdot\frac{9}{2\gamma}<1$ , then with the output $\bar{x}_{T}$ ,

Corollary 3.6 shows that SCSG is never worse than SGD and SVRG (with constant stepsize scaled as $\frac{1}{L}$ ). Compared with SGD whose complexity is $O\left(\frac{\mathcal{H}^{*}\Delta_{x}}{\epsilon^{2}}\right)$ where

SCSG has a factor $\mathcal{H}$ which is strictly smaller than $\mathcal{H}^{*}$ . It will be shown in Section 4, $\mathcal{H}$ can be much smaller than $\mathcal{H}^{*}$ even in the case where $\mathcal{H}^{*}=\infty$ .

Assume that $\theta\leq\frac{\gamma}{22},\gamma>\frac{9}{2}$ , then with the output $\bar{x}_{T}$ ,

Under the same settings of Corollary 3.8 except that setting

More Details on ℋ(f)ℋ𝑓\mathcal{H}(f)

The first attempt to describe the heterogeneity is through an unrealistic condition, called strong growth condition(), which requires

Under (13), proves that the stochastic gradient methods have the same convergence rate as the full gradient methods. However, (13) is unrealistic since it implies for any minimizer $x^{*}$ of $f$ , $x^{*}$ is the stationary point of all individual loss functions.

By contrast, our proposed measure $\mathcal{H}$ is well-behaved in most applications without awkward assumptions such as the bounded domain. Recall that

It can be viewed as a version of $\mathcal{H}^{*}$ which replaces the supremum by the value at a single point, when the optimum of $f(x)$ is unique. As a consequence, $\mathcal{H}\leq\mathcal{H}^{*}$ . In addition, when the strong growth condition (13) holds, $\|\nabla f_{i}(x^{*})\|=0$ for all $i$ and hence $\mathcal{H}^{*}=0$ . These simple facts show that $\mathcal{H}$ is strictly better than $\mathcal{G}$ and $\mathcal{H}^{*}$ as a measure of difficulty. We will show in the next two subsections that $\mathcal{H}$ can be controlled and estimated in almost all problems and is well-behaved in a wide range of applications.

Although being unrealistic, it is often assumed that $\|\nabla f_{i}(x)\|$ is uniformly bounded over the domain. This implies the boundedness of $\mathcal{H}$ directly and hence provides an example where the problem (1) is “easy”.

Let $G^{2}$ and $\mathcal{H}^{*}$ be defined in Table 2, then

Surprisingly, $\mathcal{H}$ can be bounded even without any assumption other than A1 by using an arbitrary reference point.

This entails that problem (1) with i.i.d. individual functions is “easy”. This is heuristically reasonable since the i.i.d. assumption, plus the moment conditions, forces the data to be highly homogeneous.

In fact, (17) can be proved under much broader settings. For example, when solving a linear equation $Ax=b$ , $f_{i}(x)$ can be set as $\frac{1}{2}(a_{i}^{T}x-b_{i})^{2}$ and no randomness is involved. If we set $x=0$ , then (16) implies that

Then $\mathcal{H}=O(1)$ provided $\|b\|^{2}=O(n)$ .

Another type of problems with $\mathcal{H}=O(1)$ involves pairwise comparisons, i.e.

where $Z_{1},\ldots,Z_{m}$ are independent samples. For example, in preference elicitation or sporting competitions where the data is collected as pairwise-comparisons, one can fit a Bradley-Terry model to obtain the underlying “score” that represents the quality of each unit. The objective function of the Bradley-Terry model is $\sum_{j,k}[W_{j,k}\beta_{j}-W_{j,k}\log(e^{\beta_{j}}+e^{\beta_{k}})]$ where $W_{j,k}$ is the number of times that the unit $j$ beats the unit $k$ (, ). Other examples that involve a similar structure are metric learning (, ) and convex relaxation of graph cuts (). In these cases, we can also bound $\mathcal{H}$ under mild conditions.

Finally, it is worth mentioning that (17) cannot be established for $\mathcal{H}^{*}$ unless the domain is compact and more regularity conditions, than the existence of second moment, are imposed to ensure that a certain version of uniform law of large number can be applied.

2 Estimating ℋ(f)ℋ𝑓\mathcal{H}(f) in Generalized Linear Models

Optimzation problems in machine learning are often generalized linear models where $f_{i}(x)=\rho(y_{i},a_{i}^{T}x)$ , with $a_{i}$ being the covariates and $y_{i}$ being the responses, for some convex loss function $\rho$ . Let $\rho_{2}(z,w)=\frac{\partial}{\partial w}\rho(z,w)$ . Then by definition

If $\rho_{2}(y_{i},a_{i}^{T}x)$ is uniformly bounded with $\rho_{2}(y_{i},a_{i}^{T}x^{*})^{2}\leq M_{1}$ , then

We will show in appendix E that $M_{1}=2$ for multi-class logistic regression, regardless of the number of classes. The same bound can also be derived for Huber regression , Probit model (), etc.. When the domain is unbounded, the (penalized) least square regression has an unbounded $\rho_{2}(z,w)$ . However notice that $x^{*}=(A^{T}A)^{-1}A^{T}y$ where $A=(a_{1},\ldots,a_{n})^{T}$ , one can easily show that

3 ℋ(f)ℋ𝑓\mathcal{H}(f) in Pathological Cases

The first example is due to the high dimension. When the dimension is comparable to $n$ , even the i.i.d. assumption cannot guarantee a good behavior of $\mathcal{H}$ , without further conditions, since the law of large number fails. The second example is due to the severe heterogeneity of components. In fact the $i$ -th component reaches its minimum at $x=i$ while the global function reaches its minimum at $x=\frac{n+1}{4}$ and thus most components behaves completely different from the global function.

Nevertheless, it is worth emphasizing that SGD also faces with the same issue in these two cases. More importantly, SCSG does not suffer from these undesirable properties since it will choose $B=n$ automatically; See Corollary 3.6 to Corollary 3.9.

Experiments

In this section, we illustrate the performance of SCSG by implementing it for multi-class logistic regression on the MNIST dataset http://yann.lecun.com/exdb/mnist/. We normalize the data into the range $ $by dividing each entry by$ 256$. No regularization term is added and so the function to be minimized is

The performance is measured by $\log_{10}\|\nabla f(x)\|^{2}$ versus the number of passes of data Although beging ideal to report $f(x)-f(x^{*})$ , it is not feasible in that $f^{*}$ is unknown. . For each algorithm mentioned later, we selects the best-tuned stepsize and then implement the algorithm for 10 times and report the average to avoid the random effect.

Here we compare SCSG with mini-batch SGD, with the batch size $B$ , and SVRG. Moreover, we consider three variants of SCSG:

(SCSGFixN) set $N_{j}\equiv B$ , instead of generated from a geometric distribution;

(SCSGNew) randomly pick $i_{k}\in\mathcal{I}_{j}$ , instead of from the whole dataset $[n]$ ;

(SCSGNewFixN) set $N_{j}\equiv B$ and randomly pick $i_{k}\in\mathcal{I}_{j}$ .

The first variant is to check whether geometric random variable is essential in practice; the second variant is to check whether running SGD from the whole dataset is necessary; and the third variant is the combination.

For all the variants of SCSG and SGD, we consider three batch sizes $B\in\{0.01n,0.05n,0.25n\}$ . The results are plotted in Figure 1, from which we make the following observations:

SCSG is able to reach an accurate solution very fast since all versions of SCSG are more efficient than SGD and SVRG in the first 5 passes. This confirms our theory;

SCSG with fixed $N_{j}$ is slightly more effective than the original SCSG. Thus the geometric random variable might not be essential in practice;

It makes no difference whether sampling from the whole dataset or sampling from the mini-batch when running the SGD steps in SCSG.

Based on our observations, we recommend implementing SCSGNewFixN as the default since 1) the fixed number of SGD steps stablizes the procedure; 2) sampling from the mini-batch reduces the communication cost incurred by accessing data from the whole dataset.

Discussion

We propose SCSG as a member of the SVRG family of algorithms, proving its superior performance in terms of both computation and communication cost. Both complexities are independent of sample size when the required accuracy is low, for various functions which are widely optimized in practice. The real data example also validates our theory.

We plan to explore several variants of SCSG in future work. For example, a non-uniform sampling scheme can be applied to SGD steps to leverage the Lipschitz constants $L_{i}$ as in SVRG. More interestingly, we can consider a better sampling scheme for $\mathcal{I}_{j}$ by putting more weight on influential observations. The proximal settings are also straightforward extensions of our current work.

As a final comment, we found that the previous complexity analysis focuses on the high-accuracy computation for which the dependence on the sample size $n$ and condition number $\kappa$ is of major concern. The low-accuracy regime is unfortunately under-studied theoretically even though it is commonly encountered in practice. We advocate taking all three parameters, namely $n$ , $\kappa$ and $\epsilon$ , into consideration and distinguishing the analyses for high-accuracy computation and low-accuracy computation as standard practice in the literature.

Acknowledgment

We thank the Chi Jin, Nathan Srebro and anonymous reviewers for their helpful comments, which greatly improved this work.

References

Appendix A Lemmas

Let $g$ be a convex function that satisfies the assumption $\textbf{A}1$ ,

Proof This is the standard Co-coercivity argument; See e.g. , Theorem 2.1.5 of .

Proof An elementary computation shows that

Using the fact that $(z+w)^{2}\leq 2z^{2}+2w^{2}$ , we have

Appendix B One-Epoch Analysis

First we prove a lemma that generalizes Lemma 3.1.

Further let $\mathcal{J}$ be a uniform random subset of $[M]$ with size $m$ . Then

Proof Let $W_{j}=I(j\in\mathcal{J})$ , then it is easy to see that

Then the sampling mean can be reformulated as

Proof [Lemma 3.1] Let $z_{i}=\nabla f_{i}(x^{*})$ . Then

As all other algorithms, we start from deriving a bound for the stochastic gradients $\nu^{(j)}_{k}$ . For convenience, we define $e_{j}$ as the bias of $\nu^{(j)}_{k}$ , i.e.

Under the assumption A1 and A2 with $\mu$ possibly equal to ,

By Lemma A.1 with $g=f_{i_{k}},x=x^{*},y=x^{(j)}_{k}$ ,

where the last line uses the fact that $i_{k}$ is independent of $(x^{(j)}_{k},x^{(j)}_{0})$ . Similarly, by Lemma A.1 with $g=f_{i_{k}},x=x^{(j)}_{0},y=x^{*}$ ,

where the last line uses the smoothness of $f$ . Putting the pieces together, we conclude that

Assume that $\eta L\leq 1/2$ . Then for any $j$ ,

Similarly, since $x^{(j)}_{0}$ is also independent of $\mathcal{I}_{j}$ ,

Now let $k=N_{j}$ and taking expectation with respect to $N_{j}$ , by Lemma A.2 and B.4,

The Cauchy-Schwartz inequality implies that

Proof Let $u=x^{*}$ . By Lemma B.2 and Lemma B.5, we have

Proof Using the fact that $\|z+w\|^{2}\leq 2\|z\|^{2}+2\|w\|^{2}$ , we have

Noticing that $\nabla f(x^{*})=0$ , by Lemma B.1,

On the other hand, by Lemma B.1 again, we obtain that

Putting the pieces together we prove the result.

Putting all the pieces together, we can derive the key inequality on the performance of SCSG within a single epoch.

Suppose $\eta L\leq\frac{1}{8}$ and $B\geq 8$ . Then under the assumption A1 and A2,

Using the fact that $2zw\leq\beta^{-1}z^{2}+\beta w^{2}$ for any $\beta>0$ , we have

Then by Corollary B.7 and Lemma B.8, we obtain that

Since $\eta L\leq\frac{1}{8}$ , the assumption A2 implies that

Finally, since $B\geq 8$ , we conclude that

Proof [of Lemma B.4] We prove the first claim by induction. When $j=0$ , the claim is obvious. Suppose we prove the claim for $j-1$ , i.e.

Let $y^{(j)}_{k}$ be another sequence constructed as follows:

On the other hand, showed in the proof of their Theorem 1 that

Since $f(y^{(j)}_{k})-f(x^{*})\geq 0$ and $1-2\eta L\geq 0$ , we have

Putting (29) and (30) together, and using the fact that $\|a+b\|_{2}^{2}\leq 2\|a\|_{2}^{2}+2\|b\|_{2}^{2}$ , we obtain that

Appendix C Analysis of R-SVRG

Throughout the rest of appendices, we will denote $T(\epsilon)$ and $T_{x}(\epsilon)$ by

Although we can directly apply Theorem B.9 with $B=n$ , the constants involved in the analysis are compromised. To sharpen the constants, we derive a counterpart of Theorem B.9 for R-SVRG.

Let $B=n$ and assume that $\eta L\leq\frac{1}{3}$ . Under the assumption A1 and A2,

Proof In this case, $e_{j}\equiv 0$ . By Lemma B.5 with $u=x^{*}$ and Lemma B.2,

Since $\eta L\leq\frac{1}{3}$ , the assumption A2 implies that

Based on Theorem C.1, we can prove the results in Section 3.1.

Summing the above inequality for $j=1,\ldots,T$ , we have

The result is then proved by noticing that

Proof [Corollary 3.3] In the non-strongly convex case, by part (1) of Theorem 3.2,

Recalling the definitions of $T(\epsilon)$ and $T_{x}(\epsilon)$ in (33) and (34), this implies that

In the strongly convex case, by part (2) of Theorem 3.2,

Note that $LQ_{0}=L\Delta_{x}+2\theta n\Delta_{f}$ , we obtain that

Proof [Corollary 3.4] By part (1) of Theorem 3.2, we have

Appendix D Analysis of SCSG

Proof [Theorem 3.5] By Theorem B.9, we have

Telescoping the above inequality for $j=1,\ldots,T$ , we have

where the last inequality uses $13\eta L\leq 1$ . By convexity,

Before proving the results in Section 3.2, we derive the computation complexity for arbitrary batch size $B$ with an appropriately scaled stepsize $\eta$ in the non-strongly convex case.

Assume A1 holds. Set $\eta=\frac{\alpha}{L}$ with

Under these conditions, $D_{2}$ is bounded by

Proof [Corollary 3.6] Let $\alpha=\theta/B$ . Then

where the last equality uses the fact that $\Delta_{f}\leq\frac{L\Delta_{x}}{2}$ and $\theta=\Theta(1)$ . As a consequence,

D.2 Convergence Analysis for Strongly Convex Objectives

Multiplying both sides of (37) by $(1+\xi)^{j-1}$ and summing over $j=T,T-1,\ldots,1$ , we obtain that

Proof [Corollary 3.8] By definition, $B\geq\gamma\kappa$ , and

and whenever $B<n$ , $B\geq\gamma\kappa>\kappa$ . Thus,

Proof [Corollary 3.9] Using the same argument as in the proof of Corollary 3.8, $\phi>0$ . By (10) in Theorem 3.7,

Appendix E Proof of Results in Section 4

Proof [Proposition 4.2] By Lemma A.1 in Appendix A

Averaging the above inequality for all $i$ results in

Proof [Proposition 4.3] In this case, the RHS of (16) is a form of order-two V-statistics () which can be written as

The above inequalities imply that $\mathcal{H}=O_{p}(1)$ in this case. The conclusion can be easily extended to the case where $f(x)$ can be written as a higher-order V-statistics.

Denote by $x$ the concatenation of $x_{1},\ldots,x_{K-1}$ as in Section 5. For multi-class logistic regression loss

It is easy to see that for any $i$ and $x$