Accelerated Gradient Descent via Long Steps
Benjamin Grimmer, Kevin Shu, Alex L. Wang
Introduction
When utilizing constant stepsizes, until recently, the best known guarantee was the textbook result that fixing ensures . This was improved by the tight convergence theory of Teboulle and Vaisbourd , showing a rate of
when the stepsizes . Utilizing nonconstant stepsizes monotonically converging up to , they further showed a rate approaching . These coefficient improvements were first conjectured by .
By utilizing nonconstant periodically long stepsizes, Grimmer showed improved convergence rates are possible outside the classic range of stepsizes . We refer to steps with as long steps since they go beyond the classic regime where descent on the objective value is guaranteed. Their strongest result, resulting from a computer-aided semidefinite programming proof technique, showed repeating a cycle of stepsizes ranging from to gives a rate of
Note, bounding (or a similar quantity) is natural for such long step methods as monotone decrease of the objective is no longer ensured. By considering longer and more complex patterns, increasing gains in the coefficient appear to follow. However, the reliance on numerically solving semidefinite programs with size depending on the pattern length limited this prior work’s ability to explore and prove continued improvements in convergence rates. Grimmer conjectured at least a rate would follow if one could design and analyze (algebraically) cyclic patterns of generic length.
We show greater gains are possible. By using nonconstant, nonperiodic stepsizes , we prove
Proving this relies on semidefinite programming-based analysis techniques and considers the overall effect of many iterations at once (rather than the one-step inductions typical to most first-order method analysis).
In related work, Das Gupta et. al. produced numerically globally optimal stepsize selections via a branch-and-bound procedure for gradient descent with a fixed number of steps . By fitting to asymptotics of their numerical guarantees [2, Figure 2], they conjectured a rate may be possible and may be best possible. Our work leaves open the gap between our rate and their conjecture, as well as the gap between their conjecture and the known lower bound for general gradient methods of .
Generally, studying accelerated convergence rates stemming from long steps can yield several advantages/insights beyond what classic momentum methods can provide. Understanding the acceleration stemming from long steps may yield insights into the fundamental mechanism enabling acceleration; we have shown that changing the update directions based on an auxiliary momentum sequence is not needed to beat . Hence, an acceleration can be attained by a method storing only one vector in memory at each step rather than two. Further, using long steps may partially mitigate the effects of inexact or stochastic gradients, known to hamper momentum methods , as no momentum term exists to propagate past errors into future steps. Lastly, we note that continued work in this direction may yield theoretical support for such cyclic long stepsize patterns used in neural network training .
Stronger guarantees for gradient descent with variable stepsizes are known in specialized settings, like -strongly convex minimization. Classically, gradient descent with constant stepsizes produces an -minimizer in iterations, where . Concurrent to this work, Altschuler and Parrilo recently showed an accelerated rate through the inclusion of long steps of (extending their prior preliminary results in ). Our convergence theory, using a different pattern of long steps, also improves on the classic , although at a weaker rate. Our Theorem 3.2 ensures that under our long stepsize selection, gradient descent has a rate. The silver ratio, , occurs prominently in both our analysis and theirs, indicating potential deeper connections.
Altschuler and Parrilo ’s faster accelerated rate for smooth strongly convex problems can be extended to give a guarantee for a modified gradient descent method for general smooth convex problems, the main focus of this work. Doing so requires running gradient descent on a modified objective function (whose choice depends on specifying a target accuracy and an initial distance to optimal). In contrast, our results show acceleration beyond is possible for gradient descent via long steps alone, i.e., without needing this modification and additional problem knowledge.
In the further specialized case of minimizing strongly convex quadratics, the optimal stepsizes were given by , which attain the optimal rate. For nonconvex optimization, exact worst-case guarantees for gradient descent with short steps were given by Abbaszadehpeivasti et al. . The potential use of longer steps (greater than ) in nonconvex settings is an interesting future direction but beyond the scope of this work.
In the remainder of this introduction, we define our stepsize selection which accelerates due to the inclusion of long steps. To prove our accelerated rates, Section 2 first reviews the semidefinite programming analysis technique of based on the performance estimation problem techniques of . Specifically, the proof of our accelerated convergence rates utilize the “straightforward” property of stepsize patterns. Section 3 proves the claimed convergence rate assuming that certain finite-length blocks within our nonperiodic stepsize pattern are straightforward. Section 4 shows that straightforwardness of a stepsize pattern can be certified by producing a feasible solution to an associated spectral set. Finally, we close the loop and show that appropriate blocks within our stepsize pattern are straightforward by constructing such certificates in Section 5. Appendices D and E verify the necessary conditions on our certificates. Several symbolically intense calculations or simplifications are deferred to the associated Mathematica notebook available at the Github repository https://github.com/ootks/GDLongSteps. This same Github repository also contains Julia code that computes our step size sequences and our associated certificates.
We define and inductively. For , define
Note that , so that has a unique root larger than and is well-defined.
The building block will be a pattern of length . Although this pattern has exponentially many stepsizes, it will contain only distinct values. This th building block stepsize pattern takes the form
This construction was arrived at through substantial computer-search over patterns with the necessary properties (see our Theorem 4.1). Although the above pattern may seem somewhat cryptic, given the values in , to produce the next pattern of the form (1.5) just requires specifying three new numbers , and . The values of the sequence follow a nice exponential pattern (1.3). Once this is set, the values for and are then determined entirely by a system of two equations in two variables that simply imposes two necessary conditions for straightforwardness of (see Section 3.2).
Following this construction, the first four building block patterns are, for example, given by
The values of , , and are plotted in Figure 1, showcasing their symmetries and fractal nature. Below we provide bounds on how the quantities , , and grow asymptotically (proof deferred in Appendix A.1).
1.2 Building the Proposed Nonconstant, Nonperiodic Stepsize Sequence
We construct our accelerated sequence of long stepsizes from rescaled versions of the stepsize building block patterns by some fixed scalar . We first apply a fixed number of times, then apply the pattern a fixed number of times, and so on. Each rescaled pattern will be applied
times where the associated parameter is defined as in Lemma 3.1.
We do not believe the choice of therein is as large as possible. Improvements on that parameter directly would improve our guarantees as fewer applications of each pattern would be needed. In fact, we propose a conjecture following our Lemma 3.1 on how should scale with (see 3.1). This conjecture is supported numerically and a proof of it would directly lead to an convergence rate guarantee.
As a result, the proposed nonconstant, nonperiodic stepsize sequence is
We denote the first iteration where stepsizes are drawn from the pattern by
Note the value of shrinks geometrically. As a result, the iteration counts where we switch to the next building block stepsize pattern grows geometrically. For example, setting , the proposed sequences of stepsizes would be of the form
Performance Estimation and Straightforwardness
Our proof machinery is built upon the performance estimation problem (PEP) ideas of . We first introduce this PEP line of work and associated semidefinite programs applied to our particular setting. Then, we introduce the improved semidefinite programming technique of , identifying a class of stepsize patterns, dubbed straightforward, for which the effects of long steps can be analyzed.
The performance estimation problem (PEP) results of establish that this problem can be relaxed (often tightly instead as a reformulation) to a finite-dimensional semidefinite minimization problem. Their PEP process of reformulations is carried out below, following the notation used in Grimmer , to introduce the needed notations here and for completeness.
Step 1: A QCQP reformulation. First, as proposed by Drori and Teboulle , one can discretize the infinite-dimensional problem defining over all possible objective values and gradients at the points with as done below. Using the interpolation theorem of Taylor et al. , this gives an exact reformulation rather than a relaxation, giving
where, without loss of generality, we have fixed .
Step 2: An SDP relaxation. Second, one can relax the nonconvex problem (2.2) to the following SDP as done in . Define
with the following notation for selecting columns and elements of and :
This notation ensures , , and Furthermore, for , define
Under an additional rank condition (that the problem dimension exceeds ), the QCQP (2.2) and SDP (2.3) are actually equivalent. However, this is not needed for our analysis, so we make no such assumption.
Step 3: The upper bounding dual SDP. Third, note the maximization SDP (2.3) is bounded above by its dual minimization SDP by weak duality, giving
Although it is not needed for our analysis, equality holds here as well (i.e., strong duality holds) due to [6, Theorem 6].
The dependence of this function on each parameter is made clear by considering broken into the following blocks
This first entry only depends on and only occurs in this first entry. The remainder of the first column and row are linear functions of only , denoted by . The remaining block of is linear in and affine in , denoted by .
2 Straightforward Stepsize Patterns
The performance estimation technique defined above does not provide a mechanism to give convergence rates for gradient descent as it only considers a fixed number of iterations . To enable these PEP semidefinite programs to provide convergence rate theorems, Grimmer proposed considering stepsize patterns where this worst-case function is bounded above by
Theorem 3.2 of showed that a stepsize pattern is straightforward with parameter if the following spectral set is nonempty
Hence proving a pattern is straightforward amounts to identifying a feasible solution to a semidefinite program. Grimmer used this to automate the search for long straightforward patterns, generating their constant factor convergence rate improvements for periodic stepsize sequences.
This straightforward structure is critical to our proof development as well. We will show that any rescaling of our building block patterns is straightforward. However, in contrast to this prior work, our proof of this will be entirely analytic and hence apply for all . The move from computer-generated certificates to exact algebraic formulas was essential to move the resulting performance-bound gains from being constant factor improvements to our accelerated big-O convergence.
Accelerated Convergence Rate Analysis
Our convergence rate analysis relies on three lemmas, (i) showing each rescaled building block pattern is straightforward, (ii) guaranteeing progress is made after applications of each pattern, and (iii) bounding the total number of steps in each pattern. Our first lemma requires substantial and nontrivial constructions and verification to prove. This is deferred to Sections 4 and beyond.
Each (scaled) building block pattern is straightforward. In particular, has parameter and for , has parameter
Note these parameters are chosen very conservatively. Any at most exponentially decaying lower bound suffices to give a rate strictly faster than . The slack in our above bound can be seen by numerically computing the largest such is nonempty (implying straightforwardness for ) for . The resulting numerical values are given below
The successive ratios in the numerical values of are decreasing and seem to approach . For example, . This suggests the following conjecture:
There exists such that for all , the building block is straightforward with .
As discussed previously, proving this conjecture would directly lead to an convergence rate guarantee. We also mention that if one could show the even stronger bound of , then our steplength schedule would have the nice property that each block is repeated only constant number of times and the resulting convergence rate guarantee, , would match the rate achieved by Altschuler and Parrilo with their modified gradient descent algorithm.
Our second lemma analyzes the objective gap of gradient descent with the stepsize sequence (1.7) after completing applications of . This lemma follows directly from Lemma 3.1. Recall denotes the iteration of gradient descent just after applications of has completed.
Each has .
Note this trivially holds for since and . We prove this inductively by showing that if , then . Suppose . Then the straightforwardness of ensures that the objective gap decreases with each application of the pattern . Namely, for any , we have
Solving this recurrence relation (of the standard form ) ensures that for all
Hence iteration , after applications of , has . ∎
To convert this bound to a convergence rate guarantee, we need a bound on , given below.
Let and . If , then and for ,
Trivially . Recall from Lemma 3.1 that the straightforwardness parameter is given by and if . Then for ,
From this, our accelerated convergence guarantee for gradient descent is immediate.
For any target accuracy and rescaling , gradient descent with stepsize sequence (1.7) has by some iteration
As in Lemma 3.3, let and for which if . Let be the last pattern with . Since monotonically decreases to zero, is finite. Lemma 3.3 bounds the number of steps before building block pattern is used by
A bound on the number of steps using pattern before an -minimizer follows from the convergence guarantee (3.1). Namely, the number of applications of the pattern needed is at most
Hence, if and so , an -minimizer is found by iteration
Otherwise if , then an -minimizer is found within iterations. Since requires , one can verify . ∎
This convergence rate improves if stronger lower bounds on are provided. Convergence theory matching the conjectured optimal rate of would follow immediately if one could show a bound with . This would give a rate of and has the nice property that would become a constant. Such an idealized, potentially optimal, accelerated stepsize pattern would just apply each a constant number of times.
Lemma 3.1 further enables direct analysis of gradient descent for strongly convex minimization. Let and . Note that -strong convexity of (defined as being convex) ensures .In fact, this is the only property of strong convexity used in our analysis here. So our convergence guarantee presented in Theorem 3.2 holds more generally for any problem satisfying only a quadratic growth bound. Observe that if , the objective gap contracts after applying the pattern with
Conversely, if , straightforwardness ensures that
Hence, every application of this pattern yields a contraction of at least
Given the condition number and fixing , one can select the stepsize pattern giving the best contraction. Consider . Supposing , we have , giving bounds of . Noting , one has . Then the guarantee (3.2) gives a contraction factor after applying the pattern of stepsizes of Since this contraction is attained after iterations, when amortized, the per-iteration contraction factor is
This gives the following convergence theorem. Note this accelerated rate is slower than that concurrently developed by . If one could improve the value of above to , our rate would improve to match theirs.
2 Tight Bounds on Straightforward Patterns
Some insight into the structure of straightforward stepsize sequences follows from considering particular “bad” problem instances. Since straightforward patterns always yield a descent, showing a failure to achieve a descent on any instance suffices to prove a pattern is not straightforward. Three elementary bounds of this form are given below.
Consider the one-dimensional objective function with . Then each gradient descent step is . Hence . To achieve any descent by the end of this pattern, this must be between one and minus one. ∎
Consider the one-dimensional Huber objective function if and otherwise. Letting , one has . Supposing , one has . Consequently, gradient descent failed to descend as . ∎
Without loss of generality, and . Consider the one-dimensional objective function for and otherwise. Suppose . Letting , we then have and . By our assumption, we then have . So no descent is achieved by applying the pattern and hence it is not straightforward. ∎
The first two of these bounds on how large straightforward patterns are actually tight in all of our proposed building block patterns . The selection of the middle stepsize is exactly the sum of all the other stepsizes plus two, matching Proposition 3.5. The selection of the one-quarter and three-quarters stepsize is exactly the choice making the product in Proposition 3.4 equal one (Appendix A.2 verifies this).
Arguments based on these bounds can show that our first two building block patterns have the longest total length possible among all straightforward patterns. We believe this is true for all of our proposed patterns but only provide proof for the global maximality for .
For , no straightforward pattern has . (Hence the rescaled patterns approach having maximum length.)
This is an immediate consequence of Proposition 3.5 with . ∎
For , no symmetric straightforward pattern has . (Hence the rescaled patterns approach having maximum length.)
Maximizing over the region constrained by the inequalities of Propositions 3.4, 3.5, and 3.6 ensures no pattern has (See Mathematica proof 3.1). ∎
A Spectral Certificate for Straightforwardness
All that remains to complete our analysis is the proof of Lemma 3.1. The remainder of this paper completes the substantial technical work needed to verify this. As an overview, the main result in this section (Theorem 4.1) gives a sufficient condition for straightforwardness that is more practical to verify than the nonemptiness of . Then Section 5 constructs certificates showing that the sufficient conditions of Theorem 4.1 are met for each . Finally, Sections D and E do the heavy algebraic work explicitly verifying these sufficient conditions are met.
This section’s main result, Theorem 4.1, considers patterns that may themselves not be straightforward, but for which is straightforward for any . Below, we show the set of straightforward stepsize patterns is star-convex with respect to the all zero stepsize pattern, justifying the search for such a rescaling theorem.
Suppose is nonempty and . Then, is nonempty for all and .
It suffices to prove the proposition with as the constraints defining only relax as decreases. The case follows by definition. Let and let . We claim that . The first four constraints defining hold. We will need the following fact to verify the remaining constraints: for any fixed , is an affine function in with a PSD constant term, . Then,
The other constraint holds similarly. We deduce that . ∎
If we additionally fix and assume that , , and is a lower-bound on the second-smallest eigenvalue of , then we may take any satisfying
We divide our proof into three parts. First in Section 4.1.1, we construct from satisfying the needed linear equality constraint. Then Section 4.1.2 shows a positive exists with . Finally, Section 4.1.3 improves this, providing a quantitative lower bound on .
To prove Theorem 4.1, we require some additional notation. Let
With this notation, we may decompose . Note that is bilinear in its arguments and is linear in .
Below, we verify that the second constraint defining , i.e., the linear constraint on , is satisfied for our construction. The lemma below explains what the first two linear constraints in the definition of require of and . Its proof is immediate from expanding the definition of .
The equation holds if and only if
The sum of the zeroth row of is one larger than the sum of the zeroth column of .
Suppose the support of is contained in , then the equation holds if and only if
Comparing Lemma 4.2 with our construction of , we see that the first constraints in (4.1) are satisfied. To show that the last constraint is satisfied as well, it is enough to show that the sum of all left-hand side expressions in (4.1) is zero, i.e., . This is done in the following lemma.
It holds that .
1.2 Existence of Associated Δ>0Δ0\Delta>0
The first three defining constraints of are satisfied regardless of and . Similarly, does not depend on or . Next, is nonnegative for all small enough by the assumption that for all and the observation that those are the only negative entries of . It remains to check that the two PSD constraints defining hold for all small enough.
is PSD by construction (in fact, rank-one). is PSD as maps nonnegative matrices to PSD matrices. We deduce that the first PSD constraint in the definition of holds regardless of and .
We next evaluate and . For the first expression,
For the second expression, we have .
We deduce that , or equivalently, the matrix has a positive component in the kernel of . Thus, the Schur complement lemma shows the existence of a positive satisfying the theorem statement.
1.3 A Quantitative Lower Bound on ΔΔ\Delta
For the second part of the theorem statement, we will assume , , . Additionally, we will assume that lower bounds the second-smallest eigenvalue of .
We will now repeat portions of the proof of the first claim more quantitatively to derive explicit bounds on . By the above arguments, it suffices to pick so that and .
First, as the superdiagonal entries of are defined as , we have that each of these entries is bounded in magnitude by (see Lemma 4.3). In particular, the requirement that is satisfied as long as
This is the first term in our bound on .
We now turn to the constraint .
where is the projection of onto the orthogonal complement of . To see that this is possible, note that and are orthogonal and unit norm. Note that is in the span of the first two basis vectors. Also note that the first and last vectors in this basis span the kernel of .
where the last inequality follows from our assumption .
Abusing notation, we will also write to denote the matrix written in this new basis. Then, can be bounded below by
We bound the minimum eigenvalue of the top-left two-by-two submatrix here as the determinant divided by the trace of the submatrix:
Plugging back into our lower bound on gives
We may also write in this basis in block form by first letting be the orthogonal projection of onto the subspace orthogonal to and writing
Note that from our previous section.
We now apply the Schur complement lemma to the bottom right entry of this second matrix, which yields that this matrix is PSD if and only if
The second inequality follows because is the orthogonal projection of onto a subspace.
We now bound and separately.
For this, we apply the triangle inequality to break the summation over all entries in into summations over just the th row and the first superdiagonal:
For the second term in the bound on , note that for all , we have
is a tridiagonal matrix. Using the bounds and , we have that all entries in this tridiagonal matrix have magnitude bounded by . We may bound the operator norm of this matrix as the sum of the operator norms of each diagonal. Thus,
For the second term in our bound on , a direct calculation shows
Thus, we may bound this quantity above by .
This implies that our final expression is PSD as long as
Invoking our bound on , we have that for any
Proof of Lemma 3.1 and Construction of Certificates
First for , we address the straightforwardness of with parameter individually. One can verify this by noting the below values of are a member of (see Mathematica proof 5.1):
For each , in light of Theorem 4.1, Lemma 3.1 follows if we can demonstrate certificates satisfying the sufficient conditions on straightforwardness therein for each . We do this in four parts. First, we provide a construction for our claimed certificates , which satisfy the first three conditions in Theorem 4.1. Next, we will derive lower bounds on and the second-smallest eigenvalue of :
Here, is a vector that will be constructed shortly. The proofs of these two facts are relatively tedious but ultimately straightforward. We defer the proofs of these two statements to Appendices D and E.
The proof of Lemma 3.1 will then be a direct application of Theorem 4.1 and the bounds stated in (5.1) and (5.2).
The remainder of this section provides our construction for the certificates and verifies the lower bound on the second smallest eigenvalue of .
Although our construction of only uses elementary arithmetic operations, it is still quite complicated. We first present as examples our certificates for . Together with Theorem 4.1, this proves the straightforwardness of each for .
2 Preliminary Definitions
We will begin with some auxilliary definitions that will be used in our definition of . For , we let denote the number of one’s in the binary expansion of and we let .
At times, it will be convenient to index entries of backwards from the bottom-right instead of top-left. We define . The value of will always be clear from context and we will simply write . Lemma B.5 lists some useful relationships between , , and , and .
For , recursively define to be a vector of length as follows: is the empty vector, and for ,
For , let be a vector of length as follows: and for ,
For , let be the vector of length as follows: and for ,
We now present our construction for . Throughout this section, fix . The matrix is a matrix that we construct below. We will index the rows and columns of by .
There are five cases for depending on . See Figure 2 for a depiction of the different cases.
Case 1: and is not a power of 2.
Case 2: and is not a power of 2.
While we find this vector concatenation notation to be more compact and easier to read than specifying each entry of separately, we will also give entry by entry.
Case 1: and is not a power of 2.
Case 2: and is not a power of 2.
In the remainder of this section, we fix and let . In this subsection only, we will let denote the second-smallest eigenvalue of its argument.
We recognize this as the Laplacian of the weighted graph on the vertices where the vertices and are connected with an edge with weight . We will lower-bound by identifying a simpler weighted graph that is dominated by our original graph and bounding its second-smallest eigenvalue instead.
The following lemma computes some lower bounds on the entries of . This will allow us to identify our simpler graph. Its proof simply requires checking the relevant entries of and is deferred to Section A.4.
We are now ready to prove a lower bound on .
Let . Then, the second smallest eigenvalue of is at least .
Let denote the Laplacian of the caterpillar graph. Lemma 5.1 implies that .
Now, suppose for the sake of contradiction that . For simplicity, let within this proof. We will deduce from this assumption that the “vertex-weighted” star graph that arises from contracting all path vertices in the caterpillar graph to a single vertex has small algebraic connectivity, from which we will derive a contradiction.
From this, we deduce that for each , that . Chaining these inequalities, for any pair of path vertices, the difference in values is bounded above by .
Next, let denote the weighted Laplacian on the leaf vertices and the vertex , that contains an edge of weight from to each leaf vertex. Given a leaf vertex , let denote the path vertex that was attached to in the caterpillar graph. Then,
Here, on the second line, we have used the inequality .
One may check (see Mathematica proof 5.2) that the expression in parentheses is for all .
On the other hand, (5.5) is the variational characterization of the second smallest eigenvalue of
The identity block in the bottom-right is where . Thus, the second smallest eigenvalue of is at least by Cauchy’s Interlacing Theorem, a contradiction.∎
Acknowledgements. Benjamin Grimmer’s work was supported in part by the Air Force Office of Scientific Research under award number FA9550-23-1-0531.
References
Appendix A Deferred Proofs and Calculations
First we verify , then we prove our bounds relating and , and finally we show . The defining equation of is that is the unique root larger than of . It is clear that , thus to show suffices to show that . We compute
To bound , we first claim that the sum of all ’s in is given by . This follows from noting each appears in a total of times and so the total value of terms in is
Recall also that is the sum of all other entries in plus two. Thus,
To get an upper bound on this quantity, note that . Thus,
To get a lower bound, observe that for all . Thus . The first set of bounds for follows from the identity . The final claimed inequality follows directly as .
To conclude, we show by showing . We compute
where the inequality exactly follows from our lower bound on .
We previously claimed in Section 3.2 that . We show this by induction below. The following lemma is useful in this calculation.
.
It is clear that which we have seen is equal to
As a base case, when , note that is the positive root of the polynomial
so that , and , which satisfies this equation. Now, assume that the equation holds for , i.e. . Expand this expression to be
Computing the product of all of the , this simplifies to
Combining our previous expressions, we obtain by Lemma A.1 that
Now, we note that by the defining equation of that
In this section, we fix a and let denote the construction given in Section 5.3. Our goal is to bound below for use in proving Lemma 3.1 via Theorem 4.1.
We prove this by bounding for any separately across our five cases. We will also make use of the easy that
Case 1: and is not a power of 2.
Case 4: . If , then this entry is . Otherwise, this entry is . As , we may bound the general case as
A.4 Bounding the Edge Weights in the Caterpillar graph
In Section 5.4, we required lower bounds on specific entries of . These lower bounds were stated in Lemma 5.1 and are proved below.
Fix and let .
Appendix B Useful Supporting Identities and Properties
Here are two recurrence relations for that are useful in various calculations. They say that certain entries (or rows) of are simply scalar multiples of other entries (or rows).
The claim holds if . In the remainder assume . Note that if and only if . As and we deduce that and and only if .
If , then the claim holds. In the remainder, assume . In this case, our definitions imply that
and since , and the number of one’s in the binary expansion of is exactly one more than the number of one’s in the binary expansion of , we see that
Comparing the two expressions yields our result. ∎
In the first case, we have that both and are defined according to Case 2 and that . Thus, these two rows (after the natural re-indexing) are scalar multiples of (and hence of each other). Using the identities Lemma B.5, we have
Comparing the two coefficients proves the claim.
Comparing the two coefficients proves the claim. ∎
B.2 Algebraic Properties of μ𝜇\mu
There are various algebraic properties of that we will use in this paper.
and taking the square root of both sides implies the last claim. ∎
Suppose , then , or equivalently, .
By Lemma A.1, we have . Applying this identity and combining, we get that this is equivalent to
which is the defining equation for . ∎
B.3 Properties of the revrev\operatorname{rev} operation
Suppose . Then, and
In particular, .
We recall that . For the first identity, note
Then, recall that the 2-adic valuation for the sum or difference of two numbers with different 2-adic valuations is the smaller of two. As , we have that the above quantity is equal to .
Finally, is the number of ones in the binary expansion of . This is equivalent to minus the number of ones in the binary expansion of . Now, consider the binary expansion of . The smallest position for which the binary expansion of is equal to one, i.e., , is the same as the smallest position for which the binary expansion of is equal to zero. The difference in the number of ones in their binary expansion is then . We have deduced that . ∎
The support of has a rich combinatorial structure, which we need to make use of extensively in our computations. We record some facts about this support and their proofs here. For now, let us fix , and let refer to .
From our definition of , if and only if or .
It is useful to us to understand for a fixed , which are the where . For a given , we let
Suppose that has the binary expansion , where and , then
We begin by showing that if for some where , then . Note that . This implies that
Now, we show the reverse direction, i.e., if and , then . Note that since . This implies that the binary expansion of can be expressed as
For the sake of contradiction, suppose that for some . In this case, let be the largest so that . If while , then
If , while , then
In either case, we reach a contradiction. ∎
Suppose with . Then, if and only if and . In particular, if , then . If in addition , then is the singleton set .
This is clear from the support of . ∎
Appendix D Proof of Theorem D.1
In this section, we will show that satisfies the first main condition of Theorem 4.1.
The sum of the zeroth row of is one larger than the sum of the zeroth column of .
The sum of the row of is one less than the sum of the column of .
The equivalence of the two statements follows from Lemma 4.2. Thus, it suffices to prove the three statements in the second claim. We show the first item in Lemma D.16. We show the second item in Lemma D.17, Lemma D.18, and Lemma D.19. We show the third item in Lemma D.20. ∎
In the remainder of this section, we fix and let and . Section D.1 computes the sums of each row of . Section D.2 computes the sums of each column of . Finally, Section D.3 proves lemmas claimed above. Various algebraic identities involving the entries of will be used in this section and proven in Appendix B.
Each row of is composed of various components; we will enumerate their sums here.
For , .
First, note that so that . We also have .
Note that the first entries of are identical to those of , and the same holds for the last entries. By induction, we may conclude that
See Mathematica proof D.1 for a proof of the second identity. ∎
For , the sum of the entries in is
We show this by induction: note that the sum of the entries in is 0, and so is this expression.
For , the sum of the entries in is
where is the sum of the entries in .
By Lemma D.1, and the induction hypothesis, this is
Note that , so that this becomes
For , the sum of the entries in is
A simple calculation shows that this holds for (see Mathematica proof D.2). Now suppose . Using Lemma D.1 and Lemma D.2, we see that the sum of the entries in is
This is equal to the claimed expression (see Mathematica proof D.3). ∎
For , the sum of the entries of is .
We proceed by induction: the sum of the entries of is .
By expanding the definition and applying Lemma D.1 and the inductive hypothesis, the sum of the entries of is . This is equivalent to the claimed expression by Lemma B.4. ∎
D.1.2 Computing Row Sums
We will give the sum of the entries in each row, dividing into the cases above.
Case 1: and is not a power of 2. The sum of the entries of is
Case 2: and is not a power of 2. The sum of the entries of is
Case 4: . The sum of the entries of is
Cases 1 and 2 follow directly by definition and Lemma D.3. The expressions for Cases 3 and 4 follow by adding up the partial sums computed in the previous subsection. See Mathematica proof D.4 and Mathematica proof D.5.
For Case 5, we combine the expressions for the partial sums (see Mathematica proof D.6) to get the row sum in the form
Combining these expressions proves the claim (see Mathematica proof D.7). ∎
D.2 Column Sums
denote the indices above and indices below in the support of the th column. The following lemmas give computational descriptions of these sets that will be useful in computing the column sums. We will give their proofs in Appendix C
Suppose that has the binary expansion , where and , then
Suppose with . Then, if and only if and . In particular, if , then . If in addition , then is the singleton set
D.2.2 Computing Column Sums
Fix . First suppose is not a power of and let so that . Let denote the number of ones in the binary expansion of . Then,
On the other hand, if for some , then
We begin with the case where is not a power of . Let be the binary expansion of . Since is the largest power of 2 dividing , it follows that
Case (i): Let . As is not a power of , we have that and . Thus, by definition of , we have
where is the number of ones in the binary expansion of and we note that . Now, note that if we sum over all of the form of case 2, there is exactly one such term for each possible value of from through . That is, if we add all such , we obtain (see Mathematica proof D.8)
Once again, if we add up all such terms, we collect one for each possible value of from to , yielding (see Mathematica proof D.9)
If is not a power of two, then adding the sums in the three cases yields (see Mathematica proof D.10)
Adding all of these terms yields (see Mathematica proof D.11). ∎
Fix so that where . If there are one’s in the binary expansion of , then
We will show this by induction on the number of ones in the binary expansion of .
If there are exactly 2 one’s in the binary expansion of , then
as can be seen in Mathematica proof D.12.
Now, we assume . Let , where has one’s in its binary expansion. We have by Lemma D.9 that . Once again we consider two cases: either , or it is not.
Assume that . This implies that , or equivalently that . In this case,
Now, suppose is not a power of 2. Since , and , Lemma B.1 implies that
The only element of and that is one less than a power of 2 is . If , then , and
Now assume that . This is equivalent to for some . Now, we have that
By Lemma B.1, we note that for all other than ,
It remains to consider and .
Note that and that is not a power of 2, so
Also, if , then
Note that , so that
Inspecting the support of , we have that , so we need only consider the sum of the entries corresponding to element of .
This formula also holds in the case where .
Summing up all entries in the column gives
Suppose and is not a power of . Then,
Let be the binary expansion of . Equivalently if is the binary expansion of , then for all . Note that so that . The set
Suppose and . Let . We have that . We enumerate the possible values of according to , and .
Now, suppose , then . In this case is defined according to Case 2, , , and . Then,
Now suppose satisfies . In this case, is defined according to Case 2, , and . Then,
When we sum up over entries of this form, the value of is in bijection with . Thus, the final part of this summation is
Finally, summing up all entries gives the desired claim (see Mathematica proof D.13). ∎
Let and suppose is not a power of . Then,
We will induct on . As is not a power of two, we have .
First, suppose . We compute directly that . Then
One can verify that the second expression coincides with the first expression when (see Mathematica proof D.14). Thus,
We can also verify that this expression coincides with the claimed expression for (see Mathematica proof D.15).
Now, suppose and let . We have that so that is not a power of two. Let . We will now apply Lemma D.7. There are two cases: where and where .
In the first case, and Lemma D.7 states that so that
For all , it must hold that . We may now apply Lemma B.2 to get
Now, consider the case . In this case, Lemma D.7 states that . Again, for all , it must hold that . We may now apply Lemma B.2 to get
We evaluate the two terms separately. First, note that . Thus,
For the second term, we apply the inductive hypothesis and Lemma B.2 to get
One can check that the sum of these two expressions is given by the claimed expression (see Mathematica proof D.16). ∎
D.3 Comparisons of Row and Column Sums
The sum of the zeroth row of is one larger than the sum of the zeroth column of .
The only entry of is 2. On the other hand, the only row which has an entry in column 0 is , and , which implies the lemma. ∎
For , the sum of the th row of is equal to the sum of the th column of .
We first show this assuming . It follows from our earlier work that the sum of the entries of the row is
On the other hand, the nonzero entries of the column are those indexed by and . The sum of these entries is
Thus, the row sum and the column sum are equal.
We next show this assuming is not a power of 2. If the number of one’s in the binary expansion of is , then the sum of the entries in the row of is
On the other hand, we have seen that the sum of the entries in the column is
We verify that these two expressions are equivalent in Mathematica proof D.17. ∎
Let and . The sum of the th row of is equal to the sum of the th column of .
By Lemmas D.5 and D.10, we have that the th row sum and column sum are both .∎
Suppose . Then, the sum of the entries in the th column of is equal to the sum of the entries in the th row of .
By Lemma D.5, this is also the associated row sum.
Now, suppose where and is not a power of . Then, by Lemmas D.14 and D.15, we have that the th column sum is
On the other hand, noting that , we have that the row sum is given by
These expressions are equal as is verified in Mathematica proof D.18. ∎
The sum of the row of is one less than the sum of the column of .
The final entry in the th column is where . Then and so the column sum is one. ∎
Appendix E Proof of Theorem E.1
In this section, we will show that satisfies the the second main condition of Theorem 4.1.
where is defined in Equation 5.3.
For the remainder of the section, we fix and let , , , and .
We perform the computation of entry by entry. The nature of the definition of requires us to break this computation down into a number of distinct cases. We give a summary of the possible cases in Table 1.
We show that for all in Lemmas E.2, E.4, E.3 and E.5.
We then need to consider the off-diagonal entries of . Note that is symmetric, so we only need to consider where .
We will break the remaining entries of into cases, mirroring the cases in the definition of . To reiterate, the cases we consider for an index are
and is not a power of 2.
and is not a power of 2.
We summarize the possible cases and where they are proved in Table 1.∎
We will make extensive use Theorem D.1 as well as the computed expressions for the partial row and column sums from Section D.1.1 and the computational descriptions of and in Section D.2.1.
In this subsection, we will consider the various diagonal entries of . We divide these into four cases: where , where , where , and where .
First, we present a lemma concerning the entries of .
We expand the entries of defined in Equation 2.4 and note that is zero on its th row and column:
If additionally, , then by Theorem D.1, we have that
If , then .
When , we have that the th row sum is , the th column sum is , and . Thus,
Now, suppose and for some . By Lemma D.5,
Subtracting these two expressions shows that (see Mathematica proof E.1).
On the other hand, if for some , then by Lemma D.5
Substituting these expressions for and into our expression for shows that it is equal to zero (see Mathematica proof E.2).∎
Let for some . We divide into cases: either is a power of two or it is not.
If for some , then by Lemma D.5,
We also have that , so that . On the other hand, we also have that
Now suppose that for some where not a power of 2. Then
By Lemma D.5, and the identity the identity (see Lemma B.5), we have
Here, we use the fact that the th column sum is equal to one, the th row is zero, and that . ∎
E.2 Off-Diagonal Entries of M𝑀M
The following lemma gives a description for where and will be used repeatedly throughout this subsection.
We will use the definition of and to pull out the nonzero entries of each sum. The first sum becomes
We consider a useful lemma, which can be shown by just considering the support of :
Fix . If then .
Also, if and is not a power of 2, then .
In all of the following calculations, we will consider the expression
We begin by taking care of the easier cases.
Suppose and is not a power of . Then,
The second identity holds as . We turn to the first identity. By Lemma E.6, we have
Note that if , then every term in this expression is zero. Thus we will assume . Now, we will write . Note that .
Combining these identities gives . ∎
Suppose and is not a power of . Then,
Let so that . Note, we must have . By Lemma E.6, we have
On the other hand, and . ∎
The only nonzero entry in the first two summations is
Substituting both expressions back in gives
Here, on the last line we have used Lemma B.3. This completes the proof as and . ∎
The second equality follows from the fact that .
We turn to the first equality. Let . First, note if , then and by Lemma E.7. Thus, we may assume in the remainder of this proof. We will break the rest of the proof into cases depending on how is defined, i.e., Cases 2, 4, and 5.
Case 2: Suppose for some for which is not a power of . By Lemma E.6,
By Lemma D.6, the nonzero summands in the sum correspond to indices where . We also have that . Combining these identities gives
Note that .
Case 4: Suppose . By Lemma E.6,
On the other hand, .
E.3 Off-Diagonal Entries in Case (1,1)11(1,1)
Our goal for this subsection will be to prove Lemma E.19, stating that the off-diagonal entries are zero for all , where neither nor are powers of two (equivalently, where ). We will prove Lemma E.19 inductively on the value of . We will make use of the following lemma
Fix such that and are both not powers of 2. If , so that and , and , then .
Now, note that because is not a power of 2,
Also note that because , and is not a power of 2,
The result follows by noting that and , so that
The following lemma will be the base case for the subsequent inductive proof and itself requires nontrivial calculations.
Suppose satisfy . Then, .
If then the result follows from Lemma E.7. So, assume that and where . Lemma E.17 shows that if , then . From now on, we will assume that .
We consider three cases, either ; and , and and .
In this case, and for some . By the assumption that and that , we have that . Now, considering the support of , we have that
Combining these values shows that (see Mathematica proof E.8).
Let , and let , where . We begin by noting
We break the remainder of the proof into two cases: where for some or where for some .
Combining these expressions show (see Mathematica proof E.9)
Now, we consider the other subcase, in which, for some ,
so that this sum is given by (E.3). Next, we have
Combining these expressions shows (see Mathematica proof E.10)
Let . There are two subcases: where and that where .
If , then , where by the assumption that . In this case,
Combining this expression with the identities
Now, consider the case in which . In this case, we have
It can be seen by Lemma D.6 that . That is,
One may check (see Mathematica proof E.11) that the two expressions coincide when . Thus, we may take the first expression in all cases.
Combining the previous expressions, we have that
Suppose , where neither nor is a power of . Then, .
By Lemma E.7, if , then . We may thus assume and let denote their common value.
We will show the result by induction on . The base case, where is shown in Lemma E.18. In the remainder, we assume that .
Now, set and . We see that and by considering the binary expansion of and . Thus, neither nor is a power of two.
We show the claim directly if . In this case, it holds that . In the sum
Here, the expression for follows from the fact that .
Now, assume that and denote their common value by . By the inductive hypothesis, we may assume that
We now wish to compare to . For this, we divide the summation defining into parts and then rearrange:
Note that , since is not a power of 2, and . Similarly, . We finally recall Lemma B.1, which shows that this expression is the same as
Here, there are two cases: either or it does not.
Here, we use the fact that and similarly . The first term in parentheses is zero by the definition of .
The second term in parantheses is also zero upon plugging in the various values of (see Mathematica proof E.13):
The second term is identically zero due to the identities (see Mathematica proof E.14)
It remains to show that the first term is also zero. Let . We have that and that . There are two final cases: where and where .
In the first case, we additionally have that and
In both cases, plugging the relevant values into the first term in parentheses shows that it is equal to zero. See Mathematica proof E.15 and Mathematica proof E.16.∎
E.4 Off-Diagonal Entries in Case (2,2)22(2,2)
Our goal for this subsection will be to prove Lemma E.8, stating that the off-diagonal entries of with are all 0. We start with a simplifying lemma:
Letting and , and fix so that neither nor are powers of 2. If , then .
Here, we use the fact that the series is telescoping to simplify the computation. ∎
The following lemma will be the base case for a subsequent inductive proof.
Here, in the second line, we use Lemma D.13 to simplify the first summation. The second summation uses Lemma B.5 to write (see Mathematica proof E.17).
The only possible nonzero term in the second summation occurs in row . Thus, the second summation is equal to
The only possible nonzero term in the second summation is
We deduce that regardless of whether , that
It remains to show that the two square-bracketed terms in (E.4) are zero.
This is identically zero as can be seen in Mathematica proof E.21.
This is identically zero as can be seen in Mathematica proof E.22.
Substituting this expression shows that the term in parentheses is zero (see Mathematica proof E.23).
By the inductive hypothesis, it holds that
We deal with the first square-bracketed term above. Applying Lemma B.2 and the identity we get from the inductive hypothesis, we may simplify this term to
There are two cases for the second term: either or .
Suppose . Then, the second term is
Now, suppose . Then, the second term is