Accelerated Gradient Descent via Long Steps

Benjamin Grimmer, Kevin Shu, Alex L. Wang

Introduction

When utilizing constant stepsizes, until recently, the best known guarantee was the textbook result that fixing hi=1h_{i}=1 ensures f(xT)f(x)LD2/2Tf(x_{T})-f(x_{\star})\leq LD^{2}/2T. This was improved by the tight convergence theory of Teboulle and Vaisbourd , showing a rate of

when the stepsizes hi=1h_{i}=1. Utilizing nonconstant stepsizes monotonically converging up to 22, they further showed a rate approaching LD2/8TLD^{2}/8T. These coefficient improvements were first conjectured by .

By utilizing nonconstant periodically long stepsizes, Grimmer showed improved convergence rates are possible outside the classic range of stepsizes (0,2)(0,2). We refer to steps with hi>2h_{i}>2 as long steps since they go beyond the classic regime hi(0,2)h_{i}\in(0,2) where descent on the objective value is guaranteed. Their strongest result, resulting from a computer-aided semidefinite programming proof technique, showed repeating a cycle of 127127 stepsizes h0,,h126h_{0},\dots,h_{126} ranging from 1.41.4 to 370.0370.0 gives a rate of

Note, bounding miniTf(xi)f(x)\min_{i\leq T}f(x_{i})-f(x_{\star}) (or a similar quantity) is natural for such long step methods as monotone decrease of the objective is no longer ensured. By considering longer and more complex patterns, increasing gains in the coefficient appear to follow. However, the reliance on numerically solving semidefinite programs with size depending on the pattern length limited this prior work’s ability to explore and prove continued improvements in convergence rates. Grimmer conjectured at least a O(1/Tlog(T))O(1/T\log(T)) rate would follow if one could design and analyze (algebraically) cyclic patterns of generic length.

We show greater gains are possible. By using nonconstant, nonperiodic stepsizes hih_{i}, we prove

Proving this relies on semidefinite programming-based analysis techniques and considers the overall effect of many iterations at once (rather than the one-step inductions typical to most first-order method analysis).

In related work, Das Gupta et. al. produced numerically globally optimal stepsize selections via a branch-and-bound procedure for gradient descent with a fixed number of steps TT\in. By fitting to asymptotics of their numerical guarantees [2, Figure 2], they conjectured a O(1/T1.178)O(1/T^{1.178}) rate may be possible and may be best possible. Our work leaves open the gap between our O(1/T1.0564)O(1/T^{1.0564}) rate and their conjecture, as well as the gap between their conjecture and the known lower bound for general gradient methods of O(1/T2)O(1/T^{2}) .

Generally, studying accelerated convergence rates stemming from long steps can yield several advantages/insights beyond what classic momentum methods can provide. Understanding the acceleration stemming from long steps may yield insights into the fundamental mechanism enabling acceleration; we have shown that changing the update directions based on an auxiliary momentum sequence is not needed to beat O(1/T)O(1/T). Hence, an acceleration can be attained by a method storing only one vector in memory at each step rather than two. Further, using long steps may partially mitigate the effects of inexact or stochastic gradients, known to hamper momentum methods , as no momentum term exists to propagate past errors into future steps. Lastly, we note that continued work in this direction may yield theoretical support for such cyclic long stepsize patterns used in neural network training .

Stronger guarantees for gradient descent with variable stepsizes are known in specialized settings, like μ\mu-strongly convex minimization. Classically, gradient descent with constant stepsizes hi=1/Lh_{i}=1/L produces an ϵ\epsilon-minimizer in O(κlog(1/ϵ))O(\kappa\log(1/\epsilon)) iterations, where κ=L/μ\kappa=L/\mu. Concurrent to this work, Altschuler and Parrilo recently showed an accelerated rate through the inclusion of long steps of O(κ0.786434log(1/ϵ))O(\kappa^{0.786434}\log(1/\epsilon)) (extending their prior preliminary results in ). Our convergence theory, using a different pattern of long steps, also improves on the classic O(κlog(1/ϵ))O(\kappa\log(1/\epsilon)), although at a weaker rate. Our Theorem 3.2 ensures that under our long stepsize selection, gradient descent has a O(κ0.94662log(1/ϵ))O(\kappa^{0.94662}\log(1/\epsilon)) rate. The silver ratio, 1+21+\sqrt{2}, occurs prominently in both our analysis and theirs, indicating potential deeper connections.

Altschuler and Parrilo ’s faster accelerated rate for smooth strongly convex problems can be extended to give a O(1/T1.271553)O(1/T^{1.271553}) guarantee for a modified gradient descent method for general smooth convex problems, the main focus of this work. Doing so requires running gradient descent on a modified objective function (whose choice depends on specifying a target accuracy and an initial distance to optimal). In contrast, our results show acceleration beyond O(1/T)O(1/T) is possible for gradient descent via long steps alone, i.e., without needing this modification and additional problem knowledge.

In the further specialized case of minimizing strongly convex quadratics, the optimal stepsizes were given by , which attain the optimal O(κ1/2log(1/ϵ))O(\kappa^{1/2}\log(1/\epsilon)) rate. For nonconvex optimization, exact worst-case guarantees for gradient descent with short steps hk(0,1]h_{k}\in(0,1] were given by Abbaszadehpeivasti et al. . The potential use of longer steps (greater than 2/L2/L) in nonconvex settings is an interesting future direction but beyond the scope of this work.

In the remainder of this introduction, we define our stepsize selection which accelerates due to the inclusion of long steps. To prove our accelerated rates, Section 2 first reviews the semidefinite programming analysis technique of based on the performance estimation problem techniques of . Specifically, the proof of our accelerated convergence rates utilize the “straightforward” property of stepsize patterns. Section 3 proves the claimed convergence rate assuming that certain finite-length blocks within our nonperiodic stepsize pattern are straightforward. Section 4 shows that straightforwardness of a stepsize pattern can be certified by producing a feasible solution to an associated spectral set. Finally, we close the loop and show that appropriate blocks within our stepsize pattern are straightforward by constructing such certificates in Section 5. Appendices D and E verify the necessary conditions on our certificates. Several symbolically intense calculations or simplifications are deferred to the associated Mathematica notebook available at the Github repository https://github.com/ootks/GDLongSteps. This same Github repository also contains Julia code that computes our step size sequences and our associated certificates.

We define αi\alpha_{i} and μi\mu_{i} inductively. For i0i\geq 0, define

Note that qi(1)=(βi+11)(μi1)<0q_{i}(1)=-(\beta_{i+1}-1)(\mu_{i}-1)<0, so that qiq_{i} has a unique root larger than 11 and αi\alpha_{i} is well-defined.

The building block h(k)\mathfrak{h}^{(k)} will be a pattern of length t(k)=2k+11t^{(k)}=2^{k+1}-1. Although this pattern has exponentially many stepsizes, it will contain only 2k2k distinct values. This kkth building block stepsize pattern takes the form

This construction was arrived at through substantial computer-search over patterns with the necessary properties (see our Theorem 4.1). Although the above pattern may seem somewhat cryptic, given the values in h(k1)\mathfrak{h}^{(k-1)}, to produce the next pattern h(k)\mathfrak{h}^{(k)} of the form (1.5) just requires specifying three new numbers βk2\beta_{k-2}, αk1\alpha_{k-1} and μk\mu_{k}. The values of the β\beta sequence follow a nice exponential pattern (1.3). Once this is set, the values for αk1\alpha_{k-1} and μk\mu_{k} are then determined entirely by a system of two equations in two variables that simply imposes two necessary conditions for straightforwardness of h(k)\mathfrak{h}^{(k)} (see Section 3.2).

Following this construction, the first four building block patterns are, for example, given by

The values of h(4)\mathfrak{h}^{(4)}, h(5)\mathfrak{h}^{(5)}, and h(6)\mathfrak{h}^{(6)} are plotted in Figure 1, showcasing their symmetries and fractal nature. Below we provide bounds on how the quantities αk\alpha_{k}, μk\mu_{k}, and Hk:=i=0t(k)1hi(k)H_{k}:=\sum_{i=0}^{t^{(k)-1}}\mathfrak{h}^{(k)}_{i} grow asymptotically (proof deferred in Appendix A.1).

1.2 Building the Proposed Nonconstant, Nonperiodic Stepsize Sequence

We construct our accelerated sequence of long stepsizes from rescaled versions of the stepsize building block patterns h(k)\mathfrak{h}^{(k)} by some fixed scalar η(0,1)\eta\in(0,1). We first apply (1η)h(0)(1-\eta)\mathfrak{h}^{(0)} a fixed number of times, then apply the pattern (1η)h(1)(1-\eta)\mathfrak{h}^{(1)} a fixed number of times, and so on. Each rescaled pattern (1η)h(k)(1-\eta)\mathfrak{h}^{(k)} will be applied

times where the associated parameter Δ(k)\Delta^{(k)} is defined as in Lemma 3.1.

We do not believe the choice of Δ(k)\Delta^{(k)} therein is as large as possible. Improvements on that parameter directly would improve our guarantees as fewer applications of each pattern would be needed. In fact, we propose a conjecture following our Lemma 3.1 on how Δ(k)\Delta^{(k)} should scale with kk (see 3.1). This conjecture is supported numerically and a proof of it would directly lead to an O(1/T1.119)O(1/T^{1.119}) convergence rate guarantee.

As a result, the proposed nonconstant, nonperiodic stepsize sequence is

We denote the first iteration where stepsizes are drawn from the pattern h(k)\mathfrak{h}^{(k)} by

Note the value of Δ(k)\Delta^{(k)} shrinks geometrically. As a result, the iteration counts where we switch to the next building block stepsize pattern IkI_{k} grows geometrically. For example, setting η=1/2\eta=1/2, the proposed sequences of stepsizes would be of the form

Performance Estimation and Straightforwardness

Our proof machinery is built upon the performance estimation problem (PEP) ideas of . We first introduce this PEP line of work and associated semidefinite programs applied to our particular setting. Then, we introduce the improved semidefinite programming technique of , identifying a class of stepsize patterns, dubbed straightforward, for which the effects of long steps can be analyzed.

The performance estimation problem (PEP) results of establish that this problem can be relaxed (often tightly instead as a reformulation) to a finite-dimensional semidefinite minimization problem. Their PEP process of reformulations is carried out below, following the notation used in Grimmer , to introduce the needed notations here and for completeness.

Step 1: A QCQP reformulation. First, as proposed by Drori and Teboulle , one can discretize the infinite-dimensional problem defining pL,D(δ)p_{L,D}(\delta) over all possible objective values fkf_{k} and gradients gkg_{k} at the points xkx_{k} with kIt:={,0,1,t}k\in I_{t}^{\star}:=\{\star,0,1,\dots t\} as done below. Using the interpolation theorem of Taylor et al. , this gives an exact reformulation rather than a relaxation, giving

where, without loss of generality, we have fixed x=0,f=0,g=0x_{\star}=0,f_{\star}=0,g_{\star}=0.

Step 2: An SDP relaxation. Second, one can relax the nonconvex problem (2.2) to the following SDP as done in . Define

with the following notation for selecting columns and elements of HH and FF:

This notation ensures xi=Hxix_{i}=H\mathbf{x}_{i}, gi=Hgig_{i}=H\mathbf{g}_{i}, and fi=Ffi.f_{i}=F\mathbf{f}_{i}. Furthermore, for i,jIti,j\in I_{t}^{\star}, define

Under an additional rank condition (that the problem dimension nn exceeds t+2t+2), the QCQP (2.2) and SDP (2.3) are actually equivalent. However, this is not needed for our analysis, so we make no such assumption.

Step 3: The upper bounding dual SDP. Third, note the maximization SDP (2.3) is bounded above by its dual minimization SDP by weak duality, giving

Although it is not needed for our analysis, equality holds here as well (i.e., strong duality holds) due to [6, Theorem 6].

The dependence of this function on each parameter is made clear by considering ZZ broken into the following blocks

This first entry only depends on vv and vv only occurs in this first entry. The remainder of the first column and row are linear functions of only λ\lambda, denoted by m(λ)m(\lambda). The remaining (t+1)×(t+1)(t+1)\times(t+1) block of ZZ is linear in λ\lambda and affine in h\mathfrak{h}, denoted by M(λ,h)M(\lambda,\mathfrak{h}).

2 Straightforward Stepsize Patterns

The performance estimation technique defined above does not provide a mechanism to give convergence rates for gradient descent as it only considers a fixed number of iterations tt. To enable these PEP semidefinite programs to provide convergence rate theorems, Grimmer proposed considering stepsize patterns where this worst-case function is bounded above by

Theorem 3.2 of showed that a stepsize pattern is straightforward with parameter Δ\Delta if the following spectral set is nonempty

Hence proving a pattern is straightforward amounts to identifying a feasible solution to a semidefinite program. Grimmer used this to automate the search for long straightforward patterns, generating their constant factor convergence rate improvements for periodic stepsize sequences.

This straightforward structure is critical to our proof development as well. We will show that any rescaling η(0,1)\eta\in(0,1) of our building block patterns (1η)h(k)(1-\eta)\mathfrak{h}^{(k)} is straightforward. However, in contrast to this prior work, our proof of this will be entirely analytic and hence apply for all kk. The move from computer-generated certificates to exact algebraic formulas was essential to move the resulting performance-bound gains from being constant factor improvements to our accelerated big-O convergence.

Accelerated Convergence Rate Analysis

Our convergence rate analysis relies on three lemmas, (i) showing each rescaled building block pattern is straightforward, (ii) guaranteeing progress is made after RkR_{k} applications of each pattern, and (iii) bounding the total number of steps in each pattern. Our first lemma requires substantial and nontrivial constructions and verification to prove. This is deferred to Sections 4 and beyond.

Each (scaled) building block pattern (1η)h(k)(1-\eta)\mathfrak{h}^{(k)} is straightforward. In particular, h(0)/2\mathfrak{h}^{(0)}/2 has parameter Δ=1/2\Delta=1/2 and for k>0k>0, h(k)/2\mathfrak{h}^{(k)}/2 has parameter

Note these parameters Δ(k)\Delta^{(k)} are chosen very conservatively. Any at most exponentially decaying lower bound suffices to give a rate strictly faster than O(1/T)O(1/T). The slack in our above bound can be seen by numerically computing the largest Δ\Delta such Sh(k)/2,Δ\mathcal{S}_{\mathfrak{h}^{(k)}/2,\Delta} is nonempty (implying straightforwardness for h(k)/2\mathfrak{h}^{(k)}/2) for k=1,5k=1,\dots 5. The resulting numerical values are given below

The successive ratios in the numerical values of Δ(k)\Delta^{(k)} are decreasing and seem to approach (1+2)25.83(1+\sqrt{2})^{2}\approx 5.83. For example, Δnumerical(4)/Δnumerical(5)5.93\Delta^{(4)}_{\mathtt{numerical}}/\Delta^{(5)}_{\mathtt{numerical}}\approx 5.93. This suggests the following conjecture:

There exists c>0c>0 such that for all k1k\geq 1, the building block h(k)/2\mathfrak{h}^{(k)}/2 is straightforward with Δ(k)c(1+2)2k\Delta^{(k)}\geq c(1+\sqrt{2})^{-2k}.

As discussed previously, proving this conjecture would directly lead to an O(1/T1.119)O(1/T^{1.119}) convergence rate guarantee. We also mention that if one could show the even stronger bound of Δ(k)=Ω((1+2)k)\Delta^{(k)}=\Omega((1+\sqrt{2})^{-k}), then our steplength schedule hh would have the nice property that each block h(k)\mathfrak{h}^{(k)} is repeated only constant number of times and the resulting convergence rate guarantee, O(1/T1.2715)O(1/T^{1.2715}), would match the rate achieved by Altschuler and Parrilo with their modified gradient descent algorithm.

Our second lemma analyzes the objective gap of gradient descent with the stepsize sequence (1.7) after completing RkR_{k} applications of (1η)h(k)(1-\eta)\mathfrak{h}^{(k)}. This lemma follows directly from Lemma 3.1. Recall Ik+1I_{k+1} denotes the iteration of gradient descent just after applications of h(k)\mathfrak{h}^{(k)} has completed.

Each k0k\geq 0 has f(xIk)f(x)LD2Δ(k)f(x_{I_{k}})-f(x_{\star})\leq LD^{2}\Delta^{(k)}.

Note this trivially holds for k=0k=0 since I0=0I_{0}=0 and f(x0)f(x)12LD2f(x_{0})-f(x_{\star})\leq\frac{1}{2}LD^{2}. We prove this inductively by showing that if δIkLD2Δ(k)\delta_{I_{k}}\leq LD^{2}\Delta^{(k)}, then δIk+1LD2Δ(k+1)\delta_{I_{k+1}}\leq LD^{2}\Delta^{(k+1)}. Suppose δIkLD2Δ(k)\delta_{I_{k}}\leq LD^{2}\Delta^{(k)}. Then the straightforwardness of (1η)h(k)(1-\eta)\mathfrak{h}^{(k)} ensures that the objective gap decreases with each application of the pattern (1η)h(k)(1-\eta)\mathfrak{h}^{(k)}. Namely, for any s0s\geq 0, we have

Solving this recurrence relation (of the standard form δs+1δsCδs2\delta_{s+1}\leq\delta_{s}-C\delta_{s}^{2}) ensures that for all s0s\geq 0

Hence iteration Ik+1I_{k+1}, after s=Rks=R_{k} applications of (1η)h(k)(1-\eta)\mathfrak{h}^{(k)}, has δIk+1LD2Δ(k+1)\delta_{I_{k+1}}\leq LD^{2}\Delta^{(k+1)}. ∎

To convert this bound to a convergence rate guarantee, we need a bound on IkI_{k}, given below.

Let c=17687682c=\frac{1}{768768\sqrt{2}} and d=(1+2)4d=(1+\sqrt{2})^{4}. If η=1/2\eta=1/2, then I0=0I_{0}=0 and for k>0k>0,

Trivially I0=0I_{0}=0. Recall from Lemma 3.1 that the straightforwardness parameter is given by Δ(0)=1/2\Delta^{(0)}=1/2 and Δ(k)=c/dk\Delta^{(k)}=c/d^{k} if k1k\geq 1. Then for l0l\geq 0,

From this, our accelerated convergence guarantee for gradient descent is immediate.

For any target accuracy 0<ϵ<12LD20<\epsilon<\frac{1}{2}LD^{2} and rescaling η=1/2\eta=1/2, gradient descent with stepsize sequence (1.7) has f(xT)f(x)ϵf(x_{T})-f(x_{\star})\leq\epsilon by some iteration

As in Lemma 3.3, let c=17687682c=\frac{1}{768768\sqrt{2}} and d=(1+2)4d=(1+\sqrt{2})^{4} for which Δ(k)=c/dk\Delta^{(k)}=c/d^{k} if k1k\geq 1. Let k(ϵ)k(\epsilon) be the last pattern kk with Δ(k)>ϵ/LD2\Delta^{(k)}>\epsilon/LD^{2}. Since Δ(k)\Delta^{(k)} monotonically decreases to zero, k(ϵ)k(\epsilon) is finite. Lemma 3.3 bounds the number of steps before building block pattern k(ϵ)k(\epsilon) is used by

A bound on the number of steps using pattern h(k(ϵ))\mathfrak{h}^{(k(\epsilon))} before an ϵ\epsilon-minimizer follows from the convergence guarantee (3.1). Namely, the number of applications of the pattern (1η)hk(ϵ)(1-\eta)\mathfrak{h}^{k(\epsilon)} needed is at most

Hence, if k(ϵ)>0k(\epsilon)>0 and so k(ϵ)=logd(cLD2/ϵ)k(\epsilon)=\lfloor\log_{d}(cLD^{2}/\epsilon)\rfloor, an ϵ\epsilon-minimizer is found by iteration

Otherwise if k(ϵ)=0k(\epsilon)=0, then an ϵ\epsilon-minimizer is found within LD2/ϵLD^{2}/\epsilon iterations. Since k(ϵ)=0k(\epsilon)=0 requires ϵ>LD2c/d\epsilon>LD^{2}c/d, one can verify LD2/ϵ<10.3280(LD2ϵ)0.94662LD^{2}/\epsilon<10.3280\left(\frac{LD^{2}}{\epsilon}\right)^{0.94662}. ∎

This convergence rate improves if stronger lower bounds on Δ(k)=c/dk\Delta^{(k)}=c/d^{k} are provided. Convergence theory matching the conjectured optimal rate of would follow immediately if one could show a bound with d=1+2d=1+\sqrt{2}. This would give a rate of O(1/ϵlog1+2(2))O(1/\epsilon^{\log_{1+\sqrt{2}}(2)}) and has the nice property that RkR_{k} would become a constant. Such an idealized, potentially optimal, accelerated stepsize pattern would just apply each h(k)\mathfrak{h}^{(k)} a constant number of times.

Lemma 3.1 further enables direct analysis of gradient descent for strongly convex minimization. Let Dk=sup{xxf(x)f(xk)}D_{k}=\sup\{\|x-x_{\star}\|\mid f(x)\leq f(x_{k})\} and t(k)=2k+11t^{(k)}=2^{k+1}-1. Note that μ\mu-strong convexity of ff (defined as fμ22f-\frac{\mu}{2}\|\cdot\|^{2} being convex) ensures μ2Dk2δk\frac{\mu}{2}D_{k}^{2}\leq\delta_{k}.In fact, this is the only property of strong convexity used in our analysis here. So our convergence guarantee presented in Theorem 3.2 holds more generally for any problem satisfying only a quadratic growth bound. Observe that if δst(k)LDk2Δ(k)\delta_{st^{(k)}}\leq LD_{k}^{2}\Delta^{(k)}, the objective gap contracts after applying the pattern (1η)h(k)(1-\eta)\mathfrak{h}^{(k)} with

Conversely, if δst(k)>LDk2Δ(k)\delta_{st^{(k)}}>LD_{k}^{2}\Delta^{(k)}, straightforwardness ensures that

Hence, every application of this pattern yields a contraction of at least

Given the condition number κ=L/μ\kappa=L/\mu and fixing η=1/2\eta=1/2, one can select the stepsize pattern giving the best contraction. Consider k(κ)=sup{kΔ(k)12κ}k(\kappa)=\sup\{k\mid\Delta^{(k)}\geq\frac{1}{2\kappa}\}. Supposing k(κ)>0k(\kappa)>0, we have Δ(k(κ))=c/dk(κ)\Delta^{(k(\kappa))}=c/d^{k(\kappa)}, giving bounds of 12κΔ(k(κ))d2κ\frac{1}{2\kappa}\leq\Delta^{(k(\kappa))}\leq\frac{d}{2\kappa}. Noting i=0t(k)1hi(k(κ))2(1+2)k\sum_{i=0}^{t^{(k)}-1}\mathfrak{h}^{(k(\kappa))}_{i}\geq 2(1+\sqrt{2})^{k}, one has i=0t(k)hi(k)2(2cκ/d)logd(1+2)\sum_{i=0}^{t^{(k)}}\mathfrak{h}^{(k)}_{i}\geq 2(2c\kappa/d)^{\log_{d}(1+\sqrt{2})}. Then the guarantee (3.2) gives a contraction factor after applying the pattern of stepsizes of 1(2cκ/d)logd(1+2)κ1.1-(2c\kappa/d)^{\log_{d}(1+\sqrt{2})}\kappa^{-1}. Since this contraction is attained after t(k(κ))=2k(κ)+112(2cκ/d)logd(2)t^{(k(\kappa))}=2^{k(\kappa)+1}-1\leq 2(2c\kappa/d)^{\log_{d}(2)} iterations, when amortized, the per-iteration contraction factor is

This gives the following convergence theorem. Note this accelerated rate is slower than that concurrently developed by . If one could improve the value of dd above to 1+21+\sqrt{2}, our rate would improve to match theirs.

2 Tight Bounds on Straightforward Patterns

Some insight into the structure of straightforward stepsize sequences follows from considering particular “bad” problem instances. Since straightforward patterns always yield a descent, showing a failure to achieve a descent on any instance suffices to prove a pattern is not straightforward. Three elementary bounds of this form are given below.

Consider the one-dimensional objective function f(x)=12x2f(x)=\frac{1}{2}x^{2} with x0=1x_{0}=1. Then each gradient descent step is xk+1=(1hk)xkx_{k+1}=(1-h_{k})x_{k}. Hence xt=i=0t1(1hi)x_{t}=\prod_{i=0}^{t-1}(1-\mathfrak{h}_{i}). To achieve any descent by the end of this pattern, this must be between one and minus one. ∎

Consider the one-dimensional Huber objective function f(x)=12x2f(x)=\frac{1}{2}x^{2} if x1|x|\leq 1 and f(x)=x12f(x)=|x|-\frac{1}{2} otherwise. Letting x0=j<ihj1x_{0}=-\sum_{j<i}\mathfrak{h}_{j}-1, one has xi=1x_{i}=-1. Supposing hijihj+2\mathfrak{h}_{i}\geq\sum_{j\neq i}\mathfrak{h}_{j}+2, one has xi+1jihj+1x_{i+1}\geq\sum_{j\neq i}\mathfrak{h}_{j}+1. Consequently, gradient descent failed to descend as xtj<ihj+1x_{t}\geq\sum_{j<i}\mathfrak{h}_{j}+1. ∎

Without loss of generality, i=0i=0 and i+=1i_{+}=1. Consider the one-dimensional objective function 12x2+12\frac{1}{2}x^{2}+\frac{1}{2} for x1x\leq 1 and xx otherwise. Suppose (1hi)(1hi+)j∉{i,i+}hj+10(1-\mathfrak{h}_{i})(1-\mathfrak{h}_{i_{+}})\geq\sum_{j\not\in\{i,i_{+}\}}\mathfrak{h}_{j}+1\geq 0. Letting x0=1x_{0}=1, we then have x1=(1h0)1x_{1}=(1-\mathfrak{h}_{0})\leq 1 and x2=(1h1)(1h0)x_{2}=(1-\mathfrak{h}_{1})(1-\mathfrak{h}_{0}). By our assumption, we then have x2,,xt1x_{2},\dots,x_{t}\geq 1. So no descent is achieved by applying the pattern and hence it is not straightforward. ∎

The first two of these bounds on how large straightforward patterns are actually tight in all of our proposed building block patterns h(k)\mathfrak{h}^{(k)}. The selection of the middle stepsize μk\mu_{k} is exactly the sum of all the other stepsizes plus two, matching Proposition 3.5. The selection of the one-quarter and three-quarters stepsize αk2\alpha_{k-2} is exactly the choice making the product in Proposition 3.4 equal one (Appendix A.2 verifies this).

Arguments based on these bounds can show that our first two building block patterns have the longest total length possible among all straightforward patterns. We believe this is true for all of our proposed patterns but only provide proof for the global maximality for k=0,1k=0,1.

For t=1t=1, no straightforward pattern has i=0t1hi2\sum_{i=0}^{t-1}\mathfrak{h_{i}}\geq 2. (Hence the rescaled patterns (1η)h(0)(1-\eta)\mathfrak{h}^{(0)} approach having maximum length.)

This is an immediate consequence of Proposition 3.5 with i=0i=0. ∎

For t=3t=3, no symmetric straightforward pattern has i=0t1hi8\sum_{i=0}^{t-1}\mathfrak{h_{i}}\geq 8. (Hence the rescaled patterns (1η)h(1)(1-\eta)\mathfrak{h}^{(1)} approach having maximum length.)

Maximizing i=02hi\sum_{i=0}^{2}h_{i} over the region constrained by the inequalities of Propositions 3.4, 3.5, and 3.6 ensures no pattern h=(a,b,a)h=(a,b,a) has 2a+b82a+b\geq 8 (See Mathematica proof 3.1). ∎

A Spectral Certificate for Straightforwardness

All that remains to complete our analysis is the proof of Lemma 3.1. The remainder of this paper completes the substantial technical work needed to verify this. As an overview, the main result in this section (Theorem 4.1) gives a sufficient condition for straightforwardness that is more practical to verify than the nonemptiness of Sh,Δ\mathcal{S}_{\mathfrak{h},\Delta}. Then Section 5 constructs certificates λ(k)\lambda^{(k)} showing that the sufficient conditions of Theorem 4.1 are met for each h(k)\mathfrak{h}^{(k)}. Finally, Sections D and E do the heavy algebraic work explicitly verifying these sufficient conditions are met.

This section’s main result, Theorem 4.1, considers patterns h\mathfrak{h} that may themselves not be straightforward, but for which (1η)h(1-\eta)\mathfrak{h} is straightforward for any η(0,1)\eta\in(0,1). Below, we show the set of straightforward stepsize patterns is star-convex with respect to the all zero stepsize pattern, justifying the search for such a rescaling theorem.

Suppose Sh,Δ\mathcal{S}_{\mathfrak{h},\Delta} is nonempty and Δ0\Delta\geq 0. Then, Sθh,ωΔ\mathcal{S}_{\theta\mathfrak{h},\omega\Delta} is nonempty for all θ(0,1]\theta\in(0,1] and ω\omega\in.

It suffices to prove the proposition with ω=1\omega=1 as the constraints defining Sθh,ωΔ\mathcal{S}_{\theta\mathfrak{h},\omega\Delta} only relax as ω\omega decreases. The case θ=1\theta=1 follows by definition. Let θ(0,1)\theta\in(0,1) and let (λ,γ)Sh,Δ(\lambda,\gamma)\in\mathcal{S}_{\mathfrak{h},\Delta}. We claim that (λ,θγ)Sθh,Δ/θ(\lambda,\theta\gamma)\in\mathcal{S}_{\theta\mathfrak{h},\Delta/\theta}. The first four constraints defining Sθh,Δ/θ\mathcal{S}_{\theta\mathfrak{h},\Delta/\theta} hold. We will need the following fact to verify the remaining constraints: for any fixed ξ0\xi\geq 0, θM(ξ,θh)\theta\mapsto M(\xi,\theta\mathfrak{h}) is an affine function in θ\theta with a PSD constant term, M(ξ,0)M(\xi,0). Then,

The other constraint holds similarly. We deduce that (λ,θγ)Sθh,Δ/θSθh,Δ(\lambda,\theta\gamma)\in\mathcal{S}_{\theta\mathfrak{h},\Delta/\theta}\subseteq\mathcal{S}_{\theta\mathfrak{h},\Delta}. ∎

If we additionally fix η=1/2\eta=1/2 and assume that t3t\geq 3, H8H\geq 8, and L(0,1]\mathfrak{L}\in(0,1] is a lower-bound on the second-smallest eigenvalue of W2(λ)W_{2}(\lambda), then we may take any Δ>0\Delta>0 satisfying

We divide our proof into three parts. First in Section 4.1.1, we construct γ\gamma from λ\lambda satisfying the needed linear equality constraint. Then Section 4.1.2 shows a positive Δ>0\Delta>0 exists with (λ,(1η)γ)S(1η)h,Δ(\lambda,(1-\eta)\gamma)\in\mathcal{S}_{(1-\eta)\mathfrak{h},\Delta}. Finally, Section 4.1.3 improves this, providing a quantitative lower bound on Δ\Delta.

To prove Theorem 4.1, we require some additional notation. Let

With this notation, we may decompose M(λ,h)=W1(λ,h)+W2(λ)M(\lambda,\mathfrak{h})=W_{1}(\lambda,\mathfrak{h})+W_{2}(\lambda). Note that W1W_{1} is bilinear in its arguments and W2W_{2} is linear in λ\lambda.

Below, we verify that the second constraint defining Sh,Δ\mathcal{S}_{\mathfrak{h},\Delta}, i.e., the linear constraint on γ\gamma, is satisfied for our construction. The lemma below explains what the first two linear constraints in the definition of Sh,Δ\mathcal{S}_{\mathfrak{h},\Delta} require of λ\lambda and γ\gamma. Its proof is immediate from expanding the definition of ai,ja_{i,j}.

The equation i,j=0tλi,jai,j=a,ta,0\sum_{i,j=0}^{t}\lambda_{i,j}a_{i,j}=a_{\star,t}-a_{\star,0} holds if and only if

The sum of the zeroth row of λ\lambda is one larger than the sum of the zeroth column of λ\lambda.

Suppose the support of γ\gamma is contained in {(,i):i[0,t]}{(i,i+1):i[0,t1]}\left\{(\star,i):\,i\in[0,t]\right\}\cup\left\{(i,i+1):\,i\in[0,t-1]\right\}, then the equation ijγi,jai,j=2i=0t1hia,0\sum_{i\neq j}\gamma_{i,j}a_{i,j}=2\sum_{i=0}^{t-1}\mathfrak{h}_{i}a_{\star,0} holds if and only if

Comparing Lemma 4.2 with our construction of γ\gamma, we see that the first tt constraints in (4.1) are satisfied. To show that the last constraint is satisfied as well, it is enough to show that the sum of all left-hand side expressions in (4.1) is zero, i.e., 2H=i=0tγ,i=2H1ϕ2H=\sum_{i=0}^{t}\gamma_{\star,i}=\sqrt{2H}\mathbf{1}^{\intercal}\phi. This is done in the following lemma.

It holds that 1ϕ=2H\mathbf{1}^{\intercal}\phi=\sqrt{2H}.

1.2 Existence of Associated Δ>0Δ0\Delta>0

The first three defining constraints of S(1η)h,Δ\mathcal{S}_{(1-\eta)\mathfrak{h},\Delta} are satisfied regardless of Δ\Delta and η\eta. Similarly, λ0\lambda\geq 0 does not depend on Δ\Delta or η\eta. Next, λ+Δ(1η)γ\lambda+\Delta(1-\eta)\gamma is nonnegative for all Δ>0\Delta>0 small enough by the assumption that λi,i+1>0\lambda_{i,i+1}>0 for all i[0,t1]i\in[0,t-1] and the observation that those are the only negative entries of γ\gamma. It remains to check that the two PSD constraints defining S(1η)h,Δ\mathcal{S}_{(1-\eta)\mathfrak{h},\Delta} hold for all Δ>0\Delta>0 small enough.

A\mathfrak{A} is PSD by construction (in fact, rank-one). B\mathfrak{B} is PSD as W2W_{2} maps nonnegative matrices to PSD matrices. We deduce that the first PSD constraint in the definition of S(1η)h,Δ\mathcal{S}_{(1-\eta)\mathfrak{h},\Delta} holds regardless of Δ\Delta and η\eta.

We next evaluate 1W1(γ,h)1\mathbf{1}^{\intercal}W_{1}(\gamma,\mathfrak{h})\mathbf{1} and 1W2(γ)1\mathbf{1}^{\intercal}W_{2}(\gamma)\mathbf{1}. For the first expression,

For the second expression, we have 1W2(γ)1=12i=0tγ,i=H\mathbf{1}^{\intercal}W_{2}(\gamma)\mathbf{1}=\frac{1}{2}\sum_{i=0}^{t}\gamma_{\star,i}=H.

We deduce that 1Q31=(1η)H>0\mathbf{1}^{\intercal}Q_{3}\mathbf{1}=(1-\eta)H>0, or equivalently, the matrix Q3Q_{3} has a positive component in the kernel of (1η)Q1+ηQ2(1-\eta)Q_{1}+\eta Q_{2}. Thus, the Schur complement lemma shows the existence of a positive Δ\Delta satisfying the theorem statement.

1.3 A Quantitative Lower Bound on ΔΔ\Delta

For the second part of the theorem statement, we will assume t3t\geq 3, H8H\geq 8, η=1/2\eta=1/2. Additionally, we will assume that L(0,1]\mathfrak{L}\in(0,1] lower bounds the second-smallest eigenvalue of W2(λ)W_{2}(\lambda).

We will now repeat portions of the proof of the first claim more quantitatively to derive explicit bounds on Δ\Delta. By the above arguments, it suffices to pick Δ\Delta so that λ+Δ2γ0\lambda+\frac{\Delta}{2}\gamma\geq 0 and 12Q1+12Q2+ΔQ30\frac{1}{2}Q_{1}+\frac{1}{2}Q_{2}+\Delta Q_{3}\succeq 0.

First, as the superdiagonal entries of γ\gamma are defined as γi,i+1=j=0iγ,j2H=2Hj=0iϕi2H\gamma_{i,i+1}=\sum_{j=0}^{i}\gamma_{\star,j}-2H=\sqrt{2H}\sum_{j=0}^{i}\phi_{i}-2H, we have that each of these entries is bounded in magnitude by 2H2H (see Lemma 4.3). In particular, the requirement that λ+Δ2γ0\lambda+\frac{\Delta}{2}\gamma\geq 0 is satisfied as long as

This is the first term in our bound on Δ\Delta.

We now turn to the constraint 12Q1+12Q2+ΔQ30\frac{1}{2}Q_{1}+\frac{1}{2}Q_{2}+\Delta Q_{3}\succeq 0.

where ξ\xi is the projection of (H,ϕ/2)(\sqrt{H},-\phi/\sqrt{2}) onto the orthogonal complement of (t+1,1t+1)(t+1)2+(t+1)\tfrac{(t+1,-\mathbf{1}_{t+1})}{\sqrt{(t+1)^{2}+(t+1)}}. To see that this is possible, note that (t+1,1t+1)(t+1)2+(t+1),\tfrac{(t+1,-\mathbf{1}_{t+1})}{\sqrt{(t+1)^{2}+(t+1)}}, ξξ2\frac{\xi}{\|\xi\|_{2}} and 1t+2t+2\tfrac{\mathbf{1}_{t+2}}{\sqrt{t+2}} are orthogonal and unit norm. Note that (H,ϕ/2)(\sqrt{H},-\phi/\sqrt{2}) is in the span of the first two basis vectors. Also note that the first and last vectors in this basis span the kernel of Q2Q_{2}.

where the last inequality follows from our assumption t3t\geq 3.

Abusing notation, we will also write QiQ_{i} to denote the matrix QiQ_{i} written in this new basis. Then, 12Q1+12Q2\frac{1}{2}Q_{1}+\frac{1}{2}Q_{2} can be bounded below by

We bound the minimum eigenvalue of the top-left two-by-two submatrix here as the determinant divided by the trace of the submatrix:

Plugging back into our lower bound on 12Q1+12Q2\frac{1}{2}Q_{1}+\frac{1}{2}Q_{2} gives

We may also write Q3Q_{3} in this basis in block form by first letting q3q_{3} be the orthogonal projection of Q31t+2t+2Q_{3}\tfrac{\mathbf{1}_{t+2}}{\sqrt{t+2}} onto the subspace orthogonal to 1t+1\mathbf{1}_{t+1} and writing

Note that 1t+2Q31t+2=12H\mathbf{1}_{t+2}^{\intercal}Q_{3}\mathbf{1}_{t+2}=\frac{1}{2}H from our previous section.

We now apply the Schur complement lemma to the bottom right entry of this second matrix, which yields that this matrix is PSD if and only if

The second inequality follows because q3q_{3} is the orthogonal projection of Q31t+2t+2Q_{3}\tfrac{\mathbf{1}_{t+2}}{\sqrt{t+2}} onto a subspace.

We now bound Q3op\left\|Q_{3}\right\|_{\textup{op}} and Q31t+22\left\|Q_{3}\mathbf{1}_{t+2}\right\|_{2} separately.

For this, we apply the triangle inequality to break the summation over all entries in γ\gamma into summations over just the \starth row and the first superdiagonal:

For the second term in the bound on Q3op\left\|Q_{3}\right\|_{\textup{op}}, note that for all i[0,t1]i\in[0,t-1], we have

is a tridiagonal matrix. Using the bounds γi,i+12H\left|\gamma_{i,i+1}\right|\leq 2H and hiH\mathfrak{h}_{i}\leq H, we have that all entries in this tridiagonal matrix have magnitude bounded by 2(2H)(1+H/2)3H22(2H)(1+H/2)\leq 3H^{2}. We may bound the operator norm of this matrix as the sum of the operator norms of each diagonal. Thus,

For the second term in our bound on Q31t+21\left\|Q_{3}\mathbf{1}_{t+2}\right\|_{1}, a direct calculation shows

Thus, we may bound this quantity above by 2H22H^{2}.

This implies that our final expression is PSD as long as

Invoking our bound on HH, we have that (λ,γ/2)Sh/2,Δ(\lambda,\gamma/2)\in\mathcal{S}_{\mathfrak{h}/2,\Delta} for any

Proof of Lemma 3.1 and Construction of Certificates

First for k=0k=0, we address the straightforwardness of h(0)/2=h^{(0)}/2= with parameter Δ(0)=1/2\Delta^{(0)}=1/2 individually. One can verify this by noting the below values of (λ,γ)(\lambda,\gamma) are a member of S,1/2\mathcal{S}_{,1/2} (see Mathematica proof 5.1):

For each k1k\geq 1, in light of Theorem 4.1, Lemma 3.1 follows if we can demonstrate certificates λ(k)\lambda^{(k)} satisfying the sufficient conditions on straightforwardness therein for each h(k)\mathfrak{h}^{(k)}. We do this in four parts. First, we provide a construction for our claimed certificates λ(k)\lambda^{(k)}, which satisfy the first three conditions in Theorem 4.1. Next, we will derive lower bounds on λi,i+1\lambda_{i,i+1} and the second-smallest eigenvalue of W2(λ)W_{2}(\lambda):

Here, wkw_{k} is a vector that will be constructed shortly. The proofs of these two facts are relatively tedious but ultimately straightforward. We defer the proofs of these two statements to Appendices D and E.

The proof of Lemma 3.1 will then be a direct application of Theorem 4.1 and the bounds stated in (5.1) and (5.2).

The remainder of this section provides our construction for the certificates λ(k)\lambda^{(k)} and verifies the lower bound on the second smallest eigenvalue of W2(λ)W_{2}(\lambda).

Although our construction of λ(k)\lambda^{(k)} only uses elementary arithmetic operations, it is still quite complicated. We first present as examples our certificates λ(k)\lambda^{(k)} for k=1,2k=1,2. Together with Theorem 4.1, this proves the straightforwardness of each (1η)h(k)(1-\eta)\mathfrak{h}^{(k)} for k=1,2k=1,2.

2 Preliminary Definitions

We will begin with some auxilliary definitions that will be used in our definition of λ\lambda. For i0i\geq 0, we let p(i)p(i) denote the number of one’s in the binary expansion of i+1i+1 and we let z(i)=log2(i+1)z(i)=\lfloor\log_{2}(i+1)\rfloor.

At times, it will be convenient to index entries of λ(k)\lambda^{(k)} backwards from the bottom-right instead of top-left. We define revk(t)2k+12t\operatorname{rev}_{k}(t)\coloneqq 2^{k+1}-2-t. The value of kk will always be clear from context and we will simply write rev(t)\operatorname{rev}(t). Lemma B.5 lists some useful relationships between ν(r+1)\nu(r+1), z(r)z(r), p(r)p(r) and ν(rev(r+1))\nu(\operatorname{rev}(r+1)), z(2k+22rev(r))z(2^{k+2}-2-\operatorname{rev}(r)) and p(rev(r))p(\operatorname{rev}(r)).

For k1k\geq 1, recursively define σ(k)\sigma^{(k)} to be a vector of length 2k112^{k-1}-1 as follows: σ(1)=\sigma^{(1)}=\emptyset is the empty vector, and for k2k\geq 2,

For k0k\geq 0, let ρk\rho_{k} be a vector of length 2k1+1+2k\lfloor 2^{k-1}\rfloor+1+2^{k} as follows: ρ0=\rho_{0}= and for k1k\geq 1,

For k0k\geq 0, let wkw_{k} be the vector of length 2k2^{k} as follows: w0=w_{0}= and for k1k\geq 1,

We now present our construction for λ(k)\lambda^{(k)}. Throughout this section, fix k1k\geq 1. The matrix λ(k)\lambda^{(k)} is a (2k+1+1)×(2k+1+1)(2^{k+1}+1)\times(2^{k+1}+1) matrix that we construct below. We will index the rows and columns of λ(k)\lambda^{(k)} by {,0,1,,2k+11}\{\star,0,1,\dots,2^{k+1}-1\}.

There are five cases for λi(k)\lambda^{(k)}_{i} depending on ii. See Figure 2 for a depiction of the different cases.

Case 1: i+1<2ki+1<2^{k} and i+1i+1 is not a power of 2.

Case 2: i+1>2ki+1>2^{k} and 2k+11i2^{k+1}-1-i is not a power of 2.

While we find this vector concatenation notation to be more compact and easier to read than specifying each entry of λ\lambda separately, we will also give λ\lambda entry by entry.

Case 1: i+1<2ki+1<2^{k} and i+1i+1 is not a power of 2.

Case 2: i+1>2ki+1>2^{k} and 2k+11i2^{k+1}-1-i is not a power of 2.

In the remainder of this section, we fix k1k\geq 1 and let λ=λ(k)\lambda=\lambda^{(k)}. In this subsection only, we will let L()\mathfrak{L}(\cdot) denote the second-smallest eigenvalue of its argument.

We recognize this as the Laplacian of the weighted graph on the vertices [0,2k+11][0,2^{k+1}-1] where the vertices ii and jj are connected with an edge with weight λi,j+λj,i\lambda_{i,j}+\lambda_{j,i}. We will lower-bound L(W2(λ))\mathfrak{L}(W_{2}(\lambda)) by identifying a simpler weighted graph that is dominated by our original graph and bounding its second-smallest eigenvalue instead.

The following lemma computes some lower bounds on the entries of λ\lambda. This will allow us to identify our simpler graph. Its proof simply requires checking the relevant entries of λ\lambda and is deferred to Section A.4.

We are now ready to prove a lower bound on L(W2(λ))\mathfrak{L}(W_{2}(\lambda)).

Let k1k\geq 1. Then, the second smallest eigenvalue of W2(λ)W_{2}(\lambda) is at least 1286(1+2)k\frac{1}{286}(1+\sqrt{2})^{-k}.

Let L\mathcal{L} denote the Laplacian of the caterpillar graph. Lemma 5.1 implies that L(W2(λ))L(L)\mathfrak{L}(W_{2}(\lambda))\geq\mathfrak{L}(\mathcal{L}).

Now, suppose for the sake of contradiction that L(W2(λ))1286(1+2)k\mathfrak{L}(W_{2}(\lambda))\leq\frac{1}{286}(1+\sqrt{2})^{-k}. For simplicity, let Ξ=1286(1+2)k\Xi=\frac{1}{286}(1+\sqrt{2})^{-k} within this proof. We will deduce from this assumption that the “vertex-weighted” star graph that arises from contracting all path vertices in the caterpillar graph to a single vertex has small algebraic connectivity, from which we will derive a contradiction.

From this, we deduce that for each i[1,k]i\in[1,k], that 12(x(2i1)x(2i11))2Ξ\frac{1}{\sqrt{2}}(x(2^{i}-1)-x(2^{i-1}-1))^{2}\leq\Xi. Chaining these inequalities, for any pair of path vertices, the difference in xx values is bounded above by k21/4Ξ1/2k2^{1/4}\Xi^{1/2}.

Next, let Lstar\mathcal{L}_{\textup{star}} denote the weighted Laplacian on the leaf vertices and the vertex oo, that contains an edge of weight 122(1+2)k\frac{1}{2\sqrt{2}}(1+\sqrt{2})^{-k} from oo to each leaf vertex. Given a leaf vertex ii, let parent(i)\textup{parent}(i) denote the path vertex that ii was attached to in the caterpillar graph. Then,

Here, on the second line, we have used the inequality (ab)22(ac)2+2(cb)2(a-b)^{2}\leq 2(a-c)^{2}+2(c-b)^{2}.

One may check (see Mathematica proof 5.2) that the expression in parentheses is <1<1 for all k1k\geq 1.

On the other hand, (5.5) is the variational characterization of the second smallest eigenvalue of

The identity block in the bottom-right is N×NN\times N where N2N\geq 2. Thus, the second smallest eigenvalue of Diag(μ)1/2LstarDiag(μ)1/2{\rm Diag}(\mu)^{-1/2}\mathcal{L}_{\textup{star}}{\rm Diag}(\mu)^{-1/2} is at least 122(1+2)k\frac{1}{2\sqrt{2}}(1+\sqrt{2})^{-k} by Cauchy’s Interlacing Theorem, a contradiction.∎

Acknowledgements. Benjamin Grimmer’s work was supported in part by the Air Force Office of Scientific Research under award number FA9550-23-1-0531.

References

Appendix A Deferred Proofs and Calculations

First we verify αkβk+1\alpha_{k}\leq\beta_{k+1}, then we prove our bounds relating HkH_{k} and μk\mu_{k}, and finally we show βkαk\beta_{k}\leq\alpha_{k}. The defining equation of αk\alpha_{k} is that αk\alpha_{k} is the unique root larger than 11 of qkq_{k}. It is clear that βk+11\beta_{k+1}\geq 1, thus to show αkβk+1\alpha_{k}\leq\beta_{k+1} suffices to show that qk(βk+1)>0q_{k}(\beta_{k+1})>0. We compute

To bound HkH_{k}, we first claim that the sum of all βi\beta_{i}’s in h(k)\mathfrak{h}^{(k)} is given by 2((1+2)k1)2k\sqrt{2}((1+\sqrt{2})^{k}-1)-2k. This follows from noting each βi\beta_{i} appears in h(k)\mathfrak{h}^{(k)} a total of 2(2ki11)2(2^{k-i-1}-1) times and so the total value of βi\beta_{i} terms in h(k)\mathfrak{h}^{(k)} is

Recall also that μk\mu_{k} is the sum of all other entries in h(k)\mathfrak{h}^{(k)} plus two. Thus,

To get an upper bound on this quantity, note that αiβi+1=1+(1+2)i\alpha_{i}\leq\beta_{i+1}=1+(1+\sqrt{2})^{i}. Thus,

To get a lower bound, observe that αi1\alpha_{i}\geq 1 for all ii. Thus Hk22((1+2)k1)H_{k}\geq 2\sqrt{2}((1+\sqrt{2})^{k}-1). The first set of bounds for μk\mu_{k} follows from the identity μk1=Hk/2\mu_{k}-1=H_{k}/2. The final claimed inequality follows directly as μkμk12(1+2)k22(1+2)k1=(22)(1+2)k1\mu_{k}-\mu_{k-1}\geq\sqrt{2}(1+\sqrt{2})^{k}-2\sqrt{2}(1+\sqrt{2})^{k-1}=(2-\sqrt{2})(1+\sqrt{2})^{k-1}.

To conclude, we show βkαk\beta_{k}\leq\alpha_{k} by showing qk(βk)0q_{k}(\beta_{k})\leq 0. We compute

where the inequality exactly follows from our lower bound on μk1\mu_{k}-1.

We previously claimed in Section 3.2 that i=0t1(hi(k)1)=1\prod_{i=0}^{t-1}(\mathfrak{h}^{(k)}_{i}-1)=1. We show this by induction below. The following lemma is useful in this calculation.

μi+1=μi+2(αi+βi+12)\mu_{i+1}=\mu_{i}+2(\alpha_{i}+\beta_{i+1}-2).

It is clear that μi+1=μi+2αi+2j=12i1πj(i),\mu_{i+1}=\mu_{i}+2\alpha_{i}+2\sum_{j=1}^{2^{i}-1}\pi^{(i)}_{j}, which we have seen is equal to

As a base case, when k=1k=1, note that α0\alpha_{0} is the positive root of the polynomial

so that α0=32\alpha_{0}=\frac{3}{2}, and h(1)=(32,5,32)h^{(1)}=(\frac{3}{2},5,\frac{3}{2}), which satisfies this equation. Now, assume that the equation holds for h(k1)\mathfrak{h}^{(k-1)}, i.e. i=02k11(hi(k1)1)=1\prod_{i=0}^{2^{k-1}-1}(\mathfrak{h}^{(k-1)}_{i}-1)=1. Expand this expression to be

Computing the product of all of the βi1\beta_{i}-1, this simplifies to

Combining our previous expressions, we obtain by Lemma A.1 that

Now, we note that by the defining equation of αk1\alpha_{k-1} that

In this section, we fix a k1k\geq 1 and let λ\lambda denote the construction λ(k)\lambda^{(k)} given in Section 5.3. Our goal is to bound mini[0,t1]λi,i+1\min_{i\in[0,t-1]}\lambda_{i,i+1} below for use in proving Lemma 3.1 via Theorem 4.1.

We prove this by bounding λi,i+1(k)\lambda^{(k)}_{i,i+1} for any i[0,2k+12]i\in[0,2^{k+1}-2] separately across our five cases. We will also make use of the easy that μz(i)1(1+2)z(i)+1\mu_{z(i)}-1\geq(1+\sqrt{2})^{z(i)+1}

Case 1: i+1<2ki+1<2^{k} and i+1i+1 is not a power of 2.

Case 4: i=2k1i=2^{k}-1. If k=0k=0, then this entry is β1/(μk1)\beta_{1}/(\mu_{k}-1). Otherwise, this entry is β0/(μk1)\beta_{0}/(\mu_{k}-1). As β1β0\beta_{1}\geq\beta_{0}, we may bound the general case as

A.4 Bounding the Edge Weights in the Caterpillar graph

In Section 5.4, we required lower bounds on specific entries of λ(k)\lambda^{(k)}. These lower bounds were stated in Lemma 5.1 and are proved below.

Fix k1k\geq 1 and let λ=λ(k)\lambda=\lambda^{(k)}.

Appendix B Useful Supporting Identities and Properties

Here are two recurrence relations for λ(k)\lambda^{(k)} that are useful in various calculations. They say that certain entries (or rows) of λ(k)\lambda^{(k)} are simply scalar multiples of other entries (or rows).

The claim holds if i=ji=j. In the remainder assume iji\neq j. Note that λi,j(k)0\lambda^{(k)}_{i,j}\neq 0 if and only if 2ν(i+1)1ji2ν(i+1)-2^{\nu(i+1)-1}\leq j-i\leq 2^{\nu(i+1)}. As (2z+j)(2z+i)=ji(2^{z^{\prime}}+j)-(2^{z^{\prime}}+i)=j-i and ν(2z+i+1)=ν(i+1)\nu(2^{z^{\prime}}+i+1)=\nu(i+1) we deduce that λi,j(k)0\lambda^{(k)}_{i,j}\neq 0 and and only if λ2z+i,2z+j(k)0\lambda^{(k)}_{2^{z^{\prime}}+i,2^{z^{\prime}}+j}\neq 0.

If λi,j(k)=0\lambda^{(k)}_{i,j}=0, then the claim holds. In the remainder, assume λi,j(k)0\lambda^{(k)}_{i,j}\neq 0. In this case, our definitions imply that

and since ν(2z+i+1)=ν(i+1)\nu(2^{z^{\prime}}+i+1)=\nu(i+1), and the number of one’s in the binary expansion of 2z+i2^{z^{\prime}}+i is exactly one more than the number of one’s in the binary expansion of ii, we see that

Comparing the two expressions yields our result. ∎

In the first case, we have that both λrev(r)(k)\lambda^{(k)}_{\operatorname{rev}(r)} and λrev(r)(k)\lambda^{(k)}_{\operatorname{rev}(r^{\prime})} are defined according to Case 2 and that ν(rev(r)+1)=ν(r+1)=ν(r+1)=ν(rev(r)+1)\nu(\operatorname{rev}(r^{\prime})+1)=\nu(r^{\prime}+1)=\nu(r+1)=\nu(\operatorname{rev}(r)+1). Thus, these two rows (after the natural re-indexing) are scalar multiples of ρν(r+1)\rho_{\nu(r+1)} (and hence of each other). Using the identities Lemma B.5, we have

Comparing the two coefficients proves the claim.

Comparing the two coefficients proves the claim. ∎

B.2 Algebraic Properties of μ𝜇\mu

There are various algebraic properties of μ\mu that we will use in this paper.

and taking the square root of both sides implies the last claim. ∎

Suppose k1k\geq 1, then 2(βk1)+(μk11)(μk1)=μk12(\beta_{k}-1)+\sqrt{(\mu_{k-1}-1)(\mu_{k}-1)}=\mu_{k}-1, or equivalently, 2(1+2)k1+(μk11)(μk1)=μk12(1+\sqrt{2})^{k-1}+\sqrt{(\mu_{k-1}-1)(\mu_{k}-1)}=\mu_{k}-1.

By Lemma A.1, we have μk1=μk2(αk1+βk2)\mu_{k-1}=\mu_{k}-2(\alpha_{k-1}+\beta_{k}-2). Applying this identity and combining, we get that this is equivalent to

which is the defining equation for αk1\alpha_{k-1}. ∎

B.3 Properties of the revrev\operatorname{rev} operation

Suppose 0r2k10\leq r\leq 2^{k}-1. Then, rev(rev(r))=r\operatorname{rev}(\operatorname{rev}(r))=r and

In particular, p(rev(r))+z(r)k=z(r)p(r)ν(r+1)+2p(\operatorname{rev}(r))+z(r)-k=z(r)-p(r)-\nu(r+1)+2.

We recall that rev(r)=2k+12r\operatorname{rev}(r)=2^{k+1}-2-r. For the first identity, note

Then, recall that the 2-adic valuation for the sum or difference of two numbers with different 2-adic valuations is the smaller of two. As ν(r+1)k\nu(r+1)\leq k, we have that the above quantity is equal to ν(r+1)\nu(r+1).

Finally, p(rev(r))p(\operatorname{rev}(r)) is the number of ones in the binary expansion of 2k+11r2^{k+1}-1-r. This is equivalent to k+1k+1 minus the number of ones in the binary expansion of rr. Now, consider the binary expansion of r+1r+1. The smallest position for which the binary expansion of r+1r+1 is equal to one, i.e., ν(r+1)\nu(r+1), is the same as the smallest position for which the binary expansion of rr is equal to zero. The difference in the number of ones in their binary expansion is then ν(r+1)1\nu(r+1)-1. We have deduced that p(rev(r))=k+1p(r1)=k+1(p(r)+ν(r+1)1)=kp(r)ν(r+1)+2p(\operatorname{rev}(r))=k+1-p(r-1)=k+1-(p(r)+\nu(r+1)-1)=k-p(r)-\nu(r+1)+2. ∎

The support of λ(k)\lambda^{(k)} has a rich combinatorial structure, which we need to make use of extensively in our computations. We record some facts about this support and their proofs here. For now, let us fix kk, and let λ\lambda refer to λ(k)\lambda^{(k)}.

From our definition of λ\lambda, λi,j0\lambda_{i,j}\neq 0 if and only if i>j>i2ν(i+1)1i>j>i-2^{\nu(i+1)-1} or i+2ν(i+1)>j>ii+2^{\nu(i+1)}>j>i.

It is useful to us to understand for a fixed jj, which are the ii where λi,j0\lambda_{i,j}\neq 0. For a given j2k+11j\leq 2^{k+1}-1, we let

Suppose that j[1,2k1]j\in[1,2^{k}-1] has the binary expansion j=a=0zba2aj=\sum_{a=0}^{z}b_{a}2^{a}, where bi{0,1}b_{i}\in\{0,1\} and bz=1b_{z}=1, then

We begin by showing that if i=a=rzba2a1i=\sum_{a=r}^{z}b_{a}2^{a}-1 for some r[0,z]r\in[0,z] where br=1b_{r}=1, then iSji\in S_{j}^{-}. Note that ν(i+1)=r\nu(i+1)=r. This implies that

Now, we show the reverse direction, i.e., if i<ji<j and i+2ν(i+1)ji+2^{\nu(i+1)}\geq j, then i=a=ν(i+1)zba2a1i=\sum_{a=\nu(i+1)}^{z}b_{a}2^{a}-1. Note that z(i+1)zz(i+1)\leq z since i<ji<j. This implies that the binary expansion of i+1i+1 can be expressed as

For the sake of contradiction, suppose that babab_{a}^{\prime}\neq b_{a} for some aν(i+1)a\geq\nu(i+1). In this case, let aa^{*} be the largest aa so that babab_{a}^{\prime}\neq b_{a}. If ba=0b_{a^{*}}^{\prime}=0 while ba=1b_{a^{*}}=1, then

If ba=1b_{a^{*}}^{\prime}=1, while ba=0b_{a^{*}}=0, then

In either case, we reach a contradiction. ∎

Suppose 2z1j<2z+112^{z}-1\leq j<2^{z+1}-1 with z<kz<k. Then, iSj+i\in S_{j}^{+} if and only if i[j+1,2k+12]i\in[j+1,2^{k+1}-2] and ij2ν(i+1)1i-j\leq 2^{\nu(i+1)-1}. In particular, if iSj+i\in S_{j}^{+}, then j<i2z+11j<i\leq 2^{z+1}-1. If in addition j=2z1j=2^{z}-1, then Sj+S_{j}^{+} is the singleton set {2z+11}\{2^{z+1}-1\}.

This is clear from the support of λ\lambda. ∎

Appendix D Proof of Theorem D.1

In this section, we will show that λ(k)\lambda^{(k)} satisfies the first main condition of Theorem 4.1.

The sum of the zeroth row of λ(k)\lambda^{(k)} is one larger than the sum of the zeroth column of λ(k)\lambda^{(k)}.

The sum of the 2k+112^{k+1}-1 row of λ(k)\lambda^{(k)} is one less than the sum of the 2k+112^{k+1}-1 column of λ(k)\lambda^{(k)}.

The equivalence of the two statements follows from Lemma 4.2. Thus, it suffices to prove the three statements in the second claim. We show the first item in Lemma D.16. We show the second item in Lemma D.17, Lemma D.18, and Lemma D.19. We show the third item in Lemma D.20. ∎

In the remainder of this section, we fix k1k\geq 1 and let h=h(k)h=\mathfrak{h}^{(k)} and λ=λ(k)\lambda=\lambda^{(k)}. Section D.1 computes the sums of each row of λ\lambda. Section D.2 computes the sums of each column of λ\lambda. Finally, Section D.3 proves lemmas claimed above. Various algebraic identities involving the entries of hh will be used in this section and proven in Appendix B.

Each row of λ\lambda is composed of various components; we will enumerate their sums here.

For k0k\geq 0, i=12k1πi(k)=βk+12\sum_{i=1}^{2^{k}-1}\pi^{(k)}_{i}=\beta_{k+1}-2.

First, note that π(0)=\pi^{(0)}=\emptyset so that i=10πi(0)=0\sum_{i=1}^{0}\pi^{(0)}_{i}=0. We also have β12=0\beta_{1}-2=0.

Note that the first 2k112^{k-1}-1 entries of π(k)\pi^{(k)} are identical to those of π(k1)\pi^{(k-1)}, and the same holds for the last 2k112^{k-1}-1 entries. By induction, we may conclude that

See Mathematica proof D.1 for a proof of the second identity. ∎

For k1k\geq 1, the sum of the entries in σ(k)\sigma^{(k)} is

We show this by induction: note that the sum of the entries in σ(1)\sigma^{(1)} is 0, and so is this expression.

For k>1k>1, the sum of the entries in σ(k)\sigma^{(k)} is

where Σk1\Sigma_{k-1} is the sum of the entries in σ(k1)\sigma^{(k-1)}.

By Lemma D.1, and the induction hypothesis, this is

Note that (βk12)+βk=(2+2)(1+2)k2(\beta_{k-1}-2)+\beta_{k}=(2+\sqrt{2})(1+\sqrt{2})^{k-2}, so that this becomes

For k0k\geq 0, the sum of the entries in ρk\rho_{k} is

A simple calculation shows that this holds for k=0k=0 (see Mathematica proof D.2). Now suppose k1k\geq 1. Using Lemma D.1 and Lemma D.2, we see that the sum of the entries in ρk\rho_{k} is

This is equal to the claimed expression (see Mathematica proof D.3). ∎

For k0k\geq 0, the sum of the entries of wkw_{k} is μk1\sqrt{\mu_{k}-1}.

We proceed by induction: the sum of the entries of w0w_{0} is 1=μ011=\sqrt{\mu_{0}-1}.

By expanding the definition and applying Lemma D.1 and the inductive hypothesis, the sum of the entries of wkw_{k} is βk2μk1+βkμk1+μk11\frac{\beta_{k}-2}{\sqrt{\mu_{k}-1}}+\frac{\beta_{k}}{\sqrt{\mu_{k}-1}}+\sqrt{\mu_{k-1}-1}. This is equivalent to the claimed expression by Lemma B.4. ∎

D.1.2 Computing Row Sums

We will give the sum of the entries in each row, dividing into the cases above.

Case 1: i+1<2ki+1<2^{k} and i+1i+1 is not a power of 2. The sum of the entries of λi\lambda_{i} is

Case 2: i+1>2ki+1>2^{k} and 2k+11i2^{k+1}-1-i is not a power of 2. The sum of the entries of λi\lambda_{i} is

Case 4: i=2k1i=2^{k}-1. The sum of the entries of λi\lambda_{i} is

Cases 1 and 2 follow directly by definition and Lemma D.3. The expressions for Cases 3 and 4 follow by adding up the partial sums computed in the previous subsection. See Mathematica proof D.4 and Mathematica proof D.5.

For Case 5, we combine the expressions for the partial sums (see Mathematica proof D.6) to get the row sum in the form

Combining these expressions proves the claim (see Mathematica proof D.7). ∎

D.2 Column Sums

denote the indices above jj and indices below jj in the support of the jjth column. The following lemmas give computational descriptions of these sets that will be useful in computing the column sums. We will give their proofs in Appendix C

Suppose that j[1,2k1]j\in[1,2^{k}-1] has the binary expansion j=a=0zba2aj=\sum_{a=0}^{z}b_{a}2^{a}, where bi{0,1}b_{i}\in\{0,1\} and bz=1b_{z}=1, then

Suppose 2z1j<2z+112^{z}-1\leq j<2^{z+1}-1 with z<kz<k. Then, iSj+i\in S_{j}^{+} if and only if i[j+1,2k+12]i\in[j+1,2^{k+1}-2] and ij2ν(i+1)1i-j\leq 2^{\nu(i+1)-1}. In particular, if iSj+i\in S_{j}^{+}, then j<i2z+11j<i\leq 2^{z+1}-1. If in addition j=2z1j=2^{z}-1, then Sj+S_{j}^{+} is the singleton set {2z+11}\{2^{z+1}-1\}

D.2.2 Computing Column Sums

Fix 1j2k11\leq j\leq 2^{k}-1. First suppose j+1j+1 is not a power of 22 and let zz so that 2z1<j<2z+112^{z}-1<j<2^{z+1}-1. Let pp denote the number of ones in the binary expansion of j+1j+1. Then,

On the other hand, if j=2z1j=2^{z}-1 for some z=1,,kz=1,\dots,k, then

We begin with the case where j+1j+1 is not a power of 22. Let j+1=a=0zba2aj+1=\sum_{a=0}^{z}b_{a}2^{a} be the binary expansion of j+1j+1. Since 2ν(j+1)2^{\nu(j+1)} is the largest power of 2 dividing j+1j+1, it follows that

Case (i): Let i=2z1i=2^{z}-1. As j+1j+1 is not a power of 22, we have that ν(j+1)<z\nu(j+1)<z and ν(ji)=ν(j+1)\nu(j-i)=\nu(j+1). Thus, by definition of λ\lambda, we have

where pip_{i} is the number of ones in the binary expansion of i+1i+1 and we note that ν(ji)=ν((j+1)(i+1))=ν(j+1)\nu(j-i)=\nu((j+1)-(i+1))=\nu(j+1). Now, note that if we sum over all ii of the form of case 2, there is exactly one such term for each possible value of pip_{i} from 22 through p1p-1. That is, if we add all such λi,j\lambda_{i,j}, we obtain (see Mathematica proof D.8)

Once again, if we add up all such terms, we collect one for each possible value of pip_{i} from pp to p+ν(j+1)1p+\nu(j+1)-1, yielding (see Mathematica proof D.9)

If j+1j+1 is not a power of two, then adding the sums in the three cases yields (see Mathematica proof D.10)

Adding all of these terms yields 12(μz1)\frac{1}{2}(\mu_{z}-1) (see Mathematica proof D.11). ∎

Fix jj so that 2z1<j<2z+112^{z}-1<j<2^{z+1}-1 where z<kz<k. If there are pp one’s in the binary expansion of j+1j+1, then

We will show this by induction on the number of ones in the binary expansion of j+1j+1.

If there are exactly 2 one’s in the binary expansion of j+1j+1, then

as can be seen in Mathematica proof D.12.

Now, we assume p>2p>2. Let j=2z+jj=2^{z}+j^{\prime}, where j<2z1j^{\prime}<2^{z}-1 has p1p-1 one’s in its binary expansion. We have by Lemma D.9 that Sj+={i+2z:iSj+}{2z+11}S_{j}^{+}=\{i^{\prime}+2^{z}:i^{\prime}\in S_{j^{\prime}}^{+}\}\cup\{2^{z+1}-1\}. Once again we consider two cases: either 2z+11{i+2z:iSj+}2^{z+1}-1\in\{i^{\prime}+2^{z}:i^{\prime}\in S_{j^{\prime}}^{+}\}, or it is not.

Assume that 2z+11{i+2z:iSj+}2^{z+1}-1\in\{i^{\prime}+2^{z}:i^{\prime}\in S_{j^{\prime}}^{+}\}. This implies that 2z1Sj+2^{z}-1\in S_{j^{\prime}}^{+}, or equivalently that z(j)=z1z(j^{\prime})=z-1. In this case,

Now, suppose iSj+i^{\prime}\in S^{+}_{j^{\prime}} is not a power of 2. Since z(i)=z(i)+1z(i)=z(i^{\prime})+1, and p(i)=p(i)+1p(i)=p(i^{\prime})+1, Lemma B.1 implies that

The only element of Sj+S^{+}_{j^{\prime}} and that is one less than a power of 2 is i=2z1i^{\prime}=2^{z}-1. If 2z12a>j>2z12a+12^{z}-1-2^{a}>j^{\prime}>2^{z}-1-2^{a+1}, then 2z+112a>j>2z+112a2^{z+1}-1-2^{a}>j>2^{z+1}-1-2^{a}, and

Now assume that 2z+11∉{i+2z:iSj+}2^{z+1}-1\not\in\{i^{\prime}+2^{z}:i^{\prime}\in S_{j^{\prime}}^{+}\}. This is equivalent to 2z+11>j>2z12^{z^{\prime}+1}-1>j^{\prime}>2^{z^{\prime}}-1 for some z<z1z^{\prime}<z-1. Now, we have that

By Lemma B.1, we note that for all iSj+i^{\prime}\in S_{j^{\prime}}^{+} other than i=2z+11i^{\prime}=2^{z^{\prime}+1}-1,

It remains to consider λ2z+2z+11,2z+j\lambda_{2^{z}+2^{z^{\prime}+1}-1,2^{z}+j^{\prime}} and λ2z+11,2z+j\lambda_{2^{z+1}-1,2^{z}+j^{\prime}}.

Note that 2z+12z1>j>2z+12z2^{z+1}-2^{z-1}>j>2^{z+1}-2^{z} and that j+1<2z1j^{\prime}+1<2^{z^{\prime}-1} is not a power of 2, so

Also, if 2z+112a>j>2z+112a+12^{z^{\prime}+1}-1-2^{a}>j^{\prime}>2^{z^{\prime}+1}-1-2^{a+1}, then

Note that λ2z+2z1,2z+j+λ2z+11,2z+j=(1+2)2(1+zz)μz+11μz+11λ2z1,j\lambda_{2^{z}+2^{z^{\prime}}-1,2^{z}+j^{\prime}}+\lambda_{2^{z+1}-1,2^{z}+j^{\prime}}=(1+\sqrt{2})^{2(1+z^{\prime}-z)}\frac{\mu_{z+1}-1}{\mu_{z+1}-1}\lambda_{2^{z^{\prime}}-1,j^{\prime}}, so that

Inspecting the support of λ(k)\lambda^{(k)}, we have that Sj+=S_{j}^{+}=\varnothing, so we need only consider the sum of the entries corresponding to element of Sj+S_{j}^{+}.

This formula also holds in the case where l=k1l=k-1.

Summing up all entries in the column gives

Suppose 0r<2k10\leq r<2^{k}-1 and r+1r+1 is not a power of 22. Then,

Let a=0kba2a\sum_{a=0}^{k}b_{a}2^{a} be the binary expansion of rev(r)\operatorname{rev}(r). Equivalently if a=0kca2a\sum_{a=0}^{k}c_{a}2^{a} is the binary expansion of r+1r+1, then ba=(1ca)b_{a}=(1-c_{a}) for all a[0,k]a\in[0,k]. Note that ck=0c_{k}=0 so that bk=1b_{k}=1. The set

Suppose τ[0,k]\tau\in[0,k] and bτ=1b_{\tau}=1. Let i=a=τkba2a1i=\sum_{a=\tau}^{k}b_{a}2^{a}-1. We have that 0ν(r+1)<z(r)<k0\leq\nu(r+1)<z(r)<k. We enumerate the possible values of λi,rev(r)(k)\lambda^{(k)}_{i,\operatorname{rev}(r)} according to τ[z(r)+1,k]\tau\in[z(r)+1,k], τ[ν(r+1)+1,z(r)1]\tau\in[\nu(r+1)+1,z(r)-1] and τ[0,ν(r+1)1]\tau\in[0,\nu(r+1)-1].

Now, suppose τ[0,ν(r+1)1]\tau\in[0,\nu(r+1)-1], then a=0τ1ba2a+1=2τ\sum_{a=0}^{\tau-1}b_{a}2^{a}+1=2^{\tau}. In this case λi(k)\lambda^{(k)}_{i} is defined according to Case 2, p(i)=k+1p(r)τp(i)=k+1-p(r)-\tau, ν(i+1)=τ\nu(i+1)=\tau, and z(2k+12i)=z(r)z(2^{k+1}-2-i)=z(r). Then,

Now suppose τ[ν(r+1)+1,z(r)1]\tau\in[\nu(r+1)+1,z(r)-1] satisfies bτ=1b_{\tau}=1. In this case, λi(k)\lambda^{(k)}_{i} is defined according to Case 2, ν(i+1)=τ\nu(i+1)=\tau, and z(2k+12i)=z(r)z(2^{k+1}-2-i)=z(r). Then,

When we sum up over entries of this form, the value of p(i)p(i) is in bijection with [kz(r)+1,kν(r+1)p(r)+1][k-z(r)+1,k-\nu(r+1)-p(r)+1]. Thus, the final part of this summation is

Finally, summing up all entries gives the desired claim (see Mathematica proof D.13). ∎

Let 0r<2k10\leq r<2^{k}-1 and suppose r+1r+1 is not a power of 22. Then,

We will induct on p(r)p(r). As r+1r+1 is not a power of two, we have p(r)2p(r)\geq 2.

First, suppose p(r)=2p(r)=2. We compute directly that Srev(r)+={rev(2z(r)1)}S_{\operatorname{rev}(r)}^{+}=\left\{\operatorname{rev}(2^{z(r)}-1)\right\}. Then

One can verify that the second expression coincides with the first expression when ν(r+1)+1=z(r)\nu(r+1)+1=z(r) (see Mathematica proof D.14). Thus,

We can also verify that this expression coincides with the claimed expression for i=Srev(r)+λi,rev(r)(k)\sum_{i=S^{+}_{\operatorname{rev}(r)}}\lambda^{(k)}_{i,\operatorname{rev}(r)} (see Mathematica proof D.15).

Now, suppose p(r)3p(r)\geq 3 and let r=r2zr^{\prime}=r-2^{z}. We have that p(r)=p(r)12p(r^{\prime})=p(r)-1\geq 2 so that r+1r^{\prime}+1 is not a power of two. Let z=z(r)z^{\prime}=z(r^{\prime}). We will now apply Lemma D.7. There are two cases: where z=z1z^{\prime}=z-1 and where z<z1z^{\prime}<z-1.

In the first case, z=z1z^{\prime}=z-1 and Lemma D.7 states that Srev(r)+={i2z:iSrev(r)+}S^{+}_{\operatorname{rev}(r)}=\left\{i-2^{z}:\,i\in S^{+}_{\operatorname{rev}(r^{\prime})}\right\} so that

For all iSrev(r)+i\in S^{+}_{\operatorname{rev}(r^{\prime})}, it must hold that rev(2z+11)<rev(r)<irev(2z1)\operatorname{rev}(2^{z^{\prime}+1}-1)<\operatorname{rev}(r^{\prime})<i\leq\operatorname{rev}(2^{z^{\prime}}-1). We may now apply Lemma B.2 to get

Now, consider the case z<z1z^{\prime}<z-1. In this case, Lemma D.7 states that Srev(r)+={rev(2z1)}{i2z:iSrev(r)+}S^{+}_{\operatorname{rev}(r)}=\left\{\operatorname{rev}(2^{z}-1)\right\}\cup\left\{i-2^{z}:\,i\in S^{+}_{\operatorname{rev}(r^{\prime})}\right\}. Again, for all iSrev(r)+i\in S^{+}_{\operatorname{rev}(r^{\prime})}, it must hold that rev(2z+11)<rev(r)<irev(2z1)\operatorname{rev}(2^{z^{\prime}+1}-1)<\operatorname{rev}(r^{\prime})<i\leq\operatorname{rev}(2^{z^{\prime}}-1). We may now apply Lemma B.2 to get

We evaluate the two terms separately. First, note that 2z<r+1<2z+12z12^{z^{\prime}}<r^{\prime}+1<2^{z^{\prime}+1}\leq 2^{z-1}. Thus,

For the second term, we apply the inductive hypothesis and Lemma B.2 to get

One can check that the sum of these two expressions is given by the claimed expression (see Mathematica proof D.16). ∎

D.3 Comparisons of Row and Column Sums

The sum of the zeroth row of λ\lambda is one larger than the sum of the zeroth column of λ\lambda.

The only entry of λ0\lambda_{0} is 2. On the other hand, the only row which has an entry in column 0 is λ1\lambda_{1}, and λ1,0=1\lambda_{1,0}=1, which implies the lemma. ∎

For i=1,,2k2i=1,\dots,2^{k}-2, the sum of the iith row of λ\lambda is equal to the sum of the iith column of λ\lambda.

We first show this assuming i=2z1i=2^{z}-1. It follows from our earlier work that the sum of the entries of the ithi^{th} row is

On the other hand, the nonzero entries of the ithi^{th} column are those indexed by SiS_{i}^{-} and 2z+112^{z+1}-1. The sum of these entries is

Thus, the row sum and the column sum are equal.

We next show this assuming i+1i+1 is not a power of 2. If the number of one’s in the binary expansion of i+1i+1 is p>1p>1, then the sum of the entries in the ithi^{th} row of λ\lambda is

On the other hand, we have seen that the sum of the entries in the ithi^{th} column is

We verify that these two expressions are equivalent in Mathematica proof D.17. ∎

Let k1k\geq 1 and i=2k1i=2^{k}-1. The sum of the iith row of λ\lambda is equal to the sum of the iith column of λ\lambda.

By Lemmas D.5 and D.10, we have that the (2k1)(2^{k}-1)th row sum and column sum are both μk12\frac{\mu_{k}-1}{2}.∎

Suppose 2kj<2k+112^{k}\leq j<2^{k+1}-1. Then, the sum of the entries in the jjth column of λ(k)\lambda^{(k)} is equal to the sum of the entries in the jjth row of λ(k)\lambda^{(k)}.

By Lemma D.5, this is also the associated row sum.

Now, suppose j=rev(r)j=\operatorname{rev}(r) where 0r<2k10\leq r<2^{k}-1 and r+1r+1 is not a power of 22. Then, by Lemmas D.14 and D.15, we have that the jjth column sum is

On the other hand, noting that p(rev(r))=kp(r)ν(r+1)+2p(\operatorname{rev}(r))=k-p(r)-\nu(r+1)+2, we have that the row sum is given by

These expressions are equal as is verified in Mathematica proof D.18. ∎

The sum of the 2k+112^{k+1}-1 row of λ\lambda is one less than the sum of the 2k+112^{k+1}-1 column of λ\lambda.

The final entry in the (2k+11)(2^{k+1}-1)th column is where i=2k1i=2^{k}-1. Then λi,j=1μk1\lambda_{i,j}=\frac{1}{\sqrt{\mu_{k}-1}} and so the column sum is one. ∎

Appendix E Proof of Theorem E.1

In this section, we will show that λ(k)\lambda^{(k)} satisfies the the second main condition of Theorem 4.1.

where ϕ(k)\phi^{(k)} is defined in Equation 5.3.

For the remainder of the section, we fix k1k\geq 1 and let h=h(k)h=\mathfrak{h}^{(k)}, λ=λ(k)\lambda=\lambda^{(k)}, M=M(λ,h)M=M(\lambda,h), and ϕ=ϕ(k)\phi=\phi^{(k)}.

We perform the computation of MM entry by entry. The nature of the definition of λ\lambda requires us to break this computation down into a number of distinct cases. We give a summary of the possible cases in Table 1.

We show that Mi,i=12ϕi2M_{i,i}=\frac{1}{2}\phi_{i}^{2} for all i[0,2k+11]i\in[0,2^{k+1}-1] in Lemmas E.2, E.4, E.3 and E.5.

We then need to consider the off-diagonal entries of MM. Note that MM is symmetric, so we only need to consider Mi,jM_{i,j} where i>ji>j.

We will break the remaining entries of MM into cases, mirroring the cases in the definition of λ\lambda. To reiterate, the cases we consider for an index ii are

i<2k1i<2^{k}-1 and i+1i+1 is not a power of 2.

i>2k1i>2^{k}-1 and 2k+1i12^{k+1}-i-1 is not a power of 2.

We summarize the possible cases and where they are proved in Table 1.∎

We will make extensive use Theorem D.1 as well as the computed expressions for the partial row and column sums from Section D.1.1 and the computational descriptions of Sj+S^{+}_{j} and SjS^{-}_{j} in Section D.2.1.

In this subsection, we will consider the various diagonal entries of MM. We divide these into four cases: Mi,iM_{i,i} where 0i<2k10\leq i<2^{k}-1, where i=2k1i=2^{k}-1, where 2k1<i<2k+112^{k}-1<i<2^{k+1}-1, and where i=2k+11i=2^{k+1}-1.

First, we present a lemma concerning the entries of MM.

We expand the entries of MM defined in Equation 2.4 and note that λ\lambda is zero on its \starth row and column:

If additionally, 0<i<2k+110<i<2^{k+1}-1, then by Theorem D.1, we have that

If i<2k1i<2^{k}-1, then Mi,i=12ϕi2=0M_{i,i}=\frac{1}{2}\phi_{i}^{2}=0.

When i=0i=0, we have that the iith row sum is 22, the iith column sum is 11, and hi=3/2h_{i}=3/2. Thus,

Now, suppose i>0i>0 and 2z<i+1<2z+12^{z}<i+1<2^{z+1} for some z[1,k1]z\in[1,k-1]. By Lemma D.5,

Subtracting these two expressions shows that Mi,i=0M_{i,i}=0 (see Mathematica proof E.1).

On the other hand, if i=2z1i=2^{z}-1 for some z[1,k1]z\in[1,k-1], then by Lemma D.5

Substituting these expressions for μz\mu_{z} and μz+1\mu_{z+1} into our expression for Mi,iM_{i,i} shows that it is equal to zero (see Mathematica proof E.2).∎

Let i=rev(r)i=\operatorname{rev}(r) for some r[0,2k1)r\in[0,2^{k}-1). We divide into cases: either r+1r+1 is a power of two or it is not.

If r=2a1r=2^{a}-1 for some a[0,k1]a\in[0,k-1], then by Lemma D.5,

We also have that Si+=S^{+}_{i}=\varnothing, so that Mi,i=βa+122(μa+11)M_{i,i}=\frac{\beta^{2}_{a+1}}{2(\mu_{a+1}-1)}. On the other hand, we also have that

Now suppose that i=rev(r)i=\operatorname{rev}(r) for some 0r<2k10\leq r<2^{k}-1 where r+1r+1 not a power of 2. Then

By Lemma D.5, and the identity the identity p(rev(r))=kp(r)ν(r+1)+2p(\operatorname{rev}(r))=k-p(r)-\nu(r+1)+2 (see Lemma B.5), we have

Here, we use the fact that the (2k+11)(2^{k+1}-1)th column sum is equal to one, the (2k+11)(2^{k+1}-1)th row is zero, and that ϕ2k+11=1\phi_{2^{k+1}-1}=1. ∎

E.2 Off-Diagonal Entries of M𝑀M

The following lemma gives a description for Mi,jM_{i,j} where iji\neq j and will be used repeatedly throughout this subsection.

We will use the definition of AA and CC to pull out the nonzero entries of each sum. The first sum becomes

We consider a useful lemma, which can be shown by just considering the support of λ\lambda:

Fix i<j<2k+11i<j<2^{k+1}-1. If z(i)>z(j)+1z(i)>z(j)+1 then Mi,j=0M_{i,j}=0.

Also, if z(i)=z(j)+1z(i)=z(j)+1 and i+1i+1 is not a power of 2, then Mi,j=0M_{i,j}=0.

In all of the following calculations, we will consider the expression

We begin by taking care of the easier cases.

Suppose 0r<2k10\leq r<2^{k}-1 and r+1r+1 is not a power of 22. Then,

The second identity holds as ϕr=0\phi_{r}=0. We turn to the first identity. By Lemma E.6, we have

Note that if r<2k11r<2^{k-1}-1, then every term in this expression is zero. Thus we will assume r>2k11r>2^{k-1}-1. Now, we will write 2k(r+1)=τ2^{k}-(r+1)=\tau. Note that 1τ2k111\leq\tau\leq 2^{k-1}-1.

Combining these identities gives Mr,2k1=0M_{r,2^{k}-1}=0. ∎

Suppose 0r<2k10\leq r<2^{k}-1 and r+1r+1 is not a power of 22. Then,

Let τ\tau so that 2τ<r+1<2τ+12^{\tau}<r+1<2^{\tau+1}. Note, we must have τ[0,k1]\tau\in[0,k-1]. By Lemma E.6, we have

On the other hand, ϕ2k1=μk1\phi_{2^{k}-1}=\sqrt{\mu_{k}-1} and ϕrev(r)=(wk)2kr1\phi_{\operatorname{rev}(r)}=(w_{k})_{2^{k}-r-1}. ∎

The only nonzero entry in the first two summations is

Substituting both expressions back in gives

Here, on the last line we have used Lemma B.3. This completes the proof as ϕrev(2r1)=βr+1μr+11\phi_{\operatorname{rev}(2^{r}-1)}=\frac{\beta_{r+1}}{\sqrt{\mu_{r+1}-1}} and ϕrev(2τ1)=βτ+1μτ+11\phi_{\operatorname{rev}(2^{\tau}-1)}=\frac{\beta_{\tau+1}}{\sqrt{\mu_{\tau+1}-1}}. ∎

The second equality follows from the fact that ϕ2k+11=1\phi_{2^{k+1}-1}=1.

We turn to the first equality. Let 0s2k+120\leq s\leq 2^{k+1}-2. First, note if 0s2k20\leq s\leq 2^{k}-2, then ϕs=0\phi_{s}=0 and Ms,2k+11=0M_{s,2^{k+1}-1}=0 by Lemma E.7. Thus, we may assume 2k1s2k+122^{k}-1\leq s\leq 2^{k+1}-2 in the remainder of this proof. We will break the rest of the proof into cases depending on how λs\lambda_{s} is defined, i.e., Cases 2, 4, and 5.

Case 2: Suppose s=rev(r)s=\operatorname{rev}(r) for some 0r<2k10\leq r<2^{k}-1 for which r+1r+1 is not a power of 22. By Lemma E.6,

By Lemma D.6, the nonzero summands in the sum correspond to indices i=rev(2τ1)i=\operatorname{rev}(2^{\tau}-1) where z(r)+1τkz(r)+1\leq\tau\leq k. We also have that hs=βν(r+1)h_{s}=\beta_{\nu(r+1)}. Combining these identities gives

Note that ϕrev(r)=βν(r+1)μz(r)+11\phi_{\operatorname{rev}(r)}=\frac{\beta_{\nu(r+1)}}{\sqrt{\mu_{z(r)+1}-1}}.

Case 4: Suppose s=2k1s=2^{k}-1. By Lemma E.6,

On the other hand, ϕ2k1=μk1\phi_{2^{k}-1}=\sqrt{\mu_{k}-1}.

E.3 Off-Diagonal Entries in Case (1,1)11(1,1)

Our goal for this subsection will be to prove Lemma E.19, stating that the off-diagonal entries Mi,jM_{i,j} are zero for all 0j<i<2k10\leq j<i<2^{k}-1, where neither i+1i+1 nor j+1j+1 are powers of two (equivalently, where p(i),p(j)2p(i),p(j)\geq 2). We will prove Lemma E.19 inductively on the value of min(p(i),p(j))\min(p(i),p(j)). We will make use of the following lemma

Fix i>ji>j such that i+1i+1 and j+1j+1 are both not powers of 2. If z(i)=z(j)z(i)=z(j), so that i=2z+ii=2^{z}+i^{\prime} and j=2z+jj=2^{z}+j^{\prime}, and z(j)<z(i)1z(j^{\prime})<z(i^{\prime})-1, then Mi,j=0M_{i,j}=0.

Now, note that because i+1i+1 is not a power of 2,

Also note that because j<2z+2z11j<2^{z}+2^{z-1}-1, and jj is not a power of 2,

The result follows by noting that hi=βν(i+1)h_{i}=\beta_{\nu(i+1)} and hj=βν(j+1)h_{j}=\beta_{\nu(j+1)}, so that

The following lemma will be the base case for the subsequent inductive proof and itself requires nontrivial calculations.

Suppose 0j<i<2k10\leq j<i<2^{k}-1 satisfy min{p(i),p(j)}=2\min\{p(i),p(j)\}=2. Then, Mi,j=0M_{i,j}=0.

If z(i)z(j)z(i)\neq z(j) then the result follows from Lemma E.7. So, assume that i=2z+ii=2^{z}+i^{\prime} and j=2z+jj=2^{z}+j^{\prime} where i,j<2zi^{\prime},j^{\prime}<2^{z}. Lemma E.17 shows that if z(i)>z(j)+1z(i^{\prime})>z(j^{\prime})+1, then Mi,j=0M_{i,j}=0. From now on, we will assume that z(i)z(j)+1z(i^{\prime})\leq z(j^{\prime})+1.

We consider three cases, either p(i)=p(j)=2p(i)=p(j)=2; p(i)=2p(i)=2 and p(j)>2p(j)>2, and p(i)>2p(i)>2 and p(j)=2p(j)=2.

In this case, i=2z+2r1i=2^{z}+2^{r}-1 and j=2z+2t1j=2^{z}+2^{t}-1 for some t<r<zt<r<z. By the assumption that j<ij<i and that z(i)>z(j+1)z(i^{\prime})>z(j^{\prime}+1), we have that t=r1t=r-1. Now, considering the support of λ\lambda, we have that

Combining these values shows that Mi,j=0M_{i,j}=0 (see Mathematica proof E.8).

Let i=2z+2r1i=2^{z}+2^{r}-1, and let j=2z+jj=2^{z}+j^{\prime}, where 2r11<j<2r12^{r-1}-1<j^{\prime}<2^{r}-1. We begin by noting

We break the remainder of the proof into two cases: where j=2r2a1j^{\prime}=2^{r}-2^{a}-1 for some a0a\geq 0 or where 2r2a1<j<2r2a112^{r}-2^{a}-1<j^{\prime}<2^{r}-2^{a-1}-1 for some a1a\geq 1.

Combining these expressions show (see Mathematica proof E.9)

Now, we consider the other subcase, in which, for some a1a\geq 1,

so that this sum is given by (E.3). Next, we have

Combining these expressions shows (see Mathematica proof E.10)

Let j=2z+2r1j=2^{z}+2^{r}-1. There are two subcases: where z(i)=r+1z(i^{\prime})=r+1 and that where z(i)=rz(i^{\prime})=r.

If z(i)=r+1z(i^{\prime})=r+1, then i=2z+2r+1+ii=2^{z}+2^{r+1}+i^{\prime\prime}, where i>0i^{\prime\prime}>0 by the assumption that p(i)>2p(i)>2. In this case,

Combining this expression with the identities

Now, consider the case in which z(i)=rz(i^{\prime})=r. In this case, we have

It can be seen by Lemma D.6 that Si{0,,j}={j,2z1}S_{i}^{-}\cap\{0,\dots,j\}=\{j,2^{z}-1\}. That is,

One may check (see Mathematica proof E.11) that the two expressions coincide when r=z1r=z-1. Thus, we may take the first expression in all cases.

Combining the previous expressions, we have that

Suppose 0j<i<2k10\leq j<i<2^{k}-1, where neither i+1i+1 nor j+1j+1 is a power of 22. Then, Mi,j=0M_{i,j}=0.

By Lemma E.7, if z(i)z(j)z(i)\neq z(j), then Mi,j=0M_{i,j}=0. We may thus assume z(i)=z(j)<kz(i)=z(j)<k and let zz denote their common value.

We will show the result by induction on min{p(i),p(j)}\min\{p(i),p(j)\}. The base case, where min{p(i),p(j)}=2\min\{p(i),p(j)\}=2 is shown in Lemma E.18. In the remainder, we assume that p(i),p(j)3p(i),p(j)\geq 3.

Now, set i=i2zi^{\prime}=i-2^{z} and j=j2zj^{\prime}=j-2^{z}. We see that p(i)=p(i)1p(i^{\prime})=p(i)-1 and p(j)=p(j)1p(j^{\prime})=p(j)-1 by considering the binary expansion of i+1i+1 and j+1j+1. Thus, neither i+1i^{\prime}+1 nor j+1j^{\prime}+1 is a power of two.

We show the claim directly if z(i)z(j)z(i^{\prime})\neq z(j^{\prime}). In this case, it holds that z(j)<z(i)z(j^{\prime})<z(i^{\prime}). In the sum

Here, the expression for λ2z+11,j\lambda_{2^{z+1}-1,j} follows from the fact that z(j)<zz(j^{\prime})<z.

Now, assume that z(i)=z(j)z(i^{\prime})=z(j^{\prime}) and denote their common value by zz^{\prime}. By the inductive hypothesis, we may assume that

We now wish to compare Mi,jM_{i,j} to Mi,jM_{i^{\prime},j^{\prime}}. For this, we divide the summation defining Mi,jM_{i,j} into parts and then rearrange:

Note that h2z+i=hih_{2^{z}+i^{\prime}}=h_{i^{\prime}}, since i+1i^{\prime}+1 is not a power of 2, and ν(2z+i+1)=ν(i+1)\nu(2^{z}+i^{\prime}+1)=\nu(i^{\prime}+1). Similarly, h2z+j=hjh_{2^{z}+j^{\prime}}=h_{j^{\prime}}. We finally recall Lemma B.1, which shows that this expression is the same as

Here, there are two cases: either z=z1z^{\prime}=z-1 or it does not.

Here, we use the fact that hi=hi=βν(i+1)h_{i}=h_{i^{\prime}}=\beta_{\nu(i+1)} and similarly hj=hj=βν(j+1)h_{j}=h_{j^{\prime}}=\beta_{\nu(j+1)}. The first term in parentheses is zero by the definition of σ\sigma.

The second term in parantheses is also zero upon plugging in the various values of λ\lambda (see Mathematica proof E.13):

The second term is identically zero due to the identities (see Mathematica proof E.14)

It remains to show that the first term is also zero. Let r=2z+11jr=2^{z^{\prime}+1}-1-j^{\prime}. We have that 1r<2z1\leq r<2^{z^{\prime}} and that τ=z(r1)[0,z1]\tau=z(r-1)\in[0,z^{\prime}-1]. There are two final cases: where r=2τr=2^{\tau} and where 2τ<r<2τ+12^{\tau}<r<2^{\tau+1}.

In the first case, we additionally have that ν(j+1)=τ\nu(j+1)=\tau and

In both cases, plugging the relevant λ\lambda values into the first term in parentheses shows that it is equal to zero. See Mathematica proof E.15 and Mathematica proof E.16.∎

E.4 Off-Diagonal Entries in Case (2,2)22(2,2)

Our goal for this subsection will be to prove Lemma E.8, stating that the off-diagonal entries of Mi,jM_{i,j} with i,j<2k1i,j<2^{k}-1 are all 0. We start with a simplifying lemma:

Letting h=h(k)h=\mathfrak{h}^{(k)} and λ=λ(k)\lambda=\lambda^{(k)}, and fix j<i2k1j<i\leq 2^{k}-1 so that neither i+1i+1 nor j+1j+1 are powers of 2. If z(i)z(j)z(i)\neq z(j), then M(h,λ)rev(i),rev(j)=12ϕrev(i)ϕrev(i)M(h,\lambda)_{\operatorname{rev}(i),\operatorname{rev}(j)}=\frac{1}{2}\phi_{\operatorname{rev}(i)}\phi_{\operatorname{rev}(i)}.

Here, we use the fact that the series is telescoping to simplify the computation. ∎

The following lemma will be the base case for a subsequent inductive proof.

Here, in the second line, we use Lemma D.13 to simplify the first summation. The second summation uses Lemma B.5 to write p(rev(2z+2t1))=ktp(\operatorname{rev}(2^{z}+2^{t}-1))=k-t (see Mathematica proof E.17).

The only possible nonzero term in the second summation occurs in row rev(2z1)\operatorname{rev}(2^{z}-1). Thus, the second summation is equal to

The only possible nonzero term in the second summation is

We deduce that regardless of whether z=z+1z=z^{\prime}+1, that

It remains to show that the two square-bracketed terms in (E.4) are zero.

This is identically zero as can be seen in Mathematica proof E.21.

This is identically zero as can be seen in Mathematica proof E.22.

Substituting this expression shows that the term in parentheses is zero (see Mathematica proof E.23).

By the inductive hypothesis, it holds that

We deal with the first square-bracketed term above. Applying Lemma B.2 and the identity we get from the inductive hypothesis, we may simplify this term to

There are two cases for the second term: either z=z1z^{\prime}=z-1 or z<z1z^{\prime}<z-1.

Suppose z=z1z^{\prime}=z-1. Then, the second term is

Now, suppose z<z1z^{\prime}<z-1. Then, the second term is