Recurrent Neural Networks as Weighted Language Recognizers

Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan May, Kevin Knight

Introduction

Recurrent neural networks (RNNs) are an attractive apparatus for probabilistic language modeling Mikolov and Zweig (2012). Recent experiments show that RNNs significantly outperform other methods in assigning high probability to held-out English text Jozefowicz et al. (2016).

Roughly speaking, an RNN works as follows. At each time step, it consumes one input token, updates its hidden state vector, and predicts the next token by generating a probability distribution over all permissible tokens. The probability of an input string is simply obtained as the product of the predictions of the tokens constituting the string followed by a terminating token. In this manner, each RNN defines a weighted language; i.e. a total function from strings to weights. Siegelmann and Sontag (1995) showed that single-layer rational-weight RNNs with saturated linear activation can compute any computable function. To this end, a specific architecture with 886 hidden units can simulate any Turing machine in real-time (i.e., each Turing machine step is simulated in a single time step). However, their RNN encodes the whole input in its internal state, performs the actual computation of the Turing machine when reading the terminating token, and then encodes the output (provided an output is produced) in a particular hidden unit. In this way, their RNN allows “thinking” time (equivalent to the computation time of the Turing machine) after the input has been encoded.

We consider a different variant of RNNs that is commonly used in natural language processing applications. It uses ReLU activations, consumes an input token at each time step, and produces softmax predictions for the next token. It thus immediately halts after reading the last input token and the weight assigned to the input is simply the product of the input token predictions in each step.

Other formal models that are currently used to implement probabilistic language models such as finite-state automata and context-free grammars are by now well-understood. A fair share of their utility directly derives from their nice algorithmic properties. For example, the weighted languages computed by weighted finite-state automata are closed under intersection (pointwise product) and union (pointwise sum), and the corresponding unweighted languages are closed under intersection, union, difference, and complementation Droste et al. (2013). Moreover, toolkits like OpenFST Allauzen et al. (2007) and Carmelhttps://www.isi.edu/licensed-sw/carmel/ implement efficient algorithms on automata like minimization, intersection, finding the highest-weighted path and the highest-weighted string.

RNN practitioners naturally face many of these same problems. For example, an RNN-based machine translation system should extract the highest-weighted output string (i.e., the most likely translation) generated by an RNN, Sutskever et al. (2014); Bahdanau et al. (2014). Currently this task is solved by approximation techniques like heuristic greedy and beam searches. To facilitate the deployment of large RNNs onto limited memory devices (like mobile phones) minimization techniques would be beneficial. Again currently only heuristic approaches like knowledge distillation Kim and Rush (2016) are available. Meanwhile, it is unclear whether we can determine if the computed weighted language is consistent; i.e., if it is a probability distribution on the set of all strings. Without a determination of the overall probability mass assigned to all finite strings, a fair comparison of language models with regard to perplexity is simply impossible.

The goal of this paper is to study the above problems for the mentioned ReLU-variant of RNNs. More specifically, we ask and answer the following questions:

Consistency: Do RNNs compute consistent weighted languages? Is the consistency of the computed weighted language decidable?

Highest-weighted string: Can we (efficiently) determine the highest-weighted string in a computed weighted language?

Equivalence: Can we decide whether two given RNNs compute the same weighted language?

Minimization: Can we minimize the number of neurons for a given RNN?

Definitions and notations

A single-layer RNN RR is a 77-tuple Σ,N,h1,W,W,E,E\langle\Sigma,N,h_{-1},W,W^{\prime},E,E^{\prime}\rangle, in which

RNNs act in discrete time steps reading a single letter at each step. We now define the semantics of our RNNs.

Let R=Σ,N,h1,W,W,E,ER=\langle\Sigma,N,h_{-1},W,W^{\prime},E,E^{\prime}\rangle be an RNN, ss an input string of length nn and 0tn0\leq t\leq n a time step. We define

where hs,1=h1h_{s,-1}=h_{-1} and we use standard matrix product and point-wise vector addition,

In other words, each component hs,t(n)h_{s,t}(n) of the hidden state vector is the ReLU activation applied to a linear combination of all the components of the previous hidden state vector hs,t1h_{s,t-1} together with a summand WstW^{\prime}_{s_{t}} that depends on the tt-th input letter sts_{t}. Thus, we often specify hs,t(n)h_{s,t}(n) as linear combination instead of specifying the matrix WW and the vectors WaW^{\prime}_{a}. The semantics is then obtained by predicting the letters s1,,sns_{1},\dotsc,s_{n} of the input ss and the final terminator \$ and multiplying the probabilities of the individual predictions.

E(\,\cdot)=(M+1,\,-(M+1))andandE(a,\cdot)=(1,\,-1)$ and

E^{\prime}(\)=-MandandE^{\prime}(a)=0$.

In this case, we obtain the linear combinations

computing the next hidden state components. Given the initial activation, we thus obtain hs,t=σt,t1h_{s,t}=\sigma\langle t,t-1\rangle. Using this information, we obtain

Consequently, we assign weight eM1+eM\tfrac{e^{-M}}{1+e^{-M}} to input ε\varepsilon, weight 11+eMe1e1+e1\tfrac{1}{1+e^{-M}}\cdot\tfrac{e^{1}}{e^{1}+e^{1}} to aa, and, more generally, weight 11+eM12n\tfrac{1}{1+e^{-M}}\cdot\tfrac{1}{2^{n}} to ana^{n}.

Clearly the weight assigned by an RNN is always in the interval (0,1)(0,1), which enables a probabilistic view. Similar to weighted finite-state automata or weighted context-free grammars, each RNN is a compact, finite representation of a weighted language. The softmax-operation enforces that the probability is impossible as assigned weight, so each input string is principally possible. In practical language modeling, smoothing methods are used to change distributions such that impossibility (probability ) is removed. Our RNNs avoid impossibility outright, so this can be considered a feature instead of a disadvantage.

The hidden state hs,th_{s,t} of an RNN can be used as scratch space for computation. For example, with a single neuron nn we can count input symbols in ss via:

Here the letter-dependent summand WaW^{\prime}_{a} is universally 11. Similarly, for an alphabet Σ={a1,,am}\Sigma=\{a_{1},\dotsc,a_{m}\} we can use the method of Siegelmann and Sontag (1995) to encode the complete input string ss in base m+1m+1 using:

where c\colon\Sigma_{\}\to\{0,\dotsc,m\}isabijection.Inprinciple,wecanthusstoretheentireinputstring(ofunboundedlength)inthehiddenstatevalueis a bijection. In principle, we can thus store the entire input string (of unbounded length) in the hidden state valueh_{s,t}(n),butourRNNmodeloutputsweightsateachstepandterminatesimmediatelyoncethefinaldelimiter, but our RNN model outputs weights at each step and terminates immediately once the final delimiter\$isread.Itmustassignaprobabilitytoastringincrementallyusingthechainruledecompositionis read. It must assign a probability to a string incrementally using the chain rule decompositionp(s_{1}\dotsm s_{n})=p(s_{1})\cdot\ldots\cdot p(s_{n}\mid s_{1}\dotsm s_{n-1})$.

Let us illustrate our notion of RNNs on some additional examples. They all use the alphabet Σ={a}\Sigma=\{a\} and are illustrated and formally specified in Figure 1. The first column shows an RNN R1R_{1} that assigns R1(an)=2(n+1)R_{1}(a^{n})=2^{-(n+1)}. The next-token prediction matrix ensures equal values for aa and \ateverytimestep.ThesecondcolumnshowstheRNNat every time step. The second column shows the RNNR_{2},whichwealreadydiscussed.Inthebeginning,itheavilybiasesthenextsymbolpredictiontowards, which we already discussed. In the beginning, it heavily biases the next symbol prediction towardsa,butcountersitstartingat, but counters it starting att=1.ThethirdRNN. The third RNNR_{3}usesanothercountingmechanismwithuses another counting mechanism withh_{s,t}=\sigma\langle t-100,t-101,t\rangle.ThefirsttwocomponentsareReLUthresholdedtozerountil. The first two components are ReLU-thresholded to zero untilt>101,atwhichpointtheyoverwhelmthebiastowards, at which point they overwhelm the bias towardsaturningallfuturepredictionstoturning all future predictions to\$$.

Consistency

We first investigate the consistency problem for an RNN RR, which asks whether the recognized weighted language RR is indeed a probability distribution. Consequently, an RNN RR is consistent if sΣR(s)=1\sum_{s\in\Sigma^{*}}R(s)=1. We first show that there is an inconsistent RNN, which together with our examples shows that consistency is a nontrivial property of RNNs. For comparison, all probabilistic finite-state automata are consistent, provided no transitions exit final states. Not all probabilistic context-free grammars are consistent; necessary and sufficient conditions for consistency are given by Booth and Thompson (1973). However, probabilistic context-free grammars obtained by training on a finite corpus using popular methods (such as expectation-maximization) are guaranteed to be consistent Nederhof and Satta (2006).

We immediately use a slightly more complex example, which we will later reuse.

with the single-letter alphabet Σ={a}\Sigma=\{a\}, the neurons {1,2,3,n,n}N\{1,2,3,n,n^{\prime}\}\subseteq N, initial activation h1(i)=0h_{-1}(i)=0 for all i{1,2,3,n,n}i\in\{1,2,3,n,n^{\prime}\}, and the following linear combinations:

Then hs,t(1)=0h_{s,t}(1)=0 for all tTt\leq T and hs,t(1)=1h_{s,t}(1)=1 otherwise. In addition, we have hs,t(2)=t+1h_{s,t}(2)=t+1 and hs,t(3)=σ3(tT1)h_{s,t}(3)=\sigma\langle 3(t-T-1)\rangle. Hence we have

of predicting \increasesovertimeandeventually(forincreases over time and eventually (fort\gg 3T)faroutweighstheprobabilityofpredicting) far outweighs the probability of predictinga$. Consequently, in this case the RNN is consistent (see Lemma 16 in the appendix).

We have seen in the previous example that consistency is not trivial for RNNs, which takes us to the consistency problem for RNNs:

Given an RNN RR, return “yes” if RR is consistent and “no” otherwise.

We recall the following theorem, which, combined with our example, will prove that consistency is unfortunately undecidable for RNNs.

Let MM be an arbitrary deterministic Turing machine. There exists an RNN

with saturated linear activation, input alphabet Σ={a}\Sigma=\{a\}, and 11 designated neuron nNn\in N such that for all sΣs\in\Sigma^{*} and 0ts0\leq t\leq\lvert s\rvert

hs,t(n)=0h_{s,t}(n)=0 if MM does not halt on ε\varepsilon, and

if MM does halt on empty input after TT steps, then

for all n,nNn,n^{\prime}\in N and a\in\Sigma\cup\{\\},where, wheren_{1}andandn_{2}arethetwoneuronscorrespondingtoare the two neurons corresponding tonandandn^{\prime}_{1}andandn^{\prime}_{2}arethetwoneuronscorrespondingtoare the two neurons corresponding ton^{\prime}$ (see Lemma 17 in the appendix).

Let MM be an arbitrary deterministic Turing machine. There exists an RNN

with input alphabet Σ={a}\Sigma=\{a\} and 22 designated neurons n1,n2Nn_{1},n_{2}\in N such that for all sΣs\in\Sigma^{*} and 0ts0\leq t\leq\lvert s\rvert

hs,t(n1)hs,t(n2)=0h_{s,t}(n_{1})-h_{s,t}(n_{2})=0 if MM does not halt on ε\varepsilon, and

if MM does halt on empty input after TT steps, then

We can now use this corollary together with the RNN RR of Example 3 to show that the consistency problem is undecidable. To this end, we simulate a given Turing machine MM and identify the two designated neurons of Corollary 5 as nn and nn^{\prime} in Example 3. It follows that MM halts if and only if RR is consistent. Hence we reduced the undecidable halting problem to the consistency problem, which shows the undecidability of the consistency problem.

The consistency problem for RNNs is undecidable.

As mentioned in Footnote 2, probabilistic context-free grammars obtained after training on a finite corpus using the most popular methods are guaranteed to be consistent. At least for 2-layer RNNs this does not hold.

A two-layer RNN trained to a local optimum using Back-propagation-through-time (BPTT) on a finite corpus is not necessarily consistent.

The first layer of the RNN RR with a single alphabet symbol aa uses one neuron nn^{\prime} and has the following behavior:

The second layer uses neuron nn and takes hs,t(n)h_{s,t}(n^{\prime}) as input at time tt:

Let the training data be {a}\{a\}. Then the objective we wish to maximize is simply R(a)R(a). The derivative of this objective with respect to each parameter is , so applying gradient descent updates does not change any of the parameters and we have converged to an inconsistent RNN. ∎

It remains an open question whether there is a single-layer RNN that also exhibits this behavior.

Highest-weighted string

For deterministic probabilistic finite-state automata or context-free grammars only one path or derivation exists for any given string, so the identification of the highest-weighted string is the same task as the identification of the most probable path or derivation. However, for nondeterministic devices, the highest-weighted string is often harder to identify, since the weight of a string is the sum of the probabilities of all possible paths or derivations for that string. A comparison of the difficulty of identifying the most probable derivation and the highest-weighted string for various models is summarized in Table 1, in which we marked our results in bold face.

We present various results concerning the difficulty of identifying the highest-weighted string in a weighted language computed by an RNN. We also summarize some available algorithms. We start with the formal presentation of the three studied problems.

Best string: Given an RNN RR and c(0,1)c\in(0,1), does there exist sΣs\in\Sigma^{*} with R(s)>cR(s)>c?

Consistent best string: Given a consistent RNN RR and c(0,1)c\in(0,1), does there exist sΣs\in\Sigma^{*} with R(s)>cR(s)>c?

As usual the corresponding optimization problems are not significantly simpler than these decision problems. Unfortunately, the general problem is also undecidable, which can easily be shown using our example.

The best string problem for RNNs is undecidable.

using Lemma 14 in the appendix. Consequently, a string with weight above 0.120.12 exists if and only if MM halts, so the best string problem is also undecidable. ∎

If we restrict the RNNs to be consistent, then we can easily decide the best string problem by simple enumeration.

The consistent best string problem for RNNs is decidable.

Since RR is consistent, limiSi=1\lim_{i\to\infty}S_{i}=1, so this algorithm is guaranteed to terminate and it obviously decides the problem. ∎

Next, we investigate the length wRmax\lvert w_{R}^{\text{max}}\rvert of the shortest string wRmaxw_{R}^{\text{max}} of maximal weight in the weighted language RR generated by a consistent RNN RR in terms of its (binary storage) size R\lvert R\rvert. As already mentioned by Siegelmann and Sontag (1995) and evidenced here, only small precision rational numbers are needed in our constructions, so we assume that RcN2\lvert R\rvert\leq c\cdot\lvert N\rvert^{2} for a (reasonably small) constant cc, where NN is the set of neurons of RR. We show that no computable bound on the length of the best string can exist, so its length can surpass all reasonable bounds.

In the previous section (before Theorem 6) we presented an RNN RMR_{M} that simulates an arbitrary (single-track) Turing machine MM with nn states. By Siegelmann and Sontag (1995) we have RMc(4n+16)\lvert R_{M}\rvert\leq c\cdot(4n+16). Moreover, we observed that this RNN RMR_{M} is consistent if and only if the Turing machine MM halts on empty input. In the proof of Theorem 8 we have additionally seen that the length wRmax\lvert w_{R}^{\text{max}}\rvert of its best string exceeds the number TMT_{M} of steps required to halt.

so ff clearly cannot be computable and no computable function gg can provide bounds for ff. ∎

Finally, we investigate the difficulty of the best string problem for consistent RNN restricted to solutions of polynomial length.

Identifying the best string of polynomial length in a consistent RNN is NP-complete and APX-hard.

Proof sketch. Clearly, we can guess an input string of polynomial length, run the RNN, and verify whether its weight exceeds the given bound in polynomial time. Therefore the problem is trivially in NP. For NP-hardness, we reduce from the 0-1 Integer Linear Programming Feasibility Problem:

Suppose we are given an instance of the above problem. We construct an instance of the consistent best string of polynomial length problem with input R,c\langle R,c\rangle. Our construction ensures that the only length at which a string can have weight greater than cc is nn. Thus, if there is any string whose weight is greater than cc, the given instance of 0-1 Integer Linear Programming Problem is feasible; otherwise it is not.

Our reduction is a Polynomial-Time Approximation Scheme (PTAS) reduction and preserves approximability. Since 0-1 Integer Linear Programming Feasibility is NP-complete and the corresponding maximization problem is APX-complete, consistent best string of polynomial length is NP-complete and APX-hard, meaning there is no PTAS to find the best string bounded by polynomial length (i.e. the best we can hope for in polynomial time is a constant-factor approximation algorithm) unless P == NP.

If we assume that the solution length is bounded by some finite number, we can convert algorithms from de la Higuera and Oncina (2013) for computing the most probable string in PFSAs for use in RNNs. Such algorithms would be similar to beam search Lowerre (1976) used most widely in practice.

Equivalence

We prove that equivalence of two RNNs is undecidable. For comparison, equivalence of two deterministic WFSAs can be tested in time O(Σ(QA+QB)3)O(|\Sigma|(|Q_{A}|+|Q_{B}|)^{3}), where QA|Q_{A}|, QB|Q_{B}| are the number of states of the two WFSAs and Σ|\Sigma| is the size of the alphabet Cortes et al. (2007); equivalence of nondeterministic WFSAs are undecidable Griffiths (1968). The decidability of language equivalence for deterministic probabilistic push-downtown automata (PPDA) is still open Forejt et al. (2014), although equivalence for deterministic unweighted push-downtown automata (PDA) is decidable Sénizergues (1997).

The equivalence problem is formulated as follows:

Given two RNNs RR and RR^{\prime}, return “yes” if R(s)=R(s)R(s)=R^{\prime}(s) for all sΣs\in\Sigma^{*}, and “no” otherwise.

The equivalence problem for RNNs is undecidable.

Minimization

We look next at minimization of RNNs. For comparison, state-minimization of a deterministic PFSA is O(ElogQ)O(|E|\log{|Q|}) where E|E| is the number of transitions and Q|Q| is the number of states Aho et al. (1974). Minimization of a non-deterministic PFSA is PSPACE-complete Jiang and Ravikumar (1993).

We focus on minimizing the number of hidden neurons (N|N|) in RNNs:

Given RNN RR and non-negative integer nn, return “yes” if \exists RNN RR^{\prime} with number of hidden units Nn|N^{\prime}|\leq n such that R(s)=R(s)R(s)=R^{\prime}(s) for all sΣs\in\Sigma^{*}, and “no” otherwise.

We reduce from the Halting Problem. Suppose Turing Machine MM decides the minimization problem. For any Turing Machine MM^{\prime}, construct the same RNN RR as in Theorem 12. We run MM on input R,0\langle R,0\rangle. Note that an RNN with no hidden unit can only output constant Es,tE^{\prime}_{s,t} for all tt. Therefore the number of hidden units in RR can be minimized to if and only if it always outputs E^{\prime}_{s,t}(a)=E^{\prime}_{s,t}(\)=1/2.If. IfMreturnsyes,returns “yes”,M^{\prime}doesnothaltondoes not halt on\epsilon$, else it halts. Therefore minimization is undecidable. ∎

Conclusion

We proved the following hardness results regarding RNN as a recognizer of weighted languages:

Finding the highest-weighted string for an arbitrary RNN is undecidable.

Finding the highest-weighted string for a consistent RNN is decidable, but the solution length can surpass all computable bounds.

Restricting to solutions of polynomial length, finding the highest-weighted string is NP-complete and APX-hard.

Testing equivalence of RNNs and minimizing the number of neurons in an RNN are both undecidable.

Although our undecidability results are upshots of the Turing-completeness of RNN Siegelmann and Sontag (1995), our NP-completeness and APX-hardness results are original, and surprising, since the analogous hardness results in PFSA relies on the fact that there are multiple derivations for a single string Casacuberta and de la Higuera (2000). The fact that these results hold for the relatively simple RNNs we used in this paper suggests that the case would be the same for more complicated models used in NLP, such as long short term memory networks (LSTMs; Hochreiter and Schmidhuber 1997).

Our results show the non-existence of (efficient) algorithms for interesting problems that researchers using RNN in natural language processing tasks may have hoped to find. On the other hand, the non-existence of such efficient or exact algorithms gives evidence for the necessity of approximation, greedy or heuristic algorithms to solve those problems in practice. In particular, since finding the highest-weighted string in RNN is the same as finding the most-likely translation in a sequence-to-sequence RNN decoder, our NP-completeness and APX-hardness results provide some justification for employing greedy and beam search algorithms in practice.

Acknowledgments

This work was supported by DARPA (W911NF-15-1-0543 and HR0011-15-C-0115). Andreas Maletti was financially supported by DFG Graduiertenkolleg 1763 (QuantLA).

References

Appendix

Identifying the best string of polynomial length in a consistent RNN is NP-complete and APX-hard.

Clearly, we can guess an input string of polynomial length, run the RNN, and verify whether its weight exceeds the given bound in polynomial time. Therefore the problem is trivially in NP. For NP-hardness we now reduce from the 0-1 Integer Linear Programming Feasibility Problem to our problem:

Suppose we are given an instance of the above problem. Construct an instance of the consistent best string with polynomial length problem with input R,c\langle R,c\rangle, where:

Let d=i=1k(hs,t(gi)hs,t(li))d=\sum_{i=1}^{k}{(h_{s,t}(g_{i})-h_{s,t}(l_{i}))}, δ2=1k+2\delta_{2}=\frac{1}{k+2}. We pick a big enough positive rational number β\beta so that if we define δ1=12eβ+1\delta_{1}=\frac{1}{2e^{\beta}+1},

When t=n+1t=n+1, one can verify that we can set

since the range of dd is a finite set of values {0,1,2,,k}\{0,1,2,\dots,k\}.

c=(1δ12)n(kδ2)c=(\frac{1-\delta_{1}}{2})^{n}(k\delta_{2})

so we can pick β\beta such that its length written in binary

is logarithmic in nn and kk. So the weights in matrices E,EE,E^{\prime} that produce β\beta are polynomial in nn and kk. Same is true for the weights that produce γd\gamma_{d}. cc written in binary has length

which is polynomial in nn and kk. So our construction is polynomial.

We now prove that if we can solve the R,c\langle R,c\rangle-instance of consistent best string of polynomial length in polynomial time, we can also solve the given instance of 0-1 Integer Linear Programming Feasibility in polynomial time.

By our design, at time 1tn1\leq t\leq n, RR reads a binary string x{0,1}nx\in\{0,1\}^{n} into neurons 1,2,,n1,2,\dots,n while predicting almost half-half probability for either 0 or 1 and infinitesimal probability δ1\delta_{1} for termination. Therefore no string with length less then nn has weight greater than cc.

At time t=n+1t=n+1, since j=1nAijhs,t1(j)Bi\sum_{j=1}^{n}{A_{ij}h_{s,t-1}(j)}-B_{i} is an integer, hs,t(gi)hs,t(li)h_{s,t}(g_{i})-h_{s,t}(l_{i}) is the indicator for whether the ii-th constraint is satisfied:

Therefore dd is the total number of clauses satisfied by a given setting of x=(x1,x2,,xn)x=(x_{1},x_{2},\dots,x_{n}) (0dk0\leq d\leq k). The termination probability at t=n+1t=n+1 is (d+1)δ2=d+1k+2(d+1)\delta_{2}=\frac{d+1}{k+2}. If all kk clauses are satisfied, this setting of xx would have termination probability 1δ21-\delta_{2} and therefore weight (1δ12)n(1δ2)>c(\frac{1-\delta_{1}}{2})^{n}(1-\delta_{2})>c. If fewer than kk clauses are satisfied, xx would have weight at most cc.

When tn+2t\geq n+2, RR continues to assign almost half-half probability for either 0 or 1 and infinitesimal probability for termination. Therefore any string of length greater than n+1n+1 has a weight smaller than ϵ\epsilon. From that point on the output vector is constant, so the RNN is consistent. Notice that the weights of strings monotonically decrease with length except for at length nn.

Therefore our construction ensures that the only length at which a string can have weight greater than cc is nn. Thus, if there is any string whose weight is greater than cc, the given instance of 0-1 Integer Linear Programming Problem is feasible; otherwise it is not.

Define the maximum number of clauses satisfied by all assignments of x{0,1}nx\in\{0,1\}^{n}:

By our construction, when dmax1d_{max}\geq 1, the highest-weighted string will occur at length nn, and has weight (1δ12)n(dmax+1)δ2=(1δ12)ndmax+1k+2(\frac{1-\delta_{1}}{2})^{n}(d_{max}+1)\delta_{2}=(\frac{1-\delta_{1}}{2})^{n}\frac{d_{max}+1}{k+2} which is proportional to dmax+1d_{max}+1. The empty string has the highest weight among all strings of length not equal to nn. Its weight is δ1<(1δ12)nδ2\delta_{1}<(\frac{1-\delta_{1}}{2})^{n}\delta_{2} which will always be less than the weight of any length-nn string corresponding to a setting of variables satisfying at least 1 constraint ((1δ12)n(2δ2)\geq(\frac{1-\delta_{1}}{2})^{n}(2\delta_{2})).

Therefore, given any rational number ζ=η1η2>0\zeta=\frac{\eta_{1}}{\eta_{2}}>0 (η1,η2[k]\eta_{1},\eta_{2}\in[k]), define δ(ζ)=η1η2+1\delta(\zeta)=\frac{\eta_{1}}{\eta_{2}+1}. If binary string ss is a (1δ(ζ))(1-\delta(\zeta))-approximation to consistent best string with polynomial length, then reading ss as a vector of nn variables x=(x1,x2,,xn)x=(x_{1},x_{2},\dots,x_{n}), xx would be a (1ζ)(1-\zeta)-approximation to 0-1 Integer Linear Programming Maximum Satisfiability (the optimization version of the problem, i.e., finding a setting of variables to satisfy the greatest number of constraints).

Thus our reduction is PTAS-reduction and preserves approximability. Since 0-1 Integer Linear Programming Feasibility is NP-complete and APX-complete, consistent best string of polynomial length is NP-complete and APX-hard, meaning there is no Polynomial-Time Approximation Scheme to find the best string bounded by polynomial length (i.e. the best we can hope for in polynomial time is a constant-factor approximation algorithm) unless P == NP. ∎

where (1;ek)(-1;e^{-k})_{\infty} is the infinite eke^{-k}-Pochhammer symbol.

where the final equality utilizes Lemma 14. ∎

We set hs,1(n)=h1(n)h_{s,-1}(n)=h_{-1}(n) for all nNn\in N and hs,1(n)=h1(n)h^{\prime}_{s,-1}(n^{\prime})=h^{\prime}_{-1}(n^{\prime}) for all nNn^{\prime}\in N^{\prime}. Then trivially hs,1(n1)hs,1(n2)=h1(n)0=hs,1(n)h^{\prime}_{s,-1}(n_{1})-h^{\prime}_{s,-1}(n_{2})=h_{-1}(n)-0=h_{s,-1}(n). Moreover,

Hence hs,t(n1)hs,t(n2)=hs,t(n)h^{\prime}_{s,t}(n_{1})-h^{\prime}_{s,t}(n_{2})=h_{s,t}(n) as required. ∎