Recurrent Neural Networks as Weighted Language Recognizers

Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan May, Kevin Knight

Introduction

Recurrent neural networks (RNNs) are an attractive apparatus for probabilistic language modeling Mikolov and Zweig (2012). Recent experiments show that RNNs significantly outperform other methods in assigning high probability to held-out English text Jozefowicz et al. (2016).

Roughly speaking, an RNN works as follows. At each time step, it consumes one input token, updates its hidden state vector, and predicts the next token by generating a probability distribution over all permissible tokens. The probability of an input string is simply obtained as the product of the predictions of the tokens constituting the string followed by a terminating token. In this manner, each RNN defines a weighted language; i.e. a total function from strings to weights. Siegelmann and Sontag (1995) showed that single-layer rational-weight RNNs with saturated linear activation can compute any computable function. To this end, a specific architecture with 886 hidden units can simulate any Turing machine in real-time (i.e., each Turing machine step is simulated in a single time step). However, their RNN encodes the whole input in its internal state, performs the actual computation of the Turing machine when reading the terminating token, and then encodes the output (provided an output is produced) in a particular hidden unit. In this way, their RNN allows “thinking” time (equivalent to the computation time of the Turing machine) after the input has been encoded.

We consider a different variant of RNNs that is commonly used in natural language processing applications. It uses ReLU activations, consumes an input token at each time step, and produces softmax predictions for the next token. It thus immediately halts after reading the last input token and the weight assigned to the input is simply the product of the input token predictions in each step.

Other formal models that are currently used to implement probabilistic language models such as finite-state automata and context-free grammars are by now well-understood. A fair share of their utility directly derives from their nice algorithmic properties. For example, the weighted languages computed by weighted finite-state automata are closed under intersection (pointwise product) and union (pointwise sum), and the corresponding unweighted languages are closed under intersection, union, difference, and complementation Droste et al. (2013). Moreover, toolkits like OpenFST Allauzen et al. (2007) and Carmelhttps://www.isi.edu/licensed-sw/carmel/ implement efficient algorithms on automata like minimization, intersection, finding the highest-weighted path and the highest-weighted string.

RNN practitioners naturally face many of these same problems. For example, an RNN-based machine translation system should extract the highest-weighted output string (i.e., the most likely translation) generated by an RNN, Sutskever et al. (2014); Bahdanau et al. (2014). Currently this task is solved by approximation techniques like heuristic greedy and beam searches. To facilitate the deployment of large RNNs onto limited memory devices (like mobile phones) minimization techniques would be beneficial. Again currently only heuristic approaches like knowledge distillation Kim and Rush (2016) are available. Meanwhile, it is unclear whether we can determine if the computed weighted language is consistent; i.e., if it is a probability distribution on the set of all strings. Without a determination of the overall probability mass assigned to all finite strings, a fair comparison of language models with regard to perplexity is simply impossible.

The goal of this paper is to study the above problems for the mentioned ReLU-variant of RNNs. More specifically, we ask and answer the following questions:

Consistency: Do RNNs compute consistent weighted languages? Is the consistency of the computed weighted language decidable?

Highest-weighted string: Can we (efficiently) determine the highest-weighted string in a computed weighted language?

Equivalence: Can we decide whether two given RNNs compute the same weighted language?

Minimization: Can we minimize the number of neurons for a given RNN?

Definitions and notations

A single-layer RNN $R$ is a $7$ -tuple $\langle\Sigma,N,h_{-1},W,W^{\prime},E,E^{\prime}\rangle$ , in which

RNNs act in discrete time steps reading a single letter at each step. We now define the semantics of our RNNs.

Let $R=\langle\Sigma,N,h_{-1},W,W^{\prime},E,E^{\prime}\rangle$ be an RNN, $s$ an input string of length $n$ and $0\leq t\leq n$ a time step. We define

where $h_{s,-1}=h_{-1}$ and we use standard matrix product and point-wise vector addition,

In other words, each component $h_{s,t}(n)$ of the hidden state vector is the ReLU activation applied to a linear combination of all the components of the previous hidden state vector $h_{s,t-1}$ together with a summand $W^{\prime}_{s_{t}}$ that depends on the $t$ -th input letter $s_{t}$ . Thus, we often specify $h_{s,t}(n)$ as linear combination instead of specifying the matrix $W$ and the vectors $W^{\prime}_{a}$ . The semantics is then obtained by predicting the letters $s_{1},\dotsc,s_{n}$ of the input $s$ and the final terminator $\$ $ and multiplying the probabilities of the individual predictions.

$E(\$ ,\cdot)=(M+1,\,-(M+1)) $and$ E(a,\cdot)=(1,\,-1)$ and

$E^{\prime}(\$ )=-M $and$ E^{\prime}(a)=0$.

In this case, we obtain the linear combinations

computing the next hidden state components. Given the initial activation, we thus obtain $h_{s,t}=\sigma\langle t,t-1\rangle$ . Using this information, we obtain

Consequently, we assign weight $\tfrac{e^{-M}}{1+e^{-M}}$ to input $\varepsilon$ , weight $\tfrac{1}{1+e^{-M}}\cdot\tfrac{e^{1}}{e^{1}+e^{1}}$ to $a$ , and, more generally, weight $\tfrac{1}{1+e^{-M}}\cdot\tfrac{1}{2^{n}}$ to $a^{n}$ .

Clearly the weight assigned by an RNN is always in the interval $(0,1)$ , which enables a probabilistic view. Similar to weighted finite-state automata or weighted context-free grammars, each RNN is a compact, finite representation of a weighted language. The softmax-operation enforces that the probability is impossible as assigned weight, so each input string is principally possible. In practical language modeling, smoothing methods are used to change distributions such that impossibility (probability ) is removed. Our RNNs avoid impossibility outright, so this can be considered a feature instead of a disadvantage.

The hidden state $h_{s,t}$ of an RNN can be used as scratch space for computation. For example, with a single neuron $n$ we can count input symbols in $s$ via:

Here the letter-dependent summand $W^{\prime}_{a}$ is universally $1$ . Similarly, for an alphabet $\Sigma=\{a_{1},\dotsc,a_{m}\}$ we can use the method of Siegelmann and Sontag (1995) to encode the complete input string $s$ in base $m+1$ using:

where $c\colon\Sigma_{\$ }\to\{0,\dotsc,m\} $is a bijection. In principle, we can thus store the entire input string (of unbounded length) in the hidden state value$ h_{s,t}(n) $, but our RNN model outputs weights at each step and terminates immediately once the final delimiter$ \$ $is read. It must assign a probability to a string incrementally using the chain rule decomposition$ p(s_{1}\dotsm s_{n})=p(s_{1})\cdot\ldots\cdot p(s_{n}\mid s_{1}\dotsm s_{n-1})$.

Let us illustrate our notion of RNNs on some additional examples. They all use the alphabet $\Sigma=\{a\}$ and are illustrated and formally specified in Figure 1. The first column shows an RNN $R_{1}$ that assigns $R_{1}(a^{n})=2^{-(n+1)}$ . The next-token prediction matrix ensures equal values for $a$ and $\$ $at every time step. The second column shows the RNN$ R_{2} $, which we already discussed. In the beginning, it heavily biases the next symbol prediction towards$ a $, but counters it starting at$ t=1 $. The third RNN$ R_{3} $uses another counting mechanism with$ h_{s,t}=\sigma\langle t-100,t-101,t\rangle $. The first two components are ReLU-thresholded to zero until$ t>101 $, at which point they overwhelm the bias towards$ a $turning all future predictions to$ \$$.

Consistency

We first investigate the consistency problem for an RNN $R$ , which asks whether the recognized weighted language $R$ is indeed a probability distribution. Consequently, an RNN $R$ is consistent if $\sum_{s\in\Sigma^{*}}R(s)=1$ . We first show that there is an inconsistent RNN, which together with our examples shows that consistency is a nontrivial property of RNNs. For comparison, all probabilistic finite-state automata are consistent, provided no transitions exit final states. Not all probabilistic context-free grammars are consistent; necessary and sufficient conditions for consistency are given by Booth and Thompson (1973). However, probabilistic context-free grammars obtained by training on a finite corpus using popular methods (such as expectation-maximization) are guaranteed to be consistent Nederhof and Satta (2006).

We immediately use a slightly more complex example, which we will later reuse.

with the single-letter alphabet $\Sigma=\{a\}$ , the neurons $\{1,2,3,n,n^{\prime}\}\subseteq N$ , initial activation $h_{-1}(i)=0$ for all $i\in\{1,2,3,n,n^{\prime}\}$ , and the following linear combinations:

Then $h_{s,t}(1)=0$ for all $t\leq T$ and $h_{s,t}(1)=1$ otherwise. In addition, we have $h_{s,t}(2)=t+1$ and $h_{s,t}(3)=\sigma\langle 3(t-T-1)\rangle$ . Hence we have

of predicting $\$ $increases over time and eventually (for$ t\gg 3T $) far outweighs the probability of predicting$ a$. Consequently, in this case the RNN is consistent (see Lemma 16 in the appendix).

We have seen in the previous example that consistency is not trivial for RNNs, which takes us to the consistency problem for RNNs:

Given an RNN $R$ , return “yes” if $R$ is consistent and “no” otherwise.

We recall the following theorem, which, combined with our example, will prove that consistency is unfortunately undecidable for RNNs.

Let $M$ be an arbitrary deterministic Turing machine. There exists an RNN

with saturated linear activation, input alphabet $\Sigma=\{a\}$ , and $1$ designated neuron $n\in N$ such that for all $s\in\Sigma^{*}$ and $0\leq t\leq\lvert s\rvert$

$h_{s,t}(n)=0$ if $M$ does not halt on $\varepsilon$ , and

if $M$ does halt on empty input after $T$ steps, then

for all $n,n^{\prime}\in N$ and $a\in\Sigma\cup\{\$ \} $, where$ n_{1} $and$ n_{2} $are the two neurons corresponding to$ n $and$ n^{\prime}_{1} $and$ n^{\prime}_{2} $are the two neurons corresponding to$ n^{\prime}$ (see Lemma 17 in the appendix).

Let $M$ be an arbitrary deterministic Turing machine. There exists an RNN

with input alphabet $\Sigma=\{a\}$ and $2$ designated neurons $n_{1},n_{2}\in N$ such that for all $s\in\Sigma^{*}$ and $0\leq t\leq\lvert s\rvert$

$h_{s,t}(n_{1})-h_{s,t}(n_{2})=0$ if $M$ does not halt on $\varepsilon$ , and

if $M$ does halt on empty input after $T$ steps, then

We can now use this corollary together with the RNN $R$ of Example 3 to show that the consistency problem is undecidable. To this end, we simulate a given Turing machine $M$ and identify the two designated neurons of Corollary 5 as $n$ and $n^{\prime}$ in Example 3. It follows that $M$ halts if and only if $R$ is consistent. Hence we reduced the undecidable halting problem to the consistency problem, which shows the undecidability of the consistency problem.

The consistency problem for RNNs is undecidable.

As mentioned in Footnote 2, probabilistic context-free grammars obtained after training on a finite corpus using the most popular methods are guaranteed to be consistent. At least for 2-layer RNNs this does not hold.

A two-layer RNN trained to a local optimum using Back-propagation-through-time (BPTT) on a finite corpus is not necessarily consistent.

The first layer of the RNN $R$ with a single alphabet symbol $a$ uses one neuron $n^{\prime}$ and has the following behavior:

The second layer uses neuron $n$ and takes $h_{s,t}(n^{\prime})$ as input at time $t$ :

Let the training data be $\{a\}$ . Then the objective we wish to maximize is simply $R(a)$ . The derivative of this objective with respect to each parameter is , so applying gradient descent updates does not change any of the parameters and we have converged to an inconsistent RNN. ∎

It remains an open question whether there is a single-layer RNN that also exhibits this behavior.

Highest-weighted string

For deterministic probabilistic finite-state automata or context-free grammars only one path or derivation exists for any given string, so the identification of the highest-weighted string is the same task as the identification of the most probable path or derivation. However, for nondeterministic devices, the highest-weighted string is often harder to identify, since the weight of a string is the sum of the probabilities of all possible paths or derivations for that string. A comparison of the difficulty of identifying the most probable derivation and the highest-weighted string for various models is summarized in Table 1, in which we marked our results in bold face.

We present various results concerning the difficulty of identifying the highest-weighted string in a weighted language computed by an RNN. We also summarize some available algorithms. We start with the formal presentation of the three studied problems.

Best string: Given an RNN $R$ and $c\in(0,1)$ , does there exist $s\in\Sigma^{*}$ with $R(s)>c$ ?

Consistent best string: Given a consistent RNN $R$ and $c\in(0,1)$ , does there exist $s\in\Sigma^{*}$ with $R(s)>c$ ?

As usual the corresponding optimization problems are not significantly simpler than these decision problems. Unfortunately, the general problem is also undecidable, which can easily be shown using our example.

The best string problem for RNNs is undecidable.

using Lemma 14 in the appendix. Consequently, a string with weight above $0.12$ exists if and only if $M$ halts, so the best string problem is also undecidable. ∎

If we restrict the RNNs to be consistent, then we can easily decide the best string problem by simple enumeration.

The consistent best string problem for RNNs is decidable.

Since $R$ is consistent, $\lim_{i\to\infty}S_{i}=1$ , so this algorithm is guaranteed to terminate and it obviously decides the problem. ∎

Next, we investigate the length $\lvert w_{R}^{\text{max}}\rvert$ of the shortest string $w_{R}^{\text{max}}$ of maximal weight in the weighted language $R$ generated by a consistent RNN $R$ in terms of its (binary storage) size $\lvert R\rvert$ . As already mentioned by Siegelmann and Sontag (1995) and evidenced here, only small precision rational numbers are needed in our constructions, so we assume that $\lvert R\rvert\leq c\cdot\lvert N\rvert^{2}$ for a (reasonably small) constant $c$ , where $N$ is the set of neurons of $R$ . We show that no computable bound on the length of the best string can exist, so its length can surpass all reasonable bounds.

In the previous section (before Theorem 6) we presented an RNN $R_{M}$ that simulates an arbitrary (single-track) Turing machine $M$ with $n$ states. By Siegelmann and Sontag (1995) we have $\lvert R_{M}\rvert\leq c\cdot(4n+16)$ . Moreover, we observed that this RNN $R_{M}$ is consistent if and only if the Turing machine $M$ halts on empty input. In the proof of Theorem 8 we have additionally seen that the length $\lvert w_{R}^{\text{max}}\rvert$ of its best string exceeds the number $T_{M}$ of steps required to halt.

so $f$ clearly cannot be computable and no computable function $g$ can provide bounds for $f$ . ∎

Finally, we investigate the difficulty of the best string problem for consistent RNN restricted to solutions of polynomial length.

Identifying the best string of polynomial length in a consistent RNN is NP-complete and APX-hard.

Proof sketch. Clearly, we can guess an input string of polynomial length, run the RNN, and verify whether its weight exceeds the given bound in polynomial time. Therefore the problem is trivially in NP. For NP-hardness, we reduce from the 0-1 Integer Linear Programming Feasibility Problem:

Suppose we are given an instance of the above problem. We construct an instance of the consistent best string of polynomial length problem with input $\langle R,c\rangle$ . Our construction ensures that the only length at which a string can have weight greater than $c$ is $n$ . Thus, if there is any string whose weight is greater than $c$ , the given instance of 0-1 Integer Linear Programming Problem is feasible; otherwise it is not.

Our reduction is a Polynomial-Time Approximation Scheme (PTAS) reduction and preserves approximability. Since 0-1 Integer Linear Programming Feasibility is NP-complete and the corresponding maximization problem is APX-complete, consistent best string of polynomial length is NP-complete and APX-hard, meaning there is no PTAS to find the best string bounded by polynomial length (i.e. the best we can hope for in polynomial time is a constant-factor approximation algorithm) unless P $=$ NP.

If we assume that the solution length is bounded by some finite number, we can convert algorithms from de la Higuera and Oncina (2013) for computing the most probable string in PFSAs for use in RNNs. Such algorithms would be similar to beam search Lowerre (1976) used most widely in practice.

Equivalence

We prove that equivalence of two RNNs is undecidable. For comparison, equivalence of two deterministic WFSAs can be tested in time $O(|\Sigma|(|Q_{A}|+|Q_{B}|)^{3})$ , where $|Q_{A}|$ , $|Q_{B}|$ are the number of states of the two WFSAs and $|\Sigma|$ is the size of the alphabet Cortes et al. (2007); equivalence of nondeterministic WFSAs are undecidable Griffiths (1968). The decidability of language equivalence for deterministic probabilistic push-downtown automata (PPDA) is still open Forejt et al. (2014), although equivalence for deterministic unweighted push-downtown automata (PDA) is decidable Sénizergues (1997).

The equivalence problem is formulated as follows:

Given two RNNs $R$ and $R^{\prime}$ , return “yes” if $R(s)=R^{\prime}(s)$ for all $s\in\Sigma^{*}$ , and “no” otherwise.

The equivalence problem for RNNs is undecidable.

Minimization

We look next at minimization of RNNs. For comparison, state-minimization of a deterministic PFSA is $O(|E|\log{|Q|})$ where $|E|$ is the number of transitions and $|Q|$ is the number of states Aho et al. (1974). Minimization of a non-deterministic PFSA is PSPACE-complete Jiang and Ravikumar (1993).

We focus on minimizing the number of hidden neurons ( $|N|$ ) in RNNs:

Given RNN $R$ and non-negative integer $n$ , return “yes” if $\exists$ RNN $R^{\prime}$ with number of hidden units $|N^{\prime}|\leq n$ such that $R(s)=R^{\prime}(s)$ for all $s\in\Sigma^{*}$ , and “no” otherwise.

We reduce from the Halting Problem. Suppose Turing Machine $M$ decides the minimization problem. For any Turing Machine $M^{\prime}$ , construct the same RNN $R$ as in Theorem 12. We run $M$ on input $\langle R,0\rangle$ . Note that an RNN with no hidden unit can only output constant $E^{\prime}_{s,t}$ for all $t$ . Therefore the number of hidden units in $R$ can be minimized to if and only if it always outputs $E^{\prime}_{s,t}(a)=E^{\prime}_{s,t}(\$ )=1/2 $. If$ M $returns “yes”,$ M^{\prime} $does not halt on$ \epsilon$, else it halts. Therefore minimization is undecidable. ∎

Conclusion

We proved the following hardness results regarding RNN as a recognizer of weighted languages:

Finding the highest-weighted string for an arbitrary RNN is undecidable.

Finding the highest-weighted string for a consistent RNN is decidable, but the solution length can surpass all computable bounds.

Restricting to solutions of polynomial length, finding the highest-weighted string is NP-complete and APX-hard.

Testing equivalence of RNNs and minimizing the number of neurons in an RNN are both undecidable.

Although our undecidability results are upshots of the Turing-completeness of RNN Siegelmann and Sontag (1995), our NP-completeness and APX-hardness results are original, and surprising, since the analogous hardness results in PFSA relies on the fact that there are multiple derivations for a single string Casacuberta and de la Higuera (2000). The fact that these results hold for the relatively simple RNNs we used in this paper suggests that the case would be the same for more complicated models used in NLP, such as long short term memory networks (LSTMs; Hochreiter and Schmidhuber 1997).

Our results show the non-existence of (efficient) algorithms for interesting problems that researchers using RNN in natural language processing tasks may have hoped to find. On the other hand, the non-existence of such efficient or exact algorithms gives evidence for the necessity of approximation, greedy or heuristic algorithms to solve those problems in practice. In particular, since finding the highest-weighted string in RNN is the same as finding the most-likely translation in a sequence-to-sequence RNN decoder, our NP-completeness and APX-hardness results provide some justification for employing greedy and beam search algorithms in practice.

Acknowledgments

This work was supported by DARPA (W911NF-15-1-0543 and HR0011-15-C-0115). Andreas Maletti was financially supported by DFG Graduiertenkolleg 1763 (QuantLA).

References

Appendix

Identifying the best string of polynomial length in a consistent RNN is NP-complete and APX-hard.

Clearly, we can guess an input string of polynomial length, run the RNN, and verify whether its weight exceeds the given bound in polynomial time. Therefore the problem is trivially in NP. For NP-hardness we now reduce from the 0-1 Integer Linear Programming Feasibility Problem to our problem:

Suppose we are given an instance of the above problem. Construct an instance of the consistent best string with polynomial length problem with input $\langle R,c\rangle$ , where:

Let $d=\sum_{i=1}^{k}{(h_{s,t}(g_{i})-h_{s,t}(l_{i}))}$ , $\delta_{2}=\frac{1}{k+2}$ . We pick a big enough positive rational number $\beta$ so that if we define $\delta_{1}=\frac{1}{2e^{\beta}+1}$ ,

When $t=n+1$ , one can verify that we can set

since the range of $d$ is a finite set of values $\{0,1,2,\dots,k\}$ .

$c=(\frac{1-\delta_{1}}{2})^{n}(k\delta_{2})$

so we can pick $\beta$ such that its length written in binary

is logarithmic in $n$ and $k$ . So the weights in matrices $E,E^{\prime}$ that produce $\beta$ are polynomial in $n$ and $k$ . Same is true for the weights that produce $\gamma_{d}$ . $c$ written in binary has length

which is polynomial in $n$ and $k$ . So our construction is polynomial.

We now prove that if we can solve the $\langle R,c\rangle$ -instance of consistent best string of polynomial length in polynomial time, we can also solve the given instance of 0-1 Integer Linear Programming Feasibility in polynomial time.

By our design, at time $1\leq t\leq n$ , $R$ reads a binary string $x\in\{0,1\}^{n}$ into neurons $1,2,\dots,n$ while predicting almost half-half probability for either 0 or 1 and infinitesimal probability $\delta_{1}$ for termination. Therefore no string with length less then $n$ has weight greater than $c$ .

At time $t=n+1$ , since $\sum_{j=1}^{n}{A_{ij}h_{s,t-1}(j)}-B_{i}$ is an integer, $h_{s,t}(g_{i})-h_{s,t}(l_{i})$ is the indicator for whether the $i$ -th constraint is satisfied:

Therefore $d$ is the total number of clauses satisfied by a given setting of $x=(x_{1},x_{2},\dots,x_{n})$ ( $0\leq d\leq k$ ). The termination probability at $t=n+1$ is $(d+1)\delta_{2}=\frac{d+1}{k+2}$ . If all $k$ clauses are satisfied, this setting of $x$ would have termination probability $1-\delta_{2}$ and therefore weight $(\frac{1-\delta_{1}}{2})^{n}(1-\delta_{2})>c$ . If fewer than $k$ clauses are satisfied, $x$ would have weight at most $c$ .

When $t\geq n+2$ , $R$ continues to assign almost half-half probability for either 0 or 1 and infinitesimal probability for termination. Therefore any string of length greater than $n+1$ has a weight smaller than $\epsilon$ . From that point on the output vector is constant, so the RNN is consistent. Notice that the weights of strings monotonically decrease with length except for at length $n$ .

Therefore our construction ensures that the only length at which a string can have weight greater than $c$ is $n$ . Thus, if there is any string whose weight is greater than $c$ , the given instance of 0-1 Integer Linear Programming Problem is feasible; otherwise it is not.

Define the maximum number of clauses satisfied by all assignments of $x\in\{0,1\}^{n}$ :

By our construction, when $d_{max}\geq 1$ , the highest-weighted string will occur at length $n$ , and has weight $(\frac{1-\delta_{1}}{2})^{n}(d_{max}+1)\delta_{2}=(\frac{1-\delta_{1}}{2})^{n}\frac{d_{max}+1}{k+2}$ which is proportional to $d_{max}+1$ . The empty string has the highest weight among all strings of length not equal to $n$ . Its weight is $\delta_{1}<(\frac{1-\delta_{1}}{2})^{n}\delta_{2}$ which will always be less than the weight of any length- $n$ string corresponding to a setting of variables satisfying at least 1 constraint ( $\geq(\frac{1-\delta_{1}}{2})^{n}(2\delta_{2})$ ).

Therefore, given any rational number $\zeta=\frac{\eta_{1}}{\eta_{2}}>0$ ( $\eta_{1},\eta_{2}\in[k]$ ), define $\delta(\zeta)=\frac{\eta_{1}}{\eta_{2}+1}$ . If binary string $s$ is a $(1-\delta(\zeta))$ -approximation to consistent best string with polynomial length, then reading $s$ as a vector of $n$ variables $x=(x_{1},x_{2},\dots,x_{n})$ , $x$ would be a $(1-\zeta)$ -approximation to 0-1 Integer Linear Programming Maximum Satisfiability (the optimization version of the problem, i.e., finding a setting of variables to satisfy the greatest number of constraints).

Thus our reduction is PTAS-reduction and preserves approximability. Since 0-1 Integer Linear Programming Feasibility is NP-complete and APX-complete, consistent best string of polynomial length is NP-complete and APX-hard, meaning there is no Polynomial-Time Approximation Scheme to find the best string bounded by polynomial length (i.e. the best we can hope for in polynomial time is a constant-factor approximation algorithm) unless P $=$ NP. ∎

where $(-1;e^{-k})_{\infty}$ is the infinite $e^{-k}$ -Pochhammer symbol.

where the final equality utilizes Lemma 14. ∎

We set $h_{s,-1}(n)=h_{-1}(n)$ for all $n\in N$ and $h^{\prime}_{s,-1}(n^{\prime})=h^{\prime}_{-1}(n^{\prime})$ for all $n^{\prime}\in N^{\prime}$ . Then trivially $h^{\prime}_{s,-1}(n_{1})-h^{\prime}_{s,-1}(n_{2})=h_{-1}(n)-0=h_{s,-1}(n)$ . Moreover,

Hence $h^{\prime}_{s,t}(n_{1})-h^{\prime}_{s,t}(n_{2})=h_{s,t}(n)$ as required. ∎