CKConv: Continuous Kernel Convolution For Sequential Data

David W. Romero, Anna Kuzina, Erik J. Bekkers, Jakub M. Tomczak, Mark Hoogendoorn

Introduction

Recurrent Neural Networks (RNNs) have long governed tasks with sequential data (Rumelhart et al., 1985; Hochreiter & Schmidhuber, 1997; Chung et al., 2014). Their main ingredient are recurrent units: network components with a recurrence formulation which grants RNNs the ability to be unrolled for arbitrarily many steps and handle sequences of arbitrary size. In practice, however, the effective memory horizon of RNNs, i.e., the number of steps the network retains information from, has proven to be surprisingly small, most notably due to the vanishing gradients problem (Hochreiter, 1991; Bengio et al., 1994). Interstingly, it is the very recurrent nature of RNNs that allows them to be unrolled for arbitrarily many steps which is responsible for vanishing gradients (Pascanu et al., 2013b). This, in turn, hinders learning from the far past and induces a small effective memory horizon.

Convolutional Neural Networks (CNNs) (LeCun et al., 1998) have proven a strong alternative to recurrent architectures as long as relevant input dependencies fall within their memory horizon, e.g., Conneau et al. (2016); Oord et al. (2016); Dai et al. (2017); Dauphin et al. (2017); Bai et al. (2018a). CNNs avoid the training instability and vanishing / exploding gradients characteristic of RNNs by avoiding back-propagation through time (Werbos, 1990) altogether. However, these architectures model convolutional kernels as a sequence of independent weights. As a result, their memory horizon must be defined a-priori, and larger memory horizons induce a proportional growth of the model size.

In this work, we provide a solution to these limitations. We propose to view a convolutional kernel as a continuous function parameterized by a small neural network instead of a sequence of independent weights. The resulting Continuous Kernel Convolution (CKConv) enjoys the following properties:

CKConvs can define arbitrarily large memory horizons within a single operation. Consequently, Continuous Kernel Convolutional Neural Networks (CKCNNs) detach their memory horizon from (i) the depth of the network, (ii) the dilation factors used, and (iii) the size of the network.

CKConvs do not rely on any form of recurrence. As a result, CKCNNs (i) can be trained in parallel, and (ii) do not suffer from vanishing / exploding gradients or small effective memory horizons.

Continuous convolutional kernels can be evaluated at arbitrary positions. Consequently, CKConvs and CKCNNs can be readily used on irregularly sampled data, and data at different resolutions.

Related Work

Continuous kernel formulation. Continuous formulations for convolutional kernels were introduced to handle irregularly sampled 3D data locally (Schütt et al., 2017; Simonovsky & Komodakis, 2017; Wang et al., 2018; Wu et al., 2019). As discrete convolutions learn independent weights for specific relative positions, they cannot handle irregularly sampled data effectively. Following work focuses on point-cloud applications (Fuchs et al., 2020; Hu et al., 2020; Shi et al., 2019; Thomas et al., 2018). Other approaches include Monte Carlo approximations of continuous operations (Finzi et al., 2020). Our work proposes a new broad flavor of applications for which continuous kernels are advantageous.

Implicit neural representations. Implicit neural representations construct continuous data representations by encoding the input in the weights of a neural network (Mescheder et al., 2019; Park et al., 2019; Sitzmann et al., 2020). This leads to numerous advantages over conventional (discrete) data representations, e.g., memory efficiency, analytic differentiability, with interesting properties for several applications, e.g., generative modelling (Dupont et al., 2021; Schwarz et al., 2020).

The Convolution and Common Kernel Parameterizations

Values x(τ)\mathbf{{{x}}}(\tau) falling outside of X{\mathcal{X}} are padded by a constant value often defined as zero (Fig. 2(a)).

In practice, causal convolutions are easily implemented via asymmetrical padding. In this work, we consider causal convolutions as default. Nevertheless, our analyses are also valid for centered ones.

Dilated convolutional kernels. To alleviate these limitations, previous works propose to interleave kernel weights with zeros in order to cover larger memory horizons without additional weights (Fig. 2(c)). This formulation alleviates some of the previous limitations, but introduces additional ones:

Dilated kernels are unable to model dependencies of input values falling in the interleaved regions.

Several authors use dilated convolutions with varying dilation factors as a function of depth, e.g., (Bai et al., 2018a; Dai et al., 2017; Oord et al., 2016; Romero et al., 2020). By carefully selecting layer-wise dilation factors, one can assure that some kernel hits each input within the memory horizon of the network. However, due to the extreme sparsity of the formulation, it is difficult to estimate the effective amount of processing applied to the input. In addition, this layout ties together (i) the memory horizon, (ii) the depth, and (iii) the layer-wise dilation factors of the network, which effectively constraints the flexibility of the neural architecture design.

In contrast to the (dilated) discrete convolutions presented in this section, our proposed formulation allows handling arbitrarily long sequences with arbitrarily large, dense memory horizons in a single layer and under a fixed parameter budget.

Continuous Kernel Convolution

In this section, we introduce our approach. First, we define it formally, analyze its properties, illustrate its connection to recurrent units, and elaborate on the functional family they can describe. Next, we discuss concrete parameterizations of continuous convolutional kernels, illustrate their connection to implicit neural representations, and show that our final kernels are able to fit complex functions.

Irregularly sampled data. CKConvs are able to handle irregularly-sampled and partially observed data. To this end, it is sufficient to sample MLPψ at positions for which the input signal is known and perform the convolution operation with the sampled kernel. For very non-uniformly sampled inputs, an inverse density function over the samples can be incorporated in order to provide an unbiased estimation of the convolution response (see Appx. A.1, Wu et al. (2019) for details).

We note that the previous features are hardly attainable by regular architectures, with an exception being RNNs with continuous-time interpretations, e.g., Gu et al. (2020a); Kidger et al. (2020).

(Linear) recurrent units are continuous kernel convolutions. Consider a recurrent unit of the form:

The functional family of continuous kernel convolutions. From the previous observation, we can conclude that CKConvs are not only more general than discrete convolutions, but that the functional family they describe is also more general than that of (linear) recurrent units (Fig. 3).

2 The Continuous Convolutional Kernel MLPψ

What kind of kernels can MLPψgenerate? Our method relies on the assumption that the neural network MLPψis able to model complex dependencies densely among all elements within the memory horizon. That is, it assumes that MLPψis able to generate arbitrary convolutional kernels.

Experiments

We validate our approach against several existing models and across several tasks selected from the corresponding papers. Specifically, we benchmark its ability to handle long-term dependencies, data at different resolutions and irregularly-sampled data. A complete description of the datasets used as well as additional experiments and ablation studies can be found in the Appendix (Appx. C, D). Our code is publicly available at https://github.com/dwromero/ckconv.

Network details. We parameterize our convolutional kernels as 3-layer SIRENs. Weight normalization (Salimans & Kingma, 2016) leads to better and faster convergence when applied to the layers in MLP, and we use it across all experiments. All our CKCNNs follow the structure shown in Fig. 8 and vary only in the number of blocks and channels. We use two residual blocks for all experiments reported in this section. Specifications on the architectures and hyperparameters used are given in Appx. E. We speed up the convolution operations in our networks via the convolution theorem: (f*\psi)=\mathcal{F}^{-1}\big{\{}\mathcal{F}\{f\}\cdot\overline{\mathcal{F}\{\psi\}}\big{\}}, with F\mathcal{F} the Fourier transform.

Stress experiments. First, we validate that CKCNNs can readily model memory horizons of different lengths. To this end, we evaluate if a shallow CKCNN is able to solve the Copy Memory and Adding Problem tasks (Hochreiter & Schmidhuber, 1997) for sequences of sizes in the range $.Successisachievedif100. Success is achieved if 100% accuracy, or a loss\leq1e4areobtained,forthecopymemoryandaddingproblem,respectively.Randompredictionsfortheaddingproblemleadtoalossofapprox.1e-4 are obtained, for the copy memory and adding problem, respectively. Random predictions for the adding problem lead to a loss of approx.0.17$.

Our results show that a shallow CKCNN is able to solve both problems for all sequence lengths considered without requiring structural modifications (Tab. 3). Recurrent architectures are not able not solve the copy problem at all and could solve the sum problem up to 200 steps. TCNs with k=7k{=}7, n=7n{=}7 were able to solve both tasks for up to 1000 steps. However, larger lengths were out of reach as their memory horizon is constrained a priori. To handle larger sequences, TCNs must modify their network structure based on prior knowledge regarding the expected length of the input sequence.

Shallow CKCNNs outperform recurrent, self-attention and convolutional models on sMNIST and pMNIST (Tab. 1). On sMNIST, a small CKCNN (100k params.) achieves state-of-the-art results with a model 80×\times smaller than the current state-of-the-art. A wider CKCNN (1m params.) slightly increases this result further. On pMNIST, we see an improvement of 0.8%0.8\% over the best model of size \leq100k, and our wider shallow CKCNN achieves state-of-the-art on this dataset. For sCIFAR10, our small CKCNN obtains similar results to a self-attention model 5×\times bigger, and our wider variant improves performance by an additional 1%1\%. Our best results are obtained with an even wider model (2.5m params) with which an accuracy of 65.59%65.59\% is obtained. On Char-level PTB a CKCNN with 3m parameters outperforms all models considered as well as the state-of-the-art: Mogrifier LSTMs (Melis et al., 2019), while being 13.3×\times smaller.

Time-series modeling. Next, we evaluate CKCNNs on time-series data. To this end, we consider the CharacterTrajectories (CT) (Bagnall et al., 2018) and the Speech Commands (SC) (Warden, 2018) datasets. We follow Kidger et al. (2020) to obtain a balanced classification dataset with precomputed mel-frequency cepstrum coefficients. In addition, we evaluate the ability of CKCNNs to model long-term dependencies by training on the raw SC dataset (SC_raw), whose records have length 16k.

We compare CKCNNs with representative sequential models with continuous-time interpretations: GRU-ODE (De Brouwer et al., 2019), GRU-Δt\Delta t (Kidger et al., 2020), ODE-RNN (Rubanova et al., 2019), and NCDE (Kidger et al., 2020). Continuous-time sequential models were selected as they are only sequential methods also able to handle irregularly-sampled data, and data at different resolutions. Our results show that shallow CKCNNs outperform all continuous-time models considered for both the CT and SC datasets (Tab. 3). In addition, CKCNNs obtain promising results on SC_raw, which validates their ability to handle very-long-term dependencies. In fact, CKCNNs trained on SC_raw are able outperform several Neural ODE models trained on the preprocessed data (SC).

In addition, we observed that neural ODE methods considered in Tab. 3 were prohibitively slow for long sequences. For instance, NCDEs were 228×228\times slower than a CKCNN of equivalent size on SC_raw, taking 17 hours per epoch to train. Consequently, training a NCDE on SC_raw for a matching number of epochs would take more than 212212 days to conclude. In order to provide results for these models, we train them under the same computational budget than CKCNNs. This is enough to train them for a single epoch. All obtained results are at best only marginally better than random.

Irregularly-sampled data. To conclude, we validate CKCNNs for irregularly-sampled data. To this end, consider the PhysioNet sepsis challenge (Reyna et al., 2019) as well as the CT dataset with drops of 30%, 50% and 70% of the data as in Kidger et al. (2020). In addition, we provide results under the same methodology for the SC_raw dataset. As in Kidger et al. (2020) we add an additional channel to the input to indicate whether the value at that position is known.

Our results show that CKCNNs outperform NCDEs and obtain state-of-the-art performance on the PhysioNet dataset. In addition, CKCNNs exhibit stable performance for varying quantities of missing data, and perform better than several models explicitly developed to this end (Tab. 4). On the CT dataset, NCDEs perform slightly better than CKCNNs for large data drop rates. However, we argue that our method is still advantageous due to the gains in training speed –see Section 6 for details–.

Discussion and Limitations

Parameter-efficient large convolutional kernels. CKConvs construct large complex kernels with a fixed parameter budget. For large input sequences, this results in large savings in the number of parameters required to construct global kernels with conventional CNNs. For sequences from the pMNIST (length = 784) and SC_raw (length = 16000) datasets, a conventional CNN with global kernels would require 2.14m and 46.68m of parameters, respectively, for a model equivalent to our CKCNN (100k). In other words, our kernel parameterization allows us to construct CKCNNs that are 21,8421,84 and 445,71445,71 times smaller than corresponding conventional CNNs for these datasets. Detailed exploration on the effect of our efficient continuous kernel parameterizations in optimization, overfitting and generalization is an interesting direction for future research.

Is depth important? Shallow global memory horizons. Our results are obtained with CKCNNs built with two residual blocks only. Additional experiments (Appx. D.2) indicate that our models do not benefit from larger depth, and suggest that CKCNNs do not rely on very deep features. Though further analysis is required to draw consistent conclusions, it is intriguing to explore if it is sufficient to equip neural networks with global memory horizons even if this happens in a shallow manner.

High-frequency components. Interestingly, our kernels often contain frequency components higher than the resolution of the grid used during training (Fig. 9). As a result, transitions to finer resolutions benefit from smoothing (see Appx. E.3). Nevertheless, we believe that, if tuned properly, these high-frequency components might prove advantageous for tasks such as super-resolution and compression.

Faster continuous-time models. CKCNNs rely on convolutions, and thus can be executed in parallel. As a result, CKCNNs can be trained faster than recurrent architectures. This difference becomes more pronounced with concurrent continuous-time models for sequential data, which are based on neural ODEs and require at least 5×\times slower than RNNs (Kidger et al., 2020). At the cost of larger memory costs, CKCNNs can be further sped up by using the convolution theorem.

Memory requirements. Although, CKCNNs can be deployed and trained in parallel, CKCNNs must store the convolution responses at each layer and for all input positions. This induces a linear memory complexity with regard to the sequence length, and largely contrasts with recurrent continuous-time models, whose memory complexity is constant. The memory consumption of the operation is further incremented if the convolution theorem is applied because it requires multiplying the Fourier transform of the convolution and the kernel, and taking them back to the temporal representation. On the other hand, large convolutional kernels seem to allow CNNs to perform well without using many layers, which has a positive effect on memory consumption.

Selection of ω0\bm{\omega_{0}}. We observe that CKCNNs are very susceptible to the selection of ω0\omega_{0}. For instance, performance on pMNIST may vary from 98.5498.54 to 65.2265.22 for values of ω0\omega_{0} in $.Consequently,findingagoodvalueof. Consequently, finding a good value of\omega_{0}inducesanimportantcostinhyperpararametersearch(seeAppx.E.4).induces an important cost in hyperpararameter search (see Appx. E.4).\omega_{0}actsasaprioronthevariabilityofthetargetfunction.However,itisnotobviouswhichvalueofacts as a prior on the variability of the target function. However, it is not obvious which value of\omega_{0}isoptimalfortheinternal(unknown)featuresofanetwork.Learninglayerwiseis optimal for the internal (unknown) features of a network. Learning layer-wise\omega_{0}valuesyieldedsuboptimalresults,andbetterresultswereobtainedbyusingapredefinedvalues yielded sub-optimal results, and better results were obtained by using a predefined\omega_{0}$ value across all layers.

Conclusion and Future Work

We introduced the Continuous Kernel Convolution (CKConv), a simple, yet powerful approach able to model global long-term dependencies effectively in a parameter-efficient manner. Aside from the ability to get good accuracy, CKConvs are readily able to handle irregularly-sampled data, and data at different resolutions. CKCNNs achieve state-of-the-art results on multiple datasets, and often surpass neural architectures designed for particular settings, e.g., for irregularly-sampled data.

We are intrigued about the potential of CKCNNs for tasks in which (global) long-term dependencies play a crucial role, e.g., audio, video, reinforcement learning, (autoregressive) generative modeling. The usage of CKConvs to model long-term interactions in images is also very promising. In addition, CKConvs provide a convenient way to study the effect of the receptive field size of convolutional architectures, as no network modifications are required for different sizes. Our findings may also be useful for specific problems with irregularly-sampled data, e.g., medical, point clouds. We are also excited about structural advances of CKConvs. For instance, attentive versions of CKCNNs, or formulations that further improve computation and parameter efficiency

Alleviating limitations. Reducing the memory consumption of CKConvs is vital for its application on a broad range of scenarios, e.g., embedded devices. Moreover, finding kernel parameterizations more stable to hyperparameter changes is desirable to reduce the need for hyperparameter search.

What is the best implicit kernel parameterization for convolutional kernels? Despite the success of SIRENs, we believe that better kernel parameterizations might still be constructed, e.g., with Random Fourier Features (Tancik et al., 2020). Aside from improvements in implicit neural representations, which are directly transferable to CKConvs, we consider important to analyze the effect that having unknown, changing target objectives has on the approximation. A thorough empirical study of possible kernel parameterizations is an important direction for future research. A parameterization with which additional desiderata, e.g., smoothness, can be imposed is also desirable.

Reproducibility Statement

We believe in reproducibility. In order to make our paper reproducible, we have release the source code used in our experiments to the public. In addition to the code, our repository includes the explicit command lines used to execute each of our experiments, as well as the corresponding pretrained models. Appx. E provides the experimental details of our approach. This section includes details regarding the hardware used, the specification of neural architecture as well as the inputs of MLPψ. It also states the method used for hyperparameter tuning and the hyperparameters of our final models. Details regarding the smoothing of high-frequency artifacts are also provided in this section. Details regarding the datasets and any preprocessing steps used are provided in Appx. C. The proofs of our claims can be found in Appx. A.

Acknowledgements

We gratefully acknowledge Gabriel Dernbach for interesting analyses on the knot distribution of ReLU networks. We thank Emiel van Krieken and Ali el Hasouni as well for interesting questions and motivating comments at the beginning of this project.

David W. Romero is financed as part of the Efficient Deep Learning (EDL) programme (grant number P16-25), partly funded by the Dutch Research Council (NWO) and Semiotic Labs. Anna Kuzina is funded by the Hybrid Intelligence Center, a 10-year programme funded by the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research. Erik J. Bekkers is financed by the research programme VENI (grant number 17290) funded by the Dutch Research Council. All authors sincerely thank everyone involved in funding this work.

This work was carried out on the Dutch national einfrastructure with the support of SURF Cooperative

References

Appendix

Appendix A Properties of CKConvs

CKConvs can readily handle irregularly-sampled, and partially observed data. This is a result of the convolutional kernel MLPψ being able to be sampled at arbitrary positions. For very non-uniformed sampled inputs, however, the corresponding sampling of the convolutional kernel can provide a biased estimation of the operation. To overcome this, one can follow the strategy proposed by Wu et al. (2019), which we summarize here for completeness.

where s(τ)s(\tau) depicts the inverse sample density of the input at point τ\tau. Intuitively, s(τ)s(\tau) controls the contribution of points x(τ)x(\tau) to the output response. If multiple points are close to one another, their contribution should be smaller than the contribution of points in regions where the sample distribution is much sparser. This provides a Monte Carlo estimate of (xψ)(x*\psi) from biased samples. In particular, one has that:

With s(τ)=1p(τ)s(\tau)=\tfrac{1}{p(\tau)}, Eq. 8 provides an unbiased estimation of (xψ)(x*\psi).

A.2 Data Sampled at Different Sampling Rates

Consequently, CKCNNs (i) can be deployed at sampling rates different than those seen during training, and (ii) can be trained on data with varying spatial resolutions. The later is important for tasks in which data can be given at different resolutions such as super-resolution and segmentation.

Proof. To prove the previous statement, we start with the continuous definition of the convolution:

where we assume for simplicity and without loss of generality that the functions xx, ψ\psi are scalar-valued.

As a result, we have that both approximations are approximately equal to the continuous integral at positions tt defined on both discrete grids. By equating both approximations, we obtain that:

A.3 Linear Recurrent Units Are CKConvs

Interesting insights can be obtained by drawing connections between convolutions and recurrent units. In particular, we can show that linear recurrent units are equal to a CKConv with a particular family of convolutional kernels: exponential functions. Besides providing a generalization to recurrent units, this equality provides a fresh and intuitive view to the analysis of vanishing and exploding gradients.

Here, h(1)\mathbf{{{h}}}(-1) is the initial state of the hidden representation. We see that in fact it corresponds to a convolution between an input signal x\mathbf{{{x}}} and a convolutional kernel ψ\bm{\psi} given by:We discard h(1)\mathbf{{{h}}}(-1) as it only describes the initialization of h\mathbf{{{h}}}.

Drawing this equality yields some important insights:

The cause of the exploding and vanishing gradients. Eqs. 12-14 intuitively depict the root of the exploding and vanishing gradient problem. It stems from sequence elements x(tτ)\mathbf{{{x}}}(t-\tau) τ\tau steps back in the past being multiplied with an effective convolutional weight ψ(τ)=WτU\bm{\psi}(\tau){=}{\mathbf{{{W}}}}^{\tau}{\mathbf{{{U}}}}. For eigenvalues of W{\mathbf{{{W}}}}, λ\lambda, other than one, the resulting convolutional kernel ψ\bm{\psi} can only represent functions that either grow (λ1)\lambda{\geq}1) or decrease (λ1)\lambda{\leq}1) exponentially as a function of the sequence length (Figs. 3(a), 3(b)). As a result, the contribution of input values in the past either rapidly fades away or governs the updates of the model parameters. As exponentially growing gradients lead to divergence, the eigenvalues of W{\mathbf{{{W}}}} for converging architectures are often smaller than 1. This explains the effective small effective memory horizon of recurrent networks.

Linear recurrent units are a subclass of CKConvs. Linear recurrent units can be described as a convolution between the input and a very specific class of convolutional kernels: exponential functions (Eq. 13). In general, however, convolutional kernels are not restricted to this functional class. This can be seen in conventional (discrete) convolutions, whose kernels are able to model complex functions within their memory horizon. Unfortunately, discrete convolutions use a predefined, small kernel size, and thus possess a restricted memory horizon. This is equivalent to imposing an effective magnitude of zero to all input values outside the memory horizon (Fig. 3(c)). CKConvs, on the other hand, are able to define arbitrary large memory horizons. For memory horizons of size equal to the input length, CKConvs are able to model complex functions upon the entire input (Fig. 3(d)).

In conclusion, we illustrate that CKConvs are also a generalization of (linear) recurrent architectures which allows for parallel training and enhanced expressivity.

Appendix B An Spline Interpretation Of ReLU and Sine Networks

Sitzmann et al. (2020) motivates the usage of Sine nonlinearities for implicit neural representations. However, there is no clear understanding as of why Sine nonlinearities are better suited for this task than (smooth) piece-wise nonlinearities. Here, we provide an interpretation to this phenomenon from a spline function approximation perspective.

The importance of initialization. There is an important distinction between implicit neural representations and conventional neural applications regarding the assumed distribution of the input. Conventional applications assume the distribution of the input features to be centered around the origin. This is orthogonal to implicit neural representations, where the spatial distribution of the output, i.e., the value of the function being implicitly represented, is uniformly distributed.

An improved initialization scheme. Following the previous reasoning, we explore inducing a uniformly distributed initialization of the knots. However, we observe that finding an initialization with an exponential number of knots is a cumbersome and unstable procedure. In fact, it is not always possible, and, whenever possible, it strongly restricts the values the weights W(l){\mathbf{{{W}}}}^{(l)} can assume.

A possible explanation to these astonishing results can be provided via our prior analysis:

Appendix C Dataset Description

Copy Memory Problem. The copy memory task consists of sequences of length T+T+20, for which the first 10 values are chosen randomly among the digits {1,...,8}\{1,...,8\}, the subsequent TT-1 digits are set to zero, and the last 11 entries are filled with the digit 9. The goal is to generate an output of the same size of the input filled with zeros everywhere except for the last 10 values, for which the model is expected to predict the first 10 elements of the input sequence.

The Adding Problem. The adding problem consists of input sequences of length TT and depth 2. The first dimension is filled with random values in $$, whereas the second dimension is set to zeros except for two elements marked by 1. The objective is to sum the random values for which the second dimension is equal to 1. Simply predicting the sum to be 1 results in a MSE of about 0.1767.

Sequential and Permuted MNIST. The MNIST dataset (LeCun et al., 1998) consists of 70k gray-scale 28×2828\times 28 handwritten digits divided into training and test sets of 60k and 10k samples, respectively. The sequential MNIST dataset (sMNIST) presents MNIST images as a sequence of 784 pixels for digit classification. Consequently, good predictions require preserving long-term dependencies up to 784 steps in the past: much longer than most language modelling tasks (Bai et al., 2018b).

The permuted MNIST dataset (pMNIST) additionally permutes the order or the sMNIST sequences at random. Consequently, models can no longer rely on on local features to perform classification. As a result, the classification problem becomes more difficult and the importance of long-term dependencies more pronounced.

Sequential CIFAR10. The CIFAR10 dataset (Krizhevsky et al., 2009) consists of 60k real-world 32×3232\times 32 RGB images uniformly drawn from 10 classes divided into training and test sets of 50k and 10k samples, respectively. Analogously to the sMNIST dataset, the sequential CIFAR10 (sCIFAR10) dataset presents CIFAR10 images as a sequence of 1024 pixels for image classification. This dataset is more difficult than sMNIST, as (i) even larger memory horizons are required to solve the task, and (ii) more complex structures and intra-class variations are present in the images (Trinh et al., 2018).

CharacterTrajectories. The CharacterTrajectories dataset is part of the UEA time series classification archive (Bagnall et al., 2018). It consists of 2858 time series of different lengths and 3 channels representing the x,yx,y positions and the pen tip force while writing a Latin alphabet character in a single stroke The goal is to classify which of the different 20 characters was written using the time series data. The maximum length of the time-series is 182.

Speech Commands. The Speech Commands dataset (Warden, 2018) consists of 105809 one-second audio recordings of 35 spoken words sampled at 1616kHz. Following Kidger et al. (2020), we extract 34975 recordings from ten spoken words to construct a balanced classification problem. We refer to this dataset as SC_raw. In addition, we utilize the preprocessing steps of Kidger et al. (2020) and extract mel-frequency cepstrum coefficients from the raw data. The resulting dataset, named SC, consists of time series of length 161 and 20 channels.

PhysioNet. The PhysioNet 2019 challenge on sepsis prediction (Goldberger et al., 2000; Reyna et al., 2019) is a irregularly sampled, partially observed dataset consisting of 40335 time series of variable length describing the stay of patients within an ICU. Time-series are made out of 5 static features, e.g., age, and 34 time-dependent features, e.g., respiration rate, creatinine blood concentration, and 10.3% of the values are observed. We follow Kidger et al. (2020) and consider the first 72 hours of a patient’s stay to predict whether sepsis is developed over the course of their entire stay –which can extend for a month for some patients–.

PennTreeBank. The PennTreeBank (PTB) (Marcinkiewicz, 1994) is a language corpus which consists of 5,095K characters for training, 396K for validation and 446K for testing. On a char lever that we use in our experiment the vocabulary size is 50 characters (or the size of the alphabet, including end-of-string char). We follow Bai et al. (2018a) in performing character-level language modeling task on this dataset.

Appendix D Ablation Studies

In this section, we perform an ablative study of our approach. Specifically, we analyze the effect of multiple components of our network, and provide additional comparisons with alternative architectures. Specifications on the architectures and hyperparameters used are given in Appx. E.

Our results (Fig. 7), illustrate that Sine provides astonishing approximation capabilities over all other nonlinearities considered. In particular, we observe that Sine is the only nonlinearity able to reconstruct very nonlinear and very non-smooth functions, while all other alternatives fail poorly.

D.2 Going Deeper with CKCNNs

The experimental results shown in Sec. 5 are obtained with shallow CKCNNs composed of 2 residual blocks only. An interesting question is whether going deeper can be used to improve the performance of CKCNNs. To analyze this, we compare deep and shallow CKCNNs with the same architecture for equal width, and equal number of parameters.

Our results (Tab. 7) indicate that deep CKCNNs do not provide improvements over shallow CKCNNs. In fact, deep CKCNNs of fixed size underperform their shallow counterparts. This is an interesting results as shallow CKCNNs do not strongly rely on deep-wise compositionality of features, which is largely considered indispensable in deep learning.

Analysis of the results. The dynamics governing these results are not yet fully understood. However, our findings may lead to two different conclusions, both of which we consider important for the development and understanding of CKCNNs and deep learning in general:

Outcome I: Deep CKCNNs. The first possible outcome is that our current parameterization does not correctly leverage depth. In this case, efforts to construct proper deep CKCNNs will likely lead to performance improvements over the current architectures, and thus have the potential to advance the state-of-the-art further.

Outcome II: Depth is not needed when global memory horizons are provided with shallow networks. The second possible outcome is that depth is used mainly as a means to construct global memory horizons. Consequently, neural networks do not have to be very deep at all provided that global memory horizons are defined by shallow neural networks. Interestingly, this conclusion is in line with the predominant design of recurrent architectures, for which a moderate number of layers are used, e.g., Pascanu et al. (2013a); Graves et al. (2013); Gu et al. (2020b; a). This possible outcome is very exciting as depth is largely considered indispensable in the deep learning community.

Appendix E Experimental Details

In this section, we provide extended details over our implementation as well as the exact architectures and optimization schemes used in our experiments.

Our models follow the structure shown in Fig. 8 and vary only in the number of channels. We use layer normalization (Ba et al., 2016) in our backbone network, and use the Adam optimizer (Kingma & Ba, 2014) across all our experiments. Our code is implemented in PyTorch and is publicly available at link removed for the sake of the double-blind review. We utilize wandb (Biewald, 2020) to log our results, and use NVIDIA TITAN RTX GPUs throughout our experiments.

Hyperparameter tuning. We tune the hyperparameters of our models via the bayes method given in wandb Sweeps, which selects hyperparameter values via a Gaussian process over the results obtained so far. We perform tuning on a validation dataset until a predefined maximum number of runs of 100100 is exhausted. Further improvements upon our results may be obtained by leveraging more sophisticated tuning methods as well as additional runs.

Selecting ω0\bm{\omega_{0}}. CKCNNs are very susceptible to the value of ω0\omega_{0}. In order to obtain a reasonable ω0\omega_{0}, we first perform a random search on a large interval ω0\omega_{0}\in. After a few runs, we stop the random search and select the subinterval in which the validation accuracy is most promising. Next, we restart the random search on this sub-interval and repeat the process until a ω0\omega_{0} value is obtained, for which the validation accuracy is sufficiently high. Surprisingly, we found optimal values of ω0\omega_{0} to be always enclosed in the interval $$ even for very long sequences as in SC_raw.

E.2 Accounting for Spatial Displacements of the Sampled Convolutional Kernels

(n = 1) \rightarrow [1, 2, 3, … , 180, 181, 182] (n = 2) \rightarrow [1, 3, 5, … , 177, 179, 181] (n = 4) \rightarrow [1, 5, 9, … , 173, 177, 181] (n = 8) \rightarrow [1, 9, 17, … , 161, 169, 177]

Recall that MLPψ takes normalized relative positions in $asinput,whicharecomputedbasedonthemaxinputlengthseenduringtraining.However,someofthesesubsamplingtransitionschangethemaxvalueofthesequence,e.g.,for(n=8)themaximumisgivenby177,whereasfor(n=1)thisvaluecorrespondsto182.Consequently,anaiveapproachwouldconsiderthelastpositionineachsubsampledsequencetocorrespondtothemaximumnormalizedrelativepositionas input, which are computed based on the max input length seen during training. However, some of these subsampling transitions change the max value of the sequence, e.g., for (n = 8) the maximum is given by 177, whereas for (n = 1) this value corresponds to 182. Consequently, a naive approach would consider the last position in each subsampled sequence to correspond to the maximum normalized relative position1$. This effectively induces an spatial displacement, and a re-scaling of the sampled convolutional kernel used during training.

This misalignment is automatically handled under the hood in our CKConv implementation. Nevertheless, we highlight this subtle phenomenon to prevent it in future applications.

E.3 Dealing with High-Frequency Components

Interestingly, our experiments revealed that our continuous kernels often contain frequency components of frequency higher than the resolution of the sampling grid used during training (Fig. 9). As these high-frequency components are not observed during training, we observe that they hurt performance when evaluated at higher resolutions.

E.4 Hyperparameters and Experimental Details

In this section, we provide further specifications of the hyperparameter configurations with with our models are trained. An overview of these hyperparameters is provided in Tab. 8.

Copy Memory. We set the number of channels of our CKCNN as to roughly match the number of parameters of the GRU and TCN networks of Bai et al. (2018a). This is obtained with 10 hidden channels at every layer. We observe that the time to convergence grew proportional to the length of the sequence considered. Whereas for sequences of length 100 convergence was shown after as few as 10 epochs, for sequences of length 6000 approximately 250 epochs were required. The maximum number of epochs is set to 50, 50, 100, 200 and 300 for sequences of size 100, 200, 1000, 3000 and 6000. We observe that different values of ω0\omega_{0} are optimal for different sequence lengths. The optimal ω0\omega_{0} values found are 19.20, 34.71, 68.69, 43.65 and 69.97 for the corresponding sequence lengths.

Adding Problem. We set the number of channels of our CKCNN as to roughly match the number of parameters of the GRU and TCN networks of Bai et al. (2018a). This is obtained with 25 hidden channels at every layer. Similarly to the Copy Memory task, we observe that the time to convergence grew proportional to the length of the sequence considered. Interestingly, this task was much easier to solve for our models, with convergence for sequences of length 6000 observed after 38 epochs. The maximum number of epochs is set to 20, 20, 30, 50 and 50 for sequences of size 100, 200, 1000, 3000 and 6000. We observe that different values of ω0\omega_{0} are optimal for different sequence lengths. The optimal ω0\omega_{0} values found are 14.55, 18.19, 2.03, 2.23 and 4.3 for the corresponding sequence lengths.

sMNIST, pMNIST and sCIFAR10. We construct two models of different sizes for these datasets: CKCNN and CKCNN-Big. The first is constructed to obtain a parameter count close to 100k. The second model, is constructed to obtain a parameter count close to 1m. The parameters utilized for these datasets are summarized in Tab. 8. Despite our efforts, we observed that our models heavily overfitted sCIFAR10. Combinations of weight decay, dropout and weight dropout were not enough to counteract overfitting.

CT, SC and SC_raw. The parameters utilized for classification on these datasets are summarized in Tab. 8. For hyperparameters regarding experiments with irregularly-sampled data please refer to Tab. 9. Any non-specified parameter value in Tab. 9 can be safely consider to be the one listed for corresponding dataset in Tab. 8.

PennTreeBank For a character-level language modeling on PTB dataset we use hyperparameters specified in Tab. 8. We use embedding of size 100 following the TCN model from Bai et al. (2018a).