FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

David W. Romero, Robert-Jan Bruintjes, Jakub M. Tomczak, Erik J. Bekkers, Mark Hoogendoorn, Jan C. van Gemert

Introduction

The kernel size of a convolutional layer defines the region from which features are computed, and is a crucial choice in their design. Commonly, small kernels (up to 7px) are used almost exclusively and are combined with pooling to model long term dependencies (Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Tan & Le, 2019). Recent works indicate, however, that CNNs benefit from using convolutional kernels (i) of varying size at different layers (Pintea et al., 2021; Tomen et al., 2021), and (ii) at the same resolution of the data (Peng et al., 2017; Cordonnier et al., 2019; Romero et al., 2021). Unfortunately, most CNNs represent convolutional kernels as tensors of discrete weights and their size must be fixed prior to training. This makes exploring different kernel sizes at different layers difficult and time-consuming due to (i) the large search space, and (ii) the large number of weights required to construct large kernels.

A more efficient way to tune different kernel sizes at different layers is to learn them during training.Existing methods define a discrete weighted set of basis functions, e.g., shifted Delta-Diracs (Fig. 2(b), Dai et al. (2017)) or Gaussian functions (Fig. 2(c), Jacobsen et al. (2016); Shelhamer et al. (2019); Pintea et al. (2021)). During training they learn dilation factors over the basis functions to increase the kernel size, which crucially limits the bandwidth of the resulting kernels.

In this work, we present the Flexible Size Continuous Kernel Convolution (FlexConv), a convolutional layer able to learn high bandwidth convolutional kernels of varying size during training (Fig. 1). Instead of using discrete weights, we provide a continuous parameterization of convolutional kernels via a small neural network (Romero et al., 2021). This parameterization allows us to model continuous functions of arbitrary size with a fixed number of parameters. By multiplying the response of the neural network with a Gaussian mask, the size of the kernel can be learned during training (Fig. 2(a)). This allows us to produce detailed kernels of small sizes (Fig. 3), and tune kernel sizes efficiently.

FlexConvs can be deployed at higher resolutions than those observed during training, simply by using a more densely sampled grid of kernel indices. However, the high bandwidth of the kernel can lead FlexConv to learn kernels that show aliasing at higher resolutions, if the kernel bandwidth exceeds the Nyquist frequency. To solve this problem, we propose to parameterize convolutional kernels as Multiplicative Anisotropic Gabor Networks (MAGNets). MAGNets are a new class of Multiplicative Filter Networks (Fathony et al., 2021) that allows us to analyze and control the frequency spectrum of the generated kernels. We use this analysis to regularize FlexConv against aliasing. With this regularization, FlexConvs can be directly deployed at higher resolutions with minimal accuracy loss. Furthermore, MAGNets provide higher descriptive power and faster convergence speed than existing continuous kernel parameterizations (Schütt et al., 2017; Finzi et al., 2020; Romero et al., 2021). This leads to important improvements in classification accuracy (Sec. 4).

Our experiments show that CNNs with FlexConvs, coined FlexNets, achieve state-of-the-art across several sequential datasets, match performance of recent works with learnable kernel sizes with less compute, and are competitive with much deeper ResNets (He et al., 2016) when applied on image benchmark datasets. Thanks to the ability of FlexConvs to generalize across resolutions, FlexNets can be efficiently trained at low-resolution to save compute, e.g., $16\times 16$ CIFAR images, and be deployed on the original data resolution with marginal accuracy loss, e.g., $32\times 32$ CIFAR images.

We introduce the Flexible Size Continuous Kernel Convolution (FlexConv), a convolution operation able to learn high bandwidth convolutional kernels of varying size end-to-end.

Our proposed Multiplicative Anisotropic Gabor Networks (MAGNets) allow for analytic control of the properties of the generated kernels. This property allows us to construct analytic alias-free convolutional kernels that generalize to higher resolutions, and to train FlexNets at low resolution and deploy them at higher resolutions. Moreover, MAGNets show higher descriptive power and faster convergence speed than existing kernel parameterizations.

CNN architectures with FlexConvs (FlexNets) obtain state-of-the-art across several sequential datasets, and match recent works with learnable kernel size on CIFAR-10 with less compute.

Related Work

Adaptive kernel sizes. Loog & Lauze (2017) regularize the scale of convolutional kernels for filter learning. For image classification, adaptive kernel sizes have been proposed via learnable pixel-wise offsets (Dai et al., 2017), learnable padding operations (Han et al., 2018), learnable dilated Gaussian functions (Shelhamer et al., 2019; Xiong et al., 2020; Tabernik et al., 2020; Nguyen, 2020) and scalable Gaussian derivative filters (Pintea et al., 2021; Tomen et al., 2021; Lindeberg, 2021). These approaches either dilate discrete kernels (Fig. 2(b)), or use discrete weights on dilated basis functions (Fig. 2(c)). Using dilation crucially limits the bandwidth of the resulting kernels. In contrast, FlexConvs are able to construct high bandwidth convolutional kernels of varying size with a fixed parameter count. Larger kernels are obtained simply by passing more positions to the kernel network (Fig. 1).

Recently, Romero et al. (2021) introduced the Continuous Kernel Convolution (CKConv) as a tool to model long-term dependencies. CKConv uses a continuous kernel parameterization to construct convolutional kernels as big as the input signal with a constant parameter cost. Contrarily, FlexConvs jointly learn the convolutional kernel as well as its size. This leads to important advantages in terms of expressivity (Fig. 3), convergence speed and compute costs of the operation.

Implicit neural representations. Parameterizing a convolutional kernel via a neural network can be seen as learning an implicit neural representation of the underlying convolutional kernel (Romero et al., 2021). Implicit neural representations construct continuous data representations by encoding data in the weights of a neural network (Park et al., 2019; Sitzmann et al., 2020; Fathony et al., 2021).

We replace the SIREN (Sitzmann et al., 2020) kernel parameterization used in Romero et al. (2021) by our Multiplicative Anisotropic Gabor Networks: a new class of Multiplicative Filter Networks (Fathony et al., 2021). MFNs allow for analytic control of the resulting representations, and allow us to construct analytic alias-free convolutional kernels. The higher expressivity and convergence speed of MAGNets lead to accuracy improvements in CNNs using them as kernel parameterization.

Method

In this section, we introduce our approach. First, we introduce FlexConv and the Gaussian mask. Next, we introduce our Multiplicative Anisotropic Gabor Networks (MAGNets) and provide a description of our regularization technique used to control the spectral components of the generated kernel.

To learn the kernel size during training, FlexConvs define their convolutional kernels $\bm{\psi}$ as the product of the output of a neural network MLPψ with a Gaussian mask of local support. The neural network MLPψ parameterizes the kernel, and the Gaussian mask parameterizes its size (Fig. 1).

2 Multiplicative Anisotropic Gabor Networks (MAGNets)

In this section, we formalize our proposed parameterization for the kernel MLPψ. We start by introducing Multiplicative Filter Networks (Fathony et al., 2021), and present our MAGNets next.

Multiplicative Anisotropic Gabor Networks (MAGNets). Our MAGNet formulation is based on the observation that isotropic Gabor functions, i.e., with equal $\gamma$ for the horizontal and vertical directions, are undesirable as basis for the construction of MFNs. Whenever a frequency is required along a certain direction, an isotropic Gabor function automatically introduces that frequency in both directions. As a result, other bases must counteract this frequency in the direction where the frequency is not required, and thus the capacity of the MFN is not used optimally (Daugman, 1988).

Following the original formulation of the 2D Gabor functions (Daugman, 1988), we alleviate this limitation by using anisotropic Gabor functions instead:

The resulting Multiplicative Anisotropic Gabor Network (MAGNet) obtains better control upon frequency components introduced to the approximation, and demonstrates important improvements in terms of descriptive power and convergence speed (Sec. 4).

3 Analytic Alias-free MAGNets

Note however, that Eq. 9 holds approximately. This is due to aliasing artifacts which can appear if the frequencies in the learned kernel surpass the Nyquist criterion of the target resolution. Consequently, an anti-aliased parameterization is vital to construct kernels that generalize well to high resolutions.

Towards alias-free implicit neural representations. We observe that SIRENs as well as unconstrained MFNs and MAGNets exhibit aliasing when deployed on resolutions higher than the training resolution, which hurts performance of the model. An example kernel with aliasing is shown in Fig. 8.

To combat aliasing, we would like to control the representation learned by MAGNets. MAGNets –and MFNs in general– construct implicit neural representations that can be seen as a linear combination of basis functions. This property allows us to analytically derive and study the properties of the resulting neural representation. Here, we use this property to derive the maximum frequency of MAGNet-generated kernels, so as to regularize MAGNets against aliasing during training. We analytically derive the maximum frequency of a MAGNet, and penalize it whenever it exceeds the Nyquist frequency of the training resolution. We note that analytic derivations are difficult for other implicit neural representations, e.g., SIRENs, due to stacked layer-wise nonlinearities.

Maximum frequency of MAGNets. The maximum frequency component of a MAGNet is given by:

Effect of the FlexConv mask. The Gaussian mask used to localize the response of the MAGNet also has an effect on the frequency spectrum. Hence, the maximum frequency of a FlexConv kernel is:

Here, $k$ depicts the size of the FlexConv kernel before applying the Gaussian mask, and is equal to the size of the input signal. In practice, we implement Eq. 25 by regularizing the individual MAGNet layers, as is detailed in Appx. A.2. To verify our method, Fig. 8 (Appx. A.1) shows that the frequency components of FlexNet kernels are properly regularized for aliasing.

Experiments

We evaluate FlexConv across classification tasks on sequential and image benchmark datasets, and validate the ability of MAGNets to approximate complex functions. A complete description of the datasets used is given in Appx. B. Appx. D.2 reports the parameters used in all our experiments.Our code is publicly available at https://github.com/rjbruin/flexconv.

Bandwidth of methods with learnable sizes. First, we compare the bandwidth of MAGNet against N-Jet (Pintea et al., 2021) by optimizing each to fit simple targets: (i) Gabor filters of known frequency, (ii) random noise and (iii) an a $11\times 11$ AlexNet kernel from the first layer (Krizhevsky et al., 2012).Fig. 4 shows that, even with 9 orders of Gaussian derivatives, N-Jets cannot fit high frequency signals in large kernels. Crucially, N-Jet models require many Gaussian derivative orders to model high frequency signals in large kernels: a hyperparameter which proportionally increases their inference time and parameter count. MAGNets, on the other hand, accurately model large high frequency signals. This allows FlexNets to learn large kernels with high frequency components.

Expressivity of MLP parameterizations. Next, we compare the descriptive power and convergence speed of MAGNets, Gabor MFNs, Fourier MFNs and SIRENs for image approximation. To this end, we fit the images in the Kodak dataset (Kodak, 1991) with each of these methods. Our results (Tab. 5) show that MAGNets outperform all other methods, and converge faster to good approximations.

2 Classification Tasks

Network specifications. Here, we specify our networks for all our classification experiments. We parameterize all our convolutional kernels as the superposition of a 3-layer MAGNet and a learnable anisotropic Gaussian mask. We construct two network instances for sequential and image datasets respectively: FlexTCNs and FlexNets. Both are constructed by taking the structure of a baseline network –TCN (Bai et al., 2018a) or CIFARResNet (He et al., 2016)–, removing all internal pooling layers, and replacing convolutional kernels by FlexConvs. The FlexNet architecture is shown in Fig. 10 and varies only in the number of channels and blocks, e.g., FlexNet-16 has 7 blocks. Akin to Romero et al. (2021) we utilize the Fourier theorem to speed up convolutions with large kernels.

Mask initialization. We initialize the FlexConv masks to be small. Preliminary experiments show this leads to better performance, faster execution, and faster training convergence. For sequences, the mask center is initialized at the last kernel position to prioritize the last information seen.

Time series and sequential data. First we evaluate FlexTCNs on sequential classification datasets, for which long-term dependencies play an important role. We validate our approach on intrinsic discrete data: sequential MNIST, permuted MNIST (Le et al., 2015), sequential CIFAR10 (Chang et al., 2017), noise-padded CIFAR10 (Chang et al., 2019), as well as time-series data: CharacterTrajectories (CT) (Bagnall et al., 2018), SpeechCommands (Warden, 2018) with raw waveform (SC_raw) and MFCC input representations (SC).

Our results are summarized in Tables 1 and 3. FlexTCNs with two residual blocks obtain state-of-the-art results on all tasks considered. In addition, depth further improves performance. FlexTCN-6 improves the current state-of-the-art on sCIFAR10 and npCIFAR10 by more than 6%. On the difficult SC_raw dataset –with sequences of length 16000–, FlexTCN-6 outperform the previous state-of-the-art by 20.07%: a remarkable improvement.

Furthermore, we conduct ablation studies by changing the parameterization of MLPψ, and switching off the learnable kernel size ("CKTCNs") and considering global kernel sizes instead. CKTCNs and FlexTCNs with MAGNet kernels outperform corresponding models with all other kernel parameterizations: SIRENs (Sitzmann et al., 2020), MGNs and MFNs (Fathony et al., 2021). Moreover, we see a consistent improvement with respect to CKCNNs (Romero et al., 2021) by using learnable kernel sizes. This shows that both MAGNets and learnable kernel sizes contribute to the performance of FlexTCNs. Note that in 1D, MAGNets are equivalent to MGNs. However, MAGNets consistently perform better than MGNs. This improvement in accuracy is a result of our MAGNet initialization.

Image classification. Next, we evaluate FlexNets for image classification on CIFAR-10 (Krizhevsky et al., 2009). Additional experiments on Imagenet-32, MNIST and STL-10 can be found in Appx. C.

Table 3 shows our results on CIFAR-10. FlexNets are competitive with pooling-based methods such as CIFARResNet (He et al., 2016) and outperform learnable kernel size method DCNs (Tomen et al., 2021). In addition, we compare using N-Jet layers of order three (as in Pintea et al. (2021)) in FlexNets against using MAGNet kernels. We observe that N-Jet layers lead to worse performance, and are significantly slower than FlexConv layers with MAGNet kernels. The low accuracy of N-Jet layers is likely to be linked to the fact that FlexNets do not use pooling. Consequently, N-Jets are forced to learn large kernels with high-frequencies, which we show N-Jets struggle learning in Sec. 4.1.

To illustrate the effect of learning kernel sizes, we also compare FlexNets against FlexNets with large and small discrete convolutional kernels (Tab. 3). Using small kernel sizes is parameter efficient, but is not competitive with FlexNets. Large discrete kernels on the other hand require a copious amount of parameters and lead to significantly worse performance. These results indicate that the best solution is somewhere in the middle and varying kernel sizes can learn the optimal kernel size for the task at hand.

Similar to the sequential case, we conduct ablation studies on image data with learnable, non-learnable kernel sizes and different kernel parameterizations. Table 3 shows that FlexNets outperform CKCNNs with corresponding kernel parameterizations. In addition, a clear difference in performance is apparent for MAGNets with respect to other parameterizations. These results corroborate that both MAGNets and FlexConvs contribute to the performance of FlexNets. Moreover, Tab. 3 illustrates the effect of the two contributions of MAGNet over MGN: anisotropic Gabor filters, and our improved initialization. Our results in image data are in unison with our previous results for sequential data (Tabs. 1, 3) and illustrate the value of the proposed improvements in MAGNets.

3 Alias-free FlexNets

Figure 5 shows accuracy change between ten source and target resolution combinations on CIFAR-10, both for including and excluding the FlexConv mask in the aliasing regularization. We train at the source resolution for 100 epochs, before testing the model at the target resolution with the upsampling described in Sec. 3.3. Next, we adjust $f_{\textrm{Nyq}}(k)$ to the target resolution, and finetune each model for 100 epochs at the target resolution.

We find that regularizing just $f^{+}_{\textrm{MAGNet}}$ yields a trade-off. It increases the accuracy difference between low and high resolution inference, but also increases the fine-tune accuracy at the target resolution.We therefore choose to, by default, regularize $f^{+}_{\textrm{MAGNet}}$ only.

Results of our alias-free FlexNet training on CIFAR-10 are in Table 4. We observe that the performance of a FlexNet trained without aliasing regularization largely breaks down when the dataset is upscaled. However, with our aliasing regularization most of the performance is retained.

Comparatively, FlexNet retains more of the source resolution performance than FlexNets with N-Jet layers, while baselines degrade drastically at the target resolution. Fig. 8 shows the effect of aliasing regularization on the frequency components of FlexConv.

Training at lower resolutions saves compute. We can train alias-free FlexNets at lower resolutions. To verify that this saves compute, we time the first 32 batches of training a FlexNet-7 on CIFAR-10. We compare against training on $16\times 16$ images (downsampled before training). On 16x16 images, each batch takes 179ms ( $\pm$ 7ms). On 32x32 images, each batch takes 222ms ( $\pm$ 9ms). Therefore, we save 24% training time when training FlexNets alias-free at half the native CIFAR-10 resolution.

Discussion

Learned kernel sizes match conventional priors. Commonly, CNNs use architectures of small kernels and pooling layers. This allows convolutions to build a progressively growing receptive field. With learnable kernel sizes, FlexNet could learn a different prior over receptive fields, e.g., large kernels first, and small kernels next. However, FlexNets learn to increase kernel sizes progressively (Fig. 6), and match the network design that has been popular since AlexNet (Krizhevsky et al., 2012).

Mask initialization as a prior for feature importance. The initial values of the FlexConv mask can be used to prioritize information at particular input regions. For instance, initializing the center of mask on the first element of sequential FlexConvs can be used to prioritize information from the far past. This prior is advantageous for tasks such as npCIFAR10. We observe that using this prior on npCIFAR10 leads to much faster convergence and better results (68.33% acc. w/ FlexTCN-2).

MAGNet regularization as prior induction. MAGNets allow for analytic control of the properties of the resulting representations. We use this property to generate alias-free kernels. However, other desiderata could be induced, e.g., smoothness, for the construction of implicit neural representations.

Limitations

Dynamic kernel sizes: computation and memory cost of convolutions with large kernels. Performing convolutions with large convolutional kernels is a compute-intensive operation. FlexConvs are initialized with small kernel sizes and their inference cost is relatively small at the start of training. However, despite the cropping operations used to improve computational efficiency (Figs. 1, 3, Tab. 3), the inference time may increase to up to double as the learned masks increase in size. At the cost of more memory, convolutions can be sped up by performing them in the frequency domain. However, we observe that this does not bring gains for the image data considered because FFT convolutions are faster only for very large convolutional kernels (in the order of hundreds of pixels).

Conclusion

We propose FlexConv, a convolutional operation able to learn high bandwidth convolutional kernelsof varying size during training at a fixed parameter cost. We demonstrate that FlexConvs are able to model long-term dependencies without the need of pooling, and shallow pooling-free FlexNets achieve state-of-the-art performance on several sequential datasets, match performance of recent works with learned kernel sizes with less compute, and are competitive with much deeper ResNets on image benchmark datasets. In addition, we show that our alias-free convolutional kernels allow FlexNets to be deployed at higher resolutions than seen during training with minimal precision loss.

Future work. MAGNets give control over the bandwidth of the kernel. We anticipate that this control has more uses, such as fighting sub-sampling aliasing (Zhang, 2019; Kayhan & Gemert, 2020; Karras et al., 2021). With the ability to upscale FlexNets to different input image sizes comes the possibility of transfer learning representations between previously incompatible datasets, such as CIFAR-10 and Imagenet. In a similar vein, the automatic adaptation of FlexConv to the kernel sizes required for the task at hand may make it possible to generalize the FlexNet architecture across different tasks and datasets. Neural architecture search (Zoph & Le, 2016) could see benefits from narrowing the search space to exclude kernel size and pooling layers. In addition, we envisage additional improvements from structural developments of FlexConvs such as attentive FlexNets.

Reproducibility Statement

We hope to inspire others to use and reproduce our work. We publish the source code of this work, for which the link is provided in Sec. 4.2. Sec. 4 and Appx. D.1 detail FlexNet, its hyperparameters and optimization procedure. The full derivation of the aliasing regularization objective is included in Appx. A.1. We report means over multiple runs for many experiments, to ensure the reported results are fair and reproducible, and do not rely on tuning of the random seed. All datasets used in our experiments are publicly available. If any questions remain, we welcome one and all to contact the corresponding author.

Acknowledgments

We thank Nergis Tömen for her valuable insights regarding signal processing principles for FlexConv, and Silvia-Laura Pintea for explanations and access to code of her work Pintea et al. (2021). We thank Yerlan Idelbayev for the use of the CIFARResNet code.

This work is co-supported by the Qualcomm Innovation Fellowship granted to David W. Romero. David W. Romero sincerely thanks Qualcomm for his support. David W. Romero is financed as part of the Efficient Deep Learning (EDL) programme (grant number P16-25), partly funded by the Dutch Research Council (NWO). Robert-Jan Bruintjes is financed by the Dutch Research Council (NWO) (project VI.Vidi.192.100). All authors sincerely thank everyone involved in funding this work.

This work was partially carried out on the Dutch national infrastructure with the support of SURF Cooperative. We used Weights & Biases (Biewald, 2020) for experiment tracking and visualizations.

References

Appendix A Alias-free FlexConv regularization

In this section we provide the complete derivation and analysis for our FlexConv regularization against aliasing. First, we derive the analytic maximum frequency component of a FlexConv kernel. Next, we compute the Nyquist frequency of a FlexConv kernel, and subsequently show how to combine the previous results into a regularization term to train alias-free FlexConvs.

In order to make FlexConv alias-free (Sec. 3.3), we need to compute the maximum frequency component of the kernels generated by a MAGNet, so that we can regularize it during training. In this section we analytically derive this maximum frequency component from the parameters of the MAGNet.

Recall that MAGNets generate a kernel $\bm{\psi}(x,y)$ through of a succession of anisotropic Gabor filters and linear layers (Sec. 3.2, Eqs. 2–7):

To analyse the maximum frequency component $f^{+}_{\textrm{MAGNet}}$ , we analyse the frequency components of the Gabor filters used in MAGNet, and retain their maximum. We then plug the found frequency component into the analysis of Fathony et al. (2021) to show how the frequency responses of Gabor filters and linear layers interact in MFNs. Finally, we add the effect of the FlexConv Gaussian mask to our analysis to obtain the maximum frequency component ot the final FlexConv kernel $f^{+}_{\textrm{FlexConv}}$ .

The other assumption we made before was to work with single-channel outputs. MAGNets however use multi-channel outputs with independent Gaussian terms. The maximum frequency of multi-channel Gaussian envelopes is given by:

Figure 9 illustrates the frequency spectrum of an example Gabor filter.

Maximum frequency component of a MAGNet. Fathony et al. (2021) characterize the expansion of each term of the isotropic Gabor layers in MFNs in the final MFN output. In Eq. 25, Fathony et al. (2021) demonstrate that the MFN representation contains a set of sine frequencies $\bm{\overline{\omega}}$ given by:

Visualization of regularized kernels. Fig. 8 shows example kernels from FlexNets trained with aliasing regularization. The frequency domain plots confirm the accuracy of our frequency component regularization.

A.2 Regularizing the frequency response of FlexConv

Knowing the sampling rate in terms of the kernel size allows us to express the Nyquist frequency in terms of the (pre-masked) kernel size:

Note that the kernel size in a FlexConv is initialized to be equal to the resolution of the data, if it is odd. For even resolutions, it corresponds to the resolution of the data plus one.

Constructing the regularization term. We train FlexConv with a regularization term on the frequency response of the generated kernel to ensure that aliasing effects do not distort the performance of the model when it is inferred at a higher resolution. This section details the implementation of the regularization function.

In the code, we refer to this method as the together method, versus the summed method of Eq. 25. In preliminary experiments, we observed improved performance of anti-aliasing training when using the together method. All of our experiments anti-aliasing experiments therefore use the together setting.

Appendix B Dataset Description

Kodak dataset. The Kodak dataset (Kodak, 1991) consists of 24 natural images of size $768\times 512$ . This dataset is a popular benchmark used for compression and image fitting methods.

B.2 Sequential Datasets

Sequential and Permuted MNIST. The sequential MNIST dataset (sMNIST) (Le et al., 2015)takes the $28{\times}28$ images from the original MNIST dataset (LeCun et al., 1998), and presents them as a sequence of 784 pixels. The goal of this task is to perform digit classification given the representation of the last sequence element of a sequential model. Consequently, good predictions require the model to preserve long-term dependencies up to 784 steps in the past.

The permuted MNIST dataset (pMNIST) additionally changes the order of all the sMNIST sequences by a random permutation. Consequently, models can no longer rely on local features to construct good feature representations. As a result, the classification problem becomes more difficult, and the importance of long-term dependencies more pronounced.

Sequential and Noise-Padded CIFAR10. The sequential CIFAR10 dataset (sCIFAR10) (Chang et al., 2017) takes the $32{\times}32$ images from the original CIFAR10 dataset (Krizhevsky et al., 2009) and presents them as a sequence of 1,024 pixels. The goal of this task is to perform image classification given the representation of the last sequence element of a sequential model. This task is more difficult than sMNIST, as a larger memory horizon is required to solve the task and more complex structures and intra-class variations are present in the data (Bai et al., 2018b).

The noise-padded CIFAR10 dataset (npCIFAR10) (Chang et al., 2019) flattens the images from the original CIFAR10 dataset (Krizhevsky et al., 2009) along their rows to create a sequence of length 32, and 96 channels (32 rows $\times$ 3 channels). Next, these sequences are concatenated with 968 entries of noise to form the final sequences of length 1000. As for sCIFAR10, the goal of the task is to perform image classification given the representation of the last sequence element of a sequential model.

CharacterTrajectories. The CharacterTrajectories dataset is part of the UEA time series classification archive (Bagnall et al., 2018). It consists of 2858 time series of length 182 and 3 channels representing the $x,y$ positions, and the tip force of a pen while writing Latin alphabet characters in a single stroke. The goal is to classify out of 20 classes the written character using the time series data.

B.3 Image Benchmark Datasets

MNIST. The MNIST hadwritten digits datset (LeCun & Cortes, 2010) consists of 70,000 gray-scale handwritten digits of size $28{\times}28$ , divided into a training and test sets of 60,000 and 10,000 images, respectively. The goal of the task is to classify these digits as one of the ten possible digits $(0,1,..8,9)$ .

CIFAR-10 The CIFAR-10 dataset (Krizhevsky et al., 2009) consists of 60,000 natural images from 10 classes of size $32{\times}32$ , divided into training and test sets of 50,000 and 10,000 images, respectively.

STL-10. The STL-10 dataset (Coates et al., 2011) is a subset of the ImageNet dataset (Krizhevsky et al., 2012) consisting of 5,000 natural images from 10 classes of size $96{\times}96$ , divided into trainint and test sets of 4,500 and 500 images, respectively.

Appendix C Additional Experiments

CIFAR-10. Tab. 6 shows all results for our CIFAR-10 experiments, including more ablations.

ImageNet-32. Results for the ImageNet-32 experiment are shown in Table 7. FlexNets are slightly worse than CIFARResNet-32 (He et al., 2016) with slightly less parameters. However, the results reported by Chrabaszcz et al. (2017) for Wide ResNets (Zagoruyko & Komodakis, 2016) outperform FlexNets by a significant margin.

MNIST and STL-10. We additionally report results on MNIST (Tab. 9) and STL-10 (Tab. 10. We choose these dataset for the difference in image sizes of the training data. On MNIST, though performance on MNIST is quite saturated, we are competitive with state of the art methods. On STL-10 we are significantly worse than the baseline CIFARResNet from (Luo et al., 2020), though with significantly less parameters. We were not able to prepare a more relevant baseline for this experiment.

Appendix D Experimental Details

FlexBlock. Each FlexBlock consists of two FlexConvs with BatchNorm (Ioffe & Szegedy, 2015) and dropout (Srivastava et al., 2014) ( $d=0.2$ ) as well as a residual connection. The width of a block $i$ is determined by scaling a base amount $c$ by progressively increasing factors: $c_{i}=[c,c\times 1.5,c\times 1.5,c\times 2.0,c\times 2.0](i)$ . The default configuration of FlexNet uses $c=22$ . In FlexNet-N-Jet models, we scale $c$ to match the amount of parameters of the FlexNet in the comparison.

CIFAR-10. In FlexNet-16 models for CIFAR-10 we use $c=24$ to approximate the parameter count of CIFARResNets in the experiment.

D.2 Optimization

We use Adam (Kingma & Ba, 2014) to optimize FlexNet. Unless otherwise specified, we use a learning rate of $0.01$ with a cosine annealing scheme (Loshchilov & Hutter, 2016) with five warmup epochs. We use a different learning rate of $0.1\times$ the regular learning rate for the FlexConv Gaussian mask parameters. We do not use weight decay, unless otherwise specified.

Kodak. We overfit on each image of the dataset for 20,000 iterations. To this end, we use a learning rate of 0.01 without any learning rate scheme. We observe that SIRENs diverge with this learning rate and thus, reduce the learning rate to 0.001 for these models.

CIFAR-10. We train for 350 epochs with a batch size of 64. We use the data augmentation from He et al. (2016) when training CIFAR-10: a four pixel padding, followed by a random 32 pixel crop and a random horizontal flip.