MAXIM: Multi-Axis MLP for Image Processing

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

Introduction

Image processing tasks, such as restoration and enhancement, are important computer vision problems, which aim to produce a desired output from a degraded input. Various types of degradations may require different image enhancement treatments, such as denoising, deblurring, super-resolution, dehazing, low-light enhancement, and so on. Given the increased availability of curated large-scale training datasets, recent high-performing approaches based on highly designed convolutional neural network (CNN) have demonstrated state-of-the-art (SOTA) performance on many tasks.

Improving the architectural design of the underlying model is one of the keys to improving the performance of most computer vision tasks, including image restoration. Numerous researchers have invented or borrowed individual modules or building blocks and implemented them into low-level vision tasks, including residual learning , dense connections , hierarchical structures , multi-stage frameworks , and attention mechanisms .

Recent research explorations on Vision Transformers (ViT) have exemplified their great potential as alternatives to the go-to CNN models. The elegance of ViT has also motivated similar model designs with simpler global operators such as MLP-Mixer , gMLP , GFNet , and FNet , to name a few. Despite successful applications to many high-level tasks , the efficacy of these global models on low-level enhancement and restoration problems has not been studied extensively. The pioneering works on Transformers for low-level vision directly applied full self-attention, which only accepts relatively small patches of fixed sizes (e.g., 48 $\times$ 48). Such a strategy will inevitably cause patch boundary artifacts when applied on larger images using cropping . Local-attention based Transformers ameliorate this issue, but they are also constrained to have limited sizes of receptive field, or to lose non-locality , which is a compelling property of Transformers and MLP models relative to hierarchical CNNs.

To overcome these issues, we propose a generic image processing network, dubbed MAXIM, for low-level vision tasks. A key design element of MAXIM is the use of multi-axis approach (Sec. 3.2) that captures both local and global interactions in parallel. By mixing information on a single axis for each branch, this MLP-based operator becomes ‘fully-convolutional’ and scales linearly with respect to image size, which significantly increases its flexibility for dense image processing tasks. We also define and build a pure MLP-based cross-gating module, which adaptively gate the skip-connections in the neck of MAXIM using the same multi-axis approach, and which further boosts performance. Inspired by recent restoration models, we develop a simple but effective multi-stage, multi-scale architecture consisting of a stack of MAXIM backbones. MAXIM achieves strong performance on a range of image processing tasks, while requiring very few number of parameters and FLOPs. Our contributions are:

A novel and generic architecture for image processing, dubbed MAXIM, using a stack of encoder-decoder backbones, supervised by a multi-scale, multi-stage loss.

A multi-axis gated MLP module tailored for low-level vision tasks, which always enjoys a global receptive field, with linear complexity relative to image size.

A cross gating block that cross-conditions two separate features, which is also global and fully-convolutional.

Extensive experiments show that MAXIM achieves SOTA results on more than 10 datasets including denoising, deblurring, deraining, dehazing, and enhancement.

Related Work

Restoration models. Driven by recent enormous efforts on building vision benchmarks, learning-based models, especially CNN models, have been developed that attain state-of-the-art performance on a wide variety of image enhancement tasks . These increased performance gains can be mainly attributed to novel architecture designs, and/or task-specific modules and units. For instance, UNet has incubated many successful encoder-decoder designs for image restoration that improve on earlier single-scale feature processing models . Advanced components developed for high-level vision tasks have been brought into low-level vision tasks as well. Residual and dense connections , the multi-scale feature learning , attention mechanisms , and non-local networks are such good examples. Recently, multi-stage networks have attained promising results relative to the aforementioned single-stage models on the challenging deblurring and deraining tasks . These multi-stage frameworks are generally inspired by their success on higher-level problems such as pose estimation , action segmentation , and image generation .

Low-level vision Transformers. Transformers were originally proposed for NLP tasks , where multi-head self-attention and feed-forward MLP layers are stacked to capture non-local interactions between words. Dosovitskiy et al. coined the term Vision Transformer (ViT) , and demonstrated the first pure Transformer model for image recognition. Several recent studies explored Transformers for low-level vision problems, e.g., the pioneering pre-trained image processing Transformer (IPT) . Similar to ViT, IPT directly applies vanilla Transformers to image patches. The authors of presented a spatial-temporal convolutional self-attention network that exploits local information for video super-resolution. More recently, Swin-IR and UFormer apply efficient window-based local attention models on a range of image restoration tasks.

MLP vision models. More recently, several authors have argued that when using a patch-based architecture as in ViT, the necessity of complex self-attention mechanisms becomes questionable. For instance, MLP-Mixer adopts a simple token-mixing MLP to replace self-attention in ViT, resulting in an all-MLP architecture. The authors of proposed the gMLP, which applies a spatial gating unit on visual tokens. ResMLP adopts an Affine transformation as a substitute to Layer Normalization for acceleration. Very recent techniques such as FNet and GFNet demonstrate the simple Fourier Transform can be used as a competitive alternative to either self-attention or MLPs.

Our Approach: MAXIM

We present, to the best of our knowledge, the first effective general-purpose MLP architecture for low-level vision, which we call Multi-AXIs MLP for image processing (MAXIM). Unlike previous low-level Transformers , MAXIM has several desired properties, making it intriguing for image processing tasks. First, MAXIM expresses global receptive fields on arbitrarily large images with linear complexity; Second, it directly supports arbitrary input resolutions, i.e., being fully-convolutional; Lastly, it provides a balanced design of local (Conv) and global (MLP) blocks, outperforming SOTA methods without the necessity for large-scale pre-training .

The MAXIM backbone (Fig. 2a) follows the encoder-decoder design principles that originated with UNet . We have observed that operators having small footprints such as Conv3x3 are essential to the performance of UNet-like networks. Thus, we rely on a hybrid model design for each block (Fig. 2b) – Conv for local, and MLP for long-range interactions – to make the most of them.

To allow long-range spatial mixing at different scales, we insert the multi-axis gated MLP block (MAB) into each encoder, decoder, and bottleneck (Fig. 2b), with a residual channel attention block (RCAB) (LayerNorm-Conv-LeakyReLU-Conv-SE ) stacked subsequently. Inspired by the gated filtering of skip connections , we extend the gated MLP (gMLP) to build a cross gating block (CGB, Fig. 2c), which is an efficient 2nd-order alternative to cross-attention (3rd-order correlations), to interact, or condition two distinct features. We leverage the global features from Bottleneck (Fig. 2a) to gate the skip connections, while propagating the refined global features upwards to the next CGB. Multi-scale feature fusion (red and blue lines) is utilized to aggregate multi-level information in the Encoder $\rightarrow$ CGB and CGB $\rightarrow$ Decoder dataflow.

2 Multi-Axis Gated MLP

Our work is inspired by the multi-axis blocked self-attention proposed in , which performs attention on more than a single axis. The attentions performed on two axes on blocked images correspond to two forms of sparse self-attention, namely regional and dilated attention. Despite capturing local and global information in parallel, this module cannot accommodate image restoration or enhancement tasks where the test images are often of arbitrary sizes.

We improve the ‘multi-axis’ concept for image processing tasks, by building a (split-head) multi-axis gated MLP block (MAB), as shown in Fig. 3. Instead of applying multi-axis attention in a single layer , we split in half the heads first, each being partitioned independently. In the local branch, the half head of a feature of size $(H,W,C/2)$ is blocked into a tensor of shape $(\frac{H}{b}\times\frac{W}{b},b\times b,C/2)$ , representing partitioning into non-overlapping windows each with size of $(b\times b)$ ; in the global branch, the other half head is gridded into the shape $(d\times d,\frac{H}{{d}}\times\frac{W}{d},C/2)$ using a fixed $(d\times d)$ grid, with each window having size $(\frac{H}{{d}}\times\frac{W}{{d}})$ . For visualization, we set $b={\color[rgb]{0.8,0.25,0.33}2},d={\color[rgb]{0,0.5,0}2}$ in Fig. 3. To make it fully-convolutional, we only apply the gated MLP (gMLP) block on a single axis of each branch – the $2$ nd axis for the local branch and the $1$ st axis for the global branch – while sharing parameters on the other spatial axes. Intuively, applying multi-axis gMLPs in parallel correspond to local and global (dilated) mixing of spatial information, respectively. Finally, the processed heads are concatenated and projected to reduce the number of channels, which are further combined using the long skip-connection from the input. It is worth noting that this approach provides an advantage for our model over methods that process fixed-size image patches by avoiding patch boundary artifacts.

Complexity analysis. The computational complexity of our proposed Multi-Axis gMLP block (MAB) is:

which is linear with respect to image size $HW$ , while other global models like ViT, Mixer, and gMLP are quadratic.

Universality of the multi-axis approach. Our proposed parallel multi-axis module (Fig. 3) presents a principled way to apply 1D operators on 2D images in a scalable manner. It also allows for significant flexibility and universality. For example, a straightforward replacement of a gMLP with a spatial MLP , self-attention , or even Fourier Transform leads to a family of MAXIM variants (see Sec. 4.3D), all sharing globality and fully-convolutionality. It is also easily extensible to any future 1D operator that may be defined on, e.g., Language models.

3 Cross Gating MLP Block

A common improvement over UNet is to leverage contextual features to selectively gate feature propagation in skip-connections , which is often achieved by using cross-attention . Here we build an effective alternative, namely cross-gating block (CGB, Fig. 2c), as an extension of MAB (Sec. 3.2) which can only process a single feature. CGB can be regarded as a more general conditioning layer that interacts with multiple features . We follow similar design patterns as those used in MAB.

where $\sigma$ is the $\mathsf{GELU}$ activation , $\mathsf{LN}$ is Layer Normalization , and $\mathbf{W}_{1},\mathbf{W}_{2}$ are MLP projection matrices. The multi-axis blocked gating weights are computed from $\mathbf{X}_{2},\mathbf{Y}_{2}$ , respectively, but applied reciprocally:

where $[\cdot,\cdot]$ denotes concatenation. Here $(\mathbf{z_{1}},\mathbf{z_{2}})$ are two independent heads split from $\mathbf{z}$ along the channel dimension, where $\mathbf{z}$ represents the projected features $\mathbf{x}$ after activation:

and $\mathbf{W}_{3},\mathbf{W}_{4}$ are spatial projection matrices applied on the 2nd and 1st axis of the blocked/gridded features having fixed window size $b\times b$ ( $\mathsf{Block}_{b}$ ), and fixed grid size of $d\times d$ ( $\mathsf{Grid}_{d}$ ), respectively. Finally, we adopt residual connection from the inputs, following an output channel-projection that maintains the same channel dimensions as the inputs ( $\mathbf{X}_{1},\mathbf{Y}_{1}$ ), using projection matrices $\mathbf{W}_{7}$ , $\mathbf{W}_{8}$ , denoted by

The complexity of CGB is also tightly-bounded by Eq. 1.

4 Multi-Stage Multi-Scale Framework

We further adopt a multi-stage framework because we find it more effective, as compared to scaling up the model width or height (see ablation Sec. 4.3A). We deem full resolution processing a better approach than a multi-patch hierarchy , since the latter would potentially induce boundary effects across patches. To impose stronger supervision, we apply a multi-scale approach at each stage to help the network learn. We leverage the supervised attention module to propagate attentive features progressively along the stages. We leverage the cross-gating block (Sec. 3.3) for cross-stage feature fusion. We refer the reader to Fig. 9 for details.

where $\mathbf{T}_{n}$ denotes (bilinearly-rescaled) multi-scale target images, and $\mathcal{L}_{char}$ is the Charbonnier loss :

where we set $\epsilon=10^{-3}$ . $\mathcal{L}_{freq}$ is the frequency reconstruction loss that enforces high-frequency details :

where $\mathcal{F}(\cdot)$ represents the 2D Fast Fourier Transform. We used $\lambda=0.1$ as the weighting factor in all experiments.

Experiments

We aim at building a generic backbone for a broad spectrum of image processing tasks. Thus, we evaluated MAXIM on five different tasks: (1) denoising, (2) deblurring, (3) deraining, (4) dehazing, and (5) enhancement (retouching) on 17 different datasets (summarized in Tab. 8. More comprehensive results and visualizations can be found in Sec. A.6.

Datasets and metrics. We measured PSNR and SSIM metrics between ground truth and predicted images to make quantitative comparisons. We used SIDD and DND for denoising, GoPro , HIDE , and RealBlur for debluring, a combined dataset Rain13k used in for deraining. The RESIDE is used for dehazing, while Five-K and LOL are evaluated for enhancement.

Training details. Our proposed MAXIM model is end-to-end trainable and requires neither large-scale pretraining nor progressive training. The network is trained on $256\!\times\!256$ random-cropped patches. We train different iterations for each task. We used random horizontal and vertical flips, $90^{\circ}$ rotation, and MixUp with probability $0.5$ for data augmentation. We used the Adam optimizer with an initial learning rate of $2\!\times\!10^{-4}$ , which are steadily decreased to $10^{-7}$ with the cosine annealing decay . When testing, we padded the input images to be a multiplier of $64\!\times\!64$ using symmetric padding on both sides. After inference, we cropped the padded image back to original size. More training details on each task can be found in Sec. A.1.

Architectural configuration. We designed two MAXIM variants: a two-stage model called MAXIM-2S, and a three-stage model, MAXIM-3S, for different tasks. We start with $32$ initial channels for feature extraction, with 3 downsampling layers, where the features contract from $256^{2}\times 32$ , $128^{2}\times 64$ , $64^{2}\times 128$ , to $32^{2}\times 256$ processed by two Bottlenecks (Fig. 2a), then symmetrically expanded back to full resolution. The number of parameters and required FLOPs of MAXIM-2S and MAXIM-3S, when applied on a $256\times 256$ image are shown in the last two rows of Tab. 7A.

2 Main Results

Denoising. We report in Tab. 1 numerical comparisons on the SIDD and DND datasets. As may be seen, our method outperformed previous SOTA techniques, e.g., MIRNet by 0.24 dB of PSNR on SIDD while obtaining competitive PSNR (39.84 dB) on DND. Fig. 4 shows visual results on SIDD. Our method clearly removes real noise while maintaining fine details, yielding visually pleasant results to the other methods.

Deblurring. Tab. 2 shows the quantitative comparison of MAXIM-3S against SOTA deblurring methods on two synthetic blur datasets: GoPro and HIDE . Our method achieves 0.15 dB gain in PSNR over the previous best model HINet . It is notable that the GoPro-trained MAXIM-3S model generalizes extremely well on the HIDE dataset, setting new SOTA PSNR values: 32.83 dB. We also evaluated on real-world blurry images from RealBlur under two settings: (1) directly applied the GoPro-trained model on RealBlur, and (2) fine-tuned the model on RealBlur. Under setting (1), MAXIM-3S ranked first on RealBlur-J subset while obtaining the top two performance on RealBlur-R. Fig. 5 shows visual comparisons of the evaluated models on GoPro , HIDE and RealBlur , respectively. It may be observed that our model recovers text extremely well, which may be attributed to the use of multi-axis MLP module within each block that globally aggregates repeated patterns across various scales.

Deraining. Following previous work , we computed the performance metrics using the Y channel (in YCbCr color space). Tab. 5 shows quantitative comparisons with previous methods. As may be seen, our model improved over the SOTA performances on all datasets. The average PSNR gain of our model over the previous best model HINet is 0.24 dB. We demonstrate some challenging examples in Fig. 6, which demonstrates that our method consistently delivered faithfully recovered images without introducing any noticeable visual artifacts.

Dehazing. We report our comparisons against SOTA models in Tab. 5. Our model surpassed the previous best model by 0.94 dB and 0.62 dB of PSNR on the SOTS indoor and outdoor sets. Fig. 7 shows that our model recovered images of better quality on both flat regions as well as textures, while achieving a harmonious global tone.

Enhancement / Retouching. As Tab. 6 illustrates, our model achieved the best PSNR and SSIM values on FiveK and LOL , respectively. As the top row of Fig. 8 suggests, MAXIM recovered diverse naturalistic colors as compared to other techniques. Regarding the bottom example, while MIRNet obtained a higher PSNR, we consistently observed that our model attains visually better quality with sharper details and less noise. Moreover, the far more perceptually relevant SSIM index indicates a significant advantage of MAXIM-2S relative to MIRNet.

Other benchmarks. Due to space limitations, we detail the outcomes of our experiments on the REDS deblurring and the Raindrop removal task in Sec. A.5.

3 Ablation

We conduct extensive ablation studies to validate the proposed multi-axis gated MLP block, cross-gating block, and multi-stage multi-scale architecture. The evaluations were performed on the GoPro dataset trained on image patches of size $256\times 256$ for $10^{6}$ iterations. We used the MAXIM-2S model as the test-bed for Ablation-A and -B.

A. Individual components. We conducted an ablation by progressively adding (1) inter-stage cross-gating blocks (CGBIS), (2) a supervised attention module (SAM), (3) cross-stage cross-gating blocks (CGBCS, and (4) the multi-scale supervision (MS-Sp). Tab. 7A indicates a PSNR gain of 0.25, 0.63, 0.36, 0.26 dB for each respective component.

C. Why multi-stage? Towards understanding this, we scaled up MAXIM in terms of width (channels), depth (downscaling steps), and the number of stages. Tab. 7C suggests that packing the backbone into multi-stages yields the best performance vs. complexity tradeoff (32.44 dB, 22.2 M, 339.2 G), compared to making it wider or deeper.

D. Beyond gMLP: the MAXIM families. As described in Sec. 3.2, our proposed multi-axis approach (Fig. 3) offers a scalable way of applying any 1D operators on (high-resolution) images, with linear complexity relative to image size while maintaining fully-convolutional. We conducted a pilot study using MAXIM-1S and -2S on SIDD to explore the MAXIM families: MAXIM-FFT, -MLP, -gMLP (modeled in this paper), -SA, where we use the Fourier Transform filter , spatial MLP , gMLP , and self-attention on spatial axes using the same multi-axis approach (Fig. 3). As Tab. 7D shows, the gMLP and self-attention variants achieved the best performance, while the FFT and MLP families were more computationally efficient. We leave deeper explorations to future works.

Conclusion

We have presented a generic network for restoration or enhancement tasks, dubbed MAXIM, inspired by recently popular MLP-based global models. Our work suggests an effective and efficient approach for applying gMLP to low-level vision tasks to gain global attention, a missing attribute of basic CNNs. Our gMLP initialization of the MAXIM family significantly advances state-of-the-arts in several image enhancement and restoration tasks with moderate complexity. We demonstrate a few applications, but there are many more possibilities beyond the scope of this work which could significantly benefit by using MAXIM. Our future work includes exploring more efficient models for extremely high-resolution image processing, as well as training large models that can adapt on multiple tasks.

Broader impacts. The proposed model can be used as an effective tool to enhance and retouch daily photos. However, enhancing techniques such as denoising and deblurring are vulnerable to malicious use for privacy concerns. The models trained on specific data may express bias. These issues should be responsibly taken care of by researchers.

Acknowledgment

We thank Junjie Ke, Mauricio Delbracio, Sungjoon Choi, Irene Zhu, Innfarn Yoo, Huiwen Chang, and Ce Liu for valuable discussions and feedback.

Appendix A Appendix

Image Denoising. We trained our model on $320$ high-resolution images provided in SIDD and evaluated on 1,280 ( $256\times 256$ ) and 1,000 ( $512\times 512$ ) images provided by authors of SIDD and DND , respectively. The results on DND were obtained via the online server . We cropped the training images into $512\times 512$ patches with a stride of 256 to prepare the training patches. We trained the MAXIM-3S model for 600k steps with a batch size of 256.

Image Deblurring. We trained our model on 2,103 image pairs from GoPro . To demonstrate generalization ability, we evaluated our GoPro trained model on 1,111 pairs of the GoPro evaluation set, 2,025 images in the HIDE dataset , as well as the RealBlur dataset , which contains 980 paired images of camera JPEG output and RAW images, respectively. We cropped training images from GoPro into $512\times 512$ patches with a stride of 128 to generate training patches. We trained our MAXIM-3S model over 600k steps with a batch size of 256. For evaluation on RealBlur setting (2) (see main paper), we loaded the GoPro pre-trained checkpoint and fine-tuned for 70k and 15k iterations on RealBlur-J and RealBlur-R, respectively. Additionally, we trained our model on 24,000 images from the REDS dataset of the NTIRE 2021 Image Deblurring Challenge Track 2 JPEG artifacts . For evaluation, we followed the settings in the NTIRE 2021 Challenge on Image Deblurring , i.e., we used 300 images in the validation set of REDS. We trained from scratch for 10k epochs on REDS .

Image Deraining. Following , we used a composite training set containing 13,712 clean-rain image pairs collected from multiple datasets . Evaluation was performed on five test sets, Rain100H , Rain100L , Test100 , Test1200 , and Test2800 . We trained our MAXIM-2S model over 500k steps with a batch size of 512. For the raindrop removal task, we trained MAXIM-2S on 861 pairs of training images in Raindrop dataset for 80k steps with a batch size of 512, and evaluate on testset A (58 images) and testset B (239 images), respectively.

Image Enhancement. We used the MIT-Adobe FiveK dataset provided by for the retouching evaluation: the first 4,500 images for training and the rest 500 for testing. We cropped training images into $512\times 512$ patches with a stride of 256. We also used the LOL dataset which includes 500 pairs of images for low-light enhancement. We trained our model on 485 training images and evaluated on 15 test images. We trained for 14k and 180k steps on FiveK and LOL, respectively.

A.2 Architecture Details

The detailed specifications of the Encoder part for a single-stage MAXIM are shown in Tab. 9. We also provide the input and output shapes of each block and layer. Here Conv3x3_s1_w32 means a Conv layer with 3x3 kernels, stride 1, and 32 channels. MAB and RCAB are the two major components in Encoder / Decoder / Bottleneck. Note that in Bottleneck blocks, we use (Conv1x1) layers to replace Conv3x3 in RCAB.

The Decoder part of MAXIM is symmetric with respect to Tab. 9, and has the same configuration. For the CGB necks, we used $b=d=16$ for the depths 1 and 2, while $b=d=8$ is adopted for depth 3. Basically, we set the block and grid sizes as $16$ for high-resolution stages (i.e. feature size $\geq 128$ ) and $8$ for low-resolution stages (i.e. feature size $<128$ ). Consequently, the input images need to have both dimensions to be divisible by 64, requiring the images to be padded by a multiplier of 64 during the inference.

A.2.2 Comparison with Other MLPs

In Fig. 10, we show a visual comparison of the approximated effective receptive fields among recent MLP models: MLP-Mixer , gMLP , Swin-Mixer , and our proposed MAXIM. Our approach achieves sparse interactions to obtain both local (red in Fig. 10c) and global dilated (green) spatial communications. Moreover, as shown in Tab. 10, unlike previous MLP models, MAXIM obtains both global and fully-convolutional properties with a linear complexity with respect to the number of pixels $N$ .

A.3 JAX Implementations

Here we provide a JAX implementation of the key component of MAXIM, namely the multi-axis gated MLP block (MAB), in Algorithm 1.

A.4 Performance vs. Complexity

We demonstrate the performance vs. complexity trade-off in Tab. 11 as compared with other competing methods for all the tasks. As it can be seen, our model obtains state-of-the-art performance at a very moderate complexity. On denoising, for example, MAXIM-3S has only $21\%$ FLOPs and $70\%$ parameters of MIRNet ; on deblurring, our MAXIM-3S model requires only $25\%$ of the number of parameters of the previous best model HINet , and merely $19\%$ of the number of parameters of the Transformer model IPT . It is also worth noting that unlike IPT, our model requires no large-scale pre-training to obtain leading performance, making it attractive for low-level tasks where datasets are often at limited scale.

A.5 Additional Experiments

Due to limited space in the main paper, we also show experimental results on deblurring and raindrop removal.

Deblurring on REDS . Tab. 12 shows quantitative comparisons of MAXIM-3S against the winning solution, HINet , and a leading model, MPRNet on the REDS dataset of NTIRE 2021 Image Deblurring Challenge Track 2 JPEG artifacts . The metrics are computed and averaged on 300 validation images. Our MAXIM-3S model surpasses HINet by 0.1 dB of PSNR.

Raindrop removal . Apart from the rain streak removal task reported in the main paper, we also evaluated our MAXIM model on the raindrop removal task. As can be seen in Tab. 13, our model achieved the best performance: 31.87 dB and 25.74 dB PSNR on Raindrop testset A and B.

A.6 More Visual Comparisons

Denoising. Fig. 12 shows denoising results of our model compared with SOTA models on SIDD . Our model recovers more details, yielding visually pleasant outputs.

Deblurring. The visual results on GoPro , HIDE , RealBlur-J , and REDS are shown in Fig. 13, Fig. 14, Fig. 15, and Fig. 16, respectively. Our model outperformed other competing methods on both synthetic and real-world deblurring benchmarks.

Deraining. Qualitative comparisons of our model against SOTA methods on deraining are shown in Fig. 17, Fig. 18, Fig. 19, and Fig. 20.

Raindrop removal. We provide visual comparisons of the raindrop removal task on the Raindrop testset A and B in Fig. 21 and Fig. 22.

Dehazing. We provide dehazing comparisons on the SOTS indoor and outdoor sets in Fig. 23 and Fig. 24.

Retouching. Fig. 25 shows additional comparisons of our model with competing methods on the Five-K dataset provided by for retouching results.

Low-light enhancement. Fig. 26 demonstrates the evaluations on the LOL test set for low-light enhancement.

A.7 Weight Visualizations

Fig. 11 visualizes the spatial projection matrices of the block gMLP and the grid gMLP layers of each stage of MAXIM-3S trained on GoPro . Similar to , we also observed that the weights after learning exhibit locality and spatial invariance. Surprisingly, the global grid gMLP layer also learns to perform ‘local’ operations (but on the uniform dilated grid). The spatial weights of block gMLP and grid gMLP in the same layer often demonstrate similar or coupled shapes, which may be attributed to the parallel-branch design in the multi-axis gMLP block. However, we have not observed a clear trend on how these filters at different stages vary.

A.8 Limitations and Discussions

One potential limitation of our model, which is shared with the existing SOTA, is the relatively inadequate generalization to real-world examples. This perhaps can be attributed to the training examples provided by the existing synthesized image restoration benchmarks. Creating more realistic, large-scale datasets through data-generation schemes can improve this shortcoming. Also, we observe that our model tends to slightly overfit certain benchmarks, because we did not apply a strong regularization (e.g., dropout) during training. Even though we find that regularization may result in a small reduction in performance for our models on these benchmarks we evaluated, it is worth exploring in future to effectively improve the generalization of our restoration models.

It is worth mentioning that our model is able to generate high quality sharp images, which are visually comparable to the state-of-the-art generative models . Notably, our model produces more conservative results without hallucinating many nonexistent details, delivering more reliable results than generative models.