Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model

Chunming He, Chengyu Fang, Yulun Zhang, Tian Ye, Kai Li, Longxiang Tang, Zhenhua Guo, Xiu Li, Sina Farsiu

Introduction

Illumination degradation image restoration (IDIR) seeks to enhance the visibility and contrast of degraded images while mitigating the adverse effects of deteriorated illumination, e.g., indefinite noise and variable color deviation. IDIR has been investigated in various domains, including low-light image enhancement ,

By addressing illumination degradation, the enhanced images are expected to exhibit improved visual quality, making them more suitable for decision-making or subsequent tasks like nighttime object detection and segmentation. underwater image enhancement , and backlit image enhancement .

Traditional IDIR approaches primarily rely on manually crafted enhancement techniques with limited generalization capabilities. Leveraging the robust feature extraction capabilities of convolutional neural networks and transformers, a series of deep learning-based methods have been proposed and have achieved remarkable success in the IDIR domain. However, as depicted in Fig. 1, they still face challenges in complex illumination degradation scenarios due to their constrained generative capacity.

To overcome these challenges, deep generative models, like generative adversarial networks and variational autoencoder , have gained popularity for addressing the IDIR problem, owing to their potent generative abilities. Recently, the diffusion model (DM) has been introduced to the IDIR field for high-quality image restoration, as it offers accurate results without mode collapse, a common issue with other generative models. However, the existing DM-based method, Diff-Retinex , applies DM directly to image-level generation, leading to two main challenges: (1) This method incurs high computational costs, as predicting the image-level distribution requires a large number of inference steps. (2) The enhanced results may exhibit pixel misalignment with the original clean image in terms of restored details , resulting in suboptimal performance in distortion-based metrics like PSNR. Nevertheless, this issue can be effectively addressed by transformer-based methods, which excel in modeling long-range dependencies.

To tackle the aforementioned problems, we propose introducing the latent diffusion model (LDM) to address the IDIR problem. By applying DM in the low-dimensional compact latent space, we effectively alleviate the computational burden. Additionally, we incorporate LDM into transformers to prevent pixel misalignment in the generated image, which is often observed in existing deep generative models. Unlike existing LDM-based methods that solely use the priors extracted from the RGB domain, our method allows LDM to learn Retinex information from the reflectance and illumination domains, which are decomposed from the RGB domain. This approach enables us to simultaneously enhance image details using the reflectance prior and correct color distortions with the illumination prior, thereby improving illumination degradation.

With this inspiration, we propose the first LDM-based solution, Reti-Diff, to solve the IDIR problem. As shown in Fig. 2, Reti-Diff comprises two main parts: the Retinex-based LDM (RLDM) and the Retinex-guided transformer (RGformer). Initially, RLDM is employed to generate Retinex priors, which are then integrated into RGformer to produce visually appealing results. Following a common training strategy , we propose a two-phase approach, where we first pretrain Reti-Diff and subsequently optimize RLDM. In phase I, we introduce a Retinex prior extraction (RPE) module to compress the ground-truth image into the highly compact Retinex priors, specifically the reflectance prior and the illumination prior. These priors are then sent to RGformer to guide feature decomposition and the generation of reflectance and illumination features. Afterward, RGformer employs the Retinex-guided multi-head cross attention (RG-MCA) and dynamic feature aggregation (DFA) module to refine and aggregate the decomposed features, ultimately producing enhanced images with coherent content and ensuring robustness and generalization in extreme degradation scenarios. In phase II, we train RLDM in reflectance and illumination domains to directly estimate Retinex priors from the degraded image, enhancing the quality of the generated priors and further improving reconstruction quality when jointly optimized with RGformer. Our contributions are summarized as follows:

We propose a novel DM-based framework, Reti-Diff, for the IDIR task. To the best of our knowledge, this is the first application of the latent diffusion model to tackle the IDIR problem.

We propose to let RLDM learn Retinex knowledge and extract reflectance and illumination priors, thus facilitating detail restoration and illumination correction.

We propose RGformer to integrate extracted Retinex priors to decompose features into reflectance and illumination components and then utilize RG-MCA and DFA to refine and aggregate the decomposed features, ensuring robustness and generalization in complex illumination degradation scenarios.

Extensive experiments on three IDIR tasks verify our superiority to existing methods in terms of image quality and favorability in downstream applications, including object detection and segmentation.

Related Works

Illumination Degradation Image Restoration. Early IDIR methods can be broadly categorized into three main approaches: histogram equalization (HE) , gamma correction (GC) , and Retinex theory . HE-based and GC-based methods focus on directly amplifying the low contrast regions but overlook illumination factors. Retinex-based variants propose the development of priors to constrain the solution space for reflectance and illumination maps. However, these methods still rely on hand-crafted priors, limiting their ability to generalize effectively.

With the rapid development of deep learning, approaches based on CNNs and transformers have achieved remarkable success in IDIR. For instance, LLNet proposed a sparse denoising structure to enhance illumination and suppress noise. DIE integrated Retinex cues into a learning-based structure, presenting a one-stage Retinex-based solution for color correction. To enhance generative capacity, Diff-Retinex introduced DM to the IDIR field by directly applying it to image-level generation. However, Diff-Retinex entails significant computational costs and may lead to pixel misalignment issues with the original input, particularly concerning restored image details.

Diffusion Models. Diffusion models (DMs) have demonstrated considerable success in various domains, including density estimation and data generation . Such a probabilistic generative model adopts a parameterized Markov chain to optimize the lower variational bound on the likelihood function, enabling them to generate target distributions with greater accuracy than other generative models, i.e., GAN and VAE. Recently, DMs have been introduced by Diff-Retinex to solve the IDIR problem. However, when directly applied to image-level generation, this approach introduces computational burdens and issues related to pixel misalignment. To overcome this, we propose employing LDM to estimate priors within a low-dimensional latent space. We then integrate these priors into transformer-based restoration techniques, thus reducing the computational burden and preventing pixel misalignment. Besides, unlike existing LDM-based methods that solely rely on priors extracted from the RGB domain, our method allows LDM to acquire Retinex information from the reflectance and illumination domains. Therefore, this novel approach enables us to simultaneously enhance image details using the reflectance prior and correct color distortions with the illumination prior.

Methodology

In this paper, we leverage the latent diffusion model (LDM) to generate compact priors that guide illumination degradation image restoration (IDIR) tasks. This approach effectively reduces redundant computations and mitigates pixel misalignment issues often associated with traditional diffusion models (DM) . At the same time, it preserves the network’s generative capability, allowing it to address complex degradation scenarios. However, existing LDM-based image restoration techniques primarily utilize priors extracted from the RGB domain for guidance, which limits their ability to fully exploit prior knowledge tailored to illumination degradation scenarios, ultimately impacting their restoration performance. In response to this challenge, we introduce Reti-Diff, which harnesses Retinex priors extracted from the illumination and reflectance domains to steer the image restoration process. By doing so, the extracted Retinex prior representation serves as dynamic modulation parameters, allowing for the simultaneous enhancement of restoration details via the reflectance prior and the correction of color distortion through the illumination prior. This ensures the generation of visually compelling results while favorably impacting downstream tasks.

As shown in Fig. 2, our Reti-Diff comprises two parts: the Retinex-guided transformer (RGformer) and the Retinex-based latent diffusion model (RLDM). Following , Reti-Diff undergoes a two-phase training strategy, involving the initial pretraining of Reti-Diff and the subsequent optimization of RLDM. In this section, we provide an in-depth explanation of the two-phase training approach and illustrate the entire restoration process.

We first pretrain Reti-Diff to encode the ground truth into compact priors with Retinex prior extraction (RPE) module and use them to guide RGformer for restoration.

where \text{down}(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}) means downsampling and we use the PixelUnshuffle operation here. We then send Retinex priors, i.e., ZR\mathbf{Z}_{\mathbf{R}} and ZL\mathbf{Z}_{\mathbf{L}}, to RGformer to serve as dynamic modulation parameters for detail restoration and color correction.

where channels are split into multi-head for attention calculation. By doing so, RG-MCA introduces explicit guidance to fully exploit Retinex knowledge at the feature level and use cross attention mechanism to implicitly model the Retinex theory and refine the decomposed features, which helps to restore missing details and correct color distortion.

Optimization. To facilitate the extraction of Retinex priors, the RPE module and RGformer are jointly trained by a reconstruction loss with L1L_{1} norm \|\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}\|_{1}:

The utilization of Eq. 7 serves a dual purpose: it not only facilitates the learning of Retinex theory by the split features but also enhances the overall restoration capacity.

In Phase I, the final loss LP1L_{P1} is defined as:

2 Retinex-based Latent Diffusion Model

In Phase II, we train the Retinex-based latent diffusion model (RLDM) to predict Retinex priors. Different from conventional latent diffusion models that are typically trained on the RGB domain, we introduce two RLDMs with a Siamese structure and train them on distinct domains: the reflectance domain and the illumination domain, respectively. This approach, grounded in Retinex theory, equips our RLDM to generate a more generative reflectance prior denoted as Z^R\hat{\mathbf{Z}}_{\mathbf{R}} to enhance image details, and a more harmonized illumination prior Z^L\hat{\mathbf{Z}}_{\mathbf{L}} for color correction. To simplify, we provide a detailed derivation for Z^R\hat{\mathbf{Z}}_{\mathbf{R}} herein, while further elaboration on Z^L\hat{\mathbf{Z}}_{\mathbf{L}} can be found in the supplementary material (Supp). It’s worth noting that RLDM is constructed upon the foundation of conditional denoising diffusion probabilistic models , comprising both a forward diffusion process and a reverse denoising process.

Diffusion process. In the diffusion process, we first use the pretrained RPE to extract the reflectance prior ZR\mathbf{Z}_{\mathbf{R}}, which is treated as the starting point of the forward Markov process, i.e., ZR=ZR0\mathbf{Z}_{\mathbf{R}}=\mathbf{Z}_{\mathbf{R}}^{0}. We then gradually add Gaussian noise to ZR\mathbf{Z}_{\mathbf{R}} by TT iterations and each iteration can be defined as:

where t=1,,Tt=1,\cdots,T. ZRt\mathbf{Z}_{\mathbf{R}}^{t} denotes the noisy prior at time step tt, βt\beta^{t} is the predefined factor that controls the noise variance, and N\mathcal{N} is the Gaussian distribution. Following , Eq. 9 can be simplified as follows:

where αt=1βt\alpha^{t}=1-\beta^{t} and αˉt=i=1tαi\bar{{\alpha}}^{t}=\prod_{i=1}^{t}\alpha^{i}.

Reverse process. In the reverse process, RLDM aims to extract the reflectance prior from pure Gaussian noise. Thus, RLDM samples a Gaussian random noise map ZRT\mathbf{Z}_{\mathbf{R}}^{T} and then gradually denoise it to run backward from ZRT\mathbf{Z}_{\mathbf{R}}^{T} to ZR0\mathbf{Z}_{\mathbf{R}}^{0}:

Optimization. Having got the predicted priors Z^R\hat{\mathbf{Z}}_{\mathbf{R}} and Z^L\hat{\mathbf{Z}}_{\mathbf{L}}, which are generated by two Siamese RLDMs with specific weights, we propose the diffusion loss to supervise them:

For restoration quality, we propose joint training RPE, RGformer, and RLDM. Thus, the loss in Phase II is

3 Inference

In the inference phase, given the LQ input ILQ\mathbf{I}_{LQ}, Reti-Diff first uses RPE~\widetilde{\text{RPE}} to extract the conditional vectors VR\mathbf{V}_{\mathbf{R}} and VL\mathbf{V}_{\mathbf{L}}, and then generates predicted Retinex priors Z^R\hat{\mathbf{Z}}_{\mathbf{R}} and Z^R\hat{\mathbf{Z}}_{\mathbf{R}} with two RLDMs. Under the guidance of the Retinex priors, RGformer generates the restored HQ image IHQ\mathbf{I}_{HQ}. Benefiting from our Retinex-based framework, IHQ\mathbf{I}_{HQ} enjoys richer texture details and more harmonized illumination.

Experiment

Our Reti-Diff is implemented in PyTorch on four RTX3090TI GPUs and is optimized by Adam with momentum terms (0.9,0.999)(0.9,0.999). In phases I and II, we train the network for 300K iterations and the learning rate is initially set as 2×1042\times 10^{-4} and gradually reduced to 1×1061\times 10^{-6} with the cosine annealing . Following , random rotation and flips are used for augmentation. Reti-Diff mainly comprises RLDM and RGformer. For RLDM, the channel number CC^{\prime} is set as 64. The total time step TT is set to 4 and the hyperparameters β1:T\beta^{1:T} linearly increase from β1=0.1\beta^{1}=0.1 to βT=0.99\beta^{T}=0.99. RGformer adopts a 4-level cascade encoder-decoder structure. We set the number of transformer blocks, the attention heads, and the channel number as ,,, and $$ from level 1 to level 4.

2 Comparative Evaluation

Low-light Image Enhancement. We conduct a comprehensive evaluation on four datasets: LOL-v1 , LOL-v2-real , LOL-v2-syn , and SID . We adhere to the training manner outlined in . Our assessment involves four metrics: PSNR, SSIM, FID , and BIQE . Note that larger PSNR and SSIM scores, as well as smaller FID and BIQE scores, denote superior performance. We compare our approach against 17 cutting-edge enhancement techniques and report the results in Tab. 1. As depicted in Tab. 1, our method emerges as the top performer across all datasets and significantly outperforms the second-best method (Diff-Retinex) by 13.2%13.2\%. These results underscore the superiority of our Reti-Diff. Fig. 3 presents qualitative results, showcasing our capacity to generate enhanced images with corrected illumination and enhanced texture, even in extremely challenging conditions. In contrast, existing methods struggle to achieve the same level of performance such as the boundaries of power lines, color harmonization of lakes, and detailed textures of wooded areas. Notably, results from the compared methods are generated with their provided models under the same settings to ensure fairness.

Underwater Image Enhancement. We extend our evaluation to encompass two widely-used underwater image enhancement datasets: UIEB and LSUI . In addition to PSNR and SSIM, we employ two metrics specifically tailored for underwater images, namely UCIQE and UIQM , to assess the performance of the ten enhancement approaches. In all cases, higher values indicate superior performance. The results are presented in Tab. 3. As showcased in Tab. 3, our method achieved the highest overall performance and outperformed the second-best method (PUGAN) by 4.48%4.48\%. A qualitative analysis is presented in Fig. 4, illustrating our method’s ability to correct underwater color aberrations and highlight fine texture details.

Backlit Image Enhancement. Following CLIP-LIT , we select the BAID dataset for training the network with an image size of 256×256256\times 256. In addition to PSNR and SSIM, our evaluation incorporates LPIPS and FID as metrics for evaluation. In this case, lower LPIPS scores and FID scores denote superior performance. The evaluation results are reported in Tab. 3. As demonstrated in Tab. 3, our method excels in all metrics and generally outperformed the second-best method (CLIP-LIT) by 6.03%6.03\%. Furthermore, a visual comparison in Fig. 5 provides additional evidence of our superiority in detail reconstruction and color correction. All methods are trained by cropping the training data as 256×256256\times 256 for fairness.

3 Ablation Study

We conduct ablation studies on the low-light image enhancement task with the LOL-v2-syn dataset.

Effect of RLDM. As illustrated in Tab. 5, we ablate RLDM by directly removing RLDM or retraining RLDM in the RGB domain, i.e., w/o Retinex, rather than in the reflectance and illumination domain (RGformer is guided by one RGB prior instead of the Retinex priors in this time). The two modifications result in significant drops in performance. This outcome underscores the critical role of RLDM in enhancing the restoration process. Furthermore, to assess the generalizability of RLDM, we conducted additional experiments by replacing our RGformer with two transformer-based frameworks, namely Res (Restormer ) and Ret (Retformer ). Note that the training settings are kept consistent with our Reti-Diff. The results are presented in Tab. 5. Tab. 5 reveals that RLDM significantly improves the performance of both frameworks, where “Gain” is the average gain of PSNR and SSIM. This demonstrates that our RLDM serves as a plug-and-play module with strong generalization capabilities.

Effect of RGformer. We conduct an analysis to assess the impact of our RGformer, and the results are presented in Tab. 5. In this study, we systematically removed critical components, such as DFA, RG-MCA, and the auxiliary decoder D_{a}(\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.5}{\displaystyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\textstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptstyle\bullet}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.5}{\scriptscriptstyle\bullet}}}}}), from the model architecture. The outcomes of this ablation study clearly indicate that the performance deteriorates when these components are removed, highlighting their essential role in the system. Additionally, in Tab. 5, we conduct an evaluation to affirm the significance of joint training in our approach. This analysis reinforces the importance of the joint training process.

Ablations on iteration number. The number of iterations in the diffusion model plays a crucial role in determining the method’s efficiency. To explore this, we conducted experiments with different iteration numbers for Reti-Diff, specifically TT values selected from the set {1,2,4,8,16,32}\{1,2,4,8,16,32\}. We adjusted the βt\beta^{t} parameter as defined in Eq. 9 accordingly. The results in terms of PSNR for different iterations, as shown in Fig. 6, illustrate that Reti-Diff exhibits rapid convergence and generates stable guidance priors with just 44 iterations. This efficiency is attributed to our application of the diffusion model within the compact latent space.

4 User Study and Downstream Tasks

User Study. We conduct a user study to assess the subjective visual perception of low-light image enhancement. In this study, 29 human subjects are invited to assign scores to the enhanced results based on four criteria: (1) The presence of underexposed or overexposed regions. (2) The existence of color distortion. (3) The occurrence of undesired noise or artifacts. (4) The inclusion of essential structural details. Participants rate the results on a scale from 1 (worst) to 5 (best). Each low-light image is presented alongside its enhanced results, with the names of the enhancement methods concealed from the evaluators. The scores are reported in Tab. 7. As shown in Tab. 7, our method receives the highest scores across all four datasets, which highlights our effectiveness in generating visually appealing results.

Low-light Object Detection. The enhanced images are expected to have better downstream performance than the original ones. We first verify this on low-light object detection. Following , all the compared methods are performed on ExDark with YOLOv3, which is trained from scratch. As shown in Tab. 7, our Reti-Diff exhibits a substantial advantage over existing methods, as evidenced by higher average precision (AP) scores. Notably, the mean AP of our method surpasses that of the second-best performing method, Retformer , by an impressive margin of 12.1 AP points, which serves as compelling evidence of our efficacy in facilitating high-level vision understanding.

Low-light Image Segmentation. We extend our experimentation to include image segmentation tasks, specifically semantic segmentation and concealed object segmentation. For semantic segmentation, following , we apply image darkening to samples from the VOC dataset according to . We then employ Mask2Former , a state-of-the-art segmentor, to perform segmentation on the enhanced results of these darkened images. The evaluation metric used is Intersection over Union (IoU), and the results are presented in Tab. 9. As shown in Tab. 9, our method achieves the highest performance across all classes, surpassing the second-best method by a substantial margin of 15.7%15.7\%.

We further venture into concealed object segmentation (COS) on COD10K and NC4K , which represent a challenging segmentation task aimed at delineating objects with inherent background similarity. We also apply image darkening and enlist the cutting-edge COS segmentor, FEDER , to perform segmentation on the enhanced results. We evaluate the results using four metrics: mean absolute error (M)(M), adaptive F-measure (Fβ)(F_{\beta}), mean E-measure (Eϕ)(E_{\phi}), and structure measure (Sα)(S_{\alpha}), which are presented in Tab. 9. As depicted in Tab. 9, our method exhibits superior performance compared to the second-best method, SNR-Net, with a notable margin of 2.1%2.1\% on average. Collectively, the exceptional results achieved in these two segmentation tasks substantiate our proficiency in recovering image-level illumination degraded information.

Future Works

In future work, we consider the use of multimodal data to aid in improving image reconstruction performance, such as using infrared images to aid in low-light visible image enhancement. Besides, We will explore whether our approach is downstream task-friendly with more segmentation algorithms .

Conclusions

To mitigate pixel misalignment, our approach adopt DM within a compact latent space to generate guidance priors. Specifically, we introduce RLDM to extract Retinex priors, which are subsequently supplied to RGformer for feature decomposition. This process ensures precise detailed reconstruction and effective illumination correction. RGformer then refines and aggregates the decomposed features, enhancing the robustness in handling complex degradation scenarios. Our approach is extensively validated through experiments, establishing the clear superiority of Reti-Diff.

References