An Online Plug-and-Play Algorithm for Regularized Image Reconstruction

Yu Sun, Brendt Wohlberg, Ulugbek S. Kamilov

Introduction

where $d$ is the data-fidelity term that penalizes the mismatch to the measurements and $r$ is the regularizer that imposes prior knowledge regarding the unknown image. Some popular imaging priors include nonnegativity, transform-domain sparsity, and self-similarity .

Over the past two decades, a substantial effort has been devoted to combining the best regularizers with efficient optimization algorithms. The large dimensionality of the imaging data and nondifferentiability of many regularizers has led to the widespread adoption of proximal algorithms —such as variants of iterative shrinkage/thresholding algorithm (ISTA) and alternating direction method of multipliers (ADMM) . These algorithms avoid differentiating the regularizer by using a mathematical concept known as the proximal operator, which is itself an optimization problem equivalent to regularized image denoising.

The mathematical equivalence of the proximal operator to denoising has recently inspired Venkatakrishnan et al. to introduce the powerful plug-and-play priors (PnP) framework for image reconstruction. The key idea in PnP is to replace the proximal operator in an iterative algorithm with a state-of-the-art image denoiser, such as BM3D , WNNM , or TNRD , which does not necessarily have a corresponding regularization objective. This implies that PnP methods generally lose interpretability as optimization problems. Nonetheless, the framework has gained in popularity due to its effectiveness in a range of applications in the context of imaging inverse problems . In particular, the effectiveness of PnP was demonstrated beyond the original ADMM formulation to other proximal algorithms such as primal-dual splitting and ISTA .

All current PnP algorithms are iterative batch procedures, which means that they use the full set of measurements at every iteration. This effectively precludes their application to very large datasets common in three-dimensional (3D) imaging or in imaging of dynamic objects . In this paper, we address this limitation by proposing a new online extension for PnP-ISTA. By using only a subset of the measurements at a time, the proposed algorithm scales to datasets that would otherwise be prohibitively large for batch processing. More specifically, the key contributions of this paper are as follows.

We present a detailed theoretical convergence analysis of batch PnP-ISTA under a set of explicit assumptions. Our analysis complements the recent theoretical results on PnP-ADMM by Sreehari et al. and Chan et al. in two major ways. We show that for PnP-ISTA the symmetric gradient assumption from is not necessary, while the bounded denoiser assumption from is not sufficient to establish the convergence.

We extend the traditional batch PnP framework with our novel online algorithm called PnP-SGD. We prove the theoretical convergence of the algorithm to the same set of fixed points as batch PnP-ISTA and PnP-ADMM. This makes PnP-SGD a powerful and theoretically sound alternative for large-scale image reconstruction. We also illustrate its applicability with several numerical simulations on image reconstruction problems encountered in diffraction tomography .

Background

In this section, we provide the background material that forms the foundation to our contributions. We first review the problem of regularized image reconstruction and then introduce more recent results related to the PnP algorithms.

Practical inverse problems are often ill-posed, which often leads to the formulation in (1). In such cases, one of the most popular data-fidelity terms is least-squares

According to definition (4), the proximal operator corresponds to an image denoiser formulated as regularized optimization. Note also that when the values for $\{q_{k}\}$ in Algorithm 1 are adapted as

A careful inspection of ISTA and ADMM reveals a fundamental conceptual difference between the algorithms in their treatment of the data-fidelity. While ISTA relies on the gradient $\nabla d$ , ADMM relies on the proximal operator $\mathsf{prox}_{\gamma d}$ . For a large class of linear and nonlinear inverse problems, the gradient of the data-fidelity is significantly easier to evaluate compared to its proximal operator. As an example, for least-squares we have

The matrix inversion in (7) can make ADMM updates computationally expensive for problems where the measurement matrix is not easily invertible.

The theoretical analysis in this paper is closely related to the convergence results established for first-order methods by Nesterov and Beck and Teboulle . In particular, our work is related to inexact proximal-gradient optimization that was extensively investigated by several researchers . We extend this prior work beyond traditional optimization, where denoising operators do not necessarily correspond to proximal operators of a given objective. To achieve this, we adopt the monotone operator theory , which enables a unified analysis of PnP methods by expressing them as finding zeros of some operator.

Using denoisers as priors

Both ISTA and ADMM have modular structures in the sense that the prior on the image is only imposed via the proximal operator. Additionally, since the proximal operator is mathematically equivalent to regularized image denoising, the powerful idea of Venkatakrishnan et al. was to consider replacing it with a more general denoising operator $\mathsf{denoise}_{\sigma}(\cdot)$ of controllable strength ${\sigma>0}$ . In order to be backward compatible with the traditional optimization formulation, this strength parameter is often scaled with the step-size as $\sigma=\sqrt{\gamma\lambda}$ , for some parameter $\lambda>0$ .

The original formulation of PnP relies on ADMM. However, recent results have shown that it can be as effective when used with other proximal algorithms or with another class of algorithms known as approximate message passing (AMP) . AMP-based algorithms have been shown to be effective for problems where ${\bm{H}}$ is large and random , but are also known to be unstable for general matrices ${\bm{H}}$ . Therefore, in this paper, our focus will be exclusively on the variants of PnP based on ISTA and ADMM, summarized in Algorithm 3 and 4, respectively.

A different but related approach to denoiser-driven regularization was recently proposed by Romano et al. . They proposed the regularization by denoising (RED) framework, where an explicit regularizer is constructed as

Remarkably, they also showed that under some conditions, the gradient of the regularizer has a very simple expression. More recently, Reehorst and Schniter have provided additional insight into RED by establishing conditions for the existence of explicit regularizers based on denoising operators. The key difference between PnP and RED is that the former does not seek to define an explicit regularization functional, but relies on the fixed points of a given denoising operator for regularization. This generality of the PnP framework makes it widely applicable, but also substantially complicates its theoretical analysis.

Another recent related framework is the consensus equilibrium (CE) by Buzzard et al. . Given multiple sources of information (defined via image denoisers or other similar mappings), CE proposes to fuse them by computing a specific equilibrium point. The CE framework extends the traditional consensus optimization to operators that are not necessarily proximal operators and formulates a new variant of PnP that can handle multiple denoising functions. In this paper, we will restrict our attention to the traditional PnP formulation under ISTA-based optimization.

Batch Algorithm

In this section, we present a detailed theoretical convergence analysis of batch PnP-ISTA. The results are based on the fixed point analysis of Algorithm 3 and rely on basic convex and monotone analysis, summarized in Appendix 7.1.

The central building block of PnP-ISTA is the following denoiser-gradient operator

which first computes the gradient-step with respect to the function $d$ and then denoises the result with a given denoiser. Throughout this paper, we assume that the function $d$ is convex and has a Lipschitz continuous gradient with constant $L>0$ . We are interested in convergence of Algorithm 3 to the set of fixed points of the operator $\mathsf{P}$

Note that when $\mathsf{denoise}_{\sigma}$ is the proximal oprerator of some convex function, $\mathsf{fix}(\mathsf{P})$ coincides with the set of solutions of (1).

Let $\mathsf{denoise}_{\sigma}(\cdot)=\mathsf{prox}_{\gamma r}(\cdot)$ for $\gamma,\sigma>0$ . Then ${\bm{x}^{\ast}}\in\mathsf{fix}(\mathsf{P})$ if and only if it minimizes $f=d+r$ .

Our central goal, however, is to generalize $\mathsf{denoise}_{\sigma}$ beyond proximal operators. The key assumption that we adopt for our analysis is that the denoiser is averaged (see Appendix A).

Consider an operator $\mathsf{denoise}_{\sigma}$ and a constant ${\theta\in(0,1)}$ . $\mathsf{denoise}_{\sigma}$ is $\theta$ -averaged if and only if the operator $(1-1/\theta)\mathsf{I}+(1/\theta)\mathsf{denoise}_{\sigma}$ , where $\mathsf{I}$ denotes the identity operator, is nonexpansive.

The class of averaged operators is a superset of proximal operators and a subset of nonexpansive operators. In fact, the proximal operator is an averaged operator with $\theta=1/2$ . Note that given any nonexpansive denoiser, it is always possible to make it averaged by defining a damped operator ${\mathsf{D}\,\triangleq\,(1-\theta)\mathsf{I}+\theta\mathsf{denoise}_{\sigma}}$ , with $\theta\in(0,1)$ , which has the same set of fixed points as $\mathsf{denoise}_{\sigma}$ .

We analyze PnP-ISTA under the following assumptions:

The function $d$ is convex and differentiable with a Lipschitz continuous gradient of constant $L>0$ .

$\mathsf{denoise}_{\sigma}$ is $\theta$ -averaged with $\theta\in(0,1)$ for any $\sigma>0$ .

We can then establish the following convergence result.

Run PnP-ISTA for $t\geq 1$ iterations under Assumption 1 with the step ${\gamma\in(0,1/L]}$ and ${q_{k}=1}$ for all ${k\in\{1,\dots,t\}}$ . Then, for ${\bm{x}^{\ast}}\in\mathsf{fix}(\mathsf{P})$ , we have that

The direct consequence of Proposition 2 is that

that is under Assumption 1, the iterates of PnP-ISTA can get arbitrarily close to the set of fixed points $\mathsf{fix}(\mathsf{P})$ with rate $O(1/t)$ . Note that the result is expressed in terms of the distance to ${\bm{x}}=\mathsf{P}({\bm{x}})$ as PnP-ISTA is not necessarily minimizing an objective function.

Recently, Meinhardt et al. have showed that for continuous denoisers, the fixed-points of several PnP algorithms coincide. The following proposition is a minor variation of their result tailored for PnP-ADMM.

Under Assumption 1, the set of fixed-points of PnP-ADMM coincides with $\mathsf{fix}(\mathsf{P})$ .

In the context of the work by Sreehari et al. , the propositions above indicate that the symmetric gradient assumption is not necessary for the convergence of PnP-ISTA. Moreover, both PnP-ISTA and PnP-ADMM are equivalent in the sense that they have the same set of solutions specified by $\mathsf{fix}(\mathsf{P})$ .

The bounded denoiser assumption (8) is a more relaxed assumption on the denoising operator and was used to analyze PnP-ADMM. However, we argue that it is not sufficient to guarantee the convergence of PnP-ISTA. The following proposition builds on a specific counter example.

Definition 1 makes verifying that a denoiser is averaged equivalent to verifying nonexpansiveness of some operator. As was argued in several recent publications the task is more difficult for some denoisers than it is for others and there exist denoisers for which this condition does not hold. However, all recently designed denoisers for PnP from satisfy our assumptions. As an example, the modified nonlocal means (NLM) filter from is by definition an averaged operator.

Online Algorithm

We now introduce our second key contribution: the new online variant of PnP-ISTA called PnP-SGD. We additionally prove its convergence for averaged denoisers.

In many imaging applications, the data-fidelity term $d$ consists of a large number of component functions

where each $d_{i}$ typically depends only on the subset of the measurements ${\bm{y}}$ . Note that in the notation (13), the expectation is taken over a uniformly distributed random variable ${i\in\{1,\dots,I\}}$ . The computation of the gradient of $d$ ,

scales with the total number of components $I$ , which means that when the latter is large, the classical batch PnP algorithms may become impractical in terms of speed or memory requirements. The central idea of PnP-SGD, summarized in Algorithm 5, is to approximate the gradient at every iteration with an average of $B\ll I$ component gradients

where $i_{1},\dots,i_{B}$ are independent random variables that are distributed uniformly over $\{1,\dots,I\}$ . The minibatch size parameter $B\geq 1$ controls the number of gradient components used at every iteration.

We analyze PnP-SGD under the following assumptions:

The functions $d_{i}$ are all convex and differentiable with the same Lipschitz constant $L>0$ .

$\mathsf{denoise}_{\sigma}$ is $\theta$ -averaged with $\theta\in(0,1)$ for any $\sigma>0$ .

At every iteration, the gradient estimate is unbiased and has a bounded variance:

Note that Assumption 2(a) implies that the complete data-fidelity term $d$ is also convex and has a Lipschitz continuous gradient of constant $L$ . The key difference between Assumption 1 and Assumption 2 is the last condition. The fact that the minibatch gradient is unbiased is the direct consequence of (15). The bounded variance assumption is a standard assumption used in the analysis of online and stochastic algorithms .

Run PnP-SGD for $t\geq 1$ iterations under Assumption 2 with the step ${\gamma\in(0,1/L]}$ and ${q_{k}=1}$ for all ${k\in\{1,\dots,t\}}$ . Then, for ${\bm{x}^{\ast}}\in\mathsf{fix}(\mathsf{P})$ , we have that

where $\mathsf{P}(\cdot)$ is given by (10).

This result shows that the convergence in expectation of PnP-SGD to an element of $\mathsf{fix}(\mathsf{P})$ is proportional to the step-size $\gamma$ and inversely proportional to the mini-batch size $B$ . By controlling these two parameters, we can obtain the following convergence rates.

Consider Proposition 5 with the following fixed (i.e., independent of iteration $k$ ) parameters.

For $\gamma=1/(L\sqrt{t})$ and $B=1$ , we have that

For $\gamma=1/(L\sqrt{t})$ and $B=t$ , we have that

Corollary 1(c) implies the worst-case convergence rate

which means that under Assumption 2 and with a particular selection of parameters $B$ and $\gamma$ , the iterates of PnP-SGD (in expectation) can get arbitrarily close to $\mathsf{fix}(\mathsf{P})$ as $O(1/t)$ .

Numerical Simulations

We now empirically validate PnP-SGD in the context of diffraction tomography (DT) using three popular denoisers: TV , BM3D , and TNRD . Our goal is not to justify the PnP framework, as its benefits have been well illustrated in prior work , but to focus on the aspects that relate to online processing of data. Therefore, we first discuss empirical convergence of PnP-SGD, and then highlight the benefit of using it for processing a large number of measurements.

DT is a technique used to form an image of the distribution of dielectric permittivity within an object from multiple of measurements of light it scatters . This problem is common in a number of applications—including ultrasound and optical microscopy —and is known to be highly data-intensive. A typical reconstruction task uses hundreds or thousands of measurements for forming a single image. As is common in DT, we adopt the first-Born approximation , which leads to the linear inverse problem formulation of image reconstruction.

Note that PnP-SGD is applicable beyond DT and our choice of the latter is only due to the fact that image reconstruction in DT requires the processing of a large number of distinct measurements. Additionally, our focus is not on the experimental application of DT, but rather on the demonstration of our online algorithm for image reconstruction. Hence, we restrict our study here to image reconstruction from purely simulated DT data, which enables optimal parameter tuning and quantitative comparisons.

The objects we reconstruct correspond to the eight standard grayscale images shown Fig. 1. The physical size of an image is set to 18 cm $\times$ 18 cm, discretized to a grid of $256\times 256$ . The wavelength of the illumination was set to $\lambda=0.84$ cm and the background medium was assumed to be air with $\epsilon_{b}=1$ . We additionally set the number of transmitters to $N=60$ , distributed uniformly along a circle of radius $1.6$ meters, and for each illumination, the corresponding scattered field is measured by $M=360$ receivers around the object. The simulated measurements were additionally corrupted by an additive white Gaussian noise (AWGN) corresponding to 40 dB of input signal-to-noise ratio (SNR). SNR is also used as a quantitative metric for numerically evaluating the reconstruction quality in the experiments. We use the term average SNR to indicate the SNR averaged over all the test images. In each experiment, all algorithmic hyperparameters were optimized for the best SNR performance with respect to the ground truth test image.

Convergence of PnP-SGD

One of the key conclusions of Proposition 5 is that the final accuracy of PnP-SGD to a fixed point is proportional to the step size and inversely proportional to the minibatch size. In order to numerically evaluate the convergence, we define the distance to $\mathsf{fix}(\mathsf{P})$ at the $k$ th iteration as

where $\mathsf{P}$ is given by (10). As the sequence $\{{\bm{x}}^{k}\}$ approaches $\mathsf{fix}(\mathsf{P})$ , $\mathsf{dist}({\bm{x}}^{k})$ approaches zero.

Fig. 2 and Fig. 3 empirically evaluate the evolution of the distance to a fixed point for different step and minibatch sizes, respectively. PnP-SGD under BM3D is run until convergence with $\gamma\in\{1/L,1/(4L),1/(16L)\}$ and $B\in\{10,20,30\}$ . Here, the quantity ${L>0}$ denotes the Lipschitz constant, which, for linear inverse problems, corresponds to the squared largest singular value of the measurement matrix . We show the performance of both basic and accelerated variants of PnP-SGD, where the latter is obtained by setting $\{q_{k}\}$ as in (5). The plots clearly illustrate the improvement in final accuracy for smaller $\gamma$ and larger $B$ , which is consistent with Proposition 5. Additionally, they indicate that the convergence is significantly improved when using the accelerated variant of the algorithm. Note that our theoretical analysis does not predict monotonic reduction of the distance, which also seems to be consistent with the empirical performance of PnP-SGD. In Fig. 4, we provide a reference plot showing the performance of PnP-SGD under TV, which is a valid proximal operator and hence is known to be a $1/2$ -averaged operator. We can again observe that the convergence behavior of PnP-SGD is consistent with Proposition 5. Finally, the summary in Table 1, highlights the same convergence trends for all three algorithms, where both $\gamma$ and $B$ control the accuracy of PnP-SGD.

Benefits of online processing

We now highlight the higher efficiency of PnP-SGD against PnP-ISTA and PnP-ADMM for larger number of measurements. Specifically, we consider two scenarios where: (a) the total time budget is fixed; (b) the number of measurements is fixed. While we use BM3D as our plug-in operator of choice, we note that our observations here directly generalize to any other denoiser.

Fig. 5 compares the average reconstruction SNR of PnP-SGD, PnP-FISTA, and PnP-ADMM for a fixed run-time. The batch algorithms use the full $60$ illuminations at every iteration, while PnP-SGD uses only $B=10$ illuminations per iteration. This gives PnP-SGD a significantly lower per iteration cost compared to the batch algorithms. Specifically, the average per iteration time for PnP-SGD, PnP-FISTA, and PnP-ADMM was 8.86 seconds, 44.94 seconds, and 382.83 seconds, respectively. The higher cost of PnP-ADMM is the result of the forward model inversion in (7). This figure illustrates that, in practice, even with $B=10$ , the solution of PnP-SGD is sufficiently close to that of the batch algorithm. Additionally, PnP-SGD achieves a significant speedup due to the reduction in per-iteration complexity. This indicates to the potential of the algorithm for efficient image reconstruction from a large number of measurements.

Fig. 6 compares the average reconstruction SNR of PnP-SGD, PnP-FISTA, and PnP-ADMM for a fixed per-iteration measurement budget. Both batch algorithm are allowed to use only 10 (top figure) or 30 (bottom figure) uniformly distributed illuminations. Similarly, PnP-SGD uses the same number of illuminations per iteration, but randomly cycles through all the measurements. This means that in each figure both PnP-SGD and PnP-FISTA have the same per-iteration computational complexity. The computational complexity of PnP-ADMM is higher due to the need to invert the measurement matrix. Table 2 shows the final SNR obtained by all three algorithms on each individual image in the dataset. Additionally, two visual illustrations on Monarch and Parrot are shown in Fig. 7. As expected, PnP-SGD achieves dramatically higher SNR compared to batch algorithms, since it makes use of the full set of measurements. Additionally, we note the comparable final SNR performance of PnP-SGD with $B=10$ and $B=30$ , with the latter leading to a faster convergence speed. These results again highlight the potential of PnP-SGD for large-scale PnP image reconstruction.

To conclude this section, let us put the results here in the context of our theoretical analysis. Proposition 5 reveals that PnP-SGD converges to the same set of fixed points $\mathsf{fix}(\mathsf{P})$ as PnP-ISTA and PnP-ADMM, up to a term that depends on the minibatch size $B\geq 1$ . Larger $B$ leads to a higher accuracy of PnP-SGD with respect to $\mathsf{fix}(\mathsf{P})$ , which was empirically confirmed in Fig. 3. The SNR results here additionally reveal that even with a relatively small $B$ , PnP-SGD is accurate in terms of image quality. For example, in Table 2, we can observe that the average SNR difference between PnP-SGD with $B=10$ and $B=30$ is within 0.2 dB of each other. Additionally, in Fig. 5, we observe that the batch and online algorithms approximately achieve the same final SNR performance. These observations suggest that while there is an order of magnitude difference in accuracy between $B=10$ and $B=30$ when measured in terms of the distance to a fixed point (see Fig. 3), the difference is relatively mild when measured in terms of image quality (see Fig. 7), with smaller $B$ nearly matching the image quality of the batch algorithm.

Conclusion

The online PnP algorithm developed in this paper is beneficial in the context of large-scale image reconstruction, when the amount of data is too large to be processed jointly. We presented an in-depth theoretical convergence analysis for both batch and online variants of PnP-ISTA. Our work represents a substantial extension of the current convergence theory of PnP-algorithms for image reconstruction. Related experiments are also presented to empirically confirm the proposed propositions and to elucidate the higher efficiency of PnP-SGD in different representative situations. Future work will aim to apply the algorithm to other image reconstruction tasks, relax some of the assumptions, and extend the theoretical results in this paper to ADMM and FISTA.

Appendix

We start by reviewing the key concepts useful for our analysis. A more complete description of these ideas can be found in literature .

An operator $\mathsf{F}$ is Lipschitz continuous with a constant $L>0$ if

When $L=1$ , $\mathsf{F}$ is said to be nonexpansive.

It is straightforward to show that given two operators $\mathsf{F}_{1}$ and $\mathsf{F}_{2}$ with Lipschitz constants $L_{1}$ and $L_{2}$ , respectively, the composition $\mathsf{F}\,\triangleq\,\mathsf{F}_{2}\circ\mathsf{F}_{1}$ has Lipschitz constant $L=L_{1}L_{2}$ . This means that the composition of two nonexpansive operators is also nonexpansive.

Note that the iteration of a nonexpansive operator does not necessarily converge. To see this consider a nonexpansive operator $\mathsf{F}=-\mathsf{I}$ , where $\mathsf{I}$ is the identity. However, the Krasnosel’skii-Mann theorem (see Theorem 5.15 in ) states that the iteration of the damped operator ${\mathsf{D}\,\triangleq\,(1-\alpha)\mathsf{I}+\alpha\mathsf{F}}$ , for $\alpha\in(0,1)$ , will converge to $\mathsf{fix}(\mathsf{F})$ . This idea is further formalized with the definition of the following class of operators.

For a constant $\alpha\in(0,1)$ , we say that the operator $\mathsf{D}$ is $\alpha$ -averaged, if there exists a nonexpansive operator $\mathsf{F}$ such that $\mathsf{D}=(1-\alpha)\mathsf{I}+\alpha\mathsf{F}.$

An important result from convex analysis is that the proximal operator is $(1/2)$ -averaged (see p. 132 in ). Similarly, when $d$ is convex and has a Lipschitz continuous gradient of constant $L$ , the gradient-step operator $\mathsf{G}_{\gamma}$ is $(\gamma L/2)$ -averaged for any ${\gamma\in(0,2/L)}$ (see p. 17 in ). As stated next, the composition of two averaged operators is also averaged.

Let $\mathsf{F}_{1}$ be $\alpha_{1}$ -averaged and $\mathsf{F}_{2}$ be $\alpha_{2}$ -averaged. Then, the composite operator $\mathsf{F}\,\triangleq\,\mathsf{F}_{2}\circ\mathsf{F}_{1}=\mathsf{F}_{2}\mathsf{F}_{1}$ is

The direct consequence of this theorem, is that the composition of the proximal operator and the gradient-step is also an averaged operator. The following classical result was used in Definition 1 and is central for our subsequent analysis.

For a nonexpansive operator $\mathsf{D}$ and a constant $\alpha\in(0,1)$ , the following are equivalent:

$(1-1/\alpha)\mathsf{I}+(1/\alpha)\mathsf{D}$ is nonexpansive.

Proof of Proposition 1

Proposition 1 is a direct consequence of the well-known fixed-point interpretation of ISTA (see p. 150 in ). We provide the proof here for completeness by using the following characterization of the proximal operator

Proof of Proposition 2

As mentioned in Appendix 7.1, the iterative application of an averaged operator is well known as Krasnosel’skii-Mann iteration and its convergence has been extensively discussed in literature . Below, we use this theory to establish a novel convergence result for PnP-ISTA.

From our assumptions, the denoiser $\mathsf{D}_{\sigma}$ is $\theta$ -averaged and the gradient-step operator $\mathsf{G}_{\gamma}$ is $(\gamma L/2)$ -averaged for any $\gamma\in(0,2/L)$ . From Proposition 6, we have that their composition ${\mathsf{P}=\mathsf{D}_{\sigma}\circ\mathsf{G}_{\gamma}}$ is

averaged. Consider a single iteration ${{\bm{x}}^{+}=\mathsf{P}({\bm{x}})}$ , then we have for any ${\bm{x}^{\ast}}\in\mathsf{fix}(\mathsf{P})$ that

where we used Proposition 7(c) and the fact that ${\bm{x}^{\ast}}=\mathsf{P}({\bm{x}^{\ast}})$ . By considering the iteration $k\geq 1$ and rearranging the terms, we obtain

By averaging this inequality over $t\geq 1$ iterations and dropping the last term $\|{\bm{x}}^{t}-{\bm{x}^{\ast}}\|_{2}^{2}$ , we obtain

To obtain the result that depends on $\theta\in(0,1)$ , we note that for any $\gamma\in(0,1/L]$ , we can write

Proof of Proposition 3

Proposition 3 is a variation of the result in . For completeness, we provide a proof based on the fixed-point interpretation of ADMM (see p. 157 in ).

First note that both $\mathsf{D}_{\sigma}$ and $\mathsf{prox}_{\gamma d}$ are continuous (since they are nonexpansive). Fixed points ${\bm{x}^{\ast}},{\bm{z}^{\ast}},{\bm{s}^{\ast}}$ of PnP-ADMM satisfy

From (22c), we conclude that ${\bm{z}^{\ast}}={\bm{x}^{\ast}}$ . By using the smoothness of $d$ and the characterization (20) in (22a), we obtain

Finally, by using this in (22b), we obtain

which means that ${\bm{x}^{\ast}}={\bm{z}^{\ast}}\in\mathsf{fix}(\mathsf{P})$ and completes the proof.

Proof of Proposition 4

This function is convex and has a Lipschitz continuous gradient with constant $L=1$

where $\mathsf{sgn}(\cdot)$ denotes the sign function. We also consider the denoiser defined as

where $c>0$ is some constant independent of $\sigma>0$ . Since

this denoiser satisfies the definition of boundedness in (8). Then, for $q_{k}=1$ , a single iteration of PnP-ISTA can be re-written as

where we assume any $\gamma\in(0,1)$ . By combining these equations, we obtain

where we used the fact that $\mathsf{sgn}(x)=\mathsf{sgn}(z)$ and expressed ${z=|z|\,\mathsf{sgn}(z)}$ . For $|z|\leq 1-\sigma\sqrt{c}$ , we have that

On the other hand, for $|z|>1-\sigma\,\sqrt{c}$ , we have that

This means that the iterates of PnP-ISTA satisfy

Proof of Proposition 5

We define the full proximal-gradient operator

and its online variant over a minibatch of size ${B\geq 1}$

where in the third row we used the nonexpansiveness of $\mathsf{D}_{\sigma}$ . Consider a single iteration ${\bm{x}}^{k}=\widehat{\mathsf{P}}({\bm{x}}^{k-1})$ , then we have for any ${\bm{x}^{\ast}}\in\mathsf{fix}(\mathsf{P})$ that

where we used Proposition 7(c) and the Cauchy-Schwarz inequality. Note that due to nonexpansiveness of the operator $\mathsf{P}$ , we have that

By taking a conditional expectation of (7.6) and using these bounds, we obtain

By averaging the inequality over $t\geq 1$ iterations, taking the total expectation, and dropping the last term, we obtain

where we used the law of total expectation. By using the inequality (21), we can rewrite this expression as