Convolutional Dictionary Learning: A Comparative Review and New Algorithms

Cristina Garcia-Cardona, Brendt Wohlberg

I Introduction

Sparse representations have become one of the most widely used and successful models for inverse problems in signal processing, image processing, and computational imaging. The reconstruction of a signal $\mathbf{s}$ from a sparse representation $\mathbf{x}$ with respect to dictionary matrix $D$ is linear, i.e. $\mathbf{s}\approx D\mathbf{x}$ , but computing the sparse representation given the signal, referred to as sparse coding, usually involves solving an optimization problemWe do not consider the analysis form of sparse representations in this work, focusing instead on the more common synthesis form.. When solving problems involving images of any significant size, these representations are typically independently applied to sets of overlapping image patches due to the intractability of learning an unstructured dictionary matrix $D$ mapping to a vector space with the dimensionality of the number of pixels in an entire image.

The convolutional form of sparse representations replaces the unstructured dictionary $D$ with a set of linear filters $\{\mathbf{d}_{m}\}$ . In this case the reconstruction of $\mathbf{s}$ from representation $\{\mathbf{x}_{m}\}$ is $\mathbf{s}\approx\sum_{m}\mathbf{d}_{m}\ast\mathbf{x}_{m}$ , where $\mathbf{s}$ can be an entire image instead of a small image patch. This form of representation was first introduced some time ago under the label translation-invariant sparse representations , but has recently enjoyed a revival of interest as convolutional sparse representations, inspired by deconvolutional networks (see [5, Sec. II]). This interest was spurred by the development of more efficient methods for the computationally-expensive convolutional sparse coding (CSC) problem , and has led to a number of applications in which the convolutional form provides state-of-the-art performance .

The current leading CSC algorithms are all based on the Alternating Direction Method of Multipliers (ADMM) , which decomposes the problem into two subproblems, one of which is solved by soft-thresholding, and the other having a very efficient non-iterative solution in the DFT domain . The design of convolutional dictionary learning (CDL) algorithms is less straightforward. These algorithms adopt the usual approach for standard dictionary learning, alternating between a sparse coding step that updates the sparse representation of the training data given the current dictionary, and a dictionary update step that updates the current dictionary given the new sparse representation. It is the inherent computational cost of the latter update that makes the CDL problem more difficult than the CSC problem.

Most recent batch-modeWe do not consider the very recent online CDL algorithms in this work. CDL algorithms share the structure introduced in (and described in more detail in ), the primary features of which are the use of Augmented Lagrangian methods and the solution of the most computationally expensive subproblems in the frequency domain. Earlier algorithms exist (see [5, Sec. II.D] for a thorough literature review), but since they are less effective, we do not consider them here, focusing on subsequent methods:

Proposed a number of improvements on the algorithm of , including more efficient sparse representation and dictionary updates, and a different Augmented Lagrangian structure with better convergence properties (examined in more detail in ).

Proposed a number of dictionary update methods that lead to CDL algorithms with better performance than that of .

Proposed a CDL algorithm that allows the inclusion of a spatial mask in the data fidelity term by exploiting the mask decoupling technique .

Proposed an alternative masked CDL algorithm that has much lower memory requirements than that of , and that converges faster in some contexts.

Unfortunately, due to the absence of any thorough performance comparisons between all of them (for example, provides comparisons with but not ), as well as due to the absence of a careful exploration of the optimum choice of algorithm parameters in most of these works, it is difficult to determine which of these methods truly represents the state of the art in CDL.

Three other very recent methods do not receive the same thorough attention as those listed above. The algorithm of addresses a variant of the CDL problem that is customized for neural signal processing and not relevant to most imaging applications, and appeared while we were finalizing this paper, so that it was not feasible to include them in our analysis or our main set of experimental comparisons. However, since the authors of have made an implementation of their method publicly available, we do include this method in some additional performance comparisons in Sec. SV to SVII of the Supplementary Material.

The main contributions of the present paper are:

Providing a thorough performance comparison among the different methods proposed in , allowing reliable identification of the most effective algorithms.

Demonstrating that two of the algorithms proposed in , with very different derivations, are in fact closely related and fall within the same class of algorithm.

Proposing a new approach for the CDL problem without a spatial mask that outperforms all existing methods in a serial processing context.

Proposing new approaches for the CDL problem with a spatial mask that respectively outperform existing methods in serial and parallel processing contexts.

Carefully examining the sensitivity of the considered CDL algorithms to their parameters, and proposing simple heuristics for parameter selection that provide good performance.

II Convolutional Dictionary Learning

CDL is usually posed in the form of the problem

where the constraint on the norms of filters $\mathbf{d}_{m}$ is required to avoid the scaling ambiguity between filters and coefficientsThe constraint $\left\lVert\mathbf{d}_{m}\right\rVert_{2}\leq 1$ is frequently used instead of $\left\lVert\mathbf{d}_{m}\right\rVert_{2}=1$ . In practice this does not appear to make a significant difference to the solution.. The training images $\mathbf{s}_{k}$ are considered to be $N$ dimensional vectors, where $N$ is the number of pixels in each image, and we denote the number of filters and the number of training images by $M$ and $K$ respectively. This problem is non-convex in both variables $\{\mathbf{d}_{m}\}$ and $\{\mathbf{x}_{m,k}\}$ , but is convex in $\{\mathbf{x}_{m,k}\}$ with $\{\mathbf{d}_{m}\}$ constant, and vice versa. As in standard (non-convolutional) dictionary learning, the usual approach to minimizing this functional is to alternate between updates of the sparse representation and the dictionary. The design of a CDL algorithm can therefore be decomposed into three components: the choice of sparse coding algorithm, the choice of dictionary update algorithm, and the choice of coupling mechanism, including how many iterations of each update should be performed before alternating, and which of their internal variables should be transferred across when alternating.

While a number of greedy matching pursuit type algorithms were developed for translation-invariant sparse representations [5, Sec. II.C], recent algorithms have largely concentrated on a convolutional form of the standard Basis Pursuit DeNoising (BPDN) problem

This form, which we will refer to as Convolutional BPDN (CBPDN), can be written as

If we define $D_{m}$ such that $D_{m}\mathbf{x}_{m}=\mathbf{d}_{m}\ast\mathbf{x}_{m}$ , and

we can rewrite the CBPDN problem in standard BPDN form Eq. (2). The Multiple Measurement Vector (MMV) version of CBPDN, for multiple images, can be written as

where $\mathbf{s}_{k}$ is the $k^{\text{th}}$ image, and $\mathbf{x}_{m,k}$ is the coefficient map corresponding to the $m^{\text{th}}$ dictionary filter and the $k^{\text{th}}$ image. By defining

we can rewrite Eq. (5) in the standard BPDN MMV form,

Where possible, we will work with this form of the problem instead of Eq. (5) since it simplifies the notation, but the reader should keep in mind that $D$ , $X$ , and $S$ denote the specific block-structured matrices defined above.

The most effective solution for solving Eq. (5) is currently based on ADMMIt is worth noting, however, that a solution based on FISTA with the gradient computed in the frequency domain, while generally less effective than the ADMM solution, exhibits a relatively small performance difference for the larger $\lambda$ values typically used for CDL [5, Sec. IV.B]. , which solves problems of the form

where penalty parameter $\rho$ is an algorithm parameter that plays an important role in determining the convergence rate of the iterations, and $\mathbf{u}$ is the dual variable corresponding to the constraint $A\mathbf{x}+B\mathbf{y}=\mathbf{c}$ . We can apply ADMM to problem Eq. (7) by variable splitting, introducing an auxiliary variable $Y$ that is constrained to be equal to the primary variable $X$ , leading to the equivalent problem

Step Eq. (15) involves simple arithmetic, and step Eq. (14) has a closed-form solution

where $\mathcal{S}_{\gamma}(\cdot)$ is the soft-thresholding function [30, Sec. 6.5.2]

Since $D^{T}D$ is a very large matrix, it is impractical to solve this linear system using the approaches that are effective when $D$ is not a convolutional dictionary. It is possible, however, to exploit the FFT for efficient implementation of the convolution via the DFT convolution theorem. Transforming Eq. (18) into the DFT domain gives

where $\hat{Z}$ denotes the DFT of variable $Z$ . Due to the structure of $\hat{D}$ , which consists of concatenated diagonal matrices $\hat{D}_{m}$ , linear system Eq. (19) can be decomposed into a set of $NK$ independent linear systems , each of which has a left hand side consisting of a diagonal matrix plus a rank-one component, which can be solved very efficiently by exploiting the Sherman-Morrison formula .

II-B Dictionary Update

In developing the dictionary update, it is convenient to switch the indexing of the coefficient map from $\mathbf{x}_{m,k}$ to $\mathbf{x}_{k,m}$ , writing the problem as

which is a convolutional form of Method of Optimal Directions (MOD) with a constraint on the filter normalization. As for CSC, we will develop the algorithms for solving this problem in the spatial domain, but will solve the critical sub-problems in the frequency domain. We want to solve for $\{\mathbf{d}_{m}\}$ with a relatively small support, but when computing convolutions in the frequency domain, we need to work with $\mathbf{d}_{m}$ that have been zero-padded to the common spatial dimensions of $\mathbf{x}_{k,m}$ and $\mathbf{s}_{k}$ . The most straightforward way of dealing with this complication is to consider the $\mathbf{d}_{m}$ to be zero-padded and add a constraint that requires that they be zero outside of the desired support. If we denote the projection operator that zeros the regions of the filters outside of the desired support by $P$ , we can write a constraint set that combines this support constraint with the normalization constraint as

Introducing the indicator function $\iota_{C_{\text{PN}}}$ of the constraint set $C_{\text{PN}}$ , where the indicator function of a set $S$ is defined as

allows Eq. (22) to be written in unconstrained form

Defining $X_{k,m}$ such that $X_{k,m}\mathbf{d}_{m}=\mathbf{x}_{k,m}\ast\mathbf{d}_{m}$ and

Algorithms for solving this problem will be discussed in Sec. III. A common feature of most of these methods is the need to solve a linear system that includes the data fidelity term $(1/2)\left\lVert X\mathbf{d}-\mathbf{s}\right\rVert_{2}^{2}$ . As in the case of the $X$ step Eq. (13) for CSC, this problem can be solved in the frequency domain, but there is a critical difference: $\hat{X}^{H}\hat{X}$ is composed of independent components of rank $K$ instead of rank 1, so that the very efficient Sherman Morrison solution cannot be directly exploited. It is this property that makes the dictionary update inherently more computationally expensive than the sparse coding stage, complicating the design of algorithms, and leading to the present situation in which there is far less clarity as to the best choice of dictionary learning algorithm than there is for the choice of the sparse coding algorithm.

II-C Update Coupling

Both the sparse coding and dictionary update stages are typically solved via iterative algorithms, and many of these algorithms have more than one working variable that can be used to represent the current solution. The major design choices in coupling the alternating optimization of these two stages are therefore:

how many iterations of each subproblem to perform before switching to the other subproblem, and

which working variable from each subproblem to pass across to the other subproblem.

Since these issues are addressed in detail in , we only summarize the conclusions here:

When both subproblems are solved by ADMM algorithms, most authors have coupled the subproblems via the primary variables (corresponding, for example, to $X$ in Eq. (12)) of each ADMM algorithm.

This choice tends to be rather unstable, and requires either multiple iterations of each subproblem before alternating, or very large penalty parameters, which can lead to slow convergence.

The alternative strategy of coupling the subproblems via the auxiliary variables (corresponding, for example, to $Y$ in Eq. (12)) of each ADMM algorithm tends to be more stable, not requiring multiple iterations before alternating, and converging faster.

III Dictionary Update Algorithms

Since the choice of the best CSC algorithm is not in serious dispute, the focus of this work is on the choice of dictionary update algorithm.

The simplest approach to solving Eq. (28) via an ADMM algorithm is to apply the variable splitting

for which the corresponding ADMM iterations are

It is clear from the geometry of the problem that

or, if the normalization $\left\lVert\mathbf{d}_{m}\right\rVert_{2}\leq 1$ is desired instead,

Step Eq. (30) involves solving the linear system

which can be expressed in the DFT domain as

This linear system can be decomposed into a set of $N$ independent linear systems, but in contrast to Eq. (19), each of these has a left hand side consisting of a diagonal matrix plus a rank $K$ component, which precludes direct use of the Sherman-Morrison formula .

We consider three different approaches to solving these linear systems:

An obvious approach to solving Eq. (37) without having to explicitly construct the matrix $\hat{X}^{H}\hat{X}+\sigma I$ is to apply an iterative method such as Conjugate Gradient (CG). The experiments reported in indicated that solving this system to a relative residual tolerance of $10^{-3}$ or better is sufficient for the dictionary learning algorithm to converge reliably. The number of CG iterations required can be substantially reduced by using the solution from the previous outer iteration as an initial value.

III-A2 Iterated Sherman-Morrison

Since the independent linear systems into which Eq. (37) can be decomposed have a left hand side consisting of a diagonal matrix plus a rank $K$ component, one can iteratively apply the Sherman-Morrison formula to obtain a solution . This approach is very effective for small to moderate $K$ , but performs poorly for large $K$ since the computational cost is $\mathcal{O}(K^{2})$ .

III-A3 Spatial Tiling

When $K=1$ in Eq. (37), the very efficient solution via the Sherman-Morrison formula is possible. As pointed out in , a larger set of training images can be spatially tiled to form a single large image, so that the problem is solved with $K^{\prime}=1$ .

III-B Consensus Framework

In this section it is convenient to introduce different block-matrix and vector notation for the coefficient maps and dictionary, but we overload the usual symbols to emphasize their corresponding roles. We define $X_{k}$ as in Eq. (25), but define

where $\mathbf{d}_{m,k}$ is distinct copy of dictionary filter $m$ corresponding to training image $k$ .

As proposed in , we can pose problem Eq. (28) in the form of an ADMM consensus problem [17, Ch. 7]

which can be written in standard ADMM form as

where $E=\left(\begin{array}[]{ccc}I&I&\ldots\end{array}\right)^{T}$ .

Since $X$ is block diagonal, Eq. (41) can be solved as the $K$ independent problems

each of which can be solved via the same efficient DFT-domain Sherman-Morrison method used for Eq. (13). Subproblem Eq. (42) can be expressed as [17, Sec. 7.1.1]

III-C 3D / Frequency Domain Consensus

Like spatial tiling (see Sec. III-A3), the “3D” method proposed in maps the dictionary update problem with $K>1$ to an equivalent problem for which $K^{\prime}=1$ . The “3D” method achieves this by considering an array of $K$ 2D training images as a single 3D training volume. The corresponding dictionary filters are also inherently 3D, but the constraint is modified to require that they are zero other than in the first 3D slice (this can be viewed as an extension of the constraint that the spatially-padded filters are zero except on their desired support) so that the final results is a set of 2D filters, as desired.

While ADMM consensus and “3D” were proposed as two entirely distinct methods , it turns out they are closely related: the “3D” method is ADMM consensus with the data fidelity term and constraint expressed in the DFT domain. Since the notation is a bit cumbersome, the point will be illustrated for the $K=2$ case, but the argument is easily generalized to arbitrary $K$ .

When $K=2$ , the dictionary update problem can be expressed as

which can be rewritten as the equivalent problemEquivalence when the constraints are satisfied is easily verified by multiplying out the matrix-vector product in the data fidelity term in Eq. (59).

where the constraint can also be written as

The general form of the matrix in Eq. (59) is a block-circulant matrix constructed from the blocks $X_{k}$ . Since the multiplication of the dictionary block vector by the block-circulant matrix is equivalent to convolution in an additional dimension, this equivalent problem represents the “3D” method.

Now, define the un-normalized $2\times 2$ block DFT matrix operating in this extra dimension as

and apply it to the objective function and constraint, giving

Since the DFT diagonalises a circulant matrix, this is

In this form the problem is an ADMM consensus problem in variables

III-D FISTA

The Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) , an accelerated proximal gradient method, has been used for CSC , and in a recent online CDL algorithm , but has not previously been considered for the dictionary update of a batch-mode dictionary learning algorithm.

The FISTA iterations for solving Eq. (28) are

where $t^{0}=1$ , and $L>0$ is a parameter controlling the gradient descent step size. Parameter $L$ can be computed adaptively by using a backtracking step size rule , but in the experiments reported here we used a constant $L$ for simplicity. The gradient of the data fidelity term $(1/2)\left\lVert X\mathbf{d}-\mathbf{s}\right\rVert_{2}^{2}$ in Eq. (93) is computed in the DFT domain

as advocated in for the FISTA solution of the CSC problem, and the $\mathbf{y}^{(i+1)}$ variable is taken as the result of the dictionary update.

IV Masked Convolutional Dictionary Learning

When we wish to learn a dictionary from data with missing samples, or have reason to be concerned about the possibility of boundary artifacts resulting from the circular boundary conditions associated with the computation of the convolutions in the DFT domain, it is useful to introduce a variant of Eq. (1) that includes a spatial mask , which can be represented by a diagonal matrix $W$

As in Sec. II, we separately consider the minimization of this functional with respect to $\{\mathbf{x}_{m,k}\}$ (sparse coding) and $\{\mathbf{d}_{m}\}$ (dictionary update).

A masked form of the MMV CBPDN problem Eq. (7) can be expressed as the problemFor simplicity, the notation presented here assumes a fixed mask $W$ across all columns of $DX$ and $S$ , but the algorithm is easily extended to handle a different $W_{k}$ for each column $k$ .

There are two different methods for solving this problem. The one, proposed in , exploits the mask decoupling technique , involving applying an alternative variable splitting to give the ADMM problemThis is a variant of the earlier problem formulation , which was $\displaystyle\operatorname*{arg\,min}_{X}\;(1/2)\left\lVert WY_{1}-WS\right\rVert_{F}^{2}+\lambda\|Y_{0}\|_{1}\;\text{ s.t. }\;Y_{0}=X\;,\;Y_{1}=DX\;.$

where the constraint can also be written as

The functional minimized in Eq. (107) is of the same form as Eq. (13), and can be solved via the same frequency domain method, the solution to Eq. (108) is as in Eq. (16), and the solution to Eq. (109) is given by

The other method for solving Eq. (98) involves appending an impulse filter to the dictionary and solving the problem in a way that constrains the coefficient map corresponding to this filter to be zero where the mask is unity, and to be unconstrained where the mask is zero . Both approaches provide very similar performance , the major difference being that the former is a bit more complicated to implement, while the latter is restricted to addressing problems where $W$ has only zero or one entries. We will use the mask decoupling approach for the experiments reported here since it does not require any restrictions on $W$ .

IV-B Dictionary Update

The dictionary update requires solving the problem

Algorithms for solving this problem are discussed in the following section.

V Masked Dictionary Update Algorithms

Problem Eq. (113) can be solved via the splitting

where the constraint can also be written as

V-B Extended Consensus Framework

In this section we re-use the variant notation introduced in Sec. III-B. The masked dictionary update Eq. (113) can be solved via a hybrid of the mask decoupling and ADMM consensus approaches, which can be formulated as

where the constraint can also be written as

or, expanding the block components of $\mathbf{d}$ , $\mathbf{g}_{1}$ , and $\mathbf{s}$ ,

Steps Eq. (154), (155), and (157) have the same form, and can be solved in the same way, as steps Eq. (41), (42), and (43) respectively of the ADMM algorithm in Sec. III-B, and steps Eq. (156) and (158) have the same form, and can be solved in the same way, as the corresponding steps in the ADMM algorithm of Sec. V-A.

V-C FISTA

Problem Eq. (113) can be solved via FISTA as described in Sec. III-D, but the calculation of the gradient term is complicated by the presence of the spatial mask. This difficulty can be handled by transforming back and forth between spatial and frequency domains so that the convolution operations are computed efficiently in the frequency domain, while the masking operation is computed in the spatial domain, i.e.

where $F$ and $F^{-1}$ represent the DFT and inverse DFT transform operators, respectively.

VI Multi-Channel CDL

As discussed in , there are two distinct ways of defining a convolutional representation of multi-channel data: a single-channel dictionary together with a distinct set of coefficient maps for each channel, or a multi-channel dictionary together with a shared set of coefficient mapsOne might also use a mixture of these two approaches, with the channels partitioned into subsets, each of which is assigned a distinct dictionary channel, but this option is not considered further here.. Since the dictionary learning problem for the former case is a straightforward extension of the single-channel problems discussed above, here we focus on the latter case, which can be expressed asMulti-channel CDL is presented in this section as an extension of the CDL framework of Sec. II and III. Application of the same extension to the masked CDL framework of Sec. IV is straightforward, and is supported in our software implementations .

where $\mathbf{d}_{c,m}$ is channel $c$ of the $m^{\text{th}}$ dictionary filter, and $\mathbf{s}_{c,k}$ is channel $c$ of the $k^{\text{th}}$ training signal. We will denote the number of channels by $C$ . As before, we separately consider the sparse coding and dictionary updates for alternating minimization of this functional.

Defining $D_{c,m}$ such that $D_{c,m}\mathbf{x}_{m,k}=\mathbf{d}_{c,m}\ast\mathbf{x}_{m,k}$ , and

we can write the sparse coding component of Eq. (160) as

For the FISTA solution, we compute the gradient of the data fidelity term $(1/2)\sum_{c,k}\big{\lVert}D_{c}\mathbf{x}_{k}-\mathbf{s}_{c,k}\big{\rVert}_{2}^{2}$ in Eq. (162) in the DFT domain

In contrast to the ADMM methods, the multi-channel problem is not significantly more challenging than the single channel case, since it simply involves an additional sum over the $C$ channels.

VI-B Dictionary Update

In developing the dictionary update it is convenient to re-index the variables in Eq. (160), writing the problem as

Defining $X_{k,m}$ , $X_{k}$ , $X$ and $C_{\text{PN}}$ as in Sec. II-B, and

or in simpler formThe definition of $\iota_{C_{\text{PN}}}(\cdot)$ is overloaded here in that the specific projection from which $C_{\text{PN}}$ is defined depends on the matrix structure of its argument.

It is clear that the structure of $X$ is the same as in the single-channel case and that the solutions for the different channel dictionaries $\mathbf{d}_{c}$ are independent, so that the dictionary update in the multi-channel case is no more computationally challenging than in the single channel case.

VI-C Relationship between K𝐾K and C𝐶C

The above discussion reveals an interesting dual relationship between the number of images, $K$ , in coefficient map set $X$ , and the number of channels, $C$ , in dictionary $D$ . When solving the CDL problem via proximal algorithms such as ADMM or FISTA, $C$ controls the rank of the most expensive subproblem of the convolutional sparse coding stage in the same way that $K$ controls the rank of the main subproblem of the convolutional dictionary update. In addition, algorithms that are appropriate for the large $K$ case of the dictionary update are also suitable for the large $C$ case of sparse coding, and vice versa.

VII Results

In this section we compare the computational performance of the various approaches that have been discussed, carefully selecting optimal parameters for each algorithm to ensure a fair comparison.

Before proceeding to the results of the computational experiments, we summarize the dictionary learning algorithms that will be compared. Instead of using the complete dictionary learning algorithm proposed in each prior work, we consider the primary contribution of these works to be in the dictionary update method, which is incorporated into the CDL algorithm structure that was demonstrated in to be most effective: auxiliary variable coupling with a single iteration for each subproblemIn some cases, slightly better time performance can be obtained by performing a few iterations of the sparse coding update followed by a single dictionary update, but we do not consider this complication here. before alternating. Since the sparse coding stages are the same, the algorithm naming is based on the dictionary update algorithms.

The following CDL algorithms are considered for problem Eq. (1) without a spatial mask {LaTeXdescription}

The CDL algorithm uses the dictionary update proposed in , but the more effective variable coupling and alternation strategy discussed in .

The CDL algorithm uses the dictionary update technique proposed in , but the substantially more effective variable coupling and alternation strategy discussed in .

The algorithm is the same as Cns, but with a parallel implementation of both the sparse coding and dictionary update stagesŠorel and Šroubek observe that the ADMM consensus problem is inherently parallelizable [17, Ch. 7], but do not actually implement the corresponding CDL algorithm in parallel form to allow the resulting computational gain to be quantified empirically.. All steps of the CSC stage are completely parallelizable in the training image index $k$ , as are the $\mathbf{d}$ and $\mathbf{h}$ steps of the dictionary update, the only synchronization point being in the $\mathbf{g}$ step, Eq. (42), where all the independent dictionary estimates are averaged and projected (see Eq. (46)) to update the consensus variable that all the processes share.

The CDL algorithm uses the dictionary update proposed in , but the more effective variable coupling and alternation strategy discussed in .

Not previously considered for this problem.

The following dictionary learning algorithms are considered for problem Eq. (97) with a spatial mask {LaTeXdescription}

Not previously considered for this problem.

The CDL algorithm is based on a new dictionary update constructed as a hybrid of the dictionary update methods proposed in and , with the effective variable coupling and alternation strategy discussed in .

The algorithm is the same as M-Cns, but with a parallel implementation of both the sparse coding and dictionary update. All steps of the CSC stage and the $\mathbf{d}$ , $\mathbf{g}_{1}$ , $\mathbf{h}_{0}$ , and $\mathbf{h}_{1}$ steps of the dictionary update are completely parallelizable in the training image index $k$ , the only synchronization point being in the $\mathbf{g}_{0}$ step, Eq. (155), where all the independent dictionary estimates are averaged and projected to update the consensus variable that all the processes share.

Not previously considered for this problem.

In addition to the algorithms listed above, we investigated Stochastic Averaging ADMM (SA-ADMM) , as proposed for CDL in . Our implementation of a CDL algorithm based on this method was found to have promising computational cost per iteration, but its convergence was not competitive with some of the other methods considered here. However, since there are a number of algorithm details that are not provided in (CDL is not the primary topic of that work), it is possible that our implementation omits some critical components. These results are therefore not included here in order to avoid making an unfair comparison.

We do not compare with the dictionary learning algorithm in because the algorithms of both and were both reported to be substantially faster. We do not include the algorithms of either and in our main set of experiments because we do not have implementations that are practical to run over the large number of different training image sets and parameter choices that are used in these experiments, but we do include these algorithms in some additional performance comparisons in Sec. SVII of the Supplementary Material. Multi-channel CDL problems are not included in our main set of experiments due to space limitations, but some relevant experiments are provided in Sec. SVIII of the Supplementary Material.

VII-B Computational Complexity

The per-iteration computational complexities of the methods are summarized in Table I. Instead of just specifying the dominant terms, we include all major contributing terms to provide a more detailed picture of the computational cost. All methods scale linearly with the number of filters, $M$ , and with the number of images, $K$ , except for the ISM variants, which scale as $\mathcal{O}(K^{2})$ . The inclusion of the dependency on $K$ for the parallel algorithms provides a very conservative view of their behavior. In practice, there is either no scaling or very weak scaling with $K$ when the number of available cores exceeds $K$ , and weak scaling with $K$ when it exceeds the number of available cores. Memory usage depends on the method and implementation, but all the methods have an $\mathcal{O}(KMN)$ memory requirement for their main variables.

VII-C Experiments

We used training sets of 5, 10, 20, and 40 images. These sets were nested in the sense that all images in a set were also present in all of the larger sets. The parent set of 40 images consisted of greyscale images of size 256 $\times$ 256 pixels, derived from the MIRFLICKR-1M datasetThe image data directly included in the MIRFLICKR-1M dataset is of very low resolution since the dataset is primarily targeted at image classification tasks. We therefore identified and downloaded the original images that were used to construct the MIRFLICKR-1M dataset. by cropping, rescaling, and conversion to greyscale. An additional set of 20 images, of the same size and from the same source, was used as a test set to allow comparison of generalization performance, taking into account possible differences in overfitting effects between the different methods.

The 8 bit greyscale images were divided by 255 so that pixel values were within the interval , and were highpass filtered (a common approach for convolutional sparse representations [42, Sec. 3]) by subtracting a lowpass component computed by Tikhonov regularization with a gradient term [37, pg. 3], with regularization parameter $\lambda=5.0$ .

The results reported here were computed using the Python implementation of the SPORCO library on a Linux workstation equipped with two Xeon E5-2690V4 CPUs.

VII-D Optimal Penalty Parameters

To ensure a fair comparison between the methods, the optimal penalty parameters for each method and training image set were selected via a grid search, of CDL functional values obtained after 100 iterations, over $(\rho,\sigma)$ values for the ADMM dictionary updates, and over $(\rho,L)$ values for the FISTA dictionary updates. The grid resolutions were

10 logarithmically spaced points in $[10^{-1},10^{4}]$

15 logarithmically spaced points in $[10^{-2},10^{5}]$

15 logarithmically spaced points in $[10^{1},10^{5}]$

VII-E Performance Comparisons

We compare the performance of the methods in learning a dictionary of 64 filters of size 8 $\times$ 8 for sets of 5, 10, 20 and 40 images, setting the sparsity parameter $\lambda=0.1$ , and using the parameters determined by the grid searches for each method. To avoid complicating the comparisons, we used fixed penalty parameters $\rho$ and $\sigma$ , without any adaptation methods [5, Sec. III.D], and did not apply relaxation methods [17, Sec. 3.4.3][5, Sec. III.D] in any of the ADMM algorithms. Similarly, we used a fixed $L$ for FISTA, without applying any backtracking step-size adaptation rule. Performance in terms of the convergence rate of the CDL functional, with respect to both iterations and computation time, is compared in Figs. 1 – 3. The time scaling with $K$ of all the methods is summarized in Fig. 4(a).

For the $K=5$ case, all the methods have quite similar performance in terms of functional value convergence with respect to iterations. For the larger training set sizes, CG and ISM have somewhat better performance with respect to iterations, but ISM has very poor performance with respect to time. CG has substantially better time scaling, depending on the relative residual tolerance. We ran our experiments for CG with a fixed tolerance of $10^{-3}$ , resulting in computation times that are comparable with those of the other methods. A smaller tolerance leads to better convergence with respect to iterations, but substantially worse time performance.

The “3D” method behaves similarly to ADMM consensus, as expected from the relationship established in Sec. III-C, but has a larger memory footprint. The spatial tiling method (Tiled), on the other hand, tends to have slower convergence with respect to both iterations and time than the other methods. We do not further explore the performance of these methods since they do not provide substantial advantages over the others.

Both parallel (Cns-P) and regular consensus (Cns) have the same evolution of the CBPDN functional, Eq. (5), with respect to iterations, but the former requires much less computation time, and is the fastest method overall. Moreover, parallel consensus exhibits almost ideal parallelizability, with some overhead for $K=5$ , but scaling linearly for $K\in$ , and with very competitive computation times. FISTA is also very competitive, achieving good results in less time than any of the other serial methods, and even outperforming the time performance of Cns-P for the $K=40$ case shown in Fig. 3. We believe that this variation of relative performance with $K$ is due to the unstable dependence of the CDL functional on $L$ that is illustrated, for example, in Fig. 10(b) in the Supplementary Material. This functional decreases slowly as $L$ is decreased, but then increases very rapidly after the minimum is reached, due to the constraint on $L$ discussed in Sec. VII-G2.

All experiments with algorithms that include a spatial mask set the mask to the identity ( $W=I$ ) to allow comparison with the performance of the algorithms without a spatial mask. Plots comparing the evolution of the masked CBPDN functional, Eq. (98), over 1000 iterations and problem sizes of $K\in\{5,20,40\}$ are displayed in Figs. 5 – 7, respectively. The time scaling of all the masked methods is summarized in Fig. 4(b).

While the convergence performance with iterations of the masked version of the FISTA algorithm, M-FISTA, is mixed (providing the worst performance for $K=5$ and $K=20$ , but the best performance for $K=40$ ), it consistently provides good performance in terms of convergence with respect to computation time, despite the additional FFTs discussed in Sec. V-C. The parallel hybrid mask decoupling/consensus method, M-Cns-P, is the other competitive approach for this problem, providing the best time performance for $K=5$ and $K=20$ , while lagging slightly behind M-FISTA for $K=40$ .

In contrast with the corresponding mask-free variants, M-CG and M-ISM have worse performance in terms of both time and iterations. This suggests that M-CG requires a value for the relative residual tolerance smaller than $10^{-3}$ to produce good results, but this would be at the expense of much longer computation times. With the exception of CG, for which the cost of computing the masked version increases for $K\geq 20$ , the computation time for the masked versions is only slightly worse than the mask-free variants (Fig. 4). In general, using the masked versions leads to a marginal decrease in convergence rate with respect to iterations, and a small increase in computation time.

VII-F Evaluation on the Test Set

To provide a comparison that takes into account any possible differences in overfitting and generalization properties of the dictionaries learned by the different methods, we ran experiments over a 20 image test set that is not used during learning. For all the methods discussed, we saved the dictionaries at 50 iteration intervals (including the final one obtained at 1000 iterations) while training. These dictionaries were used to sparse code the images in the test set with $\lambda=0.1$ , allowing evaluation of the evolution of the test set CBPDN functional as the dictionaries change during training. Results for the dictionaries learned while training with $K=20$ and $K=40$ images are shown in Figs. 8 and 9 respectively, and corresponding results for the algorithms with a spatial mask are shown in Figs. 10 and 11 respectively. Note that the time axis in these plots refers to the run time of the dictionary learning code used to generate the relevant dictionary, and not to the run time of the sparse coding on the test set.

As expected, independent of the method, the dictionaries obtained for training with 40 images exhibit better performance than the ones trained with 20 images. Overall, performance on training is a good predictor of performance in testing, which suggests that the functional value on a sufficiently large training set is a reliable indicator of dictionary quality.

VII-G Penalty Parameter Selection

The grid searches performed for determining optimal parameters ensure a fair comparison between the methods, but they are not convenient as a general approach to parameter selection. In this section we show that it is possible to construct heuristics that allow reliable parameter selection for the best performing CDL methods considered here.

Estimates of parameter scaling properties with respect to $K$ are derived in Sec. SIII in the Supplementary Material. For the CDL problem without a spatial mask, these scaling properties are derived for the sparse coding problem, and for the dictionary updates based on ADMM with an equality constraint, ADMM consensus, and FISTA. These estimates indicate that the scaling of the penalty parameter $\rho$ for the convolutional sparse coding is $\mathcal{O}(1)$ , the scaling of the penalty parameter $\sigma$ for the dictionary update is $\mathcal{O}(K)$ for the ADMM with equality constraint and $\mathcal{O}(1)$ for ADMM consensus, and the scaling of the step size $L$ for FISTA is $\mathcal{O}(K)$ . Derivations for the Tiled and 3D methods do not lead to a simple scaling relationship, and are not included.

For the CDL problem with a spatial mask, these scaling properties are derived for the sparse coding problem, and for the dictionary updates based on ADMM with a block-constraint, and extended ADMM consensus. The scaling of the penalty parameter $\rho$ for the masked version of convolutional sparse coding is $\mathcal{O}(1)$ , the scaling of the penalty parameter $\sigma$ for the dictionary update in the extended consensus framework is $\mathcal{O}(1)$ , while there is no simple rule of the $\sigma$ scaling in the block-constraint ADMM of Sec. V-A.

VII-G2 Parameter Selection Guidelines

The derivations discussed above indicate that the optimal algorithm parameters should be expected to be either constant or linear in $K$ . For the parameters of the most effective CDL algorithms, i.e. CG, Cns, FISTA, and M-Cns, we performed additional computational experiments to estimate the constants in these scaling relationships. Cns-P and M-Cns-P have the same parameter dependence as their serial counterparts, and are therefore not evaluated separately. Similarly, M-FISTA is not included in these experiments because it has the same functional evolution as FISTA for the identity mask $W=I$ .

For each training set size $K\in\{5,10,20\}$ , we constructed an ensemble of 20 training sets of that size by random selection from the 40 image training set. For each CDL algorithm and each $K$ , the dependence of the convergence behavior on the algorithm parameters was evaluated by computing 500 iterations of the CDL algorithm for all 20 members of the ensemble of size $K$ , and over grids of $(\rho,\sigma)$ values for the ADMM dictionary updates, and $(\rho,L)$ values for the FISTA dictionary updates. The parameter grids consisted of 10 logarithmically spaced points in the ranges specified in Table III. These parameter ranges were set such that the corresponding functional values remained within 0.1% to 1% of their optimal values.

We normalized the results for each training set by dividing by the minimum of the functional for that set, and computed statistics over these normalized values for all sets of the same size, $K$ . These statistics, which are reported as box plots in Sec. SIV of the Supplementary Material, were also aggregated into contour plots of the median (across the ensemble of training images sets of the same size) of the normalized CDL functional values, displayed in Fig. 12. (Results for ISM are the same as for CG and are not shown.) In each of these contour plots, the horizontal axis corresponds to the number of training images, $K$ , and the vertical axis corresponds to the parameter of interest. The scaling behavior of the optimal parameter with $K$ can clearly be seen in the direction of the valley in the contour plots. Parameter selection guidelines obtained by manual fitting of the constant or linear scaling behavior to these contour plots are plotted in red, and are also summarized in Table IV.

In Fig. 12(f), the guideline for $\sigma$ for M-Cns does not appear to follow the path of the 1.001 level curves. We did not select the guideline to follow this path because (i) the theoretical estimate of the scaling properties of this parameter with $K$ in Sec. SIII-G of the Supplementary Material is that it is constant, and (ii) the path suggested by the 1.001 level curves leads to a logarithmically decreasing curve that would reach negative parameter values for sufficiently large $K$ . We do not have a reliable explanation for the unexpected behavior of the 1.001 level curves, but suspect that it may be related to the loss of diversity of training image sets for $K=20$ , since each of these sets of 20 images was chosen from a fixed set of 40 images. It is also worth noting that the upper level curves for larger functional values, e.g. 1.002, do not follow the same unexpected decreasing path.

To guarantee convergence of FISTA, the inverse of the gradient step size, $L$ , has to be greater than or equal to the Lipschitz constant of the gradient of the functional . In Fig. 12(h), the level curves below the guideline correspond to this potentially unstable regime where the functional value surface has a large gradient. The gradient of the surface is much smaller above the guideline, indicating that convergence is not very sensitive to the parameter value in this region. We chose the guideline precisely to be more biased towards the stable regime.

The parameter selection guidelines presented in this section should only be expected to be reliable for training data with similar characteristics to those used in our experiments, i.e. natural images pre-processed as described in Sec. VII-C, and for the same or similar sparsity parameter, i.e. $\lambda=0.1$ . Nevertheless, since the scaling properties derived in Sec. SIII of the Supplementary Material remain valid, it is reasonable to expect that similar heuristics, albeit with different constants, would hold for different training data or sparsity parameter settings.

VIII Conclusions

Our results indicate that two distinct approaches to the dictionary update problem provide the leading CDL algorithms. In a serial processing context, the FISTA dictionary update proposed here outperforms all other methods, including consensus, for CDL with and without a spatial mask. This may seem surprising when considering that ADMM outperforms FISTA on the CSC problem, but is easily understood when taking into account the critical difference between the linear systems that need to be solved when tackling the CSC and convolutional dictionary update problems via proximal methods such as ADMM and FISTA. In the case of CSC, the major linear system to be solved has a frequency domain structure that allows very efficient solution via the Sherman-Morrison formula, providing an advantage to ADMM. In contrast, except for the $K=1$ case, there is no such highly efficient solution for the convolutional dictionary update, giving an advantage to methods such as FISTA that employ gradient descent steps rather than solving the linear system.

In a parallel processing context, the consensus dictionary update proposed in used together with the alternative CDL algorithm structure proposed in leads to the CDL algorithm with the best time performance for the mask-free CDL problem, and the hybrid mask decoupling/consensus dictionary update proposed here provides the best time performance for the masked CDL problem. It is interesting to note that, despite the clear suitability of the ADMM consensus framework for the convolutional dictionary update problem, a parallel implementation is essential to outperforming other methods; in a serial processing context it is significantly outperformed by the FISTA dictionary update, and even the CG method is competitive with it.

We have also demonstrated that the optimal algorithm parameters for the leading methods considered here tend to be quite stable across different training sets of similar type, and have provided reliable heuristics for selecting parameters that provide good performance. It should be noted, however, that FISTA appears to be more sensitive to the $L$ parameter than the ADMM methods are to the penalty parameter.

The additional experiments reported in the Supplementary Material indicate that the FISTA and parallel consensus methods are scalable to relatively large training sets, e.g. 100 images of $512\times 512$ pixels. The computation time exhibits linear scaling in the number of training images, $K$ , and the number of dictionary filters, $M$ , and close to linear scaling in the number of pixels in each image, $N$ . The limited experiments involving color dictionary learning indicate that the additional computational cost compared with greyscale dictionary learning is moderate. Comparisons with the publicly available implementations of complete CDL methods by other authors indicate that:

The method of Heide et al. does not scale well to training images sets of even moderate size, exhibiting very slow convergence with respect to computation time.

While the consensus CDL method proposed here gives very good performance, the consensus method of Šorel and Šroubek converges much more slowly, and does not learn dictionaries with properly normalized filtersIt is not clear whether this is due to weaknesses in the algorithm, or to errors in the implementation..

The method of Papyan et al. converges rapidly with respect to the number of iterations, and appears to scale well with training set size, but is slower than the FISTA and parallel consensus methods with respect to time, and the resulting dictionaries do not offer competitive performance to the leading methods proposed here in terms of performance on testing image sets.

In the interest of reproducible research, software implementations of the algorithms considered here have been made publicly available as part of the SPORCO library .

References

SI Introduction

This document provides additional detail and results that were omitted from the main document due to space restrictions. All citations refer to the References section of the main document.

SII Penalty Parameter Grid Search

The penalty parameter grid searches discussed in Sec. VII-D in the main document generate 2D surfaces representing the CDL functional value after a fixed number of iterations, plotted against the parameters for the sparse coding and dictionary update components of the dictionary learning algorithm. The surfaces corresponding to the coarse grids for the set of 20 training images are shown here in Figs. S1 – S3.

SIII Analytic Derivation of Penalty Parameter Scaling

In order to estimate the scaling properties of the algorithm parameters with respect to the training set size, $K$ , we consider the case in which the training set size is changed by replication of the same data. By removing the complexities associated with the characteristics of individual images, this simplified scenario allows analytic evaluation of the conditions under which an equivalent problem is obtained when the set size, $K$ , is changed. In practice, changing $K$ involves introducing different training images, and we cannot expect that these scaling properties will hold exactly, but they represent the best possible estimate that depends only on $K$ and not on the properties of the training images themselves.

We will also make use of the invariance of the indicator function under scalar multiplication

which is due to the $\{0,\infty\}$ range of this function.

The augmented Lagrangian for the ADMM solution to CSC problem Eq. (12) in the main document is

where we omit the final term, $-\frac{\rho}{2}\left\lVert U\right\rVert_{F}^{2}$ , which does not effect the minimizer of this functional. For $K=1$ we have $S=\mathbf{s}$ , $X=\mathbf{x}$ , $Y=\mathbf{y}$ , and $U=\mathbf{u}$ . If we construct the $K=2$ case by replicating the training data, we have $S^{\prime}=\left(\begin{array}[]{cc}\mathbf{s}&\mathbf{s}\end{array}\right)$ , $X^{\prime}=\left(\begin{array}[]{cc}\mathbf{x}&\mathbf{x}\end{array}\right)$ , $Y^{\prime}=\left(\begin{array}[]{cc}\mathbf{y}&\mathbf{y}\end{array}\right)$ , and $U^{\prime}=\left(\begin{array}[]{cc}\mathbf{u}&\mathbf{u}\end{array}\right)$ , and the augmented Lagrangian is

For this problem, the augmented Lagrangian for the $K=2$ case is just twice the augmented Lagrangian for the $K=1$ case, with the same penalty parameter $\rho$ . Therefore we expect that the optimal penalty parameter should remain constant when changing the number of training images $K$ .

SIII-B Equality Constrained ADMM Dictionary Update

The augmented Lagrangian for the ADMM solution to the dictionary update problem Eq. (29) in the main document is

where we omit the final term, $-\frac{\sigma}{2}\left\lVert\mathbf{h}\right\rVert_{2}^{2}$ , which does not effect the minimizer of this functional. We assume that the variables in the above equation represent the $K=1$ case, and construct the $K=2$ case by replicating the training data, i.e.

$\mathbf{g}^{\prime}=\mathbf{g}$ , and $\mathbf{h}^{\prime}=\mathbf{h}$ . The corresponding augmented Lagrangian is

For this problem, the augmented Lagrangian for the $K=2$ case is twice the augmented Lagrangian for the $K=1$ case when the penalty parameter is also twice the penalty parameter used for the $K=1$ case. Therefore we expect that the optimal penalty parameter should scale linearly when changing the number of training images $K$ .

SIII-C Consensus ADMM Dictionary Update

The augmented Lagrangian for the ADMM Consensus form of the dictionary update problem Eq. (39) in the main document is

where we omit the final term, $-\frac{\sigma}{2}\left\lVert\mathbf{h}\right\rVert_{2}^{2}$ , which does not effect the minimizer of this functional, and

We assume that the variables in the above equation represent the $K=1$ case, with $E=I$ , and construct the $K=2$ case by replicating the training data, i.e.

$\mathbf{g}^{\prime}=\mathbf{g}$ , and $E^{\prime}=\left(\begin{array}[]{cc}I&I\end{array}\right)^{T}$ . The corresponding augmented Lagrangian is

For this problem, the augmented Lagrangian for the $K=2$ case is just twice the augmented Lagrangian for the $K=1$ case, with the same penalty parameter $\sigma$ . Therefore we expect that the optimal penalty parameter should remain constant when changing the number of training images $K$ .

SIII-D FISTA Dictionary Update

The FISTA solution to the dictionary update problem requires computing the gradient of the data fidelity term in the DFT domain (Eq. (96) in the main document)

We assume that the variables in the above equation represent the $K=1$ case, and construct the $K=2$ case by replicating the training data, i.e.

For this problem, the gradient in the DFT domain for the $K=2$ case is just twice the gradient in the DFT domain for the $K=1$ case. To obtain the same solution we need the gradient step to be the same, which requires that the gradient step parameter be reduced by a factor of two to compensate for the doubling of the gradient. Therefore we expect that the optimal parameter $L$ , which is the inverse of the gradient step size, should scale linearly when changing the number of training images $K$ .

SIII-E Mask Decoupling ADMM Sparse Coding

The augmented Lagrangian for the ADMM solution to the masked form of the MMV CBPDN problem Eq. (99) in the main document is

which does not effect the minimizer of this functional. We assume that the variables in the above equation represent the $K=1$ case, and construct the $K=2$ case by replicating the training data, i.e. $S^{\prime}=\left(\begin{array}[]{cc}\mathbf{s}&\mathbf{s}\end{array}\right)$ , $X^{\prime}=\left(\begin{array}[]{cc}\mathbf{x}&\mathbf{x}\end{array}\right)$ , $Y_{0}^{\prime}=\left(\begin{array}[]{cc}Y_{0}&Y_{0}\end{array}\right)$ , $Y_{1}^{\prime}=\left(\begin{array}[]{cc}Y_{1}&Y_{1}\end{array}\right)$ , $U_{0}^{\prime}=\left(\begin{array}[]{cc}U_{0}&U_{0}\end{array}\right)$ , $U_{1}^{\prime}=\left(\begin{array}[]{cc}U_{1}&U_{1}\end{array}\right)$ , and $0^{\prime}=\left(\begin{array}[]{cc}\mathbf{0}&\mathbf{0}\end{array}\right)$ . The corresponding augmented Lagrangian is

SIII-F Mask Decoupling ADMM Dictionary Update

The augmented Lagrangian for the Block-Constraint ADMM solution of the masked dictionary update problem Eq. (114) in the main document is

$\mathbf{d}^{\prime}=\mathbf{d}$ , $\mathbf{g}_{0}^{\prime}=\mathbf{g}_{0}$ , and $\mathbf{h}_{0}^{\prime}=\mathbf{h}_{0}$ . The corresponding augmented Lagrangian is

For this problem, the augmented Lagrangian for the $K=2$ case has terms that are twice the augmented Lagrangian for the $K=1$ case, as well as a term that is the same as for the $K=1$ case. Therefore, there is no simple rule to scale the optimal penalty parameter $\sigma$ when changing the number of training images $K$ .

It is, however, worth noting that a scaling relationship could be obtained by replacing the constraint $\mathbf{g}^{\prime}_{0}=\mathbf{d}^{\prime}$ with the equivalent constraint $2\mathbf{g}_{0}=2\mathbf{d}$ (or, more generally, $K\mathbf{g}_{0}=K\mathbf{d}$ ) and appropriate rescaling of the scaled dual variable $\mathbf{h}_{0}$ , so that the problematic term above, $(\sigma/2)\left\lVert\mathbf{g}^{\prime}_{0}-\mathbf{d}^{\prime}+\mathbf{h}^{\prime}_{0}\right\rVert_{2}^{2}$ , exhibits the same scaling as the other terms.

SIII-G Hybrid Consensus Masked Dictionary Update

The augmented Lagrangian for the ADMM consensus solution of the masked dictionary update problem Eq. (122) in the main document is

which does not effect the minimizer of this functional. We assume that the variables in the above equation represent the $K=1$ case, with $E=I$ , and construct the $K=2$ case by replicating the training data, i.e.

$\mathbf{g}_{0}^{\prime}=\mathbf{g}_{0}$ , and $E^{\prime}=\left(\begin{array}[]{cc}I&I\end{array}\right)^{T}$ . The corresponding augmented Lagrangian is

SIV Experimental Sensitivity Analysis

Experiments to determine the median stability of the optimal parameters across an ensemble of training sets of the same size are discussed in Sec. VII-G2 in the main document. The corresponding results are plotted here in Figs. S4 – S15. The box plots represent median, quartiles, and the full range of variation of the normalized functional values obtained at each parameter value for the 20 different image subsets at each of the sizes $K\in\{5,10,20\}$ . The red lines connect the medians of the distributions at each parameter value.

It can be seen in Figs. 10(b), 11(b), and 12(b) that FISTA has very skewed sensitivity plots for $L$ , the inverse of the gradient step size. This is related to the requirement, mentioned in the main document, that $L$ has to be greater than or equal to the Lipschitz constant of the gradient of the functional to guarantee convergence of the algorithm. Although this constant is not always computable , in these experiments we are able to estimate the threshold that indicates the change in behavior expected when $L$ becomes greater than the Lipschitz constant. The variation of the normalized functional values is comparable to those for other methods and other parameters for values of $L$ greater than this threshold. However, for values of $L$ smaller than the threshold, the instability causes a much larger variance in the normalized functional values. We decided to clip the large vertical ranges resulting from the very large variances to the left of these plots in order to more clearly display the scaling in the useful range of $L$ . As a result, some of the interquartile range boxes to the left are incomplete, or just the lower part of the full range of variation is visible.

SV Large Training Set Experiments

In order to evaluate the performance of the methods for larger training sets and images of different sizes, we performed additional experiments, including comparisons with the original implementations of competing algorithms. We used training sets of 25, 100 and 400 images of sizes 1024 $\times$ 1024 pixels, 512 $\times$ 512 pixels and 256 $\times$ 256 pixels, respectively. These combinations of number, $K$ , and size, $N$ , of images were chosen to maintain a constant number of pixels in the training set, which provides a useful way of simultaneously exploring performance variations with respect to both $N$ and $K$ . All of these images were derived from images in the MIRFLICKR-1M dataset and pre-processed (scaling and highpass filtering) in the same way, as described in Sec. VII-C in the main document.

All the results using the methods discussed and analyzed in the main document were computed using the Python implementation of the SPORCO library on a Linux workstation equipped with two Xeon E5-2690V4 CPUs. We also include comparisons with the method proposed by Papyan et al. , using their publicly available Matlab and C implementationAvailable from http://vardanp.cswp.cs.technion.ac.il/wp-content/uploads/sites/62/2015/12/SliceBasedCSC.rar.

We tried to include the publicly available Matlab implementations of the methods proposed by Šorel and ŠroubekAvailable from https://github.com/michalsorel/convsparsecoding and by Heide et al. Available from http://www.cs.ubc.ca/labs/imager/tr/2015/FastFlexibleCSC in these comparisons, but were unable to obtain acceptable resultsThe methods were very slow, with partial results after running for 4 days still being noisy and far from convergence.. We therefore omit these methods from the comparisons here, including them only in a separate set of experiments on a smaller data set, reported in Sec. SVII below.

In all of these experiments we learned a dictionary of 100 filters of size 11 $\times$ 11, setting the sparsity parameter $\lambda=0.1$ . We set the parameters for our methods according to the scaling rules discussed in Sec. VII-G2 in the main document, using fixed penalty parameters $\rho$ and $\sigma$ without any adaptation methods. In contrast to the experiments reported in the main document, relaxation methods [17, Sec. 3.4.3][5, Sec. III.D] were used, with $\alpha=1.8$ .

We used the default parameters from the demonstration scripts distributed with each of the publicly available Matlab implementations by the authors of , , and . Our efforts to adjust the default parameters for the implementations of the methods of , and to obtain better results were unsuccessful, at least in part due to the slow convergence of the methods and the absence of any parameter selection discussion or guidelines provided by the authors.

During training, the dictionaries were saved at 25 iteration intervals to allow evaluation on an independent test set, which consisted of the same additional set of 20 images, of size 256 $\times$ 256 pixels, that was used for this purpose for the experiments reported in the main document. This evaluation was performed by sparse coding of the images in the test set, for $\lambda=0.1$ , computing the evolution of the CBPDN functional over the series of dictionaries. This not only allows comparison of generalization performance, taking into account possible differences in overfitting effects between the different methods, but also allows for a fair comparison between the methods, avoiding the difficulty of comparing the training functional values that are computed differently by different implementationsAll of our implementations calculate the functional values in the same way, but the implementations by other authors adopt slightly different approaches..

Results for the training objective function are shown in Fig. S16 for $K=25$ with $1024\times 1024$ images, in Fig. S17 for $K=100$ with $512\times 512$ images, and in Fig. S18 for $K=400$ with 2 $56\times 256$ images. It is clear that Cns-P consistently achieves the best performance, converging smoothly to a slightly smaller functional value than the other two methods in all the cases except for Fig. S16. It also exhibits the fastest convergence of the methods compared. In contrast, FISTA results are less stable, presenting some wild oscillations at the beginning and some small oscillations at the end, but nevertheless achieving similar final functional values to Cns-P. The method of Papyan et al. has very rapid convergence in terms of iterations, but its time performance is the worst of the three methods.

The FISTA instability can be automatically corrected by using the backtracking step-size adaptation rule (see Sec. III-D in main document). However, due to the uni-directional correction of the backtracking rule that always increases $L$ (i.e. it always decreases the gradient step size), the evolution of the functional is smooth, but also tends to converge to a larger functional value. A reasonable approach for methods that do not converge monotonically, such as FISTA, is to consider the solution at each time step as the best solution obtained until that step, as opposed to the solution specifically for that step, which has the effect of smoothing the functional value evolution. In all our experiments, we used a fixed $L$ value, set in accordance with the parameter rules described in the main document, and report actual convergence without any post processing since this more accurately illustrates the real FISTA behavior and the tradeoff between convergence smoothness and final functional value determined by parameter $L$ .

Testing results obtained for the additional 20 images of size $256\times 256$ are displayed in Fig. S19, for $K=25,1024\times 1024$ images, in Fig. S20 for $K=100,512\times 512$ images and in Fig. S21 for $K=400,256\times 256$ images. Note again that, as in the comparisons in the main document, the time axis in these plots refers to the run time of the dictionary learning code used to generate the relevant dictionary, and not to the run time of the sparse coding on the test set.

All the testing plots show that the methods perform as expected from the training comparison, with Cns-P achieving better performance also in the test set, followed by FISTA. Results for the method of Papyan et al. are always worse, and do not match the functional values achieved either by Cns-P or FISTA. For all methods, testing results are better for the dictionary filters obtained when training with $K=400,256\times 256$ images (Fig. S21), followed by the dictionary filters obtained when training with $K=100,512\times 512$ images (Fig. S20), with the worst results obtained for the dictionary filters obtained when training with $K=25,1024\times 1024$ images (Fig. S19). In particular, the Cns-P functional increases near the end of the evolution in Fig. S19. We believe that this is due to overfitting effects for the $K=100$ and $K=25$ cases, resulting from the mismatch between training and validation image sizes. Additional experiments (results not shown) confirmed that the functional decreases monotonically when the size of the images in the testing set corresponds to the size of the images in training set. Nevertheless, we decided to use the same testing set for all of these experiments so that the corresponding functionals would be comparable across the different training sets.

It can be see from Fig. 22(a) that the time per iteration for both Cns-P and FISTA decreases very slowly with increasing $K$ and decreasing $N$ , i.e. it is roughly linear in $NK$ , the number of pixels in the training image set. Since the results in Fig. 4 show that these algorithms scale linearly with $K$ , this implies that the algorithms have approximately linear scaling with $N$ as well. The slight deviation from linearity can be attributed to the $N\log N$ complexity of the FFTs used in these algorithms (see the computational complexity analysis in Table I in the main document). The method of Papyan et al. seems to be more sensitive to the scaling in $K$ , with time per iteration increasing as $K$ increases (which is not evident from the complexity analysis, see Table I below), and requires more time per iteration than Cns-P or FISTA.

SV-B CDL with Spatial Mask

Comparisons for CDL with a spatial mask were performed with a random mask with values in $\{0,1\}$ , with 25% zero entries with a uniform random distribution. Three different random masks were generated, one for the set of images of 1024 $\times$ 1024 pixels, one for the set of 512 $\times$ 512 pixels, and one for the set of 256 $\times$ 256 pixels. All the methods used the same randomly generated masks. The corresponding results are shown in Fig. S23 for $K=25,1024\times 1024$ images, in Fig. S24 for $K=100,512\times 512$ images and in Fig. S25 for $K=400,256\times 256$ images. These resemble the results obtained for the unmasked variants, with M-Cns-P yielding the fastest convergence and smallest final masked CBPDN functional values, followed by M-FISTA. M-FISTA is still initially unstable in some cases, but its convergence becomes much smoother than the unmasked variant by the end of the learning. Since both M-Cns-P and M-FISTA converge to a similar functional value in learning, it is difficult to see the differences in computation time in the plots, but M-Cns-P is almost $2/3$ faster than M-FISTA. The functional values for the masked method of Papyan et al. are inaccurate since the mask is not taken into account in the calculation.

A fair comparison can, however, be made by evaluating the CBPDN functional, Eq. (5), when sparse coding the test set with the dictionary filters learned in training. The results are shown in Fig. S26, for $K=25,1024\times 1024$ images, in Fig. S27 for $K=100,512\times 512$ images and in Fig. S28 for $K=400,256\times 256$ images. Again, note that testing results for the case of $K=400,256\times 256$ are better for all the methods, and that for our methods there are some overfitting effects for the $K=100$ and $K=25$ cases, although these are less significant than those for the unmasked ones. Also, it is clear that testing results for M-Cns-P and M-FISTA are much better than for the masked method of Papyan et al. .

It can be seen from Fig. 22(b) that M-Cns-P and M-FISTA exhibit similar behavior to the corresponding unmasked variants in that the time per iteration is almost constant when the product of $N$ and $K$ remains unchanged. The difference in the time per iteration between unmasked and masked variants is larger for M-FISTA than for M-Cns-P. Conversely, the time per iteration between unmasked and masked variants decreases for the method of Papyan et al., for smaller $K$ and larger $N$ , while it increases slightly for larger $K$ and smaller $N$ . This behavior is not expected from the complexity analysis.

Finally, it is worth noting that, while we do not quantify the optimality of the parameters selected via the guidelines discussed in Sec. VII-G of the main document, they do appear to provide good performance even for the substantially larger problems, considered here, than those used to develop these guidelines. In contrast, we found parameter selection to be problematic for the methods proposed by other authors discussed in Sec. SVII.

SVI Scaling with Dictionary Size

In this section we compare the scaling with respect to the number of filters, $M$ , of our two leading methods (Cns-P and FISTA) and the method of Papyan et al. . Dictionaries with $M\in\{50,100,200,500\}$ filters of size 11 $\times$ 11 were learned, over 500 iterations, from the training set of $K=40$ , 256 $\times$ 256 greyscale images described in the main document. The time per iteration for the three methods is compared in Fig. S29, which shows that all three methods exhibit linear scaling (modulo the outlier at $M=100$ for the method of Papyan et al.) with the number of filters.

These experiments do not address the issue of filter size. While the performance of the DFT-domain methods proposed here is roughly independent of the filter size, spatial domain methods such as that of Papyan et al. become more expensive as the filter size increases. In addition, multi-scale dictionaries are easily supported by the DFT-domain methods, but are much more difficult to support for spatial domain methods.

SVII Additional Algorithm Comparisons

We used the same training set as the previous section ( $K=40$ , 256 $\times$ 256 greyscale images) to compare the performance between our two leading methods (Cns-P and FISTA) and the competing methods proposed by Heide et al. and by Papyan et al. , and the consensus method proposed by Šorel and Šroubek . Our methods are implemented in Python, those of Heide et al. , and of Šorel and Šroubek are implemented in Matlab, and that of Papyan et al. is implemented in Matlab and C.

We compared the performance of the methods in learning a dictionary of 100 filters of size 11 $\times$ 11, setting the sparsity parameter $\lambda=0.1$ . We set the parameters for our methods according to the scaling rules discussed in Sec. VII-G2 in the main document, using fixed penalty parameters $\rho$ and $\sigma$ without any adaptation methods. Relaxation methods [17, Sec. 3.4.3][5, Sec. III.D] were used, with $\alpha=1.8$ . The parameters for the competing methods were set from the default parameters included in their respective demonstration scripts. As before, the additional set of 20 images of size 256 $\times$ 256 pixels was used as a test set to evaluate the dictionaries learned. Again, we report the evolution of the CBPDN functional Eq. (5) for the test set to provide a meaningful comparison, independent of the training functional evaluation implemented by each method, which use slightly different expressions, sometimes calculated with un-normalized dictionaries.

Comparisons for training are shown in Figs. S30 and S31. Performance is comparable for Cns-P, FISTA and the method of Papyan et al., with FISTA initially exhibiting oscillatory behavior. Since the methods of Šorel and Šroubek, and of Heide et al. perform multiple inner iterationsSet to 10 and 5 inner iterations in the demonstration scripts provided by Heide et al., and Šorel and Šroubek respectively. of the sparse coding and dictionary learning subproblems for each outer iteration, the iteration counts for these methods are reported as the product of inner and outer iterations. The method of Heide et al. starts with a very large functional value and is slow to convergeWe were unable to coerce this code to run for a full 500 iterations (50 outer iterations with 10 inner iterations) by any adjustment of stopping conditions and tolerances.. The consensus method of Šorel and Šroubek appears to achieve significantly lower functional values than the other methods, but these results are not comparable since their dictionary filters are not properly normalized. The final dictionaries computed are displayed in Fig. S32.

Sparse coding results on the test set are shown in Fig. S33. Note that Cns-P and FISTA produce the smallest CBPDN functional values, followed by the method of Papyan et al., while results for the methods of Šorel and Šroubek as well as Heide et al. are much worse. Since the functional value evolution for the method of Heide et al. is highly oscillatory, at each iteration we plot the best functional value obtained up until that point instead of the functional value for that iteration. In terms of time evolution, it is clear that Cns-P is the fastest to converge, followed by FISTA and the method of Papyan et al.. The methods of Šorel and Šroubek and of Heide et al. are slow even for this relatively small dataset.

The per-iteration computational complexities of the methods, including both sparse coding and convolutional dictionary learning subproblems, are summarized in Table I. The complexity expressions for the methods of Papyan et al. and Heide et al. are reproduced from those provided in those works, and the expression provided by Šorel and Šroubek is modified to make explicit the dependence on the number of images $K$ (for the sparse coding subproblem) and the internal ADMM iterations $P$ . Our methods have mostly linear scaling in the problem size variables, with the exception of the image size, $N$ , for which the scaling is $N\log N$ , which is shared by all of the methods that compute convolutions in the frequency domain. The corresponding scaling of the spatial domain method of Papyan et al. is $Nn$ , where $n$ is the number of samples in each filter kernel, i.e. the additional $\log N$ scaling with image size of the frequency domain methods is replaced with a linear scaling with filter size. This suggests that frequency domain methods are to be preferred for images of moderate size and moderate to large filter kernels, while spatial domain methods have an advantage for very large images and small filter kernels.

SVIII Multi-Channel Experiments

In this section we report on an experiment intended to demonstrate the multi-channel CDL capability discussed in Sec. VI of the main document. We only provide results for the two leading approaches proposed here (Cns-P and FISTA), and do not compare with the algorithms of Heide et al. , Šorel and Šroubek , or Papyan et al. since none of the corresponding publicly available implementations support multi-channel CDL. All of the color images used for these experiments were derived from images in the MIRFLICKR-1M dataset and pre-processed (cropping, scaling and highpass filtering per channel) in the same way (except for conversion to greyscale) as described in Sec. VII-C in the main document. The parameters of the Cns-P method were set using the parameter selection rules for the single channel problem, without any additional tuning. These rules were also used to set the parameters of the FISTA method, but the rule for $L$ was multiplied by 3 for a more stable convergence.

A dictionary of $M=64$ filters of size $8\times 8$ and $C=3$ channels was learned from a set of $K=40$ color images of size $256\times 256$ , using a sparsity parameter setting $\lambda=0.1$ . The results for this experiment are reported in Fig. S34. Comparing with single-channel dictionary learning results for a dictionary of the same size, and a training image set of the same number of images of the same size, reported in Fig. 3 in the main document, it can be seen that Cns-P requires about $2/3$ of time to compute the greyscale result compared to the color result, while FISTA requires about $3/4$ of time to compute the greyscale result compared to the color result. This additional cost for learning a color dictionary from color images is quite moderate considering that three times more training data is used.

Similarly to the other experiments, we saved the dictionaries at regular intervals during training and used an additional set of 10 color images, of size 256 $\times$ 256 pixels and from the same source, for testing. We compared the methods by sparse coding the color images in the test set, with $\lambda=0.1$ , and computing the evolution of the CBPDN functional over the series of multi-channel dictionaries. Fig. S35 shows that Cns-P performs slightly better than FISTA in testing too, although, Cns-P convergence is less smooth in the final stages compared to the single-channel cases, perhaps due to suboptimal parameter selection. Further evaluation of the multi-channel performance, including parameter selection guidelines, is left for future work.