Exploring Simple Siamese Representation Learning

Xinlei Chen, Kaiming He

Introduction

Recently there has been steady progress in un-/self-supervised representation learning, with encouraging results on multiple visual tasks (\eg, ). Despite various original motivations, these methods generally involve certain forms of Siamese networks . Siamese networks are weight-sharing neural networks applied on two or more inputs. They are natural tools for comparing (including but not limited to “contrasting”) entities. Recent methods define the inputs as two augmentations of one image, and maximize the similarity subject to different conditions.

An undesired trivial solution to Siamese networks is all outputs “collapsing” to a constant. There have been several general strategies for preventing Siamese networks from collapsing. Contrastive learning , \eg, instantiated in SimCLR , repulses different images (negative pairs) while attracting the same image’s two views (positive pairs). The negative pairs preclude constant outputs from the solution space. Clustering is another way of avoiding constant output, and SwAV incorporates online clustering into Siamese networks. Beyond contrastive learning and clustering, BYOL relies only on positive pairs but it does not collapse in case a momentum encoder is used.

In this paper, we report that simple Siamese networks can work surprisingly well with none of the above strategies for preventing collapsing. Our model directly maximizes the similarity of one image’s two views, using neither negative pairs nor a momentum encoder. It works with typical batch sizes and does not rely on large-batch training. We illustrate this “SimSiam” method in Figure 1.

Thanks to the conceptual simplicity, SimSiam can serve as a hub that relates several existing methods. In a nutshell, our method can be thought of as “BYOL without the momentum encoder”. Unlike BYOL but like SimCLR and SwAV, our method directly shares the weights between the two branches, so it can also be thought of as “SimCLR without negative pairs”, and “SwAV without online clustering”. Interestingly, SimSiam is related to each method by removing one of its core components. Even so, SimSiam does not cause collapsing and can perform competitively.

We empirically show that collapsing solutions do exist, but a stop-gradient operation (Figure 1) is critical to prevent such solutions. The importance of stop-gradient suggests that there should be a different underlying optimization problem that is being solved. We hypothesize that there are implicitly two sets of variables, and SimSiam behaves like alternating between optimizing each set. We provide proof-of-concept experiments to verify this hypothesis.

Our simple baseline suggests that the Siamese architectures can be an essential reason for the common success of the related methods. Siamese networks can naturally introduce inductive biases for modeling invariance, as by definition “invariance” means that two observations of the same concept should produce the same outputs. Analogous to convolutions , which is a successful inductive bias via weight-sharing for modeling translation-invariance, the weight-sharing Siamese networks can model invariance \wrtmore complicated transformations (\eg, augmentations). We hope our exploration will motivate people to rethink the fundamental roles of Siamese architectures for unsupervised representation learning.

Related Work

Siamese networks are general models for comparing entities. Their applications include signature and face verification, tracking , one-shot learning , and others. In conventional use cases, the inputs to Siamese networks are from different images, and the comparability is determined by supervision.

Contrastive learning.

The core idea of contrastive learning is to attract the positive sample pairs and repulse the negative sample pairs. This methodology has been recently popularized for un-/self-supervised representation learning . Simple and effective instantiations of contrastive learning have been developed using Siamese networks .

In practice, contrastive learning methods benefit from a large number of negative samples . These samples can be maintained in a memory bank . In a Siamese network, MoCo maintains a queue of negative samples and turns one branch into a momentum encoder to improve consistency of the queue. SimCLR directly uses negative samples coexisting in the current batch, and it requires a large batch size to work well.

Clustering.

Another category of methods for unsupervised representation learning are based on clustering . They alternate between clustering the representations and learning to predict the cluster assignment. SwAV incorporates clustering into a Siamese network, by computing the assignment from one view and predicting it from another view. SwAV performs online clustering under a balanced partition constraint for each batch, which is solved by the Sinkhorn-Knopp transform .

While clustering-based methods do not define negative exemplars, the cluster centers can play as negative prototypes. Like contrastive learning, clustering-based methods require either a memory bank , large batches , or a queue to provide enough samples for clustering.

BYOL.

BYOL directly predicts the output of one view from another view. It is a Siamese network in which one branch is a momentum encoder.MoCo and BYOL do not directly share the weights between the two branches, though in theory the momentum encoder should converge to the same status as the trainable encoder. We view these models as Siamese networks with “indirect” weight-sharing. It is hypothesized in that the momentum encoder is important for BYOL to avoid collapsing, and it reports failure results if removing the momentum encoder (0.3% accuracy, Table 5 in ). In BYOL’s arXiv v3 update, it reports 66.9% accuracy with 300-epoch pre-training when removing the momentum encoder and increasing the predictor’s learning rate by 10×\times. Our work was done concurrently with this arXiv update. Our work studies this topic from different perspectives, with better results achieved. Our empirical study challenges the necessity of the momentum encoder for preventing collapsing. We discover that the stop-gradient operation is critical. This discovery can be obscured with the usage of a momentum encoder, which is always accompanied with stop-gradient (as it is not updated by its parameters’ gradients). While the moving-average behavior may improve accuracy with an appropriate momentum coefficient, our experiments show that it is not directly related to preventing collapsing.

Method

Our architecture (Figure 1) takes as input two randomly augmented views x1x_{1} and x2x_{2} from an image xx. The two views are processed by an encoder network ff consisting of a backbone (\eg, ResNet ) and a projection MLP head . The encoder ff shares weights between the two views. A prediction MLP head , denoted as hh, transforms the output of one view and matches it to the other view. Denoting the two output vectors as p1 ⁣ ⁣h(f(x1)){p}_{1}\!\triangleq\!h(f(x_{1})) and z2 ⁣ ⁣f(x2){z}_{2}\!\triangleq\!f(x_{2}), we minimize their negative cosine similarity:

This is defined for each image, and the total loss is averaged over all images. Its minimum possible value is 1-1.

An important component for our method to work is a stop-gradient (stopgrad) operation (Figure 1). We implement it by modifying (1) as:

This means that z2{z}_{2} is treated as a constant in this term. Similarly, the form in (2) is implemented as:

Here the encoder on x2x_{2} receives no gradient from z2{z}_{2} in the first term, but it receives gradients from p2{p}_{2} in the second term (and vice versa for x1x_{1}).

The pseudo-code of SimSiam is in Algorithm 1.

Unless specified, our explorations use the following settings for unsupervised pre-training:

Optimizer. We use SGD for pre-training. Our method does not require a large-batch optimizer such as LARS (unlike ). We use a learning rate of lr×lr{\times}BatchSize/256{/}256 (linear scaling ), with a base lr ⁣= ⁣0.05lr\!=\!0.05. The learning rate has a cosine decay schedule . The weight decay is 0.00010.0001 and the SGD momentum is 0.90.9.

The batch size is 512 by default, which is friendly to typical 8-GPU implementations. Other batch sizes also work well (Sec. 4.3). We use batch normalization (BN) synchronized across devices, following .

Projection MLP. The projection MLP (in ff) has BN applied to each fully-connected (fc) layer, including its output fc. Its output fc has no ReLU. The hidden fc is 20482048-d. This MLP has 3 layers.

Prediction MLP. The prediction MLP (hh) has BN applied to its hidden fc layers. Its output fc does not have BN (ablation in Sec. 4.4) or ReLU. This MLP has 2 layers. The dimension of hh’s input and output (z{z} and p{p}) is d ⁣= ⁣2048d\!=\!2048, and hh’s hidden layer’s dimension is 512512, making hh a bottleneck structure (ablation in supplement).

We use ResNet-50 as the default backbone. Other implementation details are in supplement. We perform 100-epoch pre-training in ablation experiments.

Experimental setup.

We do unsupervised pre-training on the 1000-class ImageNet training set without using labels. The quality of the pre-trained representations is evaluated by training a supervised linear classifier on frozen representations in the training set, and then testing it in the validation set, which is a common protocol. The implementation details of linear classification are in supplement.

Empirical Study

In this section we empirically study the SimSiam behaviors. We pay special attention to what may contribute to the model’s non-collapsing solutions.

Figure 2 presents a comparison on “with \vswithout stop-gradient”. The architectures and all hyper-parameters are kept unchanged, and stop-gradient is the only difference.

As a comparison, if the output z{z} has a zero-mean isotropic Gaussian distribution, we can show that the std of z/z2{{z}}/{\left\lVert{{z}}\right\rVert_{2}} is 1d\frac{1}{\sqrt{d}}. Here is an informal derivation: denote z/z2{{z}}/{\left\lVert{{z}}\right\rVert_{2}} as z{z}^{\prime}, that is, zi ⁣= ⁣zi/(j=1dzj2)12{z}^{\prime}_{i}\!=\!{z}_{i}/(\sum_{j=1}^{d}{z}^{2}_{j})^{\frac{1}{2}} for the ii-th channel. If zj{z}_{j} is subject to an i.i.d Gaussian distribution: zj ⁣ ⁣N(0,1){z}_{j}\!\sim\!\mathcal{N}(0,1), j\forall j, then zi ⁣ ⁣zi/d12{z}^{\prime}_{i}\!\approx\!{z}_{i}/d^{\frac{1}{2}} and std[zi] ⁣ ⁣1/d12\text{{std}}[{z}^{\prime}_{i}]\!\approx\!1/d^{\frac{1}{2}}. The blue curve in Figure 2 (middle) shows that with stop-gradient, the std value is near 1d\frac{1}{\sqrt{d}}. This indicates that the outputs do not collapse, and they are scattered on the unit hypersphere.

Figure 2 (right) plots the validation accuracy of a k-nearest-neighbor (kNN) classifier . This kNN classifier can serve as a monitor of the progress. With stop-gradient, the kNN monitor shows a steadily improving accuracy.

The linear evaluation result is in the table in Figure 2. SimSiam achieves a nontrivial accuracy of 67.7%. This result is reasonably stable as shown by the std of 5 trials. Solely removing stop-gradient, the accuracy becomes 0.1%, which is the chance-level guess in ImageNet.

The introduction of stop-gradient implies that there should be another optimization problem that is being solved underlying. We propose a hypothesis in Sec. 5.

2 Predictor

In Table 1 we study the predictor MLP’s effect.

The model does not work if removing hh (Table 1a), \ie, hh is the identity mapping. Actually, this observation can be expected if the symmetric loss (4) is used. Now the loss is 12D(z1,stopgrad(z2))\frac{1}{2}\mathcal{D}({z}_{1},\texttt{stopgrad}({z}_{2})) +{+} 12D(z2,stopgrad(z1))\frac{1}{2}\mathcal{D}({z}_{2},\texttt{stopgrad}({z}_{1})). Its gradient has the same direction as the gradient of D(z1,z2)\mathcal{D}({z}_{1},{z}_{2}), with the magnitude scaled by 1/21/2. In this case, using stop-gradient is equivalent to removing stop-gradient and scaling the loss by 1/2. Collapsing is observed (Table 1a).

We note that this derivation on the gradient direction is valid only for the symmetrized loss. But we have observed that the asymmetric variant (3) also fails if removing hh, while it can work if hh is kept (Sec. 4.6). These experiments suggest that hh is helpful for our model.

If hh is fixed as random initialization, our model does not work either (Table 1b). However, this failure is not about collapsing. The training does not converge, and the loss remains high. The predictor hh should be trained to adapt to the representations.

We also find that hh with a constant lrlr (without decay) can work well and produce even better results than the baseline (Table 1c). A possible explanation is that hh should adapt to the latest representations, so it is not necessary to force it converge (by reducing lrlr) before the representations are sufficiently trained. In many variants of our model, we have observed that hh with a constant lrlr provides slightly better results. We use this form in the following subsections.

3 Batch Size

Table 2 reports the results with a batch size from 64 to 4096. When the batch size changes, we use the same linear scaling rule (lrlr×\timesBatchSize/256) with base lr ⁣= ⁣0.05lr\!=\!0.05. We use 10 epochs of warm-up for batch sizes  ⁣\geq\! 1024. Note that we keep using the same SGD optimizer (rather than LARS ) for all batch sizes studied.

Our method works reasonably well over this wide range of batch sizes. Even a batch size of 128 or 64 performs decently, with a drop of 0.8% or 2.0% in accuracy. The results are similarly good when the batch size is from 256 to 2048, and the differences are at the level of random variations.

This behavior of SimSiam is noticeably different from SimCLR and SwAV . All three methods are Siamese networks with direct weight-sharing, but SimCLR and SwAV both require a large batch (\eg, 4096) to work well.

We also note that the standard SGD optimizer does not work well when the batch is too large (even in supervised learning ), and our result is lower with a 4096 batch. We expect a specialized optimizer (\eg, LARS ) will help in this case. However, our results show that a specialized optimizer is not necessary for preventing collapsing.

4 Batch Normalization

Table 3 compares the configurations of BN on the MLP heads. In Table 3a we remove all BN layers in the MLP heads (10-epoch warmup is used specifically for this entry). This variant does not cause collapse, although the accuracy is low (34.6%). The low accuracy is likely because of optimization difficulty. Adding BN to the hidden layers (Table 3b) increases accuracy to 67.4%.

Further adding BN to the output of the projection MLP (\ie, the output of ff) boosts accuracy to 68.1% (Table 3c), which is our default configuration. In this entry, we also find that the learnable affine transformation (scale and offset ) in ff’s output BN is not necessary, and disabling it leads to a comparable accuracy of 68.2%.

Adding BN to the output of the prediction MLP hh does not work well (Table 3d). We find that this is not about collapsing. The training is unstable and the loss oscillates.

In summary, we observe that BN is helpful for optimization when used appropriately, which is similar to BN’s behavior in other supervised learning scenarios. But we have seen no evidence that BN helps to prevent collapsing: actually, the comparison in Sec. 4.1 (Figure 2) has exactly the same BN configuration for both entries, but the model collapses if stop-gradient is not used.

5 Similarity Function

Besides the cosine similarity function (1), our method also works with cross-entropy similarity. We modify D\mathcal{D} as: D(p1,z2) ⁣= ⁣softmax(z2)logsoftmax(p1)\mathcal{D}({p}_{1},{z}_{2})\!=\!{-}\texttt{softmax}({{z}_{2}}){\cdot}\log\texttt{softmax}({{p}_{1}}). Here the softmax function is along the channel dimension. The output of softmax can be thought of as the probabilities of belonging to each of dd pseudo-categories.

We simply replace the cosine similarity with the cross-entropy similarity, and symmetrize it using (4). All hyper-parameters and architectures are unchanged, though they may be suboptimal for this variant. Here is the comparison:

The cross-entropy variant can converge to a reasonable result without collapsing. This suggests that the collapsing prevention behavior is not just about the cosine similarity.

This variant helps to set up a connection to SwAV , which we discuss in Sec. 6.2.

6 Symmetrization

Thus far our experiments have been based on the symmetrized loss (4). We observe that SimSiam’s behavior of preventing collapsing does not depend on symmetrization. We compare with the asymmetric variant (3) as follows:

The asymmetric variant achieves reasonable results. Symmetrization is helpful for boosting accuracy, but it is not related to collapse prevention. Symmetrization makes one more prediction for each image, and we may roughly compensate for this by sampling two pairs for each image in the asymmetric version (“2×\times”). It makes the gap smaller.

7 Summary

We have empirically shown that in a variety of settings, SimSiam can produce meaningful results without collapsing. The optimizer (batch size), batch normalization, similarity function, and symmetrization may affect accuracy, but we have seen no evidence that they are related to collapse prevention. It is mainly the stop-gradient operation that plays an essential role.

Hypothesis

We discuss a hypothesis on what is implicitly optimized by SimSiam, with proof-of-concept experiments provided.

Our hypothesis is that SimSiam is an implementation of an Expectation-Maximization (EM) like algorithm. It implicitly involves two sets of variables, and solves two underlying sub-problems. The presence of stop-gradient is the consequence of introducing the extra set of variables.

We consider a loss function of the following form:

With this formulation, we consider solving:

Here the problem is \wrtboth θ\theta and η\eta. This formulation is analogous to k-means clustering . The variable θ\theta is analogous to the clustering centers: it is the learnable parameters of an encoder. The variable ηx\eta_{x} is analogous to the assignment vector of the sample xx (a one-hot vector in k-means): it is the representation of xx.

Also analogous to k-means, the problem in (6) can be solved by an alternating algorithm, fixing one set of variables and solving for the other set. Formally, we can alternate between solving these two subproblems:

Here tt is the index of alternation and “\leftarrow” means assigning.

One can use SGD to solve the sub-problem (7). The stop-gradient operation is a natural consequence, because the gradient does not back-propagate to ηt1\eta^{t-1} which is a constant in this subproblem.

Solving for η𝜂\eta.

This indicates that ηx\eta_{x} is assigned with the average representation of xx over the distribution of augmentation.

One-step alternation.

Inserting it into the sub-problem (7), we have:

Now θt\theta^{t} is a constant in this sub-problem, and T\mathcal{T^{\prime}} implies another view due to its random nature. This formulation exhibits the Siamese architecture. Second, if we implement (11) by reducing the loss with one SGD step, then we can approach the SimSiam algorithm: a Siamese network naturally with stop-gradient applied.

Predictor.

Our above analysis does not involve the predictor hh. We further assume that hh is helpful in our method because of the approximation due to (10).

Symmetrization.

2 Proof of concept

We design a series of proof-of-concept experiments that stem from our hypothesis. They are methods different with SimSiam, and they are designed to verify our hypothesis.

We have hypothesized that the SimSiam algorithm is like alternating between (7) and (8), with an interval of one step of SGD update. Under this hypothesis, it is likely for our formulation to work if the interval has multiple steps of SGD.

In this variant, we treat tt in (7) and (8) as the index of an outer loop; and the sub-problem in (7) is updated by an inner loop of kk SGD steps. In each alternation, we pre-compute the ηx\eta_{x} required for all kk SGD steps using (10) and cache them in memory. Then we perform kk SGD steps to update θ\theta. We use the same architecture and hyper-parameters as SimSiam. The comparison is as follows:

Here, “1-step” is equivalent to SimSiam, and “1-epoch” denotes the kk steps required for one epoch. All multi-step variants work well. The 10-/100-step variants even achieve better results than SimSiam, though at the cost of extra pre-computation. This experiment suggests that the alternating optimization is a valid formulation, and SimSiam is a special case of it.

Expectation over augmentations.

3 Discussion

Our hypothesis is about what the optimization problem can be. It does not explain why collapsing is prevented. We point out that SimSiam and its variants’ non-collapsing behavior still remains as an empirical observation.

Here we briefly discuss our understanding on this open question. The alternating optimization provides a different trajectory, and the trajectory depends on the initialization. It is unlikely that the initialized η\eta, which is the output of a randomly initialized network, would be a constant. Starting from this initialization, it may be difficult for the alternating optimizer to approach a constant ηx\eta_{x} for all xx, because the method does not compute the gradients \wrtη\eta jointly for all xx. The optimizer seeks another trajectory (Figure 2 left), in which the outputs are scattered (Figure 2 middle).

Comparisons

We compare with the state-of-the-art frameworks in Table 4 on ImageNet linear evaluation. For fair comparisons, all competitors are based on our reproduction, and “+” denotes improved reproduction \vsthe original papers (see supplement). For each individual method, we follow the hyper-parameter and augmentation recipes in its original paper.In our BYOL reproduction, the 100, 200(400), 800-epoch recipes follow the 100, 300, 1000-epoch recipes in : lrlr is {0.45, 0.3, 0.2}, wdwd is {1e-6, 1e-6, 1.5e-6}, and momentum coefficient is {0.99, 0.99, 0.996}. All entries are based on a standard ResNet-50, with two 224×\times224 views used during pre-training.

Table 4 shows the results and the main properties of the methods. SimSiam is trained with a batch size of 256, using neither negative samples nor a momentum encoder. Despite it simplicity, SimSiam achieves competitive results. It has the highest accuracy among all methods under 100-epoch pre-training, though its gain of training longer is smaller. It has better results than SimCLR in all cases.

Transfer Learning.

In Table 5 we compare the representation quality by transferring them to other tasks, including VOC object detection and COCO object detection and instance segmentation. We fine-tune the pre-trained models end-to-end in the target datasets. We use the public codebase from MoCo for all entries, and search the fine-tuning learning rate for each individual method. All methods are based on 200-epoch pre-training in ImageNet using our reproduction.

Table 5 shows that SimSiam’s representations are transferable beyond the ImageNet task. It is competitive among these leading methods. The “base” SimSiam in Table 5 uses the baseline pre-training recipe as in our ImageNet experiments. We find that another recipe of lr ⁣= ⁣0.5lr\!=\!\text{0.5} and wd ⁣= ⁣1e-5wd\!=\!\text{1e-5} (with similar ImageNet accuracy) can produce better results in all tasks (Table 5, “SimSiam, optimal”).

We emphasize that all these methods are highly successful for transfer learning—in Table 5, they can surpass or be on par with the ImageNet supervised pre-training counterparts in all tasks. Despite many design differences, a common structure of these methods is the Siamese network. This comparison suggests that the Siamese structure is a core factor for their general success.

2 Methodology Comparisons

Beyond accuracy, we also compare the methodologies of these Siamese architectures. Our method plays as a hub to connect these methods. Figure 3 abstracts these methods. The “encoder” subsumes all layers that can be shared between both branches (\eg, backbone, projection MLP , prototypes ). The components in red are those missing in SimSiam. We discuss the relations next.

. SimCLR relies on negative samples (“dissimilarity”) to prevent collapsing. SimSiam can be thought of as “SimCLR without negatives”.

To have a more thorough comparison, we append the prediction MLP hh and stop-gradient to SimCLR.We append the extra predictor to one branch and stop-gradient to the other branch, and symmetrize this by swapping. Here is the ablation on our SimCLR reproduction:

Neither the stop-gradient nor the extra predictor is necessary or helpful for SimCLR. As we have analyzed in Sec. 5, the introduction of the stop-gradient and extra predictor is presumably a consequence of another underlying optimization problem. It is different from the contrastive learning problem, so these extra components may not be helpful.

Relation to SwAV

. SimSiam is conceptually analogous to “SwAV without online clustering”. We build up this connection by recasting a few components in SwAV. (i) The shared prototype layer in SwAV can be absorbed into the Siamese encoder. (ii) The prototypes were weight-normalized outside of gradient propagation in ; we instead implement by full gradient computation .This modification produces similar results as original SwAV, but it can enable end-to-end propagation in our ablation. (iii) The similarity function in SwAV is cross-entropy. With these abstractions, a highly simplified SwAV illustration is shown in Figure 3.

SwAV applies the Sinkhorn-Knopp (SK) transform on the target branch (which is also symmetrized ). The SK transform is derived from online clustering : it is the outcome of clustering the current batch subject to a balanced partition constraint. The balanced partition can avoid collapsing. Our method does not involve this transform.

We study the effect of the prediction MLP hh and stop-gradient on SwAV. Note that SwAV applies stop-gradient on the SK transform, so we ablate by removing it. Here is the comparison on our SwAV reproduction:

Adding the predictor does not help either. Removing stop-gradient (so the model is trained end-to-end) leads to divergence. As a clustering-based method, SwAV is inherently an alternating formulation . This may explain why stop-gradient should not be removed from SwAV.

Relation to BYOL

. Our method can be thought of as “BYOL without the momentum encoder”, subject to many implementation differences. The momentum encoder may be beneficial for accuracy (Table 4), but it is not necessary for preventing collapsing. Given our hypothesis in Sec. 5, the η\eta sub-problem (8) can be solved by other optimizers, \eg, a gradient-based one. This may lead to a temporally smoother update on η\eta. Although not directly related, the momentum encoder also produces a smoother version of η\eta. We believe that other optimizers for solving (8) are also plausible, which can be a future research problem.

Conclusion

We have explored Siamese networks with simple designs. The competitiveness of our minimalist method suggests that the Siamese shape of the recent methods can be a core reason for their effectiveness. Siamese networks are natural and effective tools for modeling invariance, which is a focus of representation learning. We hope our study will attract the community’s attention to the fundamental role of Siamese networks in representation learning.

References

Appendix A Implementation Details

Our implementation follows the practice of existing works .

Data augmentation. We describe data augmentation using the PyTorch notations. Geometric augmentation is RandomResizedCrop with scale in [0.2,1.0][0.2,1.0] and RandomHorizontalFlip. Color augmentation is ColorJitter with {brightness, contrast, saturation, hue} strength of {0.4, 0.4, 0.4, 0.1} with an applying probability of 0.8, and RandomGrayscale with an applying probability of 0.2. Blurring augmentation has a Gaussian kernel with std in [0.1,2.0][0.1,2.0].

Initialization. The convolution and fc layers follow the default PyTorch initializers. Note that by default PyTorch initializes fc layers’ weight and bias by a uniform distribution U(k,k)\mathcal{U}(-\sqrt{k},\sqrt{k}) where k=1in_channelsk{=}\frac{1}{\text{in\_channels}}. Models with substantially different fc initializers (\eg, a fixed std of 0.01) may not converge. Moreover, similar to the implementation of , we initialize the scale parameters as 0 in the last BN layer for every residual block.

Weight decay. We use a weight decay of 0.0001 for all parameter layers, including the BN scales and biases, in the SGD optimizer. This is in contrast to the implementation of that excludes BN scales and biases from weight decay in their LARS optimizer.

Linear evaluation.

Appendix B Additional Ablations on ImageNet

The following table reports the SimSiam results \vsthe output dimension dd:

It benefits from a larger dd and gets saturated at d ⁣= ⁣2048d\!=\!\text{2048}. This is unlike existing methods whose accuracy is saturated when dd is 256 or 512.

In this table, the prediction MLP’s hidden layer dimension is always 1/4 of the output dimension. We find that this bottleneck structure is more robust. If we set the hidden dimension to be equal to the output dimension, the training can be less stable or fail in some variants of our exploration. We hypothesize that this bottleneck structure, which behaves like an auto-encoder, can force the predictor to digest the information. We recommend to use this bottleneck structure for our method.

Appendix C Reproducing Related Methods

Our comparison in Table 4 is based on our reproduction of the related methods. We re-implement the related methods as faithfully as possible following each individual paper. In addition, we are able to improve SimCLR, MoCo v2, and SwAV by small and straightforward modifications: specifically, we use 3 layers in the projection MLP in SimCLR and SwAV (\vsoriginally 2), and use symmetrized loss for MoCo v2 (\vsoriginally asymmetric). Table C.1 compares our reproduction of these methods with the original papers’ results (if available). Our reproduction has better results for SimCLR, MoCo v2, and SwAV (denoted as “+” in Table 4), and has at least comparable results for BYOL.

Appendix D CIFAR Experiments

Figure D.1 shows the kNN classification accuracy (left) and the linear evaluation (right). Similar to the ImageNet observations, SimSiam achieves a reasonable result and does not collapse. We compare with SimCLR trained with the same setting. Interestingly, the training curves are similar between SimSiam and SimCLR. SimSiam is slightly better by 0.7% under this setting.