Linking Image and Text with 2-Way Nets

Aviv Eisenschtat, Lior Wolf

Introduction

Computer vision emerged from its roots in image processing when researchers began to seek an understanding of the scene behind the image. Linking visual data $X$ with an external data source $Y$ is, therefore, the defining task of computer vision. When applying machine learning tools to solve such tasks, we often consider the outside source $Y$ to be univariate, e.g., in image classification. A more general scenario is the one in which $Y$ is also multidimensional. Examples of such view to view linking include matching between video and concurrent audio, matching an image with its textual description, matching images from two fixed views, etc.

The classical method of matching vectors between two different domains is Canonical Correlation Analysis (CCA). The algorithm has been generalized in many ways: regularization was added , kernels were introduced , versions for more than two sources were developed etc. Recently, with the advent of deep learning methods, deep versions were created and showed promise.

The current deep CCA methods optimize the CCA loss on top of a deep neural network architecture. In this work, an alternative is presented in which a network is built to map one source $X$ to another source $Y$ and back. This architecture, which bears similarities to the encoder-decoder framework , employs the Euclidean loss.

The Euclidean loss is hard to optimize for, when compared to classification losses such as the cross entropy loss. We, therefore, introduce a number of contributions that are critical to the success of our methods. These include: (i) a mid-way loss term that helps support the training of the hidden layers; (ii) a decorrelation regularization term that links the problem back to CCA; (iii) modified batch normalization layers; (iv) a regularization of the scale parameter that ensures that the variance does not diminish from one layer to the next; (v) a tied dropout method; and (vi) a method for dealing with high-dimensional data.

Taken together, we are able to present a general and robust method. In an extensive set of experiments, we present clear advantages over both the classical and recent methods.

Previous work

Canonical Correlation Analysis (CCA) is a statistical method for computing a linear projection for two views into a common space which maximizes their correlation. CCA plays a crucial role in many computer vision applications including multiview analysis , multimodal human behavior analysis , action recognition , and linking text with images . There are a large number of CCA variants including: regularized CCA , Nonparametric canonical correlation analysis (NCCA) , and Kernel canonical correlation analysis (KCCA) , a method for producing non-linear, non-parametric projections using the kernel trick. Recently, randomized non-linear component analysis (RCCA) emerged as a low-rank approximation of KCCA.

While CCA is restricted to linear projections, KCCA is restricted to a fixed kernel. Both methods do not scale well with the size of the dataset and the size of the representations. A number of methods based on Deep Learning were recently proposed that aim to overcome these drawbacks. Deep canonical correlation analysis processes the pairs of inputs through two network pipelines and compares the results of each pipeline via the CCA loss.

and extend to the task of images and text matching. The first employs the same model and training process of while the latter employs a different training scheme on the same architecture. Unlike and we present a novel deep model for matching images and text.

Other deep CCA methods, including ours, are inspired by a family of encoding/decoding unsupervised generative models that aim to capture a meaningful representation of input $x$ by applying a non-linear encoding function $E(x)$ , decoding the encoded signal using a non-linear decoding function $D(x)$ and minimizing the squared L2 distance between the original input and the decoded output. Some of the auto-encoder based algorithms incorporate a noise on the input or enforce a desired property using a regularization term .

Correlation Networks (CorrNet) and Deep canonically correlated autoencoders (DCCAE) expand the auto-encoder scheme by considering two input views and two output views. The encoding is shared between the two views (CorrNet) or the differences in the encodings are minimized (DCCAE). In both cases, it serves as a common bottleneck. Our model goes from one view to the other (in both directions) and not from each view to a reconstructed view.

The CCA loss is used by both CorrNet and DCCAE. The latter contribution explicitly states that the L2 loss is inferior to the CCA loss term . Our network, however, uses L2 successfully. This reinforces the need to apply the methods we propose in this work in order to enable effective training based on the L2 loss. For this end, we introduce innovative techniques based on common practices in deep learning, adapted to the problem at hand. These techniques include: dropout, batch normalization, and leaky ReLUs. While the latter is applied as is, the former two need to be carefully modified for our networks.

Dropout is a regularization method developed to reduce over-fitting in deep neural networks by zeroing a group of neurons at each training iteration. This stochastic elimination reduces the co-adaptation between neurons in the same layer and simulates the training of an ensemble of networks with shared weights.

Batch Normalization is used as a stabilizing mechanism for training a neural network by scaling the output of a hidden layer to zero norm and unit variance. This scaling lowers the change of distribution between neurons throughout the network and helps to speed up the training process.

Rectified Linear Unit (ReLU) is a non-linear activation function that does not suffer from the saturation phenomenon, which the classical sigmoids suffer from. Conventional ReLU zero negative activations, and as a result, no gradient is produced for many of the neurons. A few variants of ReLU were, therefore, proposed that reduce the effect of negative activations, but do not zero them completely. Similar to and unlike , we do not train the leakiness parameter and instead set it to a constant value.

As one of our contributions, we add a regularization term that removes the pairwise covariances of the learned features. A similar term was recently reported in work as part of a classification system (unrelated to modeling correlations between vectors). We adapt their terminology when describing our bi-directional term.

The Network Model

This section contains a detailed description of our proposed model, which we term the 2-way netCode can be found at https://github.com/aviveise/2WayNet . The model utilizes the L2 loss in order to create a bi-directional mapping between two vector spaces. The absence of a correlation based loss (such as in DeepCCA and CorrNet ) makes this model simpler. Like other regression problems, there are inherent challenges in obtaining meaningful solutions . These challenges are further amplified by the multivariate and layered structure of the performed regression. We, therefore, modify the problem in various ways, each contributing to the overall success.

Our proposed architecture is illustrated in Fig. 1. It contains two reconstruction channels. Both channels contains $k$ hidden layers $\{h_{1},h_{2},...,h_{k}\}$ and $\{\hat{h_{1}},\hat{h_{2}},...,\hat{h_{k}}\}$ . Lets define $H_{i}(x)$ and $\hat{H_{i}(y)}$ as the output of each channel at layer $i$ given network inputs $x$ and $y$ respectively, the model is optimized to minimize the Eucledean loss between both $\hat{H_{i}(y)}$ and $x$ , and $H_{i}(x)$ and y. The two channels share weights and dropout function as explained in 3.5

In our experiments, in order to compare with previous work, we use the correlation as the success metric. As the Lemma below shows, there is a connection between the correlation of two vectors and their Euclidean distance, this connection also depends on the variance of the vectors.

Given two n-dimensional vectors $x$ and $y$ we consider the squared Euclidean distance

For zero mean variables, the correlation between $x$ and $y$ is given by $c=\frac{1}{n}\frac{\sum_{j=1}^{n}(x_{j}y_{j})}{\sigma_{x}\sigma_{y}}$ . Combining with 3.1 results in what had to be proven. ∎

Given a batch of samples from views $x$ and $y$ , we measure the correlation between the outputs of two matching layers, $\{h_{j}(x_{1}),...,h_{j}(x_{n})\}$ and $\{\hat{h}_{j}(y_{i}),...,\hat{h}_{j}(y_{n})\}$ as the sum of correlations between the activations of each matching neuron. The Lemma below extends Lemma 1 and shows that the sum of correlations which we aim to maximize is bounded by a function of the Euclidean loss between the two representations.

Given two matching hidden layers, $h_{j}$ and $\hat{h_{j}}$ with $m$ neurons each. $a_{k}$ is the activation vector of neuron $k$ from $h_{j}$ with standard deviation $\sigma_{a_{k}}$ and $b_{k}$ is the activation vector of neuron $k$ from $\hat{h}_{j}$ with standard deviation $\sigma_{b_{k}}$ . Each vector is produced by feeding a batch of samples of size $n$ from views $x$ and $y$ through channels $H$ and $\hat{H}$ respectively. The sum of correlations $C$ is bounded by:

We will define $G_{m}=\sum_{k=1}^{m}\left\lVert a_{k}-b_{k}\right\rVert^{2}$ and $f_{k}=\sigma_{a_{k}}^{-1}\sigma_{b_{k}}^{-1}$ . Using Abel transform:

Note that both $\sigma_{a_{k}}\sigma_{b_{k}}$ and $\left\lVert a_{k}-b_{k}\right\rVert^{2}$ are positive for all $k$ which makes the above inequalities valid. Inserting 4 in 3 results in what had to be proven. ∎

From the above Lemma, we can conclude that by minimizing the L2 loss together with maximizing the variance of each neuron activation will result in maximization of the sum of correlations.

Solving this regression problem tends to eliminate the variance of the output representations. To overcome this limitation, we add two instruments. The first is batch normalization layer (BN) after each hidden layer. The settings of the batch normalization layer differ from the common settings to adapt to this model. Another instrument is regularizing the gamma parameter the batch normalization layer introduces. More details can be found below.

To the loss term, we add regularization terms. The first is weight decay $R_{w}=\sum\|W\|^{2}$ . A second regularization term is added in order to reduce the cross correlations between the network activations of the same layer. The property we encourage is inherent to CCA-based solutions where decorrelation is enforced. In our network solutions, we add a soft regularization term. During training, we consider the $N$ samples of a single batch $\{(x_{i},y_{i})\}_{i=1}^{N}$ and consider the set of mid-network activations $\{(H^{j}(x_{i}),\hat{H}^{j}(y_{i}))\}_{i=1}^{N}$ . The decorrelation regularization term is given by:

where $C_{h}=\frac{1}{N}\sum_{i}H^{j}(x_{i})^{\top}H^{j}(x_{i})$ is the covariance estimator for $H^{j}(x)$ and $C_{\hat{h}}=\frac{1}{N}\sum_{i}\hat{H}^{j}(y_{i})^{\top}\hat{H}^{j}(y_{i})$ is the covariance estimator for $\hat{H}^{j}(y)$ . This regularization term is minimized when the off-diagonal coefficients of both $C_{h}$ and $C_{\hat{h}}$ are zero.

2 Batch normalization layers

As shown above, in order to maximize the correlation we need not only to minimize the Euclidean loss but also to increase the variance of each neuron’s output. This is done by introducing a batch normalization layer customized to meet the model’s needs.

Given a vector of activations $a=[a_{1},\dots,a_{d}]$ produced by one of the network’s hidden layers for a given batch of inputs, we normalize $a$ to produce $a^{\prime}=[a^{\prime}_{1},\dots,a^{\prime}_{d}]$ , where ${a_{k}^{\prime}}=\frac{a_{k}-\mu_{k}}{\sigma_{k}}$ and $\mu_{k}$ and $\sigma^{2}_{k}$ are the mean and variance of neuron $k$ on the given batch. This is followed by scaling and shifting by learned parameters to produce ${a_{k}^{\prime\prime}}={\gamma}_{k}{a_{k}^{\prime}}+{\beta}_{k}$ . The BN layer mitigates the loss of variance by enforcing unit variance and by removing the influence of the weights of the hidden layer on the output’s variance.

BN layers are usually placed before the non-linearity or on the input of the layer as a preprocessing phase as shown in . This setting poses several problems. First, ReLU lowers the variance of the output which is counterproductive to our goal. Second, applying ReLU after BN has the effect of zeroing every $k$ when $a_{k}$ is below the mean in a given batch plus the term $\beta_{k}/\gamma_{k}$ . Typically, $\beta_{k}$ is initialized to zero and for a symmetric activation distribution, half of the activations are zeroed. When employing a bi-directional network, the zeroing effect occurs in both directions.

Let $s_{i}=\{k|u_{i}(k)>\mu_{k}\}$ be the group of indices of the values in $u_{i}$ that are larger than their population mean. Let $\hat{s}_{i}=\{k|v_{i}(k)>\hat{\mu}_{k}\}$ be the equivalent for the vectors $v_{i}$ . We observe the intersection $s_{i}\cap\hat{s}_{i}$ , which is the group of active neurons, following a threshold at the mean value on both $u_{i}$ and $v_{i}$ .

As the Lemma below shows, even if the correlation $\rho_{k}$ is relatively high, the size of the intersection set $s_{i}\cap\hat{s}_{i}$ is closer to the value $d/4$ obtained for randomly permuted vectors than to the maximal value of $d/2$ .

To estimate the size of $c$ , let us look at the quadrant probability $\ p$ of $u_{i}(k)$ and $v_{i}(k)$ which is given analytically by ,

Even in the case of a correlation as high as 0.6, the intersection will include only about 35% of the neurons. For neurons $k$ not in this intersection, either both sides $u_{i}(k)$ and $v_{i}(k)$ are zero, meaning that no backpropagation occurs, or only one neuron is active, in which case only that side is updated and the update is a simple shrinking effect, since the loss is the magnitude of the activation.

In order to break this symmetry, we choose to employ the BN after the non-linearity. This allows the network to choose weights that result in mostly positive activations, which remain positive after the ReLU activation units.

3 Highly leaky ReLU

Another method to prevent the harmful effects of zeroing is by using leaky ReLU as our non-linear function. Leaky ReLU was first introduced by in order to overcome the difficulties that arise from the elimination of the gradients from neurons with negative activation. In the 2-Way network, this effect is amplified, and we find leaky ReLU units to be extremely important. Formally, a leaky ReLU is defined as:

where $a<1$ is the leakiness coefficient and is fixed during both training and testing. In all of our experiments, we use a leakiness coefficient of 0.3. This value was selected on the validation set of the Flickr8k experiment described in Section 4 and is used for all experiments.

Using leaky ReLU helps to reduce the effect discussed in Section 3.2 but does not replace the need for performing BN after the non-linearity. As Lemma 3 shows, more than half of the neurons will be multiplied by the leakiness coefficient while their matching neuron will not. This asymmetric scaling adds an artificial distance between the matching neurons, which, in turn, increases the L2 loss and reduces the training efficiency.

4 Variance injection

Applying BN on the output of each hidden layer is not enough. The variance can still vanish during training. The problem is that the $\gamma$ factor introduced by each BN layer can be arbitrary and can diminish during training, resulting in low variance. To encourage high variance, we introduce a novel regularization term of the form $R_{\gamma}=\sum_{j,k}(1/\gamma_{jk})^{2}$ , where $\gamma_{jk}$ is the scaling parameter for neuron $k$ in layer $j$ .

This regularization term is enough to force the network to avoid solutions with low variance and to seek more informative output. This is demonstrated experimentally in the ablation study of Section 4.

The compound loss term we employ is of the form:

Where $\lambda_{w}$ , $\lambda_{decov}$ , and $\lambda_{\gamma}$ are the regularization coefficients. While it seems that three regularization tradeoff hyperparameters would make selecting the parameter values difficult, the converse is true: in all of our varied set of experiments $\lambda_{\gamma}=\lambda_{w}$ , and $\lambda_{decov}$ is either set to a very high value of $1/2$ or, for small datasets, to $1/20$ (see Section 4). Moreover, by adding these terms, the network is much less sensitive to the selection of $\lambda_{w}$ and allows us to learn with a much higher learning rate.

5 Tied dropout

Dropout is a form of regularization method that simulates the training of multiple networks with shared weights. Dropout zeros neurons by element-wise multiplying the output of a hidden layer consisting of $d$ neurons for a batch of $n$ samples with a random matrix $B$ of size $d\times n$ . Each element of $B$ is drawn independently from a Bernoulli distribution with a parameter $p$ .

Since dropout eliminates random neurons, it prevents co-adaptation of neurons, which is a desirable property for correlation analysis. However, using dropout, as is, in our proposed model is harmful. This is because the 2-Way network aims to enhance correlations between parallel layers $h^{j}$ and $\hat{h}^{j}$ . The elimination of neurons independently in the hidden layers creates an artificial loss, even for a perfect matching.

Let $p$ be the dropout parameter for layer $j$ , assume that the same parameter is applied on both directions. In probability $(1-p)^{2}$ , a pair of matching neurons is active on both sides and learning occurs with the true gradient. In probability $p^{2}$ , the pair of matching neurons is silent on both sides and no learning occurs. In probability $2p(1-p)$ , only one neuron is active resulting in a shrinking effect on the other neuron. Here, too, shrinking of activations is can be damaging since it might lead to a state of constant representation.

For a dropout probability of $p=0.5$ , half of the gradients would stem from a match which is silent on exactly one side, and the harmful effect is clearly seen in Section 4.

To overcome this problem, we introduce a tied dropout layer, in which the same random matrix $B^{j}$ is applied to pairs of matching hidden layers: $h^{j}$ and $\hat{h}^{j}$ , $j=1..K$ . This sharing eliminates the artifacts introduced by the conventional dropout while preserving the benefits of the stochastic process and helps avoid over-fitting.

Using tied dropout layer changes the distribution of the activations. In order to match the distribution at test time, we incorporate a scaling factor at train time.

Assume that the activations of a single neuron are zero-centered. As discussed below, most post BN activations are almost exactly centered. In this case, the variance of the neuron activations is simply the sum of the squared activations. During training, only a ratio $1-p$ of the activations contributes to the variance. Therefore, we divide the activations, at train time, by $\sqrt{1-p}$ .

6 Training high dimensional inputs

Some of the experiments shown below contain high dimensional data. High dimensional input directly increases the number of parameters and can cause over-fitting as well as an increase in training time and memory usage. To lower the number of parameters, we introduce a new type of layer we term locally dense layer. Such layer of size $n$ is composed of $m$ different dense layers $\bar{h_{1}},...,\bar{h_{m}}$ of size $\frac{n}{m}$ each. Input $x$ of size $d_{x}$ is divided into $m$ different parts of size $\frac{d_{x}}{m}$ and each part $x_{i}$ is connected into one of the dense layers $\bar{h_{i}}$ . The outputs of all inner hidden layers are concatenated, thus producing the locally dense layer’s output. To the output, we add a regular bias term $b$ of size $n$ . Using this layer reduces the number of parameters by a factor of $m$ comparing to a conventional dense layer. In the experiments below, when dealing with high dimensional input, we use a locally dense layer with two inner dense layers.

Experiments

We first present a detailed analysis of the two datasets most commonly used in the literature for examining recent CCA variants: MNIST half matching and X-Ray Microbeam Speech data (XRMB). We then provide additional experiments on the problem of image to sentence matching, showing state of the art results on the Flickr8k, Flickr30k and COCO datasets.

We follow the conventional way of evaluating the performance of CCA variants and compute the sum of the correlations of the top $c$ shared (canonical) representation variables found. The datasets used for this comparison are MNIST and XRMB. In both MNIST and XRMB experiments, we set $\lambda_{decov}=\lambda_{W}=\lambda_{\gamma}=0.05$ . For training, we used stochastic gradient descent with a learning rate of 0.0001 which was halved every 20 epochs. A momentum of 0.9 is used and a tied dropout probability of 0.5.

MNIST half matching The MNIST handwritten digits dataset contains 60,000 images of handwritten digits for training and 10,000 images for testing. Each image is cut vertically into two halves, resulting in 392 features each. The goal is to maximize the correlation of the top $c=50$ canonical variables. The model used is composed of three layers of size 392, 50 and 392 respectively, noted as 392-50-392. The middle layer was taken as the output.

X-Ray Microbeam Speech data The XRMB dataset contains simultaneous acoustic and articulatory recordings. The articulatory data is represented as a 112 dimensional vector. The acoustic data are the MFCCs for the same frames, yielding a 273 dimensional vector at each point in time. For benchmarking, 30,000 random samples are used for training, 10,000 for cross-validation and 10,000 for testing. The correlation is measured across the $c=112$ top correlated canonical variables. The same training configuration of the MNIST experiment was used for the XRMB dataset. For XRMB, we tested our model using hidden layer configuration of 560-280-112-680-1365.

Tab. 1 contains correlation comparisons on the MNIST and XRMB datasets of six CCA variants besides our proposed method. As can be seen, our method (“2WayNet”) outperforms all literature methods by a large margin on the XRMB dataset. On the MNIST dataset, in which the literature results are closer to the maximal value of 50, our method is able to regain half of the remaining correlation.

2 Image annotation and search

We next evaluate the proposed model on the sentence-image matching task. In this task, each dataset contains a set of images and five matching sentences per image. For each dataset, we test our model on two tasks, searching an image given a query sentence and matching a sentence given an image. We measure our performance on three datasets, Flickr8k , Flickr30k and COCO , each containing 8,000, 30,000 and 123,000 images respectively.

Images are presented by the representation layer of the VGG network as vectors of size 4096. Sentences are represented using the published code of . Among the available text encodings, we employ the concatenation of the Fisher Vector encoding (GMM) and the Fisher Vector of the HGLMM distribution introduced in . Each sentence is thus represented as a 36,000D vector. Going from the image to the much larger sentence representation, we trained networks containing two conventional hidden layers of sizes 2000 and 3000 and an additional locally dense layer of 16000 neurons and $m=2$ for Flickr30k and COCO datasets. For Flickr8k, due to the relatively small dataset, we used a dense layer of 4000 neurons. Correlation is used as a similarity measure between images and sentences. To this end we use the middle network representations from each channels, resulting in a representation vector of size 3000.

The Flickr8k dataset is provided with training, validation, and test splits. For Flickr30K and COCO, no splits are given, and we use the same splits used by . $\lambda_{deconv}$ is set to a value of $1/2$ , which almost eliminated all off-diagonal covariances at the middle layer. The other parameters are set as in the MNIST and XRMB experiments.

Tab. 2 compare our results to the state-of-the-art methods on the image-sentence matching task. We also report results that we computed for the RCCA method . The open implementations of the various deep CCA methods do not seem to scale well enough for this benchmark. Our proposed method achieves best performance almost across all scores, especially in the image annotation task, where we improved by a large margin for the three datasets, and especially when considering the top result (r@1).

3 Ablation analysis

We perform an ablation analysis aimed at isolating the effect of the various architectural novelties suggested. Experiments were conducted on the Flickr8k, Flickr30k, MNIST and XRMB datasets. Each experiment uses the baseline configuration used in previous experiments with only one alternation.

Batch Normalization For this experiment, we used different settings for the BN layer. The configuration settings include: (1) without BN, (2) with conventional BN (before ReLU) without regularizing $\gamma$ , (3) with post-ReLU BN, without regularizing $\gamma$ , (4) using BN before the ReLU with $\lambda_{\gamma}=0.05$ , and (5) our proposed method: BN applied only after ReLU with $\lambda_{\gamma}=0.05$ . Tab. 3 report the performance of the various configurations in terms of correlation and the mean variance of all features on the validation set.

As Tab. 3 shows, batch normalization has a profound effect on the network’s results. Results taken without batch normalization were trained with lower learning rate, using higher learning rate prevented the training from converging. We can also see that using the $1/\gamma$ regularization term significantly increases the variance of the hidden representation, which, in turn, stabilizes the training process and improves correlation. The effect studied in Section 3.2 is clearly visible in the ablation study, positioning the BN layer after the Leaky ReLU prevents an unbalance representations as can be seen by the difference in variances, which increases the correlation of two representation significantly. Tab. 4 contains r@1 results for the same experiments on the Flickr8k dataset. As in Tab. 3 out suggested configuration achieves the base recall rates.

Tied Dropout We trained the same base configuration as described above. We tested our proposed method using a conventional dropout instead of a tied dropout and removing dropout altogether. In all experiments, the dropout probability $p$ was set at 0.5.

As can be seen, the performance drops when using the conventional dropout instead of the proposed tied dropout layer. The benefits of the tied dropout layer are most significant on the large datasets Flickr8k and Flickr30k, where over-fitting is likely. The shrinking effect discussed in Section 3.5 is clearly visible and is manifested as low variance of the output of the model based on conventional dropout, compared to a much higher variance when using the tied dropout.

Leaky ReLU We also tested the contribution of other parameters on the model’s performance. One of the major benefits was using leaky ReLU non-linearity. Using conventional ReLU resulted in large correlation loss of about $33\%$ (1192 total correlation) for Flickr8k. Loss terms Another aspect we tested is the effect of various loss terms on correlation and recall rates. Removing $L_{h}$ term results in a $31\%$ (1230) decrease of correlation. This settles with Lemma 2 which links the output’s correlation and $L_{h}$ loss term. While the $L_{h}$ loss increases the output’s correlation, the reconstruction loss terms $L_{x}$ and $L_{y}$ decreases the result’s correlation. Removing them both increases correlation by $56\%$ (2752). While the correlation produced between the two views is higher without the two reconstruction losses, the dimensions of each representation are highly correlated resulting in a decrease of $87\%$ in image search and $91\%$ in image annotation performance as measured by recall@1: from the full method’s performance of 29.3 and 43.4 for the tasks of image search and image annotation to 4.0 and 3.9 respectively. Regularization The effect for $R_{\gamma}$ can be viewed in Tab. 3. Removing the $R_{d}ecov$ results in a decrease of all measures. Image search r@1 and r@5 results decrease by $14\%$ and $10\%$ respectively and the image annotation r@1 and r@5 results decrease by $10\%$ and $8\%$ respectively. Moreover, the correlation is reduced by $4\%$ . Locally dense layer To test the effect of the proposed locally dense layer, we trained our model on Flickr30k with a regular dense layer of the same size (16000 neurons) and with a regular dense layer of half the size. Image annotation r@1(r@5) results degrade by $7\%$ ( $3\%$ ) and image search by $1\%$ ( $1\%$ ) when using conventional 16000 neurons dense layer. Using dense layer half the size results in a drop of $13\%$ ( $9\%$ ) for image annotation r@1(r@5) rates and $11\%$ ( $8\%$ ) for image search recall rates r@1(r@5).

Parameter sensitivity: Fig. 2(a) shows the effect of different leakiness coefficient values on the correlation as measured on the validation sets of the MNIST and XRMB data sets. The results were obtained by training the network using leakiness coefficients ranging between 0 and 0.7. As can be seen, there is a large region of values that provide better performance than the conventional zero-leakiness ReLU. Fig. 2(b) shows the effect of the regularization weight $\lambda_{\gamma}$ that controls the learned variance of the BN layer. The value used in our experiments seems to be beneficial and lies at a relatively wide high-performance plateau.

Conclusions

In this paper, we present a method for linking paired samples from two sources. The method significantly outperforms all literature methods in the highly applicable and well studied domain of correlation analysis, including the classical methods, their modern variants, and the recent deep correlation methods. We are unique in that we employ a tied 2-way architecture, reconstructing , and unlike most methods, we employ the Euclidean loss. In order to promote an effective training, we introduce a series of contributions that are aimed at maintaining the variance of the learned representations. Each of these modifications is provided with an analysis that explains its role and together they work hand in hand in order to provide the complete architecture, which is highly accurate.

Our method is generic and can be employed in any computer vision domain in which two data modalities are used. In addition, our contributions could also help in training univariate regression problems. In the literature, the Euclidean loss is often combined with other losses , or replaced by an alternative loss in order to mitigate the challenges of training regression problems. Our variance injection method can be easily incorporated into any existing network.

As future work, we would like to continue exploring the use of tied 2-Way networks for matching views from different domains. In almost all of our trained networks, the biases of the batch normalization layers in the solutions tend to have very low values. These biases can probably be eliminated altogether. In addition, in many encoder/decoder schemes, layers are added gradually during training. It is possible to adopt such a scheme to our framework, adding hidden layers in the middle of the network one by one.