A Neural Algorithm of Artistic Style

Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

Methods

The results presented in the main text were generated on the basis of the VGG-Network , a Convolutional Neural Network that rivals human performance on a common visual object recognition benchmark task and was introduced and extensively described in . We used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG-Network. We do not use any of the fully connected layers.The model is publicly available and can be explored in the caffe-framework . For image synthesis we found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly more appealing results, which is why the images shown were generated with average pooling.

Generally each layer in the network defines a non-linear filter bank whose complexity increases with the position of the layer in the network. Hence a given input image $\vec{x}$ is encoded in each layer of the CNN by the filter responses to that image. A layer with $N_{l}$ distinct filters has $N_{l}$ feature maps each of size $M_{l}$ , where $M_{l}$ is the height times the width of the feature map. So the responses in a layer $l$ can be stored in a matrix $F^{l}\in\mathcal{R}^{N_{l}\times M_{l}}$ where $F_{ij}^{l}$ is the activation of the $i^{th}$ filter at position $j$ in layer $l$ . To visualise the image information that is encoded at different layers of the hierarchy (Fig 1, content reconstructions) we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image. So let $\vec{p}$ and $\vec{x}$ be the original image and the image that is generated and $P^{l}$ and $F^{l}$ their respective feature representation in layer $l$ . We then define the squared-error loss between the two feature representations

The derivative of this loss with respect to the activations in layer $l$ equals

from which the gradient with respect to the image $\vec{x}$ can be computed using standard error back-propagation. Thus we can change the initially random image $\vec{x}$ until it generates the same response in a certain layer of the CNN as the original image $\vec{p}$ . The five content reconstructions in Fig 1 are from layers ‘conv1_1’ (a), ‘conv2_1’ (b), ‘conv3_1’ (c), ‘conv4_1’ (d) and ‘conv5_1’ (e) of the original VGG-Network.

On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image. These feature correlations are given by the Gram matrix $G^{l}\in\mathcal{R}^{N_{l}\times N_{l}}$ , where $G_{ij}^{l}$ is the inner product between the vectorised feature map $i$ and $j$ in layer $l$ :

To generate a texture that matches the style of a given image (Fig 1, style reconstructions), we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated. So let $\vec{a}$ and $\vec{x}$ be the original image and the image that is generated and $A^{l}$ and $G^{l}$ their respective style representations in layer $l$ . The contribution of that layer to the total loss is then

where $w_{l}$ are weighting factors of the contribution of each layer to the total loss (see below for specific values of $w_{l}$ in our results). The derivative of $E_{l}$ with respect to the activations in layer l can be computed analytically:

The gradients of $E_{l}$ with respect to the activations in lower layers of the network can be readily computed using standard error back-propagation. The five style reconstructions in Fig 1 were generated by matching the style representations on layer ‘conv1_1’ (a), ‘conv1_1’ and ‘conv2_1’ (b), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (c), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (d), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (e).

To generate the images that mix the content of a photograph with the style of a painting (Fig 2) we jointly minimise the distance of a white noise image from the content representation of the photograph in one layer of the network and the style representation of the painting in a number of layers of the CNN. So let $\vec{p}$ be the photograph and $\vec{a}$ be the artwork. The loss function we minimise is

where $\alpha$ and $\beta$ are the weighting factors for content and style reconstruction respectively. For the images shown in Fig 2 we matched the content representation on layer ‘conv4_2’ and the style representations on layers ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ ( $w_{l}=1/5$ in those layers, $w_{l}=0$ in all other layers) . The ratio $\alpha/\beta$ was either $1\times 10^{-3}$ (Fig 2 B,C,D) or $1\times 10^{-4}$ (Fig 2 E,F). Fig 3 shows results for different relative weightings of the content and style reconstruction loss (along the columns) and for matching the style representations only on layer ‘conv1_1’ (A), ‘conv1_1’ and ‘conv2_1’ (B), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (C), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (D), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (E). The factor $w_{l}$ was always equal to one divided by the number of active layers with a non-zero loss-weight $w_{l}$ .

This work was funded by the German National Academic Foundation (L.A.G.), the Bernstein Center for Computational Neuroscience (FKZ 01GQ1002) and the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307)(M.B., A.S.E, L.A.G.)

Methods

References and Notes