A Neural Algorithm of Artistic Style

Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

Methods

The results presented in the main text were generated on the basis of the VGG-Network , a Convolutional Neural Network that rivals human performance on a common visual object recognition benchmark task and was introduced and extensively described in . We used the feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG-Network. We do not use any of the fully connected layers.The model is publicly available and can be explored in the caffe-framework . For image synthesis we found that replacing the max-pooling operation by average pooling improves the gradient flow and one obtains slightly more appealing results, which is why the images shown were generated with average pooling.

Generally each layer in the network defines a non-linear filter bank whose complexity increases with the position of the layer in the network. Hence a given input image x\vec{x} is encoded in each layer of the CNN by the filter responses to that image. A layer with NlN_{l} distinct filters has NlN_{l} feature maps each of size MlM_{l}, where MlM_{l} is the height times the width of the feature map. So the responses in a layer ll can be stored in a matrix FlRNl×MlF^{l}\in\mathcal{R}^{N_{l}\times M_{l}} where FijlF_{ij}^{l} is the activation of the ithi^{th} filter at position jj in layer ll. To visualise the image information that is encoded at different layers of the hierarchy (Fig 1, content reconstructions) we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image. So let p\vec{p} and x\vec{x} be the original image and the image that is generated and PlP^{l} and FlF^{l} their respective feature representation in layer ll. We then define the squared-error loss between the two feature representations

The derivative of this loss with respect to the activations in layer ll equals

from which the gradient with respect to the image x\vec{x} can be computed using standard error back-propagation. Thus we can change the initially random image x\vec{x} until it generates the same response in a certain layer of the CNN as the original image p\vec{p}. The five content reconstructions in Fig 1 are from layers ‘conv1_1’ (a), ‘conv2_1’ (b), ‘conv3_1’ (c), ‘conv4_1’ (d) and ‘conv5_1’ (e) of the original VGG-Network.

On top of the CNN responses in each layer of the network we built a style representation that computes the correlations between the different filter responses, where the expectation is taken over the spatial extend of the input image. These feature correlations are given by the Gram matrix GlRNl×NlG^{l}\in\mathcal{R}^{N_{l}\times N_{l}}, where GijlG_{ij}^{l} is the inner product between the vectorised feature map ii and jj in layer ll:

To generate a texture that matches the style of a given image (Fig 1, style reconstructions), we use gradient descent from a white noise image to find another image that matches the style representation of the original image. This is done by minimising the mean-squared distance between the entries of the Gram matrix from the original image and the Gram matrix of the image to be generated. So let a\vec{a} and x\vec{x} be the original image and the image that is generated and AlA^{l} and GlG^{l} their respective style representations in layer ll. The contribution of that layer to the total loss is then

where wlw_{l} are weighting factors of the contribution of each layer to the total loss (see below for specific values of wlw_{l} in our results). The derivative of ElE_{l} with respect to the activations in layer l can be computed analytically:

The gradients of ElE_{l} with respect to the activations in lower layers of the network can be readily computed using standard error back-propagation. The five style reconstructions in Fig 1 were generated by matching the style representations on layer ‘conv1_1’ (a), ‘conv1_1’ and ‘conv2_1’ (b), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (c), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (d), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (e).

To generate the images that mix the content of a photograph with the style of a painting (Fig 2) we jointly minimise the distance of a white noise image from the content representation of the photograph in one layer of the network and the style representation of the painting in a number of layers of the CNN. So let p\vec{p} be the photograph and a\vec{a} be the artwork. The loss function we minimise is

where α\alpha and β\beta are the weighting factors for content and style reconstruction respectively. For the images shown in Fig 2 we matched the content representation on layer ‘conv4_2’ and the style representations on layers ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (wl=1/5w_{l}=1/5 in those layers, wl=0w_{l}=0 in all other layers) . The ratio α/β\alpha/\beta was either 1×1031\times 10^{-3} (Fig 2 B,C,D) or 1×1041\times 10^{-4} (Fig 2 E,F). Fig 3 shows results for different relative weightings of the content and style reconstruction loss (along the columns) and for matching the style representations only on layer ‘conv1_1’ (A), ‘conv1_1’ and ‘conv2_1’ (B), ‘conv1_1’, ‘conv2_1’ and ‘conv3_1’ (C), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’ and ‘conv4_1’ (D), ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ (E). The factor wlw_{l} was always equal to one divided by the number of active layers with a non-zero loss-weight wlw_{l}.

This work was funded by the German National Academic Foundation (L.A.G.), the Bernstein Center for Computational Neuroscience (FKZ 01GQ1002) and the German Excellency Initiative through the Centre for Integrative Neuroscience Tübingen (EXC307)(M.B., A.S.E, L.A.G.)

References and Notes