# What is artistic style?

This project started with the challenge posed by this Kaggle competition. Essentially, I wanted to determine the likelihood that any two pieces of art were produced by the same person.

As a starting point, I looked at the now-famous “style transfer” paper. After reading this paper, I learned that its novelty comes from breaking down a piece of art into two separate components: content and style. Let’s dig a little deeper into how content and style are defined in the context of this paper.

#### Pretrained VGG Network

In the case of either content or style, the ideas in the paper come about from running an image through a GIANT neural net. They use the 19-layer VGG neural net. This network contains 16 convolutional layers, with 5 pooling operations mixed in, followed by 3 fully connected layers. It took me a long time to fully understand what was happening to the image at each stage of the network, so I drew this picture, with the sizes of the inputs and outputs of each layer labeled:

The style transfer paper makes a few modifications to this network, namely, they do not use any of the fully connected layers, and their pooling operations are average pooling instead of max pooling. For the purposes of defining the content difference and style difference between two images, we are only interested in 5 of the convolutional layers: conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1. From now on, when we talk about the $l$th layer of the network, we mean conv$l$_1.

For my project, I wanted to try out using Tensorflow, so I had to implement the VGG network in Tensorflow using the pre-trained weights and biases.

Get the weights and biases from the VGG network:

VGG_MODEL = 'imagenet-vgg-verydeep-19.mat'
vgg_layers = vgg['layers']

def weights_and_biases(layer_index):
W = tf.constant(vgg_layers[0][layer_index][0][0][2][0][0])
b = vgg_layers[0][layer_index][0][0][2][0][1]
b = tf.constant(np.reshape(b, (b.size))) # need to reshape b from size (64,1) to (64,)
layer_name = vgg_layers[0][layer_index][0][0][0][0]
return W,b


Set up the VGG network in Tensorflow:

graph = tf.Graph()

with graph.as_default():
tf_image = tf.placeholder(tf.float32, shape=(1, 224, 224, 3))

output = {}
W,b = weights_and_biases(0)
output['conv1_1'] = tf.nn.conv2d(tf_image, W, [1,1,1,1], 'SAME') + b
output['relu1_1'] = tf.nn.relu(output['conv1_1'])
W,b = weights_and_biases(2)
output['conv1_2'] = tf.nn.conv2d(output['relu1_1'], W, [1,1,1,1], 'SAME') + b
output['relu1_2'] = tf.nn.relu(output['conv1_2'])
output['pool1'] = tf.nn.avg_pool(output['relu1_2'], ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
W,b = weights_and_biases(5)
output['conv2_1'] = tf.nn.conv2d(output['pool1'], W, [1,1,1,1], 'SAME') + b
output['relu2_1'] = tf.nn.relu(output['conv2_1'])
W,b = weights_and_biases(7)
output['conv2_2'] = tf.nn.conv2d(output['relu2_1'], W, [1,1,1,1], 'SAME') + b
output['relu2_2'] = tf.nn.relu(output['conv2_2'])
output['pool2'] = tf.nn.avg_pool(output['relu2_2'], ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')
W,b = weights_and_biases(10)
output['conv3_1'] = tf.nn.conv2d(output['pool2'], W, [1,1,1,1], 'SAME') + b
output['relu3_1'] = tf.nn.relu(output['conv3_1'])
W,b = weights_and_biases(12)
output['conv3_2'] = tf.nn.conv2d(output['relu3_1'], W, [1,1,1,1], 'SAME') + b
output['relu3_2'] = tf.nn.relu(output['conv3_2'])
W,b = weights_and_biases(14)
output['conv3_3'] = tf.nn.conv2d(output['relu3_2'], W, [1,1,1,1], 'SAME') + b
output['relu3_3'] = tf.nn.relu(output['conv3_3'])
W,b = weights_and_biases(16)
output['conv3_4'] = tf.nn.conv2d(output['relu3_3'], W, [1,1,1,1], 'SAME') + b
output['relu3_4'] = tf.nn.relu(output['conv3_4'])
output['pool3'] = tf.nn.avg_pool(output['relu3_4'], ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')
W,b = weights_and_biases(19)
output['conv4_1'] = tf.nn.conv2d(output['pool3'], W, [1,1,1,1], 'SAME') + b
output['relu4_1'] = tf.nn.relu(output['conv4_1'])
W,b = weights_and_biases(21)
output['conv4_2'] = tf.nn.conv2d(output['relu4_1'], W, [1,1,1,1], 'SAME') + b
output['relu4_2'] = tf.nn.relu(output['conv4_2'])
W,b = weights_and_biases(23)
output['conv4_3'] = tf.nn.conv2d(output['relu4_2'], W, [1,1,1,1], 'SAME') + b
output['relu4_3'] = tf.nn.relu(output['conv4_3'])
W,b = weights_and_biases(25)
output['conv4_4'] = tf.nn.conv2d(output['relu4_3'], W, [1,1,1,1], 'SAME') + b
output['relu4_4'] = tf.nn.relu(output['conv4_4'])
output['pool4'] = tf.nn.avg_pool(output['relu4_4'], ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')
W,b = weights_and_biases(28)
output['conv5_1'] = tf.nn.conv2d(output['pool4'], W, [1,1,1,1], 'SAME') + b
output['relu5_1'] = tf.nn.relu(output['conv5_1'])
W,b = weights_and_biases(30)
output['conv5_2'] = tf.nn.conv2d(output['relu5_1'], W, [1,1,1,1], 'SAME') + b
output['relu5_2'] = tf.nn.relu(output['conv5_2'])
W,b = weights_and_biases(32)
output['conv5_3'] = tf.nn.conv2d(output['relu5_2'], W, [1,1,1,1], 'SAME') + b
output['relu5_3'] = tf.nn.relu(output['conv5_3'])
W,b = weights_and_biases(34)
output['conv5_4'] = tf.nn.conv2d(output['relu5_3'], W, [1,1,1,1], 'SAME') + b
output['relu5_4'] = tf.nn.relu(output['conv5_4'])
output['pool5'] = tf.nn.avg_pool(output['relu5_4'], ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')


#### Defining Content and Style Difference

The content difference between two input images $X$ and $Y$ at layer $l$ with $N_l$ channels is defined as follows. Let $X_{i,j}^{k_l}$ be the value of the $(i, j)$th entry in the activation for $X$ in channel $k$ of layer $l$, and similarly for $Y$. Then the content difference between the two is defined to be

$\mathcal{L}_{content}(X, Y, l) = \sum_{i,j, k_l} \frac{1}{2} \left( X_{i,j}^{k_l} - Y_{i,j}^{k_l} \right)^2$

This is essentially just the usual L2 distance between the activations of the two images in a given layer.

The style difference difference is a little more involved. The idea is that to capture style, we are interested in the relationship between the various channels in a given convolutional layer. So, for an image $X$ in a layer $l$, we first form its Gram matrix $G(X, l)$, which is defined to be the matrix of all inner products of the flattened output channels of that layer. If layer $l$ has $N_l$ channels, $G(X, l)$ is a $N_l \times N_l$ matrix:

$G(X, l)_{i, j} = F_i^l \cdot F_j^l$

where $F_i^l$ is the flattened $i$th channel in the activation of $X$ in layer $l$ and the operation between $F_i^l$ and $F_j^l$ is the usual dot product. The style loss between images $X$ and $Y$ in layer $l$ is then defined to be the L2 distance between their Gram matrices, normalized by the size of the layer:

$\mathcal{L}_{style}(X, Y, l) = \frac{1}{4N_lM_l} \sum_{i,j} \left( G(X,l)_{i,j} - G(Y,l)_{i,j} \right) ^2$

where $M_l$ is the height times width of each feature map in convolutional layer $l$. The total style loss between $X$ and $Y$ is

$\mathcal{L}_{style}(X, Y) = \sum_{i=1}^5 w_i \mathcal{L}_{style}(X, Y, i)$

where the $w_i$ are some weights chosen by the programmer. I used the following Python/Tensorflow code to compute the style difference between two input images:

def gram_matrix(F, N, M):
# F is the output of the given convolutional layer on a particular input image
# N is number of feature maps in the layer
# M is the total number of entries in each filter
Ft = tf.reshape(F, (M, N))
return tf.matmul(tf.transpose(Ft), Ft)

def loss_by_layer(a, x):
N = a.shape[3]
M = a.shape[1] * a.shape[2]
A = gram_matrix(a, N, M)
G = gram_matrix(x, N, M)
loss = (1.0 / (4 * N**2 * M**2)) * tf.reduce_sum(tf.pow(G - A, 2))
return loss

def style_difference(im1, im2, session):
weights = [1.0, 1.0, 1.0, 1.0, 1.0]
total_loss = 0
for i in range(1,6):
layer = 'conv' + str(i) + '_1'
total_loss += weights[i-1] * loss_by_layer(session.run(output[layer], feed_dict={tf_image: im1}),
session.run(output[layer], feed_dict={tf_image: im2}))


#### Style difference to determine same artist probability?

Initially, my idea to solve the Kaggle challenge was to use this metric of style difference as an indicator of whether two images were likely produced by the same artist. Naively, I thought that we could disregard content completely, because an artist should generally use the same style regardless of the subject he or she is painting. This turned out to be really, really wrong.

Here’s how I realized I was wrong: I took pairs of images, some by the same artist and some by different artists, scaled them down to size 224 x 224 as the VGG network expects, and examined the style difference between them. The following pair of images had a very large style difference (9.9587228e+10), but were actually produced by the same artist (George Stefanescu). The artwork on the left was produced in 1966, while the one on the right was done in 2001.

In fact, after examining 50 pairs of images by the same artist and 50 pairs by different artists and plotting their style differences using matplotlib, there was no noticeable correlation among style difference (red +’s are pairs with same artist, blue o’s are pairs with different artists.)

So clearly, style difference used in this way is not effective for measuring whether two images were produced by the same artist.

#### Ok, so what is “style”?

Upon a more careful reading of the style transfer paper, they explain that for them, style really means the texture of an image, or features “that capture its general appearance in
terms of colour and localized structures.” So if an artist switches to a wildly different color scheme across several periods of his/her career, as in the above example, then those images would be considered to have very different styles, even though they were produced by the same person.

How exactly can we visualize what style is being captured by the Gram matrices? As explained in the paper, given a target image, we can start with a randomly generated white noise image, compute its style difference from the target image, and then minimize that style difference by performing gradient descent on the white noise image to arrive at one that is similar in style to the target image but with none of the content. This took a bit of messing around with Tensorflow for me to figure out how to set it up properly, so here’s a summary of how the process works:

• Set up the graph the same as above, but now the input image is a Variable instead of Placeholder
• Initialize the input image variable
• Assign the desired style image to the image variable
• Construct the style difference tensor: this will measure the style difference between the image input variable on future runs of the session and the desired style image, whose Gram matrices are now considered fixed
• Define an optimizer to minimize the style difference tensor
• Initialize the input image variable again to clear out the style image and assign it a randomly generated white noise image
• Run the optimizer for the desired number of iterations and watch the input image variable change

Below you can see the style reconstructions of the Stefanescu image on the right above. Each reconstruction uses increasingly more convolutional layers of the network than the previous one (e.g. the first uses the first layer only, the second uses the first and second, etc.) In all images, 2000 iterations were performed.

We notice that color is a predominant feature of each style representation, and that the more layers we include, the better sense we get of the structures of localized features.

One interesting thing to note is that the optimizer doesn’t care about keeping the pixel values in the desired range of 0 to 255. So before displaying the images, we need to clip any values outside of this range.

#### Summary

From this experiment, I now know that a computer’s interpretation of style relies heavily on color and texture of an image, which is not necessarily a good indicator of what we think of as a particular artist’s style. Hopefully I can continue thinking of more effective ways to solve the Kaggle problem, but in the meantime, I learned a lot about convolutional neural networks, Tensorflow, and artistic style.