The neural network creates paintings in the style of famous artists. Styling images using neural networks: no mysticism, just swearing

20.06.2019

In the most ordinary photographs, numerous and not entirely distinguishable entities appear. Most often for some reason dogs. Such images began to fill the Internet in June 2015, when DeepDream from Google was launched - one of the first open services based on neural networks and designed for image processing.

It happens approximately like this: the algorithm analyzes photographs, finds fragments in them that remind it of some familiar objects - and distorts the image in accordance with these data.

First, the project was laid out as an open source, and then online services created on the same principles appeared on the Internet. One of the most convenient and popular is Deep Dream Generator: it takes only about 15 seconds to process a small photo here (previously, users had to wait more than an hour).

How do neural networks learn to create such images? And why, by the way, are they called that?

Neural networks in their design imitate the real neural networks of a living organism, but they do this with the help of mathematical algorithms. Having created a basic structure, you can train it using machine learning methods. If we are talking about pattern recognition, then thousands of images need to be passed through the neural network. If the task of the neural network is different, then the training exercises will be different.

Algorithms for playing chess, for example, analyze chess games. Along the same path, Google's DeepMind's AlphaGo algorithm into the Chinese game of Go - which was hailed as a breakthrough because Go is much more complex and non-linear than chess.

    You can play around with a simplified neural network model and better understand its principles.

    YouTube also has a series of intelligible hand-drawn rollers about how neural networks work.

Another popular service is Dreamscope, which can not only dream about dogs, but also imitate various painting styles. Image processing here is also very simple and fast (about 30 seconds).

Apparently, the algorithmic part of the service is a modification of the Neural style program, which we have already discussed.

More recently, a program has appeared that realistically paints black and white images. In previous versions, similar programs did their job much less well, and it was considered a great achievement if at least 20% of people could not tell the difference between a real picture and a computer-colored one.

Moreover, colorization here takes only about 1 minute.

The same development company also launched a service that recognizes different types of objects in pictures.

These services may seem like just fun entertainment, but in fact, everything is much more interesting. New technologies enter the practice of human artists and change our understanding of art. Perhaps soon people will have to compete with machines in the field of creativity.

Teaching pattern recognition algorithms is a task that AI developers have been struggling with for a long time. Therefore, programs that colorize old photographs and draw dogs in the sky can be considered part of a larger and more intriguing process.

Greetings, Habr! Surely you have noticed that the topic of styling photos for various artistic styles is being actively discussed on these Internets of yours. Reading all these popular articles, you might think that magic is going on under the hood of these applications, and the neural network is really fantasizing and redrawing the image from scratch. It just so happened that our team was faced with a similar task: as part of an internal corporate hackathon, we made a video styling, because. there was already an app for photos. In this post, we'll take a look at how the network "redraws" images, and look at the articles that made it possible. I recommend that you familiarize yourself with the last post before reading this material and in general with the basics of convolutional neural networks. You will find some formulas, some code (I will give examples on Theano and Lasagne), as well as a lot of pictures. This post is built in chronological order of appearance of articles and, accordingly, the ideas themselves. Sometimes I will dilute it with our recent experience. Here's a boy from hell for attention.


Visualizing and Understanding Convolutional Networks (28 Nov 2013)

First of all, it is worth mentioning the article in which the authors were able to show that a neural network is not a black box, but quite an interpretable thing (by the way, today this can be said not only about convolutional networks for computer vision). The authors decided to learn how to interpret the activations of hidden layer neurons, for this they used the deconvolutional neural network (deconvnet) proposed several years earlier (by the way, by the same Zeiler and Fergus, who are the authors of this publication as well). A deconvolutional network is actually the same network with convolutions and poolings applied in reverse order. The original work on deconvnet used the network in an unsupervised learning mode to generate images. This time, the authors used it simply for a reverse pass from the features obtained after a forward pass through the network to the original image. The result is an image that can be interpreted as a signal that caused this activation on neurons. Naturally, the question arises: how to make a reverse pass through convolution and nonlinearity? And even more so through max-pooling, this is certainly not an inverted operation. Let's look at all three components.

Reverse ReLu

In convolutional networks, the activation function is often used ReLu(x) = max(0, x), which makes all activations on the layer non-negative. Accordingly, when passing back through the nonlinearity, it is also necessary to obtain non-negative results. For this, the authors propose to use the same ReLu. From a Theano architecture point of view, it is necessary to override the gradient function of the operation (the infinitely valuable notebook is in the lasagna recipes, from there you will glean the details of what the ModifiedBackprop class is).

Class ZeilerBackprop(ModifiedBackprop): def grad(self, inputs, out_grads): (inp,) = inputs (grd,) = out_grads #return (grd * (grd > 0).astype(inp.dtype),) # explicitly rectify return (self.nonlinearity(grd),) # use the given nonlinearity

Reverse Convolution

Here it is a little more complicated, but everything is logical: it is enough to apply the transposed version of the same convolution kernel, but to the outputs from the reverse ReLu instead of the previous layer used in the forward pass. But I'm afraid that in words it is not so obvious, let's look at the visualization of this procedure (you will find even more visualizations of convolutions).


Convolution when stride=1

Convolution when stride=1 reverse version

Convolution when stride=2

Convolution when stride=2 reverse version

Reverse Pooling

This operation (unlike the previous ones) is generally not invertible. But we would still like to pass through the maximum in some way during the reverse pass. To do this, the authors suggest using a map of where the maximum was during the direct pass (max location switches). During the reverse pass, the input signal is transformed into unpooling in such a way as to approximately preserve the structure of the original signal, it is really easier to see than to describe here.



Result

The visualization algorithm is extremely simple:

  1. Make a straight pass.
  2. Select the layer we are interested in.
  3. Fix the activation of one or more neurons and reset the rest.
  4. Make an inference.

Each gray square in the image below corresponds to the visualization of the filter (which is used for convolution) or the weights of one neuron, and each colored image is the part of the original image that activates the corresponding neuron. For clarity, neurons within one layer are grouped into thematic groups. In general, it suddenly turned out that the neural network learns exactly what Hubel and Weisel wrote about in their work on the structure of the visual system, for which they were awarded the Nobel Prize in 1981. Thanks to this article, we got a visual representation of what a convolutional neural network learns at each layer. It is this knowledge that will allow later to manipulate the contents of the generated image, but this is still far away, the next few years have gone to improving the methods of "trepanation" of neural networks. In addition, the authors of the article proposed a way to analyze how best to build the architecture of a convolutional neural network to achieve better results (however, they did not win ImageNet 2013, but got into the top; UPD: it turns out they won, Clarifai is what they are).


Feature visualization


Here is an example of visualization of activations using deconvnet, today this result looks already so-so, but then it was a breakthrough.


Saliency Maps using deconvnet

Deep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Maps (19 Apr 2014)

This article is devoted to the study of knowledge visualization methods contained in a convolutional neural network. The authors propose two visualization methods based on gradient descent.

Class Model Visualization

So, imagine that we have a trained neural network to solve a classification problem into a certain number of classes. Denote as the activation value of the output neuron that corresponds to the class c. Then the following optimization problem gives us exactly the image that maximizes the selected class:



This task is easy to solve using Theano. Usually we ask the framework to take the derivative of the model parameters, but this time we assume that the parameters are fixed and the derivative is taken from the input image. The following function selects the maximum value of the output layer and returns a function that calculates the derivative with respect to the input image.


def compile_saliency_function(net): """ Compiles a function to compute the saliency maps and predicted classes for a given minibatch of input images. """ inp = net["input"].input_var outp = lasagne.layers.get_output(net ["fc8"], deterministic=True) max_outp = T.max(outp, axis=1) saliency = theano.grad(max_outp.sum(), wrt=inp) max_class = T.argmax(outp, axis=1) return theano.function(, )

You have probably seen strange images of dogs on the Internet - DeepDream. In the original article, the authors use the following process to generate images that maximize the selected class:

  1. Initialize initial image with zeros.
  2. Calculate the value of the derivative from this image.
  3. Change the image by adding to it the resulting image from the derivative.
  4. Return to step 2 or exit the loop.

The resulting images are:




But what if you initialize the first image with a real photo and start the same process? But at each iteration we will choose a random class, set the rest to zero and calculate the value of the derivative, then we get such a deep dream.


Caution 60 mb


Why are there so many dog ​​faces and eyes? It's simple: there are almost 200 dogs in the image net out of 1000 classes, they have eyes. And also a lot of classes where there are just people.

Class Saliency Extraction

If this process is initialized with a real photo, stopped after the first iteration and drawing the value of the derivative, then we will get such an image, adding which to the original one, we will increase the activation value of the selected class.


Saliency Maps Using Derivative


Again, the result is "so-so". It is important to note that this is a new way to visualize activations (nothing prevents us from fixing activation values ​​not on the last layer, but on any network layer in general and taking the derivative with respect to the input image). The next article will combine both previous approaches and give us a tool on how to set up style transfer, which will be described later.

Striving for Simplicity: The All Convolutional Net (13 Apr 2015)

This article is generally not about visualization, but about the fact that replacing pooling with a convolution with a large stride does not lead to loss of quality. But as a by-product of their research, the authors proposed a new way of visualizing features, which they applied to more accurately analyze what the model learns. Their idea is as follows: if we simply take the derivative, then those features that were less than zero on the input image do not go back during deconvolution (using ReLu for the input image). And this leads to the fact that negative values ​​appear on the propagated back image. On the other hand, if you use deconvnet, then another ReLu is taken from the derivative of ReLu - this allows you not to skip back negative values, but as you saw, the result is "so-so". But what if we combine these two methods?




class GuidedBackprop(ModifiedBackprop): def grad(self, inputs, out_grads): (inp,) = inputs (grd,) = out_grads dtype = inp.dtype return (grd * (inp > 0).astype(dtype) * (grd > 0).astype(dtype),)

Then you get a completely clean and interpretable image.


Saliency Maps using Guided Backpropagation

Go deeper

Now let's think, what does this give us? Let me remind you that each convolutional layer is a function that receives a three-dimensional tensor as an input and also produces a three-dimensional tensor as an output, perhaps of a different dimension d x w x h; d epth is the number of neurons in the layer, each of them generates a feature map with the size w igth x h eight.


Let's try the following experiment on the VGG-19 network:



conv1_2

Yes, you see almost nothing, because. the receptive area is very small, this is the second convolution 3x3, respectively, the total area is 5x5. But zooming in, we see that the feature is just a gradient detector.




conv3_3


conv4_3


conv5_3


pool5


And now imagine that instead of the maximum over the plate, we will take the derivative of the value of the sum of all elements of the plate over the input image. Then obviously the receptive area of ​​the group of neurons will cover the entire input image. For the early layers, we will see bright maps, from which we deduce that these are color detectors, then gradients, then borders, and so on towards more complex patterns. The deeper the layer, the dimmer image is obtained. This is explained by the fact that deeper layers have a more complex pattern that they detect, and a complex pattern appears less often than a simple one, and therefore the activation map dims. The first way is suitable for understanding layers with complex patterns, and the second one is just right for simple ones.


conv1_1


conv2_2


conv4_3


You can download a more complete database of activations for several images and .

A Neural Algorithm of Artistic Style (2 Sep 2015)

So, a couple of years have passed since the first successful trepanation of the neural network. We (in the sense of humanity) have a powerful tool in our hands that allows us to understand what the neural network learns, as well as remove what we would not really like it to learn. The authors of this article are developing a method that allows you to make one image generate a similar activation map to some target image, and possibly even more than one - this is the basis of styling. We feed white noise to the input, and in a similar iterative process as in deep dream, we bring this image to one in which the feature maps are similar to the target image.

content loss

As already mentioned, each layer of the neural network produces a three-dimensional tensor of some dimension.




Let's denote the output i th layer from the input as . Then if we minimize the weighted sum of the residuals between the input image and some image we aspire to c, then you get exactly what you need. Maybe.



For experimenting with this article, you can use this magical laptop, where the calculations take place (both on the GPU and on the CPU). The GPU is used to calculate the features of the neural network and the value of the cost function. Theano produces a function that can calculate the gradient of the objective function eval_grad by input image x. This is then fed into lbfgs and the iterative process starts.


# Initialize with a noise image generated_image.set_value(floatX(np.random.uniform(-128, 128, (1, 3, IMAGE_W, IMAGE_W)))) x0 = generated_image.get_value().astype("float64") xs = xs.append(x0) # Optimize, saving the result periodically for i in range(8): print(i) scipy.optimize.fmin_l_bfgs_b(eval_loss, x0.flatten(), fprime=eval_grad, maxfun=40) x0 = generated_image.get_value().astype("float64") xs.append(x0)

If we run the optimization of such a function, then we will quickly get an image similar to the target one. Now we can recreate images from white noise that look like some content image.


Content Loss: conv4_2



Optimization process




It is easy to notice two features of the resulting image:

  • lost colors - this is the result of the fact that in a particular example only the conv4_2 layer was used (or, in other words, the weight w was non-zero for it, and zero for the other layers); as you remember, it is the early layers that contain information about colors and gradient transitions, and the later ones contain information about larger details, which is what we observe - the colors are lost, but the content is not;
  • some houses "let's go", i.e. straight lines are slightly curved - this is because the deeper the layer, the less information about the spatial position of the feature it contains (the result of applying convolutions and poolings).

Adding early layers immediately corrects the situation with colors.


Content Loss: conv1_1, conv2_1, conv4_2


Hopefully by now you've got the feeling that you have control over what gets redrawn onto the white noise image.

style loss

And now we got to the most interesting: how can we convey the style? What is style? Obviously, the style is not what we optimized in Content Loss, because it contains a lot of information about the spatial positions of the features. So the first thing to do is somehow remove this information from the views received on each layer.


The author proposes the following method. Let's take the tensor at the output of some layer, expand it in spatial coordinates and calculate the covariance matrix between the plates. Let's denote this transformation as G. What have we really done? It can be said that we counted how often the features inside the plate occur in pairs, or, in other words, we approximated the distribution of features in the plates with a multivariate normal distribution.




Then Style Loss is entered as follows, where s is some image with style:



Shall we try for Vincent? In principle, we get something expected - noise in the style of Van Gogh, information about the spatial arrangement of features is completely lost.


Vincent




What if we put a photo instead of a style image? You get already familiar features, familiar colors, but the spatial position is completely lost.


Photo with style loss


Surely you wondered why we calculate the covariance matrix, and not something else? After all, there are many ways to aggregate features so that spatial coordinates are lost. This is really an open question, and if you take something very simple, the result will not change dramatically. Let's check this, we will not calculate the covariance matrix, but simply the average value of each plate.




simple style loss

Combined loss

Naturally, there is a desire to mix these two cost functions. Then we will generate such an image from white noise that it will retain features from the content-image (which have binding to spatial coordinates), and there will also be "style" features that are not tied to spatial coordinates, i.e. we'll hopefully keep the content image details intact in place, but redrawn with the right style.



In fact, there is also a regularizer, but we will omit it for simplicity. It remains to answer the following question: what layers (weights) to use in optimization? And I’m afraid that I don’t have an answer to this question, and neither do the authors of the article. They have a suggestion to use the following, but this does not mean at all that another combination will work worse, the search space is too large. The only rule that follows from the understanding of the model is that it makes no sense to take neighboring layers, because their signs will not differ much from each other, therefore a layer from each conv*_1 group is added to the style.


# Define loss function losses = # content loss losses.append(0.001 * content_loss(photo_features, gen_features, "conv4_2")) # style loss losses.append(0.2e6 * style_loss(art_features, gen_features, "conv1_1")) losses.append (0.2e6 * style_loss(art_features, gen_features, "conv2_1")) losses.append(0.2e6 * style_loss(art_features, gen_features, "conv3_1")) losses.append(0.2e6 * style_loss(art_features, gen_features, "conv4_1") ) losses.append(0.2e6 * style_loss(art_features, gen_features, "conv5_1")) # total variation penalty losses.append(0.1e-7 * total_variation_loss(generated_image)) total_loss = sum(losses)

The final model can be presented in the following form.




And here is the result of the houses with Van Gogh.



Attempt to control the process

Let's remember the previous parts, as early as two years before the current article, other scientists have been exploring what the neural network really learns. Armed with all these articles, you can generate visualizations of features of different styles, different images, different resolutions and sizes, and try to understand which layers with which weight to take. But even re-weighting the layers does not give full control over what is happening. The problem here is more conceptual: we're optimizing the wrong function! How so, you ask? The answer is simple: this function minimizes the residual ... well, you get the idea. But what we really want is that we like the image. The convex combination of content and style loss functions is not a measure of what our mind considers beautiful. It has been observed that if styling is continued for too long, the cost function naturally falls lower and lower, but the aesthetic beauty of the result drops sharply.




Okay, there is one more problem. Let's say we found a layer that extracts the features we need. Let's say some textures are triangular. But this layer still contains many other features, such as circles, which we really do not want to see in the resulting image. Generally speaking, if we could hire a million Chinese people, we could visualize all the features of a style image, and by exhaustive search just mark the ones that we need, and only include them in the cost function. But for obvious reasons, it's not that easy. But what if we just remove all the circles we don't want to appear in the result from the stylesheet? Then the activation of the corresponding neurons that respond to the circles simply will not work. And, of course, then this will not appear in the resulting picture. It's the same with flowers. Present a bright image with lots of colors. The distribution of colors will be very smeared throughout the space, the distribution of the resulting image will be the same, but during the optimization process, the peaks that were on the original will probably be lost. It turned out that a simple decrease in the bit depth of the color palette solves this problem. The distribution density of most colors will be near zero, and there will be large peaks in several areas. Thus, by manipulating the original in Photoshop, we are manipulating the features that are extracted from the image. It is easier for a person to express his desires visually than to try to formulate them in the language of mathematics. Bye. As a result, designers and managers, armed with Photoshop and scripts for visualizing features, achieved a result three times faster than what mathematicians and programmers did.


An example of manipulating the color and size of features


And you can immediately take a simple image as a style



results








And here is a vidosik, but only with the right texture

Texture Networks: Feed-forward Synthesis of Textures and Stylized Images (10 Mar 2016)

It seems that this could be stopped, if not one nuance. The above styling algorithm works for a very long time. If we take an implementation where lbfgs is run on the CPU, then the process takes about five minutes. If you rewrite it so that the optimization goes to the GPU, then the process will take 10-15 seconds. It's no good. Perhaps the authors of this and the next article thought about the same. Both publications came out independently 17 days apart, almost a year after the previous article. The authors of the current article, like the authors of the previous one, were engaged in texture generation (if you just reset the Style Loss, this is approximately what you get). They suggested optimizing not an image obtained from white noise, but some neural network that generates a stylized image.




Now, if the styling process does not include any optimization, only a forward pass needs to be done. And optimization is required only once to train the generator network. This article uses a hierarchical generator where each following z larger than the previous one and is sampled from noise in case of texture generation, and from some image database for stylizer training. It is critical to use something other than the training part of the imagenet, because features inside the Loss-network are calculated by the network trained just on the training part.



Perceptual Losses for Real-Time Style Transfer and Super-Resolution (27 Mar 2016)

As the name implies, the authors, who were only 17 days late with the idea of ​​a generating network, were busy increasing the resolution of images. They seem to have been inspired by the success of residual learning on the latest imagenet.




Accordingly residual block and conv block.



Thus, now in addition to styling control, we also have a fast generator in our hands (thanks to these two articles, the generation time for one image is measured in tens of ms).

The ending

We used the information from the reviewed articles and the code of the authors as a starting point for creating another styling application for the first video styling application:



Generate something like this.


Ever since German researchers from the University of Tübingen presented their idea of ​​the possibility of transferring the style of famous artists to other photographs in August 2015, services have begun to appear that monetize this opportunity. It launched on the Western market, and on the Russian market - its complete copy.

To bookmarks

Despite the fact that Ostagram launched in December, it began to quickly gain popularity in social networks in mid-April. At the same time, as of April 19, there were less than a thousand people in the project on VKontakte.

To use the service, you need to prepare two images: a photo that needs to be processed, and a picture with an example of a style to overlay on the original picture.

The service has a free version: it creates an image in a minimum resolution of up to 600 pixels along the longest side of the image. The user receives the result of only one of the iterations of applying the filter to the photo.

There are two paid versions: Premium produces an image up to 700 pixels along the longest side and applies 600 iterations of neural network processing to the image (the more iterations, the more interesting and intensive the processing). One such picture will cost 50 rubles.

In the HD version, you can adjust the number of iterations: 100 will cost 50 rubles, and 1000 - 250 rubles. In this case, the image will have a resolution of up to 1200 pixels on the longest side, and it can be used for printing on canvas: Ostagram offers this service with delivery from 1800 rubles.

In February, representatives of Ostagram will not accept requests for image processing from users "from countries with developed capitalism", but then access to photo processing for VKontakte users from all over the world. Judging by the Ostagram code published on GitHub, it was developed by Sergey Morugin, a 30-year-old resident of Nizhny Novgorod.

TJ contacted the commercial director of the project, who introduced himself as Andrey. According to him, Ostagram appeared before Instapainting, but was inspired by a similar project called Vipart.

Ostagram was developed by a group of students from NNSTU. Alekseeva: after initial testing on a narrow group of friends at the end of 2015, they decided to make the project public. Initially, image processing was completely free, and it was planned to earn money by selling printed paintings. According to Andrey, printing turned out to be the biggest problem: photos of people processed by a neural network rarely look pleasing to the human eye, and the end client needs to adjust the result for a long time before applying it to the canvas, which requires a lot of machine resources.

For image processing, the creators of Ostagram wanted to use Amazon cloud servers, but after the influx of users, it became clear that the cost of them would exceed a thousand dollars a day with a minimal return on investment. Andrey, who is also an investor in the project, rented server facilities in Nizhny Novgorod.

The audience of the project is about a thousand people a day, but on some days it reached 40 thousand people due to transitions from foreign media that had already noticed the project before domestic ones (Ostagram even managed to collaborate with European DJs). At night, when traffic is low, image processing can take 5 minutes, and take up to an hour during the day.

If earlier foreign users were deliberately limited access to image processing (it was thought to start monetization from Russia), now Ostagram is already relying more on a Western audience.

To date, the prospects for payback are conditional. If each user paid 10 rubles for processing, then perhaps it would pay off. […]

It is very difficult to monetize in our country: our people are ready to wait a week, but they will not pay a penny for it. Europeans are more favorable to this - in terms of paying for speeding up, improving quality - so the orientation goes to that market.

Andrey, Ostagram representative

According to Andrey, the Ostagram team is working on a new version of the site with a greater focus on sociality: “It will look like one well-known service, but what to do.” Representatives of Facebook in Russia have already been interested in the project, but the deal has not yet come to negotiations on the sale.

Service work examples

In the feed on the Ostagram website, you can also see what combination of images resulted in the final shots: often this is even more interesting than the result itself. At the same time, filters - pictures used as an effect for processing - can be saved for further use.



Similar articles