Convolutional Neural Networks: Week 4 | Face Recognition & Neural Style Transfer

This is the fourth and last week of the fourth course of DeepLearning.AI’s Deep Learning Specialization offered on Coursera. In this week, we go over special applications in the field of computer vision with CNNs: face recognition and neural style transfer. This week introduces new important concepts that will be useful even beyond the context of CNNs.

This week’s topics are as follows:

Face Recognition
Neural Style Transfer

Face Recognition Link to heading

What is Face Recognition? Link to heading

Let’s start by going over the important distinction between face verification and face recognition.

In the case of face verification, we get a pair of things: an input image, and also a name or ID. The output of such a system is a binary choice that says whether the image corresponds to the name or ID, or whether it does not correspond. On the other hand, in the case of face recognition, we only get a single thing: an image. The output of such a system is whether the input image matches to any $K$ identities already stored in our system’s database.

This means that face verification is a $1 \to 1$ procedure, while face recognition is a $1 \to K$ procedure. The latter is whether we find a match in or not, and this usually means doing $K$ comparisons in the worst case.

So far it doesn’t sound too bad, but here’s the real issue. Imagine that you’re implementing a face recognition system at your company’s building. How many images can you get of each person? Ideally you’d like to get as many as possible, but that’s unfeasible; even worse, in practice, we usually only have access to one or at most two pictures of each person. Having only one training example for each “class” is what defines one-shot learning; having a few is usually called few-shot learning.

One Shot Learning Link to heading

So imagine that we go around our company and finally get a face picture of everyone that should be allowed to get into the building. We want to train a model so that when someone walks into the door, and is one of the people allowed, that the model recognizes that this person is allowed from only one picture. Obviously, we also want to keep away unwanted people!

The most immediate idea is to train a classifier with a CNN architecture. If we have $K$ employees, then the output layer of our CNN will have $K + 1$ hidden units in the softmax, the extra one for when the input doesn’t match anyone. The issue is that this CNN will have terrible performance because of the size of our training data—remember, we only have $K$ pictures. Even worse, what happens when we hire someone else? We would have to retrain the whole network every time. There has to be a better way.

The better way is to move away from classification and to think of a way to learn a similarity function. A similarity function is a function that takes in a pair of elements and outputs their similarity. You might have heard of Jaccard similarity or cosine similarity. In this case, we want to learn a function that takes two images:

$$ \text{d}(\text{img}_1, \text{img}_2) = \text{degree of difference between the images} $$

Then we can set some threshold $\tau$ and binarize our similarity function:

$$ \text{Verification}(\text{img}_1, \text{img}_2) = \begin{cases} \text{Same} & \text{if} \ \text{d}(\text{img}_1, \text{img}_2) \leq \tau \\ \text{Different} & \text{otherwise} \end{cases} $$

Hopefully, when our colleague Alice walks in the door of the building, the picture taken from the device at the entrance will have the lowest difference when compared to Alice’s picture in our database, amongst all other employees. Also, someone who is not our colleague, should not have a similarity less than $\tau$ amongst anybody else in our employee set. We will see how these kinks are ironed out when we get to triplet loss. Let’s first see how we can learn such a function as $\text{d}(\text{img}_1, \text{img}_2)$.

Siamese Network Link to heading

The idea behind Siamese networks is pretty straightforward: we run a pair of images, $(\text{img}_1, \text{img}_2)$, through the same CNN with a fully connected layer at the end, and then use these outputs to compare the images. Let’s dig deeper into what this means.

Remember that CNNs usually reduce the spatial dimensions of the input volume while increasing the channel dimension. Imagine that we have a CNN that takes an image of a face with dimensions $100 \times 100 \times 3$ and after reducing this volume through some number of convolutional and max pooling layers, we get a $128$ dimensional vector. This is the same as in a vanilla CNN, but right before we run the fully connected layer through a softmax layer for classification.

Let’s call our input image $x^{(1)}$. Let’s call running $x^{(1)}$ through our CNN, and transforming the $100 \times 100 \times 3$ input into a $128$ element vector $f(x^{(1)})$, so that $f(x^{(1)}) \in \mathbb{R}^{128}$. This $128$ dimensional vector is an encoding of the input; it simply some particular representation of the original image containing a face.

If we let $x^{(1)}$ be the image of our employee in the database, and $x^{(2)}$ be the image of the face of the person who just walked in, you might imagine what we want to do. We’ll run $x^{(2)}$ through our CNN and encode it in the same way, so that we get $f(x^{(2)})$. Notice that $f(x^{(2)}) \in \mathbb{R}^{128}$ as well, and that the encoding was generated with the same CNN as $f(x^{(1)})$.

Now we can redefine our similarity function: $\text{d}(\text{img}_1, \text{img}_2)$ in terms of these numerical encodings, so that our new similarity function is:

$$ d(x^{(1)}, x^{(2)}) = ||f(x^{(1)}) - f(x^{(2)})||_2^2 $$

This is our old friend, the L2-norm of the difference between the two encodings.

You might be thinking, why not compare the two images with the L2-norm directly? Think about what happens when you compare two images of the same person but with different lighting, hairstyle, makeup, etc. It won’t work well.

But how do we train such a network? Remember we want a CNN that takes as input an image, and generates a $128$ dimensional encoding $f(x^{(i)})$. But of course, not just any encoding! We want that $128$ dimensional encoding to have certain properties:

If $x^{(i)}, x^{(j)}$ are the same person, we want $||f(x^{(1)}) - f(x^{(2)})||_2^2$ to be small.
If $x^{(i)}, x^{(j)}$ are different persons, we want $||f(x^{(1)}) - f(x^{(2)})||_2^2$ to be large.

It turns out that we can use back-propagation to learn such parameters, as long as we can define a loss function that has these properties, i.e. it generates good encodings for comparing images of faces. This is how triplet loss was defined.

Triplet Loss Link to heading

Triplet loss is called so because it uses three elements: an anchor, a positive and a negative example. The anchor is a picture of one of our employees. The positive example is another picture of the same employee. The negative example is simply a picture of someone else. That is, we distance between the anchor and the positive example should be low, while the distance between the anchor and the negative example should be high. We will use the letters $A, P, N$ to refer to the anchor, positive and negative examples respectively.

Remember the two properties we wanted out of our encodings, we wanted that the L2-norm of the difference between two encodings to be small if they are of the same person and large if they are not. At least we want the distance to be larger between the anchor and the negative than the distance between the anchor and the positive example. In math, we want:

$$ \begin{aligned} ||f(A) - f(P)||_2^2 &\leq ||f(A) - f(N)||_2^2 \\ ||f(A) - f(P)||_2^2 - ||f(A) - f(N)||_2^2 &\leq 0 \end{aligned} $$

There’s an issue with this approach. There is a trivial solution where we set everything to $0$. To prevent our network from learning this solution, we add a margin; similar to the one used in support vector machines. The margin, called $\alpha$ can be a hyperparameter, so that we have the following:

$$ ||f(A) - f(P)||_2^2 - ||f(A) - f(N)||_2^2 + \alpha \leq 0 $$

Setting $\alpha$ allows us to specify how much bigger the difference between the anchor and the negative compared to the anchor and the positive examples should be.

We are ready to define our loss function, given three images $A, P, N$:

$$ \mathcal{L}(A, P, N) = \max(||f(A) - f(P)||_2^2 - ||f(A) - f(N)||_2^2 + \alpha, 0) $$

We use the $\max$ operator here because as long as we have made the difference between the L2-norm of the differences less than $0$ (plus the margin), we have done “good”. Otherwise, we have done well, and the difference is above zero.

We can also now define our cost function, over some $m$ training samples:

$$ J = \sum_{i=1}^m \mathcal{L}(A^{(i)}, P^{(i)}, N^{(i)}) $$

Notice that to form the triplets, we need at least two pictures of the same person! For example, we could have a training set of $10,000$ images of $1,000$ people, some of them repeated more than once. You might be wondering, sure, but how do we make the triplets? It turns out that this is very important.

If we choose $A, P, N$ triplets at random, then the constraints are easily satisfied; but our network will not perform well. We need to choose triplets that are “hard” to train on, i.e. $\text{d}(A, P) \approx \text{d}(A, N)$. By doing this, we are forcing the network to deal with the harder cases, where people look similar but are not the same person. The details on how to build triplets are described in the FaceNet paper by Schroff, et al.

Face Verification and Binary Classification Link to heading

It turns out that the triplet loss approach is not the only way to build a face recognition system. We could also use the result of the two CNNs, i.e., the $128$-dimensional embeddings of the pictures, and then feed these to a logistic regression unit and perform binary classification on them; estimating whether the two embeddings are the same or not. The final logistic layer would look like this:

$$ \hat{y} = \sigma \left( \sum_{k=1}^{128} w_k ||f(x^{(i)})_{k} - f(x^{(j)})_k||_2 + b \right) $$

Where we are still using the embeddings, but we are calculating an element-wise squared difference, each multiplied by its own weight.

In practice, and in both approaches, we can speed up the latency of the system by precomputing the encodings of our employees, and only computing the encoding of the person we are trying to recognize on the fly.

Neural Style Transfer Link to heading

What is Neural Style Transfer? Link to heading

You might still remember back when neural style transfer was the latest, hottest thing in machine learning. Perhaps even more so, the pointed questions it raised about intellectual property rights. Today, many companies use stable diffusion, a text-to-image system, as an interface for performing neural style transfer.

Neural style transfer, in a nutshell, is being able to imbue an image with another style. For example, we might have a picture of our cats, but we’d like to make that picture look like it was painted by Rembrandt. Neural style transfer allows us to “transfer” the style of a Rembrandt painting into a picture of our cat, so that it looks like Rembrandt himself painted our furry friend.

We will be using the notation $C$ to refer to the content image, in our case our cat. The letter $S$ will represent the style image, in our case a Rembrandt painting. Finally, we will use $G$ to refer to the generated image, that is our cat in the style of Rembrandt.

What are deep CNNs learning? Link to heading

Before getting into neural style transfer, we must understand, at a high level, how the input changes through the layers in a CNN. There is an amazing paper by Zeiler and Fergus, 2013 in which they come up with novel ways to visualize what visual features the filters are learning on each of the layers.

The gist of it is that the filters in the shallower (earlier) layers of the network learn to pick apart basic features in our image—think vertical lines, diagonal lines, etc. As we progress down the layers, deeper into the CNN, the filters start to learn more abstract features, such as concentric circles, colors, and lines, etc. Even later on, we see that some filters specialize in certain regions of the face: noses, eyes, etc. So on and so forth.

This is important to keep in mind, especially in the context of neural style transfer, because we can choose to give a unique weight to each layer in the combination of content and style.

Cost Function Link to heading

Similar to all other applications of supervised learning, we need to establish a cost function that will guide the optimization process. Let’s get started.

We have our content image $C$ and our style image $S$, and we’d like to generate an image $G$ which is some mixture of both $C$ and $S$. We will define our loss function in terms of these elements:

$$ J(G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G) $$

Let’s unpack the formula. Our total cost is a function of two separate costs. The first one is the content cost, that is, how bad our generated image is relative to the original content image, which in our example is a picture of a cat. The second one is the style cost, how bad our generated image is relative to the original style image, which in our example is a Rembrandt painting. We have two hyperparameters $\alpha$ and $\beta$ which allow us to adjust the mixture between the two costs.

But how do we get $G$? We start by initializing it randomly. In practice, we add some random noise to the original content image $C$, but imagine that we start $G$ completely at random; that is, a random noise picture of dimensions $100 \times 100 \times 3$. Then we can use gradient descent to update our random picture $G$:

$$ G := G - \frac{\partial}{\partial G} J(G) $$

We are not updating any parameters! We are directly updating the pixel values of our generated image $G$ at each step of gradient descent, or whatever garden variety optimization algorithm we choose to use. Let’s now break down each of the components of the cost function.

Content Cost Function Link to heading

Remember, our cost function was:

$$ J(G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G) $$

We will focus on the $J_{content}(C, G)$ component.

Say that we are using a pre-trained CNN, such as VGG-19, and that we focus on some layer $l$. Keeping in mind what happens at each layer of a CNN, we will hedge our bets and pick some $l$ that’s in the middle—not too deep and not too shallow. We will focus on the activations for this layer, $a^{[l]}$.

We will run both images, $C$ and $G$, through $a^{[l]}$ and generate $a^{l}, a^{l}$. The idea is that if we compare $a^{l}, a^{l}$, and they are similar, then the images have similar content. This means that the filters are picking up similar activations on the features that they specialize in for that particular layer.

This comparison is done by our old friend, the L2-norm of the difference:

$$ J^{[l]}_{content}(C, G) = \frac{1}{4 n_H^{[l]} n_W^{[l]} n_C^{[l]}}\sum_{\text{all entries}}||a^{[l](C)} - a^{[l](G)}||_2^2 $$

The normalization factor was chosen by the authors and takes into account the dimensions of $a^{[l]}$.

This will make our optimization algorithm set pixel values on $G$ that minimize this difference, so that the generated image has similar content to the content image.

Style Cost Function Link to heading

Calculating the content cost is so far not that crazy; doing element-wise squared differences is a reasonable approach for the content. But how can we quantify “style” and compare our generated image $G$ to our style image $S$ to see how far off they are from each other? Here is where things get fascinating.

The authors define “style” as the pair-wise correlation between the activations across channels. Since each channel learns a different feature, style is defined as the interaction between the features. More specifically, their covariance. If one feature picks up stark lines, and another one picks up colors, then if both of these features have positive or negative covariance, then we can say that this is one of the elements of style. When we do a pair-wise comparison between all channels, we get a pair-wise matrix, which we will call $G^{[l]}$, the style matrix. This matrix will tell us how each of the channels covary with each other, and this, in a sense, is the essence of style.

We want to compare the style matrix for both $S$, our style image, and $G$, our generated image, and repeat the comparison we did for the content image, but this time on the style matrix or Gram matrix of both $G$ and $S$. Both style matrices $G^{l}$ and $G^{l}$ will be of the same dimensions: $n_C^{[l]} \times n_C^{[l]}$.

Let’s start with the style matrix of $S$. We define $a^{l}_{i,j,k}$ as one entry in the activations for layer $l$ using $S$ as the input. We will construct $G^{l}$, the Gram matrix of the activations. The entry $kk’$ in the $G^{l}$ is defined as:

$$ G^{[l](S)}_{kk’} = \sum_{i=1}^{n_H} \sum_{j=1}^{n_W} a^{[l](S)}_{i,j,k} a^{[l](S)}_{i,j,k’} $$

We will repeat the same, but for $G$, our generated image:

$$ G^{[l](G)}_{kk’} = \sum_{i=1}^{n_H} \sum_{j=1}^{n_W} a^{[l](G)}_{i,j,k} a^{[l](G)}_{i,j,k’} $$

Let’s recap:

We have two images, $S$ and $G$.
We defined the “style” of an image as the gram matrix of the volume. That is, the element-wise channel covariance.
We calculated the gram or style matrix from the activations of layer $l$ for both $G$ and $S$. That is we ran both $G$ and $S$ through the same layer and got some output. It is from this output that we calculate the style matrix for $G$ and $S$ separately.

Now, we compare the two in the same fashion we did for the content cost. The style cost is defined as:

$$ J^{[l]}_{style}(S, G) = \frac{1}{\left(2 n_H^{[l]} n_W^{[l]} n_C^{[l]} \right)^2} \sum_{k=1}^{n_C^{[l]}} \sum_{k’=1}^{n_C^{[l]}} \left( G_{kk’}^{[l](S)} - G_{kk’}^{[l](G)}\right)^2 $$

Where again, the normalization factor in front was set by the authors.

A final thing to notice about the style function: notice that it’s indexed by $l$; we calculate the style function at every layer $l$. The authors define the cost function at the network level as:

$$ J_{style}(S, G) = \sum_{l=1}^L \lambda^{[l]} J^{[l]}_{style}(S, G) $$

Using the $\lambda^{[l]}$ parameter for each layer allows us to mix the more basic features in the shallower layers or the more abstract features in the deeper layers. If we want the generated image to softly follow the style, we choose larger weights for the deeper layers and smaller for the shallower layers. On the other hand, if we want our generated image to strongly follow the style image, we do the opposite: select smaller weights for the deeper layers but larger weights for the shallower layers.

Now we can come back to the cost function we defined earlier:

$$ J(G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G) $$

That’s it, this is the last week of the CNN course. The next course is sequential models, which, to me, are much more interesting in terms of applications. The programming exercises for this week were fantastic as well, and I highly suggest that you do them as well.

Next week’s post is here, and it’s the first week in the sequence models course.