This is the fourth and last week of the fourth course of DeepLearning.AI’s Deep Learning Specialization offered on Coursera. In this week we go over special applications in the field of computer vision with CNNs, face recognition and neural style transfer. This week introduces new important concepts that will be useful even beyond the context of CNNs.

This week’s topics are as follows:


Face Recognition Link to heading

What is Face Recognition? Link to heading

Let’s start by going over the important distinction between face verification and face recognition.

In the case of face verification, we get a pair of things: an input image, and also a name or ID. The output of such a system is a binary choice that says whether the image corresponds to the name or ID, or whether it does not correspond. On the other hand, in the case of face recognition, we only get a single thing: an image. The output of such a system is whether the input image matches to any KK identities already stored in our system’s database.

This means that face verification is a 111 \to 1 procedure, while face recognition is a 1K1 \to K procedure. The latter is whether we find a match in or not, and this usually means doing KK comparisons in the worst case.

So far it doesn’t sound too bad, but here’s the real issue. Imagine that you’re implementing a face recognition system at your company’s building. How many images can you get of each person? Ideally you’d like to get as many as possible, but that’s unfeasible; even worse, in practice, we usually only have access to one or at most two pictures of each person. Having only one training example for each “class” is what defines one-shot learning; having a few is usually called few-shot learning.

One Shot Learning Link to heading

So imagine that we go around our company and finally get a face picture of everyone that should be allowed to get into the building. We want to train a model so that when someone walks into the door, and is one of the people allowed, that the model recognizes that this person is allowed from only one picture. Obviously, we also want to keep away unwanted people!

The most immediate idea is to train a classifier with a CNN architecture. If we have KK employees, then the output layer of our CNN will have K+1K + 1 hidden units in the softmax, the extra one for when the input doesn’t match to anyone. The issue is that this CNN will have terrible performance because of the size of our training data, remember we only have KK pictures. Even worse, what happens when we hire someone else? We would have to retrain the whole network every time. There has to be a better way.

The better way is to move away from classification and to think of a way to learn a similarity function. A similarity function a function that takes in a pair of elements and outputs their similarity. You might have heard of Jaccard similarity or cosine similarity. In this case, we want to learn a function that takes two pairs of images:

d(img1,img2)=degree of difference between the images \text{d}(\text{img}_1, \text{img}_2) = \text{degree of difference between the images}

Then we can set some threshold τ\tau and binarize our similarity function:

Verification(img1,img2)={Sameif d(img1,img2)τDifferentotherwise \text{Verification}(\text{img}_1, \text{img}_2) = \begin{cases} \text{Same} & \text{if} \ \text{d}(\text{img}_1, \text{img}_2) \leq \tau \\ \text{Different} & \text{otherwise} \end{cases}

Hopefully, when our colleague Alice walks in the door of the building, the picture taken from the device at the entrance will have the lowest difference when compared to Alice’s picture in our database, amongst all other employees. Also, someone who is not our colleague, should not have a similarity less than τ\tau amongst anybody else in our employee set. We will see how these kinks are ironed out when we get to triplet loss. Let’s first see how we can learn such a function as d(img1,img2)\text{d}(\text{img}_1, \text{img}_2).

Siamese Network Link to heading

The idea behind Siamese networks is pretty straightforward; we run a pair of images, (img1,img2)(\text{img}_1, \text{img}_2) through the same CNN with a fully connected layer at the end, we then use these outputs to compare the images. Let’s dig deeper into what this means.

Remember that CNNs usually reduce the spatial dimensions of the input volume while increasing the channel dimension. Imagine that we have a CNN that takes an image of a face with dimensions 100×100×3100 \times 100 \times 3 and after reducing this volume through some number of convolutional and max pooling layers, we get a 128128 dimensional vector. This is the same as in a vanilla CNN, but right before we run the fully connected layer through a softmax layer for classification.

Let’s call our input image x(1)x^{(1)}. Let’s call running x(1)x^{(1)} through our CNN, and transforming the 100×100×3100 \times 100 \times 3 input into a 128128 element vector f(x(1))f(x^{(1)}), so that f(x(1))R128f(x^{(1)}) \in \mathbb{R}^{128}. This 128128 dimensional vector is an encoding of the input; it simply some particular representation of the original image containing a face.

If we let x(1)x^{(1)} be the image of our employee in the database, and x(2)x^{(2)} be the image of the face of the person who just walked in, you might imagine what we want to do. We’ll run x(2)x^{(2)} through our CNN and encode it in the same way, so that we get f(x(2))f(x^{(2)}). Notice that f(x(2))R128f(x^{(2)}) \in \mathbb{R}^{128} as well, and that the encoding was generated with the same CNN as f(x(1))f(x^{(1)}).

Now we can redefine our similarity function: d(img1,img2)\text{d}(\text{img}_1, \text{img}_2) in terms of these numerical encodings, so that our new similarity function is:

d(x(1),x(2))=f(x(1))f(x(2))22 d(x^{(1)}, x^{(2)}) = ||f(x^{(1)}) - f(x^{(2)})||_2^2

This is our old friend, the L2-norm of the difference between the two encodings.

You might be thinking, why not compare the two images with the L2-norm directly? Think about what happens when you compare two images of the same person but with different lighting, hairstyle, makeup, etc. It won’t work.

But how do we train such a network? Remember we want a CNN that takes as input an image, and generates a 128128 dimensional encoding f(x(i))f(x^{(i)}). But of course, not just any encoding! We want that 128128 dimensional encoding to have certain properties:

  • If x(i),x(j)x^{(i)}, x^{(j)} are the same person, we want f(x(1))f(x(2))22||f(x^{(1)}) - f(x^{(2)})||_2^2 to be small.
  • If x(i),x(j)x^{(i)}, x^{(j)} are different persons, we want f(x(1))f(x(2))22||f(x^{(1)}) - f(x^{(2)})||_2^2 to be large.

It turns out that we can use back-propagation to learn such parameters, as long as we can define a loss function that has these properties, i.e. it generates good encodings for comparing images of faces. This is how triplet loss was defined.

Triplet Loss Link to heading

Triplet loss is called so because it uses three elements: an anchor, a positive and a negative example. The anchor is a picture of one of our employees. The positive example is another picture of the same employee. The negative example is simply a picture of someone else. That is, we distance between the anchor and the positive example should be low, while the distance between the anchor and the negative example should be high. We will use the letters A,P,NA, P, N to refer to the anchor, positive and negative examples respectively.

Remember the two properties we wanted out of our encodings, we wanted that the L2-norm of the difference between two encodings to be small if they are of the same person and large if they are not. At least we want the distance to be larger between the anchor and the negative than the distance between the anchor and the positive example. In math, we want:

f(A)f(P)22f(A)f(N)22f(A)f(P)22f(A)f(N)220 \begin{aligned} ||f(A) - f(P)||_2^2 &\leq ||f(A) - f(N)||_2^2 \\ ||f(A) - f(P)||_2^2 - ||f(A) - f(N)||_2^2 &\leq 0 \end{aligned}

There’s an issue with this approach. There is a trivial solution where we set everything to 00. To prevent our network from learning this solution, we add a margin; similar to the one used in support vector machines. The margin, called α\alpha can be a hyperparameter, so that we have the following:

f(A)f(P)22f(A)f(N)22+α0 ||f(A) - f(P)||_2^2 - ||f(A) - f(N)||_2^2 + \alpha \leq 0

Setting α\alpha allows us to specify how much bigger the difference between the anchor and the negative compared to the anchor and the positive examples should be.

We are ready to define our loss function, given three images A,P,NA, P, N:

L(A,P,N)=max(f(A)f(P)22f(A)f(N)22+α,0) \mathcal{L}(A, P, N) = \max(||f(A) - f(P)||_2^2 - ||f(A) - f(N)||_2^2 + \alpha, 0)

We use the max\max operator here because as long as we have made the difference between the L2-norm of the differences less than 00 (plus the margin), we have done “good”. Otherwise, we have done well, and the difference is above zero.

We can also now define our cost function, over some mm training samples:

J=i=1mL(A(i),P(i),N(i)) J = \sum_{i=1}^m \mathcal{L}(A^{(i)}, P^{(i)}, N^{(i)})

Notice that to form the triplets, we need at least two pictures of the same person! For example, we could have a training set of 10,00010,000 images of 1,0001,000 people, some of them repeated more than once. You might be wondering, sure, but how do we make the triplets? It turns out that this is very important.

If we choose A,P,NA, P, N triplets at random, then the constraints are easily satisfied; but our network will not perform well. We need to choose triplets that are “hard” to train on, i.e. d(A,P)d(A,N)\text{d}(A, P) \approx \text{d}(A, N). By doing this, we are forcing the network to deal with the harder cases, where people look similar but are not the same person. The details on how to build triplets are described in the FaceNet paper by Schroff, et al.

Face Verification and Binary Classification Link to heading

It turns out that the triplet loss approach is not the only approach to build a face recognition system. We could also use the result of the two CNNs, i.e. the 128128 dimensional embeddings of the pictures, and then feed these to a logistic regression unit and perform binary classification on them; estimating whether the two embeddings are the same or not. The final logistic layer would look like this:

y^=σ(k=1128wkf(x(i))kf(x(j))k2+b) \hat{y} = \sigma \left( \sum_{k=1}^{128} w_k ||f(x^{(i)})_{k} - f(x^{(j)})_k||_2 + b \right)

Where we are still using the embeddings, but we are calculating an element-wise square difference, multiplied each by its own weight.

In practice, and in both approaches, we can speed up the latency of the system by precomputing the encodings of our employees, and only computing the encoding of the person we are trying to recognize on the fly.

Neural Style Transfer Link to heading

What is Neural Style Transfer? Link to heading

You might still remember back when neural style transfer was the latest, hottest thing in machine learning. Perhaps even more so, the pointed questions it raised about intellectual property rights. Today, many companies use stable diffusion, a text-to-image system, that is used as an interface for performing neural style transfer.

Neural style transfer, in a nutshell, is being able to imbue an image with another style. For example, we might have a picture of our cats, but we’d like to make that picture look like it was painted by Rembrandt. Neural style transfer allows us to “transfer” the style of a Rembrandt painting into a picture of our cat, so that it looks like Rembrandt himself painted our furry friend.

We will be using the notation CC to refer to the content image, in our case our cat. The letter SS will represent the style image, in our case a Rembrandt painting. Finally, we will use GG to refer to the generated image, that is our cat in the style of Rembrandt.

What are deep CNNs learning? Link to heading

Before getting into neural style transfer, we must understand at a high level, how the input changes through the layers in a CNN. There is an amazing paper by Zeiler and Fergus, 2013 in which they come up with novel ways to visualize what visual features the filters are learning on each of the layers.

The gist of it is that the filters in the shallower (earlier) layers of the network learn to pick apart basic features in our image; think vertical lines, diagonal lines, etc. As we progress down the layers, deeper into the CNN, the filters start to learn more abstract features, such as concentric circles, colors and lines, etc. Even later on, we see that some filters specialize in certain regions of the face, noses, eyes, etc. So on and so forth.

This is important to keep in mind, especially in the context of neural style transfer, because we can choose to give a unique weight to each layer in the combination of content and style.

Cost Function Link to heading

Similar to all of other applications of supervised learning, we need to establish a cost function that will guide the optimization process. Let’s get started.

We have our content image CC and our style image SS, and we’d like to generate an image GG which is some mixture of both CC and SS. We will define our loss function in terms of these elements:

J(G)=αJcontent(C,G)+βJstyle(S,G) J(G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G)

Let’s unpack the formula. Our total cost is a function of two separate costs. The first one is the content cost, that is, how bad our generate image is relative to original content image, which in our example is a picture of a cat. The second one is the style cost, how bad our generate image is relative to the original style image, which in our example is a Rembrandt painting. We have two hyperparameters α\alpha and β\beta which allow us to adjust the mixture between the two costs.

But how do we get GG? We start by initializing it randomly. In practice, we add some random noise to the original content image CC, but imagine that we start GG completely at random; that is a random noise picture of dimensions 100×100×3100 \times 100 \times 3. Then we can use gradient descent to update our random picture GG:

G:=GGJ(G) G := G - \frac{\partial}{\partial G} J(G)

We are not updating any parameters! We are directly updating the pixel values of our generated image GG at each step of gradient descent, or whatever garden variety optimization algorithm we chose to use. Let’s now break down each of the components of the cost function.

Content Cost Function Link to heading

Remember, our cost function was:

J(G)=αJcontent(C,G)+βJstyle(S,G) J(G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G)

We will focus on the Jcontent(C,G)J_{content}(C, G) component.

Say that we are using a pre-trained CNN, such as VGG-19; and that we focus on some layer ll. Keeping in mind what happens at each layer of a CNN, we will hedge our bets and pick some ll that’s in the middle; not too deep and not too shallow. We will focus on the activations for this layer, a[l]a^{[l]}.

We will run both images, C,GC, G, through a[l]a^{[l]} and generate a[l](C),a[l](G)a^{[l](C)}, a^{[l](G)}. The idea is that if we compare a[l](C),a[l](G)a^{[l](C)}, a^{[l](G)}, and they are similar, then the images have similar content. This means that the filters are picking up similar activations on the features that they specialize on for that particular layer.

This comparison is done by our old friend, the L2-norm of the difference:

Jcontent[l](C,G)=14nH[l]nW[l]nC[l]all entriesa[l](C)a[l](G)22 J^{[l]}_{content}(C, G) = \frac{1}{4 n_H^{[l]} n_W^{[l]} n_C^{[l]}}\sum_{\text{all entries}}||a^{[l](C)} - a^{[l](G)}||_2^2

The normalization factor was chosen by the authors and takes into account the dimensions of a[l]a^{[l]}.

This will make our optimization algorithm set pixel values on GG that minimize this difference, so that the generated image has similar content to the content image.

Style Cost Function Link to heading

Calculating the content cost is so far not that crazy; doing element-wise squared differences is a reasonable approach for the content. But how can we quantify “style” and compare our generated image GG to our style image SS to see how far off they are from each other? Here is where things get fascinating.

The authors define “style” as the pair-wise correlation between the activations across channels. Since each channel learns a different feature, style is defined as the interaction between the features. More specifically their covariance. If one feature picks up stark lines, and another one picks up colors, then if both of these features have positive or negative covariance, then we can say that this is one of the elements of style. When we do a pair-wise comparison between all channels we get a pair-wise matrix, which we will call G[l]G^{[l]}, the style matrix. This matrix will tell us how each of the channels covary with each other, and this, in a sense, is the essence of style.

We want to compare the style matrix for both SS our style image and GG our generated image, and repeat the comparison we did for the content image, but this time on the style matrix or Gram matrix of both GG and SS. Both style matrices G[l](S)G^{[l](S)} and G[l](G)G^{[l](G)} will be of the same dimensions: nC[l]×nC[l]n_C^{[l]} \times n_C^{[l]}.

Let’s start with the style matrix of SS. We define ai,j,k[l](S)a^{[l](S)}_{i,j,k} as one entry in the activations for layer ll using SS as the input. We will construct G[l](S)G^{[l](S)} the Gram matrix of the activations. The entry kkkk’ in the G[l](S)G^{[l](S)} is defined as:

Gkk[l](S)=i=1nHj=1nWai,j,k[l](S)ai,j,k[l](S) G^{[l](S)}_{kk’} = \sum_{i=1}^{n_H} \sum_{j=1}^{n_W} a^{[l](S)}_{i,j,k} a^{[l](S)}_{i,j,k’}

We will repeat the same, but for GG, our generated image:

Gkk[l](G)=i=1nHj=1nWai,j,k[l](G)ai,j,k[l](G) G^{[l](G)}_{kk’} = \sum_{i=1}^{n_H} \sum_{j=1}^{n_W} a^{[l](G)}_{i,j,k} a^{[l](G)}_{i,j,k’}

Let’s recap:

  • We have two images, SS and GG.
  • We defined the “style” of an image as the gram matrix of the volume. That is, the element-wise channel covariance.
  • We calculated the gram or style matrix from the activations of layer ll for both GG and SS. That is we ran both GG and SS through the same layer and got some output. It is from this output that we calculate the style matrix for GG and SS separately.

Now, we compare the two in the same fashion we did for the content cost. The style cost is defined as:

Jstyle[l](S,G)=1(2nH[l]nW[l]nC[l])2k=1nC[l]k=1nC[l](Gkk[l](S)Gkk[l](G))2 J^{[l]}_{style}(S, G) = \frac{1}{\left(2 n_H^{[l]} n_W^{[l]} n_C^{[l]} \right)^2} \sum_{k=1}^{n_C^{[l]}} \sum_{k’=1}^{n_C^{[l]}} \left( G_{kk’}^{[l](S)} - G_{kk’}^{[l](G)}\right)^2

Where again, the normalization factor in front was set by the authors.

A final thing to notice about the style function: notice that’s indexed by ll; we calculate the style function at every layer ll. The authors define the cost function at the network level as:

Jstyle(S,G)=l=1Lλ[l]Jstyle[l](S,G) J_{style}(S, G) = \sum_{l=1}^L \lambda^{[l]} J^{[l]}_{style}(S, G)

Using the λ[l]\lambda^{[l]} parameter for each layer allows us to mix the more basic features in the shallower layers or the more abstract features in the deeper layers. If we want the generated image to softly follow the style, we chose larger weights for the deeper layers and smaller for the shallower layers. On the other hand, if we want our generate image to strongly follow the style image, we do the opposite: select smaller weights for the deeper layers but larger weights for the shallower layers.

Now we can come back to the cost function we defined earlier:

J(G)=αJcontent(C,G)+βJstyle(S,G) J(G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G)

That’s it, this is the last week of the CNN course. Next course is sequential models, which, to me, are much more interesting in terms of applications. The programming exercises for this week were fantastic as well, and I heavily suggest that you do them as well.

Next week’s post is here, and it’s the first week in the sequence models course.