This is the second week of the fourth course of DeepLearning.AI’s Deep Learning Specialization offered on Coursera. This week is largely a literature review, going over different architectures and approaches that have made large contributions to the field.

This week’s topics are:


Case Studies Link to heading

We should obviously keep up with the computer vision literature if we are interested in implementing new ideas. However, since CNNs have so many hyperparameters and settings, it’s essential to pay attention to the empirically justified advances occurring in the field. Since so many computer vision tasks are similar, many of the core ideas of new approaches can be applied, sometimes identically and sometimes with minor editions, to new applications. Finally, in the age of big data and cheap compute, we can get away with virtually free-riding on somebody else’s compute by using their pre-trained model. Let’s start with the “classic” networks that hit the field from the late 90s through the early 10s.

Classic Networks Link to heading

LeNet5 Link to heading

LeNet5 is a CNN architecture introduce by LeCun et al. in 1998; the paper is here. The application is the recognition of handwritten digits. This paper was one of the first ones that made a serious splash in the field. Because of this, some design decisions made by the authors were made without the benefit of a lush and active field as it is today. For example, the usage of non ReLU activation functions, using average pooling instead of max pooling are things you don’t really see today. However, obviously, there are some key ideas that the authors introduced.

A big idea here is that of progressively reducing the height and width of the input volume in each layer, while simultaneously increasing the depth. That is $n_H, n_W$ go down the layers while $n_C$ goes up. Another idea here is to slap fully connected layers, two consecutive ones in this paper, before a softmax layer for the classification. The last four layers in LeNet5 go from $5 \times 5 \times 16$ convolutional layer to a $120$ hidden unit fully connected layer, and then into a $84$ hidden unit fully connected layer, and then finally into a softmax layer with $10$ classes. LeNet5’s architecture has about $60,000$ parameters, which is miniscule compared to modern approaches, and even to AlexNet.

AlexNet Link to heading

This architecture was introduced by Krizhevsky, et al. at the ImageNet competition in 2012; the paper is here. AlexNet is very similar to LeNet5, but it’s much bigger, with around $60$ million parameters, about $1,000$ times bigger. The authors dealt with the increased computational complexity by implementing parallel GPU training, using ReLU activation functions and also utilizing local response normalization; a normalization technique applied point-wise across the channels of a volume, which is not very commonly used today.

VGG16 Link to heading

This architecture was introduced by Simonyan and Zisserman in 2015. Compared to AlexNet where there are many convolutional layers with different filter sizes and strides, the authors focused on network depth. By using the same hyperparameters for filter sizes, stride and padding across many layers, they were able to train a deeper network, of around 16 to 19 layers. The network is still very big relative to AlexNet and LeNet5, with around $138$ million parameters.

ResNets Link to heading

We move away from “classic” networks by introducing a key component: the skip connection. He, et al. introduce residual networks, or ResNets, which use skip connections for improving performance. A skip connection is simple, the activations for layer $l$ are summed to the activation of layer $l+2$, so that the activation formula for $l+2$ is:

$$ a^{[l+2]} = g^{[l+2]}(z^{[l+2]} + a^{[l]}) $$

To read more about the author’s rationale for coming up with this idea, you can read the original paper. However, there are some key results that arise from using skip connections.

In theory, adding more layers to a plain vanilla network could only help with respect to the training error. More complexity means less error. However, in reality, adding more layers usually hurts the performance of our optimization algorithm. The network depth offers diminishing returns, and even negative returns after some point, in terms of the performance on the training error. Skip connections make the relationship between depth and training error performance to decrease monotonically, allowing us to training much deeper networks, usually 100 layers, without having the tradeoff between depth and training set performance.

Why do ResNets Work? Link to heading

The key idea as to why ResNets have the almost magical property of undoing the link between depth and training set performance, is that they allow the network to efficiently learn the identity function. Remember that the identify function is a function that always returns the input value. Intuitively, this means that in the worst case, using skip connections will be as bad as not using them. In practice, using skip connections is actually better because we add more model complexity without facing the “cost” of adding more complexity when we don’t use skip connections.

Mathematically, think about the following. We already established that a skip connection simply adds the activations of two layers back, to the activations of the current layer:

$$ a^{[l+2]} = g^{[l+2]}(z^{[l+2]} + a^{[l]}) $$

We can expand $z^{[l+2]}$:

$$ a^{[l+2]} = g^{[l+2]}(W^{[l+2]}a^{[l+1]}+b^{[l+2]} + a^{[l]}) $$

If $W^{[l+2]}, b^{[l+2]} = 0$, then:

$$ \begin{aligned} a^{[l+2]} &= g^{[l+2]}(a^{[l]}) \\ a^{[l+2]} &= a^{[l]} \end{aligned} $$

This means that if adding an extra layer actually hurts the network, the network can learn parameters that “undo” the additional layer! You might be thinking how come the last two equalities hold, and they hold because $g^{[l]} \ \forall l \rightarrow \text{ReLU}$. That is, we use ReLU as the activation function for all layers. Since ReLU is the identity function when the input is greater than $0$, then it’s the same; pretty neat.

This only works if the input dimensions are the same in the networks with skip connections. If this is not the case, you can add a convolution step to make sure that the dimensions align without any performance costs.

Networks in Networks | 1x1 Convolutions Link to heading

This idea is pretty cool, and it’s used in a lot of other architectures that will be discussed later in this week’s content. You might already have thought about this: what happens if we convolve a filter of size $1$, that is $f=1$, and use a “same” padding? Lin et al. propose this approach and call it a “network in network” approach. Let’s think about this in the basic case: when the number of channels $n_C = 1$. If the number of channels $n_C=1$ then convolving a matrix with a $1\times1$ filter is equivalent to matrix-scalar multiplication; we just multiply every entry $M[i, j]$ the scalar value in the filter, to get $M^*[i, j] = M[i,j] \times f_0$. But what happens if the number of channels is not one? What happens if we have some volume as an input?

Say that we have an input volume with dimensions $6 \times 6 \times 32$, and we convolve it with a filter of dimensions $1 \times 1 \times 32$. After applying a ReLU to the convolution, each entry in the feature map will be the ReLU applied to a dot product. Specifically, the dot product between the filter and a $1 \times 1$ surface in the input volume. We can think of the $1 \times 1 \times 32$ filter as a rod or column, which is dotted with each “rod” in the input value, where the length of the rod spans the channels dimension. Because each entry in the feature map is the dot product and a non-linearity, this is equivalent to a linear function, which is equivalent to what each layer on a neural network does; hence the name: network in network.

Okay, but why use it? Beyond having the ability to represent a hidden layer across the channels dimension of the input in a CNN context, there is another powerful application of this approach: dimensionality reduction. If we have a $28 \times 28 \times 192$ input, and we convolve it with $32$ different $1 \times 1$ filters, then we will get a $28 \times 28 \times 32$ output. Notice that the dimensionality reduction occurs only in the channel dimension and not the height and width of the input volume. The idea of “shrinking” or “compressing” the input volume is a key idea of inception networks.

Inception Network Link to heading

Szegedy, et al. use the idea of $1 \times 1$ convolutions to compose a new type of layer: the inception layer. The idea is used as a solution to a problem introduced by inception layers. Inception layers allow us to use multiple convolutional operations within a layer and stack the results. Let’s say that we have input dimensions $28 \times 28 \times 192$, and we want to use multiple filter sizes: $1 \times 1, 3 \times 3, 5 \times 5$. Additionally, we also want to use a max-pooling layer to the input. Why do this? We don’t know which filter to use ex ante, so we’d like to try them all. Let’s say that we use:

  • $64$, $1 \times 1$ filters. Giving us an output of $28 \times 28 \times 64$.
  • $128$, $3 \times 3$ filters. Giving us an output of $28 \times 28 \times 128$.
  • $32$, $5 \times 5$ filters. Giving us an output of $28 \times 28 \times 32$
  • $32$ max pool filters with padding to get a “same” convolution. Giving us an output of $28 \times 28 \times 32$.

Notice that the heights and widths match up, so that we can actually stack all of these outputs together into a $28 \times 28 \times 256$ volume.

The problem, of course, is computational cost. Focusing on the $5 \times 5$ filters, when using $32$ filters, we need to do a lot of multiplications and summations. Each of the $32$ filters have dimensions $5 \times 5 \times 192$. The feature map has dimensions $28 \times 28 \times 32$. For each of these entries we need to compute $5 \times 5 \times 192 = 4800$ multiplications, and we have $28 \times 28 \times 32 = 25,088$ of these; for a total of $4800 \times 25,088 = 120,422,400$, or around $120$ million operations. And these are just the $5 \times 5$ filters! Here is where the $1 \times 1$ convolutions can be helpful as dimensionality reduction.

Before doing the convolution operations, we “compress” the input using $1 \times 1$ convolutions. If the input is $28 \times 28 \times 192$, and we use $16$, $1 \times 1$ filters, we will get an output with dimensions $28 \times 28 \times 16$. The process of “compressing” is called the bottleneck layer. How many operations are we doing now? There are two steps now: the bottleneck layer and the convolutional layer.

The bottleneck layer takes a $28 \times 28 \times 192$ input and convolves it with $16$ different $1 \times 1 \times 192$ filters. The feature map dimensions should be $28 \times 28 \times 16$ and each of these numbers is the result of $1 \times 1 \times 192 = 192$ multiplications. For a total of $28 \times 28 \times 16 \times 192 = 2,408,448$ or around $2.4$ million. The second part takes a $28 \times 28 \times 16$ input and convolves it with $32$ different $5 \times 5 \times 16$ filters. The feature map dimensions should be $28 \times 28 \times 32$ and each of these numbers is the result of $5 \times 5 \times 16 = 400$ multiplications. For a total of $28 \times 28 \times 32 \times 400 = 10,035,200$ or around $10$ million. Adding both we get around $12.4$ million, which is much less than the original $120$ million. Of course, the number of filters in the bottleneck layer regulates the computational ratio between the two approaches. We now have a way to increase dimensionality reduction to reduce computational cost.

Inception Network Architecture Link to heading

The authors then use inception layers as building blocks of the network, using consecutive inception layers and a very deep network, interspersed with max-pooling blocks to regulate the volume dimensions. Another interesting idea is that they add more than just one softmax layer at the end of the network. They also put some before the final layer. What this does, is that it keeps the inception layers from getting too crazy and forgetting that their job should be the classification. We don’t use the output from these inner softmax layers for the final classification, but we put them there solely to “ground” the optimization process.

MobileNet Link to heading

As we can probably know by know, CNNs offer big computational savings compared to fully connected neural networks. However, most architectures in the classic section still hover around $200$ million parameters or more. The ideas behind mobile net were fueled by the computational constraint in mobile devices, such as cellphones or tables; where most devices are still below the double-digit RAM gigabytes. Think even on security cameras; there are clear benefits on running classification models locally on embedded systems instead of relying on networking. One of the key ideas used by Howard, et al. is that of depthwise-separable convolutions.

Let’s start by reviewing normal convolutions. Imagine that we have a $6 \times 6 \times 3$ input volume, and we convolve it with a $3 \times 3$ filter. Each filter is also a volume, a $3 \times 3 \times 3$ volume. The output feature map will have a dimension of $4 \times 4 \times n_C$ where $n_C$ is the number of filters; we get this number using our trusty formula from the first week. Imagine that we use $5$ such $3 \times 3 \times 3$ filters. To calculate the computational cost we multiply the number of filter parameters times the number of filter positions times the number of filters:

$$ \text{Computational Cost} = \text{\# filter parameters} \times \text{\# filter positions} \times \text{\# of filters} $$

In our case each filter has $3 \times 3 \times 3 = 27$ parameters, there are $4 \times 4$ filter positions, one for each entry in the feature map output, and we have $5$ such filters. Totaling $3 \times 3 \times 3 \times 4 \times 4 \times 5 = 27 \times 16 \times 5 = 2196$ multiply operations. Using depthwise-separable convolutions we can reduce this number. Depthwise-separable convolutions factor regular convolutions into two operations: depth-wise convolution, and point-wise convolution. Let’s start with depth-wise convolution.

Depth-wise Convolution Link to heading

Depth-wise convolution is the same as regular convolution, but instead of convolving an input volume with a filter that is also a volume, we convolve it as many 2D filters as there are channels in the input. If our input volume has dimensions $6 \times 6 \times 3$, then we will do depth-wise convolution with $3$ different $3 \times 3$ filters separately on each channel. We use $3$ filters because $n_C = 3$. In our case the computational cost is:

$$ \begin{aligned} \text{Computational Cost} &= \text{\# filter parameters} \times \text{\# filter positions} \times \text{\# of filters} \\ \text{Computational Cost} &= (3 \times 3) \times (4 \times 4) \times 3 \\ \text{Computational Cost} &= 9 \times 16 \times 3 \\ \text{Computational Cost} &= 432 \end{aligned} $$

About $20\%$ of the original computational cost, but we’re not done yet. We got an output of $4 \times 4 \times 3$, and we want an output of $4 \times 4 \times 5$ since we were originally using $5$ filters. We go from a $4 \times 4 \times 3$ input to a $4 \times 4 \times 5$ output feature map via point-wise convolution.

Point-wise Convolution Link to heading

Point-wise convolution is where the $1 \times 1$ convolution idea comes back. Our input is of dimension $4 \times 4 \times 3$, but we want a $4 \times \times 4 \times 5$ output; we can get that by stacking $5$ different $1 \times 1 \times 3$ filters in a volume. That is, convolving our $4 \times 4 \times 3$ input with $5$ different $1 \times 1 \times 3$ filters, we will get an output with dimensions $4 \times 4 \times 5$. Instead of projecting our features to lower dimensional space as it’s the case in dimensionality reduction, we are doing the opposite; we are linearly combining the features into more of them. The computational cost of point-wise convolution is:

$$ \begin{aligned} \text{Computational Cost} &= \text{\# filter parameters} \times \text{\# filter positions} \times \text{\# of filters} \\ \text{Computational Cost} &= (1 \times 1 \times 3) \times (4 \times 4) \times 5 \\ \text{Computational Cost} &= 3 \times 16 \times 5 \\ \text{Computational Cost} &= 240 \end{aligned} $$

About $10\%$ of the original computation cost, and this time we are done. We went from $2196$ to $240 + 432 = 672$ multiply operations, which is a lot better. Depth-wise separable convolutions end up being around $10$ times cheaper than regular convolutions. If you’re thinking that such a computational boon does not come free of charge, then you are right. We are able to get cheaper computation by making a key assumption: the space (width and height) is statistically independent (orthogonal) of the channels (depth). That might sound like a bold assumption in the computer vision field. Intuitively, color is not orthogonal from shapes. However, in practice, it works extremely well in terms of model performance; whereby many applications are willing to pay a small accuracy hit for $10$ times cheaper compute.

MobileNet Architecture Link to heading

The original MobileNet (v1) used $13$ depthwise-separable convolution layers with a pooling, fully connected and softmax layers with great success. The network offered about the same performance as networks with normal convolutional layers for about a tenth of the computational price. This of course, was not enough, hence the second version v2 uses a more sophisticated approach.

MobileNet v2 composes more steps into each convolutional layer. First we add skip-connections, the same way we did in ResNets. However, we also add two additional steps: expansion before depthwise-separable convolution, and then projection afterwards. How do we expand and project? Using $1 \times 1$ convolutions of course!

Sandler, et al. borrow the bottleneck approach from inception networks, which constitute the expansion and projection. Say that we start with a $n \times n \times 3$ input. Before doing depthwise-separable convolution, we expand this volume, using $18$ different $1 \times 1$ filters ($18$ is just a number picked). We get an output of $n \times n \times 18$. This step constitutes expansion. Now, we do depthwise-separable convolution on the $n \times n \times 18$ volume, and we get a $n \times n \times 18$ output (again $18$ is just a number picked). Finally, project the volume into lower dimensional space via point-wise convolution, using $3$ different $1 \times 1$ filters ($3$ is just a number that matches the input channels) getting a $n \times n \times 3$ output. This step constitutes projection.

This is pretty nifty, we get some volume, and we expand it, linearly combining its features to generate new ones. Afterwards we do the cheap version of convolution, using depthwise-separable convolution, learning new features from the expanded ones. Finally, we apply dimensionality reduction via projection, to keep memory usage low. So by using expansion, we allow the network to learn a richer function; and by using projection we keep the memory usage low. In practice the MobileNet v2 paper uses this “bottleneck” block (expansion, depthwise-separable convolution and projection) 17 times before running through the usual pooling, fully connected and softmax layers.

Practical Advice for Using ConvNets Link to heading

Transfer Learning Link to heading

As we’ve seen, networks get pretty big and computational expensive. Many papers release their models as open-source software via public repositories in version control systems. You can use transfer learning, covered in Structuring Machine Learning Models | Week 2. The key idea is to “freeze” all the layers expect the last one from a pre-trained model, and retrain the last layer using our own data. Sometimes you can freeze fewer layers, let’s say the first $90\%$ of the layers, although this number is not set in stone. Think about what each layer is doing: freezing the earlier layers, means reusing the lower-level features from the pre-trained model, while freezing the latter layers means freezing the higher-level features from the pre-trained model. At which point in the feature hierarchy is the optimal overlap between the pre-trained model and our application depends on each application and pre-trained model combination.

Data Augmentation Link to heading

Data augmentation means generating new training samples from our original training samples via some process. Within the context of computer vision, this is usually done via mirroring and random cropping. Color shifting is also possible via tools like PCA Color Augmentation which was a technique used in the AlexNet paper. There are also computational approaches that implement data augmentation concurrently with training. Concurrency in scientific computing is a huge field, and it makes sense to show up in any computationally intensive applications.

Next week’s post is here.