Final week of this course. Again this week is pretty technical and a lot of the learning is done while coding up your own examples via the weekly programming assignments. The purpose of this week is to extend previous' weeks ideas into $L$-layered networks.

This week’s topics are:


Deep L-Layer neural network Link to heading

The number of hidden layers in a neural network determine whether it is “shallow” or “deep”. Exactly how many layers is deep or shallow is not set in stone.

More notation is introduced to have an explicit way of communicating two things:

  1. The number of layers
  2. The “height” of each layer, i.e. how many hidden units are in each layer.

Therefore, the following notation is introduced:

  • $L$ is the number of layers in your network. This includes the output layer, but it does not include the input layer (i.e. your features). The simple reason is that the input layer is usually layer $0$, so that your output layer is layer $L$.
  • $n^{[l]}$ denotes the number of hidden units or nodes in layer $l$. This is the “height” of your hidden layer.
  • $a^{[l]}$ is the corresponding activation for layer $l$. Remember that layers can have different activation functions so that $A^{[l]} = g^{[l]}(Z^{[l]})$, where $g^{[l]}$ is the activation function used in layer $l$.

A key thing to remember is that for any layer, you calculate $A^{[l]}$ as: $$ \begin{equation} Z^{[l]} = g^{[l]}(W^{[l]}A^{[l - 1]} + b^{[l]}) \end{equation} $$

Remembering that the input layer is usually denoted $A^{[0]} = X$.

This is pretty much all there is to deeper networks. The caveat however, is that this applies to fully-connected layers or feed-forward neural networks. In the fourth course, we will go over different architectures which try to represent information in more novel ways.

Getting your matrix dimensions right Link to heading

If you’re implementing this from scratch, making sure that your dimensions are right is usually the first step in debugging. In other architectures such as CNNs it’s more involved, and therefore it becomes more important to keep track of the dimensions, at least in your head. Here are the key things to keep in mind:

The dimensions of arrays in vector land:

  • $W^{[l]} = (n^{[l]}, n^{[l - 1]})$
  • $b^{[l]} = (n^{[l]}, 1)$
  • $z^{[l]} = a^{[l]} = (n^{[l]}, 1)$

In a single equation:

$$ \begin{equation} \underset{(n^{[l]}, 1)}{z^{[l]}} = \underset{(n^{[l]}, n^{[l -1]})}{W^{[l]}}\underset{(n^{[l - 1]}, 1)}{a^{[l - 1]}} + \underset{(n^{[l]}, 1)}{b^{[l]}} \end{equation} $$

Now, in matrix land:

  • $W^{[l]} = (n^{[l]}, n^{[l - 1]})$ (remains the same)
  • $b^{[l]} = (n^{[l]}, 1)$ (remains the same)
  • $Z^{[l]} = A^{[l]} = (n^{[l]}, m)$, where $m$ is the number of training samples.

Again, in a single equation:

$$ \begin{equation} \underset{(n^{[l]}, m)}{Z^{[l]}} = \underset{(n^{[l]}, n^{[l - 1]})}{W^{[l]}}\underset{(n^{[l - 1]}, m)}{A^{[l-1]}} + \underset{(n^{[l]}, 1)}{b^{[l]}} \end{equation} $$

Notice that adding $b^{[l]}$ to the product $W^{[l]}A^{[l-1]}$ is done via broadcasting with NumPy!

A note about taking derivatives is that the derivatives should be the same dimensions as the arrays.

Why deep representations? Link to heading

We have been mentioning how deep neural networks amount to automatic feature generation, and how this is what really set deep neural networks apart from other contemporaneous models. Therefore, a key thing is that they do not just need to be “big”, but they need to be deep, i.e. have hidden layers. How many? Keep reading.

The basic idea is that as you run the features through your network, they are combined into more abstract features. The example used in the course is going from raw audio to phonemes, from phonemes to words, from words to sentences; and of course the magic is that you, the programmer, don’t to know what these are, but the machinery figures them out by itself by optimization.

An important idea mentioned in the course is that of the relationship between depth and height, that is the number of hidden units. First think of neural networks as trying to learn a function $f(x)$ that maps a vector $x \mapsto y$. Now, there are some functions that you can estimate using “small” $L$-layer neural networks. However, if you want to use a shallower network and keep the same level of performance in the estimation, you will need to use exponentially more hidden units.

More precisely, if you have $n$ features, and you want to compute the exclusive or (XOR) of all the features, you will need a network which has a depth in the order of $O(\log n)$, where $n$ is also the number of nodes in your network. On the other hand, if you’re forced to use a shallow network, i.e. logistic regression, you will need $O(2^n)$ nodes. In summary, adding layers to your network is much more computationally efficient than growing the size of a hidden layer. Authors Mhaskar, Liao and Poggio show this at much more detail in their paper.

Parameters and Hyperparameters Link to heading

The parameters $W^{[l]}, b^{[l]}$ are the things you derive via training. On the other hand hyperparameters are fixed during training and must be set before training.

A good way to think about the difference between parameters and hyperparameters comes from Tamara Broderick’s slides. In machine learning we usually want to evaluate a hypothesis $h(x)$ that maps $x \mapsto y$. A hypothesis can belong to a hypothesis class. A hypothesis class $H$ used by a learning algorithm is the set of all classifiers considered by it. 1 For example a linear classifier considers all classifiers whose decision boundary is linear. During training the algorithm will search within the hypothesis class for a particular hypothesis that minimizes the cost. Hyperparameters are related to the hypothesis class, while parameters are related to a particular hypothesis.

Hyperparameters are usually selected via hyperparameter tuning, which is the topic of the next course.

Some hyperparameters in neural networks are:

  • Learning rate $\alpha$: the step-size of gradient descent.
  • Number of iterations: how many times will the model observe the entire training set.
  • Number of hidden layers.
  • Size of hidden layers.
  • Which activation functions and on which layers.

As you can probably see, there are a lot of hyperparameters to search over, especially considering the size of the set of all combinations.

The first week of the next course’s post can be found here.


  1. Hypothesis Class | Carnegie Mellon University ↩︎