Structuring ML Projects: Week 1 | ML Strategy

This is the first week of the third course of DeepLearning.AI’s Deep Learning Specialization offered on Coursera. This course is less technical than the previous two, and focuses instead on general principles and intuition related to machine learning projects.

This week’s topics are:

Introduction to ML Strategy
- Why ML Strategy
- Orthogonalization
Setting Up our Goal
Comparing to Human-Level Performance

Introduction to ML Strategy Link to heading

Why ML Strategy Link to heading

Whenever we are working on a machine learning project, and after we have completed our first iteration on the approach, there might be many things to try next. Should we get more data? Try regularization? Try a bigger network? So many things to choose from. This is why we need high-level heuristics to guide our strategy.

Orthogonalization Link to heading

Because there are so many things that could be changed simultaneously, a key idea is that of orthogonalization. In the context of computer science, orthogonality refers to a system design property whereby changing one thing changes that one thing only and nothing else. For example, changing the brightness setting on your phone changes only the brightness, and not whether your phone is on silent or not. We can say that these two components are orthogonal.

In our machine learning setting, we want to define some things that we can change in an orthogonal way. For example, we want to reduce bias, and reduce bias only; without any side effects. Some of the goals that we want to engage in during a project can be (in order):

Fit training set well on cost function.
Fit dev set well on cost function.
Fit test set well on cost function.
Performs well on the real world.

We can make progress toward each of these goals in an orthogonal way via different processes.

As a counter-example to orthogonalization think about early stopping. With early stopping you will fit your training set less well but also fit your dev set better; that is changing two things at once. This is why early stopping is usually not a suggested approach in the course.

Setting Up our Goal Link to heading

Single Number Evaluation Metric Link to heading

The key idea to making progress in our project is to be able to quickly compare different approaches. We can do this by defining a single real-valued evaluation metric.

As an example, think that you have two classifiers: $A, B$, and the following performance metrics:

$A$ - Precision: $95\%$, Recall: $90\%$
$B$ - Precision: $98\%$, Recall: $85\%$

Remember that precision tells you the proportion of true positives out of true positives and false positives, while recall tells you the proportion of true positives out of true positives and false negatives.

You might have an application where recall is more important than precision: classifying flight-risk when setting bails. In this case your single metric would be recall. However, if you’re not sure, how can you compare classifiers $A$ and $B$?

In this case we can combine precision and recall into the F1-score which is the harmonic mean of both precision and recall.

Another example is wanting to implement a model that performs good over different geographical regions. If you have $K$ geographical regions you might have $K$ different evaluation metrics, one for each region. The solution here is to take the average and compare the average of each classifier across all regions and pick the best one.

They key take away here is to have a single evaluation metric, and also to have a dev set. With these two things you can quickly iterate over different approaches and pick the best performing one.

Satisficing and Optimizing Metrics Link to heading

Satisficing is a decision-making strategy introduced by the legendary Herbert A. Simon in 1947. It differs from optimizing metrics by the fact that satisficing metrics don’t have to be the best but just good enough instead.

In our setting, an optimizing metric might be the performance of our classifier. Whereas latency or training time might be a satisficing metric. Being able to clearly think of which goals are which will help us to iterate faster.

We could also have our model performance to be a satisficing metric: we are okay with some number of bad predictions over some period of time.

The main takeaway is that if you have $N$ metrics, you should have $1$ optimizing metric and $N-1$ satisficing metrics. This is because optimizing for two metrics is complicated and also not guaranteed to be aligned.

Train/Dev/Test Distributions Link to heading

Very simple: make sure our dev and test set come from the same distribution. If we train a classifier to predict defaulting on a loan, having high-income zip codes in our dev set and low-income zip codes in our test sets is a terrible idea. The dev and test sets should reflect the data we expect to get in the future, and is relative to the main application of our machine learning model. In practice this usually means randomly shuffling your data before splitting it.

Size of Dev and Test Sets Link to heading

This was covered before as well.

The key takeaway is to make sure that our test set is big enough to reduce the probability that the result obtained is not due to random chance. This could be a lot less than $30\%$ of your data if you have a lot of data.

In some settings not having a test set might be okay, but it will never be better than having a test set.

When to Change Dev/Test Sets and Metrics? Link to heading

If we find that our algorithm is performing “well” but when being used has side effects, we should definitely change our splits and/or metrics.

The example in the course goes over a cat classifier, where classifier $A$ performs better than classifier $B$. However, $A$ also shows some illicit content to some users, while $B$ does not. Clearly you should use $B$. But how can we incorporate this new preference, not showing illicit content, into our metric?

The idea is to first think about how to express the new preference, i.e. a penalization for illicit content. Afterwards we can separately (orthogonalization) think about how to do well on this new metric.

The key takeaway here is to not marry a metric and/or data split. We should be able to rapidly pivot if we gain new information about our problem and how our model is performing in relationship to our problem.

Comparing to Human-Level Performance Link to heading

Why Human-level performance? Link to heading

Humans are very skilled at some tasks. Remember there are three performance levels for any classification task:

Bayes error rate: The theoretical limit that a perfect classifier would achieve when using random data. It’s not $0$.
Human-level error: The performance of the best human(s) at the task. It will be higher than Bayes error rate.
Training error: The performance of our model on the training set. It might be lower than human-level performance or higher, but never lower than Bayes error rate.

Knowing the human-level performance is extremely useful, because it allows us to reason about our training error. If our model is doing better than human-level error, then we might be close to Bayes error rate and improving training set performance might result in overfitting. On the other hand, if we are below human-level error, then it means that we should keep trying to improve our model’s performance on the training set. Depending on whether our model is above or below human-level performance can tell us which thing we should focus on next.

Avoidable Bias Link to heading

The idea of comparing our model’s performance to that of humans is very powerful. In general, human-level performance is very close to Bayes error rate; which means that we can, in general, treat human-level performance as the best we can do. The difference between our model’s performance and human-level performance (assuming it’s a proxy for Bayes error rate) is called avoidable bias. It’s called avoidable because, in general, we should be able to perform as good as human-level performance.

To cement this idea, and how the comparison can affect the direction of your project, let’s go over a simple example. Imagine that we have some classification task with the following performance numbers:

Human-level performance: $1\%$, Training Error: $8\%$, Dev error: $10\%$
- Because there is still about $8\% - 1\% = 7\%$ avoidable bias, we should focus on reducing bias. Train a larger model, train for longer, reduce regularization, etc.
Human-level performance: $7.5\%$, Training Error: $8\%$, Dev error: $10\%$
- Because the avoidable bias is smaller than the difference between the dev error and the training error, we should focus more on reducing variance. That is the difference between the dev error and the training error.

So how come we reached different conclusions on two classifiers that have the same performance? The only thing that is different is the human-level performance. Because we are thinking of human-level performance as a proxy for Bayes error rate, then if the human-level performance changes, then everything changes. If we don’t know human performance, we must make some educated guess about it.

The key thing here is to look at the difference between training error and human-level error, which we call avoidable bias. On the other hand, the difference between the dev error and the training error is a measure of variance. We should tackle whichever is greater first.

Understanding Human-level Performance Link to heading

When talking about human-level performance, we usually mean the best performing individual human or group of humans. We don’t take whether it takes a village, we just want to know what’s the best that the human species can do. We care about this for two reasons:

It will inform us about how much we should expect our training error to be.
It will give us an upper bound of the Bayes error rate.

If you have a bias problem (you haven’t reached human-level performance yet) focus on this first. If we are relatively close to human-level performance but have a variance problem, focus on that instead.

Surpassing Human-level Performance Link to heading

We have been talking about human-level performance for a while now, and you might be thinking: haven’t some models beaten humans? Yes, but only in certain problems. Of course the set of problems that have been solved with better-than-human performance is changing every day.

Within our context, the thing to keep in mind is that once our model is doing better than human-level performance, we might be very close to Bayes error rate; therefore any improvements in the training error might be very costly, and might actually make our model overfit.

Improving your Model Performance Link to heading

Okay, so we introduced all these topics above; now how do we put them together?

When doing supervised learning, there are two fundamental assumptions, and how good those assumptions are being met will tell us what do next:

We can fit the training set pretty well:
- This means that we have eliminated all or almost all avoidable bias.
The training set performance generalizes pretty well to the dev/test set.
- This means that we have found a good amount of variance in our model.

So if we still have an avoidable bias problem we can try the following:

Training a bigger model
Train longer and/or with better optimization algorithms
Try a specialized NN architecture and/or improve hyperparameters

If we solved the bias problem, but we still have a variance problem we can try:

Try getting more data
Add more regularization
Try a specialized NN architecture and/or improve hyperparameters

Next week’s post is here.