This is the second week of the third course of DeepLearning.AI’s Deep Learning Specialization offered on Coursera. This course is less technical than the previous two, and focuses instead on general principles and intuition related to machine learning projects.
This week’s topics are:
- Error Analysis
- Mismatched Training and Dev/Test Sets
- Learning from Multiple Tasks
- End-to-end Deep Learning
Error Analysis Link to heading
Carrying out Error Analysis Link to heading
One of the things that we can do when our model is performing worse than human-level performance is to carry out error analysis. Error analysis is just a fancy name to trying to ascertain what the sources of errors are. This is critical because if we can quickly come up with a “ceiling” or upper-bound on the improvement of a particular strategy.
Let’s say that our metric is accuracy and that our classifier is achieving $90\%$ accuracy on the dev set. One thing that can help is to take a look at the samples where the algorithm is not performing correctly. Say for example that our classifier is a cat classifier; and that for some breeds of dogs it’s not performing that well. Whether to focus on this problem is key to iterate fast.
The main idea is to look at what proportion of the errors are coming from mislabeled dogs. If we grab $\approx 100$ mislabeled examples from the dev set, we need to count up how many are dogs. If only $5\%$ of the $100$ mislabeled examples are dogs, then even it we get them all wrong we would reduce our error down to $9.5\%$. Maybe this is not worth it. Of course this might be different if $50\%$ or more of our mislabeled examples are dogs.
We can extend this idea and evaluate multiple sources of errors in parallel. We simply grab our $100$ mislabeled examples from the dev set and tag each example as belonging to either one or more categories of issues. Then we calculate which of the categories is the highest and try to focus on that.
The key takeaway from error analysis is to figure out what approach has the biggest bang for our buck; and it’s usually the most prevalent one in our errors.
Cleaning up Incorrectly Labeled Data Link to heading
If we suspect, or can even confirm, that our training set has mislabeled samples, then it might because for concern. There is a lot of literature in econometrics that investigates measurement or observational error and the effect this has on causal inference. Since we are not that interested in causal inference but on prediction instead, we won’t go too much into the details. However, a key concept from the field is characterizing the “issues” in our data. The most important thing is whether the issues are random or systematic. For example, if mislabeled examples are random with respect to other features, then it’s as good as we can hope for. However, if all the mislabeled examples are related to black cats, then it’s definitely an issue.
The key is that if our mislabeled examples are in our training set, and if the mislabeling is occurring at random, then we can rely on our model being robust to the mislabeling.
We can implement the same approach described in the previous section and add a tag where the source of error is incorrectly labeled data. We might consider engaging in the usually costly process of relabeling if a majority of the errors come from mislabeled data. A key thing is to make sure that both the dev and test set go through the same changes, i.e. we don’t want to fix labels on the dev set but not on the test set. Whether the training set comes from a different distribution than the dev/test set is a topic covered in the following section.
Build our First System Quickly, then Iterate Link to heading
This should be pretty intuitive: following the same guidelines from agile development, we should get a baseline running as fast as possible; usually meaning a simple baseline. Being able to diagnose the first round of errors quickly, and then engaging in bias/variance and error analysis allows us to iterate quickly. Many projects fail to even leave the ground due to unjustifiable (sometimes theoretical) complexity before even trying the first ideas.
Mismatched Training and Dev/Test Sets Link to heading
Training and Testing on Different Distributions Link to heading
We have repeatedly worried about having the dev and test set come from different distributions. How bad is it that our training set comes from a different distribution from our dev and test set? Let’s go over the example shown in the course.
Why would we want to have different distributions between training and dev/test sets? Because doing this might allow we to use a lot more data.
Say that we have our cat app, where users can upload pictures, and we can classify them as a cat or not. Say that we scraped the web for cat images, and we have $200,000$ such images, called $D_{old}$; all of which are pretty high quality. These images were used for our original train/dev/test splits. However, now our users upload their own images of cats, which are usually blurry and of lesser quality; we have $10,000$ of these user-generated images, called $D_{new}$. How can we incorporate this new data, which comes from a different distribution than our original $200,000$ samples, into our pipeline? Let’s go over the different options:
- Mix $D_{old}$ and $D_{new}$ into a single dataset $(210,000)$ samples. Shuffle them and split again:
- Good: This is good because now thrice our train/dev/test splits come from the same distribution.
- Bad: This is bad (more bad than good) because our $10,000$ are very rare across our splits. On expectation about $4.7\%$ the data in each split contain the new data.
- Put $50\%$ of the new data into the training set $(205,000)$ and keep the other half $(5,000)$ as dev/test sets $2,500$ each.
- Good: This better reflects where we want to “aim” with our model. We want to do good on the images that our users upload.
- Bad: Our training set comes from a different distribution than the dev/test sets. But this is not as bad as not “aiming” where we actually want to.
The key takeaway here is three-fold:
- Always make sure that our test/dev splits come from the same distribution.
- Our dev/test set should reflect the main application of our model. In our case it’s doing well on images that our users upload and not on random images from the web.
- Having a train set that comes from a different distribution than our dev/test sets can be justified in cases like the example above.
The last point has a particular caveat, which is discussed next.
Bias and Variance with Mismatched Data Distributions Link to heading
If our training set distribution is different from our dev/test distribution, then our the bias/variance analysis we have been discussed will not be correct anymore. Let’s revisit the bias/variance analysis.
Assume that human-level error $\approx 0$, and that our cat classifier has the following performance:
- When train/dev come from the same distribution:
- Training error: $1\%$
- Dev error: $10\%$
In the case that our training and dev sets come from the same distribution, we might diagnose this with a variance problem. We are overfitting the training data, and we are not able to generalize to unseen data. However, we cannot apply the same reasoning when the training and dev sets come from different distributions. We cannot say if the error is coming from a variance problem, i.e. not being able to generalize, or from the fact that the dev set comes from a different distribution than the training set. These two things are no longer orthogonal when our training and dev sets come from different distributions.
It turns out that further splitting our data can help us determine which one of the two issues is driving our dev error. We can generate a new split, called the training-dev split, which is separate from the train, dev and test sets. The key is that the training-dev set must come from the same distribution as the training set. This means that both the training and training-dev set come from the same distribution, which can be different from the distribution from which the dev/test sets come from. We can extend our bias/variance analysis over all these splits to better understand our model’s performance.
Let’s revisit the example above but with a made up train-dev error:
- When train/train-dev come from same distribution and dev comes from another distribution:
- Training error: $1\%$
- Train-dev error: $9\%$
- Dev error: $10\%$
What does it mean that our model is performing well on the training set but not that well on the train-dev set? The only difference is that our model has not seen the train-dev set. Remember that the train and train-dev sets come from the same distribution. This must mean that we have a variance problem, since the model is not able to generalize to unseen data. Why is it not a bias problem? Because our training error is very close to human-level performance. What about the difference between the train-dev error and the dev error?
Let’s look at a similar example:
- When train/train-dev come from same distribution and dev comes from another distribution:
- Training error: $1\%$
- Train-dev error: $1.5\%$
- Dev error: $10\%$
It doesn’t look like we either a bias nor variance problem. Not a bias problem because our model is performing very close to human-level performance on the training set. Not a variance problem because the gap between our training and train-dev error is very close. What about the gap in performance between the train-dev and the dev set? This is error is a data mismatch error, because the difference between the train-dev set and the dev set is that they come from different distributions. It’s not even that our classifier cannot generalize, but that we are simply evaluating its performance on a task that it wasn’t trained to perform good on.
So in general, the difference in performance between the data splits will give we us sense of where the issue lies:
- Human-level error: We are assuming that this is approximately equal to Bayes error rate.
- Training error: The difference between training error and human-level error is the amount of avoidable bias. If this is high, then we have a bias problem.
- Train-dev error: The difference between the train-dev error and the training error is the amount measuring the inability of our model to generalize to unseen data. If this is high, then we have a variance problem.
- Dev error: The difference between the dev error and the train-dev error is the error attributable to data mismatch. Remember, the main difference between the dev set and the train-dev set is that they come from different distributions. Which means that this is beyond its ability to generalize. We will see ways of tackling this in the next section.
- Test error: The difference between test error and dev error is the amount to which our model is overfitting the dev set. Remember that both our dev and test sets should come from the same distribution, so the difference between the two is equivalent to the difference between the train and training-dev set errors. It’s a lack of generalization, or a variance problem.
The key takeaway is that there are cases when we might desire having a mismatched training and dev/test set distributions. Usually, having more data will help our model perform better; especially if not doing means throwing away a lot of data. However, by doing this we are introducing a new kind of error: data mismatch error. When our training set comes from a different distribution than our dev/test sets, our traditional bias/variance problem will be biased or “off” by some amount. This “off” amount will come from the data mismatch; that is the degree to which our training set and dev/test sets come from different distributions. This error should be totally expected. What can we expect by training a model to classify cats using pictures of lions and then evaluating them taxonomic drawings of lions?
Addressing Data Mismatch Link to heading
If our training and dev/test sets come from different distributions and our improved bias/variance analysis indicates that we have a data mismatch issue, what can we do?
Unfortunately there are no systematic ways of addressing this problem. At the end of the day, we have data that comes from two different distributions. However, there are a couple of things to do that might help us describe the difference in distributions.
The first thing to do is to carry out error analysis and try to understand how the two distributions are different. Notice that this is not an empirical approach, and it might be problematic for high-dimensional data, as more things can be different.
Another thing to do is to make training data “more similar” or to get more data that comes from the distribution of the dev/test sets.
There are some data synthesis techniques that are discussed in the course in a very shallow manner, but I think that these come with a lot of issues if we are not experts.
Learning from Multiple Tasks Link to heading
Transfer Learning Link to heading
Time for transfer learning. It’s possible that we might have heard this in the past since it has become very popular in the generative AI field. The concept is pretty simple: imagine that we trained a cat classifier. It might be the case that the model that we learned, or even parts of it, could be used successfully for another computer vision task; such as generating medical diagnosis from medical images.
The main idea is that if we trained some model on data $X_{cats}, Y_{cats}$, which in our case is images and labels of cat pictures, then we can use this model to perform classification on another domain, with another data $X_{medicine}, Y_{medicine}$. The first step, training on the cat classification task is called pre-training. The second stage, training on medical images, is called fine-tuning.
Practically, we carry out pre-training the same way we would do any other application. However, for fine-tuning, we have two approaches. The approach we take depends on the amount of data that we have for fine-tuning, relative to pre-training.
If we have a lot of medical images, we might retrain the entire network. We wouldn’t start with random weights but from the starting point of our pre-trained model. If we don’t have that much data, we can simply retrain the last layer of our network, the output layer, during the fine-tuning. In either case, the output layer will be retrained from scratch, that is: we initialize the output layer to random weights and retrain it via fine-tuning. The difference in approaches is simply whether we only train the output layer, or retrain the entire network.
Why and how can this work? We might imagine that within a domain, say image recognition, any neural network that we train for different applications, such as cat classification or medical diagnosis, have a huge overlap. Remember that neural networks amount of feature generation, and the abstract features between two similar applications will have a lot of overlap. This is why a model trained on cat classification might work well for medical diagnosis, because a lot of the low-level features, such as recognizing shapes, borders, etc., will carry over from one domain to the other.
So when does transfer learning make sense? An obvious reason we might think of is when we have don’t have a lot of data for a particular task. In our example, we might have a lot of data for cat recognition, but not so much for medical imaging. Transfer learning makes sense when we can pre-train on a lot of data, and fine-tune for a specific application, where we have less data than that which is available for pre-training.
More specifically, transfer learning makes sense when:
- Both tasks have the same inputs, e.g. images.
- We have a lot more data for pre-training than fine-tuning.
- Abstract/low level features learned during pre-training could be helpful for fine-tuning.
The use of many large language models (LLMs) use transfer learning today. Huge models are trained with vast data scraped and curated from the internet. These models are gigantic and therefore contain a lot of the generalizable information that’s needed to parse and represent language. These models can then be fine-tuned for different applications such as finance, customer support or making memes.
Multitask Learning Link to heading
Multitask learning is a similar approach to transfer learning. Similar in the sense where two different tasks might benefit from having shared low-level features. The difference between transfer learning and multitask learning is that instead of having two models: pre-training and fine-tuning; multitask learning has one model that does many things.
The example in the course is that of object recognition in an autonomous driving system. In this scenario we want a system that recognizes stop signs, pedestrians, and many other objects related to driving. It turns out that we can approach this problem similarly to a softmax classifier. However, the key difference is that our loss function will sum across the tasks, instead of only the label for that sample, as in the case of a classifier. There are more technical details in the course, but they are very shallow.
The key takeaway is that instead of having $N$ models when we have $N$ tasks, we can approach the $N$ tasks with a single model. The core idea is that all the tasks share a lot of the feature representations. If this is not the case, then multitask learning is not a sound way to go. We also, perhaps intuitively by now, need to have about the same amount of data for each task if we expect the model to perform equally across each task. In practice, transfer learning is used a lot more than multitask learning. This is due to a combination of the availability of data for a given task, and the fact that different parties can do the pre-training and fine-tuning separately.
End-to-end Deep Learning Link to heading
What is End-to-end Deep Learning? Link to heading
End-to-end deep learning is in a way a commodification of deep learning applications. If we take a deep learning application that is composed of several steps (models) and replace it with a single step (model), then this would be called end-to-end deep learning. It came as a response to complicated and hand-crafted processes. If we think of speech recognition as an example, then an end-to-end approach would be to go directly from audio to a transcript. This is in juxtaposition to a process where we go from audio, to features, to phonemes, etc.
The key takeaway is that end-to-end requires a lot of data in general, depending on how efficient the non-end-to-end process is. An example of success is machine translation. Originally the process of machine translation was composed of many steps. Today, however, larger and larger transformer models can be trained directly on the task of translation.
Whether to use End-to-end Deep Learning Link to heading
The good thing about end-to-end learning is that we don’t need to rely that much on manual feature engineering. In a way, this process is more transparent when compared to a heavily engineered process. This is why we need more data. On the other hand, many hand-designed components are the result of intense research; potentially giving us large efficiency gains. Whether to use end-to-end deep learning depends on if you can keep the baby and throw away the bathwater, which is not always possible. Finally, end-to-end approaches are a lot more data hungry. Therefore, depending on the amount of data available to us, it might not even be a feasible approach.
Next week’s post is here.