Sequence Models: Week 2 | NLP & Word Embeddings

This is the second week of the fifth course of DeepLearning.AI’s Deep Learning Specialization offered on Coursera. In this week we go over a little more in depth into natural language applications with sequence models, and also discuss word embeddings; an amazing technique for extracting semantic meaning from words.

This week’s topics are:

Introduction to Word Embeddings
Word Embeddings
Applications Using Word Embeddings
- Sentiment Classification
- De-biasing Word Embeddings

Introduction to Word Embeddings Link to heading

Word Representation Link to heading

Word embeddings are a way of representing words. The approach borrows from dimensionality reduction, and combines it with optimization. These two things allow us to create new word representations that are empirically good with respect to some task. Let’s go over how this is possible.

We have been using one-hot encoding for words in our vocabulary. This works great as a way of representing words for a mathematical model; i.e. we build a pretty decent language model. The issue however, is two-fold: a vocabulary size of $10,000$ is tiny in practice. Do we really need to calculate the softmax for all these words? The second issue is that of semantics: at no point does the model realize that apple, orange and banana are similar things. Let’s look at the latter first.

In a one-hot encoding all elements are $0$ except for the element indexed by the index of the word in our vocabulary. If our vocabulary size $|V| = 10,000$, then we represent each word with $9,999$ zeros and one $1$ in the index of that word. Let’s use these indexes for the words in the example next:

$i_{man} = 5391$
$i_{woman} = 9853$
$i_{king} = 4914$
$i_{queen} = 7157$
$i_{apple} = 456$
$i_{orange} = 6257$

So that $o_{man} = o_{5391}$ is the one-hot encoding of the word. Remember that $o_w \in \mathbb{R}^{10,000}$.

How could our algorithm handle semantic meaning? The same way that we define most about anything, with matrices and linear algebra. Currently, our words live in $\mathbb{R}^{10,000}$ space. That is, they are a $10,000$ dimensional one-hot vector, where all the entries are $0$ except for one. Let’s think about the dot-product, which is the inner product of Euclidean space. Algebraically, the dot product is the sum of the products of the corresponding entries of the two sequences of numbers. Geometrically, it is the product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them. ¹ Let’s focus on the geometrical definition.

If the dot-product between two vectors is $0$ then the vectors are either orthogonal or one of them is the zero vector $\vec{0}$. On the other hand, positive and negative values map to the angle value between them. What does it mean that two vectors are orthogonal? It means that they are perpendicular to each other. What does that mean? It depends on the application: in optics, orthogonality is related to polarization states. In statistics, orthogonal random variables are those whose covariance is $0$. What happens when we dot two one-hot vectors for different words?

Since, by definition, each of the one-hot vectors is unique for every word in our dictionary, all of them will have a $1$ in a unique position. This means that the dot product between any pair $i,j$ where $i \neq j$ will be $0$. This means that the vector space defined by all of our one-hot vectors has a dimension for each word, since dimension are orthogonal to each other, all vectors are orthogonal to each other. Which in turn means that our representation imposes the assumption that there is literally no relationship between any pair of words. Even pairs such as $(\text{happy}, \text{happier})$.

The question is then: how can we project the $\mathbb{R}^{10,000}$ vector space into a lower dimensional space, say $\mathbb{R}^{300}$? We could try to use dimensionality reduction techniques, such as PCA, but it will not work that well. This is because all dimensions are orthogonal, and there’s no “common” information to encode the higher dimensional space into a lower dimensional space. We need something different, something learnable, we need word embeddings!

Say that we define $4$ dimensions: Gender, Royal, Age, Food. We’d like to map each of our $\mathbb{R}^{10,000}$ one-hot vectors into $\mathbb{R}^{4}$. We can do this by hand. Let’s map the words we had before into these by hand:

	Man	Woman	King	Queen	Apple	Orange
Gender	-1	1	-0.95	0.97	0.00	0.01
Royal	0.01	0.02	0.93	0.95	-0.01	0.00
Age	0.03	0.02	0.7	0.69	0.03	-0.02
Food	0.09	0.01	0.02	0.01	0.95	0.97

The numbers are fictitious, but set by hand to illustrate the example. The important thing is to notice that by embedding these words in our four new dimensions we have captured semantic meaning, and now the vectors live in “semantic” space. This in turns, allows us to define distance, and therefore similarity. For example, we should expect our embeddings to comply with $e_{apple} - e_{queen} > e_{apple} - e_{orange}$, where $e_{w}$ is the embedding vector, in our case in $\mathbb{R}^{4}$, of the word $w$. That is, the distance between the word apple and queen, should be greater than the distance between apple and orange because apples and oranges are fruits.

How do we learn these embeddings, and how many dimensions are enough is something we go over in a bit.

Remember that embeddings are synonymous with encodings. We want to have efficient word encodings that are much smaller but denser than our one-hot vectors; which are big but sparse.

Using Word Embeddings Link to heading

So how do we use these embeddings? It turns out that learning good word embeddings takes a lot of text: between $1$ and $100$ billion words in the corpus. That’s just the data requirements, training costs are also going to be significant. Instead, we can be lazy and use pre-trained embeddings from somewhere online. After we have these embeddings, we can employ transfer learning and fine-tune a neural network model to our task. We could even fine-tune the embeddings with our data.

This approach is similar to that of face encoding. In the face recognition approach, we used a Siamese network to learn some $\mathbb{R}^{128}$ dimensional face embedding that is a good embedding for comparing two faces. It would even work with a picture that was not in the training set; a necessity of the face recognition application. In the case of word embeddings we are doing the same, except two things. The encodings are good encodings with respect to a language model, and also the cardinality of the inputs is fixed. It’s equal to the size of the vocabulary used to learn the embeddings. Remember that we can use a unique token to represent unknown words, which would also get its own embedding representation.

Properties of Word Embeddings Link to heading

We ended up with word embeddings by wanting to have a better representation of words that capture semantic meaning. But what does “capturing semantic meaning” really mean, and how is that helpful? One of the things that we can do with semantics, is to reason about analogies: $a$ is to $b$ as $c$ is to $d$; this seems to be an ability that helps with NLP tasks.

Let’s say that we have some embedding $e_w \in \mathbb{R}^{300}$, since our embedding vectors are no longer orthogonal we can use linear algebra to describe the relationship between things in this new space; such as length, distance and so on. In the paper that introduces this idea, the authors write in the abstract:

… For example, the male/female relationship is automatically learned, and with the induced vector representations, “King - Man + Woman” results in a vector very close to “Queen.”

This means that there is some dimension in the embedding that represents the semantic meaning of “gender”, and this dimension, the difference between “King” and “Man”, plus that of “Woman” approximates “Queen”. We can do math with semantic topics. Since we can do math, let’s define “Man” is to “Woman” as “King” is to $w$. Mathematically:

$$ \begin{aligned} e_{man} - e_{woman} &\approx e_{king} - e_{w} \\ e_{w} &\approx e_{king} - e_{man} + e_{woman} \end{aligned} $$

If we have a way to compare two vectors, i.e. a similarity function, we can search over all our words $w \in V$ in linear time comparing each $e_w$ to $e_{king} - e_{man} + e_{woman}$, and then picking the most similar as $e_w$. How can we say if two vectors are similar?

Cosine Similarity Link to heading

Let’s keep in mind that the dot product is related to the angle between two vectors. If $x \cdot y = 0$, then $x \perp y$, that is, they are either orthogonal or $x$ or $y = \vec{0}$. In the case that they are orthogonal, the angle between them is $90\degree$. What if $x \cdot y \neq 0$? It can be anything, from $(-\infty, \infty)$ except $0$. In similarity functions we always like the interval $[0, 1]$ more than $[-\infty, \infty]$. We can shrink the domain to be in the interval $[0, 1]$ usually by normalizing by something. We can use the Euclidean norm of the two vectors $x, y$ to normalize the size of the dot product. What does that product between the lengths of two vectors have to do with the dot product, and the angle between the vectors?

The Euclidean dot product formula is:

$$ A \cdot B = ||A|| \ ||B|| \cos{\theta} $$

Which is where the cosine similarity formula comes from:

$$ S_c(A, B) := \cos{\theta} = \frac{A \cdot B}{||A|| \ ||B||} $$

Remembering the sine and cosine definitions: $\cos(90) = 0, \cos(0) = 1$ in angles.

We can redefine our search problem for analogy reasoning as:

$$ \argmax_{w \in W} := S_c(e_w, e_{king} - e_{man} + e_{woman}) $$

The analogy reasoning is just a show trick. It’s a way of exemplify what embeddings are doing. In practice, using word embeddings allows our models to learn about the connections between words in semantic space; which is very useful for performance.

Embedding Matrix Link to heading

So to formalize what an embedding is, let’s go step by step and calculate some embedding $e_w$. Let $E$ be an embedding matrix of dimensions $(300, 10000)$. It’s a mapping from 10,000 dimensional space to 300 dimensional space. The 10,000 comes from our vocabulary size; that is, for each word in our vocabulary $w \in W$ we will have a 300 dimensional vector $e_w$ which represents the embedding of $w \in \mathbb{R}^{300}$. Remember that the one-vector is the same but in more dimensions: $o_w \in \mathbb{R}^{10000}$. What happens if we multiply $E o_w$? The output dimension should be $(300, 1)$ since:

$$ \underset{(300, 10000)}{E}\underset{(10000, 1)}{o_w} = \underset{(300, 1)}{e_w} $$

Notice that because of the way matrix multiplication works, only one column of $E$ is selected, and it’s exactly the $w_{th}$ one. This makes generating word embeddings pretty computationally expensive, even after learning $E$. Which is the reason why many pre-trained embeddings are distributed as a dictionary mapping $w \to e_w$ directly, without the multiplication step.

Word Embeddings Link to heading

Learning Word Embeddings Link to heading

How can we learn the embedding matrix $E$? We can learn it via supervised learning. If we have some vocabulary $V$ with size $|V| = 10,000$, and $o_w$ represents the one-hot vector encoding of the word with index $w$. The training pair is a pair consisting of a phrase and the next word, such as $X_i=\text{“I want a glass of orange”}$ and $Y_i=\text{“juice”}$

We can randomly initialize $E$ and define $e_w = Eo_w$. The idea is to take each $o_w^{<t>}$ in our training pair and map it to $e_w^{<t>}$ via $E$. Finally, we use all $e_w^{<t>}$ into a fully connected layer that’s run into a softmax. The idea is to correctly classify the next word using the optimization on the network parameters $\theta$, one of which is $E$ the embedding matrix, using cross-entropy loss.

In practice, this means that the algorithm will learn embeddings of words with the directed goal of next token prediction. Since the quality of the embeddings is directly related to word prediction, the algorithm has full incentive to encode relations of similar words into one of the 300 features. This brings us to another important thing.

So far we have been using semantic topics such as “Man” or “Queen”. We do this not because the embeddings will correspond to topics we understand, but to aid the explanation of the topic. In practice, deciphering what an embedding dimension really means is hard, and it amounts to guess work.

Different extensions of this basic model generalize the pairing of examples for the supervised learning task. We can define a context $c$ and a target word $t$. We were using $c$ to be all the words leading up to $t$. But we are not required to do this. We can use a context where we use 2 words on the left and on the right of the target word, even just the last word. The Word2Vec algorithm is an extension of this idea, and it has a variation that uses Negative sampling for generating the pairs. Let’s discuss the Word2Vec approach first.

Word2Vec Link to heading

In Word2Vec we keep using the supervised learning approach, but we pick just a pair of words $c, t$, the context and target respectively. Notice that these do not need to be adjacent words. In the skip-gram model, we pick a word $t$ and generate the content by picking a neighboring word $c$ at random. Therefore, the classification task is to classify which words are more likely to occur nearby each other. We will see a more sophisticated way of sampling in negative sampling but for now let’s keep in mind that we are still doing essentially the same as before:

$$ o_c \to E \to e_c \to \text{softmax} \to \hat{y} $$

The softmax output $\hat{y}$ is the estimate of the probability that we see $c$ given $t$: $P(t\mid c)$. It’s defined as:

$$ P(t \mid c) = \frac{e^{\theta_t^Te_c}}{\sum_{j=1}^{10000}e^{\theta_j^Te_c}} $$

And our loss is:

$$ \mathcal{L}(\hat{y}, y) = - \sum_{i=1}^{10000} y_i \log \hat{y}_i $$

So far so good, and this will work. But there’s a problem: calculating the softmax for $10,000$ classes might be feasible, but what about bigger vocabularies? What about a million? It’s no longer feasible. There are a few solutions.

The hierarchical softmax is a similar approach to negative sampling, where the idea is to reduce the computational complexity of the $\text{softmax}$ function down from $O(|V|)$. The hierarchical softmax has a time complexity of $O(\log |V|)$. The approach is based on Huffman coding a coding scheme used for lossless data compression. Chris McCormick has an amazing tutorial on his website where you can learn a lot more about word2vec. It seems that negative sampling and hierarchical softmax work in an equivalent manner, with negative sampling being more intuitive to understand.

Negative Sampling Link to heading

In negative sampling we still have a context word and a target: $c, t$. Except we will create fake example pairs, negative samples, in addition to the real pair. For example, if the real pair is (orange, juice), we might make $k=4$ negative samples:

(orange, king)
(orange, book)
(orange, the)
(orange, of)

Notice that the context $c$ remains fixed, while the target $t$ is sampled at random. For smaller datasets we want to pick $k \in [5, 20]$ while for larger datasets we can get away with $k \in [2, 5]$. Think of $k$ as a sampling size parameter, which reduces the standard error of our estimates as it goes up.

The next thing is to label all positive samples (real) with a $1$ label, that is $y=1$; while we give a $0$ label to the negative samples, that is $y = 0$. The idea is for the softmax to learn to classify which pairs are positive and which ones are negative. That is we want to estimate:

$$ P(y=1 \mid c, t) = \sigma(\theta_t^Te_c) $$

This is equivalent to training $10,000$ different binary classifiers at the end of the network, but only sampling $k$ of those classifiers at each time. Therefore, we can think of negative sampling as an estimator of the $\text{softmax}$ function.

But how do we sample the negative samples? The authors recommend a way that’s somewhere in between the empirical distribution $P(w_i)$ and $\frac{1}{|V|}$. Using the former will over-sample stop words because they are very common, while the latter will treat extremely uncommon words the same as stop words. The actual probability mass function is given by:

$$ P(w_i) = \frac{f(w_i)^{\frac{3}{4}}}{\sum_{j=1}^{10000}f(w_j)^{\frac{3}{4}}} $$

GloVe Word Vectors Link to heading

Global Vectors For Word Representations (GloVE) is another approach to learn $E$ and $e_w$, the embedding matrix and word embeddings. It does something very similar to word2vec.

In word2vec we were picking $c, t$ two words that appear in proximity. Here we will be more explicit and define $X_{ij}$ as the number of times that word $j$ appears in the context of word $i$. Notice that depending on how we define the context $X_{ij}$ could be equal to $X_{ji}$. If the context wraps around the target it will be symmetrical, while if you only look backwards it will not.

In this case we want to minimize the following:

$$ \mathcal{L} = \sum_{i=1}^{|V|} \sum_{j=1}^{|V|} f(X_{ij})(\theta_i^Te_j + b_i + b_j - \log X_{ij})^2 $$

$f(X_{ij})$ is some weighting function that allows us to deal with taking $\log 0$ but can also introduce useful information about the frequency of the pair $ij$.

It turns out that $\theta_i$ and $e_j$ as symmetrical vectors since they are both optimized in the same way. This means that in practice, to calculate the embedding of word $w$ we take the average of the two:

$$ e_w = \frac{\theta_w + e_w}{2} $$

Applications Using Word Embeddings Link to heading

Sentiment Classification Link to heading

Okay, enough about embeddings. How can we use them? The same way we use one-hot encodings! Let’s walk through the steps from a basic NN to using an RNN.

We could take a review, such as “This desert is excellent” and want to classify it into five classes, the reviews’ stars. We do the same thing again, map each $w$ into $o_w$ and then into $e_w$ via an embedding matrix $E$. After we have $e_{the}, \dots, e_{excellent}$ we could take the average of these vectors across the dimensions (columns), resulting in a $300$ dimensional vector, if we use $300$ dimensional embeddings. We could then run this through a softmax with $K = 5$ and generate a prediction for the review score. This will work okay, but we will run into the same issue we ran original when introducing RNNs. The model cannot use temporal structure to affect the prediction.

We can simply plug-and-play an RNN using the same embeddings, and generate a prediction at the end, exactly like the many-to-one RNN approach. This will work a lot better for reviews that say things like: “A great waste of my very, good, excellent time”. The former model might score this review highly because of the presence of positive sentiment words, but an RNN might be more adept at figuring out that “waste” negates all the other positive words.

De-biasing Word Embeddings Link to heading

You might have heard of instances where earlier versions of machine translation approaches would translate the non-gendered words in English such as “Doctor” into gendered versions in the target language, such as “Doctor” in Spanish as opposed to “Doctora”, the female version of the noun. This is a reflection of the corpus the model was trained on. Algorithmic bias is a contentious topic because it reveals biases we either forgot were there or those that we do not like to be reminded of. In this case, male doctors were much more frequent than female doctors in the target translation language, and as such, the language model learned to estimate that empirical probability distribution. But this example, generally benign in its real outcomes, is probably the most benevolent example of algorithmic bias.

As predictive models make their way into diverse aspects of life, the more important the outcome associated with the prediction generated by an algorithm, the more care we must take with the implicit biases in its training data. One noxious example is that of bail decisions, an approach supported by quasi-experimental methods as an improvement over judges deciding by themselves as measured by the change in the defendant’s welfare.

In economics, we differentiate taste-based discrimination from statistical discrimination. The former emphasizes the role of prejudice on decision-making, while the latter is more mechanical. In the absence of perfect information, people must make guesses, based on observable attributes to make rational decisions. A very interesting study of statistical discrimination occurred in the context of the “ban the box” (BTB) policies. BTB policies prevented employers from asking about job applicant’s criminal records until late in the hiring process, under the idea that it was unfair for people with criminal records. The study however finds the opposite effect: with the absence of information, employers engage in statistical discrimination against demographic groups that include more ex-offenders. The study finds that BTB policies decrease the probability of employment by $3.4$ percentage points for young, low-skilled black men.

Is there a way we could de-bias training data so that the predictive model aligns better with our ethical and moral compass? For the case of machine translation, a group of authors did just that.

With our knowledge of embeddings it’s pretty intuitive to understand the basics of the approach by Bolukbasi, et al. in Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. The first step is to identify the bias direction. We do this by taking pairs of intrinsically gendered nouns, such as man and woman and averaging the bias across all such pairs we find. The bias for the pair (male, female) for example is defined as $e_{male} - e_{female}$. After we identified the direction of bias, we want to project the bias dimension away. Finally, we want to equalize pairs. This means that definitional words, such as grandfather, should be equidistant from non-definitional nouns, such as doctor.

Next week’s post is here.

Dot-Product | Wikipedia ↩︎