Chapter 9: RNN and CNN

80. Turning words into numeric IDs

We want to turn the words from the data we created in problem 51 into numeric IDs. Assign to each word that occurs twice or more in data a numeric ID, so that the most frequent word is assigned the ID 1, the second-most frequent word the ID 2, and so on. Then, implement a function that returns the ID for a given word, or, if the word occurs less than two times in the data, returns 0.

81. Prediction with an RNN

We have a sequence of numeric word IDs \(\boldsymbol{x} = (x_1, x_2, \dots, x_T)\), corresponding to the words in a given sentence. Here, \(T\) is the number of words in the sequence (i.e., the sentence) and \(x_t \in \mathbb{R}^{V}\) is a one-hot vector (\(V\) is the number of words in the vocabulary). Use an RNN (Recurrent Neural Network) to implement a model that predicts a category \(y\) from a given sequence of word IDs \(\boldsymbol{x}\).

\[\overrightarrow{h}_0 = 0, \\ \overrightarrow{h}_t = {\rm \overrightarrow{RNN}}(\mathrm{emb}(x_t), \overrightarrow{h}_{t-1}), \\ y = {\rm softmax}(W^{(yh)} \overrightarrow{h}_T + b^{(y)}),\]

where \(\mathrm{emb}(x) \in \mathbb{R}^{d_w}\) is a word embedding (a function that maps a one-hot vector to a dense word vector). Here, \(\overrightarrow{h}_t \in \mathbb{R}^{d_h}\) is the hidden state at time step \(t\). In addition, \({\rm \overrightarrow{RNN}}(x,h)\) is an RNN cell that (1) takes as input the word at timestep \(t\) and the hidden state of the previous time step, and (2) produces the hidden state for the next time step. \(W^{(yh)} \in \mathbb{R}^{L \times d_h}\) is a linear transformation for predicting the category from the hidden state, and \(b^{(y)} \in \mathbb{R}^{L}\) is a bias term (\(d_w\), \(d_h\) and \(L\) are the dimensionalities of the word embeddings, hidden state, and the number of categories respectively). While there are several different RNN cells \({\rm \overrightarrow{RNN}}(x,h)\), a simple example is as follows:

\[{\rm \overrightarrow{RNN}}(x,h) = g(W^{(hx)} x + W^{(hh)}h + b^{(h)}),\]

where \(W^{(hx)} \in \mathbb{R}^{d_h \times d_w},W^{(hh)} \in \mathbb{R}^{d_h \times d_h}, b^{(h)} \in \mathbb{R}^{d_h}\) are the parameters of the RNN cell, and \(g\) is an activation function (e.g. \(\tanh\) and ReLU).

For this problem, you don’t have to train the model. You only need to randomly initialize the network and compute \(y\). For the dimentionalities, use the settings like \(d_w = 300, d_h=50\) (same applies for the later questions).

82. Training with Stochastic Gradient Descent

Use Stochastic Gradient Descent (SGD) to train the model we built in Problem 81. During training, print the training loss and training accuracy, and loss and accuracy on validation data. Stop training according to the appropriate criterion (e.g., after 10 epochs).

83. Mini-batch Training, GPU Training

Modify the code from Problem 82 so that it computes loss and gradient for a batch of instances. You may decide mini-batch size \(B\) (e.g., \(B=32\)). After this, train the model on a GPU.

84. Add Pretrained Word Embeddings

Use pretrained word embeddings (e.g., the Google News embeddings, which have been trained on approx. 100 billion tokens) to initialize \(\mathrm{emb}(x)\) and train the model.

85. Bi-directional RNN and Multi-layer RNN

Encode the input text using both forward and backward RNNs and train the model.

\[\overleftarrow{h}_{T+1} = 0, \\ \overleftarrow{h}_t = {\rm \overleftarrow{RNN}}(\mathrm{emb}(x_t), \overleftarrow{h}_{t+1}), \\ y = {\rm softmax}(W^{(yh)} [\overrightarrow{h}_T; \overleftarrow{h}_1] + b^{(y)})\]

However,\(\overrightarrow{h}_t \in \mathbb{R}^{d_h}, \overleftarrow{h}_t \in \mathbb{R}^{d_h}\) is the hidden state vector at time \(t\) obtained by the forward and backward RNNs, and \({\rm \overleftarrow{RNN}}(x,h)\) is the RNN unit that calculates the previous state from the input \(x\) and the hidden state \(h\) at the next time, \(W^{(yh)} \in \mathbb{R}^{L \times 2d_h}\) is a matrix for predicting categories from the hidden state vector, and \(b^{(y)} \in \mathbb{R}^{L}\) is the bias term. Moreover,\([a; b]\) represents a concatenation of two vectors \(a\) and \(b\).

In addition, experiment with multi-layered bidirectional RNNs.

86. Convolutional Neural Networks (CNN)

Let us consider the sequence of words \(\boldsymbol{x} = (x_1, x_2, \dots, x_T)\). Here, \(T\) is the length of the word. In addition, \(x_t \in \mathbb{R}^{V}\) is one-hot vector that represents the ID that corresponds to the word (\(V\) is the total number of words).

Implement a model that predicts category \(y\) from the given sequence \(x\) using a Convolutional Neural Network (CNN).

The configuration of the CNN is as follows:

  • Word embedding dimension: \(d_w\)
  • Convolution filter size: 3 tokens
  • Convolution stride: 1 token
  • Convolution padding: Yes
  • The size of the vector of each time step after convolution operation: \(d_h\)
  • Apply max pooling after convolution operation and express input sentences as \(d_h\)-dimensional hidden vector.

That is, the feature vector \(p_t \in \mathbb{R}^{d_h}\) at time \(t\) is expressed via the following equation:

\[p_t = g(W^{(px)} [\mathrm{emb}(x_{t-1}); \mathrm{emb}(x_t); \mathrm{emb}(x_{t+1})] + b^{(p)})\]

Here, \(W^{(px)} \in \mathbb{R}^{d_h \times 3d_w}\), \(b^{(p)} \in \mathbb{R}^{d_h}\) is a CNN parameter, \(g\) is the activation function (e.g., \(\tanh\), ReLU, etc.), \([a; b; c]\) is the concatenation of vectors \(a,b,c\). The reason why the number of columns in matrix \(W^{(px)}\) is \(3d_w\) is to perform a linear transformation on the concatenation of word embeddings for the 3 tokens.

In max pooling, the maximum value at all times is taken for each dimension of the feature vector, and the feature vector \(c \in \mathbb{R}^{d_h}\) of the input document is obtained. Assuming that \(c[i]\) represents the \(i\)-th dimension of vector \(c\), max pooling is given by the following equation:

\[c[i] = \max_{1 \leq t \leq T} p_t[i]\]

Lastly, predict category \(y\) by using both linear transformation and softmax function with matrix \(W^{(yc)} \in \mathbb{R}^{L \times d_h}\) and bias \(b^{(y)} \in \mathbb{R}^{L}\) applied to the input document’s feature vector:

\[y = {\rm softmax}(W^{(yc)} c + b^{(y)})\]

Moreover, it is fine to calculate \(y\) with a randomly initialized weight matrix without learning the model.

87. CNN Learning via Stochastic Gradient Descent

Using Stochastic Gradient Descent (SGD), learn the model you constructed in Problem 86.Learn the model while displaying the loss and correct rate on the training data and the loss and correct rate on the evaluation data. Finish with a suitable standard (e.g., 10 epochs, etc.).

88. Hyper-parameter Tuning

Altering Problem 85 or 87’s code, build a high-quality category classifier by adjusting the neural network’s shape and hyper-parameters.

89. Transfer Learning from a Pretrained Language Model

Starting from a Pretrained Language Model (e.g. BERT, etc.) model, build a model that categorizes news article headlines into categories.