Chapter 8: Neural networks

In this chapter, we implement a neural network model for the text classification task. We then apply the model to the news classification dataset that we used in the chapter 6. You might want to use the deep learning frameworks such as PyTorch, TensorFlow, Chainer.

70. Generating Features through Word Vector Summation

Let us consider converting the dataset from the problem 50 to feature vectors. For example, we want to create a matrix \(X\) (sequence of feature vectors of all instances) and a vector \(Y\) (sequence of gold labels of all instances).

\[X = \begin{pmatrix} \boldsymbol{x}_1 \\ \boldsymbol{x}_2 \\ \dots \\ \boldsymbol{x}_n \\ \end{pmatrix} \in \mathbb{R}^{n \times d}, Y = \begin{pmatrix} y_1 \\ y_2 \\ \dots \\ y_n \\ \end{pmatrix} \in \mathbb{N}^{n}\]

Here, \(n\) represents a number of instances in training data. \(\boldsymbol{x}_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{N}\) represent \(i \in {1,...,n}\)-th feature vector and target (gold) label respectively. Note that the task is to classify a given headline into one of the following four categories: “Business”, “Science”, “Entertainment” and “Health”. Let us define that \(\mathbb{N}_4\) represents a natural number smaller than 4 (including zero). Then a gold label of a given instance can be represented as \(y_i \in \mathbb{N}_4\). Let us also define that \(L\) represents the number of labels (This time \(L=4\)).

A feature vector of \(i\)-th instance \(\boldsymbol{x}_i\) is computed as follows:

\[\boldsymbol{x}_i = \frac{1}{T_i} \sum_{t=1}^{T_i} \mathrm{emb}(w_{i,t}),\]

where \(i\)-th instance consists of \(T_i\) tokens \((w_{i,1},...,w_{i,T_i})\) and \(emb(w) \in \mathbb{R}^{d}\) represents a (size \(d\)) word vector corresponding word \(w\). In other words, \(i\)-th article headline is represented as an average of word vectors of all words in the headline. For word embeddings, use pretrained word vector of dimension \(300\) (i.e., \(d=300\)).

A gold label of \(i\)-th instance \(y_i\) is defined as follows:

\[y_i = \begin{cases} 0 & (\mbox{if article }\boldsymbol{x}_i\mbox{ belongs to Business category}) \\ 1 & (\mbox{if article }\boldsymbol{x}_i\mbox{ belongs to Science category}) \\ 2 & (\mbox{if article }\boldsymbol{x}_i\mbox{ belongs to Entertainment category}) \\ 3 & (\mbox{if article }\boldsymbol{x}_i\mbox{ belongs to Health category}) \\ \end{cases}\]

Note that you do not have to strictly follow the definition above as long as there exist one-to-one mappings between the name of the category and the label index.

Based on the specifications above, create the following matrices and vectors and save them into binary files:

  • Training data feature matrix: \(X_{\rm train} \in \mathbb{R}^{N_t \times d}\)
  • Training data label vector: \(Y_{\rm train} \in \mathbb{N}^{N_t}\)
  • Validation data feature matrix: \(X_{\rm valid} \in \mathbb{R}^{N_v \times d}\)
  • Validation data label vector: \(Y_{\rm valid} \in \mathbb{N}^{N_v}\)
  • Test data feature matrix: \(X_{\rm test} \in \mathbb{R}^{N_e \times d}\)
  • Test data label vector: \(Y_{\rm test} \in \mathbb{N}^{N_e}\)

Here, \(N_t, N_v, N_e\) represent the number of instances in training data, validation data and test data, respectively.

71. Building Single Layer Neural Network

Load matrices and vectors from the problem 70. Compute the following operations on training data:

\[\hat{\boldsymbol{y}}_1 = {\rm softmax}(\boldsymbol{x}_1 W), \\ \hat{Y} = {\rm softmax}(X_{[1:4]} W)\]

Here, \({\rm softmax}\) refers to softmax function and \(X_{[1:4]} \in \mathbb{R}^{4 \times d}\) is a vertical concatenation of \(\boldsymbol{x}_1, \boldsymbol{x}_2, \boldsymbol{x}_3, \boldsymbol{x}_4\):

\[X_{[1:4]} = \begin{pmatrix} \boldsymbol{x}_1 \\ \boldsymbol{x}_2 \\ \boldsymbol{x}_3 \\ \boldsymbol{x}_4 \\ \end{pmatrix}\]

Matrix \(W \in \mathbb{R}^{d \times L}\) is the weight of single-layer neural network. You may randomly initialize the weight for now (we will update the parameter in later questions). Note that \(\hat{\boldsymbol{y}}_1 \in \mathbb{R}^L\) represents a probability distribution over the category. Similarly, \(\hat{Y} \in \mathbb{R}^{n \times L}\) represents a probability distribution of each instance in training data \(x_1, x_2, x_3, x_4\).

72. Calculating loss and gradients

Calculate the cross-entropy loss and the gradients for the matrix \(W\) on a training sample \(x_1\) and a set of samples \(x_1, x_2, x_3, x_4\). The loss on a single sample is calculated using the following formula:

\[l_i = −\log[\text{probability that sample } x_i \text{ is classified as }y_i]\]

The cross-entropy loss for a set of samples is the average of the losses of each sample included in the set.

73. Learning with stochastic gradient descent

Update the matrix \(W\) using stochastic gradient descent (SGD). The training should be terminated with an appropriate criterion, for example, “stop after 100 epochs”.

74. Measuring accuracy

Find the classification accuracy over both the training and evaluation data using the matrix obtained in the problem 73.

75. Plotting loss and accuracy

Modify the code from the problem 73 so that the loss and accuracy of both the training and the evaluation data are plotted on a graph after each epoch. Use this graph to monitor the progress of learning.

76. Checkpoints

Modify the code from the problem 75 to write out checkpoints to a file after each epoch. Checkpoints should include values of the parameters such as weight matrices and the internal states of the optimization algorithm.

77. Mini-batches

Modify the code from the problem 76 to calculate the loss/gradient and update the values of matrix \(W\) for every \(B\) samples (mini-batch). Compare the time required for one learning epoch by changing the value of \(B\) to \(1, 2, 4, 8, \dots\).

78. Training on a GPU

Modify the code from the problem 77 so that it runs on a GPU.

79. Multilayer Neural Networks

Modify the code from the problem 78 to create a high-performing classifier by changing the architecture of the neural network. Try introducing bias terms and multiple layers.