This chapter introduces the concept of the word vectors (i.e., word embeddings). Create the following problems.
60. Loading word vectors
Download word vectors that are pretrained on Google News dataset (approx. 100 billion words). The file contains word vectors of 3 million words/phrases, whose dimentionalities are 300. Print out the word vector of the term “United States”. Note that “United States” is represented as “United_States” in the file.
61. Word similarity
Compute the cosine similarity between “United States” and “U.S.”
62. Top-10 most similar words
Find the top-10 words that have the highest cosine similarity with the word “United States” and print out the similarity score.
63. Analogy based on the additive composition
Subtract the vector of “Madrid” from the vector of “Spain” and then add the vector of “Athens”. Compute the top-10 most similar words with the output vector.
64. Analogy data experiment
Download word analogy evaluation dataset. Compute the vector as follows: vec(word in second column) - vec(word in first column) + vec(word in third column). From the output vector, (1) find the most similar word and (2) compute the similarity score with the word. Append the most similar word and its similarity to each row of the downloaded file.
65. Accuracy score on the analogy task
From the output of the problem 64, compute the accuracy score on both the semantic analogy and the syntactic analogy.
66. Evaluation on WordSimilarity-353
Download the test data from The WordSimilarity-353 Test Collection. Compute the spearman’s rank correlation coefficient between two similarity rank scores: (1) similarity computed from word vectors and (2) similarity evaluated by the human.
67. k-means clustering
Extract the word vectors of the country names. Apply k-means clustering where k=5.
68. Ward’s method clustering
Apply hierarchical clustering to the word vectors of the country names. Use Ward’s method for the distance metric between two clusters. Visualize the clustering result as the dendrogram.
69. t-SNE Visualization
Visualize the word vector space of the country names by t-SNE.