Chapter 7: Word Vector

This chapter introduces the concept of the word vectors (i.e., word embeddings). Create the following problems.

60. Loading word vectorsPermalink

Download word vectors that are pretrained on Google News dataset (approx. 100 billion words). The file contains word vectors of 3 million words/phrases, whose dimentionalities are 300. Print out the word vector of the term “United States”. Note that “United States” is represented as “United_States” in the file.

61. Word similarityPermalink

Compute the cosine similarity between “United States” and “U.S.”

62. Top-10 most similar wordsPermalink

Find the top-10 words that have the highest cosine similarity with the word “United States” and print out the similarity score.

63. Analogy based on the additive compositionPermalink

Subtract the vector of “Madrid” from the vector of “Spain” and then add the vector of “Athens”. Compute the top-10 most similar words with the output vector.

64. Analogy data experimentPermalink

Download word analogy evaluation dataset. Compute the vector as follows: vec(word in second column) - vec(word in first column) + vec(word in third column). From the output vector, (1) find the most similar word and (2) compute the similarity score with the word. Append the most similar word and its similarity to each row of the downloaded file.

65. Accuracy score on the analogy taskPermalink

From the output of the problem 64, compute the accuracy score on both the semantic analogy and the syntactic analogy.

66. Evaluation on WordSimilarity-353Permalink

Download the test data from The WordSimilarity-353 Test Collection. Compute the spearman’s rank correlation coefficient between two similarity rank scores: (1) similarity computed from word vectors and (2) similarity evaluated by the human.

67. k-means clusteringPermalink

Extract the word vectors of the country names. Apply k-means clustering where k=5.

68. Ward’s method clusteringPermalink

Apply hierarchical clustering to the word vectors of the country names. Use Ward’s method for the distance metric between two clusters. Visualize the clustering result as the dendrogram.

69. t-SNE VisualizationPermalink

Visualize the word vector space of the country names by t-SNE.