The zip archive alice.zip contains the novel, Alice’s Adventures in Wonderland (written by Lewis Carroll) distributed on Project Gutenberg, as a text file
Apply a part-of-speech (POS) tagger to the text file, and store the result in another file.
Implement programs that read the POS tagging result and perform the jobs.
For your convenience, the zip archive also includes
alice.txt.conll, the novel with part-of-speech labels predicted by Stanford CoreNLP.
It may be a good idea to use matplotlib or Gnuplot for the problems 37, 38, and 39.
30. Reading the result
Implement a program that reads the result of part-of-speech tagging.
Here, represent a sentence as a list of mapping objects, each of which associates a surface form, lemma (base form), part-of-speech tag with the keys
Use this representation in the rest of the problems.
Extract surface forms of all verbs appearing in the text.
32. Base forms of verbs
Extract lemmas of all verbs appearing in the text.
33. A of B
Extract noun phrases in the form of “A of B”, where A and B are nouns.
34. A B
Extract the longest noun phrase consisting of consecutive nouns.
35. Frequency of words
Obtain the list of words and frequencies of their occurrences sorted by descending order of frequency.
36. Top-ten frequent words
Visualize the top-ten frequent words and their frequencies with a chart (e.g., bar chart).
37. Top-ten words co-occurring with ‘Alice’
Extract the list of words that co-occur with the word “Alice”. Visualize with a chart (e.g., bar chart) the top-ten words co-occurring with the word “Alice” and their frequencies.
Draw a histogram of word frequency (x-axis is a scalar range representing a frequency ranging from 1 to the largest frequency of a given word in the entire corpus, and the y-axis is the count of unique words that fall into the count of the x value).
39. Zipf’s law
Plot a log-log graph with the x-axis being rank order and the y-axis being frequency.