Chapter 4: POS tagging
The zip archive alice.zip contains the novel, Alice’s Adventures in Wonderland (written by Lewis Carroll) distributed on Project Gutenberg, as a text file alice.txt
.
Apply a part-of-speech (POS) tagger to the text file, and store the result in another file.
Implement programs that read the POS tagging result and perform the jobs.
For your convenience, the zip archive also includes alice.txt.conll
, the novel with part-of-speech labels predicted by Stanford CoreNLP.
It may be a good idea to use matplotlib or Gnuplot for the problems 37, 38, and 39.
30. Reading the resultPermalink
Implement a program that reads the result of part-of-speech tagging.
Here, represent a sentence as a list of mapping objects, each of which associates a surface form, lemma (base form), part-of-speech tag with the keys text
, lemma
, pos
.
Use this representation in the rest of the problems.
31. VerbsPermalink
Extract surface forms of all verbs appearing in the text.
32. Base forms of verbsPermalink
Extract lemmas of all verbs appearing in the text.
33. A of BPermalink
Extract noun phrases in the form of “A of B”, where A and B are nouns.
34. A BPermalink
Extract the longest noun phrase consisting of consecutive nouns.
35. Frequency of wordsPermalink
Obtain the list of words and frequencies of their occurrences sorted by descending order of frequency.
36. Top-ten frequent wordsPermalink
Visualize the top-ten frequent words and their frequencies with a chart (e.g., bar chart).
37. Top-ten words co-occurring with ‘Alice’Permalink
Extract the list of words that co-occur with the word “Alice”. Visualize with a chart (e.g., bar chart) the top-ten words co-occurring with the word “Alice” and their frequencies.
38. HistogramPermalink
Draw a histogram of word frequency (x-axis is a scalar range representing a frequency ranging from 1 to the largest frequency of a given word in the entire corpus, and the y-axis is the count of unique words that fall into the count of the x value).
39. Zipf’s lawPermalink
Plot a log-log graph with the x-axis being rank order and the y-axis being frequency.