NLP100 2020

The zip archive ai.en.zip contains the article “Artificial intelligence” from English Wikipedia.

ai.en.txt: the text extracted from the Wikipedia article
ai.en.txt.json: the text annotated with dependency trees (in JSON format)

We used WikiExtractor to extract the text ai.en.txt from the original MediaWiki article (in XML format). The part-of-speech tags and dependency trees were annotated by Stanford CoreNLP. This is a rough explanation of how to obtain ai.en.txt and ai.en.txt.json.

# Extract text from ai.en.xml (the article in XML format).
$ python WikiExtractor.py ai.en.xml
# Prepare ai.en.txt by editing the output from the tool, e.g., text/AA/wiki_00

# Apply Stanford CoreNLP to ai.en.txt
$ ./corenlp.sh -annotators tokenize,ssplit,pos,lemma,depparse -outputFormat json -file ai.en.txt

These files are distributed under the term of Creative Commons Attribution-ShareAlike 3.0 Unported.