Chapter 3: Regular Expression
The file enwiki-country.json.gz stores Wikipedia articles in the format:
- Each line stores a Wikipedia article in JSON format
- Each JSON document has key-value pairs:
- Title of the article as the value for the
title
key - Body of the article as the value for the
text
key
- Title of the article as the value for the
- The entire file is compressed by gzip
Write codes that perform the following jobs.
20. Read JSON documentsPermalink
Read the JSON documents and output the body of the article about the United Kingdom. Reuse the output in problems 21-29.
21. Lines with category namesPermalink
Extract lines that define the categories of the article.
22. Category namesPermalink
Extract the category names of the article.
23. Section structurePermalink
Extract section names in the article with their levels. For example, the level of the section is 1 for the MediaWiki markup "== Section name =="
.
24. Media referencesPermalink
Extract references to media files linked from the article.
25. InfoboxPermalink
Extract field names and their values in the Infobox “country”, and store them in a dictionary object.
26. Remove emphasis markupsPermalink
In addition to the process of the problem 25, remove emphasis MediaWiki markups from the values. See Help:Cheatsheet.
27. Remove internal linksPermalink
In addition to the process of the problem 26, remove internal links from the values. See Help:Cheatsheet.
28. Remove MediaWiki markupsPermalink
In addition to the process of the problem 27, remove MediaWiki markups from the values as much as you can, and obtain the basic information of the country in plain text format.
29. Country flagPermalink
Obtain the URL of the country flag by using the analysis result of Infobox. (Hint: convert a file reference to a URL by calling imageinfo in MediaWiki API)