In this chapter, we tackle the task of news classification. The task is to classify a given news headline to one of the following categories: “Business”, “Science”, “Entertainment” and “Health”. News Aggregator Data Set provided by Fabio Gasparetti, is the dataset we use in this chapter.
50. Download and Preprocess Dataset
Download News Aggregator Data Set and create training data (
train.txt), validation data (
valid.txt) and test data (
test.txt) as follows:
- Unpack the downloaded zip file and read
- Extract the articles such that the publisher is one of the followings: “Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com” and “Daily Mail”.
- Randomly shuffle the extracted articles.
- Split the extracted articles in the following ratio: the training data (80%), the validation data (10%) and the test data (10%). Then save them into files
test.txt, respectively. In each file, each line should contain a single instance. Each instance should contain both the name of the category and the article headline. Use
Tabto separate each field.
After creating the dataset, check the number of instances contained in each category.
51. Feature extraction
Extract a set of features from the training, validation and test data, respectively.
Save the features into files as follows:
Design the features that are useful for the news classification.
The minimum baseline for the features is the tokenized sequence of the news headline.
Use the training data from the problem 51 and train the logistic regression model.
Use the logistic regression model from the problem 52. Create a program that predicts the category of a given news headline and computes the prediction probability of the model.
54. Accuracy score
Compute the accuracy score of the logistic regression model from the problem 52 on both the training data and the test data.
55. Confusion matrix
Create the confusion matrix of the logistic regression model from the problem 52 for both the training data and the test data.
56. Precision, recall and F1 score
Compute the precision, recall and F1 score of the logistic regression model from the problem 52. First, compute these metrics for each category. Then summarize the score of each category using (1) micro-average and (2) macro-average.
57. Feature weights
Use the logistic regression model from the problem 52. Check the feature weights and list the 10 most important features and 10 least important features.
When training a logistic regression model, one can control the degree of overfitting by manipulating the regularization parameters. Use different regularization parameters to train the model. Then, compute the accuracy score on the training data, validation data and test data. Summarize the results on the graph, where x-axis is the regularization parameter and y-axis is the accuracy score.
59. Hyper-parameter tuning
Use different training algorithms and parameters to train the model for the news classification. Search for the training algorithms and parameters that achieves the best accuracy score on the validation data. Then compute its accuracy score on the test data.