Chapter 6: Machine Learning

In this chapter, we tackle the task of news classification. The task is to classify a given news headline to one of the following categories: “Business”, “Science”, “Entertainment” and “Health”. News Aggregator Data Set provided by Fabio Gasparetti, is the dataset we use in this chapter.

50. Download and Preprocess Dataset

Download News Aggregator Data Set and create training data (train.txt), validation data (valid.txt) and test data (test.txt) as follows:

  1. Unpack the downloaded zip file and read readme.txt.
  2. Extract the articles such that the publisher is one of the followings: “Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com” and “Daily Mail”.
  3. Randomly shuffle the extracted articles.
  4. Split the extracted articles in the following ratio: the training data (80%), the validation data (10%) and the test data (10%). Then save them into files train.txt, valid.txt and test.txt, respectively. In each file, each line should contain a single instance. Each instance should contain both the name of the category and the article headline. Use Tab to separate each field.

After creating the dataset, check the number of instances contained in each category.

51. Feature extraction

Extract a set of features from the training, validation and test data, respectively. Save the features into files as follows: train.feature.txt, valid.feature.txt and test.feature.txt. Design the features that are useful for the news classification. The minimum baseline for the features is the tokenized sequence of the news headline.

52. Training

Use the training data from the problem 51 and train the logistic regression model.

53. Prediction

Use the logistic regression model from the problem 52. Create a program that predicts the category of a given news headline and computes the prediction probability of the model.

54. Accuracy score

Compute the accuracy score of the logistic regression model from the problem 52 on both the training data and the test data.

55. Confusion matrix

Create the confusion matrix of the logistic regression model from the problem 52 for both the training data and the test data.

56. Precision, recall and F1 score

Compute the precision, recall and F1 score of the logistic regression model from the problem 52. First, compute these metrics for each category. Then summarize the score of each category using (1) micro-average and (2) macro-average.

57. Feature weights

Use the logistic regression model from the problem 52. Check the feature weights and list the 10 most important features and 10 least important features.

58. Regularization

When training a logistic regression model, one can control the degree of overfitting by manipulating the regularization parameters. Use different regularization parameters to train the model. Then, compute the accuracy score on the training data, validation data and test data. Summarize the results on the graph, where x-axis is the regularization parameter and y-axis is the accuracy score.

59. Hyper-parameter tuning

Use different training algorithms and parameters to train the model for the news classification. Search for the training algorithms and parameters that achieves the best accuracy score on the validation data. Then compute its accuracy score on the test data.