Madhavan Mukund



Data Mining and Machine Learning,
Jan-Apr 2019

Assignment 2: Naïve Bayes text classification

3 March, 2019
Due 16 March, 2019



The Task

The "Reuters-21578 Text Categorization Collection Data Set" data set from the UCI Machine Learning Repository contains 21578 news articles that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories. The URL for the UCI repository is http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection.

The articles are grouped into 21 SGML files, with 1000 articles per SGML file. Articles may have no topics assigned or even multiple topics. Details about the SGML format are given at http://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/README.txt.

Your task is to build a naïve Bayesian classifier for the Reuters data to assign topics to articles, using the topic tags in the training set. Note that an article may be assigned more than one topic. You are free to choose a strategy to assign multiple topic labels to a document.

Preprocessing the SGML data to extract details about individual articles and their topic labels is part of the task.


Solving the Task

  • You can use any programming language, including Python and R. You can make use of standard packages for analytics and machine learning. Clearly document any external packages used by your code.

  • Submit via Moodle a single archive (zip, tar.gz, …) containing:

    • The code you used to solve the assignment.

    • A link to the output produced by your code. Do not include the output in this submission. Save it somewhere on the cloud and provide a link.

    • A short write up describing how your code ran on the data sets: the parameters used, time taken, space required, and anything else of interest.

  • You can work in groups of two. Each group makes a single submission to Moodle. Use either person's Moodle account to submit. The submission should mention the names of the two partners.

  • There will be a short oral presentation and question/answer session for each group.