Madhavan Mukund



Data Mining and Machine Learning,
Aug-Nov 2017

Assignment 2: Supervised Learning

18 October, 2017, due 31 October 2017



The Task

The "Census Income" data set from the UCI Machine Learning Repository contains income information for over 48,000 individuals taken from the 1994 US census. The original URL for the UCI repository is http://archive.ics.uci.edu/ml/datasets/Census+Income.

The task is to predict whether a person makes over 50K a year.

More information about the dataset is available here. This includes a description of the columns in the table.

The actual dataset is available as a csv (comma separated values) in two parts.

Note: You should ignore the attribute fnlwgt when building the classifier. This attribute describes the sampling weight of each entry and is only useful if you are trying to extrapolate aggregate statistics from this dataset.

In this assignment you have to build two classifiers for this data set, a decision tree and a naive Bayesian classifier.

  • First build classifiers using the training data as given (about 2/3 of the total) and validate it against the test data (about 1/3 of the total).

  • Merge the training and test data and do 10-fold cross validation to evaluate your classifier.


Solving the Task

  • You can use any programming language, including Python and R. You can make use of standard packages for analytics and machine learning. Clearly document any external packages used by your code.

  • Submit via Moodle a single archive (zip, tar.gz, …) containing:

    • The code you used to solve the assignment.

    • A link to the output produced by your code. Do not include the output in this submission. Save it somewhere on the cloud and provide a link.

    • A short write up describing how your code ran on the data sets: the parameters used, time taken, space required, and anything else of interest.

  • You can work in groups of two or three. Each group makes a single submission to Moodle. Use any one person's Moodle account to submit. The submission should mention the names of the partners.

  • There will be a short oral presentation and question/answer session for each group.