Data Mining and Machine Learning

Aug-Nov, 2013

Assignment 2: Clustering

5 November, 2013

The "Bag of Words" data set from the UCI Machine Learning Repository contains five text collections in the form of bags-of-words. The original URL for the UCI repository is http://archive.ics.uci.edu/ml/datasets/Bag+of+Words.

Your task is cluster the documents into K clusters using K-means clustering.

Use the "standard" TF-IDF vector representation for each document in collection XYZ in terms of the words specified in vocab.XYZ.txt. (Use the TF-IDF definition from Web Data Mining by Bing Liu, Chapter 6.2.2, page 189.)
Use cosine to measure the distance between documents.

Report your results on the three smaller datasets (Enron emails, NIPS blog entries, KOS blog entries) for different values of K and different choices of initial centroids.

In addition to the actual output, report the time it took to complete the job and, in case your program did not terminate for a given dataset and combination of K and initial centroids, report how long you tried before you gave up.

The three smaller datasets from this repository that you need for this assignment are also available locally at http://www.cmi.ac.in/~madhavan/courses/datamining13-aug/assignment2/bag-of-words.

In each of the text collections, each document is summarized as a bag (multiset) of words. The individual documents are identified by document IDs and the words are identified by word IDs.

After some cleaning up, in each collection the vocabulary of unique words has been truncated to only keep words that occurred more than ten times overall in that collection.

For each collection XYZ:

vocab.XYZ.txt is the vocabulary file, listing all words that appear in the collection XYZ, one word per line. Each word has an implicit wordID that is its line number in this file, starting with 1 (the word on line 1 has wordID 1, the word on line 2 has wordID 2, ...)
docword.XYZ.txt lists out the number of times each word in vocab.XYZ.txt occurs in each document (only non-zero counts are recorded).

The file docword.XYZ.txt begins with 3 header lines
```
   D
   W
   NNZ
```
where D is the number of documents in the collection, W is the number of words whose frequency is counted (i.e., W is the number of words in vocab.XYZ.txt) and NNZ is the number of non-zero frequency entries for this collection (i.e., NNZ is 3 less than the number of lines in docword.XYZ.txt).

This is followed by NNZ lines of the form
```
   docID wordID count
```
where count is the number of time the word with id wordID appears in document with id docID. Remember that only non-zero counts are recorded.

As usual, a K-itemset of words is a collection of words of size K that occur together in the same document. Your assignment is to write a program to find all K-itemsets of words occurring with frequency F, where K, F and the name of the dataset to use should be parameters to your program.

The datasets are of different sizes. Report your results on the three smaller datasets (Enron emails, NIPS blog entries, KOS blog entries) for different values of K and F. In addition to the actual output, report the time it took to complete the job and, in case your program did not terminate for a given dataset and combination of K and F, report how long you tried before you gave up.

Information about the datasets in the repository

Enron Emails:
orig source: www.cs.cmu.edu/~enron
```
D=39861
W=28102
N=6,400,000 (approx)
```
NIPS full papers:
orig source: books.nips.cc
```
D=1500
W=12419
N=1,900,000 (approx)
```
KOS blog entries:
orig source: dailykos.com
```
D=3430
W=6906
N=467714
```
NYTimes news articles:
orig source: ldc.upenn.edu
```
D=300000
W=102660
N=100,000,000 (approx)
```
PubMed abstracts:
orig source: www.pubmed.gov
```
D=8200000
W=141043
N=730,000,000 (approx)
```

Last updated Tue 15 Oct, 2013