Homework 2

Consider usps.bz2 and usps.t.bz2 as your training and testing sets. Transform them to the input format of random forest subroutine. Then you train a model to predict the test set.

The format of training and testing data file is:

label index1:value1 index2:value2 ...

The difficulty of this homework is on data transformation. As R is like an interpreter, if you are not careful, the data transformation can be extremely slow (much slower than the training and testing). So the purpose of this homework it to let you try different implementation ways of doing this data transformation. You may want to demonstrate several possible ways and report their computational time.

Write a short report (<= 2 pages in English) to show what you find.


Last modified: Sun Feb 22 17:03:51 CST 2004