Homework 3

In homework 2 you may have seen that the performance after selecting 200 features is still not good. Remember that we used the software libsvm with parameters as follows

 svm-train -c 32 -g 0.0001220703125 thrombin

We suspect that maybe we did not select good parameters. We would like to try the following two things:

Using different combinations of C and g on the training data and predict the test data. Then report the best result you can get. For example, if you test
g = [2⁴, 2³, ... , 2^-10] and C = [2¹², 2¹¹, ... 2^-2], then there are 225 combinations. For each combination you predict the test data and write down your result. Then you report the best of all 225 results. This can be considered as the best result you can obtain under the current setting.
Conduct five-fold cross validation on the training data using different criteria. See if any criteria returns a parameter which is close to the optimal one you found above.

For consistency, please use the following 200-feature training and testing files prepares by Yien (yien@csie): training and testing.

Running so many combinations may take a few hours so you want to do this homework as early as possible. Write a short report (<= 2 pages) in English about what you find.

Note that the error rate is counted by a different way: from the KDD cup homepage "if there are 10 actives and 100 inactives in the test set, then each active will effectively count 10 times as much as each inactive."

For calculating cross-validation accuracy using different criteria, you can modify the program svm-train.c. In particular line 144 to 157.

Last modified: Mon Oct 29 19:11:36 CST 2001