Homework 2

The first problem of KDD cup 2001 is a classification for the prediction of molecular bioactivity for drug design -- binding to thrombin.

The training file contains 1910 data with more than 100,000 features. Using the sparse format it is available here (in zip format). For testing we have 634 data which are available here (in zip format).

We used the software libsvm with parameters as follows

 svm-train -c 32 -g 0.0001220703125 thrombin

but results are not good.

The winner said that he selected only 200 features. We doubt this is the main reason why we didn't get good performance but not other reasons such as scaling or the classifier used.

You want to download libsvm and do the training/testing using all features and only 200 of them. Then compare their results and write a short report (<= 2 pages) in English about what you find.

Note that the error rate is counted by a different way: from the KDD cup homepage "if there are 10 actives and 100 inactives in the test set, then each active will effectively count 10 times as much as each inactive."

If the result is still not good using the 200 features, we may have to change parameters. This then will be another homework later.

Last modified: Sun Oct 28 20:02:13 CST 2001