You need MATLAB and the software LIBLINEAR. The MATLAB/Octave interface of LIBLINEAR must be built.
You need to download newbinary.m and rcv1_lineart_col.m to the matlab-interface directory matlab/ of LIBLINEAR.
In addition, you need to download the RCV1 data sets: rcv1_train.mat and rcv1_test.mat to the same directory.
The script rcv1_lineart_col.m optimize macro-average F-measure by implementing the method "SVM.1" described in:
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.
For a more detailed study of this approach, please seeR.-E. Fan and C.-J. Lin. A Study on Threshold Selection for Multi-label Classification , 2007.
We show results of using L1-loss SVM, L2-loss SVM and logistic regression. This code gives results for three category sets: "Topics", "Industries", and "Regions" (see Table 5 in Lewis et al.).Type
rcv1_lineart_col('topics', 'l2svm_dual') rcv1_lineart_col('regions', 'l2svm_dual') rcv1_lineart_col('industries', 'l2svm_dual')You will get results similar to the following (Total time is based on liblinear 1.21 and an Intel C2Q Q6600 2.40G computer):
(for topics:) INFO: microaverage: 0.812878 INFO: macroaverage: 0.617562 INFO: Total Time: 298.085084 seconds (for industries:) INFO: microaverage: 0.532126 INFO: macroaverage: 0.304381 INFO: Total Time: 845.156376 seconds (for regions:) INFO: microaverage: 0.870443 INFO: macroaverage: 0.601623 INFO: Total Time: 589.156516 secondsWe calculate the microaverage and the macroaverage F-measure for categories with one or more positive training examples, i.e., the "1+train(101)" row in Table 5 of the paper.
We use 3-fold cross validation instead of 5-fold CV in the paper.
To use logistic regression, simply replace 'l2svm_dual' with 'lr'
rcv1_lineart_col('topics', 'lr') rcv1_lineart_col('industries', 'lr') rcv1_lineart_col('regions', 'lr')The results are listed below:
(for topics:) INFO: microaverage: 0.807876 INFO: macroaverage: 0.596911 INFO: Total Time: 872.522069 seconds (for industries:) INFO: microaverage: 0.487655 INFO: macroaverage: 0.264600 INFO: Total Time: 2803.757595 seconds (for regions:) INFO: microaverage: 0.859337 INFO: macroaverage: 0.546168 INFO: Total Time: 1894.421539 secondsTo use L1-loss SVM, replace 'l2svm_dual' with 'l1svm_dual'.
You may swap training and testing by
rcv1_lineart_col('topics', 'l2svm_dual', 'swap')Results are
(for topics:) INFO: microaverage: 0.854851 INFO: macroaverage: 0.686879 INFO: Total Time: 13244.108812 seconds (for industries:) INFO: microaverage: 0.748794 INFO: macroaverage: 0.602279 INFO: Total Time: 38438.749033 seconds (for regions:) INFO: microaverage: 0.911866 INFO: macroaverage: 0.584129 INFO: Total Time: 30556.409249 seconds
Author: Rong-En Fan, Cheng-Yu Lee, and Xiang-Rui Wang