Reproducing RCV1 results

Generate libsvm format of RCV1

In this directory, you can find some scripts for generating the data. Please check the file INSTRUCTION in detail.

Installation

You need MATLAB and the software LIBLINEAR. The MATLAB/Octave interface of LIBLINEAR must be built.

You need to download newbinary.m and rcv1_lineart_col.m to the matlab-interface directory matlab/ of LIBLINEAR.

In addition, you need to download the RCV1 data sets: rcv1_train.mat and rcv1_test.mat to the same directory.


RCV1

The script rcv1_lineart_col.m optimize macro-average F-measure by implementing the method "SVM.1" described in:

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.

For a more detailed study of this approach, please see

R.-E. Fan and C.-J. Lin. A Study on Threshold Selection for Multi-label Classification , 2007.

We show results of using L1-loss SVM, L2-loss SVM and logistic regression. This code gives results for three category sets: "Topics", "Industries", and "Regions" (see Table 5 in Lewis et al.).

Type

rcv1_lineart_col('topics', 'l2svm_dual')
rcv1_lineart_col('regions', 'l2svm_dual')
rcv1_lineart_col('industries', 'l2svm_dual')
You will get results similar to the following (Total time is based on liblinear 1.21 and an Intel C2Q Q6600 2.40G computer):
(for topics:)
INFO: microaverage: 0.812878
INFO: macroaverage: 0.617562
INFO: Total Time: 298.085084 seconds

(for industries:)
INFO: microaverage: 0.532126
INFO: macroaverage: 0.304381
INFO: Total Time: 845.156376 seconds

(for regions:)
INFO: microaverage: 0.870443
INFO: macroaverage: 0.601623
INFO: Total Time: 589.156516 seconds
We calculate the microaverage and the macroaverage F-measure for categories with one or more positive training examples, i.e., the "1+train(101)" row in Table 5 of the paper.

We use 3-fold cross validation instead of 5-fold CV in the paper.

To use logistic regression, simply replace 'l2svm_dual' with 'lr'

rcv1_lineart_col('topics', 'lr')
rcv1_lineart_col('industries', 'lr')
rcv1_lineart_col('regions', 'lr')
The results are listed below:
(for topics:)
INFO: microaverage: 0.807876
INFO: macroaverage: 0.596911
INFO: Total Time: 872.522069 seconds

(for industries:)
INFO: microaverage: 0.487655
INFO: macroaverage: 0.264600
INFO: Total Time: 2803.757595 seconds

(for regions:)
INFO: microaverage: 0.859337
INFO: macroaverage: 0.546168
INFO: Total Time: 1894.421539 seconds

To use L1-loss SVM, replace 'l2svm_dual' with 'l1svm_dual'.

You may swap training and testing by

rcv1_lineart_col('topics', 'l2svm_dual', 'swap')
Results are
(for topics:)
INFO: microaverage: 0.854851
INFO: macroaverage: 0.686879
INFO: Total Time: 13244.108812 seconds

(for industries:)
INFO: microaverage: 0.748794
INFO: macroaverage: 0.602279
INFO: Total Time: 38438.749033 seconds

(for regions:)
INFO: microaverage: 0.911866
INFO: macroaverage: 0.584129
INFO: Total Time: 30556.409249 seconds

Author: Rong-En Fan, Cheng-Yu Lee, and Xiang-Rui Wang


Please contact Chih-Jen Lin for any question.