LIBSVM Tools: Multi-label classification

Last modified: Wed May 28 07:12:33 CST 2008

This page provides different tools for multi-label classification that are based on LIBSVM or LIBLINEAR. Comments are welcome. Please properly cite our work if you find them useful. This supports our future development. -- Chih-Jen Lin

Disclaimer: We do not take any responsibility on damage or other problems caused by using these software and data sets.


Label Combination

One simple way for multi-label classification is to treat each "label set" as a single class and train/test multi-class problems. The script trans_class.py transforms data to multi-class sets:
Usage: ./trans_class.py training_file [testing_file]
"training_file" and "testing_file" are the original multi-label sets. The script generates three temporary files: "tmp_train" and "tmp_test" are multi-class sets, and "tmp_class" contains the mapping information.

After training/testing multi-class sets, the script measure.py (you also need subr.py) gives three measures: exact match ratio, microaverage F-measure and macroaverage F-measure.

Usage: ./measure.py testing_file testing_output_file training_class
In our calculation, when TP=FP=FN=0, F-measure is defined as 0.

Example: (data from LIBSVM data sets)

% trans_class.py rcv1subset_topics_train_2.svm rcv1subset_topics_test_2.svm 
% svm-train -t 0 tmp_train
% svm-predict tmp_test tmp_train.model o
% measure.py rcv1subset_test_2.svm o tmp_class 
You may try other multi-class methods available in BSVM.

Author: Wen-Hsien Su


Binary Approach

This approach extends the one-against-all multi-class method for multi-label classification. For each label, it builds a binary-class problem so instances associated with that label are in one class and the rest are in another class. The script binary.py (you also need subr.py) implements this approach. To use,

Usage: ./binary.py [parameters for svm-train] training_file testing_file
"training_file" and "testing_file" are multi-label sets. You need to install LIBSVM and set suitable paths (see variables svmtrain_exe and svmpredict_ete in the script).

After training/testing, binary.py, gives three measures: exact match ratio, microaverage F-measure and macroaverage F-measure. In our calculation, when TP=FP=FN=0, F-measure is defined as 0. For the prediction outcome, it is possible a test instance is not associated with any label.

Example: (data from LIBSVM data sets)

% binary.py -t 0 rcv1subset_train_2.svm rcv1subset_test_2.svm 

Author: Rong-En Fan


Read multi-label datasets in LIBSVM format to MATLAB

You need to download read_sparse_ml.c. and compile this MATLAB-C interface by using the Makefile. Type 'make' under unix systems:

$ make
or the following command under MATLAB
matlab> mex -largeArrayDims read_sparse_ml.c

To load a set such as 'rcv1train.svm' into MATLAB, first launch MATLAB. Then type:

>> [y, x, map] = read_sparse_ml('rcv1train.svm');
The "y" matrix represents the labels of each instance. y(i,j) is 1 if the i-th instance has the label j, otherwise it is 0. The "x" matrix is the data. The "map" matrix stores the mapping between the internal label j and the label found in the dataset.

Author: Rong-En Fan. Minor improvement by Chun-Heng Huang, April 2013.


Generate libsvm format of RCV1

In this directory, you can find some scripts for generating the data. Please check the file INSTRUCTION in detail.

RCV1

The script rcv1_lineart_col.m optimize macro-average F-measure by implementing the method "SVM.1" described in:

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.

For a more detailed study of this approach, please see

R.-E. Fan and C.-J. Lin. A Study on Threshold Selection for Multi-label Classification , 2007.

We show results of using L1-loss SVM, L2-loss SVM and logistic regression. This code gives results for three category sets: "Topics", "Industries", and "Regions" (see Table 5 in Lewis et al.). You need MATLAB and the software LIBLINEAR. You must put rcv1_lineart_col.m in the matlab-interface directory matlab/ of LIBLINEAR. In addition, you need to download the RCV1 data sets: rcv1_train.mat and rcv1_test.mat to the same directory.

Type

rcv1_lineart_col('topics', 'l2svm_dual')
rcv1_lineart_col('regions', 'l2svm_dual')
rcv1_lineart_col('industries', 'l2svm_dual')
You will get results similar to the following (Total time is based on liblinear 1.21 and an Intel C2Q Q6600 2.40G computer):
(for topics:)
INFO: microaverage: 0.812878
INFO: macroaverage: 0.617562
INFO: Total Time: 298.085084 seconds

(for industries:)
INFO: microaverage: 0.532126
INFO: macroaverage: 0.304381
INFO: Total Time: 845.156376 seconds

(for regions:)
INFO: microaverage: 0.870443
INFO: macroaverage: 0.601623
INFO: Total Time: 589.156516 seconds
We calculate the microaverage and the macroaverage F-measure for categories with one or more positive training examples, i.e., the "1+train(101)" row in Table 5 of the paper.

We use 3-fold cross validation instead of 5-fold CV in the paper.

To use logistic regression, simply replace 'l2svm_dual' with 'lr'

rcv1_lineart_col('topics', 'lr')
rcv1_lineart_col('industries', 'lr')
rcv1_lineart_col('regions', 'lr')
The results are listed below:
(for topics:)
INFO: microaverage: 0.807876
INFO: macroaverage: 0.596911
INFO: Total Time: 872.522069 seconds

(for industries:)
INFO: microaverage: 0.487655
INFO: macroaverage: 0.264600
INFO: Total Time: 2803.757595 seconds

(for regions:)
INFO: microaverage: 0.859337
INFO: macroaverage: 0.546168
INFO: Total Time: 1894.421539 seconds

To use L1-loss SVM, replace 'l2svm_dual' with 'l1svm_dual'.

You may swap training and testing by

rcv1_lineart_col('topics', 'l2svm_dual', 'swap')
Results are
(for topics:)
INFO: microaverage: 0.854851
INFO: macroaverage: 0.686879
INFO: Total Time: 13244.108812 seconds

(for industries:)
INFO: microaverage: 0.748794
INFO: macroaverage: 0.602279
INFO: Total Time: 38438.749033 seconds

(for regions:)
INFO: microaverage: 0.911866
INFO: macroaverage: 0.584129
INFO: Total Time: 30556.409249 seconds

Author: Rong-En Fan, Cheng-Yu Lee, and Xiang-Rui Wang


Please contact Chih-Jen Lin for any question.