This page provides different tools for multi-label classification that are based on LIBSVM or LIBLINEAR. Comments are welcome. Please properly cite our work if you find them useful. This supports our future development. -- Chih-Jen Lin
Disclaimer: We do not take any responsibility on damage or other problems caused by using these software and data sets.
Usage: ./trans_class.py training_file [testing_file]"training_file" and "testing_file" are the original multi-label sets. The script generates three temporary files: "tmp_train" and "tmp_test" are multi-class sets, and "tmp_class" contains the mapping information.
After training/testing multi-class sets, the script measure.py (you also need subr.py) gives three measures: exact match ratio, microaverage F-measure and macroaverage F-measure.
Usage: ./measure.py testing_file testing_output_file training_classIn our calculation, when TP=FP=FN=0, F-measure is defined as 0.
Example: (data from LIBSVM data sets)
% trans_class.py rcv1subset_topics_train_2.svm rcv1subset_topics_test_2.svm % svm-train -t 0 tmp_train % svm-predict tmp_test tmp_train.model o % measure.py rcv1subset_test_2.svm o tmp_classYou may try other multi-class methods available in BSVM.
Author: Wen-Hsien Su
This approach extends the one-against-all multi-class method for multi-label classification. For each label, it builds a binary-class problem so instances associated with that label are in one class and the rest are in another class. The script binary.py (you also need subr.py) implements this approach. To use,
Usage: ./binary.py [parameters for svm-train] training_file testing_file"training_file" and "testing_file" are multi-label sets. You need to install LIBSVM and set suitable paths (see variables svmtrain_exe and svmpredict_ete in the script).
After training/testing, binary.py, gives three measures: exact match ratio, microaverage F-measure and macroaverage F-measure. In our calculation, when TP=FP=FN=0, F-measure is defined as 0. For the prediction outcome, it is possible a test instance is not associated with any label.
Example: (data from LIBSVM data sets)
% binary.py -t 0 rcv1subset_train_2.svm rcv1subset_test_2.svm
Author: Rong-En Fan
You need to download read_sparse_ml.c. and compile this MATLAB-C interface by using the Makefile. Type 'make' under unix systems:
$ make
To load a set such as 'rcv1train.svm' into MATLAB, first launch MATLAB. Then type:
>> [y, x, map] = read_sparse_ml('rcv1train.svm');
The "y" matrix represents the labels of each instance. y(i,j) is 1 if the i-th instance has the label j, otherwise it is 0. The "x" matrix is the data. The "map" matrix stores the mapping between the internal label j and the label found in the dataset.
Author: Rong-En Fan
The script rcv1_lineart_col.m optimize macro-average F-measure by implementing the method "SVM.1" described in:
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.
For a more detailed study of this approach, please seeR.-E. Fan and C.-J. Lin. A Study on Threshold Selection for Multi-label Classification , 2007.
We show results of using L1-loss SVM, L2-loss SVM and logistic regression. This code gives results for three category sets: "Topics", "Industries", and "Regions" (see Table 5 in Lewis et al.). You need MATLAB and the software LIBLINEAR. You must put rcv1_lineart_col.m in the matlab-interface directory matlab/ of LIBLINEAR. In addition, you need to download the RCV1 data sets: rcv1_train.mat and rcv1_test.mat to the same directory.Type
rcv1_lineart_col('topics', 'l2svm_dual')
rcv1_lineart_col('regions', 'l2svm_dual')
rcv1_lineart_col('industries', 'l2svm_dual')
You will get results similar to the following (Total time is based on liblinear 1.21 and
an Intel C2Q Q6600 2.40G computer):
(for topics:) INFO: microaverage: 0.812878 INFO: macroaverage: 0.617562 INFO: Total Time: 298.085084 seconds (for industries:) INFO: microaverage: 0.532126 INFO: macroaverage: 0.304381 INFO: Total Time: 845.156376 seconds (for regions:) INFO: microaverage: 0.870443 INFO: macroaverage: 0.601623 INFO: Total Time: 589.156516 secondsWe calculate the microaverage and the macroaverage F-measure for categories with one or more positive training examples, i.e., the "1+train(101)" row in Table 5 of the paper.
We use 3-fold cross validation instead of 5-fold CV in the paper.
To use logistic regression, simply replace 'l2svm_dual' with 'lr'
rcv1_lineart_col('topics', 'lr')
rcv1_lineart_col('industries', 'lr')
rcv1_lineart_col('regions', 'lr')
The results are listed below:
(for topics:) INFO: microaverage: 0.807876 INFO: macroaverage: 0.596911 INFO: Total Time: 872.522069 seconds (for industries:) INFO: microaverage: 0.487655 INFO: macroaverage: 0.264600 INFO: Total Time: 2803.757595 seconds (for regions:) INFO: microaverage: 0.859337 INFO: macroaverage: 0.546168 INFO: Total Time: 1894.421539 secondsTo use L1-loss SVM, replace 'l2svm_dual' with 'l1svm_dual'.
You may swap training and testing by
rcv1_lineart_col('topics', 'l2svm_dual', 'swap')
Results are
(for topics:) INFO: microaverage: 0.854851 INFO: macroaverage: 0.686879 INFO: Total Time: 13244.108812 seconds (for industries:) INFO: microaverage: 0.748794 INFO: macroaverage: 0.602279 INFO: Total Time: 38438.749033 seconds (for regions:) INFO: microaverage: 0.911866 INFO: macroaverage: 0.584129 INFO: Total Time: 30556.409249 seconds
Author: Rong-En Fan, Cheng-Yu Lee, and Xiang-Rui Wang