Training and Prediction for Linear Classifiers
For a step-by-step tutorial, see
For the documentation on some commonly used command line flags, see
For the complete set of command line flags, see
Using CLI via an Example
Step 1. Data Preparation
Create a data sub-directory within LibMultiLabel and go to this sub-directory.
mkdir -p data/rcv1
cd data/rcv1
Linear methods take either textual or bag-of-words numeric data as inputs. For this example, the data will be in LibMultiLabel Format, a textual data format. Download and uncompress the RCV1 dataset with
wget -O train.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_train.txt.bz2
wget -O test.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_test.txt.bz2
bzip2 -d *.bz2
Browse an instance of the data with
head -n 1 train.txt
# Output: 2286 E11 ECAT M11 M12 MCAT recov recov recov recov excit excit bring mexic mexic [...]
If you want to use numeric data in LibSVM Format instead, you may do so with
wget -O train.svm.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_train.svm.bz2
wget -O test.svm.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_combined_test.svm.bz2
bzip2 -d *.bz2
head -n 1 train.svm
# Output: 34,59,93,94,102 864:0.0497399253756197 1523:0.044664135988103 1681:0.0673871572152868 [...]
See Dataset Formats for more details on the data formats.
Step 2. Training and Prediction via an Example
Next, move back to the root directory and run the main script
cd ../..
python3 main.py --config example_config/rcv1/l2svm.yml
This trains a L2-regularized L2-loss SVM and evaluates the model on the test set.
Training and (Optional) Prediction
To train and evaluate a model, use
python3 main.py --config CONFIG_PATH \
--training_file TRAINING_DATA_PATH \
--test_file TEST_DATA_PATH \
--linear \
--liblinear_options=LIBLINEAR_OPTIONS \
--linear_technique MULTILABEL_OR_MULTICLASS_TECHNIQUE \
--data_format DATA_FORMAT
config: Path to a configuration file. Command line options may be specified here instead. See Command Line Options for more details.
The linear classifiers are based on LIBLINEAR, and its options may be specified.
training_file: The path to training data.
test_file: The path to test data. If test data is available, also evaluates the trained model on the test data.
linear: This option specifies that linear models should be ran, as opposed to running neural network models.
liblinear_options: An option string for LIBLINEAR. For example
--liblinear_options='-s 2 -B 1 -e 0.0001 -q'
linear_technique: An option for multi-label or multi-class techniques. It should be one of:
1vsrest
(one-vs-rest),thresholding
(thresholding),cost_sensitive
(cost-sensitive), andbinary_and_multiclass
(binary_and_multiclass).data_format: The data format. It should be one of
txt
(LibMultiLabel format),svm
(LibSVM format). See Dataset Formats for more details on accepted data formats.
Prediction
To predict a test set by applying a previously trained model, use
python3 main.py --config CONFIG_PATH \
--test_file TEST_DATA_PATH \
--eval \
--linear \
--data_format DATA_FORMAT \
--checkpoint_path CHECKPOINT_PATH
where CHECKPOINT_PATH
is a path to a linear_pipeline.pickle
.