Training and Prediction for Linear Classifiers

For a step-by-step tutorial, see

For the documentation on some commonly used command line flags, see

For the complete set of command line flags, see


Using CLI via an Example

Step 1. Data Preparation

Create a data sub-directory within LibMultiLabel and go to this sub-directory.

mkdir -p data/rcv1
cd data/rcv1

Linear methods take either textual or bag-of-words numeric data as inputs. For this example, the data will be in LibMultiLabel Format, a textual data format. Download and uncompress the RCV1 dataset with

wget -O train.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_train.txt.bz2
wget -O test.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_test.txt.bz2
bzip2 -d *.bz2

Browse an instance of the data with

head -n 1 train.txt
# Output: 2286    E11 ECAT M11 M12 MCAT   recov recov recov recov excit excit bring mexic mexic [...]

If you want to use numeric data in LibSVM Format instead, you may do so with

wget -O train.svm.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_train.svm.bz2
wget -O test.svm.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_combined_test.svm.bz2
bzip2 -d *.bz2
head -n 1 train.svm
# Output: 34,59,93,94,102  864:0.0497399253756197 1523:0.044664135988103 1681:0.0673871572152868 [...]

See Dataset Formats for more details on the data formats.

Step 2. Training and Prediction via an Example

Next, move back to the root directory and run the main script

cd ../..
python3 main.py --config example_config/rcv1/l2svm.yml

This trains a L2-regularized L2-loss SVM and evaluates the model on the test set.


Training and (Optional) Prediction

To train and evaluate a model, use

python3 main.py --config CONFIG_PATH \
                --training_file TRAINING_DATA_PATH \
                --test_file TEST_DATA_PATH \
                --linear \
                --liblinear_options=LIBLINEAR_OPTIONS \
                --linear_technique MULTILABEL_OR_MULTICLASS_TECHNIQUE \
                --data_format DATA_FORMAT
  • config: Path to a configuration file. Command line options may be specified here instead. See Command Line Options for more details.

The linear classifiers are based on LIBLINEAR, and its options may be specified.

  • training_file: The path to training data.

  • test_file: The path to test data. If test data is available, also evaluates the trained model on the test data.

  • linear: This option specifies that linear models should be ran, as opposed to running neural network models.

  • liblinear_options: An option string for LIBLINEAR. For example

    --liblinear_options='-s 2 -B 1 -e 0.0001 -q'
    
  • linear_technique: An option for multi-label or multi-class techniques. It should be one of: 1vsrest (one-vs-rest), thresholding (thresholding), cost_sensitive (cost-sensitive), and binary_and_multiclass (binary_and_multiclass).

  • data_format: The data format. It should be one of txt (LibMultiLabel format), svm (LibSVM format). See Dataset Formats for more details on accepted data formats.

Prediction

To predict a test set by applying a previously trained model, use

python3 main.py --config CONFIG_PATH \
                --test_file TEST_DATA_PATH \
                --eval \
                --linear \
                --data_format DATA_FORMAT \
                --checkpoint_path CHECKPOINT_PATH

where CHECKPOINT_PATH is a path to a linear_pipeline.pickle.