Training, Prediction, and Hyper-parameter Search for Neural Networks

For users who are just getting started, see:

If you have been familiar with the basic operations, see:


Using CLI via an Example

Step 1. Data Preparation

Create a data sub-directory within LibMultiLabel and go to this sub-directory.

mkdir -p data/rcv1
cd data/rcv1

Download the RCV1 LibMultiLabel Format dataset from LIBSVM Data by the following commands.

wget -O train.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_train.txt.bz2
wget -O test.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_test.txt.bz2

Uncompress data files and change the directory back to LibMultiLabel.

bzip2 -d *.bz2
cd ../..

See Dataset Formats here if you want to use your own dataset.

Step 2. Training and Prediction via an Example

Train a CNN model and predict the test set by an example config. Use --cpu to run the program on the cpu.

python3 main.py --config example_config/rcv1/kim_cnn.yml

Training and (Optional) Prediction

In the training procedure, you can build a model from scratch or start from some pre-obtained information.

python3 main.py --config CONFIG_PATH \
                [--checkpoint_path CHECKPOINT_PATH] \
                [--embed_file EMBED_NAME_OR_EMBED_PATH] \
                [--vocab_file VOCAB_CSV_PATH]
  • config: configure parameters in a yaml file. See python3 main.py --help.

If a model was trained before by this package, the training procedure can start with it.

  • checkpoint_path: specify the path to a pre-trained model.

To use your own word embeddings or vocabulary set, specify the following parameters:

  • embed_file: choose one of the pretrained embeddings defined in torchtext or specify the path to your word embeddings with each line containing a word followed by its vectors. Example:

the 0.04656 0.21318 -0.0074364 ...
a -0.29712 0.094049 -0.096662 ...
an -0.3206 0.43316 -0.086867 ...
  • vocab_file: set the file path to a predefined vocabulary set that contains lines of words.

the
a
an

For validation, you can evaluate the model with a set of evaluation metrics. Set monitor_metrics to define what you want to print on the screen. The argument val_metric is the metric for selecting the best model. Namely, the model occurred at the epoch with the best validation metric is returned after training. If you do not specify a validation set in the configuration file via val_file or a training-validation split ratio via val_size, we will split the training data into training and validation set with an 80-20 split. Example lines in a configuration file:

monitor_metrics: [P@1, P@3, P@5]
val_metric: P@1

If test_file is specified, the model with the highest val_metric will be used to predict the test set.

Prediction

To deploy/evaluate a model (i.e., a pre-obtained checkpoint), you can predict a test set by the following command.

python3 main.py --eval \
                --config CONFIG_PATH \
                --checkpoint_path CHECKPOINT_PATH \
                --test_file TEST_DATA_PATH \
                --save_k_predictions K \
                --predict_out_path PREDICT_OUT_PATH
  • Use --save_k_predictions to save the top K predictions for each instance in the test set. K=100 if not specified.

  • Use --predict_out_path to specify the file for storing the predicted top-K labels/scores.