Training, Prediction, and Hyper-parameter Search for Neural Networks
For users who are just getting started, see:
If you have been familiar with the basic operations, see:
Using CLI via an Example
Step 1. Data Preparation
Create a data sub-directory within LibMultiLabel and go to this sub-directory.
mkdir -p data/rcv1
cd data/rcv1
Download the RCV1 LibMultiLabel Format dataset from LIBSVM Data by the following commands.
wget -O train.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_train.txt.bz2
wget -O test.txt.bz2 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel/rcv1_topics_test.txt.bz2
Uncompress data files and change the directory back to LibMultiLabel.
bzip2 -d *.bz2
cd ../..
See Dataset Formats here if you want to use your own dataset.
Step 2. Training and Prediction via an Example
Train a CNN model and predict the test set by an example config. Use --cpu
to run the program on the cpu.
python3 main.py --config example_config/rcv1/kim_cnn.yml
Training and (Optional) Prediction
In the training procedure, you can build a model from scratch or start from some pre-obtained information.
python3 main.py --config CONFIG_PATH \
[--checkpoint_path CHECKPOINT_PATH] \
[--embed_file EMBED_NAME_OR_EMBED_PATH] \
[--vocab_file VOCAB_CSV_PATH]
config: configure parameters in a yaml file. See
python3 main.py --help
.
If a model was trained before by this package, the training procedure can start with it.
checkpoint_path: specify the path to a pre-trained model.
To use your own word embeddings or vocabulary set, specify the following parameters:
embed_file: choose one of the pretrained embeddings defined in torchtext or specify the path to your word embeddings with each line containing a word followed by its vectors. Example:
the 0.04656 0.21318 -0.0074364 ...
a -0.29712 0.094049 -0.096662 ...
an -0.3206 0.43316 -0.086867 ...
vocab_file: set the file path to a predefined vocabulary set that contains lines of words.
the
a
an
For validation, you can evaluate the model with a set of evaluation metrics.
Set monitor_metrics
to define what you want to print on the screen.
The argument val_metric
is the metric for selecting the best model.
Namely, the model occurred at the epoch with the best validation metric is returned after training.
If you do not specify a validation set in the configuration file via val_file
or a training-validation split ratio via val_size
,
we will split the training data into training and validation set with an 80-20 split.
Example lines in a configuration file:
monitor_metrics: [P@1, P@3, P@5]
val_metric: P@1
If test_file
is specified, the model with the highest val_metric
will be used to predict the test set.
Prediction
To deploy/evaluate a model (i.e., a pre-obtained checkpoint), you can predict a test set by the following command.
python3 main.py --eval \
--config CONFIG_PATH \
--checkpoint_path CHECKPOINT_PATH \
--test_file TEST_DATA_PATH \
--save_k_predictions K \
--predict_out_path PREDICT_OUT_PATH
Use
--save_k_predictions
to save the top K predictions for each instance in the test set. K=100 if not specified.Use
--predict_out_path
to specify the file for storing the predicted top-K labels/scores.
Hyper-parameter Search
Parameter selection is known to be extremely important in machine learning practice; see a powerful reminder in “this paper”. Here we leverage Ray Tune, which is a python library for hyper-parameter tuning, to select parameters. Due to the dependency of Ray Tune, first make sure your python version is not greater than 3.8. Then, install the related packages with:
pip3 install -Ur requirements_parameter_search.txt
We provide a program search_params.py
to demonstrate how to run LibMultiLabel with Ray Tune. An example is as follows:
python3 search_params.py --config example_config/rcv1/cnn_tune.yml \
--search_alg basic_variant
config: configure all parameters in a yaml file. You can define a continuous, a discrete, or other types of search space (see a list here). An example of configuring the parameters is presented as follows:
dropout: ['grid_search', [0.2, 0.4, 0.6, 0.8]] # grid search
num_filter_per_size: ['choice', [350, 450, 550]] # discrete
learning_rate: ['uniform', 0.2, 0.8] # continuous
activation: tanh # not for hyper-parameter search
search_alg: specify a search algorithm considered in Ray Tune. We support basic_variant (e.g., grid/random), bayesopt, and optuna. You can also define
search_alg
in the config file. For example, if you want to run grid search overlearning_rate
, the config is like this:
search_alg: basic_variant
learning_rate: ['grid_search', [0.2, 0.4, 0.6, 0.8]]
After the search process, the program applies the best hyper-parameters to obtain the final model.
The re-training process by default adds the validation set for training.
Our empirical analysis shows that this setting improves test results.
If you do not want to incorporate the validation data for training, you can specify the option no_merge_train_val
.
In either case, the optimization starts from scratch and runs for the number of epochs that leads to the best validation results in the hyper-parameter search.
For more information on this section, please refer to Grid Search over Parameters