Hyperparameter Search for Neural Networks
The performance of a model depends on the choice of hyperparameters. The following example demonstrates how the BiGRU model performs differently on the EUR-Lex data set with two parameter sets. Datasets can be downloaded from the LIBSVM datasets.
Direct Trying Some Parameters
First, train a BiGRU model with the default configuration file with a little modification on the learning rate. Some important parameters are listed as follows.
learning_rate: 0.001
network_config:
embed_dropout: 0.4
post_encoder_dropout: 0.4
rnn_dim: 512
rnn_layers: 1
The training command is:
python3 main.py --config example_config/EUR-Lex/bigru_lwan.yml
After training for 50 epochs, the checkpoint with the best validation performance is stored for testing. The average P@1 score on the test data set is 80.36%.
Next, the learning_rate
is changed to 0.003 while other parameters are kept the same.
learning_rate: 0.003
network_config:
embed_dropout: 0.4
post_encoder_dropout: 0.4
rnn_dim: 512
rnn_layers: 1
By the same training command, the P@1 score of the second parameter set is about 78.65%, which is 2% lower than the first one. This demonstrates the importance of parameter selection.
For more striking examples on the importance of parameter selection, you can see this paper.
Grid Search over Parameters
In the configuration file, we specify a grid search on the following parameters.
learning_rate: ['grid_search', [0.003, 0.001, 0.0003]]
network_config:
embed_dropout: ['grid_search', [0, 0.2, 0.4, 0.6, 0.8]]
post_encoder_dropout: ['grid_search', [0, 0.2, 0.4]]
rnn_dim: ['grid_search', [256, 512, 1024]]
rnn_layers: 1
embed_cache_dir: .vector_cache
We set the embed_cache_dir
to .vector_cache
to avoid downloading pre-trained embeddings repeatedly for each configuration.
Then the training command is:
python3 search_params.py --config example_config/EUR-Lex/bigru_lwan_tune.yml
The process finds the best parameter set of learning_rate=0.0003
, embed_dropout=0.6
, post_encoder_dropout=0.2
, and rnn_dim=256
.
After the search process, the program applies the best parameters to obtain the final model by adding the validation set for training. The average P@1 score is 81.99% on the test set, better than the result without a hyperparameter search. Note that after obtaining the best hyperparameters, we combine training and validation sets to train a final model for testing. For more details about ‘re-training’, please refer to the Re-train or not section.
Early Stopping of the Parameter Search
It is time consuming to search over the entire parameter space. To save time, LibMultiLabel has incorporated some early stopping techniques implemented in Ray.
Here we demonstrate an example of applying an ASHA (Asynchronous Successive Halving Algorithm) Scheduler.
First, uncomment the following lines in the configuration file:
scheduler:
time_attr: training_iteration
max_t: 50
grace_period: 10
reduction_factor: 3
brackets: 1
Under the same computing environment and the same command, the best parameter set of learning_rate=0.001
,
embed_dropout=0.4
, post_encoder_dropout=0.2
, and rnn_dim=512
is found in 26% of the time compared to the
grid search, while the average test P@1 score is similar to the result without early stopping.
A summary of results is in the following table. Eight Nvidia Tesla V100 GPUs were used in this experiment.
Methods |
Macro-F1 |
Micro-F1 |
P@1 |
P@5 |
Training Time (GPU) |
---|---|---|---|---|---|
wo/ parameter selection |
20.79 |
54.91 |
80.36 |
53.89 |
42.5 minutes |
w/ parameter selection (grid search) |
24.43 |
57.99 |
81.99 |
57.57 |
23.0 hours |
w/ parameter selection (ASHA) |
23.07 |
58.03 |
82.33 |
57.07 |
5.89 hours |
Re-train or not
In the Grid Search over Parameters section, we split the available data into training and validation sets for hyperparameter search. For methods like SVM, they usually train the final model with the best hyperparameters by combining the training and validation sets. This approach maximizes the utilization of information for model learning, and we refer to it as the “re-train” strategy.
Since re-training is usually beneficial, we have incorporated the strategy into search_params.py
.
When hyperparameter search is done, the re-training process will be automatically
executed by default, like the case in section Grid Search over Parameters.
Though not recommended, you can use the argument --no_retrain
to disable the
re-training process.
python search_params.py --config example_config/EUR-Lex/bigru_lwan.yml --no_retrain
By doing so, the model achieving the best validation performance during parameter search will be returned. In this case, the P@1 performance with re-training shows an improvement of approximately 2% compared to the performance without re-training. The following test results illustrate the advantages of the re-training.
Methods |
Macro-F1 |
Micro-F1 |
P@1 |
P@5 |
---|---|---|---|---|
wo/ re-training after hyperparameter search |
22.95 |
56.37 |
80.08 |
56.24 |
w/ re-training after hyperparameter search |
24.43 |
57.99 |
81.99 |
57.57 |
In a different scenario, if you want to skip the parameter search but still wish to re-train the model with your chosen hyperparameters, we will provide an example of how to do this.
Let’s train a BiGRU model using the configuration file used in the Direct Trying Some Parameters section, where the learning rate is set to 0.001. Please note that because the validation set is not specified in the configuration file, the training dataset is partitioned into a training set and a validation subsets to assess the performance at each epoch.
python main.py --config example_config/EUR-Lex/bigru_lwan.yml
Using the model obtained at the epoch of the best validation PR@5, the test performance is:
Macro-F1 |
Micro-F1 |
P@1 |
P@5 |
---|---|---|---|
20.79 |
54.91 |
80.36 |
53.89 |
To get the epoch with the best validation performance, the following code snippet reads the log, extracts the performance metrics for each epoch, and identifies the optimal epoch:
import json
import numpy as np
with open('your_log_path_for_the_first_step.json', 'r') as r: # the log file which records the configuration and validation performance of each epoch is saved in the 'runs' directory by default.
log = json.load(r)
log_metric = np.array([l[log["config"]["val_metric"]] for l in log["val"]])
optimal_idx = log_metric.argmax() # if your validation metric is loss, use np.argmin() instead.
best_epoch = optimal_idx.item() + 1
print(best_epoch)
In this case, the optimal epoch should be 42.
We then specify --merge_train_val
to include the validation set for training and
specify the number of epochs by --epochs
. Note that options explicitly defined
override those in the configuration file. Because of no validation set, only the model
at the last epoch is returned.
python main.py --config example_config/EUR-Lex/bigru_lwan.yml --epochs 42 --merge_train_val
Similar with the last case, the test performance improves after re-training:
Macro-F1 |
Micro-F1 |
P@1 |
P@5 |
---|---|---|---|
22.65 |
57.06 |
83.10 |
56.34 |