Hyperparameter Search for Neural Networks

The performance of a model depends on the choice of hyperparameters. The following example demonstrates how the BiGRU model performs differently on the EUR-Lex data set with two parameter sets. Datasets can be downloaded from the LIBSVM datasets.

Direct Trying Some Parameters

First, train a BiGRU model with the default configuration file with a little modification on the learning rate. Some important parameters are listed as follows.

learning_rate: 0.001
network_config:
  embed_dropout: 0.4
  post_encoder_dropout: 0.4
  rnn_dim: 512
  rnn_layers: 1

The training command is:

python3 main.py --config example_config/EUR-Lex/bigru_lwan.yml

After training for 50 epochs, the checkpoint with the best validation performance is stored for testing. The average P@1 score on the test data set is 80.36%.

Next, the learning_rate is changed to 0.003 while other parameters are kept the same.

learning_rate: 0.003
network_config:
  embed_dropout: 0.4
  post_encoder_dropout: 0.4
  rnn_dim: 512
  rnn_layers: 1

By the same training command, the P@1 score of the second parameter set is about 78.65%, which is 2% lower than the first one. This demonstrates the importance of parameter selection.

For more striking examples on the importance of parameter selection, you can see this paper.

Grid Search over Parameters

In the configuration file, we specify a grid search on the following parameters.

learning_rate: ['grid_search', [0.003, 0.001, 0.0003]]
network_config:
  embed_dropout: ['grid_search', [0, 0.2, 0.4, 0.6, 0.8]]
  post_encoder_dropout: ['grid_search', [0, 0.2, 0.4]]
  rnn_dim: ['grid_search', [256, 512, 1024]]
  rnn_layers: 1
embed_cache_dir: .vector_cache

We set the embed_cache_dir to .vector_cache to avoid downloading pre-trained embeddings repeatedly for each configuration.

Then the training command is:

python3 search_params.py --config example_config/EUR-Lex/bigru_lwan_tune.yml

The process finds the best parameter set of learning_rate=0.0003, embed_dropout=0.6, post_encoder_dropout=0.2, and rnn_dim=256.

After the search process, the program applies the best parameters to obtain the final model by adding the validation set for training. The average P@1 score is 81.99% on the test set, better than the result without a hyperparameter search. Note that after obtaining the best hyperparameters, we combine training and validation sets to train a final model for testing. For more details about ‘re-training’, please refer to the Re-train or not section.

Re-train or not

In the Grid Search over Parameters section, we split the available data into training and validation sets for hyperparameter search. For methods like SVM, they usually train the final model with the best hyperparameters by combining the training and validation sets. This approach maximizes the utilization of information for model learning, and we refer to it as the “re-train” strategy.

Since re-training is usually beneficial, we have incorporated the strategy into search_params.py. When hyperparameter search is done, the re-training process will be automatically executed by default, like the case in section Grid Search over Parameters.

Though not recommended, you can use the argument --no_retrain to disable the re-training process.

python search_params.py --config example_config/EUR-Lex/bigru_lwan.yml --no_retrain

By doing so, the model achieving the best validation performance during parameter search will be returned. In this case, the P@1 performance with re-training shows an improvement of approximately 2% compared to the performance without re-training. The following test results illustrate the advantages of the re-training.

Methods

Macro-F1

Micro-F1

P@1

P@5

wo/ re-training after hyperparameter search

22.95

56.37

80.08

56.24

w/ re-training after hyperparameter search

24.43

57.99

81.99

57.57

In a different scenario, if you want to skip the parameter search but still wish to re-train the model with your chosen hyperparameters, we will provide an example of how to do this.

Let’s train a BiGRU model using the configuration file used in the Direct Trying Some Parameters section, where the learning rate is set to 0.001. Please note that because the validation set is not specified in the configuration file, the training dataset is partitioned into a training set and a validation subsets to assess the performance at each epoch.

python main.py --config example_config/EUR-Lex/bigru_lwan.yml

Using the model obtained at the epoch of the best validation PR@5, the test performance is:

Macro-F1

Micro-F1

P@1

P@5

20.79

54.91

80.36

53.89

To get the epoch with the best validation performance, the following code snippet reads the log, extracts the performance metrics for each epoch, and identifies the optimal epoch:

import json
import numpy as np

with open('your_log_path_for_the_first_step.json', 'r') as r: # the log file which records the configuration and validation performance of each epoch is saved in the 'runs' directory by default.
    log = json.load(r)
log_metric = np.array([l[log["config"]["val_metric"]] for l in log["val"]])
optimal_idx = log_metric.argmax() # if your validation metric is loss, use np.argmin() instead.
best_epoch = optimal_idx.item() + 1
print(best_epoch)

In this case, the optimal epoch should be 42. We then specify --merge_train_val to include the validation set for training and specify the number of epochs by --epochs. Note that options explicitly defined override those in the configuration file. Because of no validation set, only the model at the last epoch is returned.

python main.py --config example_config/EUR-Lex/bigru_lwan.yml --epochs 42 --merge_train_val

Similar with the last case, the test performance improves after re-training:

Macro-F1

Micro-F1

P@1

P@5

22.65

57.06

83.10

56.34