Hyperparameter Search for Neural Networks ========================================== The performance of a model depends on the choice of hyperparameters. The following example demonstrates how the BiGRU model performs differently on the EUR-Lex data set with two parameter sets. Datasets can be downloaded from the `LIBSVM datasets `_. Direct Trying Some Parameters ----------------------------- First, train a BiGRU model with the `default configuration file `_ with a little modification on the learning rate. Some important parameters are listed as follows. .. code-block:: bash learning_rate: 0.001 network_config: embed_dropout: 0.4 post_encoder_dropout: 0.4 rnn_dim: 512 rnn_layers: 1 The training command is: .. code-block:: bash python3 main.py --config example_config/EUR-Lex/bigru_lwan.yml After training for 50 epochs, the checkpoint with the best validation performance is stored for testing. The average P@1 score on the test data set is 80.36%. Next, the ``learning_rate`` is changed to 0.003 while other parameters are kept the same. .. code-block:: bash learning_rate: 0.003 network_config: embed_dropout: 0.4 post_encoder_dropout: 0.4 rnn_dim: 512 rnn_layers: 1 By the same training command, the P@1 score of the second parameter set is about 78.65%, which is 2% lower than the first one. This demonstrates the importance of parameter selection. For more striking examples on the importance of parameter selection, you can see `this paper `_. .. _Parameter Selection for Neural Networks: Grid Search over Parameters --------------------------- In the configuration file, we specify a grid search on the following parameters. .. code-block:: bash learning_rate: ['grid_search', [0.003, 0.001, 0.0003]] network_config: embed_dropout: ['grid_search', [0, 0.2, 0.4, 0.6, 0.8]] post_encoder_dropout: ['grid_search', [0, 0.2, 0.4]] rnn_dim: ['grid_search', [256, 512, 1024]] rnn_layers: 1 embed_cache_dir: .vector_cache We set the ``embed_cache_dir`` to ``.vector_cache`` to avoid downloading pre-trained embeddings repeatedly for each configuration. Then the training command is: .. code-block:: bash python3 search_params.py --config example_config/EUR-Lex/bigru_lwan_tune.yml The process finds the best parameter set of ``learning_rate=0.0003``, ``embed_dropout=0.6``, ``post_encoder_dropout=0.2``, and ``rnn_dim=256``. After the search process, the program applies the best parameters to obtain the final model by adding the validation set for training. The average P@1 score is 81.99% on the test set, better than the result without a hyperparameter search. Note that after obtaining the best hyperparameters, we combine training and validation sets to train a final model for testing. For more details about 're-training', please refer to the `Re-train or not`_ section. Early Stopping of the Parameter Search -------------------------------------- It is time consuming to search over the entire parameter space. To save time, LibMultiLabel has incorporated some early stopping techniques implemented in `Ray `_. Here we demonstrate an example of applying an `ASHA (Asynchronous Successive Halving Algorithm) Scheduler `_. First, uncomment the following lines in the `configuration file `_: .. code-block:: bash scheduler: time_attr: training_iteration max_t: 50 grace_period: 10 reduction_factor: 3 brackets: 1 Under the same computing environment and the same command, the best parameter set of ``learning_rate=0.001``, ``embed_dropout=0.4``, ``post_encoder_dropout=0.2``, and ``rnn_dim=512`` is found in 26% of the time compared to the grid search, while the average test P@1 score is similar to the result without early stopping. A summary of results is in the following table. Eight Nvidia Tesla V100 GPUs were used in this experiment. .. list-table:: :widths: 50 25 25 25 25 50 :header-rows: 1 * - Methods - Macro-F1 - Micro-F1 - P@1 - P@5 - Training Time (GPU) * - wo/ parameter selection - 20.79 - 54.91 - 80.36 - 53.89 - 42.5 minutes * - w/ parameter selection (grid search) - 24.43 - 57.99 - 81.99 - 57.57 - 23.0 hours * - w/ parameter selection (ASHA) - 23.07 - 58.03 - 82.33 - 57.07 - 5.89 hours Re-train or not -------------------------------------- In the `Grid Search over Parameters`_ section, we split the available data into training and validation sets for hyperparameter search. For methods like SVM, they usually train the final model with the best hyperparameters by combining the training and validation sets. This approach maximizes the utilization of information for model learning, and we refer to it as the "re-train" strategy. .. However, when applied in deep learning, merging the validation set into the training .. set means that the optimization process, which previously relied on the validation set for .. termination, no longer works. While there's no definitively proven best termination criterion .. , a typical approach is to determine the optimal epoch during .. hyperparameter search based on the number of training steps that led to the best .. validation performance. This optimal epoch serves as a stopping criterion .. when training the model with all available data. This strategy has been shown .. to provide stable improvements while mitigating the risk of overfitting. Since re-training is usually beneficial, we have incorporated the strategy into ``search_params.py``. When hyperparameter search is done, the re-training process will be automatically executed by default, like the case in section `Grid Search over Parameters`_. Though not recommended, you can use the argument ``--no_retrain`` to disable the re-training process. .. code-block:: bash python search_params.py --config example_config/EUR-Lex/bigru_lwan.yml --no_retrain By doing so, the model achieving the best validation performance during parameter search will be returned. In this case, the P@1 performance with re-training shows an improvement of approximately 2% compared to the performance without re-training. The following test results illustrate the advantages of the re-training. .. list-table:: :widths: 50 25 25 25 25 :header-rows: 1 * - Methods - Macro-F1 - Micro-F1 - P@1 - P@5 * - wo/ re-training after hyperparameter search - 22.95 - 56.37 - 80.08 - 56.24 * - w/ re-training after hyperparameter search - 24.43 - 57.99 - 81.99 - 57.57 In a different scenario, if you want to skip the parameter search but still wish to re-train the model with your chosen hyperparameters, we will provide an example of how to do this. Let's train a BiGRU model using the configuration file used in the `Direct Trying Some Parameters`_ section, where the learning rate is set to 0.001. Please note that because the validation set is not specified in the configuration file, the training dataset is partitioned into a training set and a validation subsets to assess the performance at each epoch. .. code-block:: bash python main.py --config example_config/EUR-Lex/bigru_lwan.yml Using the model obtained at the epoch of the best validation PR@5, the test performance is: .. list-table:: :widths: 25 25 25 25 :header-rows: 1 * - Macro-F1 - Micro-F1 - P@1 - P@5 * - 20.79 - 54.91 - 80.36 - 53.89 To get the epoch with the best validation performance, the following code snippet reads the log, extracts the performance metrics for each epoch, and identifies the optimal epoch: .. code-block:: python import json import numpy as np with open('your_log_path_for_the_first_step.json', 'r') as r: # the log file which records the configuration and validation performance of each epoch is saved in the 'runs' directory by default. log = json.load(r) log_metric = np.array([l[log["config"]["val_metric"]] for l in log["val"]]) optimal_idx = log_metric.argmax() # if your validation metric is loss, use np.argmin() instead. best_epoch = optimal_idx.item() + 1 print(best_epoch) In this case, the optimal epoch should be 42. We then specify ``--merge_train_val`` to include the validation set for training and specify the number of epochs by ``--epochs``. Note that options explicitly defined override those in the configuration file. Because of no validation set, only the model at the last epoch is returned. .. code-block:: bash python main.py --config example_config/EUR-Lex/bigru_lwan.yml --epochs 42 --merge_train_val Similar with the last case, the test performance improves after re-training: .. list-table:: :widths: 25 25 25 25 :header-rows: 1 * - Macro-F1 - Micro-F1 - P@1 - P@5 * - 22.65 - 57.06 - 83.10 - 56.34