Neural Network API

The neural network module libmultilabel.nn contains three methods. Method libmultilabel.nn.networks is a collection of classes that defines the neural networks. The other two methods, libmultilabel.nn.data_utils and libmultilabel.nn.nn_utils, are utilities for processing data and training a neural network model.


libmultilabel.nn.data_utils

libmultilabel.nn.data_utils.get_dataset_loader(data, classes, device, max_seq_length=500, batch_size=1, shuffle=False, data_workers=4, add_special_tokens=True, *, tokenizer=None, word_dict=None)[source]

Create a pytorch DataLoader.

Parameters
  • data (list[dict]) – List of training instances with index, label, and tokenized text.

  • classes (list) – List of labels.

  • device (torch.device) – One of cuda or cpu.

  • max_seq_length (int, optional) – The maximum number of tokens of a sample. Defaults to 500.

  • batch_size (int, optional) – Size of training batches. Defaults to 1.

  • shuffle (bool, optional) – Whether to shuffle training data before each epoch. Defaults to False.

  • data_workers (int, optional) – Use multi-cpu core for data pre-processing. Defaults to 4.

  • add_special_tokens (bool, optional) – Whether to add the special tokens. Defaults to True.

  • tokenizer (transformers.PreTrainedTokenizerBase, optional) – HuggingFace’s tokenizer of the transformer-based pretrained language model. Defaults to None.

  • word_dict (torchtext.vocab.Vocab, optional) – A vocab object for word tokenizer to map tokens to indices. Defaults to None.

Returns

A pytorch DataLoader.

Return type

torch.utils.data.DataLoader

libmultilabel.nn.data_utils.load_datasets(training_data=None, test_data=None, val_data=None, val_size=0.2, merge_train_val=False, tokenize_text=True, remove_no_label_data=False)[source]

Load data from the specified data paths or the given dataframe. If val_data does not exist but val_size > 0, the validation set will be split from the training dataset.

Parameters
  • training_data (Union[str, pandas,.Dataframe], optional) – Path to training data or a dataframe.

  • test_data (Union[str, pandas,.Dataframe], optional) – Path to test data or a dataframe.

  • val_data (Union[str, pandas,.Dataframe], optional) – Path to validation data or a dataframe.

  • val_size (float, optional) – Training-validation split: a ratio in [0, 1] or an integer for the size of the validation set. Defaults to 0.2.

  • merge_train_val (bool, optional) – Whether to merge the training and validation data. Defaults to False.

  • tokenize_text (bool, optional) – Whether to tokenize text. Defaults to True.

  • remove_no_label_data (bool, optional) – Whether to remove training/validation instances that have no labels. Defaults to False.

Returns

A dictionary of datasets.

Return type

dict

libmultilabel.nn.data_utils.load_or_build_text_dict(dataset, vocab_file=None, min_vocab_freq=1, embed_file=None, embed_cache_dir=None, silent=False, normalize_embed=False)[source]

Build or load the vocabulary from the training dataset or the predefined vocab_file. The pretrained embedding can be either from a self-defined embed_file or from one of the vectors defined in torchtext.vocab.pretrained_aliases (https://github.com/pytorch/text/blob/main/torchtext/vocab/vectors.py).

Parameters
  • dataset (list) – List of training instances with index, label, and tokenized text.

  • vocab_file (str, optional) – Path to a file holding vocabuaries. Defaults to None.

  • min_vocab_freq (int, optional) – The minimum frequency needed to include a token in the vocabulary. Defaults to 1.

  • embed_file (str) – Path to a file holding pre-trained embeddings.

  • embed_cache_dir (str, optional) – Path to a directory for storing cached embeddings. Defaults to None.

  • silent (bool, optional) – Enable silent mode. Defaults to False.

  • normalize_embed (bool, optional) – Whether the embeddings of each word is normalized to a unit vector. Defaults to False.

Returns

A vocab object which maps tokens to indices and the pre-trained word vectors of shape (vocab_size, embed_dim).

Return type

tuple[torchtext.vocab.Vocab, torch.Tensor]

libmultilabel.nn.data_utils.load_or_build_label(datasets, label_file=None, include_test_labels=False)[source]

Obtain the label set from loading a label file or from the given data sets. The label set contains labels in the training and validation sets. Labels in the test set are included only when include_test_labels is True.

Parameters
  • datasets (dict) – A dictionary of datasets. Each dataset contains list of instances with index, label, and tokenized text.

  • label_file (str, optional) – Path to a file holding all labels.

  • include_test_labels (bool, optional) – Whether to include labels in the test dataset. Defaults to False.

Returns

A list of labels sorted in alphabetical order.

Return type

list

libmultilabel.nn.nn_utils

libmultilabel.nn.nn_utils.init_device(use_cpu=False)[source]

Initialize device to CPU if use_cpu is set to True otherwise GPU.

Parameters

use_cpu (bool, optional) – Whether to use CPU or not. Defaults to False.

Returns

One of cuda or cpu.

Return type

torch.device

libmultilabel.nn.nn_utils.init_model(model_name, network_config, classes, word_dict=None, embed_vecs=None, init_weight=None, log_path=None, learning_rate=0.0001, optimizer='adam', momentum=0.9, weight_decay=0, lr_scheduler=None, scheduler_config=None, val_metric=None, metric_threshold=0.5, monitor_metrics=None, multiclass=False, loss_function='binary_cross_entropy_with_logits', silent=False, save_k_predictions=0)[source]

Initialize a Model class for initializing and training a neural network.

Parameters
  • model_name (str) – Model to be used such as KimCNN.

  • network_config (dict) – Configuration for defining the network.

  • classes (list) – List of class names.

  • word_dict (torchtext.vocab.Vocab, optional) – A vocab object for word tokenizer to map tokens to indices. Defaults to None.

  • embed_vecs (torch.Tensor, optional) – The pre-trained word vectors of shape (vocab_size, embed_dim). Defaults to None.

  • init_weight (str) – Weight initialization method from torch.nn.init. For example, the init_weight of torch.nn.init.kaiming_uniform_ is kaiming_uniform. Defaults to None.

  • log_path (str) – Path to a directory holding the log files and models.

  • learning_rate (float, optional) – Learning rate for optimizer. Defaults to 0.0001.

  • optimizer (str, optional) – Optimizer name (i.e., sgd, adam, or adamw). Defaults to ‘adam’.

  • momentum (float, optional) – Momentum factor for SGD only. Defaults to 0.9.

  • weight_decay (int, optional) – Weight decay factor. Defaults to 0.

  • lr_scheduler (str, optional) – Name of the learning rate scheduler. Defaults to None.

  • scheduler_config (dict, optional) – The configuration for learning rate scheduler. Defaults to None.

  • val_metric (str, optional) – The metric to select the best model for testing. Used by some of the schedulers. Defaults to None.

  • metric_threshold (float, optional) – The decision value threshold over which a label is predicted as positive. Defaults to 0.5.

  • monitor_metrics (list, optional) – Metrics to monitor while validating. Defaults to None.

  • multiclass (bool, optional) – Enable multiclass mode. Defaults to False.

  • silent (bool, optional) – Enable silent mode. Defaults to False.

  • loss_function (str, optional) – Loss function name (i.e., binary_cross_entropy_with_logits, cross_entropy). Defaults to ‘binary_cross_entropy_with_logits’.

  • save_k_predictions (int, optional) – Save top k predictions on test set. Defaults to 0.

Returns

A class that implements MultiLabelModel for initializing and training a neural network.

Return type

Model

libmultilabel.nn.nn_utils.init_trainer(checkpoint_dir, epochs=10000, patience=5, early_stopping_metric='P@1', val_metric='P@1', silent=False, use_cpu=False, limit_train_batches=1.0, limit_val_batches=1.0, limit_test_batches=1.0, save_checkpoints=True)[source]

Initialize a torch lightning trainer.

Parameters
  • checkpoint_dir (str) – Directory for saving models and log.

  • epochs (int) – Number of epochs to train. Defaults to 10000.

  • patience (int) – Number of epochs to wait for improvement before early stopping. Defaults to 5.

  • early_stopping_metric (str) – The metric to monitor for early stopping. Defaults to ‘P@1’.

  • val_metric (str) – The metric to select the best model for testing. Defaults to ‘P@1’.

  • silent (bool) – Enable silent mode. Defaults to False.

  • use_cpu (bool) – Disable CUDA. Defaults to False.

  • limit_train_batches (Union[int, float]) – Percentage of training dataset to use. Defaults to 1.0.

  • limit_val_batches (Union[int, float]) – Percentage of validation dataset to use. Defaults to 1.0.

  • limit_test_batches (Union[int, float]) – Percentage of test dataset to use. Defaults to 1.0.

  • save_checkpoints (bool) – Whether to save the last and the best checkpoint or not. Defaults to True.

Returns

A torch lightning trainer.

Return type

lightning.trainer

libmultilabel.nn.nn_utils.set_seed(seed)[source]

Set seeds for numpy and pytorch.

Parameters

seed (int) – Random seed.