Neural Network API
The neural network module libmultilabel.nn
contains three methods.
Method libmultilabel.nn.networks
is a collection of classes that defines the neural networks. The other two methods, libmultilabel.nn.data_utils
and libmultilabel.nn.nn_utils
, are utilities for processing data and training a neural network model.
libmultilabel.nn.data_utils
- libmultilabel.nn.data_utils.get_dataset_loader(data, classes, device, max_seq_length=500, batch_size=1, shuffle=False, data_workers=4, add_special_tokens=True, *, tokenizer=None, word_dict=None)[source]
Create a pytorch DataLoader.
- Parameters
data (list[dict]) – List of training instances with index, label, and tokenized text.
classes (list) – List of labels.
device (torch.device) – One of cuda or cpu.
max_seq_length (int, optional) – The maximum number of tokens of a sample. Defaults to 500.
batch_size (int, optional) – Size of training batches. Defaults to 1.
shuffle (bool, optional) – Whether to shuffle training data before each epoch. Defaults to False.
data_workers (int, optional) – Use multi-cpu core for data pre-processing. Defaults to 4.
add_special_tokens (bool, optional) – Whether to add the special tokens. Defaults to True.
tokenizer (transformers.PreTrainedTokenizerBase, optional) – HuggingFace’s tokenizer of the transformer-based pretrained language model. Defaults to None.
word_dict (torchtext.vocab.Vocab, optional) – A vocab object for word tokenizer to map tokens to indices. Defaults to None.
- Returns
A pytorch DataLoader.
- Return type
torch.utils.data.DataLoader
- libmultilabel.nn.data_utils.load_datasets(training_data=None, test_data=None, val_data=None, val_size=0.2, merge_train_val=False, tokenize_text=True, remove_no_label_data=False)[source]
Load data from the specified data paths or the given dataframe. If val_data does not exist but val_size > 0, the validation set will be split from the training dataset.
- Parameters
training_data (Union[str, pandas,.Dataframe], optional) – Path to training data or a dataframe.
test_data (Union[str, pandas,.Dataframe], optional) – Path to test data or a dataframe.
val_data (Union[str, pandas,.Dataframe], optional) – Path to validation data or a dataframe.
val_size (float, optional) – Training-validation split: a ratio in [0, 1] or an integer for the size of the validation set. Defaults to 0.2.
merge_train_val (bool, optional) – Whether to merge the training and validation data. Defaults to False.
tokenize_text (bool, optional) – Whether to tokenize text. Defaults to True.
remove_no_label_data (bool, optional) – Whether to remove training/validation instances that have no labels. Defaults to False.
- Returns
A dictionary of datasets.
- Return type
dict
- libmultilabel.nn.data_utils.load_or_build_text_dict(dataset, vocab_file=None, min_vocab_freq=1, embed_file=None, embed_cache_dir=None, silent=False, normalize_embed=False)[source]
Build or load the vocabulary from the training dataset or the predefined vocab_file. The pretrained embedding can be either from a self-defined embed_file or from one of the vectors defined in torchtext.vocab.pretrained_aliases (https://github.com/pytorch/text/blob/main/torchtext/vocab/vectors.py).
- Parameters
dataset (list) – List of training instances with index, label, and tokenized text.
vocab_file (str, optional) – Path to a file holding vocabuaries. Defaults to None.
min_vocab_freq (int, optional) – The minimum frequency needed to include a token in the vocabulary. Defaults to 1.
embed_file (str) – Path to a file holding pre-trained embeddings.
embed_cache_dir (str, optional) – Path to a directory for storing cached embeddings. Defaults to None.
silent (bool, optional) – Enable silent mode. Defaults to False.
normalize_embed (bool, optional) – Whether the embeddings of each word is normalized to a unit vector. Defaults to False.
- Returns
A vocab object which maps tokens to indices and the pre-trained word vectors of shape (vocab_size, embed_dim).
- Return type
tuple[torchtext.vocab.Vocab, torch.Tensor]
- libmultilabel.nn.data_utils.load_or_build_label(datasets, label_file=None, include_test_labels=False)[source]
Obtain the label set from loading a label file or from the given data sets. The label set contains labels in the training and validation sets. Labels in the test set are included only when include_test_labels is True.
- Parameters
datasets (dict) – A dictionary of datasets. Each dataset contains list of instances with index, label, and tokenized text.
label_file (str, optional) – Path to a file holding all labels.
include_test_labels (bool, optional) – Whether to include labels in the test dataset. Defaults to False.
- Returns
A list of labels sorted in alphabetical order.
- Return type
list
libmultilabel.nn.nn_utils
- libmultilabel.nn.nn_utils.init_device(use_cpu=False)[source]
Initialize device to CPU if use_cpu is set to True otherwise GPU.
- Parameters
use_cpu (bool, optional) – Whether to use CPU or not. Defaults to False.
- Returns
One of cuda or cpu.
- Return type
torch.device
- libmultilabel.nn.nn_utils.init_model(model_name, network_config, classes, word_dict=None, embed_vecs=None, init_weight=None, log_path=None, learning_rate=0.0001, optimizer='adam', momentum=0.9, weight_decay=0, lr_scheduler=None, scheduler_config=None, val_metric=None, metric_threshold=0.5, monitor_metrics=None, multiclass=False, loss_function='binary_cross_entropy_with_logits', silent=False, save_k_predictions=0)[source]
Initialize a Model class for initializing and training a neural network.
- Parameters
model_name (str) – Model to be used such as KimCNN.
network_config (dict) – Configuration for defining the network.
classes (list) – List of class names.
word_dict (torchtext.vocab.Vocab, optional) – A vocab object for word tokenizer to map tokens to indices. Defaults to None.
embed_vecs (torch.Tensor, optional) – The pre-trained word vectors of shape (vocab_size, embed_dim). Defaults to None.
init_weight (str) – Weight initialization method from torch.nn.init. For example, the init_weight of torch.nn.init.kaiming_uniform_ is kaiming_uniform. Defaults to None.
log_path (str) – Path to a directory holding the log files and models.
learning_rate (float, optional) – Learning rate for optimizer. Defaults to 0.0001.
optimizer (str, optional) – Optimizer name (i.e., sgd, adam, or adamw). Defaults to ‘adam’.
momentum (float, optional) – Momentum factor for SGD only. Defaults to 0.9.
weight_decay (int, optional) – Weight decay factor. Defaults to 0.
lr_scheduler (str, optional) – Name of the learning rate scheduler. Defaults to None.
scheduler_config (dict, optional) – The configuration for learning rate scheduler. Defaults to None.
val_metric (str, optional) – The metric to select the best model for testing. Used by some of the schedulers. Defaults to None.
metric_threshold (float, optional) – The decision value threshold over which a label is predicted as positive. Defaults to 0.5.
monitor_metrics (list, optional) – Metrics to monitor while validating. Defaults to None.
multiclass (bool, optional) – Enable multiclass mode. Defaults to False.
silent (bool, optional) – Enable silent mode. Defaults to False.
loss_function (str, optional) – Loss function name (i.e., binary_cross_entropy_with_logits, cross_entropy). Defaults to ‘binary_cross_entropy_with_logits’.
save_k_predictions (int, optional) – Save top k predictions on test set. Defaults to 0.
- Returns
A class that implements MultiLabelModel for initializing and training a neural network.
- Return type
Model
- libmultilabel.nn.nn_utils.init_trainer(checkpoint_dir, epochs=10000, patience=5, early_stopping_metric='P@1', val_metric='P@1', silent=False, use_cpu=False, limit_train_batches=1.0, limit_val_batches=1.0, limit_test_batches=1.0, save_checkpoints=True)[source]
Initialize a torch lightning trainer.
- Parameters
checkpoint_dir (str) – Directory for saving models and log.
epochs (int) – Number of epochs to train. Defaults to 10000.
patience (int) – Number of epochs to wait for improvement before early stopping. Defaults to 5.
early_stopping_metric (str) – The metric to monitor for early stopping. Defaults to ‘P@1’.
val_metric (str) – The metric to select the best model for testing. Defaults to ‘P@1’.
silent (bool) – Enable silent mode. Defaults to False.
use_cpu (bool) – Disable CUDA. Defaults to False.
limit_train_batches (Union[int, float]) – Percentage of training dataset to use. Defaults to 1.0.
limit_val_batches (Union[int, float]) – Percentage of validation dataset to use. Defaults to 1.0.
limit_test_batches (Union[int, float]) – Percentage of test dataset to use. Defaults to 1.0.
save_checkpoints (bool) – Whether to save the last and the best checkpoint or not. Defaults to True.
- Returns
A torch lightning trainer.
- Return type
lightning.trainer