Linear Classifier API

Train and Predict

Linear methods are methods based on LibLinear. The simplest usage is:

model = linear.train_1vsrest(train_y, train_x, options)
predict = linear.predict_values(model, test_x)

libmultilabel.linear.train_1vsrest(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) → FlatModel[source]

Train a linear model for multi-label data using a one-vs-rest strategy.

Parameters

y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.

Returns

A model which can be used in predict_values.

libmultilabel.linear.train_thresholding(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) → FlatModel[source]

Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick decision thresholds optimizing the sum of Macro-F1 and Micro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity due to an internal cross-validation.

This method is the micromacro-freq approach from this CIKM 2023 paper: “On the Thresholding Strategy for Infrequent Labels in Multi-label Classification” (see Section 4.3 and Supplementary D).

Parameters

y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.

Returns

A model which can be used in predict_values.

libmultilabel.linear.train_cost_sensitive(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) → FlatModel[source]

Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick an optimal asymmetric misclassification cost for Macro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity. See user guide for more details.

Parameters

y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.

Returns

A model which can be used in predict_values.

libmultilabel.linear.train_cost_sensitive_micro(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) → FlatModel[source]

Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick an optimal asymmetric misclassification cost for Micro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity. See user guide for more details.

Parameters

y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.

Returns

A model which can be used in predict_values.

libmultilabel.linear.train_binary_and_multiclass(y: csr_matrix, x: csr_matrix, multiclass: bool = True, options: str = '', verbose: bool = True) → FlatModel[source]

Train a linear model for binary and multi-class data.

Parameters

y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.

Returns

A model which can be used in predict_values.

libmultilabel.linear.train_tree(y: csr_matrix, x: csr_matrix, options: str = '', K=100, dmax=10, verbose: bool = True) → TreeModel[source]

Train a linear model for multi-label data using a divide-and-conquer strategy. The algorithm used is based on https://github.com/xmc-aalto/bonsai.

Parameters

y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
options (str) – The option string passed to liblinear.
K (int, optional) – Maximum degree of nodes in the tree. Defaults to 100.
dmax (int, optional) – Maximum depth of the tree. Defaults to 10.
verbose (bool, optional) – Output extra progress information. Defaults to True.

Returns

A model which can be used in predict_values.

libmultilabel.linear.predict_values(model, x: csr_matrix) → ndarray[source]

Calculate the decision values associated with x, equivalent to model.predict_values(x).

Parameters

model – A model returned from a training function.
x (sparse.csr_matrix) – A matrix with dimension number of instances * number of features.

Returns

A matrix with dimension number of instances * number of classes.

Return type

np.ndarray

libmultilabel.linear.get_topk_labels(preds: ndarray, label_mapping: ndarray, top_k: int = 5) → tuple[numpy.ndarray, numpy.ndarray][source]

Get labels and scores of top k predictions from decision values.

Parameters

preds (np.ndarray) – A matrix of decision values with dimension (number of instances * number of classes).
label_mapping (np.ndarray) – A ndarray of class labels that maps each index (from 0 to num_class-1) to its label.
top_k (int) – Determine how many classes per instance should be predicted.

Returns

Two 2d ndarray with first one containing predicted labels and the other containing corresponding scores. Both have dimension (num_instances * top_k).

libmultilabel.linear.get_positive_labels(preds: ndarray, label_mapping: ndarray) → tuple[list[list[str]], list[list[float]]][source]

Get all labels and scores with positive decision value.

Parameters

preds (np.ndarray) – A matrix of decision values with dimension number of instances * number of classes.
label_mapping (np.ndarray) – A ndarray of class labels that maps each index (from 0 to num_class-1) to its label.

Returns

Two 2d lists with first one containing predicted labels and the other containing corresponding scores.

class libmultilabel.linear.FlatModel(name: str, weights: np.matrix, bias: float, thresholds: float | np.ndarray, multiclass: bool)[source]

A model returned from a training function.

predict_values(x: csr_matrix) → ndarray[source]

Calculate the decision values associated with x.

Parameters: x (sparse.csr_matrix) – A matrix with dimension number of instances * number of features.
Returns: A matrix with dimension number of instances * number of classes.
Return type: np.ndarray

class libmultilabel.linear.TreeModel(root: Node, flat_model: FlatModel, weight_map: ndarray)[source]

A model returned from train_tree.

predict_values(x: csr_matrix, beam_width: int = 10) → ndarray[source]

Calculate the probability estimates associated with x.

Parameters

x (sparse.csr_matrix) – A matrix with dimension number of instances * number of features.
beam_width (int, optional) – Number of candidates considered during beam search. Defaults to 10.

Returns

A matrix with dimension number of instances * number of classes.

Return type

np.ndarray

Load Dataset

Load dataset in LibSVM or LibMultiLabel formats.

Parameters

data_format (str) – The data format used. ‘svm’ for LibSVM format, ‘txt’ for LibMultiLabel format in file and ‘dataframe’ for LibMultiLabel format in dataframe .
train_path (str | pd.DataFrame, optional) – Training data file or dataframe in LibMultiLabel format. Ignored if eval is True. Defaults to None.
test_path (str | pd.DataFrame, optional) – Test data file or dataframe in LibMultiLabel format. Ignored if test_data doesn’t exist. Defaults to None.
label_path (str, optional) – Path to a file holding all labels. Defaults to None.

Returns

The training and/or test data, with keys ‘train’ and ‘test’ respectively. The data has keys ‘x’ for input features and ‘y’ for labels.

Return type

dict[str, dict[str, sparse.csr_matrix | str]]

Preprocessor

class libmultilabel.linear.Preprocessor(include_test_labels: bool = False, remove_no_label_data: bool = False, tfidf_params: dict[str, str] = {})[source]

Preprocessor is used to preprocess input data in LibSVM or LibMultiLabel formats. The same Preprocessor has to be used for both training and test datasets; see save_pipeline and load_pipeline for more details.

__init__(include_test_labels: bool = False, remove_no_label_data: bool = False, tfidf_params: dict[str, str] = {})[source]

Initializes the preprocessor.

Parameters

include_test_labels (bool, optional) – Whether to include labels in the test dataset. Defaults to False.
remove_no_label_data (bool, optional) – Whether to remove training instances that have no labels. Defaults to False.
tfidf_params (dict[str, str], optional) – A set of parameters for sklearn.TfidfVectorizer. If empty, default parameters will be used.

fit(dataset: dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) → Preprocessor[source]

Fit the preprocessor according to the training and test datasets, and pre-defined labels if given.

Parameters: dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with possibly pre-defined labels with keys ‘train’, ‘test’, and “labels” respectively. The dataset must have keys ‘x’ for input features, and ‘y’ for actual labels. It also contains ‘data_format’ to indicate the data format used.
Returns: An instance of the fitted preprocessor.
Return type: Preprocessor

fit_transform(dataset)[source]

Fit the preprocessor according to the training and test datasets, and pre-defined labels if given. Then convert x and y in the training and test datasets according to the fitted preprocessor.

Parameters: dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with labels with keys ‘train’, ‘test’, and labels respectively. The dataset has keys ‘x’ for input features and ‘y’ for labels. It also contains ‘data_format’ to indicate the data format used.
Returns: The transformed dataset.
Return type: dict[str, dict[str, sparse.csr_matrix]]

transform(dataset: dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]])[source]

Convert x and y in the training and test datasets according to the fitted preprocessor.

Parameters: dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with labels with keys ‘train’, ‘test’, and labels respectively. The dataset has keys ‘x’ for input features and ‘y’ for labels. It also contains ‘data_format’ to indicate the data format used.
Returns: The transformed dataset.
Return type: dict[str, dict[str, sparse.csr_matrix]]

Load and Save Pipeline

libmultilabel.linear.save_pipeline(checkpoint_dir: str, preprocessor: Preprocessor, model)[source]

Save preprocessor and model to checkpoint_dir/linear_pipline.pickle.

Parameters

checkpoint_dir (str) – The directory to save to.
preprocessor (Preprocessor) – A Preprocessor.
model – A model returned from one of the training functions.

libmultilabel.linear.load_pipeline(checkpoint_path: str) → tuple[libmultilabel.linear.preprocessor.Preprocessor, Any][source]

Load preprocessor and model from checkpoint_path.

Parameters: checkpoint_path (str) – The path to a previously saved pipeline.
Returns: A tuple of the preprocessor and model.
Return type: tuple[Preprocessor, Any]

Metrics

Metrics are specified by their names in compute_metrics and get_metrics. The possible metric names are:

'P@K', where K is a positive integer
'R@K', where K is a positive integer
'RP@K', where K is a positive integer
'NDCG@K', where K is a positive integer
'Macro-F1'
'Micro-F1'

Their definitions are given in the implementation document.

libmultilabel.linear.compute_metrics(preds: ndarray, target: ndarray, monitor_metrics: list[str], multiclass: bool = False) → dict[str, float][source]

Compute metrics with decision values and labels. See get_metrics and MetricCollection if decision values and labels are too large to hold in memory.

Parameters

preds (np.ndarray) – A matrix of decision values with dimensions number of instances * number of classes.
target (np.ndarray) – A 0/1 matrix of labels with dimensions number of instances * number of classes.
monitor_metrics (list[str]) – A list of metric names.
multiclass (bool, optional) – Enable multiclass mode. Defaults to False.

Returns

A dictionary of metric values.

Return type

dict[str, float]

libmultilabel.linear.get_metrics(monitor_metrics: list[str], num_classes: int, multiclass: bool = False) → MetricCollection[source]

Get a collection of metrics by their names. See MetricCollection for more details.

Parameters

monitor_metrics (list[str]) – A list of metric names.
num_classes (int) – The number of classes.
multiclass (bool, optional) – Enable multiclass mode. Defaults to False.

Returns

A metric collection of the list of metrics.

Return type

MetricCollection

class libmultilabel.linear.MetricCollection(metrics)[source]

A collection of metrics created by get_metrics. MetricCollection computes metric values in two steps. First, batches of decision values and labels are added with update(). After all instances have been added, compute() computes the metric values from the accumulated batches.

compute() → dict[str, float][source]

Compute the metrics from the accumulated batches of decision values and labels.

Returns: A dictionary of metric values.
Return type: dict[str, float]

reset()[source]: Clear the accumulated batches of decision values and labels.

update(preds: ndarray, target: ndarray)[source]

Add a batch of decision values and labels.

Parameters

preds (np.ndarray) – A matrix of decision values with dimensions number of instances * number of classes.
target (np.ndarray) – A 0/1 matrix of labels with dimensions number of instances * number of classes.

libmultilabel.linear.tabulate_metrics(metric_dict: dict[str, float], split: str) → str[source]

Convert a dictionary of metric values into a pretty formatted string for printing.

Parameters

metric_dict (dict[str, float]) – A dictionary of metric values.
split (str) – Name of the data split.

Returns

Pretty formatted string.

Return type

str

Grid Search with Sklearn Estimators

class libmultilabel.linear.MultiLabelEstimator(options: str = '', linear_technique: str = '1vsrest', scoring_metric: str = 'P@1', multiclass: bool = False)[source]

Customized sklearn estimator for the multi-label classifier.

Parameters

options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
linear_technique (str, optional) – Multi-label technique defined in utils.LINEAR_TECHNIQUES. Defaults to ‘1vsrest’.
scoring_metric (str, optional) – The scoring metric. Defaults to ‘P@1’.

__init__(options: str = '', linear_technique: str = '1vsrest', scoring_metric: str = 'P@1', multiclass: bool = False)[source]

fit(X: csr_matrix, y: csr_matrix)[source]

predict(X: csr_matrix) → ndarray[source]

score(X: csr_matrix, y: csr_matrix) → float[source]

class libmultilabel.linear.GridSearchCV(estimator, param_grid: dict, n_jobs=None, **kwargs)[source]

A customized sklearn.model_selection.GridSearchCV` class for Liblinear. The usage is similar to sklearn’s, except that the parameter scoring is unavailable. Instead, specify scoring_metric in MultiLabelEstimator in the Pipeline.

Parameters

estimator (estimator object) – An estimator for grid search.
param_grid (dict) – Search space for a grid search containing a dictionary of parameters and their corresponding list of candidate values.
n_jobs (int, optional) – Number of CPU cores run in parallel. Defaults to None.

__init__(estimator, param_grid: dict, n_jobs=None, **kwargs)[source]

set_fit_request(*, groups: Union[bool, None, str] = '$UNCHANGED$') → GridSearchCV

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters: groups (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for groups parameter in fit.
Returns: self – The updated object.
Return type: object