Linear Classifier API
Train and Predict
Linear methods are methods based on LibLinear. The simplest usage is:
model = linear.train_1vsrest(train_y, train_x, options)
predict = linear.predict_values(model, test_x)
- libmultilabel.linear.train_1vsrest(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel [source]
Train a linear model for multi-label data using a one-vs-rest strategy.
- Parameters
y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.
- Returns
A model which can be used in predict_values.
- libmultilabel.linear.train_thresholding(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel [source]
Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick decision thresholds optimizing the sum of Macro-F1 and Micro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity due to an internal cross-validation.
This method is the micromacro-freq approach from this CIKM 2023 paper: “On the Thresholding Strategy for Infrequent Labels in Multi-label Classification” (see Section 4.3 and Supplementary D).
- Parameters
y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.
- Returns
A model which can be used in predict_values.
- libmultilabel.linear.train_cost_sensitive(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel [source]
Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick an optimal asymmetric misclassification cost for Macro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity. See user guide for more details.
- Parameters
y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.
- Returns
A model which can be used in predict_values.
- libmultilabel.linear.train_cost_sensitive_micro(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel [source]
Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick an optimal asymmetric misclassification cost for Micro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity. See user guide for more details.
- Parameters
y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.
- Returns
A model which can be used in predict_values.
- libmultilabel.linear.train_binary_and_multiclass(y: csr_matrix, x: csr_matrix, multiclass: bool = True, options: str = '', verbose: bool = True) FlatModel [source]
Train a linear model for binary and multi-class data.
- Parameters
y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
multiclass (bool, optional) – A flag indicating if the dataset is multiclass.
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
verbose (bool, optional) – Output extra progress information. Defaults to True.
- Returns
A model which can be used in predict_values.
- libmultilabel.linear.train_tree(y: csr_matrix, x: csr_matrix, options: str = '', K=100, dmax=10, verbose: bool = True) TreeModel [source]
Train a linear model for multi-label data using a divide-and-conquer strategy. The algorithm used is based on https://github.com/xmc-aalto/bonsai.
- Parameters
y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes.
x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features.
options (str) – The option string passed to liblinear.
K (int, optional) – Maximum degree of nodes in the tree. Defaults to 100.
dmax (int, optional) – Maximum depth of the tree. Defaults to 10.
verbose (bool, optional) – Output extra progress information. Defaults to True.
- Returns
A model which can be used in predict_values.
- libmultilabel.linear.predict_values(model, x: csr_matrix) ndarray [source]
Calculate the decision values associated with x, equivalent to model.predict_values(x).
- Parameters
model – A model returned from a training function.
x (sparse.csr_matrix) – A matrix with dimension number of instances * number of features.
- Returns
A matrix with dimension number of instances * number of classes.
- Return type
np.ndarray
- libmultilabel.linear.get_topk_labels(preds: ndarray, label_mapping: ndarray, top_k: int = 5) tuple[numpy.ndarray, numpy.ndarray] [source]
Get labels and scores of top k predictions from decision values.
- Parameters
preds (np.ndarray) – A matrix of decision values with dimension (number of instances * number of classes).
label_mapping (np.ndarray) – A ndarray of class labels that maps each index (from 0 to
num_class-1
) to its label.top_k (int) – Determine how many classes per instance should be predicted.
- Returns
Two 2d ndarray with first one containing predicted labels and the other containing corresponding scores. Both have dimension (num_instances * top_k).
- libmultilabel.linear.get_positive_labels(preds: ndarray, label_mapping: ndarray) tuple[list[list[str]], list[list[float]]] [source]
Get all labels and scores with positive decision value.
- Parameters
preds (np.ndarray) – A matrix of decision values with dimension number of instances * number of classes.
label_mapping (np.ndarray) – A ndarray of class labels that maps each index (from 0 to
num_class-1
) to its label.
- Returns
Two 2d lists with first one containing predicted labels and the other containing corresponding scores.
- class libmultilabel.linear.FlatModel(name: str, weights: np.matrix, bias: float, thresholds: float | np.ndarray, multiclass: bool)[source]
A model returned from a training function.
- class libmultilabel.linear.TreeModel(root: Node, flat_model: FlatModel, weight_map: ndarray)[source]
A model returned from train_tree.
- predict_values(x: csr_matrix, beam_width: int = 10) ndarray [source]
Calculate the probability estimates associated with x.
- Parameters
x (sparse.csr_matrix) – A matrix with dimension number of instances * number of features.
beam_width (int, optional) – Number of candidates considered during beam search. Defaults to 10.
- Returns
A matrix with dimension number of instances * number of classes.
- Return type
np.ndarray
Load Dataset
- libmultilabel.linear.load_dataset(data_format: str, train_path: str | pd.DataFrame | None = None, test_path: str | pd.DataFrame | None = None, label_path: str | None = None) dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]] [source]
Load dataset in LibSVM or LibMultiLabel formats.
- Parameters
data_format (str) – The data format used. ‘svm’ for LibSVM format, ‘txt’ for LibMultiLabel format in file and ‘dataframe’ for LibMultiLabel format in dataframe .
train_path (str | pd.DataFrame, optional) – Training data file or dataframe in LibMultiLabel format. Ignored if eval is True. Defaults to None.
test_path (str | pd.DataFrame, optional) – Test data file or dataframe in LibMultiLabel format. Ignored if test_data doesn’t exist. Defaults to None.
label_path (str, optional) – Path to a file holding all labels. Defaults to None.
- Returns
The training and/or test data, with keys ‘train’ and ‘test’ respectively. The data has keys ‘x’ for input features and ‘y’ for labels.
- Return type
dict[str, dict[str, sparse.csr_matrix | str]]
Preprocessor
- class libmultilabel.linear.Preprocessor(include_test_labels: bool = False, remove_no_label_data: bool = False, tfidf_params: dict[str, str] = {})[source]
Preprocessor is used to preprocess input data in LibSVM or LibMultiLabel formats. The same Preprocessor has to be used for both training and test datasets; see save_pipeline and load_pipeline for more details.
- __init__(include_test_labels: bool = False, remove_no_label_data: bool = False, tfidf_params: dict[str, str] = {})[source]
Initializes the preprocessor.
- Parameters
include_test_labels (bool, optional) – Whether to include labels in the test dataset. Defaults to False.
remove_no_label_data (bool, optional) – Whether to remove training instances that have no labels. Defaults to False.
tfidf_params (dict[str, str], optional) – A set of parameters for sklearn.TfidfVectorizer. If empty, default parameters will be used.
- fit(dataset: dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) Preprocessor [source]
Fit the preprocessor according to the training and test datasets, and pre-defined labels if given.
- Parameters
dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with possibly pre-defined labels with keys ‘train’, ‘test’, and “labels” respectively. The dataset must have keys ‘x’ for input features, and ‘y’ for actual labels. It also contains ‘data_format’ to indicate the data format used.
- Returns
An instance of the fitted preprocessor.
- Return type
- fit_transform(dataset)[source]
Fit the preprocessor according to the training and test datasets, and pre-defined labels if given. Then convert x and y in the training and test datasets according to the fitted preprocessor.
- Parameters
dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with labels with keys ‘train’, ‘test’, and labels respectively. The dataset has keys ‘x’ for input features and ‘y’ for labels. It also contains ‘data_format’ to indicate the data format used.
- Returns
The transformed dataset.
- Return type
dict[str, dict[str, sparse.csr_matrix]]
- transform(dataset: dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]])[source]
Convert x and y in the training and test datasets according to the fitted preprocessor.
- Parameters
dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with labels with keys ‘train’, ‘test’, and labels respectively. The dataset has keys ‘x’ for input features and ‘y’ for labels. It also contains ‘data_format’ to indicate the data format used.
- Returns
The transformed dataset.
- Return type
dict[str, dict[str, sparse.csr_matrix]]
Load and Save Pipeline
- libmultilabel.linear.save_pipeline(checkpoint_dir: str, preprocessor: Preprocessor, model)[source]
Save preprocessor and model to checkpoint_dir/linear_pipline.pickle.
- Parameters
checkpoint_dir (str) – The directory to save to.
preprocessor (Preprocessor) – A Preprocessor.
model – A model returned from one of the training functions.
- libmultilabel.linear.load_pipeline(checkpoint_path: str) tuple[libmultilabel.linear.preprocessor.Preprocessor, Any] [source]
Load preprocessor and model from checkpoint_path.
- Parameters
checkpoint_path (str) – The path to a previously saved pipeline.
- Returns
A tuple of the preprocessor and model.
- Return type
tuple[Preprocessor, Any]
Metrics
Metrics are specified by their names in compute_metrics
and get_metrics
.
The possible metric names are:
'P@K'
, whereK
is a positive integer'R@K'
, whereK
is a positive integer'RP@K'
, whereK
is a positive integer'NDCG@K'
, whereK
is a positive integer'Macro-F1'
'Micro-F1'
Their definitions are given in the implementation document.
- libmultilabel.linear.compute_metrics(preds: ndarray, target: ndarray, monitor_metrics: list[str], multiclass: bool = False) dict[str, float] [source]
Compute metrics with decision values and labels. See get_metrics and MetricCollection if decision values and labels are too large to hold in memory.
- Parameters
preds (np.ndarray) – A matrix of decision values with dimensions number of instances * number of classes.
target (np.ndarray) – A 0/1 matrix of labels with dimensions number of instances * number of classes.
monitor_metrics (list[str]) – A list of metric names.
multiclass (bool, optional) – Enable multiclass mode. Defaults to False.
- Returns
A dictionary of metric values.
- Return type
dict[str, float]
- libmultilabel.linear.get_metrics(monitor_metrics: list[str], num_classes: int, multiclass: bool = False) MetricCollection [source]
Get a collection of metrics by their names. See MetricCollection for more details.
- Parameters
monitor_metrics (list[str]) – A list of metric names.
num_classes (int) – The number of classes.
multiclass (bool, optional) – Enable multiclass mode. Defaults to False.
- Returns
A metric collection of the list of metrics.
- Return type
- class libmultilabel.linear.MetricCollection(metrics)[source]
A collection of metrics created by get_metrics. MetricCollection computes metric values in two steps. First, batches of decision values and labels are added with update(). After all instances have been added, compute() computes the metric values from the accumulated batches.
- compute() dict[str, float] [source]
Compute the metrics from the accumulated batches of decision values and labels.
- Returns
A dictionary of metric values.
- Return type
dict[str, float]
- update(preds: ndarray, target: ndarray)[source]
Add a batch of decision values and labels.
- Parameters
preds (np.ndarray) – A matrix of decision values with dimensions number of instances * number of classes.
target (np.ndarray) – A 0/1 matrix of labels with dimensions number of instances * number of classes.
- libmultilabel.linear.tabulate_metrics(metric_dict: dict[str, float], split: str) str [source]
Convert a dictionary of metric values into a pretty formatted string for printing.
- Parameters
metric_dict (dict[str, float]) – A dictionary of metric values.
split (str) – Name of the data split.
- Returns
Pretty formatted string.
- Return type
str
Grid Search with Sklearn Estimators
- class libmultilabel.linear.MultiLabelEstimator(options: str = '', linear_technique: str = '1vsrest', scoring_metric: str = 'P@1', multiclass: bool = False)[source]
Customized sklearn estimator for the multi-label classifier.
- Parameters
options (str, optional) – The option string passed to liblinear. Defaults to ‘’.
linear_technique (str, optional) – Multi-label technique defined in utils.LINEAR_TECHNIQUES. Defaults to ‘1vsrest’.
scoring_metric (str, optional) – The scoring metric. Defaults to ‘P@1’.
- class libmultilabel.linear.GridSearchCV(estimator, param_grid: dict, n_jobs=None, **kwargs)[source]
A customized sklearn.model_selection.GridSearchCV` class for Liblinear. The usage is similar to sklearn’s, except that the parameter
scoring
is unavailable. Instead, specifyscoring_metric
inMultiLabelEstimator
in the Pipeline.- Parameters
estimator (estimator object) – An estimator for grid search.
param_grid (dict) – Search space for a grid search containing a dictionary of parameters and their corresponding list of candidate values.
n_jobs (int, optional) – Number of CPU cores run in parallel. Defaults to None.
- set_fit_request(*, groups: Union[bool, None, str] = '$UNCHANGED$') GridSearchCV
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline
. Otherwise it has no effect.- Parameters
groups (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
groups
parameter infit
.- Returns
self – The updated object.
- Return type
object