Linear Classifier API
Train and Predict
Linear methods are methods based on LibLinear. The simplest usage is:
model = linear.train_1vsrest(train_y, train_x, options)
predict = linear.predict_values(model, test_x)
- libmultilabel.linear.train_1vsrest(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel[source]
- Train a linear model for multi-label data using a one-vs-rest strategy. - Parameters
- y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes. 
- x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features. 
- multiclass (bool, optional) – A flag indicating if the dataset is multiclass. 
- options (str, optional) – The option string passed to liblinear. Defaults to ‘’. 
- verbose (bool, optional) – Output extra progress information. Defaults to True. 
 
- Returns
- A model which can be used in predict_values. 
 
- libmultilabel.linear.train_thresholding(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel[source]
- Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick decision thresholds optimizing the sum of Macro-F1 and Micro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity due to an internal cross-validation. - This method is the micromacro-freq approach from this CIKM 2023 paper: “On the Thresholding Strategy for Infrequent Labels in Multi-label Classification” (see Section 4.3 and Supplementary D). - Parameters
- y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes. 
- x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features. 
- multiclass (bool, optional) – A flag indicating if the dataset is multiclass. 
- options (str, optional) – The option string passed to liblinear. Defaults to ‘’. 
- verbose (bool, optional) – Output extra progress information. Defaults to True. 
 
- Returns
- A model which can be used in predict_values. 
 
- libmultilabel.linear.train_cost_sensitive(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel[source]
- Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick an optimal asymmetric misclassification cost for Macro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity. See user guide for more details. - Parameters
- y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes. 
- x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features. 
- multiclass (bool, optional) – A flag indicating if the dataset is multiclass. 
- options (str, optional) – The option string passed to liblinear. Defaults to ‘’. 
- verbose (bool, optional) – Output extra progress information. Defaults to True. 
 
- Returns
- A model which can be used in predict_values. 
 
- libmultilabel.linear.train_cost_sensitive_micro(y: csr_matrix, x: csr_matrix, multiclass: bool = False, options: str = '', verbose: bool = True) FlatModel[source]
- Train a linear model for multi-label data using a one-vs-rest strategy and cross-validation to pick an optimal asymmetric misclassification cost for Micro-F1. Outperform train_1vsrest in most aspects at the cost of higher time complexity. See user guide for more details. - Parameters
- y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes. 
- x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features. 
- multiclass (bool, optional) – A flag indicating if the dataset is multiclass. 
- options (str, optional) – The option string passed to liblinear. Defaults to ‘’. 
- verbose (bool, optional) – Output extra progress information. Defaults to True. 
 
- Returns
- A model which can be used in predict_values. 
 
- libmultilabel.linear.train_binary_and_multiclass(y: csr_matrix, x: csr_matrix, multiclass: bool = True, options: str = '', verbose: bool = True) FlatModel[source]
- Train a linear model for binary and multi-class data. - Parameters
- y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes. 
- x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features. 
- multiclass (bool, optional) – A flag indicating if the dataset is multiclass. 
- options (str, optional) – The option string passed to liblinear. Defaults to ‘’. 
- verbose (bool, optional) – Output extra progress information. Defaults to True. 
 
- Returns
- A model which can be used in predict_values. 
 
- libmultilabel.linear.train_tree(y: csr_matrix, x: csr_matrix, options: str = '', K=100, dmax=10, verbose: bool = True) TreeModel[source]
- Train a linear model for multi-label data using a divide-and-conquer strategy. The algorithm used is based on https://github.com/xmc-aalto/bonsai. - Parameters
- y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes. 
- x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features. 
- options (str) – The option string passed to liblinear. 
- K (int, optional) – Maximum degree of nodes in the tree. Defaults to 100. 
- dmax (int, optional) – Maximum depth of the tree. Defaults to 10. 
- verbose (bool, optional) – Output extra progress information. Defaults to True. 
 
- Returns
- A model which can be used in predict_values. 
- Return type
 
- libmultilabel.linear.train_ensemble_tree(y: csr_matrix, x: csr_matrix, options: str = '', K: int = 100, dmax: int = 10, n_trees: int = 3, verbose: bool = True, seed: int = None) EnsembleTreeModel[source]
- Trains an ensemble of tree models (Parabel/Bonsai-style). - Parameters
- y (sparse.csr_matrix) – A 0/1 matrix with dimensions number of instances * number of classes. 
- x (sparse.csr_matrix) – A matrix with dimensions number of instances * number of features. 
- options (str, optional) – The option string passed to liblinear. Defaults to ‘’. 
- K (int, optional) – Maximum degree of nodes in the tree. Defaults to 100. 
- dmax (int, optional) – Maximum depth of the tree. Defaults to 10. 
- n_trees (int, optional) – Number of trees in the ensemble. Defaults to 3. 
- verbose (bool, optional) – Output extra progress information. Defaults to True. 
- seed (int, optional) – The base random seed for the ensemble. Defaults to None, which will use 42. 
 
- Returns
- An ensemble model which can be used for prediction. 
- Return type
- EnsembleTreeModel 
 
- libmultilabel.linear.predict_values(model, x: csr_matrix) ndarray[source]
- Calculate the decision values associated with x, equivalent to model.predict_values(x). - Parameters
- model – A model returned from a training function. 
- x (sparse.csr_matrix) – A matrix with dimension number of instances * number of features. 
 
- Returns
- A matrix with dimension number of instances * number of classes. 
- Return type
- np.ndarray 
 
- libmultilabel.linear.get_topk_labels(preds: ndarray, label_mapping: ndarray, top_k: int = 5) tuple[numpy.ndarray, numpy.ndarray][source]
- Get labels and scores of top k predictions from decision values. - Parameters
- preds (np.ndarray) – A matrix of decision values with dimension (number of instances * number of classes). 
- label_mapping (np.ndarray) – A ndarray of class labels that maps each index (from 0 to - num_class-1) to its label.
- top_k (int) – Determine how many classes per instance should be predicted. 
 
- Returns
- Two 2d ndarray with first one containing predicted labels and the other containing corresponding scores. Both have dimension (num_instances * top_k). 
 
- libmultilabel.linear.get_positive_labels(preds: ndarray, label_mapping: ndarray) tuple[list[list[str]], list[list[float]]][source]
- Get all labels and scores with positive decision value. - Parameters
- preds (np.ndarray) – A matrix of decision values with dimension number of instances * number of classes. 
- label_mapping (np.ndarray) – A ndarray of class labels that maps each index (from 0 to - num_class-1) to its label.
 
- Returns
- Two 2d lists with first one containing predicted labels and the other containing corresponding scores. 
 
- class libmultilabel.linear.FlatModel(name: str, weights: numpy.matrix | scipy.sparse._csr.csr_matrix, bias: float, thresholds: float | numpy.ndarray, multiclass: bool)[source]
- A model returned from a training function. 
- class libmultilabel.linear.TreeModel(root: Node, flat_model: FlatModel, node_ptr: ndarray)[source]
- A model returned from train_tree. - predict_values(x: csr_matrix, beam_width: int = 10) ndarray[source]
- Calculate the probability estimates associated with x. - Parameters
- x (sparse.csr_matrix) – A matrix with dimension number of instances * number of features. 
- beam_width (int, optional) – Number of candidates considered during beam search. Defaults to 10. 
 
- Returns
- A matrix with dimension number of instances * number of classes. 
- Return type
- np.ndarray 
 
 
Load Dataset
- libmultilabel.linear.load_dataset(data_format: str, train_path: str | pandas.core.frame.DataFrame | None = None, test_path: str | pandas.core.frame.DataFrame | None = None, label_path: str | None = None) dict[str, dict[str, scipy.sparse._csr.csr_matrix | list[list[int]] | list[str]]][source]
- Load dataset in LibSVM or LibMultiLabel formats. - Parameters
- data_format (str) – The data format used. ‘svm’ for LibSVM format, ‘txt’ for LibMultiLabel format in file and ‘dataframe’ for LibMultiLabel format in dataframe . 
- train_path (str | pd.DataFrame, optional) – Training data file or dataframe in LibMultiLabel format. Ignored if eval is True. Defaults to None. 
- test_path (str | pd.DataFrame, optional) – Test data file or dataframe in LibMultiLabel format. Ignored if test_data doesn’t exist. Defaults to None. 
- label_path (str, optional) – Path to a file holding all labels. Defaults to None. 
 
- Returns
- The training and/or test data, with keys ‘train’ and ‘test’ respectively. The data has keys ‘x’ for input features and ‘y’ for labels. 
- Return type
- dict[str, dict[str, sparse.csr_matrix | str]] 
 
Preprocessor
- class libmultilabel.linear.Preprocessor(include_test_labels: bool = False, remove_no_label_data: bool = False, tfidf_params: dict[str, str] = {})[source]
- Preprocessor is used to preprocess input data in LibSVM or LibMultiLabel formats. The same Preprocessor has to be used for both training and test datasets; see save_pipeline and load_pipeline for more details. - __init__(include_test_labels: bool = False, remove_no_label_data: bool = False, tfidf_params: dict[str, str] = {})[source]
- Initializes the preprocessor. - Parameters
- include_test_labels (bool, optional) – Whether to include labels in the test dataset. Defaults to False. 
- remove_no_label_data (bool, optional) – Whether to remove training instances that have no labels. Defaults to False. 
- tfidf_params (dict[str, str], optional) – A set of parameters for sklearn.TfidfVectorizer. If empty, default parameters will be used. 
 
 
 - fit(dataset: dict[str, dict[str, scipy.sparse._csr.csr_matrix | list[list[int]] | list[str]]]) Preprocessor[source]
- Fit the preprocessor according to the training and test datasets, and pre-defined labels if given. - Parameters
- dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with possibly pre-defined labels with keys ‘train’, ‘test’, and “labels” respectively. The dataset must have keys ‘x’ for input features, and ‘y’ for actual labels. It also contains ‘data_format’ to indicate the data format used. 
- Returns
- An instance of the fitted preprocessor. 
- Return type
 
 - fit_transform(dataset)[source]
- Fit the preprocessor according to the training and test datasets, and pre-defined labels if given. Then convert x and y in the training and test datasets according to the fitted preprocessor. - Parameters
- dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with labels with keys ‘train’, ‘test’, and labels respectively. The dataset has keys ‘x’ for input features and ‘y’ for labels. It also contains ‘data_format’ to indicate the data format used. 
- Returns
- The transformed dataset. 
- Return type
- dict[str, dict[str, sparse.csr_matrix]] 
 
 - transform(dataset: dict[str, dict[str, scipy.sparse._csr.csr_matrix | list[list[int]] | list[str]]])[source]
- Convert x and y in the training and test datasets according to the fitted preprocessor. - Parameters
- dataset (dict[str, dict[str, sparse.csr_matrix | list[list[int]] | list[str]]]) – The training and test datasets along with labels with keys ‘train’, ‘test’, and labels respectively. The dataset has keys ‘x’ for input features and ‘y’ for labels. It also contains ‘data_format’ to indicate the data format used. 
- Returns
- The transformed dataset. 
- Return type
- dict[str, dict[str, sparse.csr_matrix]] 
 
 
Load and Save Pipeline
- libmultilabel.linear.save_pipeline(checkpoint_dir: str, preprocessor: Preprocessor, model)[source]
- Save preprocessor and model to checkpoint_dir/linear_pipline.pickle. - Parameters
- checkpoint_dir (str) – The directory to save to. 
- preprocessor (Preprocessor) – A Preprocessor. 
- model – A model returned from one of the training functions. 
 
 
- libmultilabel.linear.load_pipeline(checkpoint_path: str) tuple[libmultilabel.linear.preprocessor.Preprocessor, Any][source]
- Load preprocessor and model from checkpoint_path. - Parameters
- checkpoint_path (str) – The path to a previously saved pipeline. 
- Returns
- A tuple of the preprocessor and model. 
- Return type
- tuple[Preprocessor, Any] 
 
Metrics
Metrics are specified by their names in compute_metrics and get_metrics.
The possible metric names are:
- 'P@K', where- Kis a positive integer
- 'R@K', where- Kis a positive integer
- 'RP@K', where- Kis a positive integer
- 'NDCG@K', where- Kis a positive integer
- 'Macro-F1'
- 'Micro-F1'
Their definitions are given in the implementation document.
- libmultilabel.linear.compute_metrics(preds: ndarray, target: ndarray, monitor_metrics: list[str], multiclass: bool = False) dict[str, float][source]
- Compute metrics with decision values and labels. See get_metrics and MetricCollection if decision values and labels are too large to hold in memory. - Parameters
- preds (np.ndarray) – A matrix of decision values with dimensions number of instances * number of classes. 
- target (np.ndarray) – A 0/1 matrix of labels with dimensions number of instances * number of classes. 
- monitor_metrics (list[str]) – A list of metric names. 
- multiclass (bool, optional) – Enable multiclass mode. Defaults to False. 
 
- Returns
- A dictionary of metric values. 
- Return type
- dict[str, float] 
 
- libmultilabel.linear.get_metrics(monitor_metrics: list[str], num_classes: int, multiclass: bool = False) MetricCollection[source]
- Get a collection of metrics by their names. See MetricCollection for more details. - Parameters
- monitor_metrics (list[str]) – A list of metric names. 
- num_classes (int) – The number of classes. 
- multiclass (bool, optional) – Enable multiclass mode. Defaults to False. 
 
- Returns
- A metric collection of the list of metrics. 
- Return type
 
- class libmultilabel.linear.MetricCollection(metrics)[source]
- A collection of metrics created by get_metrics. MetricCollection computes metric values in two steps. First, batches of decision values and labels are added with update(). After all instances have been added, compute() computes the metric values from the accumulated batches. - compute() dict[str, float][source]
- Compute the metrics from the accumulated batches of decision values and labels. - Returns
- A dictionary of metric values. 
- Return type
- dict[str, float] 
 
 - update(preds: ndarray, target: ndarray)[source]
- Add a batch of decision values and labels. - Parameters
- preds (np.ndarray) – A matrix of decision values with dimensions number of instances * number of classes. 
- target (np.ndarray) – A 0/1 matrix of labels with dimensions number of instances * number of classes. 
 
 
 
- libmultilabel.linear.tabulate_metrics(metric_dict: dict[str, float], split: str) str[source]
- Convert a dictionary of metric values into a pretty formatted string for printing. - Parameters
- metric_dict (dict[str, float]) – A dictionary of metric values. 
- split (str) – Name of the data split. 
 
- Returns
- Pretty formatted string. 
- Return type
- str 
 
Grid Search with Sklearn Estimators
- class libmultilabel.linear.MultiLabelEstimator(options: str = '', linear_technique: str = '1vsrest', scoring_metric: str = 'P@1', multiclass: bool = False)[source]
- Customized sklearn estimator for the multi-label classifier. - Parameters
- options (str, optional) – The option string passed to liblinear. Defaults to ‘’. 
- linear_technique (str, optional) – Multi-label technique defined in utils.LINEAR_TECHNIQUES. Defaults to ‘1vsrest’. 
- scoring_metric (str, optional) – The scoring metric. Defaults to ‘P@1’. 
 
 
- class libmultilabel.linear.GridSearchCV(estimator, param_grid: dict, n_jobs=None, **kwargs)[source]
- A customized sklearn.model_selection.GridSearchCV` class for Liblinear. The usage is similar to sklearn’s, except that the parameter - scoringis unavailable. Instead, specify- scoring_metricin- MultiLabelEstimatorin the Pipeline.- Parameters
- estimator (estimator object) – An estimator for grid search. 
- param_grid (dict) – Search space for a grid search containing a dictionary of parameters and their corresponding list of candidate values. 
- n_jobs (int, optional) – Number of CPU cores run in parallel. Defaults to None.