.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_linear_tree_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_linear_tree_tutorial.py: Handling Data with Many Labels Using Linear Methods ==================================================== For the case that the amount of labels is very large, the training time of the standard ``train_1vsrest`` method may be unpleasantly long. The ``train_tree`` method in LibMultiLabel can vastly improve the training time on such data sets. To illustrate this speedup, we will use the `EUR-Lex dataset `_, which contains 3,956 labels. The data in the following example is downloaded under the directory ``data/eur-lex`` Users can use the following command to easily apply the ``train_tree`` method. .. code-block:: bash $ python3 main.py --training_file data/eur-lex/train.txt --test_file data/eur-lex/test.txt --linear --linear_technique tree Besides CLI usage, users can also use API to apply ``train_tree`` method. Below is an example. .. GENERATED FROM PYTHON SOURCE LINES 24-46 .. code-block:: Python import math import libmultilabel.linear as linear import time datasets = linear.load_dataset("txt", "data/eurlex/train.txt", "data/eurlex/test.txt") preprocessor = linear.Preprocessor() datasets = preprocessor.fit_transform(datasets) training_start = time.time() # the standard one-vs-rest method for multi-label problems ovr_model = linear.train_1vsrest(datasets["train"]["y"], datasets["train"]["x"]) training_end = time.time() print("Training time of one-versus-rest: {:10.2f}".format(training_end - training_start)) training_start = time.time() # the train_tree method for fast training on data with many labels tree_model = linear.train_tree(datasets["train"]["y"], datasets["train"]["x"]) training_end = time.time() print("Training time of tree-based: {:10.2f}".format(training_end - training_start)) .. GENERATED FROM PYTHON SOURCE LINES 47-57 On a machine with an AMD-7950X CPU, the ``train_1vsrest`` function took `578.30` seconds, while the ``train_tree`` function only took `144.37` seconds. .. note:: The ``train_tree`` function in this tutorial is based on the work of :cite:t:`SK20a`. ``train_tree`` achieves this speedup by approximating ``train_1vsrest``. To check whether the approximation performs well, we'll compute some metrics on the test set. .. GENERATED FROM PYTHON SOURCE LINES 57-69 .. code-block:: Python ovr_preds = linear.predict_values(ovr_model, datasets["test"]["x"]) tree_preds = linear.predict_values(tree_model, datasets["test"]["x"]) target = datasets["test"]["y"].toarray() ovr_score = linear.compute_metrics(ovr_preds, target, ["P@1", "P@3", "P@5"]) print("Score of 1vsrest:", ovr_score) tree_score = linear.compute_metrics(tree_preds, target, ["P@1", "P@3", "P@5"]) print("Score of tree:", tree_score) .. GENERATED FROM PYTHON SOURCE LINES 70-82 :math:`P@K`, a ranking-based criterion, is a metric often used for data with a large amount of labels. .. code-block:: Score of 1vsrest: {'P@1': 0.833117723156533, 'P@3': 0.6988357050452781, 'P@5': 0.585666235446313} Score of tree: {'P@1': 0.8217335058214748, 'P@3': 0.692539887882708, 'P@5': 0.578835705045278} For this data set, ``train_tree`` gives a slightly lower :math:`P@K`, but has a significantly faster training time. Typcially, the speedup of ``train_tree`` over ``train_1vsrest`` increases with the amount of labels. For even larger data sets, we may not be able to store the entire ``preds`` and ``target`` in memory at once. In this case, the metrics can be computed in batches. .. GENERATED FROM PYTHON SOURCE LINES 82-102 .. code-block:: Python def metrics_in_batches(model): batch_size = 256 num_instances = datasets["test"]["x"].shape[0] num_batches = math.ceil(num_instances / batch_size) metrics = linear.get_metrics(["P@1", "P@3", "P@5"], num_classes=datasets["test"]["y"].shape[1]) for i in range(num_batches): preds = linear.predict_values(model, datasets["test"]["x"][i * batch_size : (i + 1) * batch_size]) target = datasets["test"]["y"][i * batch_size : (i + 1) * batch_size].toarray() metrics.update(preds, target) return metrics.compute() print("Score of 1vsrest:", metrics_in_batches(ovr_model)) print("Score of tree:", metrics_in_batches(tree_model)) .. _sphx_glr_download_auto_examples_plot_linear_tree_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_linear_tree_tutorial.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_linear_tree_tutorial.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_