.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_linear_gridsearch_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_linear_gridsearch_tutorial.py: Hyperparameter Search for Linear Methods ============================================================= This guide helps users to tune the hyperparameters of the feature generation step and the linear model. Here we show an example of tuning a linear text classifier with the `rcv1 dataset `_. Starting with loading and preprocessing of the data without using ``Preprocessor``: .. GENERATED FROM PYTHON SOURCE LINES 9-17 .. code-block:: Python from sklearn.preprocessing import MultiLabelBinarizer from libmultilabel import linear datasets = linear.load_dataset("txt", "data/rcv1/train.txt", "data/rcv1/test.txt") binarizer = MultiLabelBinarizer(sparse_output=True) y = binarizer.fit_transform(datasets["train"]["y"]).astype("d") .. GENERATED FROM PYTHON SOURCE LINES 18-21 we format labels into a 0/1 sparse matrix with ``MultiLabelBinarizer``. Next, we construct a ``Pipeline`` object that will be used for hyperparameter search later. .. GENERATED FROM PYTHON SOURCE LINES 21-32 .. code-block:: Python from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline pipeline = Pipeline( [ ("tfidf", TfidfVectorizer(max_features=20000, min_df=3)), ("clf", linear.MultiLabelEstimator(options="-s 2 -m 4", linear_technique="1vsrest", scoring_metric="P@1")), ] ) .. GENERATED FROM PYTHON SOURCE LINES 33-45 The vectorizor ``TfidfVectorizer`` is used in ``Pipeline`` to generate TF-IDF features from raw texts. As for the estimator ``MultiLabelEstimator``, argument ``options`` is a LIBLINEAR option (see *train Usage* in `liblinear `__ README), and ``linear_technique`` is one of the linear techniques, including ``1vsrest``, ``thresholding``, ``cost_sensitive``, ``cost_sensitive_micro``, and ``binary_and_multiclass``. We can specify the aliases of the components used by the pipeline. For example, ``tfidf`` is the alias of ``TfidfVectorizer`` and ``clf`` is the alias of the estimator. To search for the best setting, we employ ``GridSearchCV``. The usage is similar to sklearn's except that the parameter ``scoring`` is not available. Please specify ``scoring_metric`` in ``linear.MultiLabelEstimator`` instead. .. GENERATED FROM PYTHON SOURCE LINES 45-51 .. code-block:: Python liblinear_options = ["-s 2 -c 0.5", "-s 2 -c 1", "-s 2 -c 2", "-s 1 -c 0.5", "-s 1 -c 1", "-s 1 -c 2"] parameters = {"clf__options": liblinear_options, "tfidf__max_features": [10000, 20000, 40000], "tfidf__min_df": [3, 5]} clf = linear.GridSearchCV(pipeline, parameters, cv=5, n_jobs=4, verbose=1) clf = clf.fit(datasets["train"]["x"], y) .. GENERATED FROM PYTHON SOURCE LINES 52-57 Here we check the combinations of six feature generation options and six liblinear options in the linear classifier. The key in ``parameters`` should follow the sklearn's coding rule starting with the estimator's alias and two underscores (i.e., ``clf__``). We specify ``n_jobs=4`` to run four tasks in parallel. After finishing the grid search, we can get the best parameters by the following code: .. GENERATED FROM PYTHON SOURCE LINES 57-61 .. code-block:: Python for param_name in sorted(parameters.keys()): print(f"{param_name}: {clf.best_params_[param_name]}") .. GENERATED FROM PYTHON SOURCE LINES 62-74 The best parameters are:: clf__options: -s 2 -c 0.5 -m 1 tfidf__max_features: 10000 tfidf__min_df: 5 Note that in the above code, the ``refit`` argument of ``GridSearchCV`` is enabled by default, meaning that the best configuration will be trained on the whole dataset after hyperparameter search. We refer to this as the retrain strategy. After fitting ``GridSearchCV``, the retrained model is stored in ``clf``. We can apply the ``predict`` function of ``GridSearchCV`` object to use the estimator trained under the best hyperparameters for prediction. Then use ``linear.compute_metrics`` to calculate the test performance. .. GENERATED FROM PYTHON SOURCE LINES 74-85 .. code-block:: Python # For testing, we also need to read in data first and format test labels into a 0/1 sparse matrix. y = binarizer.transform(datasets["test"]["y"]).astype("d").toarray() preds = clf.predict(datasets["test"]["x"]) metrics = linear.compute_metrics( preds, y, monitor_metrics=["Macro-F1", "Micro-F1", "P@1", "P@3", "P@5"], ) print(metrics) .. GENERATED FROM PYTHON SOURCE LINES 86-89 The result of the best parameters will look similar to:: {'Macro-F1': 0.5296621774388927, 'Micro-F1': 0.8021279986938116, 'P@1': 0.9561621216872636, 'P@3': 0.7983185389507189, 'P@5': 0.5570921518306848} .. _sphx_glr_download_auto_examples_plot_linear_gridsearch_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_linear_gridsearch_tutorial.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_linear_gridsearch_tutorial.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_