Tweaking Feature Generation for Linear Methods

In both API and CLI usage of linear methods, LibMultiLabel handles the feature generation step by default. Unless necessary, you do not need to generate features in different ways as described in this tutorial.

This tutorial demonstrates how to customize the way to generate features for linear methods through an API example. Here we use the rcv1 dataset as an example.

from sklearn.preprocessing import MultiLabelBinarizer
from libmultilabel import linear

datasets = linear.load_dataset("txt", "data/rcv1/train.txt", "data/rcv1/test.txt")
tfidf_params = {
    "max_features": 20000,
    "min_df": 3,
    "ngram_range": (1, 3)
}
preprocessor = linear.Preprocessor(tfidf_params=tfidf_params)
preprocessor.fit(datasets)
datasets = preprocessor.transform(datasets)

The argument tfidf_params of the Preprocessor can specify how to generate the TF-IDF features. In this example, we adjust the max_features, min_df, and ngram_range of the preprocessor. For explanation of these three and other options, refer to the sklearn page. Users can also try other methods to generalize features, like word embedding.

Finally, we use the generated numerical features to train and evaluate the model. The rest of the steps is the same in the quickstarts. Please refer to them for details.

Gallery generated by Sphinx-Gallery