.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_dataset_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_dataset_tutorial.py: Using Data not in Default Forms =================================================== Different datasets are stored in various structures and formats. To apply LibMultiLabel with any of them, one must convert the data to a form accepted by the library first. In this tutorial, we demonstrate an example of converting a hugging face data set. Before we start, note that LibMultiLabel format consists of IDs (optional), labels, and raw texts. `Here `_ are more details. To begin, let's install ``datasets`` from the hugging face library by the following command:: pip3 install datasets We use Pandas' ``DataFrame`` to process the data set from Hugging Face and load the data set by Hugging Face's API ``load_dataset``. Please import these libraries. .. GENERATED FROM PYTHON SOURCE LINES 19-23 .. code-block:: Python import pandas from datasets import load_dataset .. GENERATED FROM PYTHON SOURCE LINES 24-26 We choose a multi-label dataset ``emoji`` from ``tweet_eval`` in this example. The dataset can be loaded by the following code. .. GENERATED FROM PYTHON SOURCE LINES 26-32 .. code-block:: Python hf_datasets = dict() hf_datasets["train"] = load_dataset("tweet_eval", "emoji", split="train") hf_datasets["val"] = load_dataset("tweet_eval", "emoji", split="validation") hf_datasets["test"] = load_dataset("tweet_eval", "emoji", split="test") .. GENERATED FROM PYTHON SOURCE LINES 33-38 Convert to LibMultiLabel format -------------------------------------------------- We load Hugging Face's data set to Pandas' structure by the function ``DataFrame``. In consistent with our `linear model quickstart `_, which does not need a validation set, we use ``pandas.concat`` to merge training and validation sets and use ``reset_index`` to add the new indices to rows. .. GENERATED FROM PYTHON SOURCE LINES 38-45 .. code-block:: Python for split in ["train", "val", "test"]: hf_datasets[split] = pandas.DataFrame(hf_datasets[split], columns=["label", "text"]) hf_datasets["train"] = pandas.concat([hf_datasets["train"], hf_datasets["val"]], axis=0, ignore_index=True) hf_datasets["train"] = hf_datasets["train"].reset_index() hf_datasets["test"] = hf_datasets["test"].reset_index() .. GENERATED FROM PYTHON SOURCE LINES 46-55 The format of the data set after conversion looks like below:: >>> print(hf_datasets['train'].loc[[0]]) #print first row ... index label text 0 12 Sunday afternoon walking through Venice in the... Next, we train and make prediction with the data set using a linear model. The detailed explanation is in our `linear model quickstart `_. The difference between here and the quickstart is that the ``data_format`` option should be ``dataframe`` because the data set is a dataframe now. .. GENERATED FROM PYTHON SOURCE LINES 55-62 .. code-block:: Python import libmultilabel.linear as linear datasets = linear.load_dataset("dataframe", hf_datasets["train"], hf_datasets["test"]) preprocessor = linear.Preprocessor() datasets = preprocessor.fit_transform(datasets) .. GENERATED FROM PYTHON SOURCE LINES 63-66 In this case, if you want to use a deep learning model, use ``load_datasets`` from ``libmultilabel.nn.data_utils`` and change the data to the dataframes we created. Here is the modification of our `Bert model quickstart <../auto_examples/plot_bert_quickstart.html>`_. .. GENERATED FROM PYTHON SOURCE LINES 66-70 .. code-block:: Python from libmultilabel.nn.data_utils import load_datasets datasets = load_datasets(hf_datasets["train"], hf_datasets["test"], tokenize_text=False) .. _sphx_glr_download_auto_examples_plot_dataset_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_dataset_tutorial.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_dataset_tutorial.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_