Installation and Dataset Formats

To work with the command line interface, firstly

The supported Dataset Formats include:

Then the following modules are available.


Install LibMultiLabel from Source

  • Environment

    • Python: 3.8+

    • CUDA: 11.8, 12.1 (if training neural networks by GPU)

    • Pytorch 2.0.1+

It is optional but highly recommended to create a virtual environment. For example, you can first refer to the link for the installation guidances of Miniconda and then create a virtual enviroment as follows.

conda create -n LibMultiLabel python=3.8
conda activate LibMultiLabel
git clone https://github.com/ntumlgroup/LibMultiLabel.git
cd LibMultiLabel
  • Install the default dependencies with:

pip3 install -r requirements.txt
  • If you are using neural networks, install additional dependencies with:

pip3 install -r requirements_nn.txt

If you have a different version of CUDA, follow the installation instructions for PyTorch LTS at their website.


Dataset Formats

The input data for building train, test, and validation datasets must have specific formats. For neural networks, the only accepted format is the LibMultiLabel Format. For linear methods, both LibMultiLabel Format and LibSVM Format are accepted. More sample sets in these formats can be downloaded from the LIBSVM data.

LibMultiLabel Format

The LibMultiLabel format is a format for IDs (optional), labels, and raw texts. They are combined in a single file, using tabs and line endings as control characters. It must satisfy the following requirements

  • one sample per line

  • ID, labels, and texts are separated by <TAB> (the ID column is optional)

  • labels are split by spaces

  • each field should not contain any <TAB>

An example with the ID column:

2286<TAB>E11 ECAT M11 M12 MCAT<TAB>recov recov recov recov excit ...
2287<TAB>C24 CCAT<TAB>uruguay uruguay compan compan compan ...

An example without the ID column:

E11 ECAT M11 M12 MCAT<TAB>recov recov recov recov excit ...
C24 CCAT<TAB>uruguay uruguay compan compan compan ...

LibSVM Format

The LibSVM format is a format for labels and sparse numerical features. They are combined in a single file, using commas, spaces, colons and line endings as control characters. It must meet the criteria below

  • one sample per line

  • labels and features are separated by a space

  • labels are split by commas

  • features are split by spaces

  • each feature is specified as index:value, with index starting from 1

Some sample lines are as follows:

1,3,5 1:0.1 9:0.2 13:0.3
2,4,6 2:0.4 10:0.5 14:0.4