Handling Data with Many Labels Using Neural Networks
====================================================
As time and space complexities grow linearly as the label size increases, it is inefficient to train models in its
original label space. We consider adopting AttentionXML :cite:p:`RY19a` to address the issue by training
models with a reduced space of labels.

Usage
-----
It is recommended to run AttentionXML through a configuration file. For example, to test AttentionXML on Wiki10-31K,
run the following code:

.. code-block:: console

    $ python main.py -c example_config/Wiki10-31K/attentionxml.yml

Training Process
----------------
Roughly speaking, training AttentionXML takes two steps. First, a model (model 0) is trained to predict clusters (rather
than labels). Then, another model (model 1) is trained to predict labels. Both models involve a BiLSTM layer with
label-wise attention.

One significant distinction between AttentionXML and one-vs-all algorithms is that in training model 1, AttentionXML
only updates weights related to only a subset of the original label space (in the attention layer) during
backpropagation, thereby increasing training speed.

.. image:: images/AttentionXML_training.png
   :width: 70%
   :align: center

Hyperparameters
---------------

There are 2 extra hyperparameters for AttentionXML that users need to know:

* **cluster_size**: The maximal number of labels in a cluster.
* **beam_width**: The process of selecting predicted clusters from model 0 is called beam search. Beam width indicates
  the number of clusters that will be selected.

Performance
-----------
We compared the performance between BiLSTM and AttentionXML as they have similar architectures. The dataset,
Wiki10-31K, has 30,938 classes, which makes it hard for models to train in a one-vs-all manner.

Both models were trained on an A100 Nvidia GPU. Their test results are shown below. Notice the difference
between their running time.

.. list-table::
   :widths: 80 60 60 60 60
   :header-rows: 1
   :stub-columns: 1

   * - Performance
     - P@1
     - P@3
     - P@5
     - Time (min)
   * - BiLSTM
     - 84.48
     - 75.91
     - 66.88
     - 87.1
   * - AttentionXML
     - 87.44
     - 77.70
     - 67.85
     - 29.9