This directory contains the tools and data needed to build a version 
of the eurlex dataset that is the same as the one used by attentionXML.
This version is based on the original version provided at http://www.ke.tu-darmstadt.de/resources/eurlex.

Files
=====

- download.sh:
    This script is responsible for downloading all source data required.
    Run this before any other script:
    ```
    ./download.sh
    ```

- preprocess_text.py:
    This script produces the stemmed text data.
    Generated files includes 'eurlex_texts_train.txt' and 'eurlex_texts_test.txt'.
    Python packages `beautifulsoup4`, 'pandas', 'tqdm' and 'PyLucene' has
    to be installed for this to work properly.
    Notice that PyLucene cannot be installed through pip and has to be built and installed manually
    (see https://lucene.apache.org/pylucene/install.html).
    Simply run it with python after './download.sh':
    ```
    python preprocess_text.py
    ```

- tfidf.py:
    This script is responsible for generating the tfidf features from the tokenized
    text from the original dataset (not from the text produced by preprocess_text.py).
    The tfidf features are placed in 'eurlex_tfidf_train.svm' and 'eurlex_tfidf_test.svm'.
    Python packages `sklearn`, 'scipy', 'numpy' and 'tqdm' has to be installed for this
    to work properly.
    Simply run it with python after './download.sh':
    ```
    python tfidf.py
    ```

- perm.pkl:
    This file contains the permutation that will be applied to the samples. 
    The first 15449 samples after the permutation would be the training set 
    and the remaining would be the testing set.

- eurovocs.txt:
    This file contains the mapping of the EUROVOCS labels to its index (= line number - 1).

- english.stop:
    These are the list of stop words that will be removed from the source text.