These are the tools needed to extract the raw texts and labels from the **Wiki10+** dataset.

Files
=====

- partition.txt: 
    This file specifies whether each sample is in the train or test set. 
    This file was created according to the partition used in AttentionXML.

- tags.txt: 
    This file (obtained from the Extreme Classification Repository) contains the subset of 31k labels.

- download.sh: 
    This script is responsible for downloading the original **Wiki10+** dataset. 
    The downloaded data will be in `./data`.
    
    Use this script before `preprocess.py`:
        ```
        ./download.sh
        ```

- preprocess.py: 
    This script is responsible for extracting the texts and label from the downloaded data
    and requires the module `beautifulsoup4` to work.

    Excecute the script after the data had been downloaded:
        ```
        python preprocess.py
        ```
    
    The script should generate the files `wiki10_31k_raw_texts_train.txt` and `wiki10_31k_raw_texts_test.txt`  in './data/`.

- tfidf.py:
    This script is responsible for creating the bag-of-words (tfidf) feature after the raw texts have been extracted.
    It requires the module `scikit-learn` to work.

    Execute the script after running `preprocess.py`:
        ```
        python tfidf.py
        ```
    
    The script should generate the files `wiki10_31k_tfidf_train.svm`, `wiki10_31k_tfidf_test.svm` and 'vocab.txt' in './data/`.
