This directory contains the tools and data needed to build a version of the LexGLUE data sets.
The data sets are first downloaded from the Hugging Face at https://huggingface.co/datasets/lex_glue
and processed according to the LexGLUE repository at https://github.com/coastalcph/lex-glue.

The problem types and the associated data sets of LexGLUE are listed below.
Multi-class classification: SCOTUS, LEDGAR.
Multi-label classification: ECtHR (A), ECtHR (B), EUR-LEX, UNFAIR-ToS.

Files
=====

- process_data.py:
    This script produces the LexGLUE data sets.    
    Generated directory 'lexglue_data' includes six directories
    naming ecthr_a, ecthr_b, scotus, eurlex, ledgar, unfair_tos.
    For each data set directory, both the raw texts and tf-idf features are provided.
    Python package `datasets` and `scikit-learn` has to be installed for this to work properly.
    Simply run it with python:
    ```
    python process_data.py
    ```

- ocp.py:
    This script records some global settings for 'process_data.py'.
    The settings include lists of data sets and splits intended to be downloaded,
    the problem type of each data set, and the naming of each split.
