Environment
===========

- OS: Linux (ubuntu 16.04)
- Python: 3.6+
- CUDA: 10.2 (if GPU used)
- Pytorch 1.8+

Data processing
===============

- Follow the instructions in the physionet website (https://mimic.physionet.org/gettingstarted/access/) to get the MIMIC-III dataset.

- Process the dataset with this Jupyter Notebook (https://github.com/jamesmullenbach/caml-mimic/blob/master/notebooks/dataproc_mimic_III.ipynb) provided by Mullenbach et al. (2018). The script will help you to get the files listed below to run our code.
    - train_50.csv, test_50.csv, and dev_50.csv: Training, test, and validation sets.
    - vocab.csv: Vocabulary with tokens appearing in at least three training samples.
    - processed_full.embed: Pre-trained word embeddings to the tokens in vocab.csv.

- Convert the processed data and put them to the data directory with `scripts/process_data.py`.
> python scripts/process_data.py


Code for training and prediction
================================

We use the software LibMultiLabel (https://github.com/ASUS-AICS/LibMultiLabel) for multi-label text classification. We use a particular commit to run experiments in the paper.

LibMultiLabel files are included in this directory though you can also get them by:

> git clone https://github.com/ASUS-AICS/LibMultiLabel.git && cd LibMultiLabel
> git checkout 315cdd4ffb036623a278cd213bd2b84c8cbf2e76

If you clone files by yourself, you must change "epochs: 100" to "epochs: 200" in LibMultiLabel/example_config/MIMIC-50/*.yml

Grid search for parameter selection
===================================

To generate results in the paper, run
> sh scripts/run_grid.sh caml
> sh scripts/run_grid.sh cnn
> sh scripts/generate_tables.sh

Results are put in cnn_all.tex and caml_all.tex.
- cnn_all.tex: CNN results by three seeds.
- caml_all.tex: CAML results by three seeds.

These two files lead to tables in the supplementary materials.

- cnn_caml_avg.tex: This file is for Table 4 in the paper.

Note that after checking a grid of parameters, we select those having the best validation results and report the corresponding test performances.