Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
prepare_dataset.ipynb	prepare_dataset.ipynb
prepare_subtasks.ipynb	prepare_subtasks.ipynb

Data Preparation

Code for our paper published in NeurIPS 2022 [arXiv]:

Retrieve, Reason, and Refine: Generating Accurate and Faithful Patient Instructions

Fenglin Liu, Bang Yang, Chenyu You, Xian Wu, Shen Ge, Zhangdaihong Liu, Xu Sun*, Yang Yang*, and David A. Clifton.

Updates

[2022.10.25] We release our data at Google Drive. The data looks like

data
├── diagnose-procedure-medication
│   ├── admDxMap_mimic3.pk      # Source: D_ICD_DIAGNOSES.csv
│   ├── admMedMap_mimic3.pk     # Source: prescriptions.csv
│   ├── admPrMap_mimic3.pk      # Source: D_ICD_PROCEDURES.csv
│   └── readme.txt
├── splits                      # Source: NOTEEVENTS.csv
│   ├── train.csv               # obtained by prepare_dataset.ipynb
│   ├── val.csv                 # obtained by prepare_dataset.ipynb
│   ├── test.csv                # obtained by prepare_dataset.ipynb
│   └── subtasks                
│       ├── age                 # Source: NOTEEVENTS.csv
│       │   └── ...             # obtained by prepare_subtasks.ipynb
│       ├── sex                 # Source: NOTEEVENTS.csv
│       │   └── ...             # obtained by prepare_subtasks.ipynb
│       └── diseases            # Source: NOTEEVENTS.csv, D_ICD_DIAGNOSES.csv
│           └── ...             # obtained by prepare_subtasks.ipynb
└── vocab                       
    ├── special_tokens_map.json # obtained by prepare_dataset.ipynb
    ├── tokenizer_config.json   # obtained by prepare_dataset.ipynb
    └── vocab.txt               # obtained by prepare_dataset.ipynb

Directly Use Our Released Data for Patient Instruction Generation

Clone the repo: git clone https://github.com/AI-in-Hospitals/Patient-Instructions.git
For clarity, we define DATA_ROOT by running export DATA_ROOT=$(pwd)/Patient-Instructions/data
Download our released dataset (data.zip, 132MB), unzip it and move evering in the data folder to DATA_ROOT:

unzip data.zip
mv data/* $DATA_ROOT

Reproduce Our Data Preparation Process: Preliminaries

Visit the official website of MIMIC-III, follow guidelines and download the raw dataset (a .zip file, ~6.6G)
Unzip the downloaded archive file, you should see the following structure:

.
└── mimic-iii-clinical-database-1.4
    ├── ADMISSIONS.csv.gz
    ├── CALLOUT.csv.gz
    ├── CAREGIVERS.csv.gz
    ├── CHARTEVENTS.csv.gz
    ├── CPTEVENTS.csv.gz
    ├── DATETIMEEVENTS.csv.gz
    ├── DIAGNOSES_ICD.csv.gz
    ├── DRGCODES.csv.gz
    ├── D_CPT.csv.gz
    ├── D_ICD_DIAGNOSES.csv.gz
    ├── D_ICD_PROCEDURES.csv.gz
    ├── D_ITEMS.csv.gz
    ├── D_LABITEMS.csv.gz
    ├── ICUSTAYS.csv.gz
    ├── INPUTEVENTS_CV.csv.gz
    ├── INPUTEVENTS_MV.csv.gz
    ├── LABEVENTS.csv.gz
    ├── LICENSE.txt
    ├── MICROBIOLOGYEVENTS.csv.gz
    ├── NOTEEVENTS.csv.gz
    ├── OUTPUTEVENTS.csv.gz
    ├── PATIENTS.csv.gz
    ├── PRESCRIPTIONS.csv.gz
    ├── PROCEDUREEVENTS_MV.csv.gz
    ├── PROCEDURES_ICD.csv.gz
    ├── README.md
    ├── SERVICES.csv.gz
    ├── SHA256SUMS.txt
    ├── TRANSFERS.csv.gz
    ├── checksum_md5_unzipped.txt
    └── checksum_md5_zipped.txt

Run the code below to obtain NOTEEVENTS.csv and D_ICD_DIAGNOSES.csv and move them to DATA_ROOT

gzip -d ./mimic-iii-clinical-database-1.4/NOTEEVENTS.csv.gz
mv NOTEEVENTS.csv $DATA_ROOT

gzip -d ./mimic-iii-clinical-database-1.4/D_ICD_DIAGNOSES.csv.gz
mv D_ICD_DIAGNOSES.csv $DATA_ROOT

Prepare Dataset

Run prepare_dataset.ipynb, where we provide step-by-step instructions. After that, you should see the following structure:

$DATA_ROOT
├── subjects_with_PI
│   └── ...
├── patient_instructions
│   └── ...
├── health_records
│   └── ...
├── processed_patient_instructions
│   └── ...
├── processed_health_records
│   └── ...
├── info
│   └── ...
├── splits
│   ├── train.csv
│   ├── val.csv
│   └── test.csv
└── vocab
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

Prepare Subtasks

Besides evaluating on the full test set, we also divide the test set into different groups based on age, sex, and diseases. Run prepare_subtasks.ipynb, and you will see:

$DATA_ROOT
├── splits
│   ├── train.csv
│   ├── val.csv
│   ├── test.csv
|   └── subtasks                
│       ├── age                 
│       │   └── test
│       │       ├── 0_55.txt    # age between [0, 55)
│       │       ├── 55_70.txt   # age between [55, 70)
│       │       └── 70_200.txt  # age between [70, 200)
│       ├── sex                 
│       │   └── test
│       │       ├── f.txt       # female
│       │       └── m.txt       # male
│       └── diseases            
│           └── test            # higher the rank, more frequent the disease
│               ├── D_250.txt   # rank 5: Diabetes mellitus   
│               ├── D_272.txt   # rank 2: Hyperlipidemia          
│               ├── D_276.txt   # rank 6: Acidosis          
│               ├── D_285.txt   # rank 7: Anemia          
│               ├── D_401.txt   # rank 1: Hypertension (most frequent)        
│               ├── D_414.txt   # rank 4: Coronary atherosclerosis of native coronary artery          
│               ├── D_427.txt   # rank 3: Atrial fibrillation          
│               ├── D_428.txt   # rank 8: Congestive heart failure          
│               ├── D_518.txt   # rank 9: Acute respiratory failure          
│               └── D_584.txt   # rank 10: Acute kidney failure          
└── ...

Bugs or Questions?

If you encounter any problems when using the code, or want to report a bug, you can open an issue or email {yangbang@pku.edu.cn,fenglinliu98@pku.edu.cn}. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please consider citing our papers if our code or datasets are useful to your work, thanks sincerely!

@inproceedings{liu2022retrieve,
   title={Retrieve, Reason, and Refine: Generating Accurate and Faithful Patient Instructions},
   author={Liu, Fenglin and Yang, Bang and You, Chenyu and Wu, Xian and Ge, Shen and Liu, Zhangdaihong and Sun, Xu and Yang, Yang and Clifton, David A},
   booktitle={Advances in Neural Information Processing Systems},
   year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Data Preparation

Updates

Directly Use Our Released Data for Patient Instruction Generation

Reproduce Our Data Preparation Process: Preliminaries

Prepare Dataset

Prepare Subtasks

Bugs or Questions?

Citation

FilesExpand file tree

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data Preparation

Updates

Directly Use Our Released Data for Patient Instruction Generation

Reproduce Our Data Preparation Process: Preliminaries

Prepare Dataset

Prepare Subtasks

Bugs or Questions?

Citation