Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
dvs23 committed Dec 21, 2023
0 parents commit fc66410
Show file tree
Hide file tree
Showing 1,626 changed files with 257,165 additions and 0 deletions.
243 changes: 243 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf

# AWS User-specific
.idea/**/aws.xml

# Generated files
.idea/**/contentModel.xml

# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml

# Gradle
.idea/**/gradle.xml
.idea/**/libraries

# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr

# CMake
cmake-build-*/

# Mongo Explorer plugin
.idea/**/mongoSettings.xml

# File-based project format
*.iws

# IntelliJ
out/

# mpeltonen/sbt-idea plugin
.idea_modules/

# JIRA plugin
atlassian-ide-plugin.xml

# Cursive Clojure plugin
.idea/replstate.xml

# SonarLint plugin
.idea/sonarlint/

# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties

# Editor-based Rest Client
.idea/httpRequests

# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser

*.json
*.out
*epochs/
results/
old/
93 changes: 93 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Artifact "Comparing Generative and Extractive Approaches to Information Extraction from Abstracts Describing Randomized Clinical Trials"

## Setup

We suggest using Python 3.10.12 or newer to run this artifact, older versions have not been tested. The following instructions should work for most mayor Linux distributions.

Start by setting up a virtual environment and installing the required packages. For this, run the following at the top level of a cloned version of this repository:

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

All scripts in this artifact automatically activate the virtual environment assuming it is named and located as shown above and set the `PYTHONPATH` accordingly. If you want to execute a single Python file manually, you have to activate the virtual environment and add the `src/` subdirectory as `PYTHONPATH`:

```bash
source venv/bin/activate
export PYTHONPATH=./src/
python some/python/file.py
```

## Artifact Structure

* data/ - annotated datasets for type 2 diabetes and glaucoma RCT abstracts used in the paper
* scripts-extractive/ - scripts to execute training and evaluation of the extractive approach
* all_runs.txt - list of commands necessary to start full hyperparameter search training run
* extractive-dm2.sh - given a model name, executes an extractive hyperparameter optimization training run with 30 trials for type 2 diabetes dataset
* extractive-gl.sh - given a model name, executes an extractive hyperparameter optimization training run with 30 trials for glaucoma dataset
* extractive-best.sh - given path to a `best_params.pkl` file generated by `src/eval_summary.py`, executes 10 training runs with the best parameters found during hyperparameter optimization
* eval-extr.sh - executes evaluation for extractive part of directory of trained models, CHANGE PATH IN FILE TO ACTUAL LOCATION OF RESULTS!
* scripts-generative/ - scripts to execute training and evaluation of the generative approach
* same as extractive, but for the generative approach
* src/ - source code of both approaches used in the paper
* extractive_approach - source code of the extractive approach (training file is `training.py`)
* generative_approach - source code of the generative approach (training file is `training.py`)
* template_lib - source code of general classes and functions to load and use the dataset
* full_eval.py - runs evaluation for whole given training results directory
* eval_summary.py - generates summary of evaluated training results of hyperparameter search
* eval_summary_best.py - generates summary of evaluated training results of the 10 training runs executed separately with the best hyperparameters
* main.py - can be used to play around with loaded datasets, contains code to list and count slot fillers of "Journal"
* requirements.txt - Python requirements of this project
* sort_results.sh - expecting training to have been executed in top directory of project, sorts models etc. into folders grouped by approach, disease and model, CHANGE PATH IN FILE TO ACTUAL LOCATION OF RESULTS!

## Replication Steps

1. Go to the top directory of this project

2. Execute all hyperparameter optimization trainings, i.e.:
```bash
scripts-extractive/extractive-dm2.sh 'allenai/longformer-base-4096' | tee train-longformer-dm2.txt
scripts-extractive/extractive-dm2.sh "allenai/led-base-16384" | tee train-led-dm2.txt
scripts-extractive/extractive-dm2.sh "google/flan-t5-base" | tee train-t5-dm2.txt
scripts-extractive/extractive-gl.sh 'allenai/longformer-base-4096' | tee train-longformer-gl.txt
scripts-extractive/extractive-gl.sh "allenai/led-base-16384" | tee train-led-gl.txt
scripts-extractive/extractive-gl.sh "google/flan-t5-base" | tee train-t5-gl.txt
scripts-generative/generative-dm2.sh 'allenai/led-base-16384' | tee train-led-dm2-gen.txt
scripts-generative/generative-dm2.sh 'google/flan-t5-base' | tee train-t5-dm2-gen.txt
scripts-generative/generative-gl.sh 'allenai/led-base-16384' | tee train-led-gl-gen.txt
scripts-generative/generative-gl.sh 'google/flan-t5-base' | tee train-t5-gl-gen.txt
```

3. Sort results into folders: `sort_results.sh` (change paths in file!)

4. Run evaluation for extractive and generative models (change paths in file!):
```bash
scripts-extractive/eval-extr.sh
scripts-generative/eval-gen.sh
```

5. Generate evaluation summary for hyperparameter optimization (first activate virtual environment and set `PYTHONPATH` as shown above!). To generate case study data, append `--casestudy`. Some tables are only printed and generated if you run the command a second time because they are first only saved to pickle files.
```bash
python src/eval_summary.py --results /path/to/results/folder/
```

6. Run training again for best found parameters:
```bash
scripts-extractive/extractive-best.sh /path/to/best_params.pkl
scripts-generative/generative-best.sh /path/to/best_params.pkl
```

7. Sort new results into folders: `sort_results.sh` and make sure the files are sorted into a different directory such that you can differentiate the hyperparameter optimization from the training with best parameters. Do not forget to also copy the `config_*.json` files from the original results to the directories of the new results (e.g. using `cp --parents`) as they are necessary for running the evaluation.

8. Run evaluation for new extractive and generative models (change paths in file accordingly!):
```bash
scripts-extractive/eval-extr.sh
scripts-generative/eval-gen.sh
```

9. Generate evaluation summary with mean and standard deviations for training with best parameters (first activate virtual environment and set `PYTHONPATH` as shown above!). Some tables are only printed and generated if you run the command a second time because they are first only saved to pickle files.
```bash
python src/eval_summary_best.py --results /path/to/results/folder/
```
20 changes: 20 additions & 0 deletions data/dataset_splits/dm2_test_ids.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
27740719
17992639
30019498
19821654
29246950
25125506
22672586
23250357
24939043
21781152
24199686
15677776
17130197
23116881
29110647
20092584
25208756
25036533
24356792
24824197
Loading

0 comments on commit fc66410

Please sign in to comment.