Initial commit

ag-sc · Dec 21, 2023 · fc66410 · fc66410
commit fc66410
Show file tree

Hide file tree

Showing 1,626 changed files with 257,165 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,243 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
+# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
+
+# User-specific stuff
+.idea/**/workspace.xml
+.idea/**/tasks.xml
+.idea/**/usage.statistics.xml
+.idea/**/dictionaries
+.idea/**/shelf
+
+# AWS User-specific
+.idea/**/aws.xml
+
+# Generated files
+.idea/**/contentModel.xml
+
+# Sensitive or high-churn files
+.idea/**/dataSources/
+.idea/**/dataSources.ids
+.idea/**/dataSources.local.xml
+.idea/**/sqlDataSources.xml
+.idea/**/dynamic.xml
+.idea/**/uiDesigner.xml
+.idea/**/dbnavigator.xml
+
+# Gradle
+.idea/**/gradle.xml
+.idea/**/libraries
+
+# Gradle and Maven with auto-import
+# When using Gradle or Maven with auto-import, you should exclude module files,
+# since they will be recreated, and may cause churn.  Uncomment if using
+# auto-import.
+# .idea/artifacts
+# .idea/compiler.xml
+# .idea/jarRepositories.xml
+# .idea/modules.xml
+# .idea/*.iml
+# .idea/modules
+# *.iml
+# *.ipr
+
+# CMake
+cmake-build-*/
+
+# Mongo Explorer plugin
+.idea/**/mongoSettings.xml
+
+# File-based project format
+*.iws
+
+# IntelliJ
+out/
+
+# mpeltonen/sbt-idea plugin
+.idea_modules/
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# Cursive Clojure plugin
+.idea/replstate.xml
+
+# SonarLint plugin
+.idea/sonarlint/
+
+# Crashlytics plugin (for Android Studio and IntelliJ)
+com_crashlytics_export_strings.xml
+crashlytics.properties
+crashlytics-build.properties
+fabric.properties
+
+# Editor-based Rest Client
+.idea/httpRequests
+
+# Android studio 3.1+ serialized cache file
+.idea/caches/build_file_checksums.ser
+
+*.json
+*.out
+*epochs/
+results/
+old/
diff --git a/README.md b/README.md
@@ -0,0 +1,93 @@
+# Artifact "Comparing Generative and Extractive Approaches to Information Extraction from Abstracts Describing Randomized Clinical Trials"
+
+## Setup
+
+We suggest using Python 3.10.12 or newer to run this artifact, older versions have not been tested. The following instructions should work for most mayor Linux distributions.
+
+Start by setting up a virtual environment and installing the required packages. For this, run the following at the top level of a cloned version of this repository:
+
+```bash
+python -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+
+All scripts in this artifact automatically activate the virtual environment assuming it is named and located as shown above and set the `PYTHONPATH` accordingly. If you want to execute a single Python file manually, you have to activate the virtual environment and add the `src/` subdirectory as `PYTHONPATH`:
+
+```bash
+source venv/bin/activate
+export PYTHONPATH=./src/
+python some/python/file.py
+```
+
+## Artifact Structure
+
+* data/ - annotated datasets for type 2 diabetes and glaucoma RCT abstracts used in the paper
+* scripts-extractive/ - scripts to execute training and evaluation of the extractive approach
+    * all_runs.txt - list of commands necessary to start full hyperparameter search training run
+    * extractive-dm2.sh - given a model name, executes an extractive hyperparameter optimization training run with 30 trials for type 2 diabetes dataset
+    * extractive-gl.sh - given a model name, executes an extractive hyperparameter optimization training run with 30 trials for glaucoma dataset
+    * extractive-best.sh - given path to a `best_params.pkl` file generated by `src/eval_summary.py`, executes 10 training runs with the best parameters found during hyperparameter optimization
+    * eval-extr.sh - executes evaluation for extractive part of directory of trained models, CHANGE PATH IN FILE TO ACTUAL LOCATION OF RESULTS!
+* scripts-generative/ - scripts to execute training and evaluation of the generative approach
+    * same as extractive, but for the generative approach
+* src/ - source code of both approaches used in the paper
+    * extractive_approach - source code of the extractive approach (training file is `training.py`)
+    * generative_approach - source code of the generative approach (training file is `training.py`)
+    * template_lib - source code of general classes and functions to load and use the dataset
+    * full_eval.py - runs evaluation for whole given training results directory
+    * eval_summary.py - generates summary of evaluated training results of hyperparameter search
+    * eval_summary_best.py - generates summary of evaluated training results of the 10 training runs executed separately with the best hyperparameters
+    * main.py - can be used to play around with loaded datasets, contains code to list and count slot fillers of "Journal"
+* requirements.txt - Python requirements of this project
+* sort_results.sh - expecting training to have been executed in top directory of project, sorts models etc. into folders grouped by approach, disease and model, CHANGE PATH IN FILE TO ACTUAL LOCATION OF RESULTS!
+
+## Replication Steps
+
+1. Go to the top directory of this project
+
+2. Execute all hyperparameter optimization trainings, i.e.:
+```bash
+scripts-extractive/extractive-dm2.sh 'allenai/longformer-base-4096' | tee train-longformer-dm2.txt
+scripts-extractive/extractive-dm2.sh "allenai/led-base-16384" | tee train-led-dm2.txt
+scripts-extractive/extractive-dm2.sh "google/flan-t5-base" | tee train-t5-dm2.txt
+scripts-extractive/extractive-gl.sh 'allenai/longformer-base-4096' | tee train-longformer-gl.txt
+scripts-extractive/extractive-gl.sh "allenai/led-base-16384" | tee train-led-gl.txt
+scripts-extractive/extractive-gl.sh "google/flan-t5-base" | tee train-t5-gl.txt
+scripts-generative/generative-dm2.sh 'allenai/led-base-16384' | tee train-led-dm2-gen.txt
+scripts-generative/generative-dm2.sh 'google/flan-t5-base' | tee train-t5-dm2-gen.txt
+scripts-generative/generative-gl.sh 'allenai/led-base-16384' | tee train-led-gl-gen.txt
+scripts-generative/generative-gl.sh 'google/flan-t5-base' | tee train-t5-gl-gen.txt
+```
+
+3. Sort results into folders: `sort_results.sh` (change paths in file!)
+
+4. Run evaluation for extractive and generative models (change paths in file!):
+```bash
+scripts-extractive/eval-extr.sh
+scripts-generative/eval-gen.sh
+```
+
+5. Generate evaluation summary for hyperparameter optimization (first activate virtual environment and set `PYTHONPATH` as shown above!). To generate case study data, append `--casestudy`. Some tables are only printed and generated if you run the command a second time because they are first only saved to pickle files.
+```bash
+python src/eval_summary.py --results /path/to/results/folder/ 
+```
+
+6. Run training again for best found parameters:
+```bash
+scripts-extractive/extractive-best.sh /path/to/best_params.pkl
+scripts-generative/generative-best.sh /path/to/best_params.pkl
+```
+
+7. Sort new results into folders: `sort_results.sh` and make sure the files are sorted into a different directory such that you can differentiate the hyperparameter optimization from the training with best parameters. Do not forget to also copy the `config_*.json` files from the original results to the directories of the new results (e.g. using `cp --parents`) as they are necessary for running the evaluation.
+
+8. Run evaluation for new extractive and generative models (change paths in file accordingly!):
+```bash
+scripts-extractive/eval-extr.sh
+scripts-generative/eval-gen.sh
+```
+
+9. Generate evaluation summary with mean and standard deviations for training with best parameters (first activate virtual environment and set `PYTHONPATH` as shown above!). Some tables are only printed and generated if you run the command a second time because they are first only saved to pickle files.
+```bash
+python src/eval_summary_best.py --results /path/to/results/folder/ 
+```
diff --git a/data/dataset_splits/dm2_test_ids.txt b/data/dataset_splits/dm2_test_ids.txt
@@ -0,0 +1,20 @@
+27740719
+17992639
+30019498
+19821654
+29246950
+25125506
+22672586
+23250357
+24939043
+21781152
+24199686
+15677776
+17130197
+23116881
+29110647
+20092584
+25208756
+25036533
+24356792
+24824197