Skip to content

Commit

Permalink
upgrade dependencies and retrain models
Browse files Browse the repository at this point in the history
  • Loading branch information
plandes committed Jan 25, 2025
1 parent 9903496 commit 820e192
Show file tree
Hide file tree
Showing 15 changed files with 183 additions and 16 deletions.
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,21 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
## [Unreleased]


### Removed
- Support for Python 3.10.

### Added
- Conda environment file (`src/python/environment.yml`) to reduce start up
latency and improve reproducibility.

### Changed
- Upgrade to [zensols.mimic] version 1.8.0 and [zensols.deeplearn] version
1.13.3. The latter includes dependencies for PyTorch 2.1.2 and HuggingFace
transformers 4.48.1.
- Retrained models (0.1.1) using new dependencies and uploaded to Zenodo, which
are downloaded automatically on first use.


## [1.8.0] - 2024-04-14
### Changed
- Release newly trained models with better performance and smaller file size
Expand Down
40 changes: 40 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,14 @@ available.

## Installation

Because the this library has many dependencies and many moving parts, it is
best to create a new environment using [conda]:

```bsh
conda env create -f src/python/environment.yml
conda activate mimicsid
```

The library can be installed with pip from the [pypi] repository:
```bash
pip3 install zensols.mimicsid
Expand Down Expand Up @@ -192,6 +200,19 @@ changes of each version might present language parsing differences such as
sentence chunking, metrics are most likely statistically insignificant.


#### Version 0.1.1

The version was released to accommodate for Zensols framework upgrades.

| Name | Type | Id | wF1 | mF1 | MF1 | acc |
|-------------------------------|---------|----------------------------------------|-------|-------|-------|-------|
| `BiLSTM-CRF_tok (fastText)` | Section | bilstm-crf-tok-fasttext-section-type | 0.921 | 0.929 | 0.787 | 0.929 |
| `BiLSTM-CRF_tok (GloVE 300D)` | Section | bilstm-crf-tok-glove-300d-section-type | 0.939 | 0.944 | 0.841 | 0.944 |
| `BiLSTM-CRF_tok (fastText)` | Header | bilstm-crf-tok-fasttext-header | 0.996 | 0.996 | 0.961 | 0.996 |
| `BiLSTM-CRF_tok (GloVE 300D)` | Header | bilstm-crf-tok-glove-300d-header | 0.996 | 0.996 | 0.962 | 0.996 |



#### Version 0.1.0

Adding biomedical NER improved the `0.1.0` models (see [Model
Expand All @@ -210,6 +231,8 @@ macro F1 of 0.8163.

#### Version 0.0.3

The version was released to accommodate for Zensols framework upgrades.

| Name | Type | Id | wF1 | mF1 | MF1 | acc |
|-------------------------------|---------|----------------------------------------|-------|-------|-------|-------|
| `BiLSTM-CRF_tok (fastText)` | Section | bilstm-crf-tok-fasttext-section-type | 0.911 | 0.917 | 0.792 | 0.917 |
Expand All @@ -220,6 +243,8 @@ macro F1 of 0.8163.

#### Version 0.0.2

The version was released to accommodate for Zensols framework upgrades.

| Name | Type | Id | wF1 | mF1 | MF1 | acc |
|-------------------------------|---------|----------------------------------------|-------|-------|-------|-------|
| `BiLSTM-CRF_tok (fastText)` | Section | bilstm-crf-tok-fasttext-section-type | 0.918 | 0.925 | 0.797 | 0.925 |
Expand Down Expand Up @@ -310,6 +335,20 @@ For the header model use:

## Training Production Models

TL;DR: if you're feeling lucky:

1. Create a Conda environment with `src/python/environment.yml`
1. Update the new *model* version in:
* [resources/default.conf](resources/default.conf) for property
`msid_model:version`.
* [dist-resources/app.conf][dist-resources/app.conf] for properth
`deeplearn_model_packer:version`
1. Run detached from the console since it will take about a day to train all
four models: `nohup src/bin/all.sh > train.log 2>&1 &`

However, there are many moving parts and libraries with many things that can go
wrong. More in-depth training instructions follow.

To train models used in your projects, train the model on both the training and
test sets. This still leaves the validation set to inform when to save for
epochs where the loss decreases:
Expand Down Expand Up @@ -428,6 +467,7 @@ Copyright (c) 2022 - 2025 Paul Landes

[MedCat]: https://github.com/CogStack/MedCAT
[spaCy]: https://spacy.io
[conda]: https://docs.anaconda.com/miniconda/

[mednlp package]: https://github.com/plandes/mednlp
[mimic package]: https://github.com/plandes/mimic
Expand Down
3 changes: 1 addition & 2 deletions dist-resources/app.conf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# version of the pretrained model (must match with resources/default.conf)
[deeplearn_model_packer]
version = 0.1.0
version = 0.1.1

[cli]
apps = list: ${cli_config_default:apps}, deeplearn_fac_batch_app,
Expand Down Expand Up @@ -52,7 +52,6 @@ config_files = list:
resource(zensols.mimic): resources/obj.conf,
resource(zensols.mimic): resources/decorator.conf,
resource(zensols.mimicsid): resources/anon.conf,
resource(zensols.mimicsid): resources/lang.yml,
resource(zensols.mimicsid): dist-resources/obj.conf,
^{config_path}, ^{override}

Expand Down
2 changes: 1 addition & 1 deletion dist-resources/model.conf
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ out_features = ${deepnlp_default:num_labels}
[model_settings]
learning_rate = 0.01
scale_gradient_params = dict: {'max_norm': 0.5, 'norm_type': 2.}
reduce_outcomes = None
reduce_outcomes = none
batch_iteration_class_name = zensols.deeplearn.model.SequenceBatchIterator
scheduler_class_name = torch.optim.lr_scheduler.ReduceLROnPlateau
prediction_mapper_name = feature_prediction_mapper
Expand Down
2 changes: 1 addition & 1 deletion resources/default.conf
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ doc_parser = mednlp_combine_biomed_medcat_doc_parser
[msid_model]
# the version of the model to (maybe download) and use; this must match with
# the version in dist-resources/app.conf in the deeplearn_model_packer section
version = 0.1.0
version = 0.1.1
2 changes: 1 addition & 1 deletion resources/obj.conf
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
[msid_model_info]
version = '${msid_model:version}'.replace('.', '_')
# TODO: update zenodo link
zenodo_base_url = https://zenodo.org/record/10971167/files
zenodo_base_url = https://zenodo.org/record/14736865/files

# section
[msid_model_section_id_resource]
Expand Down
10 changes: 7 additions & 3 deletions src/bin/all.sh
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
#!/bin/bash
#@meta {desc: 'build production models from scratch', date: '2024-04-11'}

# before starting, make sure to increment:
# - resources/default.conf msid_model:version
# - dist-resources/app.conf deeplearn_model_packer:version

./src/bin/preprocess.sh && \
cp config/system.conf config/system-sensitive-data.conf && \
cat /dev/null > config/system.conf && \
./src/bin/package.sh && \
./dist summary -c config/glove300.conf --validation -o stage/model-performance.csv && \
mv config/system-sensitive-data.conf config/system.conf
./src/bin/package.sh > package.log && \
mv config/system-sensitive-data.conf config/system.conf && \
mv package.log stage
11 changes: 8 additions & 3 deletions src/bin/package.sh
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,14 @@ function create_checksums() {
}

# output the models' performance metrics
function dump_results() {
$BIN summary -c ${CONF_DIR}/${model}.conf \
function write_results() {
log "writing performance metrics..."
model="${MODELS%% *}"
$BIN ressum -c ${CONF_DIR}/${model}.conf \
--validation -o stage/model-performance.csv
if [ $? -ne 0 ] ; then
fail "model training failed"
fi
}

# do all
Expand All @@ -87,7 +92,7 @@ function main() {
verify
package
create_checksums
dump_results
write_results
}

main
6 changes: 6 additions & 0 deletions src/bin/preprocess.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ function clean() {
rm -rf target model data stage
}

function deps() {
echo "installing additional dependencies needed for training"
pip install -r src/python/requirements-train.txt
}

function preempt() {
echo "parsing admissions, notes and docs"
if [ $FAST -eq 1 ] ; then
Expand All @@ -42,6 +47,7 @@ function batch() {
function main() {
confirm
clean
deps
preempt
batch
}
Expand Down
70 changes: 70 additions & 0 deletions src/bin/readme-results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/usr/bin/env python

from typing import Iterable, Dict
from dataclasses import dataclass, field
import sys
import re
from pathlib import Path
from io import TextIOBase
import pandas as pd
from tabulate import tabulate
from zensols.config import Dictable


@dataclass
class ResultSummarizer(Dictable):
result_path: Path = field()

def _to_readme_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
type_re: re.Pattern = re.compile(
r'^(.+) (Header|Section)(?: Type)?: 1$')
cols: Dict[str, str] = {
'name': 'Name',
'type': 'Type',
'resid': 'Id',
'wF1v': 'wF1',
'mF1v': 'mF1',
'MF1v': 'MF1',
'accv': 'acc'
}
for col in 'wF1v mF1v MF1v accv'.split():
df[col] = df[col].round(3)
df['resid'] = df['resid'].apply(
lambda s: re.sub(r'^(.+)-1$', r'\1', s))
#type_ser: pd.Series = df['name'].apply(
df['type'] = df['name'].apply(
lambda s: re.sub(type_re, r'\2', s))
df['name'] = df['name'].apply(
lambda s: re.sub(type_re, r"`\1`", s))
#df.insert(0, 'type', type_ser)
df = df.sort_values('type name'.split(), ascending=False)
df = df[list(cols.keys())]
df = df.rename(columns=cols)
return df

def _to_readme_table(self, df: pd.DataFrame) -> str:
tab: str = tabulate(
df,
headers=df.columns,
tablefmt='orgtbl',
showindex=False)
return tab.replace('+', '|')

def write(self, depth: int = 0, writer: TextIOBase = sys.stdout):
res_paths: Iterable[Path] = filter(
lambda p: p.suffix == '.csv', self.result_path.iterdir())
for path in res_paths:
df: pd.DataFrame = pd.read_csv(path, index_col=0)
df = self._to_readme_dataframe(df)
self._write_line(f'{path}:', depth, writer)
self._write_block(self._to_readme_table(df), depth, writer)
self._write_divider(depth, writer)


def main():
summarizer = ResultSummarizer(Path('stage'))
summarizer.write()


if (__name__ == '__main__'):
main()
27 changes: 27 additions & 0 deletions src/python/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: mimicsid
channels:
- defaults
dependencies:
- python==3.11.11
- numpy==1.25.2
- nmslib==2.1.1
- pip
- pip:
## third party
- torch==2.1.2
- transformers~=4.48.1
## framework
- zensols.util==1.15.1
- zensols.nlp==1.12.1
- zensols.dbpg==1.4.0
# deep learning
- zensols.deeplearn==1.13.3
- zensols.deepnlp==1.17.1
# clinical
- zensols.mednlp~=1.8.0
- zensols.mimic==1.8.0
## models
- https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl
- https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl
- https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_md-0.5.3.tar.gz
- https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_ner_bionlp13cg_md-0.5.3.tar.gz
2 changes: 1 addition & 1 deletion src/python/requirements-test.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
zensols.deepnlp~=1.15.0
zensols.deepnlp~=1.17.0
2 changes: 2 additions & 0 deletions src/python/requirements-train.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
zensols.dbpg~=1.4.0
zensols.deepnlp~=1.17.0
5 changes: 2 additions & 3 deletions src/python/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
zensols.deeplearn~=1.13.0
zensols.mednlp~=1.8.0
zensols.mimic~=1.8.0
zensols.deeplearn==1.13.3
zensols.mimic==1.8.0

0 comments on commit 820e192

Please sign in to comment.