We want to create a webdataset made up of the 10M images from iNat21, EOL and BIOSCAN.
Here are the steps:
- Get a list of all images.
- Choose a validation split.
- Write the splits in the webdataset format.
Edit the --tag
from v2 to v3.
sbatch slurm/make-dataset-mapping.sh
You have to pass a different mapping.sqlite
file if you used a different tag in the previous step.
python scripts/evobio10m/make_splits.py \
--db /fs/ess/PAS2136/open_clip/data/evobio10m-v2/mapping.sqlite \
--seed 17 \
--val-split 5
Edit the --tag
from v2 to v3 if you did in step 1.
sbatch slurm/make-dataset-wds.sh
You must point the --shardlist
argument to the train and the val splits.
python scripts/evobio10m/check_wds.py --shardlist '/fs/ess/PAS2136/open_clip/data/evobio10m-v2.1/224x224/val/shard-{000003..000031}.tar' --workers 8
This will print out some bad files, like:
/fs/ess/PAS2136/open_clip/data/evobio10m-v2.1/224x224/val/shard-000003.tar
/fs/ess/PAS2136/open_clip/data/evobio10m-v2.1/224x224/val/shard-000004.tar
/fs/ess/PAS2136/open_clip/data/evobio10m-v2.1/224x224/val/shard-000013.tar
Then you should delete these files and re-run the jobs.
There are various configuration options hidden all over the codebase. Here are some different ones to look out for:
src/imageomics/eol.py
:VernacularNameLookup
has a default argument fordata/eol/vernacularnames.csv
. If yourvernacularnames.csv
is not indata/eol
, you will have to edit this.src/imageomics/eol.py
:EolNameLookup
has many default arguments for different files.src/iamgeomics/evobio10m.py
:eol_root_dir
,inat21_root_dir
andbioscan_root_dir
all point to specific image folders on OSC.src/iamgeomics/evobio10m.py
:get_output_dir
returns the default location for evobio10m datasets.scripts/evobio10m/make_wds.py
:seen_in_training_json
andunseen_in_training_json
are the seen and unseen species used in the rare species benchmarks. There are version controlled so they should be in the correct location in your repo.