Note: Open-Flamingo currently only works on Torch 2.0.1. If you want to use it, you will have to backdate your torch installation, which will break features in the trainer. I recommend making a separate environment for Open Flamingo captioning instead. You can run through normal install, then pip install open-flamingo
in the separate envirment to back date torch and make that install open-flamingo only.
python caption_fl.py --data_root input --min_new_tokens 20 --max_new_tokens 30 --num_beams 3 --model "openflamingo/OpenFlamingo-9B-vitl-mpt7b"
This script uses two example image/caption pairs located in the /example
folder to prime the system to caption, then captions the images in the input folder. It will save a .txt
file of the same base filename with the caption in the same folder.
This script currently requires an AMPERE or newer GPU due to using bfloat16.
Trying out different example image/caption pairs will influence how the system captions the input images. Adding more examples slows processing.
Supported models:
openflamingo/OpenFlamingo-3B-vitl-mpt1b
Small model, requires 8 GB VRAM a num_beams 3, or 12 GB at num_beams 16openflamingo/OpenFlamingo-9B-vitl-mpt7b
Large model, requires 24 GB VRAM at num_beams 3, or 36.7gb at num_beams 32
The small model with more beams (ex. 16) performs well with details and should not be immediately discounted.
The larger model is more accurate with proper names (i.e. identifying well-known celebrities, objects, or locations) and seems to exhibit a larger vocabulary.
Primary params:
--num_beams 3
increasing uses more VRAM and runs slower, may improve detail, but can increase hallicunations--min_new_tokens 20
and--max_new_tokens 35
control the length of the caption
Other settings:
--force_cpu
forces to use CPU even if a CUDA device is present--temperature 1.0
relates to randomness used for next token chosen--repetition_penalty 1.0
penalizes repeating tokens/words, can adjust up if you see repeated terms--length_penalty 1.0
penalizes longer captions