Skip to content

Latest commit

 

History

History
34 lines (20 loc) · 2.09 KB

CAPTION.md

File metadata and controls

34 lines (20 loc) · 2.09 KB

Captioning tools

Open-Flamingo

Note: Open-Flamingo currently only works on Torch 2.0.1. If you want to use it, you will have to backdate your torch installation, which will break features in the trainer. I recommend making a separate environment for Open Flamingo captioning instead. You can run through normal install, then pip install open-flamingo in the separate envirment to back date torch and make that install open-flamingo only.

python caption_fl.py --data_root input --min_new_tokens 20 --max_new_tokens 30 --num_beams 3 --model "openflamingo/OpenFlamingo-9B-vitl-mpt7b"

This script uses two example image/caption pairs located in the /example folder to prime the system to caption, then captions the images in the input folder. It will save a .txt file of the same base filename with the caption in the same folder.

This script currently requires an AMPERE or newer GPU due to using bfloat16.

Trying out different example image/caption pairs will influence how the system captions the input images. Adding more examples slows processing.

Supported models:

  • openflamingo/OpenFlamingo-3B-vitl-mpt1b Small model, requires 8 GB VRAM a num_beams 3, or 12 GB at num_beams 16
  • openflamingo/OpenFlamingo-9B-vitl-mpt7b Large model, requires 24 GB VRAM at num_beams 3, or 36.7gb at num_beams 32

The small model with more beams (ex. 16) performs well with details and should not be immediately discounted.

The larger model is more accurate with proper names (i.e. identifying well-known celebrities, objects, or locations) and seems to exhibit a larger vocabulary.

Primary params:

  • --num_beams 3 increasing uses more VRAM and runs slower, may improve detail, but can increase hallicunations
  • --min_new_tokens 20 and --max_new_tokens 35 control the length of the caption

Other settings:

  • --force_cpu forces to use CPU even if a CUDA device is present
  • --temperature 1.0 relates to randomness used for next token chosen
  • --repetition_penalty 1.0 penalizes repeating tokens/words, can adjust up if you see repeated terms
  • --length_penalty 1.0 penalizes longer captions