This script processes comic panel images to generate captions and descriptive lists using various vision-language models from Hugging Face.
The script provides automated captioning for comic panels using state-of-the-art vision-language models. It can:
- Process full comic pages and automatically crop panels
- Generate detailed captions for each panel
- Extract lists of relevant items/details
- Handle batch processing with customizable splits
- Save results in both raw and processed formats
- CUDA-capable GPU (required)
- RAM requirements depend on chosen model and batch size
- Actual VRAM usage may vary based on implementation and settings
Note: Memory requirements can vary significantly based on:
- Model quantization settings
- Batch size configuration
- Input image resolution
- System configuration and available optimizations
- Python 3.8+
- CUDA Toolkit 11.8 or higher
- Git LFS (for model downloads)
# Core dependencies
pip install torch torchvision
pip install transformers pillow pandas numpy tqdm
# Model-specific dependencies
pip install flash-attention-2 bitsandbytes accelerate
To check installed packages:
pip list | grep -E "torch|transformers|pillow|pandas|numpy|tqdm|flash-attention|bitsandbytes|accelerate"
data/
├── predicts.caps/
│ └── [model-name]-cap/ # Here will be added the model predictions
└── datasets.unify/
└── [subdb]/ # Comic subdatabases
└── [comic_no]/ # Individual books
└── [page_no].jpg # Full comic pages
benchmarks/
└── captioning/
├── prompts.py # Model prompts
├── generate_captions.py # Main script
└── postprocessing.py # Postprocessing script
python benchmarks/captioning/generate_captions.py --model MODEL_NAME --num_splits N --index I [options]
--model
: Model choice ("minicpm2.6", "qwen2", "florence2", "idefics2", "idefics3")--num_splits
: Total number of dataset splits--index
: Index of current split (0-based)
--override
: Override existing processed files--batch_size
: Batch size for processing (default varies by model)--num_workers
: Number of DataLoader workers (default: 4)--save_txt
: Save raw results as text files--save_csv
: Extract and save results to CSV files
# Process 1/4 of dataset with MiniCPM (medium VRAM usage)
python benchmarks/captioning/generate_captions.py --model minicpm2.6 --num_splits 4 --index 0 --batch_size 1 --save_txt --save_csv
# Process with Qwen2VL (requires ~40GB VRAM)
python benchmarks/captioning/generate_captions.py --model qwen2 --num_splits 4 --index 1 --batch_size 2 --save_txt --save_csv
# Process with Florence2 (smallest VRAM usage)
python benchmarks/captioning/generate_captions.py --model florence2 --num_splits 4 --index 2 --batch_size 16 --save_txt --save_csv
Required columns in compiled_panels_annotations.csv
:
subdb,comic_no,page_no,panel_no,x1,y1,x2,y2
subdb1,book1,1,1,100,100,300,300
subdb
: Subdatabase identifiercomic_no
: Book identifierpage_no
: Page number (will be zero-padded to 3 digits)panel_no
: Panel number within the pagex1,y1,x2,y2
: Panel bounding box coordinates
- Format: JPG
- Location:
data/datasets.unify/[subdb]/[comic_no]/[page_no].jpg
- Content: Full comic pages (panels will be cropped automatically)
File: utils/prompt.py
- Contains model-specific prompting templates
- Required templates:
base_prompt
minicpm26_prompt
idefics2_prompt
data/predicts.caps/[model-name]-cap/
├── results/ # Raw results (if --save_txt)
│ └── subdb_book_page_panel.txt
├── N_I_caption.csv # Processed captions (if --save_csv)
└── N_I_list.csv # Processed lists (if --save_csv)
subdb,comic_no,page_no,panel_no,caption/items
subdb1,book1,1,1,"A person walking down the street"
...
-
MiniCPM-V-2.6
- Model ID: "openbmb/MiniCPM-V-2_6"
- Parameters: 8.1B
- Default batch size: 16
- Architecture: Vision-Language model with instruction tuning
-
Qwen2VL
- Model ID: "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4"
- Parameters: 12.6B (Quantized from 72B)
- Default batch size: 8
- Architecture: Vision-Language model with instruction tuning
-
Florence2
- Model ID: "microsoft/Florence-2-large-ft"
- Parameters: 0.77B
- Default batch size: 64
- Architecture: Vision-Language model with contrastive learning
-
Idefics2
- Model ID: "HuggingFaceM4/idefics2-8b"
- Parameters: 8.4B
- Default batch size: 16
- Architecture: Vision-Language model with open-vocabulary detection
-
Idefics3
- Model ID: "HuggingFaceM4/Idefics3-8B-Llama3"
- Parameters: 8.46B
- Default batch size: 16
- Architecture: Vision-Language model based on Llama3
- CUDA-capable GPU (required)
- RAM requirements depend on chosen model and batch size
- Actual VRAM usage may vary based on implementation and settings
Note: Memory requirements can vary significantly based on:
- Model quantization settings
- Batch size configuration
- Input image resolution
- System configuration and available optimizations
-
CUDA Out of Memory
- Reduce batch size
- Try a smaller model
- Free up GPU memory
- Use
nvidia-smi
to monitor GPU memory
-
Missing Files
- Verify directory structure
- Check file permissions
- Ensure all required files exist
-
Model Loading Errors
- Check internet connection
- Verify HuggingFace authentication (
huggingface-cli login
) - Clear transformers cache if needed
- Ensure Git LFS is installed
- Check the error message for specific details
- Verify all prerequisites are met
- Ensure input data follows the required format