Comic Panel Captioning Script

This script processes comic panel images to generate captions and descriptive lists using various vision-language models from Hugging Face.

Overview

The script provides automated captioning for comic panels using state-of-the-art vision-language models. It can:

Process full comic pages and automatically crop panels
Generate detailed captions for each panel
Extract lists of relevant items/details
Handle batch processing with customizable splits
Save results in both raw and processed formats

Prerequisites

Hardware Requirements

CUDA-capable GPU (required)
RAM requirements depend on chosen model and batch size
Actual VRAM usage may vary based on implementation and settings

Note: Memory requirements can vary significantly based on:

Model quantization settings
Batch size configuration
Input image resolution
System configuration and available optimizations

Software Requirements

Python 3.8+
CUDA Toolkit 11.8 or higher
Git LFS (for model downloads)

Required Python Packages

# Core dependencies
pip install torch torchvision 
pip install transformers pillow pandas numpy tqdm

# Model-specific dependencies
pip install flash-attention-2 bitsandbytes accelerate

To check installed packages:

pip list | grep -E "torch|transformers|pillow|pandas|numpy|tqdm|flash-attention|bitsandbytes|accelerate"

Data structure

data/
├── predicts.caps/
│   └── [model-name]-cap/               # Here will be added the model predictions
└── datasets.unify/
    └── [subdb]/                        # Comic subdatabases
        └── [comic_no]/                 # Individual books
            └── [page_no].jpg           # Full comic pages

Directory structure

benchmarks/
└── captioning/
    ├── prompts.py                      # Model prompts
    ├── generate_captions.py            # Main script
    └── postprocessing.py               # Postprocessing script

Usage

Basic Command

python benchmarks/captioning/generate_captions.py --model MODEL_NAME --num_splits N --index I [options]

Required Arguments

--model: Model choice ("minicpm2.6", "qwen2", "florence2", "idefics2", "idefics3")
--num_splits: Total number of dataset splits
--index: Index of current split (0-based)

Optional Arguments

--override: Override existing processed files
--batch_size: Batch size for processing (default varies by model)
--num_workers: Number of DataLoader workers (default: 4)
--save_txt: Save raw results as text files
--save_csv: Extract and save results to CSV files

Example Commands

# Process 1/4 of dataset with MiniCPM (medium VRAM usage)
python benchmarks/captioning/generate_captions.py --model minicpm2.6 --num_splits 4 --index 0 --batch_size 1 --save_txt --save_csv

# Process with Qwen2VL (requires ~40GB VRAM)
python benchmarks/captioning/generate_captions.py --model qwen2 --num_splits 4 --index 1 --batch_size 2 --save_txt --save_csv

# Process with Florence2 (smallest VRAM usage)
python benchmarks/captioning/generate_captions.py --model florence2 --num_splits 4 --index 2 --batch_size 16 --save_txt --save_csv

Input Data Format

Panel Annotations CSV

Required columns in compiled_panels_annotations.csv:

subdb,comic_no,page_no,panel_no,x1,y1,x2,y2
subdb1,book1,1,1,100,100,300,300

subdb: Subdatabase identifier
comic_no: Book identifier
page_no: Page number (will be zero-padded to 3 digits)
panel_no: Panel number within the page
x1,y1,x2,y2: Panel bounding box coordinates

Image Files

Format: JPG
Location: data/datasets.unify/[subdb]/[comic_no]/[page_no].jpg
Content: Full comic pages (panels will be cropped automatically)

Prompt Templates

File: utils/prompt.py

Contains model-specific prompting templates
Required templates:
- base_prompt
- minicpm26_prompt
- idefics2_prompt

Output Format

Directory Structure

data/predicts.caps/[model-name]-cap/
├── results/                           # Raw results (if --save_txt)
│   └── subdb_book_page_panel.txt
├── N_I_caption.csv                    # Processed captions (if --save_csv)
└── N_I_list.csv                       # Processed lists (if --save_csv)

CSV Output Format

subdb,comic_no,page_no,panel_no,caption/items
subdb1,book1,1,1,"A person walking down the street"
...

Supported Models

MiniCPM-V-2.6
- Model ID: "openbmb/MiniCPM-V-2_6"
- Parameters: 8.1B
- Default batch size: 16
- Architecture: Vision-Language model with instruction tuning
Qwen2VL
- Model ID: "Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4"
- Parameters: 12.6B (Quantized from 72B)
- Default batch size: 8
- Architecture: Vision-Language model with instruction tuning
Florence2
- Model ID: "microsoft/Florence-2-large-ft"
- Parameters: 0.77B
- Default batch size: 64
- Architecture: Vision-Language model with contrastive learning
Idefics2
- Model ID: "HuggingFaceM4/idefics2-8b"
- Parameters: 8.4B
- Default batch size: 16
- Architecture: Vision-Language model with open-vocabulary detection
Idefics3
- Model ID: "HuggingFaceM4/Idefics3-8B-Llama3"
- Parameters: 8.46B
- Default batch size: 16
- Architecture: Vision-Language model based on Llama3

Hardware Requirements

CUDA-capable GPU (required)
RAM requirements depend on chosen model and batch size
Actual VRAM usage may vary based on implementation and settings

Note: Memory requirements can vary significantly based on:

Model quantization settings
Batch size configuration
Input image resolution
System configuration and available optimizations

Troubleshooting

Common Issues

CUDA Out of Memory
- Reduce batch size
- Try a smaller model
- Free up GPU memory
- Use nvidia-smi to monitor GPU memory
Missing Files
- Verify directory structure
- Check file permissions
- Ensure all required files exist
Model Loading Errors
- Check internet connection
- Verify HuggingFace authentication (huggingface-cli login)
- Clear transformers cache if needed
- Ensure Git LFS is installed

Getting Help

Check the error message for specific details
Verify all prerequisites are met
Ensure input data follows the required format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generation.md

generation.md

Comic Panel Captioning Script

Table of Contents

Overview

Prerequisites

Hardware Requirements

Software Requirements

Required Python Packages

Data structure

Directory structure

Usage

Basic Command

Required Arguments

Optional Arguments

Example Commands

Input Data Format

Panel Annotations CSV

Image Files

Prompt Templates

Output Format

Directory Structure

CSV Output Format

Supported Models

Hardware Requirements

Troubleshooting

Common Issues

Getting Help

Files

generation.md

Latest commit

History

generation.md

File metadata and controls

Comic Panel Captioning Script

Table of Contents

Overview

Prerequisites

Hardware Requirements

Software Requirements

Required Python Packages

Data structure

Directory structure

Usage

Basic Command

Required Arguments

Optional Arguments

Example Commands

Input Data Format

Panel Annotations CSV

Image Files

Prompt Templates

Output Format

Directory Structure

CSV Output Format

Supported Models

Hardware Requirements

Troubleshooting

Common Issues

Getting Help