Urdu Image Caption Generation

This repository contains the implementation of a Transformer-based model for Urdu Image Caption Generation, presented in the study "A Transformer-based Urdu Image Caption Generation." The project aims to generate syntactically, contextually, and semantically correct captions in Urdu for given images. It addresses the challenges of working with low-resource languages like Urdu by leveraging state-of-the-art deep learning architectures.

Overview

Image caption generation is a bridge between Natural Language Processing (NLP) and Computer Vision (CV), designed to describe images with textual captions. While most research focuses on resource-rich languages like English, this project emphasizes Urdu, a low-resource language.

The study proposes three Seq2Seq-based architectures for Urdu caption generation:

CNN + Attention LSTM with Word2Vec embeddings.
CNN + Transformer encoder-decoder.
Vision Transformer (ViT) + Roberta encoder-decoder.

Key Features

Custom Urdu-translated subset of the Flickr8k dataset.
Comparison of three deep learning models.
Achieves state-of-the-art performance with BLEU and BERT-F1 scores.
Demonstrates the effectiveness of pretraining and transformer-based approaches for low-resource languages.

Dataset

Flickr8k Urdu Subset

Total Images: 1800
Total Captions: 9000 Urdu captions
Format: Each image is paired with 5 Urdu captions.
Preprocessing: Includes punctuation removal, sentence splitting, and tokenization.

Methodology:

Proposed Models

CNN + Attention LSTM

Incorporates a ResNet50-based encoder for feature extraction.
Decoder uses LSTM with a soft attention mechanism.
Pretrained Word2Vec embeddings trained on Urdu text.

CNN + Transformer

Employs InceptionV3 as the encoder and a Transformer-based decoder.
Enhanced performance with self-attention mechanisms in the Transformer.

Vision Transformer (ViT) + Roberta

Uses Vision Transformer (ViT) for image encoding.
Roberta (trained on Urdu text) serves as the decoder.
Achieves the highest scores among the models.

Results

The models were evaluated using BLEU, BERT-F1, and LASER metrics. The Vision Transformer + Roberta model outperformed all others.

Model	BLEU-1	BLEU-4	BERT-F1
CNN + Attention LSTM	71.8	29.1	81.9
CNN + Transformer	78.9	37.6	87.0
ViT + Roberta	86.0	59.0	90.6

Placeholder for Comparison Plot.

Screenshot 2025-01-19 at 8.36.55 AM ---

Contributing

Contributions are welcome! Please open an issue or submit a pull request to improve the project.

License

This project is licensed under the MIT License. See the LICENSE file for details.

References

"A Transformer-based Urdu Image Caption Generation," Springer Nature.
Additional references and datasets are mentioned in the paper.

Acknowledgments

Special thanks to the authors and contributors of the research paper for their groundbreaking work in advancing Urdu NLP and CV tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Urdu Image Caption Generation

Table of Contents

Overview

Key Features

Dataset

Flickr8k Urdu Subset

Methodology:

Proposed Models

CNN + Attention LSTM

CNN + Transformer

Vision Transformer (ViT) + Roberta

Results

Contributing

License

References

Acknowledgments

About

Releases

Packages

License

MuhammadHadiofficial/urdu_caption_generator

Folders and files

Latest commit

History

Repository files navigation

Urdu Image Caption Generation

Table of Contents

Overview

Key Features

Dataset

Flickr8k Urdu Subset

Methodology:

Proposed Models

CNN + Attention LSTM

CNN + Transformer

Vision Transformer (ViT) + Roberta

Results

Contributing

License

References

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages