This is an implementation for "Hierarchical LSTMs with Adaptive Attention for Visual Captioning" and it is extension of the paper "Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning" accepted in International Joint Conference on Artificial Intelligence (IJCAI) 2017 https://github.com/zhaoluffy/hLSTMat. This work has been accepted by IEEE Transactions of Pattern Analysis and Machine Intelligence (TPAMI), 2019.
Three corresponding version are: concatenation fusion, dynamic fusion and ensemble fusion.
Python 2.7.6
Theano 0.8.2
You need to download pretrained resnet model for extracting features. We provide our extracted ResNet video feature and processed caption in:https://drive.google.com/open?id=1HymvVvAEygM6UJm41dQkQ4IbTWcHT0iQ. Download this dataset and replace RAB_FEATURE_BASE_PATH in config.py with your feature path and replace RAB_DATASET_BASE_PATH in config.py with your processed data path. Besides, you should assign where to store your result in config.py.
If you'd like to evaluate BLEU/METEOR/CIDER scores during training. Don't forget to download coco-caption:https://github.com/tylin/coco-caption and Jobman:http://deeplearning.net/software/jobman/install.html.
Also you should add coco-caption path to $PYTHONPATH and add jobman path to $PYTHONPATH as well.
If you have any question, drop us email at:[email protected]
@article{gao2019hierarchical,
title={Hierarchical LSTMs with adaptive attention for visual captioning},
author={Gao, Lianli and Li, Xiangpeng and Song, Jingkuan and Shen, Heng Tao},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={42},
number={5},
pages={1112--1131},
year={2019},
publisher={IEEE}
}