Exploring a Spectrum of Deep Learning Models for Automated Image Captioning: A Comprehensive Survey

Main Article Content

Sushma Jaiswal
Harikumar Pallthadka
Rajesh P. Chinchewadi
Tarun Jaiswal

Abstract

Automatic caption generation from images has emerged as a fundamental and challenging problem at the intersection of computer vision and natural language processing. This paper presents a comprehensive survey of the techniques, methodologies, and advancements in the field of automatic caption generation from images. The primary objective is to provide an extensive review of the state-of-the-art models, evaluation metrics, datasets, and applications associated with this domain. The survey begins by elucidating the underlying principles of image feature extraction and caption generation. Various neural network architectures, including Convolutional Neural Networks (CNNs) and recurrent models such as Long Short-Term Memory (LSTM) networks, are discussed in detail. Additionally, the paper explores the integration of attention mechanisms and reinforcement learning strategies to enhance the quality and relevance of generated captions. A thorough examination of evaluation metrics, encompassing both automated and human-centric approaches, is presented to evaluate the generated captions quantitatively and qualitatively. The survey also highlights prominent datasets that have significantly contributed to the advancement of research in this field, facilitating a deeper understanding of challenges and trends. Furthermore, the paper discusses practical applications and real-world use cases where automatic caption generation plays a pivotal role, including accessibility, multimedia indexing, and assistive technologies. The discussion concludes by outlining open challenges and future directions, aiming to inspire further research and innovation in automatic caption generation from images. The aim of this paper is to examine and contrast diverse end-to-end learning frameworks for image captioning, employing established evaluation metrics to comprehend their applicability across different research domains. In addition to the comparative analysis, the paper addresses future challenges in this domain.

Article Details

How to Cite
[1]
Sushma Jaiswal, Harikumar Pallthadka, Rajesh P. Chinchewadi, and Tarun Jaiswal, “Exploring a Spectrum of Deep Learning Models for Automated Image Captioning: A Comprehensive Survey ”, Int. J. Comput. Eng. Res. Trends, vol. 10, no. 12, pp. 1–11, Dec. 2023.
Section
Survey

References

Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan. Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages. 3156-3164, 2015.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning (ICML), pages 2048-2057, 2015.

Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3242-3250, 2017.

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6077-6086, 2018.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. In Advances in Neural Information Processing Systems, pages 30-38, 2017).

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh (2017). Hierarchical Question-Image Co-Attention for Visual Question Answering. In Advances in Neural Information Processing System, pages 289-298, 2017.

Minh-Thang Luong, Hieu Pham, Christopher D. Manning. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),pages 1412-1421, 2015.

Huang, Lun, Wenmin Wang, Yaxian Xia and Jie Chen. “Adaptively Aligned Image Captioning via Adaptive Attention Time.” ArXiv abs/1909.09060, 2019: n. pag.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, Jiebo Luo. Image Captioning with Semantic Attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4651-4659, 2016.

Yang Li, Lukasz Kaiser, Samy Bengio, Si Si. Area Attention. Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), pages 20037-20047, 2020.

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier & David Forsyth. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision: Part V (ECCV), pages 15-29, 2010.

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg. Baby talk: Understanding and generating simple image descriptions. In Proceedings of the 24th CVPR, pages 1609-1616, 2011.

Elliott, R., Rottenberg, A., & Stankiewicz, B. An AI system that describes its understanding of visual information. In Proceedings of the 12th International Joint Conference on Artificial Intelligence (IJCAI), pages 1483-1489, 1993.

Vicente Ordonez, Girish Kulkarni, Tamara Lee Berg. Im2Text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems (NIPS), pages 1143-1151, 2011.

Micah Hodosh, Peter Young, Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853-899, 2013.

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh. Neural Baby Talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7219-7228, 2018

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312, 2014.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78, 2014.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arXiv preprint arXiv:1602.07332, 2016.

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, Aniruddha Kembhavi, Abhinav Gupta, Ali Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv preprint arXiv:1712.05474, 2017.

Abhishek Dutta, Gupta, A., Andrew Zisserman. VGG Image Annotator (VIA): A Simple and Efficient Tool for Annotation of Images and Videos. Proceedings of the European Conference on Computer Vision (ECCV), 242-257, 2016.

Holger Caesar, Jasper Uijlings, Vittorio Ferrari. COCO-Stuff: Thing and Stuff Classes in Context. arXiv preprint arXiv:1612.03716, 2018.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics, 311-318, 2002.

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65-72, 2005.

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. Text summarization branches out: Proceedings of the ACL-04 workshop, 74-81, 2004.

Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh. CIDEr: Consensus-based image description evaluation. Proceedings of the IEEE conference on computer vision and pattern recognition, 4566-4575, 2015.

Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould. SPICE: Semantic Propositional Image Caption Evaluation. European conference on computer vision, 382-398, 2016.

Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, Yejin Choi.Simulating Action Dynamics with Neural Process Networks. arXiv preprint arXiv:1805.09921, 2018.

Xinyu Xiao, Lingfeng Wang, Kun Ding, Shiming Xiang, and Chunhong Pan. Deep Hierarchical Encoder-Decoder Network for Image Captioning. IEEE Transaction on Multimedia, Apr-2018.

Cheng Wang, Haojin Yang, and Christoph Meinel. Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 2s, Article 40. April-2018.

Maofu Liua,b, Lingjun Li, Huijun Hu, Weili Guan, Jing Tian. Image caption generation with dual attention mechanism. Information Process and Management - 57, Elsevier, Nov-2019.

Yuting Su, Yuqian Li, Ning Xu, An-An Liu. Hierarchical Deep Neural Network for Image Captioning. Neural processing letters, Springer Nature, 2019.

Xiaodan Zhang, Shengfeng He, Xinhang Song, Rynson W.H. Lau, Jianbin Jiao, Qixiang Ye. Image captioning via semantic element embedding. Neurocomputing, Elsevier, June-2019.

Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, Min Yang. Interactive Dual Generative Adversarial Networks for Image Captioning. Association for the Advancement of Artificial Intelligence, 2020.

Eric ke wang, Xun zhang, Fan wang, Tsu-yang wu, and Chien-ming chen, Multilayer Dense Attention Model for Image Caption, IEEE Access, June-2019.

Ruifan Li, Haoyu Liang, Yihui Shi, Fangxiang Feng, Xiaojie Wang.Dual-CNN: A Convolutional language decoder for paragraph image captioning. Neurocomputing, Elsevier, Feb-2020.

Yang, Zhilin & Yuan, Ye & Wu, Yuexin & Salakhutdinov, Ruslan & Cohen, William. Encode, Review, and Decode: Reviewer Module for Caption Generation, 2016.

Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. “Areas of Attention for Image Captioning”. In: Proceedings of the IEEE International Conference on Computer Vision, pages 1251–1259, 2017.

Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training. In: IEEE International Conference on Computer Vision (ICCV), pages 4155–4164, 2017.

Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep Reinforcement Learningbased Image Captioning with Embedding Reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1151–1159, 2017.

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1141–1150, 2017.

Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M Hospedales. Actor-critic sequence training for image captioning. In: 31st Conference on Neural Information Processing Systems. 2017.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Selfcritical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1195, 2017.

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In: International Conference on Learning Representations (ICLR). 2015.

Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pages 2407–2415, 2015.

Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. “Image captioning and visual question answering based on attributes and external knowledge”. In: vol. 40. 6. IEEE, pages 1367–1381, 2018.

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473, 2014.

Most read articles by the same author(s)