An Extensive Analysis of Image Captioning Models, Evaluation Measures, and Datasets
Main Article Content
Abstract
The difficult interdisciplinary endeavour of creating insightful and detailed captions for images is known as "image captioning," and it lies at the nexus of computer vision and natural language processing. We give a comprehensive examination of datasets, evaluation metrics, and image captioning models in this paper. We present a thorough review of popular image captioning models, from conventional methods to the most recent developments utilizing deep learning and attention mechanisms. We examine the design, underlying assumptions, and capabilities of these models, emphasizing how they help produce logical and contextually appropriate captions. Furthermore, we analyses in detail well-known evaluation metrics as BLEU, METEOR, ROUGE, and CIDEr, clarifying their importance in evaluating generated caption quality against ground truth references. Additionally, we talk about the critical role that datasets play in image captioning research, with particular attention to prominent datasets like as COCO, Flickr30k, and Conceptual Captions. We investigate these dataset’s diversity, volume, and annotations, emphasizing their impact on model evaluation and training. Our objective is to furnish scholars, professionals, and amateurs with an invaluable tool for comprehending the state of image captioning, so facilitating the creation of inventive models and enhanced assessment techniques.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
Farhadi, A., Hejrati, M., Sadeghi, M.A., et al.: Every picture tells a story: Generating sentences from images. In: European Conference on Computer Vision, pp. 15–29. (2010)
Yang, J., Sun, Y., Liang, J., et al.: Image captioning by incorporating affective concepts learned from both visual and textual components. Neurocomputing (2019), 328, 56–68
Bernardi, R., Cakici, R., Elliott, D., et al.: Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Research (2016), 55, 409–442
Yang, Y., et al.: Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. (2011)
Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Ordonez, V., Kulkarni G., Berg, T.: Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 24, 1143– 1151 (2011)
Hodosh, M., Young P., Hockenmaier, J.: Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Research 47, 853–899 (2013)
Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. Comput. Sci. abs/1506.00019, (2015).
Zaremba, W., Sutskever I., Vinyals, O.: Recurrent neural network regularization. arXiv:1409.2329 (2014)
Socher, R., et al.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of the Twenty Seventh Advances in Neural Information Processing Systems (NIPS), vol. 3, pp. 1889–1897 (2014)
Lebret, R., Pinheiro P., Collobert R.: Phrase-based image captioning. In: International Conference on Machine Learning (2015)
Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: International Conference on Machine Learning, pp. 595–603 (2014)
Wu, Q., Shen, C. & Liu, L.: What value do explicit high level concepts have in vision to language problems? In: IEEE conference on computer vision and pattern recognition, pp. 203-212 (2016)
Karpathy, A. & Fei-Fei L. : Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition (2015)
Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv:2103.14030 (2021)
Jia, X., Gavves E., Fernando B. & Tuytelaars T. : Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2407–2415 (2015)
Johnson, J., Karpathy A., Fei-Fei L.,: Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574 (2016)
Yao, T., et al.: Hierarchy parsing for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
Li, Y., et al.: Pointing novel objects in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Zheng, Y., Li Y., Wang S.: Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395–8404 (2019)
Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4565-4574.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: A Neural Image Caption Generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156-3164. DOI: 10.1109/CVPR.2015.7298935
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European conference on computer vision. Springer, 2014, pp. 740–755.
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2641–2649.
M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorkó et al., “The 2005 pascal visual object classes challenge,” in Machine Learning Challenges Workshop. Springer, 2005, pp. 117–176.
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
D. Elliott, S. Frank, K. Sima’an, and L. Specia, “Multi30k: Multilingual english-german image descriptions,” in Proceedings of the 5th Workshop on Vision and Language, 2016, pp. 70–74.
J. Wu, H. Zheng, B. Zhao, Y. Li, B. Yan, R. Liang, W. Wang, S. Zhou, G. Lin, Y. Fu et al., “Ai challenger: A large-scale dataset for going deeper in image understanding,” arXiv preprint arXiv:1711.06475, 2017.
M. Grubinger, P. Clough, H. Müller, and T. Deselaers, “The iapr tc-12 benchmark: A new evaluation resource for visual information systems,” in International workshop ontoImage, vol. 2, 2006.
A. F. Biten, L. Gomez, M. Rusinol, and D. Karatzas, “Good news, everyone! context driven entity-aware captioning for news images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 466–12 475.
H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “nocaps: novel object captioning at scale,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8948–8957.
X. Yang, H. Zhang, D. Jin, Y. Liu, C.-H. Wu, J. Tan, D. Xie, J. Wang, and X. Wang, “Fashion captioning: Towards generating accurate descriptions with semantic rewards,” in Computer Vision–ECCV 2020. Springer, 2020, pp. 1–17.
O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, “Text caps: a dataset for image captioning with reading comprehension,” in Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 742–758.
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017.
I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-ElHaija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit et al., “Open images: A public dataset for largescale multi-label and multi-class image classification,” Dataset available from https://github. com/open images, vol. 2, no. 3, p. 18, 2017.
D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, “Captioning images taken by people who are blind,” in Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 417–434.
K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol. 29, 2005, pp. 65–72.
C. Lin and F. J. Och, “Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics,” in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004, pp. 605–612.
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566– 4575.
P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Proceedings of the European Conference on Computer Vision. Springer, 2016, pp. 382–398.
Jia, X., Gavves E., Fernando B. & Tuytelaars T. : Guiding the long-shortterm memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2407–2415(2015)
Lee, H., Yoon, S., Dernoncourt, F., et al.: UMIC: An unreferenced metric for image captioning via contrastive learning. arXiv:2106.14019 (2021)
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning(2015)
Lu, J., Xiong, C., Parikh, D., et al.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2017)
Fu, K., Jin, J., Cui, R., et al.: Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2321–2334(2016)
Lu, J., Yang, J., Batra, D., et al.: Neural baby talk. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7219–7228 (2018)
Yao, T., et al.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference On Computer Vision (ECCV)(2018)
Fan, Z., Wei, Z., Wang, S., et al.: TCIC: Theme concepts learning cross language and vision for image captioning. arXiv:2106.10936 (2021)
Pan, Y., Yao T., Li Y., Mei T.: X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980 (2020)
Dong, X., Long, C., Xu, W., et al.: Dual graph convolutional net-works with transformer and curriculum learning for image captioning.arXiv:2108.02366 (2021)
Ji, J., Luo, Y., Sun, X., et al.: Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35,pp. 1655–1663 (2021)
Liu, W., Chen, S., Guo, L., et al.: CPTR: Full transformer network for image captioning. arXiv:2101.10804 (2021)
Yang, X., et al.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Yao, T., et al.: Hierarchy parsing for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)
Chen, S., et al.: Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Mahajan, S., Roth, S.: Diverse image captioning with context-object split latent spaces. arXiv:2011.00966 (2020)
Liu, S., Zhu Z., Ye N., Guadarrama S., Murphy K.: Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873–881 (2017)
Wu, J., Chen, T., Wu, H., et al.: Fine-grained image captioning with global-local discriminative objective. IEEE Trans. Multimedia 23, 2413–2427(2021)
Deng, C., Ding, N., Tan, M., et al.: Length-controllable image captioning. In: Computer Vision–ECCV 2020: 16th European Conference, pp. 712–729. Glasgow, UK, 23–28 August 2020
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. “Deep captioning with multimodal recurrent neural networks (m-rnn)”. In: International Conference on Learning Representations (ICLR). 2015.
Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. “Image captioning with deep bidirectional LSTMs”. In: Proceedings of the 2016 ACM on Multimedia Conference. ACM. 2016, pp. 988–997.
Cheng Wang, Haojin Yang, and Christoph Meinel. “Image captioning with deep bidirectional lstms and multi-task learning”. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14.2s (2018), p. 40.
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. “Long-term recurrent convolutional networks for visual recognition and description”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 2625–2634.
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. “Image captioning with semantic attention”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 4651–4659.
Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. “Convolutional image captioning”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 5561–5570.
Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, and Qi Tian. “Image caption with global-local attention”. In: Thirty-First AAAI Conference on Artificial Intelligence (2017).
Jia Huei Tan, Chee Seng Chan, and Joon Huang Chuah. “COMIC: Towards a compact image captioning model with attention”. In: IEEE Transactions on Multimedia (2019).
Xinxin Zhu, Lixiang Li, Jing Liu, Haipeng Peng, and Xinxin Niu. “Captioning transformer with stacked attention modules”. In: Applied Sciences 8.5 (2018), p. 739.