A Deep Learning Model for Automatic Image Captioning using GRU and Attention Mechanism

Main Article Content

Sushma Jaiswal
Harikumar Pallthadka
Rajesh P. Chinchewadi
Tarun Jaiswal

Abstract

In computer vision and natural language processing, automatic image captioning is an important task that aims to produce accurate and meaningful image captions. For automatic image captioning, we provide a novel method in this paper that combines a deep learning model based on Gated Recurrent Units (GRUs) with an attention mechanism. During the caption generation process, the model can concentrate on pertinent areas of the image by using the attention mechanism, which dynamically weighs the image attributes. This facilitates better matching of the generated captions' related words with the image features. Recurrent neural networks of the GRU kind are used to simulate the sequential structure of natural language and accurately represent word relationships in the output captions. The network learns to produce logical and contextually appropriate descriptions for different kinds of images by being trained on a broad collection of photos and captions. We show that the suggested model is capable of producing high-quality captions by evaluating it using common metrics as BLEU, METEOR, ROUGE, and CIDEr. The results of our experiments demonstrate that our method performs better than baseline methods, demonstrating the benefits of using GRU and an attention mechanism in the image captioning process. The approach is extremely relevant in real-world applications like image interpretation, accessibility, and content recommendation since the generated captions are not only correct but also express a deeper knowledge of the visual content in the images.

Article Details

How to Cite
[1]
Sushma Jaiswal, Harikumar Pallthadka, Rajesh P. Chinchewadi, and Tarun Jaiswal, “A Deep Learning Model for Automatic Image Captioning using GRU and Attention Mechanism”, Int. J. Comput. Eng. Res. Trends, vol. 11, no. 1, pp. 28–36, Jan. 2024.
Section
Research Articles

References

. Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. ”SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size.” arXiv preprint arXiv:1602.07360 (2016).

. Ma, Ningning, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ”Shufflenet v2: Practical guidelines for efficient cnn architecture design.” In Proceedings of the European conference on computer vision (ECCV), pp. 116-131. 2018.

. Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. ”Mobilenetv2: Inverted residuals and linear bottlenecks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510-4520. 2018.

. Tan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew How-ard, and Quoc V. Le. ”Mnasnet: Platform-aware neural architecture search for mobile.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820-2828. 2019.

. Zoph, Barret, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. ”Learning transferable architectures for scalable image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697-8710. 2018.

. Chen, Yunpeng, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. ”Du-al path networks.” In Advances in neural information processing systems, pp. 4467-4475. 2017.

. Xie, Saining, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming ´ He. ”Aggregated re-sidual transformations for deep neural networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492-1500. 2017.

. Zagoruyko, Sergey, and Nikos Komodakis. ”Wide residual networks.” arXiv preprint arXiv:1605.07146 (2016).

. Hu, Jie, Li Shen, and Gang Sun. ”Squeeze-and-excitation networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132-7141. 2018

. Zhang, Xingcheng, Zhizhong Li, Chen Change Loy, and Dahua Lin. ”Polynet: A pursuit of structural diversity in very deep networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 718-726. 2017.

. Szegedy, Christian, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. ”Inception-v4, in-ception-resnet and the impact of residual connections on learning.” arXiv preprint arXiv:1602.07261 (2016).

. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep residual learning for im-age recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.

. Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

. Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. ”Densely connected convolutional networks.” In Proceedings of the IEEE conference on computer vi-sion and pattern recognition, pp. 4700-4708. 2017.

. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Angue-lov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Going deeper with convolu-tions. arXiv preprint arXiv:1409.4842, 2014.

. Sergey Ioffe, Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015

. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105, 2012.

. Maofu Liu, Lingjun Li, Huijun Hu, Weili Guan, Jing Tian, Image Caption Generation with Dual Attention Mechanism, Inf. Process. Manag. 57(2) (2020) 102178. doi: 10.1016/j.ipm.2019.102178.

. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), (pp. 740-755).

. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78.

. Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image anno-tations using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Work-shop on Creating Speech and Language Data with Amazon’s Mechanical Turk, (pp. 139-147).

. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., ... & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annota-tions. International Journal of Computer Vision, 123(1), 32-73.

. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. (2015). ImageNet large scale visual recognition challenge. International Journal of Comput-er Vision, 115(3), 211-252.

. Caesar, H., Uijlings, J., & Ferrari, V. (2018). COCO-Stuff: Thing and stuff classes in con-text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 1209-1218).

. Sharma, P., Ding, N., Goodman, S., Soricut, R. (2018). Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. arXiv preprint arXiv:1806.06357.

. Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Associa-tion for Computational Linguistics (ACL), (pp. 311-318).

. Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on In-trinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, (pp. 65-72).

. Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image descrip-tion evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 4566-4575).

. Lin, C. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summa-rization Branches Out, (pp. 74-81).

. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2016). SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV), (pp. 382-398).

. Martin, J., & Doddington, G. (1997). The DET curve in assessment of detection task per-formance. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), (Vol. 2, pp. 1895-1898).

. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of transla-tion edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas (AMTA), (pp. 223-231).

Most read articles by the same author(s)