A Deep Learning Model for Automatic Image Captioning using GRU and Attention Mechanism

Sushma Jaiswal; Harikumar Pallthadka; Rajesh P. Chinchewadi; Tarun Jaiswal

doi:10.22362/ijcert/2024/v11/i1/v11i103

PDF

Published: Jan 3, 2024

DOI: https://doi.org/10.22362/ijcert/2024/v11/i1/v11i103

Keywords:

GRU, CNN, LSTM, Deep learning, image caption.

Sushma Jaiswal

Guru Ghasidas Central University, Bilaspur (C.G.) and Post-Doctoral Research Fellow, Manipur International University, Imphal, Manipur

Harikumar Pallthadka

Manipur International University, Imphal, Manipur

Rajesh P. Chinchewadi

Manipur International University, Imphal, Manipur

Tarun Jaiswal

National Institute of Technology, Raipur

Abstract

In computer vision and natural language processing, automatic image captioning is an important task that aims to produce accurate and meaningful image captions. For automatic image captioning, we provide a novel method in this paper that combines a deep learning model based on Gated Recurrent Units (GRUs) with an attention mechanism. During the caption generation process, the model can concentrate on pertinent areas of the image by using the attention mechanism, which dynamically weighs the image attributes. This facilitates better matching of the generated captions' related words with the image features. Recurrent neural networks of the GRU kind are used to simulate the sequential structure of natural language and accurately represent word relationships in the output captions. The network learns to produce logical and contextually appropriate descriptions for different kinds of images by being trained on a broad collection of photos and captions. We show that the suggested model is capable of producing high-quality captions by evaluating it using common metrics as BLEU, METEOR, ROUGE, and CIDEr. The results of our experiments demonstrate that our method performs better than baseline methods, demonstrating the benefits of using GRU and an attention mechanism in the image captioning process. The approach is extremely relevant in real-world applications like image interpretation, accessibility, and content recommendation since the generated captions are not only correct but also express a deeper knowledge of the visual content in the images.

How to Cite

[1]

Sushma Jaiswal, Harikumar Pallthadka, Rajesh P. Chinchewadi, and Tarun Jaiswal, “A Deep Learning Model for Automatic Image Captioning using GRU and Attention Mechanism”, Int. J. Comput. Eng. Res. Trends, vol. 11, no. 1, pp. 28–36, Jan. 2024.

Issue

Vol. 11 No. 1 (2024): January(2024) Issue

Section

Research Articles

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

IJCERT Policy:

The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.

By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.

References

. Iandola, Forrest N., Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. ”SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size.” arXiv preprint arXiv:1602.07360 (2016).

. Ma, Ningning, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ”Shufflenet v2: Practical guidelines for efficient cnn architecture design.” In Proceedings of the European conference on computer vision (ECCV), pp. 116-131. 2018.

. Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. ”Mobilenetv2: Inverted residuals and linear bottlenecks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510-4520. 2018.

. Tan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew How-ard, and Quoc V. Le. ”Mnasnet: Platform-aware neural architecture search for mobile.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820-2828. 2019.

. Zoph, Barret, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. ”Learning transferable architectures for scalable image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697-8710. 2018.

. Chen, Yunpeng, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. ”Du-al path networks.” In Advances in neural information processing systems, pp. 4467-4475. 2017.

. Xie, Saining, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming ´ He. ”Aggregated re-sidual transformations for deep neural networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492-1500. 2017.

. Zagoruyko, Sergey, and Nikos Komodakis. ”Wide residual networks.” arXiv preprint arXiv:1605.07146 (2016).

. Hu, Jie, Li Shen, and Gang Sun. ”Squeeze-and-excitation networks.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132-7141. 2018

. Zhang, Xingcheng, Zhizhong Li, Chen Change Loy, and Dahua Lin. ”Polynet: A pursuit of structural diversity in very deep networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 718-726. 2017.

. Szegedy, Christian, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. ”Inception-v4, in-ception-resnet and the impact of residual connections on learning.” arXiv preprint arXiv:1602.07261 (2016).

. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep residual learning for im-age recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.

. Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

. Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. ”Densely connected convolutional networks.” In Proceedings of the IEEE conference on computer vi-sion and pattern recognition, pp. 4700-4708. 2017.

. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Angue-lov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Going deeper with convolu-tions. arXiv preprint arXiv:1409.4842, 2014.

. Sergey Ioffe, Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015

. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105, 2012.

. Maofu Liu, Lingjun Li, Huijun Hu, Weili Guan, Jing Tian, Image Caption Generation with Dual Attention Mechanism, Inf. Process. Manag. 57(2) (2020) 102178. doi: 10.1016/j.ipm.2019.102178.

. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), (pp. 740-755).

. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78.

. Rashtchian, C., Young, P., Hodosh, M., & Hockenmaier, J. (2010). Collecting image anno-tations using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Work-shop on Creating Speech and Language Data with Amazon’s Mechanical Turk, (pp. 139-147).

. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., ... & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annota-tions. International Journal of Computer Vision, 123(1), 32-73.

. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. (2015). ImageNet large scale visual recognition challenge. International Journal of Comput-er Vision, 115(3), 211-252.

. Caesar, H., Uijlings, J., & Ferrari, V. (2018). COCO-Stuff: Thing and stuff classes in con-text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 1209-1218).

. Sharma, P., Ding, N., Goodman, S., Soricut, R. (2018). Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. arXiv preprint arXiv:1806.06357.

. Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Associa-tion for Computational Linguistics (ACL), (pp. 311-318).

. Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on In-trinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, (pp. 65-72).

. Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based image descrip-tion evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 4566-4575).

. Lin, C. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summa-rization Branches Out, (pp. 74-81).

. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2016). SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV), (pp. 382-398).

. Martin, J., & Doddington, G. (1997). The DET curve in assessment of detection task per-formance. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH), (Vol. 2, pp. 1895-1898).

. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of transla-tion edit rate with targeted human annotation. In Proceedings of Association for Machine Translation in the Americas (AMTA), (pp. 223-231).

A Deep Learning Model for Automatic Image Captioning using GRU and Attention Mechanism

Abstract

References

Most read articles by the same author(s)

QUICK LINKS

FOR AUTHORS

FOR REVIEWERS

JOURNAL CONTENTS

DOWNLOADS

Article Sidebar

Main Article Content

Abstract

Article Details

References

Most read articles by the same author(s)