Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model

Rifqi Mulyawan - Universitas Amikom Yogyakarta, Yogyakarta, Indonesia
Andi Sunyoto - Universitas Amikom Yogyakarta, Yogyakarta, Indonesia
Alva Hendi Muhammad Muhammad - Universitas Amikom Yogyakarta, Yogyakarta, Indonesia

Citation Format:



Classification and object recognition in image processing has significantly improved computer vision tasks. The method is often used for visual problems, especially in picture classification utilizing the Convolutional Neural Network (CNN). In the popular state-of-the-art (SOTA) task of generating a caption on an image, the implementation is often used for feature extraction of an image as an encoder. Instead of performing direct classification, these extracted features are sent from the encoder to the decoder section to generate the sequence. So, some CNN layers related to the classification task are not required. This study aims to determine which CNN pre-trained architecture or model performs best in extracting image features using a state-of-the-art Transformer model as its decoder. Unlike the original Transformer’s architecture, we implemented a vector-to-sequence way instead of sequence-to-sequence for the model. Indonesian Flickr8k and Flick30k datasets were used in this research. Evaluations were carried out using several pre-trained architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, Efficientnet_b0, Efficientnet_b1, and Googlenet. The qualitative model inference results and quantitative evaluation scores were analyzed in this study. The test results show that the ResNet50 architecture can produce stable sequence generation with the highest accuracy value. With some experimentation, finetuning the encoder can significantly increase the model evaluation score. As for future work, further exploration with larger datasets like Flickr30k, MS COCO 14, MS COCO 17, and other image captioning datasets in Indonesian also implementing a new Transformers-based method can be used to get a better Indonesian automatic image captioning model. 


Deep Neural Network; Convolutional Neural Network; Indonesian Image Captioning, Transformer, Attention Mechanism

Full Text:



R. Subash, R. Jebakumar, Y. Kamdar, and N. Bhatt, “Automatic image captioning using convolution neural networks and LSTM,” J. Phys. Conf. Ser., vol. 1362, no. 1, 2019, doi: 10.1088/1742-6596/1362/1/012096.

M. D. Zakir Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Comput. Surv., vol. 51, no. 6, 2019, doi: 10.1145/3295748.

K. C. Nithya and V. V. Kumar, “A Review on Automatic Image Captioning Techniques,” Proc. 2020 IEEE Int. Conf. Commun. Signal Process. ICCSP 2020, pp. 432–437, 2020, doi: 10.1109/ICCSP48568.2020.9182105.

C. Wang, H. Yang, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 14, no. 2s, 2018, doi: 10.1145/3115432.

C. Amritkar and V. Jabade, “Image Caption Generation Using Deep Learning Technique,” Proc. - 2018 4th Int. Conf. Comput. Commun. Control Autom. ICCUBEA 2018, pp. 1–4, 2018, doi: 10.1109/ICCUBEA.2018.8697360.

A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.

A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.

E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo, “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset,” 2019 IEEE Int. Conf. Comput. Intell. Virtual Environ. Meas. Syst. Appl. CIVEMSA 2019 - Proc., 2019, doi: 10.1109/CIVEMSA45640.2019.9071632.

R. Mulyawan, A. Sunyoto, and A. H. Muhammad, “Automatic Indonesian Image Captioning using CNN and Transformer-Based Model Approach,” in 2022 5th International Conference on Information and Communications Technology (ICOIACT), 2022, pp. 355–360, doi: 10.1109/ICOIACT55506.2022.9971855.

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 2641–2649, 2015, doi: 10.1109/ICCV.2015.303.

J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 5561–5570, 2018, doi: 10.1109/CVPR.2018.00583.

S. Liu, L. Bai, Y. Hu, and H. Wang, “Image Captioning Based on Deep Neural Networks,” MATEC Web Conf., vol. 232, pp. 1–7, 2018, doi: 10.1051/matecconf/201823201052.

H. Shi, P. Li, B. Wang, and Z. Wang, “Image captioning based on deep reinforcement learning,” ACM Int. Conf. Proceeding Ser., vol. 01052, pp. 1–7, 2018, doi: 10.1145/3240876.3240900.

W. Lan, X. Li, and J. Dong, “Fluency-guided cross-lingual image captioning,” MM 2017 - Proc. 2017 ACM Multimed. Conf., pp. 1549–1557, 2017, doi: 10.1145/3123266.3123366.

X. Li, W. Lan, J. Dong, and H. Liu, “Adding Chinese captions to images,” ICMR 2016 - Proc. 2016 ACM Int. Conf. Multimed. Retr., pp. 271–275, 2016, doi: 10.1145/2911996.2912049.

Y. Yoshikawa, Y. Shigeto, and A. Takeuchi, “STAIR captions: Constructing a large-scale Japanese image caption dataset,” ACL 2017 - 55th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap., vol. 2, pp. 417–421, 2017, doi: 10.18653/v1/P17-2066.

H. A. Al-muzaini, T. N. Al-yahya, and H. Benhidour, “Automatic Arabic image captioning using RNN-LSTM-based language model and CNN,” Int. J. Adv. Comput. Sci. Appl., vol. 9, no. 6, pp. 67–73, 2018, doi: 10.14569/IJACSA.2018.090610.

M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.

K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” 32nd Int. Conf. Mach. Learn. ICML 2015, vol. 3, pp. 2048–2057, 2015.

L. Chen et al., “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017, doi: 10.1109/CVPR.2017.667.

H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han, “Show, observe and tell: Attribute-driven attention model for image captioning,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2018-July, pp. 606–612, 2018, doi: 10.24963/ijcai.2018/84.

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-Decem, pp. 4651–4659, 2016, doi: 10.1109/CVPR.2016.503.

G. Li, L. Zhu, P. Liu, and Y. Yang, “Entangled transformer for image captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2019-Octob, no. c, pp. 8927–8936, 2019, doi: 10.1109/ICCV.2019.00902.

S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 1–11, 2019.

V. Atliha and D. Šešok, “Text augmentation using BERT for image captioning,” Appl. Sci., vol. 10, no. 17, 2020, doi: 10.3390/app10175978.

X. Zhu, L. Li, J. Liu, H. Peng, and X. Niu, “Captioning transformer with stacked attention modules,” Appl. Sci., vol. 8, no. 5, 2018, doi: 10.3390/app8050739.

W. Zhang, W. Nie, X. Li, and Y. Yu, “I m age Caption Generation With Adaptive Transfor m er,” pp. 521–526, 2019.

S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and N. Pugeault, “Image Captioning Through Image Transformer,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12625 LNCS, pp. 153–169, 2021, doi: 10.1007/978-3-030-69538-5_10.

C. Cormier, “Bleu,” Landscapes, vol. 7, no. 1, pp. 16–17, 2005, doi: 10.3917/chev.030.0107.

S. Banerjee and A. Lavie, “METEOR: An automatic metric for mt evaluation with improved correlation with human judgments,” Intrinsic Extrinsic Eval. Meas. Mach. Transl. and/or Summ. Proc. Work. ACL 2005, no. June, pp. 65–72, 2005.

C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Proc. Work. text Summ. branches out (WAS 2004), no. 1, pp. 25–26, 2004.

R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07-12-June, pp. 4566–4575, 2015, doi: 10.1109/CVPR.2015.7299087.

R. Staniute and D. Šešok, “A systematic literature review on image captioning,” Appl. Sci., vol. 9, no. 10, 2019, doi: 10.3390/app9102024.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

JOIV : International Journal on Informatics Visualization
ISSN 2549-9610  (print) | 2549-9904 (online)
Organized by Society of Visual Informatocs, and Institute of Visual Informatics - UKM and Soft Computing and Data Mining Centre - UTHM
W :
E :,,

View JOIV Stats

Creative Commons License is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.