The PDF file you selected should load here if your Web browser has a PDF reader plug-in installed (for example, a recent version of Adobe Acrobat Reader).
If you would like more information about how to print, save, and work with PDFs, Highwire Press provides a helpful Frequently Asked Questions about PDFs.
Alternatively, you can download the PDF file directly to your computer, from where it can be opened using a PDF reader. To download the PDF, click the Download link above.
BibTex Citation Data :
@article{JOIV1387, author = {Rifqi Mulyawan and Andi Sunyoto and Alva Hendi Muhammad Muhammad}, title = {Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model}, journal = {JOIV : International Journal on Informatics Visualization}, volume = {7}, number = {2}, year = {2023}, keywords = {Deep Neural Network; Convolutional Neural Network; Indonesian Image Captioning, Transformer, Attention Mechanism}, abstract = {Classification and object recognition in image processing has significantly improved computer vision tasks. The method is often used for visual problems, especially in picture classification utilizing the Convolutional Neural Network (CNN). In the popular state-of-the-art (SOTA) task of generating a caption on an image, the implementation is often used for feature extraction of an image as an encoder. Instead of performing direct classification, these extracted features are sent from the encoder to the decoder section to generate the sequence. So, some CNN layers related to the classification task are not required. This study aims to determine which CNN pre-trained architecture or model performs best in extracting image features using a state-of-the-art Transformer model as its decoder. Unlike the original Transformer’s architecture, we implemented a vector-to-sequence way instead of sequence-to-sequence for the model. Indonesian Flickr8k and Flick30k datasets were used in this research. Evaluations were carried out using several pre-trained architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, Efficientnet_b0, Efficientnet_b1, and Googlenet. The qualitative model inference results and quantitative evaluation scores were analyzed in this study. The test results show that the ResNet50 architecture can produce stable sequence generation with the highest accuracy value. With some experimentation, finetuning the encoder can significantly increase the model evaluation score. As for future work, further exploration with larger datasets like Flickr30k, MS COCO 14, MS COCO 17, and other image captioning datasets in Indonesian also implementing a new Transformers-based method can be used to get a better Indonesian automatic image captioning model. }, issn = {2549-9904}, pages = {487--493}, doi = {10.30630/joiv.7.2.1387}, url = {https://joiv.org/index.php/joiv/article/view/1387} }
Refworks Citation Data :
@article{{JOIV}{1387}, author = {Mulyawan, R., Sunyoto, A., Muhammad, A.}, title = {Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model}, journal = {JOIV : International Journal on Informatics Visualization}, volume = {7}, number = {2}, year = {2023}, doi = {10.30630/joiv.7.2.1387}, url = {} }Refbacks
- There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
__________________________________________________________________________
JOIV : International Journal on Informatics Visualization
ISSN 2549-9610 (print) | 2549-9904 (online)
Organized by Society of Visual Informatocs, and Institute of Visual Informatics - UKM and Soft Computing and Data Mining Centre - UTHM
W : http://joiv.org
E : joiv@pnp.ac.id, hidra@pnp.ac.id, rahmat@pnp.ac.id
View JOIV Stats
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.