Image Captioning with Style Using Generative Adversarial Networks

Dennis Setiawan - Computer Science Department, School of Computer Science, Bina Nusantara University, Palmerah, Jakarta 11480, Indonesia
Maria Astrid Saffachrissa - Computer Science Department, School of Computer Science, Bina Nusantara University, Palmerah, Jakarta 11480, Indonesia
Shintia Tamara - Computer Science Department, School of Computer Science, Bina Nusantara University, Palmerah, Jakarta 11480, Indonesia
Derwin Suhartono - Computer Science Department, School of Computer Science, Bina Nusantara University, Palmerah, Jakarta 11480, Indonesia

Citation Format:



Image captioning research, which initially focused on describing images factually, is currently being developed in the direction of incorporating sentiments or styles to produce natural captions that reflect human-generated captions. The problem this research tries to solve the problem that captions produced by existing models are rigid and unnatural due to the lack of sentiment. The purpose of this research is to design a reliable image captioning model that incorporates style based on state-of-the-art SeqCapsGAN architecture. The materials needed are MS COCO and SentiCaps datasets. Research methods are done through literature studies and experiments. While many previous studies compare their works without considering the differences in components and parameters being used, this research proposes a different approach to find more reliable configurations and provide more detailed insights into models’ behavior. This research also does further experiments on the generator part that have not been thoroughly investigated. Experiments are done on the combinations of feature extractor (VGG-19 and ResNet-50), discriminator model (CNN and Capsule), optimizer (Adam, Nadam, and SGD), batch size (8, 16, 32, and 64), and learning rate (0.001 and 0.0001) by doing a grid search. In conclusion, more insights into the models’ behavior can be drawn, and better configuration and result than the baseline can be achieved. Our research implies that research in comparative studies of image recognition models in image captioning context, automated metrics, and larger datasets suited for stylized image captioning might be needed for furthering the research in this field.


Stylized image captioning; SeqCapsGAN; sentiments or styles; Generative Adversarial Network (GAN); capsule; discriminator; generator.

Full Text:



​M. Z. Hossain, F. Sohel, M. F. Shiratuddin and H. Laga, "A Comprehensive Survey of Deep Learning for Image," 2018.

​C. Gan, Z. Gan, X. He, J. Gao and L. Deng, "StyleNet: Generating Attractive Visual Captions with Styles," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

​A. Mathews, L. Xie and X. He, "SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8591-8600, 2018.

​J. D. Lannoy, "The effect of chatbot personality on emotional connection and customer satisfaction," 17 November 2017.

​A. Matthews, L. Xie and X. He, "SentiCap: Generating Image Descriptions with Sentiments," Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1-12, 2016.

​B. Dai, S. Fidler, R. Urtasun and D. Lin, "Towards Diverse and Natural Image Descriptions via a Conditional GAN," 2017.

​P. Dognin, I. Melnyk, Y. Mroueh, J. Ross and T. Sercu, "Adversarial Semantic Alignment for Improved Image Captions," p. 1, 2018.

​O. M. Nezami, M. Dras, S. Wan, C. Paris and L. Hamey, "Towards Generating Stylized Image Captions via," Pacific Rim International Conference on Artificial Intelligence, pp. 270-284, 2019.

​A. Bibi, H. Abidi and O. Dhaouadi, "SeqCapsGAN: Generating Stylized Image Captions," 2020.

​S. Sabour, N. Frosst and G. E. Hinton, "Dynamic Routing Between Capsules," CoRR, 2017.

​A. Shah, E. Kadam, H. Shah, S. Shinde and S. Shingade, "Deep Residual Networks with Exponential Linear Unit," Proceedings of the Third International Symposium on Computer Vision and the Internet, pp. 59-65, 2016.

​T. Dozat, "Incorporating Nesterov Momentum Into Adam," ICLR 2016, 2016.

​P. Zhou, J. Feng, C. Ma, C. Xiong, S. HOI and W. E, "Towards Theoretically Understanding Why SGD Generalizes Better Than Adam in Deep Learning," arXiv preprint arXiv:2010.05627, 2020.

​D. Masters and C. Luschi, "Revisiting Small Batch Training for Deep Neural Networks," arXiv preprint arXiv:1804.07612, 2018.

​P. M. Radiuk, "Impact of Training Set Batch Size on the Performance of Convolutional Neural Networks for Diverse Datasets," vol. 20, pp. 20-24, 17 December 2017.

​Q. Fu, Y. Liu and Z. Xie, "EECS442 Final Project Report," pp. 1-9, 2019.

​COCO Consortium, "COCO 2015 Image Captioning Task," 1 April 2015. [Online]. Available:

​M. Arjovsky, S. Chintala and L. Bottou, "Wasserstein GAN," Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 214-223, 2017.

​C. Lin, "Rouge: A package for automatic evaluation of summaries," Text Summarization Branches Out, 2004.

R. Vedantam, C. Lawrence Zitnick and D. Parikh, "Cider: Consensus-based image description," Proceedings of the IEEE conference on computer vision and pattern recognition, p. 4566–4575, 2015.