Karonese Sentiment Analysis: A New Dataset and Preliminary Result

Ichwanul Muslim Karo Karo - Universiti Tun Hussein Onn, Johor, 86400, Malaysia
Mohd Farhan Md Fudzee - Universiti Tun Hussein Onn, Johor, 86400, Malaysia
Shahreen Kasim - Universiti Tun Hussein Onn, Johor, 86400, Malaysia
Azizul Azhar Ramli - Universiti Tun Hussein Onn, Johor, 86400, Malaysia

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.6.2-2.1119


Amount social media active users are always increasing and come from various backgrounds. An active user habit in social media is to use their local or national language to express their thoughts, social conditions, socialize, ideas, perspectives, and publish their opinions. Karonese is a non-English language prevalent mostly in North Sumatra, Indonesia, with unique morphology and phonology. Sentiment analysis has been frequently used in the study of local or national languages to obtain an overview of the broader public opinion behind a particular topic. Good quality Karonese resources are needed to provide good Karonese sentiment analysis (KSA). Limitation resources become an obstacle in KSA research. This work provides Karonese Dataset from multi-domain social media. To complete the dataset for sentiment analysis, sentiment label annotated by Karonese transcribers, three kinds of experiments were applied: KSA using machine learning, KSA using machine learning with two variants of feature extraction methods. Machine learning algorithms include Logistic Regression, Naïve Bayes, Support Vector Machine and K-Nearest Neighbor. Feature extraction improves model performance in the range of 0.1 – 7.4 percent. Overall, TF-IDF as feature extraction on machine learning has a better contribution than BoW. The combination of the SVM algorithm with TF-IDF is the combination with the highest performance. The value of accuracy is 58.1 percent, precision is 58.5 percent, recall is 57.2, and F1 score is 57.84 percent


Karonese sentiment analysis; support vector machine; k-nearest neighbor; logistic regression, naïve bayes

Full Text:



A. M. Alayba, V. Palade, M. England dan R. Iqbal, "Arabic language sentiment analysis on health services," dalam 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR), Nancy, France, 2017.

I. P. M. Wirayasa, I. M. A. Wirawan dan I. M. A. Pradnyana, “Algoritma Bastal: Adaptasi Algoritma Nazief & Adriani untuk Stemming Teks Bahasa Bali,†Jurnal Nasional Pendidikan Teknik Informatika: JANAPATI,, vol. 8, no. 1, pp. 60-69, 2019.

Y. Cahyono dan S. Saprudin, “Analisis Sentiment Tweets Berbahasa Sunda Menggunakan Naive Bayes Classifier dengan Seleksi Feature Chi Squared Statistic,†Jurnal Informatika Universitas Pamulang, vol. 4, no. 3, 2019.

F. Koto dan I. Koto, "Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation," dalam The 34th Pacific Asia Conference on Language, Information and Computation, 2020.

M. Diallo, C. Fourati dan H. Haddad, "Bambara Language Dataset for Sentiment Analysis," dalam International Conference on Learning Representations (ICLR) , 2021.

M. S. Divate, "Sentiment analysis of Marathi news using LSTM," International Journal of Information Technology, vol. 13, no. 5, pp. 2069-2074, 2021.

M. O. Rase , "Sentiment Analysis of Afaan Oromoo Facebook Media Using Deep Learning Approach," New Media and Mass Communication, vol. 90, pp. 7-22, 2020.

M. H. Alam, M. M. Rahoman dan M. A. K. Azad, "Sentiment analysis for Bangla sentences using convolutional neural network," dalam 2017 20th International Conference of Computer and Information Technology (ICCIT), 2017.

K. Becker, V. P. Moreira dan A. G. dos Santos, "Multilingual emotion classification using supervised learning: Comparative experiments," Information Processing & Management, vol. 53, no. 3, pp. 684-704, 2017.

E. Tighe dan C. Cheng, "Modeling personality traits of filipino twitter users," Proceedings of the Second Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media, pp. 112-122, 2018.

W. Chansanam dan K. Tuamsuk, "Thai Twitter Sentiment Analysis: Performance Monitoring of Politics in Thailand using Text Mining Techniques," International Journal of Innovation, Creativity and Change, vol. 11, no. 2, pp. 436-452.

F. Djatmiko, R. Ferdiana dan M. Faris, "A Review of Sentiment Analysis for Non-English Language," dalam 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), Yogyakarta, Indonesia, 2019.

K. Sailunaz dan R. Alhajj, "Emotion and sentiment analysis from Twitter text," Journal of Computational Science, vol. 36, 2019.

M. Rumelli dan D. AkkuÅŸ, "Sentiment Analysis in Turkish Text with Machine Learning Algorithms," dalam 2019 Innovations in Intelligent Systems and Applications Conference (ASYU), Izmir, Turkey, 2019.

A. Chakrabarty dan S. Roy, "A framework for medical text mining using a feature weighted clustering algorithm," dalam 2013 1st International Conference on Emerging Trends and Applications in Computer Science, Shillong, India, 2013.

M. Muchtar, W. Kembaren dan F. Repelita, "TRANSLATION TECHNIQUES AND QUALITY IN THE ENGLISH VERSION OF NGANTING MANUK TEXT," International Journal on Language, Research and Education Studies, vol. 2, no. 2, 2018.

G. Woollams , A GRAMMAR OF KARO BATAK, SUMATRA, Australia: Australian National University Canberra , 1996.

B. Tarigan, R. Sofyan dan R. N. Rosa, "Derivational morphology of Karonese ecolexicon," dalam Seventh International Conference on Languages and Arts (ICLA 2018), 2019.

S. B. Gurusinga, "PHONOLOGICAL DIALECT DIFFERENCES OF KARONESE LANGUAGE IN MEDAN, NORTH SUMATRA," urnal CULTURE (Culture, Language, and Literature Review), vol. 7, no. 2, pp. 263-275, 2020.

V. A. Fitri, R. Andreswari dan M. A. Hasibuan, "Sentiment analysis of social media Twitter with case of Anti-LGBT campaign in Indonesia using Naïve Bayes, decision tree, and random forest algorithm," Procedia Computer Science, vol. 161, pp. 765-772, 2019.

X. Lin, "Chinese Text Sentiment Analysis Based on Improved Convolutional Neural Networks," dalam 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 2019.

Y. Cahyono dan S. Saprudin, “Analisis Sentiment Tweets Berbahasa Sunda Menggunakan Naive Bayes Classifier dengan Seleksi Feature Chi Squared Statistic,†Jurnal Informatika Universitas Pamulang, vol. 4, no. 3, pp. 87-94., 2019.

I. M. K. Karo, M. F. M. Fudzee, S. Kasim dan A. A. Ramli, "Sentiment Analysis in Karonese Tweet using Machine Learning," Indonesian Journal of Electrical Engineering and Informatics (IJEEI), vol. 10, no. 1, pp. 219-231, 2022.

S. S. Kumar, M. A. Kumar dan K. P. Soman, "Sentiment analysis of tweets in malayalam using long short-term memory units and convolutional neural nets," dalam International Conference on Mining Intelligence and Knowledge Exploration, 2017.

M. Oljira, "Sentiment Analysis of Afaan Oromo using Machine learning Approach," International Journal of Research Studies in Science, Engineering and Technology, vol. 7, no. 9, pp. 7-15, 2020.

W. A. Qader, M. M. Ameen dan B. I. Ahmed, "An overview of bag of words; importance, implementation, applications, and challenges," dalam 2019 International Engineering Conference (IEC), 2019.

S. Qaiser dan R. Ali, "Text mining: use of TF-IDF to examine the relevance of words to documents," International Journal of Computer Applications, vol. 181, no. 1, pp. 25-29, 2018.

R. a. D. H. a. S. Y. Sinnott, "Chapter 15—a case study in big data analytics: exploring twitter sentiment analysis and the weather," Big Data, pp. 357-388, 2016.

I. M. K. R. R. R. A. W. &. A. B. Z. Karo, "A Hybrid Classification Based on Machine Learning Classifiers to Predict Smart Indonesia Program," dalam In 2020 Third International Conference on Vocational Education and Electrical Engineering (ICVEE) IEEE, 2020.

N. Cristianini dan J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods, Cambridge university press, 2000.

A. Al-Anazi dan . I. D. Gates, "A support vector machine algorithm to classify lithofacies and model permeability in heterogeneous reservoirs," Engineering Geology, ELSEVIER, vol. 114, pp. 267-277, 2010.

I. M. K. Karo, A. Khosuri dan R. Setiawan, "Effects of Distance Measurement Methods in K-Nearest Neighbor Algorithm to Select Indonesia Smart Card Recipient," dalam 2021 International Conference on Data Science and Its Applications (ICoDSA). IEEE, 2021.

R. Choudhary dan H. K. Gianey, "Comprehensive review on supervised machine learning algorithms," dalam 2017 International Conference on Machine Learning and Data Science (MLDS). IEEE, 2017.