Fuzzy Soft Set Clustering for Categorical Data

Iwan Tri Yanto - Universitas Ahmad Dahlan, Yogyakarta, Indonesia
Ani Apriani - Institut Teknologi Nasional Yogyakarta, Yogyakarta, Indonesia
Rofiul Wahyudi - Universitas Ahmad Dahlan, Yogyakarta, Indonesia
Cheah WaiShiang - Universiti Malaysia Sarawak, Malaysia
- Suprihatin - Universitas Ahmad Dahlan, Yogyakarta, Indonesia
Rahmat Hidayat - Politeknik Negeri Padang, Padang, Indonesia

Citation Format:

DOI: http://dx.doi.org/10.62527/joiv.8.1.2364


Categorical data clustering is difficult because categorical data lacks natural order and can comprise groups of data only related to specific dimensions. Conventional clustering, such as k-means, cannot be openly used to categorical data. Numerous categorical data using clustering algorithms, for instance, fuzzy k-modes and their enhancements, have been developed to overcome this issue. However, these approaches continue to create clusters with low Purity and weak intra-similarity. Furthermore, transforming category attributes to binary values might be computationally costly. This research provides categorical data with fuzzy clustering technique due to soft set theory and multinomial distribution. The experiment showed that the approach proposed signifies better performance in purity, rank index, and response times by up to 97.53%. There are many algorithms that can be used to solve the challenge of grouping fuzzy-based categorical data. However, these techniques do not always result in improved cluster purity or faster reaction times. As a solution, it is suggested to use hard categorical data clustering through multinomial distribution. This involves producing a multi-soft set by using a rotated based soft set, and then clustering the data using a multivariate multinomial distribution. The comparison of this innovative technique with the established baseline algorithms demonstrates that the suggested approach excels in terms of purity, rank index, and response times, achieving improvements of up to ninety-seven-point fifty three percent compared to existing methods.


Function of multinomial distribution; clustering; categorial data; multi soft set

Full Text:



G. M. Gonçalves and L. L. Lourenço, “Mathematical formulations for the K clusters with fixed cardinality problem,” Comput. Ind. Eng., vol. 135, pp. 593–600, 2019.

G. J. McLachlan, S. I. Rathnayake, and S. X. Lee, “2.24 - Model-Based Clustering☆,” S. Brown, R. Tauler, and B. B. T.-C. C. (Second E. Walczak, Eds. Oxford: Elsevier, 2020, pp. 509–529.

K. Soppari and N. S. Chandra, “Development of improved whale optimization-based FCM clustering for image watermarking,” Comput. Sci. Rev., vol. 37, p. 100287, 2020.

C. Wu and X. Zhang, “Total Bregman divergence-based fuzzy local information C-means clustering for robust image segmentation,” Appl. Soft Comput., vol. 94, p. 106468, 2020.

M. C. Thrun and Q. Stier, “Fundamental clustering algorithms suite,” SoftwareX, vol. 13, p. 100642, 2021.

Y. Karali, B. Lyngdoh, and H. Behera, “Hard and Fuzzy Clustering Algorithms Using Normal Distribution of Data Points: a Comparative Performance Analysis,” vol. 2, no. 10, pp. 320–328, 2013.

K. Mrudula and E. K. Reddy, “Hard And Fuzzy Clustering Methods : A Comparative Study Hard and Fuzzy Clustering Methods : A Comparative Study,” no. April, 2019.

S. Zhu and L. Xu, “Many-objective fuzzy centroids clustering algorithm for categorical data,” Expert Syst. Appl., vol. 96, pp. 230–248, 2018.

D. T. Dinh, V. N. Huynh, and S. Sriboonchitta, “Clustering mixed numerical and categorical data with missing values,” Inf. Sci. (Ny)., vol. 571, pp. 418–442, 2021.

Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Min. Knowl. Discov., vol. 2, no. 3, pp. 283–304, 1998.

Z. He, S. Deng, and X. Xu, “Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode BT - Computational Intelligence and Security,” 2005, pp. 157–162.

M. K. Ng, M. J. Li, J. Z. Huang, and Z. He, “On the impact of dissimilarity measure in k-modes clustering algorithm.,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp. 503–507, 2007.

O. M. San, H. Van-Nam, and Y. Nakamori, “An alternative extension of the k-means algorithm for clustering categorical data,” Int. J. Appl. …, vol. 14, no. 2, pp. 241–247, 2004.

Z. Huang and M. K. Ng, “A fuzzy k-modes algorithm for clustering categorical data,” IEEE Trans. Fuzzy Syst., vol. 7, no. 4, pp. 446–452, 1999.

M. W. M. Wei, H. X. H. Xuedong, C. Z. C. Zhibo, Z. H. Z. Haiyan, and W. C. W. Chunling, “Multi-Agent Reinforcement Learning Based on Bidding,” Inf. Sci. Eng. (ICISE), 2009 1st Int. Conf., vol. 20, no. 3, 2009.

D.-W. Kim, K. H. Lee, and D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids,” Pattern Recognit. Lett., vol. 25, no. 11, pp. 1263–1271, Aug. 2004.

Y. N. K. Umayahara, S. Miyamoto, “Formulations of Fuzzy Clustering for Categorical Data Kazutaka Umayahara,” Inf. Control, vol. 1, no. 1, pp. 83–94, 2005.

D. Parmar, T. Wu, and J. Blackhurst, “MMR: An algorithm for clustering categorical data using Rough Set Theory,” Data Knowl. Eng., vol. 63, no. 3, pp. 879–893, Dec. 2007.

S. Wu, A. W.-C. Liew, H. Yan, and M. Yang, “Cluster Analysis of Gene Expression Data Based on Self-Splitting and Merging Competitive Learning,” IEEE Trans. Inf. Technol. Biomed., vol. 8, no. 1, pp. 5–15, Mar. 2004.

C.-C. Hsu, C.-L. Chen, and Y.-W. Su, “Hierarchical clustering of mixed data based on distance hierarchy,” Inf. Sci. (Ny)., vol. 177, no. 20, pp. 4474–4492, 2007.

P. Bryant and J. A. Williamson, “Asymptotic Behaviour of classification maximum likelihood estimates,” Biometrika , vol. 65, no. 2, pp. 273–281, Aug. 1978.

M. S. Yang, Y. H. Chiang, C. C. Chen, and C. Y. Lai, “A fuzzy k-partitions model for categorical data and its comparison to the GoM model,” Fuzzy Sets Syst., vol. 159, no. 4, pp. 390–405, 2008.

S. P. Chatzis, “A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional,” Expert Syst. Appl., vol. 38, no. 7, pp. 8684–8689, Jul. 2011.

M. A. Woodbury and J. Clive, “Clinical Pure Types as a Fuzzy Partition,” J. Cybern., vol. 4, no. 3, pp. 111–121, Jan. 1974.

S. Naouali, S. Ben Salem, and Z. Chtourou, Clustering categorical data: A survey, vol. 19, no. 1. 2020.

A. Saxena and M. Singh, “Using Categorical Attributes for Clustering,” Int. J. Sci. Eng. Appl. Sci., no. 2, pp. 324–329, 2016.

B. Pardasani, “Multi Softset for Decision Making,” Int. J. Sci. Res., vol. 7, no. 11, pp. 55–56, 2018.

M. S. Khan, G. Mujtaba, M. A. Al-garadi, N. H. Friday, A. Waqas, and F. R. Qasmi, “Multi-soft sets-based decision making using rank and fix valued attributes,” in 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), 2018, pp. 1–11.

D. Molodtsov, “Soft set theory—first results,” Comput. Math. with Appl., vol. 37, no. 4–5, pp. 19–31, 1999.

P. K. Maji, R. Biswas, and A. R. Roy, “Soft set theory,” Comput. Math. with Appl., vol. 45, no. 4–5, pp. 555–562, 2003.

T. Herawan and M. M. Deris, “On Multi-soft Sets Construction in Information Systems BT - Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence,” 2009, pp. 101–110.

S. Malefaki and G. Iliopoulos, “Simulating from a multinomial distribution with large number of categories,” Comput. Stat. Data Anal., vol. 51, no. 12, pp. 5471–5476, 2007.

D. Dheeru and E. Karra Taniskidou, “{UCI} Machine Learning Repository.” 2017.