Enhancing Code Similarity with Augmented Data Filtering and Ensemble Strategies

Gyeongmin Kim - Korea University, Seoul 02841, Republic of Korea
Minseok Kim - Minds lab Inc., Seongnam 13493, Republic of Korea
Jaechoon Jo - Hanshin University, Osan 18101, Republic of Korea

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.6.3.1259


Although COVID-19 has severely affected the global economy, information technology (IT) employees managed to perform most of their work from home. Telecommuting and remote work have promoted a demand for IT services in various market sectors, including retail, entertainment, education, and healthcare. Consequently, computer and information experts are also in demand. However, producing IT, experts is difficult during a pandemic owing to limitations, such as the reduced enrollment of international students. Therefore, researching increasing software productivity is essential; this study proposes a code similarity determination model that utilizes augmented data filtering and ensemble strategies. This algorithm is the first automated development system for increasing software productivity that addresses the current situation—a worldwide shortage of software dramatically improves performance in various downstream natural language processing tasks (NLP). Unlike general-purpose pre-trained language models (PLMs), CodeBERT and GraphCodeBERT are PLMs that have learned both natural and programming languages. Hence, they are suitable as code similarity determination models. The data filtering process consists of three steps: (1) deduplication of data, (2) deletion of intersection, and (3) an exhaustive search. The best mating (BM) 25 and length normalization of BM25 (BM25L) algorithms were used to construct positive and negative pairs. The performance of the model was evaluated using the 5-fold cross-validation ensemble technique. Experiments demonstrate the effectiveness of the proposed method quantitatively. Moreover, we expect this method to be optimal for increasing software productivity in various NLP tasks.


Code similarity; language model; software productivity; CodeBERT; cross-validated ensemble.

Full Text:



D. Jacob, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

A. Radford, et al. Language models are unsupervised multitask learners. OpenAI blog 1.8 (2019): 9.

Y. Zhilin, et al. Xlnet: Generalized autoregressive pre-training for language understanding. Advances in neural information processing systems 32 (2019).

D. Zihang, et al. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).

A. Vaswani, et al. Attention is all you need. Advances in neural information processing systems 30 (2017).

G. Kim, et al. AI Student: A Machine Reading Comprehension System for the Korean College Scholastic Ability Test. Mathematics 10.9 (2022): 1486.

S. Lee, G. Kim, and H. Lim, Verification of educational goal of reading area in Korean SAT through natural language processing techniques, Journal of the Korea Convergence Society, vol. 13, no. 1, pp. 81–88, Jan. 2022.

G. Kim, et al. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. International journal of machine learning and cybernetics 11.10 (2020): 2341-2355.

K. Kim, et al. GREG: A global level relation extraction with knowledge graph embedding. Applied Sciences 10.3 (2020): 1181.

Z. Feng, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics.

G. Daya, et al. Graphcodebert: Pre-training code representations with data flow. International Conference on Learning Representations: ICLR 2021.

C. Kevin, et al. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).

T. Breaux and J. Moritz. 2021. The 2021 software developer shortage is coming. Commun. ACM 64, 7 (July 2021), 39–41. https://doi.org/10.1145/3440753

P. Tambe, Xuan Ye, Peter Cappelli (2020) Paying to Program? Engineering Brand and High-Tech Wages. Management Science 66(7):3010-3028. https://doi.org/10.1287/mnsc.2019.3343

H. Hamel, et al. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).

L. Yujia, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814 (2022).

Z. Feng, et al. Flowchart-based cross-language source code similarity detection. Scientific Programming (2020).

S. Ducasse, M. Rieger and S. Demeyer, A language independent approach for detecting duplicated code, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM 99). Software Maintenance for Business Change (Cat. No.99CB36360), 1999, pp. 109-118, doi: 10.1109/ICSM.1999.792593.

B. S. Baker, On finding duplication and near-duplication in large software systems, Proceedings of 2nd Working Conference on Reverse Engineering, 1995, pp. 86-95, doi: 10.1109/WCRE.1995.514697.

I. D. Baxter, A. Yahin, L. Moura, M. SantAnna and L. Bier, Clone detection using abstract syntax trees, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), 1998, pp. 368-377, doi: 10.1109/ICSM.1998.738528.

M. Leblanc and Merlo, Experiment on the automatic detection of function clones in a software system using metrics, 1996 Proceedings of International Conference on Software Maintenance, 1996, pp. 244-253, doi: 10.1109/ICSM.1996.565012.

P. Anupriya. Code Clone Detection Using Code2Vec. University of California, Irvine, 2020.

A. Uri, et al. code2vec Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3.POPL (2019), 1-29.

R. Stephen, et al. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109.

Y. Lv and C. Zhai. 2011. When documents are very long, BM25 fails In Proceedings of the 34th international ACM SIGIR conference on research and development in Information Retrieval. Association for Computing Machinery, New York, USA.

K. Yang, et al. Cross-Validated Ensemble Methods in Natural Language Inference. Annual Conference on Human and Language Technology. Human and Language Technology, 2019.