ON INFORMATICS VISUALIZATION Enhancing Code Similarity with Augmented Data Filtering and Ensemble Strategies

— Although COVID-19 has severely affected the global economy, information technology (IT) employees managed to perform most of their work from home. Telecommuting and remote work have promoted a demand for IT services in various market sectors, including retail, entertainment, education, and healthcare. Consequently, computer and information experts are also in demand. However, producing IT, experts is difficult during a pandemic owing to limitations, such as the reduced enrollment of international students. Therefore, researching increasing software productivity is essential; this study proposes a code similarity determination model that utilizes augmented data filtering and ensemble strategies. This algorithm is the first automated development system for increasing software productivity that addresses the current situation—a worldwide shortage of software dramatically improves performance in various downstream natural language processing tasks (NLP). Unlike general-purpose pre-trained language models (PLMs), CodeBERT and GraphCodeBERT are PLMs that have learned both natural and programming languages. Hence, they are suitable as code similarity determination models. The data filtering process consists of three steps: (1) deduplication of data, (2) deletion of intersection, and (3) an exhaustive search. The best mating (BM) 25 and length normalization of BM25 (BM25L) algorithms were used to construct positive and negative pairs. The performance of the model was evaluated using the 5-fold cross-validation ensemble technique. Experiments demonstrate the effectiveness of the proposed method quantitatively. Moreover, we expect this method to be optimal for increasing software productivity in various NLP tasks.


I. INTRODUCTION
Pretrained language models (PLMs) such as Bidirectional Encoder Representations from Transformers (BERT) [1], Generative Pre-Training (GPT) [2], and eXtra Long NETwork (XLNET) [3], [4] have contributed to tremendous performance improvements in downstream tasks, such as machine reading comprehension, entity recognition, and relation extraction, of natural language processing (NLP) in recent years. These PLMs are based on the transformer architecture [5], and they are pretrained on a large unsupervised text corpus and then finetuned with training data in downstream tasks such as question answering [6], [7], named entity recognition [8] and relation extraction [9]. This pre-training and finetuning model approach in NLP has expedited pre-training development.
CodeBERT [10] and GraphCodeBERT [11] are language models trained in a programming language (PL) and natural language (NL). These are PLMs that have been optimized such as for code search and document generation. Both models were pretrained on the CodeSearchNet dataset containing functions in six PLs as well as NLs' documents.
In the pre-training process, the former model was trained on objective functions, including standard masked language modeling (MLM), and replaced token detection (RTD) [12]. The latter model was trained on code representations using the semantic structure of the code. These code-aware specific PLMs, which have learned code representations, differ from general PLMs that have not learned code as they can classify the similarity when inputting a PL. Moreover, these PLMs are considered more general.
Recently, a serious worldwide shortage of software developers and experts who can supply quality software has become evident [13], [14]. As the supply of software developers is limited, having a method prepared in advance for analyzing, developing, and maintaining software based on an automated method is essential for increasing software productivity. In addition, many studies are being conducted in this field worldwide. As a first step, an algorithm that determines whether two codes can produce the same results is crucial for enhancing software productivity.
Our research heavily involves the use of deep bimodal architectures trained on both PL and NL. Specifically, we used CodeBERT, GraphCodeBERT, CodeBERT-MLM (developed by Microsoft Research), and CodeBERTaPy. CodeBERT and CodeBERT-MLM were among the first research to create models that capture the semantic connection between NL and PL, such as Python, Java, JavaScript, etc. In the pre-training step, the input data consists of two sequences concatenated with a special token i.e. , , … , , , … , where segment , , … , denotes NL text and the segment , , … , denotes a certain PL. This input is then passed through a standard BERT architecture, and the outputs of the model are the contextual vector representation of each token for both the language and the code component and the [CLS] token embedding containing the pooled sequence information. The pre-training objective functions include MLM and RTD, as we mentioned above. Both bimodal and unimodal data were used to train CodeBERT, and bimodal data was obtained from GitHub repositories where a code is paired with its description. CodeBERT [10] was then finetuned for downstream tasks such as code search and codeto-text generation, and it achieved state-of-the-art performances in both tasks. GraphCodeBERT [11] followed CodeBERT, and it is a graph-based pre-trained model based on Transformer [5] for PL. It is one of the first works that leverages code structure to learn code representation to improve code understanding dramatically. The pre-training tasks for GraphCodeBERT are unique in that they include not only masked language modeling but also data flow edge prediction and the variable-alignment across source code and data flow to align representation between source code and data flow.
Furthermore, the CodeSearchNet challenge [15] launched in 2020 actively encourages research in retrieving relevant code given an NL query. The organizers provide a corpus consisting of 99 NL queries with approximately 4000 expert relevance annotations of likely results. It also consists of automatically generated NL queries for 2 million code functions. A leaderboard was run, encouraging competition and advancement in this area of research. Another notable project that made use of deep learning for code understanding is the AlphaCode project [16] initiated by Deep Mind. AlphaCode aims to train deep learning models that understand code for code generation, specifically generating code solutions in various PLs to competitive coding problems from CodeForces. They approached the problem as a sequence-tosequence task, where given a problem statement in NL, a solution code string was generated. Thus, the researchers of AlphaCode made use of an encoder-decoder Transformer architecture [17] to solve the problem. This project also made its training dataset public, which came in very handy for us to use as an additional source of training data when fine-tuning our CodeBERT models for detecting code similarity.
Despite the advancements in code and natural language understanding, there are not many existing works of literature on utilizing these architectures to classify code performing similar tasks directly. Instead, widely used approaches are text-based detection [18], token-based detection [19], which extracts a sequence of tokens using compiler-style source code transformation that is subsequently used for similarity matching, and tree-based detection [20], where the code is transformed to abstract syntax trees and are later used in tree sub matching algorithms for similarity identification and metrics based detection [21] that extracts a number of metrics from the source code fragments then comparing these metrics for detecting similarity. Another research is about using flow charts [22] to detect code similarity, but it is applied for crosslanguage source code detection (i.e., detecting similarity even when source codes are written in different PLs). The aforementioned techniques do not rely on deep learning, but recently relatively simple deep learning-based code embedding techniques such as Code2Vec [23] were explored to embed codes for similar or clone code detection.
For most code search systems being used in practice, it appears that simpler methods that do not involve deep neural networks, such as a cross encoder or a bi-encoder structure for classifying similarity, seem to be preferred. Our work shows that fine-tuning deep pre-trained language models that understand code performs efficiently on the task of code similarity detection and thus provides an incentive for using deep pre-trained models for future code similarity detection systems.
In this study, we propose a language model system that can determine code similarity using an automated method. The three-step strategic data filtering process for effectively training a language model can eliminate duplication between training and testing. In addition, positive and negative pairs were generated using the best matching (BM25) [24] and length normalization of BM25 (BM25L) [25] algorithms of the Okapi system with proven performance. Moreover, the cross-validation ensemble method [26] enabled an accurate and effective performance in the model training process in which data imbalance existed.
The contributions are as follows:  Three data filtering strategies for removing duplication in the training dataset are introduced. In addition, positive and negative pairs are obtained using the BM25 and BM25L algorithms.  CodeBERT and graphcodebert, pretrained on code representations, are finetuned. Then, the crossvalidation ensemble, verified when inferencing, is applied to extract a more effective performance.  The effectiveness of the proposed method is verified using quantitative experiments for single and ensemble models.

II. MATERIALS AND METHOD
In this section, we describe the following three crucial procedures (data preprocessing process, three filtering strategies, forming positive, and negative pairs), and finally, a cross-validated ensemble method for achieving performance from the ensembled results of a single model. Fig. 1 shows the overall schematic of the process.

A. Data Preprocessing
Before the code data is fed as an input sequence of the model, cleaning the raw text must be performed because of the characteristic of the code. The code may be written in an unorganized manner by the coder or contain text information that is irrelevant to compilation. Especially, the existing typo information in the code is unnecessary, and it can act as a fatal factor in the model training procedure. First, unnecessary new lines and spaces are removed. Comments (sequences marked with '#') written to help understand the code are also somewhat unnecessary for our training process. Hence, they are removed along with white spaces.

B. Data Filtering Strategies
Duplication of data may occur when using different datasets as training and test data. Therefore, a model trained with these duplicated features cannot perform an accurate evaluation. Before enhancing the model with additional data, we observed data duplication between the additional training data and the test data. Hence, filtering was performed with three strategies. Algorithm 1 describes the three-step data filtering strategies in Fig. 1 more effectively.

1) Deduplication of Data:
First, the data corresponding to A and B, and the TEST codes are loaded. The first filtering strategy filters overlapped sequence data using a hash table (HT). This method records all values of A in the HT and then checks whether these values exist in B; this is the most common and widely used method. For the first filtering of Algorithm 1 (1~6 lines), (1) the sequence input to the DEDUPLICATION function and initialized HT creates a new HT from the A, and (2) the new HT is compared with B to create First_filtered_codes (6 lines). Most duplicate data are filtered out in this process.
2) Deletion of Intersection: Second, the purpose of the Second Filtering is to filter out data not completely filtered out by the HT owing to reasons such as trailing space. This includes spaces such as white spaces ' ' in the right or left edge of sentences, tabs '\t', In the second filtering of Algorithm 1 (7~10 lines), (1) the SIMPLIFY function concatenates all newlines existing in the character strings of the code. (2) All spaces and newlines before and after the character strings are removed. (3) Filtering is performed once more by taking the intersection of these filtered sequences of the test code and using first_filtered_codes to generate second_filtered_codes.

3) Exhaustive Search:
Most of the duplicated data are removed by the second filtering process. However, a method that completely eliminates duplication is required. Therefore, an exhaustive search is performed, and s and TEST_codes are mutually compared in the third filtering of Algorithm 1 to remove the few remaining duplicate data traces (11~20 lines). Now, the third_filtered_codes, which are used as the training data, are finally generated.

C. Forming positive and negative pairs
Then, the dataset that has been trimmed with the previous three preprocessing strategies should be composed of positive and negative pairs for the training process. One type of training data is positive pairs which can determine whether the two codes are similar, and the other is negative pairs which can determine whether the two codes are not similar. For negative samples, ranks are assigned after sorting the data in descending order based on the confidence score. BM25 algorithm of the Okapi system, which is based on probabilistic retrieval research, is a ranking function utilized by search engines to rank matching sentences according to their relevance to a given query. For length normalization of BM25, BM25L, a newer variant to boost scores of very long documents and more effective than BM25, was proposed [25]. The Okapi BM25 scores a document with respect to query sentence containing keywords , … , as follows: The inverse document frequency (IDF) means how common or rare a word is in the total document set. is consisted of IDF weight of the query term and , is the number of documents containing , which is calculated as above. Using these algorithms, the volume of the original dataset was dramatically expanded, and we experimented with demonstrating the improved efficiency of the code similarity evaluation model. Algorithm

D. Cross-Validation Ensemble
The training process of the language models is divided into training, validating, and testing datasets, respectively. For evaluation of our model, any data imbalance-related issues can not be allowed; if there are, an accurate measure is impossible. In Fig. 2, the 5-fold cross-validation can train each model in the training step, alternating each set 5 times. We finally divided the overall training dataset by 8:2, finetuned the model with each train & valid features, and evaluated it on the test dataset. Fig. 2 A method of 5-fold cross-validation. The models were ensembled by voting from each # ) * 5 , ,6,7,89 and resulting in these models' training and testing Consequently, a more accurate performance evaluation can be performed. We utilized the method of evaluating the final performance by recording the checkpoint that indicates the best performance for each fold and creating an ensemble of the determinations made by the 5 models generated in this method in the testing process.

III. RESULTS AND DISCUSSION
This section describes the quantitative experimental method and experimental results that prove the effectiveness of our proposed method.

A. Data
The experiment was conducted by randomly extracting only 50% of the CodeNet data, owing to resource problems, presented at a code similarity competition, which was a NLP AI contest held in Korea in May 2022. The problems were numbered, and there were two solution codes (1 and 2) for solving each problem.  Table 1 shows the count and volume of the competition and CodeNet data used in the experiment. Positive and negative pairs were created to increase the volume of the data. In addition, the validation loss was calculated every 1000 steps by classifying the data as training and validation data. When two codes are input, the accuracy is used to measure the performance of the binary classification task, which determines whether the two codes solve the same problem.

E. Experiment Setup
The experiment environment is as follows. The graphcodebert and codebert-mlm models were used for the proposed method. Both models were trained on a large dataset used for researching the probing capability of code-based PLMs. These PLMs can learn the general-purpose representation as well as code search and documentation generation tasks. Moreover, these PLMs are language models, and studies have been recently conducted on these models for downstream tasks related to code as in Table 2.  Table 3 lists the hyperparameters for finetuning the two models using the proposed method. These models were trained using an A100 GPU and by setting the learning rate to 2e-5, epsilon to 1e-5, the optimizer to AdamW, max epochs to 2, batch size to 32, and max sequence length to 512.  F. Results Table 3 shows the quantitative results of the code similarity evaluation model. The first column shows the single models used in the experiment, graphcodebert and codebert-mlm, and cross-validation ensemble of the two models. In addition, the competition-only and competition + CodeNet data were used as the training data. The similarity was measured by applying the BM25 and BM25L similarity algorithms of the Okapi system to each dataset to generate the positive and negative pairs.
The results show a consistent performance improvement when the BM25L algorithm was applied, unlike with the BM25 algorithm. In addition, the graphcodebert model (the first row of Models) outperformed the codebert-mlm model (the second row of Models). Moreover, the cross-validation ensemble of the two models achieved a better performance than the two single models. In particular, the cross-validation ensemble model finetuned with the competition + CodeNet dataset achieved the highest score with a performance of 98.8785.

IV. CONCLUSION
In this study, we proposed an effective code similarity representation model. Our approach of data filtering strategies: deduplication, delete intersection, and exhaustive search, can eliminate data redundancy and contributes to efficient model training. Subsequently, the positive and negative pairs were formed by applying the BM25 and BM25L algorithms to the filtered dataset. Then, the model was finetuned with the 5-fold cross-validation method. Finally, the best performance was achieved by performing a final inference on the test dataset by the five models, each of which was trained with the cross-validated ensemble method. The performance improvement was demonstrated quantitatively by comparing our approach with conventional methods. In addition, the ensemble was proven effective on both single and augmented data. Our approach to improving the productivity of this software is crucial as the impact of labor-supply shortages in IT can negatively affect both productivity and innovation, and we believe our methods will be leveraged in various NLP tasks.