ON INFORMATICS

— The unstable properties and the advantages of the mRNA vaccine have encouraged many experts worldwide in tackling the degradation problem. Machine learning models have been highly implemented in bioinformatics and the healthcare fieldstone insights from biological data. Thus, machine learning plays an important role in predicting the degradation rate of mRNA vaccine candidates. Stanford University has held an OpenVaccine Challenge competition on Kaggle to gather top solutions in solving the mentioned problems, and a multi-column root means square error (MCRMSE) has been used as a main performance metric. The Nucleic Transformer has been proposed by different researchers as a deep learning solution that is able to utilize a self-attention mechanism and Convolutional Neural Network (CNN). Hence, this paper would like to enhance the existing Nucleic Transformer performance by utilizing the AdaBelief or RangerAdaBelief optimizer with a proposed decoder that consists of a normalization layer between two linear layers. Based on the experimental result, the performance of the enhanced Nucleic Transformer outperforms the existing solution. In this study, the AdaBelief optimizer performs better than the RangerAdaBelief optimizer, even though it possesses Ranger’s advantages. The advantages of the proposed decoder can only be shown when there is limited data. When the data is sufficient, the performance might be similar but still better than the linear decoder if and only if the AdaBelief optimizer is used. As a result, the combination of the AdaBelief optimizer with the proposed decoder performs the best with 2.79% and 1.38% performance boost in public and private MCRMSE, respectively.


I. INTRODUCTION
Vaccines have been a disease prevention trend by injecting inactivated pathogens [1] or genetic material such as DNA, mRNA, or protein. The downsides of the inactivated pathogens vaccine are the efficiency of development and deployment, and inapplicable to non-infectious diseases such as cancerous diseases where the gene mutation occurs and the cell carrying mutated gene starts to divide and grow out of control, rather than getting infected by bacteria or virus [1]. The nucleic acid therapeutics approach tackles these problems because they are safe and efficient their production is scalable [1]. For example, the mRNA vaccine is safe because it is a non-infectious molecule and undergoes degradation by normal cellular processes. Besides, mRNA can be modified to be more stable and highly translatable [1]. mRNA can be produced cheaper and more scalable through its high yield of in vitro transcription reactions [1]- [4]. Machine learning models have been highly implemented in bioinformatics and the healthcare fieldstone insights from biological data. Deep learning solutions such as Artificial Neural Networks (ANN) have been implemented as promising solutions in various fields of study. ANN as Convolutional Neural Network (CNN) [5]- [7] has been used to analyze images such as CT scans of a patient to analyze, train and predict the possible illnesses that have yet to be diagnosed. Long Short-Term Memory (LSTM) has also been utilized in analyzing and studying the nature of the nucleotide sequences such as binding sites or searching for motifs.
Transformer [8] is one of the state-of-the-art-art deep learning architectures that involve stacks of encoders to analyze and decipher the inputs, mostly a sentence composed by natural languages, and a pile of decoders to transform the encoded inputs into decoded outputs. Several models, such as recurrent neural networks (RNN) [9] and autoencoders [4], have been used in Natural Language Processing (NLP). However, the transformer is more advantageous because it can process multiple inputs in parallel, depending on the 588 JOIV : Int. J. Inform. Visualization, 7(2) -June 2023 588-599 number of encoders and decoders in the predefined architecture. Hence, every encoder has its feature and sensations when reconstructing a sentence in another language. Unfortunately, limited studies show the implementation of transformers in biological systems.
Stanford University has held a competition named OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction, aiming to gather worldwide solutions. Nucleic Transformer [10]- [12] is one of the promising solutions in the competition. There are also some solutions submitted in the competition, such as XGBoost [13], HistGradientBoostingRegressor [14], Regularized LSTM [15]- [16] and GNN + Attention + CNN ensemble [17]. Only Nucleic Transformer and the ensemble solution provide remarkable prediction loss among these solutions. This indicates that predicting degradation is a complex problem, requiring an ensemble or stacking of models.
There are several challenges to be emphasized when working with machine learning solutions. The possible difficulties are lacking data, overfitting, imbalanced data, and model interpretability [17]. Transfer learning [18]- [21] and data augmentation [22]- [24] can overcome data-hungry problems. Data augmentation is also able to minimize the overfitting of the model. Weight decay, batch normalization [25]- [26], and dropout can be utilized in the model. Penalization of over-confident output from the model is considered one of the solutions for overfitting. To balance the dataset, we can up-sampling smaller or down-sample larger categories for imbalanced data. Deep learning approaches seem like a black-box operation. It is interpretable, meaning we can know what happens when the model is training [27]- [29]. For instance, the backpropagation-based approach [30] and the perturbation-based approach [31] can interpret the model. Therefore, this paper is motivated to propose an improved optimizer to boost the performance of mRNA degradation prediction. Next, Section 2 discusses the method to be implemented, Section 3 presents the results and discussion, and Section 4 ends with a conclusion.

II. MATERIAL AND METHOD
The enhanced Nucleic Transformer inherits from the existing Nucleic Transformer, with a proposed decoder and the utilization of AdaBelief/RangerAdaBelief optimizer instead of Ranger optimizer. The proposed decoder consists of two linear layers with a sigmoid activation function in between, acting as a normalization layer before the final output. Fig. 1 shows the design of the proposed decoder. The degradation rate predictions require several steps, from preparing the data for the transformer model to generating prediction results. First, the exploratory data analysis (EDA) is done once. The EDA step is to study the training dataset's characteristics and explore the effect of data filtering. The filtered data is then reshaped into the specific dimension to be fed into the transformer model later. Afterward, the training and validation split is done through a stratified 10-fold crossvalidation technique. The training and testing dataset is read and reshaped for the pre-training step, similar to the training step. However, there are different sequence lengths, which are 107 and 130 requiring splitting them into long and short sequences [32]. In the enhanced nucleic transformer model, five nucleic transformer encoder layers and a proposed decoder layer are used to process the data. In the pre-training step, the long sequences come first to be processed, and then to short lines. The long and short sequences are randomly mutated or masked in random positions. The objective of the pre-training step is to allow the model to predict the true line, structure, and loop type of every masked or mutated sample. The weights and state of the pre-training model are then saved and loaded during the training step. The pre-trained model tells the general rules of mRNA secondary structure. Then, the model's latest state is loaded during the training step. This time the sequences have the same length and can be processed in the transformer model. Validation will be done right after every training epoch. Again, the trained model will be saved and loaded during testing. Fig. 1 shows the steps of operating the nucleic transformer from the beginning of data analysis and preparation until the mRNA degradation rates predictions.

A. Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is a process often used by data scientists to analyze and explore datasets to identify the characteristics of the data. Usually, EDA helps data scientists discover data patterns, noises or outliers, and anomalies before making any assumptions. EDA ensures that the results are valid and applicable to our research objectives. The typical analysis done by data scientists is standard deviations, confidence intervals, data point distributions, etc. In this section, the dataset from OpenVaccine on Kaggle is explored and analyzed to study the features of mRNA vaccine candidates and the trends of their degradation properties. There are five target labels, the five degradation properties provided in the dataset. The relationships between the degradation properties and the predicted loop type of mRNA secondary structure are explored.    Fig. 3, there are two features -signal_to_noise and SN_filter, both are the information regarding the signal-to-noise ratio of each sequence. Bin and Kai [16] used this feature to filter the lines used in the training and validation process. The result of the data filtering is shown in Fig. 4 below, where 2257 out of 2400 training samples had signal_to_noise values greater than 0.25.  SN_filter value of 0 reduced from 811 to 668. The training and testing datasets are diverse due to the results obtained from the laboratory through experiments, and the testing dataset does not include those five degradation properties and is required to be predicted. In the training dataset, all five degradation properties and their respective experimental error values for each position exist. The dataset has provided data grouping through the feature SN_filter, and the data frame is split by following the part mentioned. Afterward, the values of each type of predicted loop are grouped in a particular list. There are seven types of loops found in the dataset, which are bulge (B), dangling end (E), hairpin (H), internal (I), multiloop (M), stem (S), and external (X) loop. This subchapter aims to observe the distribution of values in all five degradation properties in every loop type. This can help provide information on how each kind of loop contributes to the stability of mRNA secondary structure. There are five degradation properties: reactivity, deg_Mg_pH10, deg_pH10, deg_Mg_50C, and deg_50C.
From Fig. 7 and Fig. 8, several observations can be made. The standard deviations of both full and filtered training datasets remained unchanged based on Fig. 5 and Fig. 6. In the full training dataset when SN_filter is 0, the value spans between 0 and 3. The results tell that the sequences classified as SN_filter of value 0 are statistically more diverged compared to SN_filter of value 1. However, in the filtered training dataset, when SN_filter is 0, the value span between 0 and 1 is similar to that when SN_filter is 1. Comparing their standard deviations to that when SN_filter is 1, the value of stem-loop (S) increased across all five degradation properties. According to the criteria of signal-to-noise filtering, the minimum value across all five properties must be greater than -0.5, and the average signal-to-noise must be greater than 1.0. The deg_[condition] weights depict the likelihood of decay at the base after incubating with the particular situation. The higher the value, the higher the possibility of pruning that base.  The reason for standard deviation divergence when SN_filter = 0 may be due to the predicted loop type of those sequences. Since all the dataset entries are obtained from the prediction (mRNA secondary structure) and experiments from the laboratory (decay rates across all five properties), the results must be mixed with noises. For decay rates, the errors of all five properties are provided with the length of seq_scored. For the predicted loop type, He et al. [10] have provided another six biophysical models with a temperature of 37°C and 50°C since the degradation properties to be predicted consist of different temperatures. However, these biophysical models cannot predict secondary structure in different pH values, and there are no degradation rates across all five properties for these extra models. In short, relying only on the form provided by the dataset is insufficient in predicting the degradation rates across all five conditions. In short, stem-loop (S) is the most stable loop across all five properties, and external loop (E) is the least durable loop when SN_filter is 1. The dataset filtering process improves the data quality, as shown from Fig. 7 to Fig. 8 when SN_filter is 0.

B. Data Filtering and Splitting
The training and testing dataset can be acquired in Chapter 3.3. This step is important during the training step as this procedure will affect the performance of the nucleic transformer. The filtering criteria will be based on the existing method by He et al. [10]. The original training dataset is read and filtered following the criteria in Fig. 9 above. After the filtering process, the filtered dataset has 2257 out of 2400 samples. Then, the stratified 10-fold splitting process is done, and two outcomes are generated -the indices of split training (2031) and validation (226) dataset. The validation dataset is again filtered by including those with signal_to_noise of value greater than 1. The finalized validation dataset indices are then generated. The amount may vary from 208 to 214 samples.

1) Data Preparation:
First, the filtered and split datasets are loaded into object instances, as illustrated in Fig. 11. In the figure, pre-train data requires sequences and bppm data only because it learns only the rules of mRNA secondary structure by predicting the true lines of the inputs. Regarding training and validation steps, the labels and error weights provided by the original dataset are loaded into other object instances. These object instances with full sequences data, or train and validation split data, are split into batches for pretraining and training processes. Index 0 to 11 indicates six biophysical models with two temperature results for each model. The data loader provided by PyTorch is used to shuffle and split the dataset into batches. The batches except the last set will have the same size, while the remaining will automatically become the last batch instead of discarding them. Then, all packages will be fitted and shuffled into the transformer model in every epoch.

2) Pre-training Model:
The model structure is identical to the existing Nucleic Transformer workflow in the pretrained model as illustrated in Fig. 12. To pre-train the model, the model takes mutated or masked sequences and the respective bppm data. The objective here is to allow the model to predict the true lines of the mutated or masked sequences with the help of bpm data. The embedding layer embeds the input of the series into a dimension of the model dmodel or ninp = 256. The embedded input will have 256 * 3 because the input of the sequence contains nucleotides sequence, structure, and loop type, and each has a dimension of 256. The projection is a PyTorch Linear layer to transform the embedded sequence inputs with dimension 256 * 3 into a dimension of 256.  The embedded and projected sequence inputs and bppm data are then forwarded ithe nto Nucleic Transformer encoder stack. ConvTransformerEncoderLayer is a self-customized layer with convolutions and self-attention mechanisms, where the value k represents the kernel of the convolutions to perform kmer-to-kmer interaction mappings. Each layer will generate processed encoded sequences, attention weights, and processed bppm, and these products will be the inputs of the next encoder layer until the last layer.
Afterward, the processed encoded sequences will be decoded in the decoder. The decoder in pre-training settings has three different linear layer configurations used to determine the predicted sequence, structure, and loop type. The loss will be calculated between the predicted sequences, structure, loop type, and the true sequences data. The loss function used is CrossEntropyLoss. AdaBelief or RangerAdaBelief optimizer will be used. An epoch ends when all the batches of sequence data are processed. In the next epoch, all sequence data is shuffled again and split into batches to ensure the model does not memorize the input sequences. After all, epochs are done, the model will be saved and loaded in the training step. Fig. 13 shows the example output of the pre-training process when the epoch is 5. Besides, the proposed decoder has a normalization layer in between 2 linear layers. The sigmoid layer is used as the normalization layer to transform the output from the first linear layer to values between 0 and 1 inclusive. Compared to the linear decoder used in the existing nucleic transformer, the proposed decoder with normalization can prevent a certain degree of overfitting [2] when training the model. Fig. 14 shows the architecture of the proposed decoder. ninp is the input size, and nclass is the number of predicted classes.

3) Training and Validation of Model:
In the training step, the workflow is similar to the pre-train structure until the Nucleic Transformer encoder. The decoder is modified to implement a sigmoid layer between linear layers. The decoder generates predicted degradation rates and calculates the loss from the self-customized loss function weighted_MCRMSE with true degradation rates and error_weights, all provided by the original training dataset. error_weights assist the weight updating process. The calculated loss is then backpropagated to update model weights. AdaBelief or RangerAdaBelief optimizer will be used. During the validation step, the flow is similar to the training step except for the calculation of loss and the exception of the model weights updating process. This time error_weights is unnecessary because the model weights no longer need to be updated. The validation process is triggered when all the training dataset batches are processed in every epoch; however, the validation process will be triggered after several epochs have been done. The example output of the training and validation process is shown in Fig. 17 below. The details of the proposed decoder have been mentioned in the previous chapter. Fig. 17 Example output of training and validation process 4) Prediction Testing: The last process, mRNA degradation rates prediction on the test dataset, is done in the same manner as the validation process. The predicted result is then saved in .csv file for submission on OpenVaccine Challenge on Kaggle. The submission will return the predicted results' public and private MCRMSE scores. The score will be compared with the scores obtained from the existing nucleic transformer model. Fig. 19 illustrates the full proposed Nucleic Transformer framework. The measures employed for performance analysis are public and private MCRMSE. The difference between these two metrics is that private MCRMSE is measured using 91% of test data, while the remaining 9% is for public MCRMSE. Fig. 20 displays the loss per epoch of every combination of optimizers and decoders. Based on the figure, the existing solution using Ranger optimizer and linear decoder (blue line) has a relatively higher loss than AdaBelief/RangerAdaBelief optimizers. Comparing RangerAdaBelief and AdaBelief optimizer, AdaBelief optimizer shows significant and better improvement on both decoders. Initially, the AdaBelief and linear optimizer combination performs the best among others. However, at the end of pre-training, the AdaBelief and proposed decoder combination slightly outperforms the former combination.

B. Evaluation of Enhanced Nucleic Transformer
In Table 2, AdaBelief optimizers for both proposed and linear decoders were presented. AdaBelief and the proposed decoder have the best performance among the experiments on public and private MCRMSE. Comparing the proposed decoder to a linear decoder while using AdaBelief optimizer, the proposed decoder performs slightly better than the linear decoder in private MCRMSE. From the perspective of the pre-training performance, the AdaBelief optimizer and linear decoder outperform the rest of the combination at the beginning of the pretrain process. This further proves that AdaBelief optimizer is capable of fast convergence in the early model training process. However, when it comes to an end, the AdaBelief optimizer and the proposed decoder overtake the former combination, indicating that AdaBelief optimizer possesses strong generalization capabilities. At the same time, the model is more complicated when implementing the proposed decoder, which increases the model parameters. By normalizing the values before the final output, the performance of the Nucleic Transformer can be slightly improved. However, implementing the proposed decoder shows its advantages only in public MCRMSE, which includes only 9% of the test dataset. In other words, the advantages of the proposed decoder can only be shown when there is limited data. When the data is sufficient, the performance might be similar but still better than the linear decoder if and only if AdaBelief optimizer is used.

IV. CONCLUSION
This paper addresses the enhancement that can be done on the existing Nucleic Transformer. The enhanced Nucleic Transformer consists of the AdaBelief optimizer and the proposed decoder that applies a normalization layer between two linear layers. The enhanced version of Nucleic Transformer eventually provides a 2.79% and 1.38% performance boost in public and private MCRMSE, respectively.