Deep Convolutional Neural Networks Transfer Learning Comparison on Arabic Handwriting Recognition System

— Around 27 languages and more than 420 million people worldwide use Arabic letters. That makes the Arabic language one of the most used languages. However, the Arabic language has a challenge, namely the difference in letters based on their position. Arabic handwriting recognition is important for various applications, such as education and communication. One example is during a pandemic when most education has turned digital, making recognizing students' Arabic handwriting difficult. This paper aims to create a model that can recognize Arabic handwriting by comparing several CNN architectures using transfer learning to classify Arabic, Hijja, and AHCD handwriting datasets. Transfer learning is a model that has been trained by previous datasets to other datasets and is suitable for use in models with small datasets because it can improve model accuracy even with small datasets. The datasets were split into 60%, 20%, and 20% for training, validation, and testing. Each model uses data augmentation and 50% dropout on a fully connected layer to reduce overfitting. Some of the CNN architectures used in this study to create Arabic writing recognition models are ResNet, DenseNet, VGG16, VGG19, InceptionV3, and MobileNet. The models were compiled and trained with various parameters. The best model achieved to classify AHCD and Hijja dataset is VGG16 with Adam optimizer and 0.0001 learning rate. Based on this research, it is expected to know the performance of the best model for classifying Arabic handwriting.


I. INTRODUCTION
Arabic is one of the most used languages in the world as a source of documentation. The Arabic language is challenged because of the different letters based on their position. Around 27 languages use Arabic letters; Arabic is used by more than 420 million people worldwide, making it 6 th on the most-used language [1]. Arabic handwriting recognition is needed on various applications, especially in this pandemic where most education is moving to digital, making it hard to recognize Arabic handwriting from the student [2]. Therefore, this paper aims to make a model recognizing Arabic handwriting. Hopefully, this model can be meaningful to learning or teaching Arabic during this pandemic.
Several CNN architectures have been used to develop models to recognize Arabic handwriting on IFN/ENIT Arabic database [1]. The method begins by using the Hierarchical Agglomerative Clustering (HAC) algorithm to divide the database into clusters that are only loosely related to one another. Then, cluster members are ordered using the newly suggested ranking method. The ranking technique begins by computing the Pyramid Histogram of Oriented Gradients (PHoG), which is followed by the Kullback-Leibler method for determining divergence. Only the matching classes with the highest rankings are subject to the classification process. Only 11% of the entire database is used in the suggested clustering and ranking stages, which minimizes computation complexity and improves classification results. The AlexNet architecture produced the best model in this study, which had 99% accuracy and a 0.01 learning rate. Modeling took 18.15 minutes.
Another proposed CNN-based model has also been used to develop a model to recognize Arabic handwriting on AHCD and HIJJA datasets [7]. The proposed model uses an 80% dropout and 0.001 learning rate, as well as Adam for the optimizer. The proposed model achieved 97% and 88% on the AHCD and Hijja datasets. Arabic letters and characters are recognized using Deep Convolutional Neural Network (DCNN) and SVM by comparing the input templates to the pre-stored templates using fully connected DCNN and dropout SVM.
This work also considers the correctness of the corrected categorized templates and the recognized handwritten Arabic characters while calculating the corrected classification rate (CRR). The error classification rate (ECR) is also calculated. The experimental findings of this work show that the suggested algorithm can recognize, identify, and verify the Arabic characters typed in manually. The suggested system uses a clustering method based on the K-means clustering approach to address the issue of multi-stroke in Arabic characters to identify comparable Arabic letters. Compared to the state-of-the-art, the comparative evaluation shows that the system accuracy was 95.07% CRR and 4.93% ECR. [2].
Another approach to recognizing Arabic characters is to use DCNN based on Beta-elliptic parameters and Fuzzy Elementary Perceptual Codes. This paper uses two databases, LMCA and MAYASTROUN. The proposed approach achieved 98.90% accuracy [4]. Arabic writing may now be recognized from photos of natural scenes thanks to the CNN-RNN model. This study presents a CNN-RNN model for Arabic picture text recognition with an attention mechanism. Using CNN, the model creates feature sequences from an input image. To create feature sequences in the correct order, these sequences are fed through a bidirectional RNN. The bidirectional RNN may overlook certain text segmentation preparation. Therefore, the model can choose pertinent data from the feature sequences using a bidirectional RNN with an attention mechanism to generate output. End-to-end training is carried out by an attention mechanism using a common backpropagation method. The proposed method obtains 87% accuracy on Alif and Activ databases. [5].
A cursive handwritten Arabic text recognition system has been developed using different deep-learning architectures and modeling choices. The approach starts with implementing adaptive data augmentation to promote class diversity, to prevent imbalanced data sets. This algorithm assigns a weight to each word in the database lexicon, which is calculated based on the average probability of each class in a word. The models proposed are implemented in two databases, IFN/ENIT and AHDB. The highest performance was achieved by the BLSTM model with accuracies of 98.99% and 98.10% on IFN/ENIT and AHDB databases, respectively [6].
The CNN-trained model has been used to classify the AHCD dataset, which consists of 16,800 handwritten Arabic characters that are split into 13,440 training images and 3,360 testing images. This paper uses a combination of feature extractors and a trainable classifier. The proposed model achieved a 94.9% classification accuracy rate on testing images [8].
Based on several papers that have been studied previously with various methods used. This research will create an Arabic handwriting recognition model to help recognize students' handwriting so that learning Arabic during a pandemic becomes easier. In conducting this research, the model will be created using a CNN-based architecture.
A comparison of the architectural performance of CNN in recognizing Arabic handwriting will be carried out in this study, namely VGG16, VGG19, InceptionV3, MobileNet, ResNet, and DenseNet because the method used is deep learning which will produce better performance if the dataset tends to be larger [9]. So, this study will apply the data augmentation method to increase the sample variance from the dataset. Not only that but the use of transfer learning was also applied in this study to increase the model's accuracy [10] [11]. The last method that will be used to increase the model's performance is fine-tuning. Fine-tuning is a concept of transfer learning and performs better than a manually created model [12].
So, this research does not only compare CNN architectures but also improves the performance of models created using three methods, namely data augmentation, transfer learning, and fine-tuning. The model will recognize Arabic writing with input in images of hijaiyah letters with public Hijja and AHCD datasets. It is hoped that in this study, we can find out the performance of the CNN architecture in classifying Arabic handwriting and the level of performance comparison of each CNN architecture.
This study consists of 4 sections. Section 1 contains an introduction regarding the background and research conducted. Section 2 contains the materials and methods used in the research. Section 3 contains the results and discussions. Furthermore, section 4 contains conclusions about this research.

A. CNN Architectures
CNN was first introduced by LeCun around 1980. It is one of the most used deep learning methods to process visual data [13]. The primary uses of CNN are in data analytics, natural language processing, and image and signal processing [14]. CNN had an important role in deep learning history, an example of the successful implementation of how the brain works into machine learning. CNN is also one of the models that has a good performance in commercial usage [15].
CNN is a type of feedforward neural network that can use convolutional architecture to extract features from data [16]. CNNs are based on neurons layered in the organization, making them capable of learning hierarchical representations like any other neural network model. Using weights and biases, the neurons in the layers are connected [17]. Recently, it was stated that many contributions to the CNN structure went into creating deep-learned DCNNs. By deepening the network, deep CNNs can learn additional features. However, as the network depth increases, degradation and vanishing gradient issues arise [18]. Deep learning may result in the exclusion of many crucial pieces of information when information or the gradient of input images is propagated across numerous layers. Due to this, numerous recent publications have suggested various designs to implement the deep learning notion while attaining a short path of layers. [1]. The most popular CNN architectures are Residual Networks (ResNets), DenseNet, VGG16 architecture [1], VGG19 architecture, InceptionV3 [19], and MobileNet [20].
The number and kind of layers used in these various CNN architectures vary. These changes depend on the type of application, the volume of data, and the complexity. The input, convolution, batch normalization, pooling, dropout, and output layers are among the several types of layers. [1]. As explained in the following subsections, these architectures have been used for various purposes, notably text recognition.
1) ResNet: An architecture called ResNet has thousands of levels. Building so many layers is done to learn more complex facts accurately. The ResNets model has an advantage over other architectural models in that performance does not suffer as the design becomes more complex [21]. Degradation and other harmful effects of layering will also occur. One method for preventing degradation is ResNet. There are leftover blocks in this architecture. With ResNet, the layer will also receive input from the residual units and the direct prior layer [22]. To prevent the calculation from stacking layers without adding parameters or complexity, the identity of x is appended to the residual block's output. [1].
2) DenseNet: By substituting the dense block for the main unit in the ResNet model architecture, DenseNet architecture is created. The output of one layer in the DenseNet is broadcast to all the layers in front of it. [23]. DenseNet builds feature learning models using dense blocks as the primary building component. [24]. DenseNet connects all network levels in Dense Block to provide maximum information flow between layers [25]. L layers and L connections make up conventional CNNs. Direct connections make up L(L+1)/2 of dense CNNs. Every feature map is computed in every layer before it is used in that layer. It is regarded as a very effective remedy for the vanishing-gradient issue. Final feature maps are created by concatenating all referenced feature maps from earlier sequential layers. [1].
3) VGG16: VGG16 is one of the VGGnet models using 16 layers as its architecture. Normally VGG16 uses five convolutional blocks connected to 3 MLP classifiers. The output layer uses a sigmoid activation function when are two or fewer categories and a SoftMax activation function when there are three or more categories from the dataset [26]. On the ImageNet database, the VGG-16 network was trained. The VGG-16 network has undergone considerable training, which results in outstanding accuracy even with small image data sets [27]. 4) VGG19: VGG19 is similar to VGG16 and other VGGNet variants. The difference is in 3 additional convolutional layers that help identify patterns on images [19]. 5) InceptionV3: InceptionV3 architecture consists of 48 layers and the development of GoogleNet or InceptionV1. The Inception-V3 model is a deep CNN that was trained on a computer with a basic configuration [28]. This architecture comprises convolutional and fully connected (FC) layers with 1 Hijja dataset is available at https://github.com/israksu/Hijja2 pooling average and max and drop out after the pooling layer. The activation function used in this architecture is batch normalization, and the loss function used is softmax [19]. 6) MobileNet: MobileNet is a CNN architecture for mobile devices [29]. This architecture comprises two convolutional layers: a 3x3 depthwise convolution layer and a 1x1 pointwise convolution [30]. Counting depthwise and pointwise convolutions as separate layers, MobileNet has 28 layers [20].

B. Dataset
In this section, we will describe the datasets that are used in this paper. There are two datasets, Hijja 1 and AHCD 2 . Hijja is a free, publicly available dataset of single Arabic letters collected from Arabic-speaking school children between the ages of 7 and 12. It represents 47,434 characters written by 591 participants in different forms. Data were collected in Riyadh, Saudi Arabia, from January 2019 to April 2019 [7]. AHCD is a collection of free, public Arabic letter data. Sixty individuals wrote 16,800 characters in the dataset; their ages ranged from 19 to 40, and 90% used their right hand. [8].

C. Proposed Method
Our first step is to compare each of CNN architecture's performances with transfer learning by preparing the dataset. This paper will use the same dataset as the previous paper, those are AHCD and Hijja. This paper will also follow the same dataset split configuration which is 60% for training, 20% for validation, and 20% for testing [7]. Each selected CNN architecture will be trained using transfer learning with pre-trained weights from ImageNet. Data augmentation will also be implemented on all models with the following parameters. After the dataset has been split and augmented, we will make the model from each architecture without the FC layer, in addition to our own FC layer, to classify Hijja and AHCD datasets. The FC layers are going to use 50% dropout as an attempt to reduce over-fitting. Once the model has been made, the models are going to be trained. There are two steps in training the models: training the FC layers only and for the whole layers.
The FC layers will be trained with the convolution layer frozen, meaning its weight will not be updated on training. Once the FC layers are trained, the whole layers will be trained with a small learning rate. The FC layers will be trained with 20 epochs for feature extraction, and the whole layers will be trained with 80 epochs for fine-tuning, which makes 100 epochs in total to train the whole model. There are two different parameters according to the dataset to train the model, and this whole process is transferring learning with fine-tuning.

ℎ
(1) Each model will be trained with parameters in Table 3 according to the model's dataset, and after the model has been trained, it will be re-compiled for fine-tuning with two optimizers, Adam and Stochastic Gradient Descend (SGD). Each optimizer will have three learning rates, which are 0.001, 0.0001, and 0.00001. In summary, each architecture will be re-compiled for fine-tuning with the following optimizers and learning rate configuration. Once all the models with various architectures and parameters are trained, they are going to be compared and sorted based on their validation and testing accuracy.

III. RESULT AND DISCUSSION
This paper uses two datasets Hijja, and AHCD. These datasets are divided into training, validation, and testing for 60%, 20%, and 20%, respectively. After that, the datasets will be augmented with the configuration in the previous chapter. After the dataset had been augmented, the dataset was used as an input for the model. Each model will use different selected CNN architectures, and the FC layer from each model will be trained for 20 epochs for feature extraction and 80 epochs for fine-tuning. Each model will use layer configuration in Figure  3, with 50% dropout on the FC layer and different neurons for different selected CNN architectures, see Table 5.   DenseNet  1024  512  28  29  InceptionV3  1024  512  28  29  MobileNet  1024  512  28  29  ResNet50  1024  512  28  29  VGG16  512  256  28  29  VGG19  512  256  28  29 After the model has been made, we train the model in 2 steps. First, we train the FC layer with the convolution layer frozen to keep the weight on the convolution layer while training the newly created FC layer. The FC layer will be trained with epochs and step per epochs from the configuration in Table 2. Once the FC layer has been trained, we re-compile the model with the parameters from Table 3 and continue the model training using the new parameter. After the training process has been done, we can compare each model's performance. The training process result with various selected CNN architectures can be seen in Table 5,  Table 6, and Table 7. In the AHCD dataset, VGG16 has the performance with the highest validation accuracy and testing accuracy, and VGG19 is the model with the highest training accuracy. While the model with the lowest training accuracy is MobileNet, and the model with the lowest validation accuracy and testing accuracy is DenseNet. In the Hijja dataset, VGG16 has the highest performance with training, validation, and testing accuracy. Meanwhile, MobileNet has the lowest training, validation, and testing accuracy.   Table 5 and Table 6 show the results of the validation accuracy and testing accuracy of each model that has been trained. Based on Table 5 and Table 6, it can be seen that VGG16 with Adam optimizer and a total learning rate of 0.0001 with fine-tuning is the best architecture in classifying AHCD and hijja datasets.  Table 7 shows the top models of each architecture. Based on the results, it can be seen that the model has a better overall performance on the AHCD dataset. This could be because Hijja or Arabic script made by children is more difficult to classify. This study obtained the same results as those in previous research [6], where the Adam optimizer worked better than SGD and can be seen from Table 7, where the top model for each architecture uses the Adam optimizer on finetuning is 0.0001. Based on the results that have been presented, the following are the differences obtained from previous relevant studies.

IV. CONCLUSIONS
Based on the results, we can see each model's training, validation, and testing accuracy. Table 5 and 6 show the sorted model with the highest validation and testing accuracy, and Table 7 shows the top models from each architecture. Based on the results, we can see that the models have a better overall performance on the AHCD dataset. This can happen because Hijja or Arabic writing made by children is harder to classify. This paper has the same conclusion as the previous paper, which says that Adam was found to work better than SGD [7]. Table 7 shows that all the top models from each architecture use Adam optimizer instead of SGD. Another conclusion that we can conclude from observing Table 7 is that the best learning rate for CNN transfer learning architectures using Adam optimizer on fine-tuning is 0.0001. The best CNN architecture may be biased because of the same parameters for all models that might benefit only certain architecture, but with parameters from Table 3 and Table 4, we can conclude that VGG16 with Adam optimizer and 0.0001 learning rate on fine-tuning is the best architecture to classify Hijja and AHCD dataset.