The Effect of Layer Batch Normalization and Dropout of CNN model Performance on Facial Expression Classification

— One of the implementations of face recognition is facial expression recognition, in which a machine can recognize facial expression patterns from the observed data. This study used two convolutional neural network models, model A and model B. The first model A was without batch normalization and dropout layers, while the second model B used batch normalization and dropout layers. It used an arrangement of 4 layer models with activation of ReLU and Softmax layers and two fully connected layers for five different facial expressions: angry, happy, normal, sad, and shock. The research method in this study is divided into data analysis, pre-processing, gray scaling, Convolutional Neural Network (CNN), and model validation testing. This study obtained an accuracy of 64.8% for training data and an accuracy of 63.3% for validation data. The dropout layers and batch normalization could maintain the training and validation data stability so that there was no overfitting. By dividing the batch size on the training data into 50% with 200 iterations, aiming to make the load on each training model lighter by using the learning rate to be 0.001, which works to improve the weight value, thus making the training model work to be fast without crossing the minimum error limit. Accuracy results in the classification of ekp facial receipts from the camera's distance to the face object about 30 cm in the room with the use of bright enough lighting by 78%.


I. INTRODUCTION
One of the problems in computer vision, the solutions that have been being investigated for a long time, is the general classification of objects in the image. It concerns duplicating the human ability to understand image information so that the computer can recognize objects in the image like humans. The feature engineering process is limited as it can only be applied to certain datasets and cannot be generalized to any type of image due to differences in viewing angles, scale, lighting conditions, object deformation, and others [1].
The face is a multi-signal or multi-message system. The face can display one kind of signal but has many messages [2]. The texture of facial expressions can be classified into basic expressions: anger, pleasure, fear, surprise, disgust, and sadness [3]. Classification of facial emotions with the challenge of emotional intensity from smooth to sharp in a sequence of images or videos [4].
The Convolutional Neural Networks Architectural Model (CNN) is configured to initialize parameters to speed up the network training process. Use the Convolutional Neural Networks model for construction [5]. CNN consists of two phases, namely the training phase and the testing phase. This second phase only uses features in the form of eye locations to carry out the pre-processing stage, which includes spatial normalization, synthetic sampling, image cropping, sampling, and intensity normalization. These two phases also have slightly different stages where the testing phase does not go through the synthetic sample generation stage, namely the creation of duplicate images that have been given random noise such as translation, rotation, and slope. This aims to add to the database to increase accuracy in the phase of training. In addition, in the training phase, the output of CNN is in the form of CNN weights, which will later be used to provide output in the form of an expression type in the testing phase [6]. The system detects and recognizes faces at a distance of 0-40 cm. However, with a distance of more than 40 cm, the system cannot detect and recognize faces optimally [7]. The convolutional neural network method is used to recognize facial expressions, with the JAFFE dataset from several variations of parameters, using a 3-layer training architecture on the network [8]. Using the Convolutional Neural Network (CNN). Data distribution equalization is applied to improve the performance of the model. This paper presents a classification model that can be applied to find out emotions in a video [9].
Facial Expression Recognition (FER) has become the current topic of study in human-computer interaction, including digital entertainment, customer service, driver monitoring, and emotional robotics. Indeed, many extensive studies and methods have been developed to increase accuracy and reduce the amount of loss during model training, including adding a dropout screen and batch normalization. This study aims to identify the different effects of adding a dropout layer and batch normalization layer with and without using the two additional screens. The Convolutional Neural Network (CNN) method is a technique for processing twodimensional data that are processed in only one screen dimension but in 2 screen dimensions [6]. Face Detection and Recognition in Real-Time Photos with Haar Cascade and Convolutional Neural Network is one of the machine learning methods from the development of Multi-Layer Perceptron (MLP), which is designed to process two-dimensional data [10]. A neural network model for a mechanism of visual pattern recognition [11]. A good image classification model must be invariant to the cross-product of all these variations [12]. Training Deep Neural Networks is a complicated by the fact [13]. Deep Learning has excellent skills in computer vision. One of them is on classifying objects in the image [1]. When a face is detected, the system will create a red Bounding Box, and then the image of the face will be saved into the feature extraction process. Classification using the CNN method is also relatively reliable for parameter changes as changes in the confusion level are not affected by the accuracy results. This study's ultimate goal is to help detect facial expressions of angry, happy, normal, sad, and shock in realtime and to determine the level of emotion accuracy based on facial expressions towards indoor lighting and camera distance.

II. MATERIALS AND METHODS
The procedure of this study is presented in Figure 1 below: Fig. 1 The Procedure of the Study

A. Data Analysis
This present study used face expression recognition data from a site (https://www.kaggle.com/c/challenges in representation learning facial expression recognition challenge). The data consisted of two columns of class and an array in which the class presented a class of images consisting of angry, happy, sad, shocked, and normal, while the array shows the number of pixels of each image. The data consists of 24256 records for training and 3006 for validation. The data were collected randomly by angle, position, and light intensity. Table I below shows facial expression class data. Processed data were divided into two, namely training and validation data. The result of the produced model was taken for the best accuracy and stored in a file with the extension of h5 containing the weights of the evaluated parameters. Data for validation were tested to compare the produced model.

B. Pre-processing Gray scaling
Before the facial expression detection process was done, the first process was to normalize the image to a 1.0 rotational scale, change the color to grayscale, and resize it into 64x64 pixels.

C. Convolutional Neural Network (CNN)
Machine learning and deep learning have produced much good performance in image recognition [23]. The CNN working procedure is almost the same as MLP, but each neuron in CNN is represented in a two-dimensional form, while in the MLP, it is only one dimension. On CNN, the propagated data on the network is in the form of twodimensional data in which the linear operation uses a convolution process, and data weights are a collection of fourdimensional convolution filters consisting of 25-input neurons, output neurons, height, and width. Due to the nature of the convolution process, CNN can only be used on twodimensional data. Based on the LeNet architecture, a CNN consists of four layers: convolution layer, subsampling layer, fully connected layer, and activation function [1]. CNN is a type of neural network where this method can be used on images [19].
Parameter implementation of Convolutional Neural Network ( This study used the architectural model of two screens. The first used the normalization batch screen and the dropout screen, and the second used two screens of non-batch normalization screen and a non-dropout screen. These two models were compared for the accuracy and training error levels. The form of the CNN architecture is presented in figures 2 and 3 below by using the equation (1)

1) Convolution:
The convolutional stage is to multiply the input data with a function (filter) repeatedly to obtain a feature map. The filter value used in this study was randomly initiated following a normal distribution with the size and number of filters as described in Figures 3 and 4 above.

2) Fully Connected layer (FC):
The feature map produced from feature extraction is in the form of a multidimensional array so that it should be flattened or reshaped into a vector so that it can be used as the input of the fully connected layer (FC). The FC layer is the layer where all the activity neurons of the layer are previously connected with the next neuron layer as an ordinary artificial neural network. Each dimension can be connected to all neurons in the fully connected layer. The FC layer is usually used in the multi-layer perceptron method and aims to process data so that they can be classified based on the differences between layers. The use of several layers affects the time and accuracy of test data [15].
3) ReLU Activation: ReLU processes the convergence performed by stochastic gradient descent faster than sigmoid and tan h. ReLu can also be implemented only by setting a threshold at zero. However, ReLU can be fragile during training and cause unit malfunction. For example, the study found that 40% of the networks were not functioning because the learning rate was initialized higher from lr=0.01 to lr=0.001. The equation of ReLu [15]: , as it only makes a limit on zero number, it means that if x 0, then x = 0 and if x > 0, then x = x [20].

4) Pooling Layer:
Pooling is an important concept of CNN. It lowers the computational burden by reducing the number of connections between convolutional layers. This section introduces some recent pooling methods that used two hundred CNN [17].
The process is the same as the convolution screen that uses a filter that the movement can be changed as described above and the result of the changes can be seen from the following equation [18]: 1 (2) The notation in the above equation can be defined as follows: n = input size of the map feature, f = filter size, and s = number of strides.
The main goal of Pooling Layer is to reduce the number of parameters of the input tensor to help reduce overfitting, extract representative features from the input tensor, reduce computation, and thereby aid efficiency [21].

5) Layer Batch Normalization:
We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift. Eliminating it offers a promise of faster training. We propose a new mechanism called Batch Normalization [13].
To reduce internal covariate shifts and increase the accuracy of a model so that the produced model can be efficient, it can use a Batch Normalization layer in which, after passing the convolution layer process, the BN layer will work to distribute the value automatically at the time of layer activation. To reduce the internal covariate shift problem, a BN layer was added in each of the dimensions, because between each of the connected BN layers, the input distribution range of each layer remains the same regardless of changes in the previous layer, as presented in the following equation.
In training, we refer to the change in the distributions of internal nodes of a deep network as Internal Covariate Shift and eliminating it offers a promise of faster training. We propose a new mechanism, which we call Batch Normalization, which takes a step toward reducing internal covariate shift, and in doing so dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. Batch Normalization also benefits the gradient flow through the network by reducing the gradients' dependence on the parameters' scale or initial values. This allows us to use much higher learning rates without the risk of divergence. Furthermore, batch normalization regularizes the model and reduces the need for dropout [13].
The equation above can be represented by [4]: = input of k-neuron E [ ] = Population of mean input from each k-neuron Var [ ] = Variance of input on each k-neuron 6) Dropout: Dropout is a technique that addresses both of these issues [14]. The regularization in the training model can be overcome using a dropout layer which can minimize the occurrence of too many training errors. In principle, each neural forces others to connect after passing through activation but causes over-fitting data and a large training error. Dropout will disable random networks connected to neurons with a rating scale as presented in the equation below. In which p is the value of the hidden layer that is connected to other layers with the standard parameter for the dropout screen is 0.5. the optimal probability of retention is usually closer to 1 than 0.5 [14]. The dropout rate is higher than in offline classes because students have to manage and control study time without the professor or manager's help. Therefore, professors and managers must support students in a timely act to avoid the risk of dropout online university classes [24].

III. RESULTS AND DISCUSSION
Dropout and batch normalization are well-recognized approaches to tackling these challenges [22]. Training and validation testing in classifying the detection of emotional expressions on a person's face used hardware consisting of AMD-87410 Processor, 12GB Ram, AMD Radeon R5 2.2 GHz graphics, and software of Google Colab, Visual Studio Code Anaconda, with hardware libraries, NumPy, and OpenCV.

A. Model Validation Testing
This study used two models, A and B. Model A does not use batch normalization and dropout screens, and model B uses batch normalization and dropout screens. The comparisons of how the two models work are presented in Tables II and III    Based on Figure 4 for model A, all networks are active or connected, but they will have an impact on backpropagation in the form of difficulty in updating each weight and causing overfitting in which the training value is good but not good for data that have not been trained. And figure 5, model B used batch normalization, and the results showed that some networks are not activated to reduce constraints in updating the weighting process and prevent data overfitting B. Comparison between the Training Models during Iteration Model A and Model B were compared with 50 and 100 iteration stages. The results of the comparison between the two models can be seen in each iteration process in the following Figure 6. Based on Table IV, the model has reached the 100 th iteration stage to check the suitability between the training data and the validation data to create a graph. Generally, the validation data error value is lower than the training data, and there is no big gap between the two. The graph can be seen below in Figure 7.  Figure 4, the addition of the batch normalization screen aims to normalize input by adjusting and scaling activation and reducing internal covariate shifts. The dropout screen functions to prevent overfitting and accelerate the learning process. The dropout refers to disabling the hidden and visible layers in the network

C. Comparison of Accuracy
Based on Table V, model A has a large difference between training and validation accuracy of about 6.7 points. Meanwhile, model b does not have a large difference between the training and validation accuracy (0.7 points).   Table VI below.   Based on Figure 6, the effect of the number of iterations, batch size, and the addition of batch normalization screens as backward propagation handling so that changes in weight in each training do not cause overfitting. The dropout screen is to deactivate neurons to reduce changes in weight values temporarily. Thus, adding weight to each neuron on a large scale can be adjusted by the model.

D. Modeling
This system is divided into two stages of classifying facial expressions as follows: 1) Input: At this stage, face recording was performed when the person was facing the camera in front. The system began with frame acquisition and then continued with gray scaling pre-processing. Then, it was continued with human form detection by using a frame difference that detects movement. The system returns to the video frame acquisition process if no motion is detected. However, if motion is detected, the system continues detecting faces using the Viola-Jones algorithm. After getting the human form, then it was followed by face detection using the Viola-Jones method of the Haar Cascade Classifier algorithm. When a face is detected, the system creates a red Bonding Box, which is then saved to enter the feature extraction process.
2) Classification: This study used a feature extraction of the Convolutional Neural Network method. Convolutional neural networks (CNNs) have shown great performance in various fields, such as image classification [16]. The result of the input stage continued with the process features stored as a model. However, it should go through the process of the gray scaling pre-processing stage, changing the size, extracting layers of each image data on the pre-trained model, and filters on the pre-trained obtained from the training process. The explanation of the stages of training and testing in the process of recognizing facial expressions is as follows:  Image pre-processing, before the face detection process, the first process is changing the image into grayscale and changing the size.  Feature Learning, after the face image passes through the pre-processing of the image data, extra features are added to each layer to get the learning of each image data to feature using the CNN method.  After the feature learning process, classification is performed from each processed image feature.  For face detection, the study used the Viola-Jones method.

1) Lighting Condition Testing:
In testing the lighting conditions in the room and the camera's distance, a picture was taken using a webcam directly. The indoor lighting was divided into three light sources: sun, room light, and camera light. Meanwhile, the distance from the camera to the face is 30 cm and 60 cm.  Figure 7, the best results are obtained for indoor conditions using sunlight, 11-watt lamps, and camera light, with the camera-to-face distance being 30cm with an average of 78%.

2) Facial Expression Testing
The test was based on facial emotional expressions consisting of angry, happy, normal, sad, and shocked by conducting direct testing using a webcam and the camera's distance to the face is 30 and 60 cm. The results showed that the highest average is in happy facial expressions (93.7%) and angry (81.5%) at a distance of 30 cm. The detailed results can be seen in Figure 8.

3) Testing the Running Application:
In testing the running application, the application can detect objects and focuses on faces. Thus, the application can show facial expression information according to the object presented in Figure 9. Of the two models used in the study, the first model did not use the dropout layer and batch normalization layer, and the second used both. The first model cannot maintain data stability from overfitting while the second model. Can maintain the stability of the training data and validation data during testing so that data do not overfit. Model validation Testing, Obtained an accuracy of 64.8% for training data and accuracy of 63.3% for validation data. The dropout layers and batch normalization could maintain the training and validation data stability so that there was no overfitting. With the testing process dividing the batch size of the training data into 50% with 200 iterations, aims to make the load on each training model lighter. Using the learning rate of 0.001 to correct the weight value can make the training model work faster without crossing the minimum error limit. The accuracy results in classifying facial expressions from the camera distance of 30 cm to the object in the room with the use of bright lighting reach 78%. The suggestions for further research using data augmentation and collaboration deep learning methods to get a higher level of accuracy and use high camera quality