Implementation of Convolutional Neural Network and Long Short-Term Memory Algorithms in Human Activity Recognition Based on Visual Processing Video

—Human Activity Recognition (HAR) is an interesting research topic, especially in identifying human movement actions focusing on video-based security surveillance. Symptom of an illness from a movement. The use of HAR in this research is the key to better understanding the various semantics contained in the video to find out the pattern of a human movement, especially in sports movements. In this study, a combination of the CNN and LSTM method algorithms was applied by using several variations of the model parameter values on the dropout layer and batch size to convert the pattern in the video into image form to produce a HAR model. Data processing at the convolution layer is used to extract spatial features in the frame. The extraction results are fed to the LSTM layer on each network for modeling the temporal sequence of human movement. In this way, the network on the model will learn spatiotemporal features directly in end-to-end data training tests to produce a robust model. The test data used are 10 sports activities obtained from related research from the University of Central Florida (UCF). The results showed that the performance was quite good, although there were still errors in the classification of sports activities because they had similarities in the movements of the activities carried out. The classification results show a loss value of 0.4 and an accuracy of 0.94. In further research, what needs to be corrected is the loss value which is still high so that several times the test results show an error in the classification of sports activities that have similarities in the movements of the activities.


I. INTRODUCTION
The analysis process of human activities and event detection in the video is conducted manually by repeatedly observing the motion of human action from CCTV, which is very time-consuming and energy. HAR video-based is a challenging research field today, specifically observing sports activities [1], [2], [3]. HAR is a key to understanding various semantics of video content [4]. The recognition of human activities on a video and others else, such as taking video for the content, system of area surveillance for office security, private services, and human-computer interaction, has gained significant attention in computer vision today [5], [6].
Deep learning is the most used technique in machine learning, with many high-end features to identify actions and human behavior based on the video. The Convolutional Neural Network (CNN) is a deep learning architecture used in HAR of convolutional operation to learn video frames in every training process [7].
The activities recognition using pre-trained weight of machine learning architecture to represent video frame visually in the training stage which affect difference feature determination, such as the difference of visual and temporal. The deep learning in HAR using Bi-Directional Long Short-Term Memory (BiLSTM) and Dilated Convolutional Neural Network (DCNN) architectures, which were focused on input feature frame that effective for recognizing several human actions in a video [8]. Recurrent convolutional architecture can be end-to-end trained for a vast scale of visual understanding in recognizing activities, text, images, and video descriptions based on camera on HAR, like in health services and social caring. The RNN gained 99.55% in average accuracy and recognized twelve human activities [9].
HAR uses the joints skeleton technique that focuses on noise that causes joint points to be irrelevant and unsteady, becoming the main performance deterioration problem. Research that used the transfer learning method of VGG-16 to obtain image features and classification processes by CNN in human activities recognition. The accuracy proposed by that research reached 96.95% [10].
The subsequent study is about a two-stream structure using LSTM for the spatial data stream when extracting the motion in the video by utilizing spatial and temporal features on an RGB frame. Furthermore, there was a study of HAR that focused on multi-class classifications to increase accuracy with low computation and reduced model complexity by eliminating processes that are needed for feature advanced techniques [11], [12].
On the other hand, Liu et al. [13] studied the architecture model by combining 3DCNN and LSTM methods. The proposed model can stack video frames, extract time and spatial features, and also perform data training for action video to gain performance of reliable recognition. The LSTM model becomes a bridge between frames at different times to obtain better information about data from the previous frame.
This study was designed based on previous research with implementation and combining CNN and LSTM methods. Then, in its processes, convolutional layers used to extract the action of spatial features from video frames became input variables for LSTM. After that process, the network architecture of the temporal HAR model will be created. In this way is believed that the network architecture of machine learning classifications will learn spatiotemporal features directly at training end-to-end for gaining a reliable HAR model to identify human action.

II. MATERIALS AND METHOD
Several studies regarding HAR on video have been carried out in previous studies using various methods, such as the research entitled Human action recognition using attentionbased LSTM network with dilated CNN features [4]. This research discusses video focus on motion action recognition techniques by using pre-training weighting identification methods, then from the results of machine learning architectural designs in the visual representation of video frames processed at the training stage can affect the differences in features, such as visual difference with temporal video. This research proposes a machine learning architecture for HAR using Bi-Directional Long Short-Term Memory (BiLSTM) algorithms and Dilated Convolutional Neural Network (DCNN) in a reduced video detection manner focusing on managing input frame identification features that can effectively recognize various human actions in videos.
Related research discusses Human action recognition using LSTM and fully connected LSTM by providing different data input indicators. This study aims to discover the features of motion video action recognition with the STDAN network architecture, combining Convolutional LSTM and Fully Connected LSTM. Another study was entitled Long-term Recurrent Convolutional Networks for Visual Recognition and Description [5]. The research focuses on a class of iterative convolution architectures in images that can be trained end-to-end and are suitable for big data-scale visual identification for motion recognition activities, text, images, and video descriptions.
Park et al. [6] used a depth camera based RNN method on HAR for health and social care services. The proposed method achieves an average recognition accuracy of 99.55% and can reliably recognize twelve human activities. Arif et al. [7] proposed combining the 3D-CNN method with LSTM. In the process, the 3-dimensional convolution network will combine the raw information from the video into motion identification, called a motion map. Combining motion maps and video frames can increase the length of training videos iteratively. Applying a linear weighted fusion scheme combines the identification of motion in network features into spatiotemporal features and applying an encoder-decoder in LSTM to carry out the final prediction process.
Another study on HAR used the skeleton joints technique, which was carried out with the title Human Activity Recognition Based On Optimal Skeleton Joints Using Convolutional Neural Networks [8]. This study focuses on noise that causes irrelevant and immovable joint points, which is the main cause of decreased HAR performance [8]. Zheng et al. [9] conducted a study using the VGG-16 transfer learning method to get deep image features and a machine learning classification process trained on CNN in human activity recognition. The accuracy of the method proposed in this study reached 96.95%. Research conducted by Zhao et al. [10] improved a two-stream model for human action recognition and examines a two-stream structure by using LSTM for spatial data streams in extracting video motion by utilizing spatial and temporal features in RGB frames [28], [29].
There is research on HAR that focuses on multi-class classification processes in increasing accuracy with low computational costs [11], [12], [30], [31], [32] and reducing model complexity by process of elimination required for advanced features techniques. Some previous studies designed the architectural model using a combination of the 3DCNN and LSTM methods [13], [26], [27]. In this study, the proposed model is capable of stacking video frames, extracting time and spatial features [14], [15], [16]. As well as carrying out the video movement data training process to achieve good recognition performance [17], [18]. The LSTM model is a link between frames at different times to get better data information on previous frames [19], [20], [21].
Based on related research, in each stage of his research, using the HAR method has advantages and disadvantages for designing machine learning model architectures in identifying human movements. This study was designed based on the results of previous research studies using a combination of CNN and LSTM methods. In the HAR stage process, where the convolution layer is used to extract the movement of spatial features from the frame, it will be fed to the LSTM layer [22], [23], [24]. After that, the network architecture of the temporal HAR model was made. In this way, the network architecture of the model will classify machine learning into studying spatiotemporal features directly in end-to-end training to produce a good HAR model in identifying human movement [25].

III. RESULTS AND DISCUSSION
The sample or research data used in this study is the UCF50 dataset. The research carried out is to build a machine learning model using a combination of CNN and LSTM algorithms on HAR and test quality parameters using the confusion matrix method, which shows the success of implementing algorithms, hyperparameters, and architecture models used. The research stages are presented as a whole, as in Figure 1. The modeling process begins with designing the machine learning model architecture. Figure 2 is the architecture of the machine learning model in this study. Four convolutional layers use ReLU activation, which makes the limiting value at zero to determine whether or not the neurons are active in the neural network, so only neurons related to objects are selected and followed by Maxpooling2D to reduce the number of input parameters spatially and layers. Dropout to reduce overfitting problems. The convolution and flattening layers are wrapped in a Time Distributed layer which is used to process sequence or time-series data, and it is possible to apply the layer to each temporal slice of input in parallel to the training process. The extracted features in the Conv2D layer will be converted using the Flatten layer and will be fed to the LSTM layer. The activation function used in the Dense layer or fully connected is SoftMax which will use the output from the LSTM layer to predict the action to be taken, presented as a whole as in Figure 2. The system is tested with some testing data to determine the performance of the model that has been built using the confusion matrix method. The confusion matrix is used to determine the value of accuracy, precision, recall, f1-score, and support. The following is the equation used to determine the value of accuracy, precision, recall, and f1-score using the confusion matrix. The confusion matrix is presented as a whole, as in Table 1. Accuracy is a performance value in the model based on the degree of closeness between the predicted value and the actual value. Accuracy can be interpreted as an illustration of how precise the model is in carrying out the classification process correctly. Determine the accuracy value can be done with the following formula 1: Precision is the ratio of the amount of relevant information selected by the system to the total amount of information selected. Precision can be interpreted as a match between the information requested by the system and the predicted results provided by the model. Determine the precision value can be done with the following formula 2: Recall is the ratio of the amount of relevant information the system selects to the total amount of relevant information available. Recall can be described as the success of the model in finding relevant information. Recall can be calculated using the following formula 3: F1-Score is a metric used to measure model performance by comparing the average precision and recall values. F1-Score can be calculated using the following formula 4: The model's design consists of several processes for classifying the HAR, starting with the input data or dataset from the extraction process on the video that is used as a learning resource. The input data that is loaded is the result of dividing the dataset in the form of training, validation, and test datasets. The model design to be built uses a combination of CNN and LSTM algorithms. After the design stage of the machine learning model is complete, the process will be carried out training to produce machine learning models that can perform the classification process on the HAR. The Machine Learning Models are presented as a whole, as in Figure 3. The machine learning model in this study uses four layers convolution, maxpooling2D layer, dropout layer, flatten layer, LSTM layer, and dense layer, as seen in Figure 4.6. The convolutional and flattened layers were wrapped by Time Distributed, which is used for managing data sequences or time series and can implement a layer to every temporal slice from input in parallel with the training process.
Data input is an image with a length of 64 pixels, a width of 64 pixels, and three channels of RGB. The data were processed by the first layer of the convolutional (see Figure 3 point 1) that extracted features with 16 filters, kernel 3x3, padding with 'same' parameter and used ReLU for activation function.
The second convolutional layer (See Figure 3 point 2) extracted the feature of the image with 32 filters, kernel 3x3, padding with the 'same' parameter, and still used ReLU as an activation function. As previously processed in this layer, Maxpooling2D is used with 4x4 of size and dropout layer as additional.
The third convolutional layer acquired data input as 4x4 pixels of the image. This layer (Figure 3 point 3) extracted image features with 64 filters, kernel 3x3, padding with the 'same' parameter, and ReLU as an activation function. The Maxpooling2D in this layer with 2x2 in size and dropout layer is added in this layer. The fourth convolutional layer is identical to the previous one, except this layer processes image 2x2 pixels as data input.
The feature extracted on the Conv2D layer is converted to a vector using Flatten layer, whose result becomes data input for the LSTM layer. In this layer (See Figure 3 point 5), the result is reduced become 16 outputs which are processed at the Dense layer suitable with the number of classes in the category or dataset label. The activation function used in the Dense layer or fully connected layer was SoftMax, which was to calculate the probability of all labels obtained from the LSTM output to forecast action taken by humans.
The model processes data input as categorical data, thereby using categorical cross-entropy loss. This model's optimizer uses ADAM because it is generally better than other optimizer algorithms in processing more data and has good efficiency in computation time and memory usage.

A. Training Process
The parameter configurations are needed in the training process to gain the optimum machine learning model. The experiment was conducted with various parameter values at the dropout layer and batch size. The six models' variations are shown in Table 1, and the visualization can be seen in Figure 4. According to Table 1, it clearly can be seen that both models A and D, which are not applying a dropout layer, have a higher accuracy value and lower loss value than the opposite side but need more time for training. Also, the value of low batch size needs a longer execution time.  Figure 4 describes the increasing trend for accuracy and validation values, and in contrast with those, for loss graphic and validation loss have decreasing trend. The weight trend obtained from data validation was not more stable than the training data because the model had never learned the data used before. If the weight trend deviates consistently, the training process will be stopped early, as in models D and E, to avoid overfitting.

B. Results Evaluation
A confusion matrix was used for analyzing the performance of the machine learning model. It was used to know the quality parameters of the classification model by counting the number of true and false predictions for all classes.  Table 2 shows the report of classification using the testing dataset. Higher accuracy was obtained from the variation A dan D, which were without the applied dropout layer. On the variation that using a bigger batch size yields lower loss value and vice versa because the bigger batch size will slowly give the training process more convergent with accuracy on predicting.
The confusion matrix result is the actual data and data prediction using the testing dataset of the machine learning model that was built. It can be seen in the confusion matrix graphic variations A and D, which are without applied dropout layers, showed significant performance with the greatest number of classes of true positive, even though there was still misclassification, such as in jumping-jack activity becoming basketball. This matter happened because of close similarities in the motion of both sides.
The following confusion matrix evaluation describes how the performance measure was calculated, and the detailed description is illustrated using one class output, namely class 0. If we compare the accuracy between all variations in the experiment, then variations A and D have had higher accuracy than the others. Likewise, if the whole class is calculated on average, then Variations A and D consistently still have the highest average accuracy for the entire experiment, as previously seen in Table III.  Figure 5 describes the evaluation result from the multiclass confusion matrix. The columns represent actual classes, while the rows are for prediction classes. To obtain accuracy, precision, recall, and F1-score value, the interpretation of TP, FP, TN, and FN must be performed first. As an illustration, we calculate that value for class 0. The TP value is picked from cell1,1 as 31 data, meaning the data were correctly classified as expected. Figure 5 shows the TP value for all classes located at diagonal cells. The TN value of class 0 can be calculated by adding all matters of the cell except for cells in column 1 and row 1, so it accounts for 362 data. The FP of the class 0 obtained by adding value from cell1,2 to cell1,10 accounted for 4 data. The last, for the FN, can be calculated from the value of cell2,1 through cell2,10, which is 3 data. Based on formula numbers 1,2,3, and 4, the value of accuracy, precision, recall, and F1-score of class 0, respectively, are obtained as 98.25%, 88.57%, 91.18%, and 89.86%.  The evaluation for Variation C is depicted in Figure 7, where the values of TP, FP, TN, and FN of the class 0 consecutively is 32, 3, 358, and 7. Using the same formula as previously described, the accuracy was 97.50%, the precision was 91.43%, the recall was 82.05%, and finally, the F1-score was 86.49%. The fourth confusion matrix is the evaluation of Variation D, as can be seen in Figure 8, where the value of TP, FP, TN, and FN consecutively is 32, 3, 361, and 4 with 98.25%, 91.43%, 88.89%, and 90.14 % is value for accuracy, precision, recall and F1-score for class 0 in Variation D, respectively. The subsequent evaluation is for Variation E, which obtained an accuracy of 98.00%, a precision of 94.29%, a recall of 84.62%, and an F1-score of 89.19% of class 0. Those performance calculations were gained from the value of TP 33, FP 2, TN 359, and FN 6. The last evaluation is for Variation F, where the confusion matrix resulted in an accuracy of 95.25%, a precision of 57.14%, a recall of 83.33%, and an F1-score of 67.80% of class 0. Those results are based on TP, FP, TN, and FN values, respectively, as 20, 15, 361, and 4. Based on the experiment result, the model performance in training and testing was affected by dropout layer implementations and batch size with an appropriate measure. The main function of the dropout layer is to prevent overfitting. However, the bigger dropout parameter size will impact the model's inability to fit at training time because of reduced model capability correctly. Furthermore, removing neurons in the hidden layer and visible layer in the network affects the bad result of the model.

C. Threat of Validity
The experiment model was only performed using the UCF50 dataset with ten types of sports activities. The used dataset is a video with three channels of color, 64x64 pixels, and a number of video frames processed by the model are 20 sequences. Lacking a number of the dataset in this experiment impacted the model learning capabilities toward dataset training. This matter is a reason for the experiment's variation not being optimum. The experiment used a Google Collaboratory environment with a hardware-sharing scheme to affect the experiment's performance. For the following research, using a dedicated machine with suitable specifications to gain maximum performance is highly recommended.

IV. CONCLUSION
The study resulted in six models with several variations in the value of parameters of the model, especially on dropout layer and batch size. According to the experiment result, the highest accuracy was obtained from the variation that did not implement a dropout layer with batch size four accounting for 0.94 and loss value 0.4. Whereas the lowest accuracy was obtained from the variation that implemented dropout layer as 0.4 with batch size accounting for 8, the accuracy and the loss are 0.84 and 0.57, respectively.
Following the experiment result, the accuracy trends and its validation are increasing, while the loss and validation loss is decreasing. This matter showed that the model has good performance. Both variations in training or testing processes that were not implemented dropout layer obtained high accuracy and low loss value but needed more execution time for the training process. On the contrary, the model that implemented the dropout layer behaves otherwise.
Because the value loss is still high and the occurrence of misclassification in activities with a similar motion, the future study must focus on parameter and hyperparameter tunings with a sufficient dataset. The transfer learning method also must be considered, such as using pre-trained architecture like VGGNet, ResNet, and DenseNet to gain the optimum result.