Modified LeNet-5 Architecture to Classify High Variety of Tourism Object: A Case Study of Tourism Object for Education in Tinalah Village

— This research aims to modify a CNN (Convolutional Neural Network) based on LeNet-5 to reduce overfitting in a Tinalah Tourism Village dataset object detection. Tinalah Tourism Village has many objects that can be identified for tourism education and enhanced tourist experience. While these objects, spread across the different sites of Tinalah do vary, some share similarities in their histogram patterns. Visually, if the size of a picture is reduced in the LeNet-5 ‘preferred size’ feature, it will inevitably lose some of its information, making pictures too similar reducing accuracy. In order to learn and classify objects, this research performs a modification on LeNet-5 architecture to provide a better performance geared toward larger input imaging. The previous state-of-the-art architecture showed an overfitting performance where the training accuracy performed too much better than the testing accuracy in our dataset. We brought in a dropout layer to reduce overfitting, increase the dense layer's size, and add a convolution layer. We then compared the modified LeNet-5 with other state-of-the art architecture, such as LeNet-5 and AlexNet. Results showed that a modified LeNet-5 outperformed other architectures, especially in performing accuracy for testing the Tinalah dataset, reaching 0.913 or (91,3 %). This research discusses the dataset, the modified LeNet-5 architecture, and performance comparison between state-of-the-art CNN architecture. Our CNN architecture can be developed by involving a transfer learning mechanism to provide greater accuracy for further research.


I. INTRODUCTION
Tinalah Tourism Village is counted among the top 50 tourism villages in Yogyakarta province, Indonesia.The village, located in the Sungai Tinalah and Menoreh Mountain regions of Southern Java, carries a vision of becoming a nature-based education destination as part of its tourism package.The village also offers many historical sites and stories, such as the living site of Prince Diponegoro, a World War Two encryption site, as well as many geological attractions.In those sites, there are many objects and locations containing the kind of story-telling content that could be delivered to tourists to provide historical experience and knowledge.
Objects are scattered and found throughout the Tinalah village site, varying in their shape, color, and background.There were 21 objects to be identified, including trees (coconut, cassava, bananas, kemuning, kalpataru, cangkring, penjalinan, ketapang kencana, and duwet), rocks (such as limestone, coral, and volcanic rock), traditional kitchen utensils (tampah tumpeng, telik, gejok lesung), paintings, stalagmites, a micro-hydro machine, as well as museum items such as dioramas and historical soil.Classifying those objects is challenging due to the variety of the image class and it is a raw image with various backgrounds.Several clusters of images have high similarities inside the cluster, such as traditional kitchen utensils and trees and high diversity outside the cluster.
Furthermore, as a social impact, this research already contributed to developing a mobile apps system that performs image recognition and storyteller as one of the tourist attractions.In order to learn and classify objects via mobile application -i.e., by object photos taken by smartphone -this paper explores a deep learning algorithm that incorporates CNN (Convolution Neural Network) as its backbone architecture.
Since LeNet 5 is claimed to have a promising performance as a CNN architecture for image recognition [1], [2].Therefore, this research investigated the LeNet5 architecture and modified the architecture.Our Modified LeNet 5 shows promising performance and outperforms AlexNet and LeNet5 architecture.This research provides an alignment architecture for LeNet5 for Multi-Class Image classification with variety in class.This research showed how more complex deep learning architecture would also contribute to an overfitting state.

A. CNN Development
CNN is already widespread across many sectors in object recognition due to its accuracy in performance [3], [4].CNN also performs better than its predecessor's support vector machine-based algorithm [5].With its ability to perform well across audio, real-time video, and image recognition alike, CNN possesses noticeable advantages over others [6]- [8].The detection of pedestrian gait is becoming a topic of interest for CNN in real-time video processing [9], [10].
Conversely, CNN for image recognition also shows promising advances in detecting biological objects, assisting in hand gestures and sign language, plant illness, remote sensing, agriculture, and health conditions [11]- [16].Presently, within the health sector, CNN has already provided outstanding performances in detecting Modification and combination solutions have also been developed for CNN to help researchers reach their highest point of accuracy.Moreover, while Ensemble learning has become one of the options in providing a more stable CNN, performing voting mechanisms through multi-CNN architecture, it has resulted in higher computational efforts.[24], [25] Other modifications, with less computation involved, have been achieved by combining heuristic algorithms to adjust hyperparameters in LeNet-5, resulting in higher accuracy [26].Involving heuristic algorithm is also used in developing CNN architecture and shows promising performance.However, it was time-consuming and needed more resources [27].Performing transfer learning could reduce time and effort with higher performance, yet a similar model should be available first [28].The development of the CNN architecture seems to become more realistic.
Numerous research refers to modification(s) in their backbone CNN architecture and additional layers to adapt to their specific case studies and perform a better accuracy level that lowers MSE (Mean Square Error).Deleting the convolution layer from backbone architecture, changing the activation function, modifying the pooling layer, altering the fully connected layer to SVM, and expanding the input image, also become options in a modification that have resulted in higher accuracy, whilst considering computation effort [8], [29]- [34].Furthermore, modification for performance development in CNN architecture also happened in its activation function such as relu and SoftMax layer [35].A hybrid pooling layer was also performed to increase the performance of the CNN [36].
This research performs a modification on a LeNet algorithm to provide better performance in the processing of larger input imaging and utilizes a dropout layer to reduce overfitting.We applied random rotation and random zoom techniques to increase variety in training our image algorithm in the pre-processing section.The research then investigated the data collection process in the Tourism Village to become our ground truth data and evaluated the performance of the LeNet-5 algorithm against the modified LeNet-5 architecture.By comparing our enhanced algorithm to traditional LeNet-5, we highlight improvements to the suggested architecture.

B. CNN for Object Classification
CNN is one of the 'most watched' deep learning algorithms currently available to perform multi-problem task solutions such as classification, object localisation, prediction, and regression.Currently, CNN shows promising performance in organisation and image classification, with 'classification', for this research, defined as 'images classified into a determined class'.Like other neural supervised network algorithms, CNN also consists of neurons that are connected with 'weight' at every connection.Within the architecture, CNN also employs an input layer that represents input nodes, a hidden layer function to handle complexity, and an output layer.As an additional layer, the Convolution layer is embedded in the natural neural network architecture.CNN, therefore, as can be seen, is categorized as deep learning because of the depth of the layers that make up its architecture.
The convolutional layer is a filtering layer that sums up features into an 'n x n' filter and converts it into different size outputs called a feature map.Like traditional neural networks, a convolution layer works involving multiplication of the input data with its 'weight.'Furthermore, the convolutional layer works with 'n' dimension of array to an 'n' dimension of weight, explaining why convolutional layers work for 1, 2, and 3 dimensions of array respectively.A one-dimensional array is typically used for a structured dataset, with a twodimensional array for unstructured datasets, such as image and voice.A three-dimensional array finds its most frequent use in video processing -where time becomes an additional dimension-or when processing three-dimensional images [37].Combinations of convolution layer's dimensional aspects, like 2D and 3D, are also made possible to develop and enhance performance during hyperspectral image processing [38].

= ∑
(1) Convolution works by calculating the dot operation of each filter size from input with value in the convolution filter and puts the result on the feature map (illustrated below in Figure 1).In the image classification, the input layer will consist of 3 color layers: red, green, and blue.
There are also Pooling layers where the input image will be shrunk and filtered based on the maximum, minimum, or average value from each determined padding size.Figure 2 shows a pooling layer with a 2x2 filter with three strides.The pooling layer will search for the maximum value in the filter.
The final layer is the fully connected layer.This layer will fully process the final output from convolutional and pooling layers, processing it within a dot operation, similar to traditional neural networks.In this layer, many activation functions may be employed to help determine the output from the dot matrix that has been produced, such as the Rectified Linear Unit (ReLu) and Softmax activation functions utilized in this research.
ReLu activation function as implemented in this study: Where 'x' stands for input value in the equation, anything less than 0, will be converted to 0. Alternatively, the Softmax activation function is an activation function within a neural network that helps convert a vector of numbers into a probability vector.The Softmax activation function works to calculate the probability through the exponential function: Dropout is a regularization mechanism where neurons will be dropped from the neural network randomly during the training process to reduce the architecture's complexity.A complex architecture will lead to a condition known as overfitting, therefore, reducing this complexity will help the architecture to avoid overfitting.

C. Dataset Development and Specification
This research started by selecting suitable objects from the Tinalah Tourism Village site used for recognition.The object must have a story to tell behind it, something that can give an experience and educate while simultaneously entertaining and enhancing the tourist experience.After completing our observation and interviews with local tour guides in Tinalah Tourism Village, we determined there were 21 objects suitable set by the parameters of our research.Understanding the dataset class of each object will assist in CNN architecture development [39].Our dataset took three different times to imbue the image with different lighting intensities.Similar images were then removed from the set because it could lead to overfitting.In order to develop different results with the same object, we also changed the angle of shots taken and used eight different phone cameras.This research preferred to determine pixel size in 180 x 250, larger than the LeNet-5 preferred size due to the color and shape similarities of the trees, plants, and traditional kitchen utensils pictured.Based on Figures 3 through 6, showed that tree pictures possess a common green color histogram that is the most accurate, meaning that green, as a color, dominates each picture field.In Figures 7 to 9, red, green, and blue most closely align, showing that the kitchen utensil had color schemes mostly dominated by grey and brown tones.Therefore, maintaining the shape of an object by providing extra pixel size is important due to the similarity of color.When pictures of a similar color like plants and trees are shrunk to a smaller size, there will be less difference between pictures, making them harder to differentiate.However, it should be noted that extending pixel size has consequences when it comes to accuracy because it is here where overfitting is most likely to occur.

D. The Lenet-5 Architecture Performance In Tinalah Dataset
In this research, we used LeNet-5 architecture as our backbone architecture.LeNet-5 showed an exceptional performance for image recognition [1], [2] and could be the base architecture toward developing a new and promising architecture upon modification [26], [32], [33].Adding, editing, and deleting become options for architecture adjustment.The performance of LeNet-5 has been shown above in Table 4.These experiments were repeated ten times, with every experiment using 150 It aims to note the best performance of the training accuracy (accuracy), to train loss (loss), to test data accuracy (value accuracy), and to test data loss (value loss).Experiment number 6 shows the best performance in value accuracy and accuracy.As can be seen, the overfitting value or 'gap' between both is 0.1279447675 or 12,79447675 percent.The updates' objective is to reduce the difference between the accuracy and value accuracy without significantly reducing the accuracy.

E. The Modification of Lenet-5
This research prefers to use a larger input size of 128x128 pixels.Therefore, multiple updates need to be performed to ensure maximum accuracy for the current dataset.Table 5 shows this research update in LeNet-5.Firstly, an update was made to feature map sizes in 16 in the first 2D convolution layer.Secondly, an update was also made in the second convolution layer, where the feature map size was expanded twice at 32x32.This research also added a convolution layer with a map size of 64x64.These extensions of convolution layers were developed to assist the fully connected layer process larger input imaging.
In order to handle larger image sizes, the fully connected layer needed to also to 'bulk up' its size.The bigger the connection utilized in the fully connected layer it was found, the better the layer's ability to resolve higher resolution images.Consequently, the possibility of overfitting is more often likely to occur.Indeed, whilst overfitting means that the network will be an exact fit to the training data, it will, subsequently, be considered inaccurate for test data.
To resolve this problem, the current research uses a dropout layer, that randomly deletes connections that occur between two neurons so that connection sizes will be reduced but still keep neuron numbers.However, utilizing the dropout layer may result in failure if the size of the current neuron cannot handle the accuracy.Figure 11 shows that the gap between best accuracy and val_accuracy has a reduction rate of 0.011, where, in contrast, best accuracy fell dramatically to a level of 0.9064.Based on the ten experiments undertaken for this study, the best accuracy fell at 0,9631 found in the tenth experiment (as seen below) with a gap at 0.062, which is still less than the against LeNet-5 without dropout layers.In order to maintain the performance of the accuracy and minimize the gap between val_accuracy and accuracy, this research suggests increasing the number of nodes in the fullyconnected layer to 700 and 300, respectively.The dropout layer should be located in 3 places.Firstly, it is between the convolutional layer and fully connected layers to help reduce the connection within.Secondly, it is between the first and second dense layers.Thirdly, it is between the second and the output layers.This LeNet-5 modification-by adapting with the expanded size and performing dropout layer -resulted in the increase in performance shown in Figures 10 and 11.Furthermore, the chart in Figure 10 demonstrates a steadying accuracy value across all ten experiments, where the accuracy value did not fall too much from its original in LeNet-5.The highest accuracy reached by this algorithm is 0.993525922, performed in the first and eighth experiments.This accuracy only has 0.003984034 difference between the original LeNet algorithm.On the other hand, the performance of value accuracy increases in the proposed LeNet modification.Figure 11 shows that the best value accuracy is at 0,913043499, with the lowest at 0,897698224.This number shows an increase in performance when compared with LeNet-5.
Figure 12 shows the difference between the value accuracy and accuracy.The best performance shown in the fifth experiment has the value reaching its lowest point at 0.078988373.It means that the architecture has less overfit, or the prediction value from the initial test data is not far from training data, with all of the overfitting value at less than that of the LeNet-5 experiment.
As noted, this research also compared the suggested architecture with other state-of-the-art architecture currently available, such as AlexNet and LeNet-5.Table 6 highlights performance between the three architectures, with the modified LeNet-5, outperforming all others in value accuracy, value loss, as well as overfitting value.Our architecture -the modified LeNet-5 -showed that it has the lowest overfitting, with a slight accuracy reduction that can be ignored.Whilst AlexNet showed promising accuracy, their gap in value accuracy is too high.Moreover, AlexNet used more connections with 4036 units in each dense layer and more convolution layer, resulting in a higher computational effort.

IV. CONCLUSION
LeNet-5 and AlexNet are sophisticated CNN architectures that are widely used for image recognition across the industry.In the Tinalah dataset, both show an overfitting performance where 'value accuracy that shows performance with testing data' had an obvious difference with 'accuracy value that shows performance with training data'.
Our suggested architecture, the modified LeNet-5, outperforms both AlexNet and LeNet-5 architecture in the case of the Tinalah dataset.It shows noticeable performance differences in value accuracy, value loss, and overfitting value, highlighting important dissimilarities between value accuracy and accuracies; specifically, our architecture reaches 0.913043499 in the value accuracy 0.078988373 in the overfitting value.
When applied to Tinalah object recognition, through our chosen research, our architecture's performance needs to be increased because the desired value accuracy is more than 0.97.This performance could be reached by adding a more comprehensive data set.Furthermore, this research should also involve transfer learning to improve accuracy and reduce time execution.Transfer learning could provide a basic model for the architecture.The challenge will be how to find a proper model as the basis.

Fig. 10
Fig. 10 Accuracy chart of the failure dropout layer III.RESULTS AND DISCUSSION

TABLE I DATASET
CLASS

TABLE III ARCHITECTURE
OF LENET-5

TABLE IV PERFORMANCE
OF LENET-5