ON INFORMATICS VISUALIZATION

— The development of transportation technology is increasing every day; it impacts the number of transportation and their users. The increase positively impacts the economy's growth but also has a negative impact, such as accidents and crime on the highway. In 2018, the number of accidents in Indonesia reached 109,215 cases, with a death rate of 29,472 people, which was mostly caused by the late treatment of the casualties. On the other hand, in the same year, there were 8,423 mugs, and 90,757 snitches cases in Indonesia, with only 23.99% of cases reported. This low reporting rate is mostly caused by the lack of awareness and knowledge about where to report. Therefore, a quick response surveillance system is needed. In this study, an audio-based accident and crime detection system was built using a neural network. To improve the system's robustness, we enhance our dataset by mixing it with certain noises which likely to occur on the road. The system was tested with several parameters of segment duration, bandpass filter cut-off frequency, feature extraction, architecture, and threshold values to obtain optimal accuracy and performance. Based on the test, the best accuracy was obtained by convolutional neural network architecture using 200ms segment duration, 0.5 overlap ratio, 100Hz and 12000Hz as bandpass cut-off frequency, and a threshold value of 0.9. By using mentioned parameters, our system gives 93.337% accuracy. In the future, we hope to implement this system in a real environment.


I. INTRODUCTION
Emergency situation happens rarely and unpredictably, but the chance never is 0. Emergency situations cause individuals or groups to shift their focus to handle the situation [1]. The most challenging part during an emergency is keeping calm and responding with a fast and fitting act [2]. In this research, we particularly focused on accident and crime situations. The advancement of transportation technology affects the number of vehicles and their passengers. In 2018, 146,858,759 vehicles were recorded in Indonesia, classified as passenger cars, buses, freight cars, and motorcycles [3], [4]. The increment of vehicles also affects the number of accidents that happen. In Indonesia, 109,215 accidents and 29,472 deaths were recorded in 2018 [5]. Most of the deaths were caused by the late treatment of the casualties [6].
The other emergency situation we focused on is a crime. The data by the Central Bureau of Statistics reported that in 2018, 8,423 mugs and 90,757 snitches cases occurred in Indonesia, of which 23.99% were reported. This low reporting rate happened due to a lack of awareness and knowledge about where to report [7]. For that reason, a reliable system which able to detect accidents and crimes was needed. By referencing other research that uses audio recognition mostly focuses on accidents or impulsive sound detection that withstands environmental noises [8]- [18].
In this research, we propose an audio-based accident and crime detection system and tune several parameters in the overall process to obtain the optimal result. To increase its accuracy and robustness, we enhance the used dataset by mixing the raw audio with several noises related to the real environment.

II. MATERIALS AND METHOD
Our proposed method is divided into two major processes: dataset creation and inference. The dataset creation process aims to create a dataset with various noise mixed to improve the inference accuracy and robustness, while the inference process mainly aims to recognize and decide whether it is normal, accident, or crime based on the audio.

A. Dataset Creation
We collect audio data labeled as a car crash, engine idling, gunshot, rain, road traffic, scream, thunderstorm, and wind from various resources, including other publications [8], [19]- [21] and YouTube, the chosen labels represent the normal, accident, and crime condition. Collected data were then resampled to 44100Hz and enhanced by mixing it with environmental noises such as rain, road traffic, thunderstorm, and wind. The enhancement process was done using the following rules.
The mixing process changes the sound of the data based on the used noise. These changes provide wider data coverage, which were benefit the real scene [22]. This process gives a total of 5352 audio data divided into eight labels. Twenty data from each label were randomly excluded as the data test, and the rest were divided into train and validation data with a 7:3 ratio. As a result, we used 3,635 train data, 1,557 validation data, and 160 test data.

B. Inference
The inference method consists of segmentation, Bandpass filter, Short Time Fourier Transform (STFT), Mel spectrogram, neural network, and thresholding, as shown in

1) Segmentation:
This process slices the audio into smaller segments to reduce the processing load and fasten the response for each segment. A good segmentation process is required due to the important information of audio, mostly not at the same part of a segment [23]. Thus, this research conducts two different parameters: segment duration and overlap ratio. We also use overlapped segmentation process, which gives a higher accuracy than the non-overlapped segmentation for the recognition system because it has less correlation with its adjacent segments [24].

Fig. 2 Audio segmentation
As shown in Fig. 2 each segment overlapping to its adjacent. In this research, we test our system with various segmentation parameters TABLE IV.
2) Bandpass filter: This process occurred to reduce noises based on their frequency. Bandpass filter could generate fine samples, increasing the system's robustness [25]. The key point of this process is the cut-off frequency used. In this research, we use a 4 th -order bandpass filter with various combinations of cut-off frequencies. The impact of the bandpass filter on the audio data is shown in the spectrogram in TABLE III. 3) STFT: This process is an improvement of the Fast Fourier Transform that calculates the Fourier transform coefficients in a smaller time fraction [26]. STFT was chosen due to its speed and no repetition data. STFT is a key component for signal processing systems with a wide application range, such as medicine, industrial measurement and control, and audio signals analysis [27]. STFT step consists of three subprocesses as follows. Framing is a process of capturing a smaller piece of the segment. In this research, we frame each segment into smaller frames with a frame width of 1,764 samples and a hop length of 441 samples. Windowing is a process to avoid spectral leakage by reducing spikes at the start and end of the frame. One of the windowing methods is Hann window (1). We used Hann window with window width equal to frame width.
Fast Fourier Transform (FFT) is a faster process to calculate Fourier transform. We use FFT with Fourier width same as window and frame width. The FFT equation is shown in (2).
This process converts time-domain audio data into a spectrogram based on user parameters.

4) Mel Spectrogram:
Human hearing perception of frequencies is logarithmic, which means that human hearing has a higher resolution at high frequencies. In order to utilize our hearing system, we convert the spectrogram into a Mel scale. Mel Spectrogram is a data form made of a combination between the Mel scale and spectrogram to represent frequency and amplitude by the time domain [28]. We use a total of 128 Mel bands for each spectrogram.

5) Neural
Network: Most used architectures for audio recognition are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Deep Neural Network (DNN) architectures as the classification system, with the best result mostly performed by RNN and DNN [29]. In this research, we test and compare all the mentioned architectures as the classification system. We used the Mel spectrogram as input, and eight output refers to 8 labels from the enhanced dataset. The same training parameters were applied to all tested architectures. By combining different parameters from the mentioned process, we created twenty-four models to find the optimal parameter and performance of the system.

6) Thresholding:
Thresholding was used to reduce the false positive result from the Neural Network. The Neural Network's output was mapped into three types of output with mapping rules shown in TABLE V and thresholding based on its confidence value. The three types of output represent the outcome of accident and crime detection. To obtain optimal accuracy, the threshold value was tuned by a trial-and-error process [30].

III. RESULTS AND DISCUSSION
In this section, we describe the result of experiments from our proposed method. The experiments were divided into neural network classification, thresholding, and comparison section. The neural network classification section contains model, architecture, segmentation and filter analysis, and the Thresholding section contains thresholding analysis. Moreover, the comparison section provides a brief comparison with another related research.

A. Neural Network Classification
We test our models with 160 data tests consisting of 20 audio data from each label we prepared before. The test result is shown in the following table. The result shows that each model gives various accuracy starting from model R_3, with the lowest accuracy at 64.03%, and model D_8, with the best classification result with an accuracy of 79.81%. The next step is to find the optimal architecture type. We analyzed the average accuracy from tested architecture: CNN, RNN, and DNN. The result is shown in TABLE VII. Based on TABLE VII, RNN architecture gives the lowest average accuracy at 70.97%, and DNN gives the best performance with an average accuracy of 72.34%. Each architecture only gives a slightly different average accuracy from others. In TABLE VIII, we analyzed the impact of segmentation parameters on system accuracy to conclude the optimal parameter value of the system.  VIII shows that models with 1000ms segment duration give a better average accuracy than models with 200ms segment duration. A shorter segment duration means fewer data to be processed, and the total data in 200ms segment duration is mostly insufficient to analyze properly. Behavior analysis of bandpass filter parameters was done to obtain the optimal cut-off frequency range for accident and crime detection systems. Based on the analysis, we found that the best result was obtained from models with 100Hz and 12000Hz cut-off frequencies, which gained an average accuracy of 73.34%. This means the accident and crime audio mostly occurred at a frequency between 100-12000Hz.

B. Thresholding
We apply the thresholding process to the best model of each architecture, which are C_4, R_8, and D_8. Various threshold value was used to find the optimum performance for each selected model. Then we compare the accuracy of the selected model with and without thresholding.  The accuracy of model C_4 increases from 72.59% to 93.34%. Recorded a 20.75% accuracy improvement. The accuracy of model R_8 increases from 77.54% to 92.31%. Fig 7 shows a 14.77% accuracy improvement. The accuracy of model D_8 increases from 79.81% to 85.3%, giving a 5.49% accuracy improvement. Overall, model C_4 gives the best accuracy improvement due to its prediction error mostly in the same output label due to the mapping table in TABLE V.

C. Comparison
We compare our proposed method with methods from Sammarco et al. [8], Gatto et al. [9], and Arslan et al. [11] in terms of accuracy. The comparison details are shown in Table X. Our proposed method performs better than the method from Sammarco et al. [8] and Gatto et al. [9] but is unable to beat the methods presented by Arslan et al. [11].

IV. CONCLUSION
From the experiments, we can conclude that our proposed method can recognize accidents and crimes using audio data with an accuracy of 85.3-93.34%. The thresholding process could improve the accuracy. The optimal parameters are CNN architecture, 200ms segment duration, 0.5 overlap ratio, 100Hz and 12000Hz as bandpass cut-off frequency, and a threshold value of 0.9. In the future, we hope to improve our method with more dataset and implement it into an embedded system to test its accuracy and robustness in the real environment. NOMENCLATURE w window coefficients k sample index (discrete) K window width sample discrete Fourier series N Fourier width input samples n sample index (continue)