Mel Frequency Cepstral Coefficients (MFCC) Method and Multiple Adaline Neural Network Model for Speaker Identification

— Speech recognition technology makes human contact with the computer more accessible. There are two phases in the speaker recognition process: capturing or extracting voice features and identifying the speaker's voice pattern based on the voice characteristics of each speaker. Speakers consist of men and women. Their voices are recorded and stored in a computer database. Mel Frequency Cepstrum Coefficients (MFCC) are used at the voice extraction stage with a characteristic coefficient of 13. MFCC is based on variations in the response of the human ear's critical range to frequencies (linear and logarithmic). The sound frame is converted to Mel frequency and processed with several triangular filters to get the cepstrum coefficient. Meanwhile, at the speech pattern recognition stage, the speaker uses an artificial neural network (ANN) Madaline model (many Adaline/ which is the plural form of Adaline) to compare the test sound characteristics. The training voice's features have been inputted as training data. The Madaline Neural Network training is BFGS Quasi-Newton Backpropagation with a goal parameter of 0,0001. The results obtained from the study prove that the Madaline model of artificial neural networks is not recommended for identification research. The results showed that the database's speech recognition rate reached 61% for ten tests. The test outside the database was rejected by only 14%, and 84% refused testing outside the database with different words from the training data. The results of this model can be used as a reference for creating an Android-based real-time system.


I. INTRODUCTION
Voice recognition recognizes a voice owner's identity by comparing the voice's features as input with each speaker's features inside and outside the existing database [1].In some conditions, voice recognition becomes essential in humancomputer interaction [2].One of the mathematical computer technologies used to recognize the different characteristics of the human voice is the Fast Fourier Transform (FFT) [3]- [5].FFT is a method for transforming a time zone signal into a frequency region signal and then storing it in digital form as a frequency-based signal spectrum [6].
Much research has been done with themes related to voice identification using artificial neural network methods, Self-Organizing Maps (SOM), Backpropagation, and other rules [7].There are several speech features commonly used to extract speaker characteristics, including Linear predictive coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and Lateral Prefrontal Cortex (LPFC) [8]- [11].MFCC has good results for feature extraction in sound and images [12].MFFC is also combined with other methods to produce a high level of recognition, for example, using Self Organizing Maps (SOM).[13].This study aims to determine speech recognition accuracy using the Self Organizing Maps (SOM) artificial neural network method using the MFCC model for voice feature extraction [14].Researchers have also widely studied the speaker recognition system using several ways [15]- [17].In addition, the speaker verification system has also been widely developed and researched [18]- [20].
An application of artificial neural network models Adaline and Madaline then compare the effectiveness in classifying goiter.Several studies using Adaline networks have also been carried out [21][22].Madaline network performance is slightly superior to Adaline network [23].Recognition level in identifying the speaker's voice by trying various SNR (Signal to Noise Ratio) values has been done.Sound with SNR from 20 dB to 80 dB has a success rate according to its SNR value.The greater the SNR value, the higher the identification success rate [24].
Based on the research that has been done, identification of the human voice spoken by the owner.According to the spoken words, the simulation could identify or detect the sound pattern's owner.Mel Frequency Cepstrum Coefficient (MFCC) feature extraction is the first step in the identification procedure.The approach may determine the Cepstrum coefficient based on how a person hears.It is linear for low frequencies and logarithmic for high frequencies.After getting the results of the cepstrum, training, and testing were carried out using the Madaline (Many Adaline) Artificial Neural Network (ANN).This method is a development of the Perceptron method and is the plural form of the Adaline (Adaptive Linear) ANN model.The difference with the Perceptron method is that in the Adaptive Linear method, the weight modification is carried out by a method known as the delta rule method or, commonly called the Least Mean Square (LMS) method.

II. MATERIAL AND METHOD
Voice identification is made to determine the level of accuracy of speech pattern recognition produced by the owner [25].An aspect of the spoken voice was acquired using the Mel Frequency Cepstrum Coefficient (MFCC) approach and subsequently recognized by the Madaline model of Artificial Neural Networks (ANN).The system design process begins with conducting research and analyzing the system to be built.Here are some strategies that could be carried out in system design.

A. Voice Recording
Voice recording is used as a command input using the Goldwave software program [26].Fig. 2 shows the sound recording process.Data from a speaker's voice signal is recorded using a microphone connected to a laptop.The recording is done on speakers with the GoldWave application with a duration of 5 seconds per sound at a sampling rate (Fs) of 16000 Hz and mono channels.Twenty-five speakers could be divided into ten speakers included in the database and 15 other speakers outside the database.The speakers in the database consist of 8 men and two women who say the word "Telkom Laboratory."Speakers outside the database also comprised eight men and two women saying the same word, namely "Telkom Laboratory," and five other people saying different words from the database, namely "electrical engineering."Each speech data is saved as an audio file in ".wav" format, which is named after the speaker's name and followed by a pronunciation order index.The recording was done 25 times for the data in the database.Almost 15 of the 25 recorded data are used as training and testing data.At the same time, the other ten are only used as test data.Ten speeches were recorded and used as test data for data outside the database.The results of the recording are then saved in .wavformat.The recording results can be seen in Fig. 3.

B. Feature Extraction (MFCC)
Voice signal feature extraction in this study using MFCC.The parameters of the MFCC are:  Input, namely voice input, comes from each speaker and is saved in a wav file.Each speaker had ten file records. Each file could be processed as a sampling step. The sampling rate is the number of values taken in one second.This study used a sampling rate of 16000 Hz [27]. The time frame is the desired time for one frame (in milliseconds).The time frame used is 50 ms. Lap, which is overlapping, consists of N/2 data. The cepstrum coefficient is the desired number of cepstrum as the output of the frame.The cepstrum coefficient used is 13.The coefficient value of 13 is obtained from the spectrum value of the frequency value of the dominant voice data.The stages of the MFCC process are as follows: 1) Frame Blocking: The result of voice recording is an analog signal in the time domain, a time-variant [28].Therefore, it must be cut into specific time slots to be considered invariant.One frame contains 800 samples, and another overlaps along 400 models or 50% of the total sample between shelves.Fig. 4 shows the results of the FFT process from voice recordings, and Fig. 5 shows the results of the frame-blocking process from sound recordings.Using a sampling frequency of 16000, the voice signal is cut by 50 milliseconds.Where the calculation is as follows: Sampling rate (Fs) = 16000 Hz Time frame (Ts) = 50 ms or 0.05 s Frame size (N) = 16000 * 0.05 = 800 samples Overlapping (M) = 800/2 = 400 samples Then, the voice signal is cut along 800 at each overlapping 400.Each piece is called a frame.So, in one frame, there are 800 samples from 80000 existing samples.

2) Hamming Window:
The sound signal cut into several frames could cause data errors in the Fourier transform process.A Hamming Window is needed to reduce the discontinuity effect of the frame-blocking process, especially at the beginning and end of each frame [30].The framing process causes a signal discontinuity (cut off/not connected).The windowing process reduces signal discontinuity from the beginning to the end of the frame.Fig. 6 shows the voice data after the Hamming Window process for voice with an SNR of 80 dB.

3)
Fast Fourier Transform: In the Fourier transform process, there is a change in the shape of the input voice signal from the time domain into the frequency domain [29].The following process is the Fast Fourier Transform (FFT) process.FFT is a process used to convert voice signals from the time domain into the frequency domain.The signal to be converted is a signal processed by frame blocking.Then each frame could be processed by FFT.Fig. 7 shows the sound data after the Fourier transform process.

4)
Mel Frequency Wrapping: Mel frequency wrapping aims to filter the spectrum of each frame.Signals that have passed the FFT process could then be filtered using a filter bank.The frequency scale of the filter bank is the same as the concept of human hearing, so the frequency scale is often used as an extraction parameter in sound signal processing.Fig. 8 shows the triangular filter bank.The mapping between hertz and Mel scale frequencies is linear for frequencies below 1000 Hz and logarithmic for frequencies above 1000 Hz [31].Equations 1 and 2 show the formula for forming Mel frequency wrapping.Formula 1 is used for conversion from frequency scale to Mel scale.
Formula 2 is used to calculate the Mel scale to the frequency scale.
Furthermore, a filter array is formed, which contains some M triangular filters with M triangular filters used 20.Fig. 9 shows the sound characteristics after the filter bank process for sound without noise.The filtering results could produce 20 cepstrum parameters according to the number of triangular filters [32].

5)
Cepstrum: Cepstrum results from the log Mel spectrum from the frequency domain converted into the time domain using DCT, which produces a matrix measuring the number of frames * coefficient [33].Cepstrum is the last process and is carried out after the filterbank process.Cepstrum is used to convert log Mel spectrum into cepstrum using DCT (Discrete Cosine Transform).Fig. 10 shows the sound characteristics after the DCT process for noiseless sound.In this case, 13 dominant cepstrum parameters are used.
The feature extraction result using MFCC has a feature matrix of nxk, n is the number of frames, and k is the coefficient.It produces a matrix of the same size in each vote, namely a matrix of size lxk.The coefficients are averaged for each row.The results of this cepstrum could be used as input to the Madaline process.

C. Madaline Artificial Neural Network Process
Before training the Madaline model of artificial neural networks, consider the network architecture [34].The network architecture was chosen with a constructive approach: a small Adaline network with one or more hidden layers.The Adaline Neural Network model's activation and threshold function equations also obtain the hidden layer.Then, it develops the number of hidden units and additional weights until the desired solution is obtained.
Each neuron in the input layer consists of feature extraction results with the MFCC method and a predetermined weight.
The number of neurons in the input layer corresponds to the number of variables selected as network input plus one biased neuron.Fig. 11 shows that the number of input layer neurons is between 1 and 10 according to the number of speakers used as input data in the system, plus one bias neuron.The initial weights and biases are initialized with a small random number between 0 to 1.The initial weight will affect whether the network will reach a local minimum or global minimum and the duration of its convergence.The initial weight that is too large makes the derivative value of the activation function minimal.It causes the weight change to be tiny as well.The size of the input layer weight matrix is 10 x 13.
Another thing to note also is that the parameters that must be set in the network include: 1) Learning rate: The learning rate selected was 0.01 to 0.99 during the training.Generally, the automatic learning rate is 0.01 2) Goal parameters: The performance objective is the target value of the performance function.The iteration will be stopped if the value of the performance function is less than or equal to the performance objective.

3)
Maximum number of iterations: Maximum iteration is the maximum number of epochs performed during the training process.The iteration will be stopped if the number that has been trained exceeds the maximum number of iterations.

A. Network Training with the Madaline Neural Network
Model.
Network training is carried out to see the system's performance and find the slightest error value during the training process by changing several parameters.

1)
Looking for a type of training with a target: Table 1 shows that the best type of training used in the Madaline Neural Network training is BFGS Quasi-Newton Backpropagation.In training, the results followed the target with few iterations and a small number of errors.In the BFGS Quasi-Newton Backpropagation, there has also been a change in the learning rate (lr) from 0.1 -0.9.Meanwhile, the results do not affect the training process, both the output results and the errors caused.

2)
Looking for the best goal parameters: Table 2 shows the output of each training course with changes in goal parameters from 0.01 to 0.00001.By paying attention to the production of the two Adaline networks in training, the best goal parameter value is 0.0001.3) The results of the change in weight at the 15th training: Furthermore, the weights listed in Table 3 could be used in the testing process.Table 4 shows that ANN Madaline cannot adequately recognize some voices, so the recognition percentage is only 61%.

C. Test Results Outside the Database
Table 5 shows the speaker's voice rejection accuracy using the Madaline artificial neural network model (ANN).The tables found that some votes were still well recognized by ANN Madaline, so the rejection percentage was only 14%.Testing the exact words from outside the database did not produce good results.As many as ten people said ten times, only one person was wholly rejected.As many as ten people said ten times, and only one was deserted.Likewise, testing with a different word, namely "electrical engineering," results in an imperfect rejection.Even though it produces a pretty good percentage, it still has a recognizable sound.This is due to the large amount of training data used in the training process, and the sound that contains silence is still legible during the sound recording process.

IV. CONCLUSION
Testing with the speaker's voice in the database obtained an introduction percentage of 61%.Meanwhile, testing with test data with speakers outside the database only rejected the introduction of 14%.The test with test data outside the database with different words resulted in a denial of 86%.In addition, the Madaline artificial neural network is not suitable for identification and is more suitable for classification and prediction research.This paper describes the process of identifying sounds based on the words spoken by the speaker.MFCC is one of the features of human voice feature extraction based on ear response filters, linear at low frequencies and logarithmic at medium frequencies.The Madaline-based speaker speech recognition algorithm gives good results for one-dimensional system identification, although it is not superior.Furthermore, the identification process should be attempted in real-time with MFCC and the Madaline detection algorithm or other algorithms.
The results of identifying the speaker's speech are not good.It is estimated that the Madaline algorithm used is a type I.In Madaline type I, the input layer is directly connected to the output layer, so the updated weight function depends on an error variable in the output.For further research, it can be tried to identify the speaker's utterance with the Madaline type II algorithm, where a hidden neuron layer is added between the input and output layers.With the addition of the hidden layer, it is hoped that each neuron in the Adaline network could be better at updating the weight of the disturbance.The Madaline algorithm should be implemented in one-dimensional cases that the Adaline algorithm can handle.The two algorithms can identify or predict weather-related cases, grayscale-based image patterns, or predict facial features with the characteristic variables being worked out as one-dimensional vectors.

Fig. 4
Fig.4The result of the FFT process of sound recording

Fig. 7
Fig. 7 Voice data after going through the FFT process

Fig. 8
Fig. 8 Triangular Bank Filter process results

TABLE I NETWORK
TRAINING WITH CHANGING TYPES OF EXERCISE.

TABLE III NETWORK
TRAINING WITH CHANGING TYPES OF EXERCISE.
Testing outside the Database with Different Words Table6shows that the test with different pronunciations showed imperfect rejection, which was 84%.

TABLE VI TESTING
OUTSIDE THE DATABASE WITH DIFFERENT PRONUNCIATION