ON INFORMATICS VISUALIZATION

— Various forms of distributed denial of service (DDoS) assault systems and servers, including traffic overload, request overload, and website breakdowns. Heuristic-based DDoS attack detection is a combination of anomaly-based and pattern-based methods, and it is one of three DDoS attack detection techniques available. The pattern-based method compares a sequence of data packets sent across a computer network using a set of criteria. However, it cannot identify modern assault types, and anomaly-based methods take advantage of the habits that occur in a system. However, this method is difficult to apply because the accuracy is still low, and the false positives are relatively high. Therefore, this study proposes feature selection based on Hybrid N-Gram Heuristic Techniques. The research starts with the conversion process, package extract, and hex payload analysis, focusing on the HTTP protocol. The results show the Hybrid N-Gram Heuristic-based feature selection for the CIC-2017 dataset with the SVM algorithm on the CSDPayload+N-Gram feature with a 4-Gram accuracy rate of 99.86%, MIB-Dataset 2016 with the 2016 algorithm. SVM and CSPayload feature +N-Gram with 100% accuracy for 4-Gram, H2N-Payload Dataset with SVM Algorithm, and CSDPayload+N-Gram feature with 100% accuracy for 4-Gram. As a comparison, the KNN algorithm for 4-Gram has an accuracy rate of 99.44%, and the Neural Network Algorithm has an accuracy rate of 100% for 4-Gram. Thus, the best algorithm for DDoS detection is SVM with Hybrid N-Gram (4-Gram).


I. INTRODUCTION
Currently, DDoS is a type of cyber-attack that can attack any website, be it a personal website, school website, online shop, or even an enterprise-level website.These attacks also continue to evolve as technology evolves.The target of the attack is from layer two to layer seven, where the server will receive and respond to Hypertext Transfer Protocol (HTTP) requests and load the website page.This category of attacks tends to be challenging to identify and overcome because they resemble natural web traffic.
Statistical statistics, such as the quantity, size, and length of data packets, are commonly used by researchers to analyze traffic.The traditional method of performing traffic analysis in the case of detecting DDoS assaults is to convert packet units to flow units based on packet sets of the same 5-tuple (Source IP, Source Port, Destination IP, Destination Port, Transport Layer Protocol) [1].However, detecting DDoS attacks based on the HTTP protocol receives little attention because it cannot be analyzed until flow generation is complete.There is an additional cost disadvantage for calculating the statistical information of the flow occurring.Therefore, several methods can detect DDoS attack types, both bandwidth depletion and resource depletion attacks.The primary focus of detecting DDoS assaults is bandwidth depletion related to the quantity and kind of packets sent and received, which has a high false positive rate.While some scholars have lately used machine learning approaches to identify network attacks and other anomalies, others have published methodologies based on a statistical analysis of Management Information Base (MIB) and Canadian Institute for Cybersecurity (CIC) data.Other studies reviewed related work on anomaly detection using the SNMP-MIB, and CIC-2017 datasets, such as research conducted by Alkasassbeh et al. [2], which proposes a new dataset that includes modern attack types not used in previous studies.The suggested approach uses 91 MIB traffic features from 5 categories (IP, ICMP, TCP, UDP, and SNMP) that are periodically gathered from targets and attackers participating in the attack.Ping Flood, Targa3, and UDP Flood are three DDoS assaults that use controlled traffic loads.
In addition, the pattern recognition of DDoS attacks on IDS has two disadvantages.First, TCP/IP deficit [3] for hackers, DDoS attacks are easy to start, while the victims are hard to realize.In addition, DDoS attacks have developed a new technique; an example is the SYN-Flood attack.In general, a single SYN packet is a legal packet of network activity that is difficult to detect as a strange artifact by IDS.Therefore, IDS is challenging enough to generate a warning about whether SYN-Flood is attacking the network [4].Second, falsepositive alert issues in signature-based IDS frequently occur when standard network patterns are wrongly identified as DDoS attacks.As a result, when a DDoS attack occurs, it is imperative to quickly identify and take mitigation measures to secure networks that cannot function properly.
While the type of resource depletion attack, for example, [5] proposes Payload Based Signature Generation to detect DDoS attacks based on the similarity of the two payloads compared to the Similarity-Based Classification approach, classification based on similarity treats payloads as strings.It investigates methods for correlating those payloads based on similarity in structure and content.This classification aims to group related payloads, part of an attack with a different variant from other traffic.

A. DDoS detection Methods Category
These websites have been unavailable for several hours since the first DDoS attack in 2000, which caused damage to websites for companies including Amazon, CNN, eBay, and Yahoo.Researchers in network security are constantly looking at ways to stop assaults like this one.Several techniques, including statistical, knowledge-based, software computing-based, data mining-based, and machine learningbased [6], for detecting and preventing DDoS attacks.Like previous research [7], byte-level HTTP traffic analysis offers a practical solution to the problem of network intrusion detection and traffic analysis problems.
This study emphasizes statistical-based and knowledgebased methods with the N-Gram technique in detecting attacks on the HTTP protocol [8].General attacks, Shell Code, and CLET datasets are the types of attacks detected.Attacks are detected based on the results of attack simulations by making HTTP requests to the server by sending normal packets as raw data.The subsequent request is by entering the shell code of the attack, then the normal raw data with the attack is compared by calculating the Chi-Square Distance and Pattern Counting values.This technique provides a better detection rate for 1-Gram 0.1107 milliseconds, 2-Gram 0.6599 milliseconds, 3-Gram 14,9650 milliseconds, 4-Gram 18.0545 milliseconds and 5-Gram 37.8059 milliseconds and is faster and more efficient than HMM-based techniques [9].However, in conducting the analysis using the outdated DARPA'99 dataset, thus new types of attacks cannot be detected in this study.
An intelligent method for identifying DDoS attack patterns was generated by network packet analysis and the application of machine learning [10].The Center for Applied Internet Data Analysis provided many network packets for analysis in this study.They use the SVM technique to build a detection system, with Radial Kernel as the main goal (Gaussian).This study set up 4,000 IP addresses, 2,000 from the attacker pool and 2,000 from the victim pool, and four attributes as test data.
The detection system has an overall accuracy rate of 85% for detecting DDoS attacks and an accuracy rate of 98.7% using five features.The system creation method for detecting DDoS attacks shows that the system using SVM was successfully trained using the recommended features to identify DDoS attacks with high accuracy.
Improved detection of Distributed Denial of Service attacks has been proposed based on fast entropy methods and flow-based analysis [11].Compared to traditional computing, Fast Entropy and flow-based significantly reduce computation time while maintaining high detection accuracy.Entropy calculation per flow is performed after network traffic analysis on the fly demand.When the entropy difference of the flow calculation from the average entropy value over that period exceeds a threshold value modified adaptively based on traffic pattern conditions to improve detection accuracy, a DDoS attack is detected.This paper suggests three techniques for identifying DDoS: Fast Entropy, flow aggregation, and adaptive threshold.This adaptive threshold technique improves detection accuracy while reducing computation time compared to traditional entropy.The relationship between 192.95.27.190 and 71.126.222.64,e.g., The resultant value of 7.46 compared to the other relationships, is significant.However, because this approach performs forward tracking, previously found packets cannot be inspected again.
Machine learning-based methods for identifying DDoS attacks have been researched [2], [12], [13], along with new information about recently used and unresearched attack variants.The dataset consists of 27 attributes and five classes.Network Simulator (NS2) is used in this work because it can be used with good results and is quite reflective.Many attacks targeting the application and network layers have data recorded for them.Three machine learning techniques, MLP, Random Forest, and Naive Bayes, were used to classify Smurf, UDP-Flood, HTTP-Flood, and SIDDOS types from the obtained data set.The most accurate classifier is the MLP classifier.Multilayer Perceptron (MLP), Naive Bayes, and Random Forest are the three techniques used.According to the experimental findings, MLP achieves maximum accuracy (98.63%).
A previous study has demonstrated a deep learning-based DDoS detection system to identify multi-vector attacks involving TCP, UDP, and ICMP in a custom SDN environment [14].The suggested approach has 95.65% accuracy in classifying certain DDoS attacks.Compared to other studies, it classifies traffic as normal and strikes with 99.82% accuracy while generating very few false positives.However, in future research proposals, the NIDS system in this study has not been able to identify attacks at the application layer, especially on raw data.
In order to reduce the false alarm rate, research by Khreich et al. [15] developed a new feature extraction technique that integrates frequency and temporal data from tracking system calls with a single class Support Vector Machine detector (OC-SVM).The approach is the feature extraction methodology.In order to train the OC-SVM detector, the proposed method first divides tracking system calls into ngrams of variable length and maps them to a fixed-size sparse feature vector.The system call dataset results show that our feature vector performs up to six grams better than the term vector model (using the most common weighting strategy) suggested in the related work.Its anomaly detection system, which used OC-SVM with a Gaussian kernel and was trained on our feature vectors, achieves greater detection accuracy rates than Markovian and n-gram-based models and more sophisticated anomaly detection methods (with alarm levels lower fake).While keeping the temporal link between events, the suggested feature extraction approach from event traces provides a fresh and well-liked data type for popular singleclass machine learning techniques.
In contrast to Snort, the packets are matched against legal user access patterns to web pages rather than attack patterns.The test's findings show that the attack detection accuracy was 94.07% at a threshold value 0.85.This intrusion detection system is more resistant to zero-day attacks since it can identify different attacks without first describing existing attacks.
The research conducted by Sridharan [16] was continued from Oza [7], which states that web applications generate malicious HTTP requests that provide a platform to attack vulnerable machines to exploits.The network intrusion detection system must identify such malicious traffic based on traffic analysis.Previous research has shown that the N-Gram technique can be applied to detect HTTP attacks.This study analyzes the payload size by calculating Chi-square Distance, Pattern counting technique, and Ad-hoc N-Gram Technique.The results show that 2-Gram has an AUC value of 0.98 and an accuracy rate of detection of generic attacks, shellcode attacks, and CLET attack dataset of 98.16%, but the focus of the research is only on the size of the payload and 2-Gram to 3-Gram.

B. N-Gram Heuristic Techniques
Any problem-solving strategy that employs a realistic approach or numerous shortcuts to achieve answers that might not be ideal but are adequate given a constrained timeline or deadline is known as a heuristic or heuristic-based [17].Heuristics-based approaches are adaptable and used for quick decisions, especially when working with complex data and finding the best solution is impossible or impracticable [18].An N-Gram is a collection of N strings drawn from a collection of text or words.This series can be anything, depending on how to utilize it, such as letters, words, or sentences [19], [20].A one-sized N-Gram is called a unigram, a two-sized one is called a bigram, and a three-sized one is called a trigram.Larger sizes are referred to as four-grams, five-grams, and so on.The working principle of the N-Gram can be seen in Figure 1.From Figure 1, a 4-Gram string shift starting from the word "Microsoft word" by ignoring space, that a 4-Gram shift starting from "micr" and ending in "word", from all shifts is obtained the value of F, and each F may have the same pattern so that further analysis can be carried out [22].
The fields of information retrieval [23] and statistical natural language processing [24] have both used N-Grams in the past.This technique allows it to recover a set of symbols from the input stream using a sliding window of length n.Everywhere, a sequence of length n is taken into consideration.
The formal definition of a feature set S, which corresponds to all feasible sequences of length n: Chi-squared Distance is a technique for calculating the separation between two histograms of benign traffic that were seen with predicted frequency distributions and unknown payloads.Both X and Y are equal to [X1, X2,..., Xn].First, the two histograms must be normalized, which requires that they add to one.X2 is determined between the n frequency distributions using this method.The training and testing phases make up this approach's two components.
Where: n = Number of unique data on the histogram = Normality value of the value of xi (observed) = Normality value of the value of yi (Normal).For example, in implementing the N-Gram technique, raw data is used in the HTTP protocol.The analyzed payload can be seen in Figure 2. Since a normalcy model may be automatically created from the N-Grams present in a packet payload, the usage of N-Grams does not need the construction of necessary features by experts in the relevant subject [25].Consider the artificial payload x = "ooddod" where the set of all possible symbols is restricted to "o" and "d" to show how the technique works.If n = 2, the sequences that can be extracted are "oo", "od", "dd", "do", and "od", respectively.
In addition, Bazrafshan et al. [26] explain that N-Gram is a feature that can be used in feature selection, as shown in Figure 3.  API/System calls: Applications frequently communicate with the Operating System through application programming interface (API) calls.API call sequences are one of the most effective techniques to mimic the actions of malicious software [27].As an illustration, API Calls to connect between access networks, like setWifiEnabled() and execHTTPRequest(), ZwOpenKeyEx  Opcode: A machine language instruction subdivision known as a "opcode" designates the execution action.
An organized set of assembly instructions makes up a program.An instruction is a pair comprising either a list of operands or an operational code.Opcode can be found in all programming languages, with examples in machine languages such as push, mov, call, StartupInfo [28]. N-Gram: N-Gram is all substrings of a larger string of length N [29].As an illustration, the string "ATTACK" can be divided into a number of 3-Grams, such as "ATT," "TTA," "TAC," "ACK," and so on.Several investigations have been conducted to identify unknown malware based on its binary code content over the last ten years.Based on the hex value of the HTTP protocol's content in DDoS attacks, the study will examine it [24], [30]. Control flow graph: The Control Flow Graph (CFG), a graph that depicts the control flow of programs, has been extensively utilized in software analysis for many years [31], [32], [33].CFG is a directed graph where every node corresponds to a program statement and every edge to the control flow between the statements.(i.e., what happens after what).Statements may be assignments, copy statements, branches, etc.  Hybrid Features: Two key aspects affect how well machine learning classifiers perform: features and algorithms.Thus, Hybrid Feature combines feature selection algorithms and attack characteristic features to help machine learning models produce the best classification and predictions [34], [35], [36].

III. RESULTS AND DISCUSSION
This section discusses the results of data packet construction using the N-Gram technique.There are two types of payloads extracted, normal Payload and DDoS Payload.The first stage is preparing data packets containing DDoS packets and normal packets originating from CIC-2017, MIB-2016, and H2NPayload, then extracting the hex payload using online tools and Python programming language.

A. Preparation Dataset Result
The identified payload is extracted from the raw data for additional analysis.The following results identify the raw data before converting it into hexadecimal form.This raw data is taken from the CIC-2017 dataset in the format of a PCAP file and then extracted using the scapy module in Figure 4.

B. Payload Identifications
Next, identify and reconstruct the payload using the N-Gram technique described in the following section.The following results from extracting the payload from the CIC-2017 Dataset data packet using the Hex Packet Decoder tool (gasmi.net),which can be seen in Figure 6.data packets to be analyzed, which are separated by several fields; field descriptions for all data packets are as follows: Payload separation using the scapy module developed using Python programming.It is explained that the payload of the HTTP protocol can be separated from the data packet field.

C. Result in N-Gram Pattern Formation
To identify and analyze data payloads that include DDoS attack patterns and separate them into 2-Gram, 3-Gram, 4-Gram, 5-Gram, and 6-Gram by calculating the frequency of each payload packet string.After the conversion of all datasets, both the first, second, and third data sets, then determine the payload pattern using the N-Gram technique ranging from 2-Gram to 6-Gram as in the following payload example:

D. Result Calculation of Chi-square Distance
The Chi-Square Distance method will be used by applications to determine the Distance between regular packets and packets being analyzed from each other.Calculate pattern occurrence frequency, percentage, and Chi-Square distance starting from 2-Gram, 3-Gram, 4-Gram, 5-Gram, and 6-Gram after extracting the hex payload and creating payload string shifts.The steps for calculating CSD manually based on this formula are as follows: The Pearson Chi-Square Test analysis was carried out as a threshold determination to determine the status of the payload observed, which was formed based on the following hypothesis: H0 is interpreted as a DDoS packet, and H1 is not a DDoS attack or normal Payload.D2 is Chi-Square Distance between two payloads.X 2 is the value of the chi-square table with the significant value of a = 0.05, and the degree of freedom b-1, b is the number of unique patterns that appear in the reference packet (Normal/DDoS) The chi-squared Distance between the analyzed packet and the reference packet will now be compared with the chisquared table value of = 0.05 and the degree of freedom b-1.From the calculation of the chi-squared Distance, the value is 0.327.The value of X2 (0.05,146) is 176,293.Since the chisquared distance value is less than the value of x2, the payload is a DDoS attack.

E. Experimentation Summary
This study uses three datasets, CIC-2017, MIB-2016, and H2N-Payload, to detect DDoS attacks.N-Gram technique analysis is used to determine whether a packet is malicious.The analysis is based on string patterns in each payload, ranging from 1-Gram to 6-Gram.The frequency of occurrence of patterns in each string is used to calculate the Chi-Square and Cosine Similarity value.Therefore, this value becomes a new feature in this study.Calculation of Chi-Square Distance and Cosine Similarity is performed on the three datasets.The result of the calculation will be the value for all features.The accuracy of each dataset is assessed using the SVM model once values are acquired for each feature.Each feature evaluates both datasets from CIC-2017, MIB-2016, and H2N-Payload.After each feature has been evaluated for correctness, a combined test of the two features is run.The combination of these two features is called a hybrid.Without N-Gram 84.16 (%) [23] Based on Table 3, it is explained that the 4-Gram feature is the best feature that can classify each payload.The accuracy rate for the CSDPayload+N-Gram feature is 99.86%, the CSPayload+N+Gram feature is 99.37%, and the CSDPayload+CSPayload feature is 99.65%.When compared with research conducted [23], it was concluded that there was an increase in the detection of DDoS attacks in the N-Gram technique compared to without using N-Gram, an increase of 15.70%, as well as with other features there was a significant increase in accuracy.Without N-Gram 97.90 % [37] Table 4 explains the accuracy rate for the CSDPayload+N-Gram feature is 99.98%, the CSPayload+N+Gram feature is 100%, and the CSDPayload+CSPayload feature is 99.74%.When compared with the research conducted, it was concluded that there was an increase in the detection of DDoS attacks in the N-Gram technique compared to without using N-Gram, an increase of 15.82%, as well as with other features there was a significant.From Table 6, it can be explained that for the CIC-2017 dataset, when evaluating the performance of the N-Gram technique in detecting DDoS attacks, there was an increase in the level of accuracy for the CSDPayload+N-Gram feature on the 4-Gram subset reaching 15.70%, CSPayload +N-Gram 15.21 %, the Hybrid Payload+N-Gram feature is 15.49%.In comparison, for the MIB-2016 dataset, there is an increase in the accuracy rate for the CSDPayload+N-Gram feature in the 4-Gram subset reaching 2.08%, CSPayload +N-Gram 2.10% and Hybrid+N-Gram 1.84 %.

CIC
The results show that feature selection in detecting DDoS attacks uses the Hybrid N-Gram heuristic technique for the CIC-2017 dataset with the SVM algorithm on the CSDPayload+N-Gram feature with a 4-Gram accuracy rate of 99.86%, the MIB-2016 dataset with the algorithm SVM and features CSPayload+N-Gram with 100% accuracy rate for 4-Gram, payload H2N-Dataset with SVM Algorithm and CSDPayload+N-Gram feature with 100% accuracy rate for 4-Gram.

F. Compare the Algorithm and Result
The comparison algorithm in this study uses 2 algorithms and the same dataset.Therefore, a comparison is needed against other algorithms besides SVM to measure the performance of the proposed N-Gram technique.The comparison starts with the KNN algorithm and the Neural Network.Table 8 explains the experimental results on the H2N-Payload dataset with the Neural Network algorithm on the combined features of CSDPayload+N-Gram+CSPayload+N-Gram (Hybrid N-Gram).This study's highest accuracy level in detecting DDoS attacks was 99.33% for 4-Gram compared to another level of N-Gram accuracy.Evaluation result using three Machine Learning algorithms, then the best algorithm for selecting features to improve the detection of DDoS attacks is the SVM for the 4-Gram algorithm, with an accuracy rate of up to 100%.

IV. CONCLUSION
As explained in the introduction, network security is important today because data protection in an organization is mandatory.It involves corporate confidentiality.One crucial aspect is data availability when accessed, but sometimes the data is unavailable due to server disturbances, one of which is a DDoS attack.Attacks known as denial-of-service (DoS) use the internet to attack vital Web services.By sending the target a substantial amount of unsolicited traffic to use up connection or bandwidth, this attack seeks to lower the quality of service a genuine service provides.DoS attacks are becoming more common, increasing the risk to servers and other devices connected to the internet.DDoS attacks have been happening for some time.Only a few defense systems could stop single-source attacks in the past, so better traceability prevents or repels attack sources.However, many systems today are vulnerable to attackers due to the rapid growth of the internet these days.
Therefore, this study proposes a DDoS attack detection technique using a hybrid N-Gram heuristic technique.The research stage shows that this technique can detect attacks by recognizing the percentage of two network class conditions (Normal and DDoS) for the CIC-2017 dataset with the SVM algorithm and the CSDPayload+N-Gram feature with a 4-Gram accuracy rate of 99.86%.MIB-2016 with SVM algorithm and PayloadCS+N-Gram features with 100.00% accuracy rate for 4-Gram, H2N-Payload dataset with SVM Algorithm and CSDPayload+ N-Gram feature with 100% accuracy for 4-Gram.In contrast, the KNN algorithm for 3-Gram has an accuracy rate of 99.44%, and the Neural Network Algorithm has an accuracy rate of 100% for 4-Gram.Thus, the best algorithm to detect DDoS is to use SVM.In contrast, the KNN and Neural Network algorithms are less consistent in classifying because the level of accuracy varies from 1-Gram to 6-Gram features.

Fig. 2
Fig. 2 An example of an HTTP packet payload

Fig. 3 Figure 3 ,
Fig.3Hybrid Methods Features[26] Figure3, the features used in the heuristics-based method consist of API Calls, CFG, N-Gram, Operation Code, and Hybrid features.API/System calls: Applications frequently communicate with the Operating System through application programming interface (API) calls.API call sequences are one of the most effective techniques to mimic the actions of malicious software[27].As an illustration, API Calls to connect between access networks, like setWifiEnabled() and execHTTPRequest(), ZwOpenKeyEx  Opcode: A machine language instruction subdivision known as a "opcode" designates the execution action.An organized set of assembly instructions makes up a program.An instruction is a pair comprising either a list of operands or an operational code.Opcode can be found in all programming languages, with examples in machine languages such as push, mov, call, StartupInfo[28]. N-Gram: N-Gram is all substrings of a larger string of length N[29].As an illustration, the string "ATTACK" can be divided into a number of 3-Grams, such as "ATT," "TTA," "TAC," "ACK," and so on.Several investigations have been conducted to identify unknown malware based on its binary code content over the last ten years.Based on the hex value of the HTTP protocol's content in DDoS attacks, the study will examine it[24],[30]. Control flow graph: The Control Flow Graph (CFG), a graph that depicts the control flow of programs, has been extensively utilized in software analysis for many years[31],[32],[33].CFG is a directed graph where every node corresponds to a program statement and every edge to the control flow between the statements.(i.e., what happens after what).Statements may be assignments, copy statements, branches, etc.  Hybrid Features: Two key aspects affect how well machine learning classifiers perform: features and algorithms.Thus, Hybrid Feature combines feature selection algorithms and attack characteristic features to help machine learning models produce the best classification and predictions[34],[35],[36].

Fig. 4
Fig. 4 Sample Raw Data CIC-2017 Dataset There are three steps to identify and analyze the raw data in a data packet: the first is to identify the IP Address, the second is the Network Protocol, and the third is to analyze the payload.All parts are converted from text to hexadecimal, as shown in Figure 4 below:

Fig. 5
Fig. 5 Payload Raw Figures 4 and 5 are the results of the data collection process in this study.All data packets on each dataset will be analyzed in depth, focusing on the payload.

Fig. 6 Feature
Fig. 6 Payload hex Figure 6, which is marked as the result of the identification of the payload of data packets, both normal data packets and

Table 7 ,
it is explained that the experimental results on the H2N-Payload dataset with the KNN algorithm on the combined features of CSD + Cosine Similarity (Hybrid N-Gram), the highest level of accuracy obtained in detecting DDoS attacks in this study was 91.97% for 1-Gram compared to another level of N-Gram accuracy.