Network Intrusion Detection with 1D Convolutional Neural Networks

: Computer network assets expose to various cyber threats in today’s digital era. Network Anomaly Detection Systems (NADS) play a vital role in protecting digital assets in the purview of network security. Intrusion detection systems data are imbalanced and high dimensioned, affecting models’ performance in classifying malicious traffic. This paper uses a denoising autoencoder (DAE) for feature selection to reduce data dimension. To balance the data, the authors use a combined approach of oversampling technique, adaptive synthetic (ADASYN) and a cluster-based under-sampling method using a clustering algorithm, Kmeans. Then, a one-dimensional convolutional neural network (1D-CNN) is used to perform classification. The performance of the proposed model is evaluated on UNSW-NB15 and NSL-KDD datasets. The experimental results show that the model produces a detection rate of 98.79% and 97.23% on UNSW-NB15 for binary classification and multiclass classification, respectively. In the evaluation using NSL-KDD, the model yields a detection rate of 98.52% for binary type classification and 98.16% for multiclass type classification.


Introduction
The pervasive use of digital devices in today's digitally connected world, especially the IoT paradigm in different aspects of lives [1] , brings much information into cyberspace. On the other hand, this overall development and changes cause hidden dangers [2,3] to digital assets. Hence, protecting digital assets is a critical challenge and motivates network security scholars to research the domain. In recent years, many network security incidents have happened on personal and commercial systems with different attack methods [4] ; it shows that the number of attacks increased and the forms of attacks increased dramatically. The traditional in-place security tools as a firstline network security defence, such as encryption methods and firewalls, cannot cope with all types of network security dimensions. Therefore, developing an effective tool to detect network attacks is highly needed. NADS is a promising method to identify new intrusive activities and unauthorized access to digital assets.
Moreover, network traffic data is huge, resulting in computational costs for anomaly detection [5] . In the real world, network traffic data are an imbalance that causes a delay in the model's convergence and results in bias prediction [6,7] . Re-sampling techniques are widely employed methods to balance data. Re-sampling includes over-sampling and under-sampling. Each one has its pros and cons. Over-sampling is good for keeping all information but increases the size of data.
In contrast, under-sampling decreases the data size but causes information loss to some extent based on the type and proportion of sampling. The simplest method is random over-sampling (ROS) and under-sampling (RUS). Some of the other representations are BalanceCascade [8] , Synthetic Minority Oversampling Technique (SMOTE) [9] and Adaptive Synthetic (ADASYN) [10] . There are many recommendations for over-sampling in previous works; for example, author [11] has proved that over-sampling does not cause over-fitting, and it is the best way to handle imbalance problems in deep learning. We propose a feature selection technique using DAE to reduce the data dimensionality to solve the problems mentioned earlier.
To handle imbalance issues in network traffic data, we use a combined method of oversampling, ADASYN and a cluster-based under-sampling, using the K-means algorithm. The aim of using the combined method is to avoid the data size increase with the help of under-sampling and exploits the advantages of keeping as much as informative samples with the use of over-sampling. Then, this imbalance processing approach is fused to 1D CNN to perform binary classification and multiclass classification tasks. The performance of the model is evaluated on NSL-KDD and UNSW-NB15 datasets. Our main contributions in this study are: a. We present a combined ADASYN and K-meansbased clustering method to handle imbalance issues in the network traffic data.
b. We propose a flow-based NADS that employs an integrated imbalance processing method and the 1D CNN model. Our model performs superior to the state-of-theart in producing detection rates and dramatically reducing false alarm rates. It will be a prominent point for future NADS design and development.
The remainder of this paper consists of several sections. Section 2 briefly describes related work. Next, in Section 3, a description of the datasets is given. The proposed method is explained in Section 4. Section 5 details experimental results and analysis, and finally, Section 6 concludes the paper.

Related Work
This section briefly summarises some of the most related literature to our work. Authors [12] employed DAE for feature selection and Multilayer Perceptron (MLP) for classification, with an accuracy of 98.80%. The performance of the proposed method was evaluated on the UNSW-NB15 dataset. Authors [13] proposed a network intrusion detection using a conditional variational autoencoder for network anomaly detection in the IoT domain. The proposed method deals with feature reconstruction in the case of incomplete data. The authors claimed that the performance was improved and less complex than other unsupervised methods. Authors [14] proposed a Recurrent Neural Network (RNN) based network intrusion detection system. The authors claimed that deep learning-based network intrusion detection achieves better in the case of big data processing. Authors [15] proposed a deep learning-based distributed attack detection in an IoT environment especially using edge devices. The method produced an accuracy of 98.27% on the NSL-KDD dataset. Baig et al. [16] developed a multiclass classification method for network intrusion detection. They used a cascaded artificial neural network which yields an accuracy of 86.40% on the UNSW-NB15. Kwon et al. [17] used a fully connected neural network architecture for NIDS. The performance of the method was tested on the NSL-KDD dataset. The detection rate was reported from 92.9% to 95.3% for different files of the NSL-KDD dataset. The authors [18] proposed an intrusion detection based on Deep Neural Network (DNN). The model performance was evaluated on different datasets for binary and multiclass classification. With the rapid growth of technological advancement, data in cyberspace is getting larger and larger. Therefore, shallow learning with traditional machine learning (ML) relies on a high level of human involvement in data preparation, which may not be suitable in the real-world environment [19,14] . Also, these techniques produce low accuracy [14] . In recent years, deep learning demonstrated success in different real-world problem solving, including cyber-security, due to its capability of automatic feature capturing and correlation in large datasets [14] .

Dataset
We use UNSW-NB15 and NSL-KDD datasets for the proposed model performance evaluation.

UNSW-NB15
The UNSW-NB15 [20,21] was generated by the University of South Wales in 2015. The researchers employed three virtual servers and used a tool called Bro to extract 49-dimensional features, including two labels. This dataset has 2.54 million network traffic samples with nine types of attacks. The number of attack types is more than KDD, and its features are plentiful. UNSW-NB15 has a serious class imbalance. From Table 1, detailed distribution of each class, we can calculate that 87.35% of the entire data is normal traffic, and the remaining 12.65% is all types of attacks. We use the entire dataset for the experiment and divide it into training, testing, and validation at a ratio of 70%, 20%, and 10%, respectively.

NSL-KDD
NSL-KDD is a refined version of the KDD CUP 99 dataset [22] . The records in the NSL-KDD have been chosen carefully to avoid redundancy issues in the previous version. It contains only a moderate number of records. There are different files with different formats for the NSL-KDD dataset [23] . In this experiment, we used KDDTrain+ and KDDTest+. From Table 1, detailed distribution of each class, we can calculate that 51.88% of the entire data is normal traffic, and the remaining 48.20% is all types of attacks. It is imbalanced but not as much as UNSW-NB15. We divide it into training, testing, and validation at 70%, 20%, and 10%, respectively. Further details on NSL-KDD are available [24] . This is one of the commonly used datasets in NIDS.

The Proposed Method
The proposed method mainly consists of data pre-processing, class imbalance handling, classification, and evaluation. Figure 1 illustrates the architecture of the model.
The data pre-processing includes one-hot encoding, data normalization, and feature selection. The second step is combining imbalance processing using ADASYN and the k-means algorithm. The classification consists of 1D-CNN and, finally, the evaluation of the model.

Data Pre-processing
First, we dropped unnecessary features such as "srcip", "sport", "dstip", "dsport", "stime", and "ltime" [12] from the UNSW-NB15 dataset before the data pre-processing step. Data pre-processing consists of three main steps: one-hot encoding, data standardization, and feature selection. One-hot encoding is a process of converting nominal features into binary vectors [4] . The goal of performing data standardization is to bring down all the features to a common scale without distorting the differences in the range of the values. Feature selection reduces the number of variables to an optimal set by eliminating redundant or unnecessary variables. UNSW-NB15 and NSL-KDD have three nominal features each. Nominal features of UN-SW-NB15 are "proto", "state", and "service", and nominal features of NSL-KDD are "protocol type", "service", and "flag". After applying one-hot encoding, the UNSW-NB15 feature dimension increased to 202, and the feature dimension of NSL-KDD increased to 121. Similarly, one-hot encoding is applied on the class label of both datasets. Next, we standardize all the remaining features according to Equation (1) and normalize them to Normal Distribution, also called the Gaussian Distribution, with a mean of 0 and a variance of 1.
where x ′ is the normalized features, x is the original feature, µ is the mean, and δ is the standard deviation. The general form of Gaussian Distribution is given in Equation (2).
where x is the original feature, µ is the mean, and δ is the standard deviation. Finally, we perform feature selection using DAE on both datasets and select the top 15 features for each. Table  2 shows 15 selected features' names of both datasets.

Data Imbalance Processing
In Table 1, the number of instances is too small for some classes in the UNSW-NB15. For example, worms, shellcode, and backdoors have few samples, and similarly, the number of samples for U2R and R2L is smaller in NSL-KDD compared to the other classes. We present a combined over-sampling technique using ADASYN and a cluster-based under-sampling technique using the K-means algorithm to balance the data. Over-sampling alone increases the data size, ultimately increasing the model's computational cost and affecting model accuracy. Undersampling decreases data size but causes information loss by eliminating some informative transactions. To overcome those shortcomings, we use a combination of both re-sampling techniques. For the minority class, we use the ADASYN technique. ADASYN is an improved version of SMOTE. The key idea behind the use of ADASYN is that it employs a density distribution as a criterion to automatically decide the number of synthetic samples that are required to be generated for each minority example [10] . The way it works is similar to SMOTE, with a minor improvement. It adds some small random values to the points, making them more realistic. So, instead of all the samples being linearly correlated to the parent points, they have some more variance, i.e. they are scattered. Next, in the clustering-based under-sampling technique, we divide the data belonging to each majority class into 10 clusters and randomly choose some portions from each cluster so that the sum of all taken portions has to be equal to the selected threshold the minority class has been over-sampled. Algorithm 1, ADASYN-KM, describes our imbalance processing method.

Convolutional Neural Networks
Neural networks (NNs) are a subset of ML at the heart of deep learning algorithms. Neural networks compromise from node layers. They contain three main layers, i.e. input layer, one or possibly more hidden layers, and an output layer. The nodes are connected one to another and are associated with weight and a certain threshold. The node is activated if the output value crosses the threshold and then passes the data to the next layer; else, no data is passed to the next network layer. Convolutional neural networks (CNN) are an extension architecture of feedforward neural networks by having three main layers: convolutional, pooling, and fully connected. The first layer is the convolutional layer, followed by other layers or pooling layers. The final layer of CNN is the fully connected layer.   [25,26] . Max pooling selects the maximum value, and Average Pooling calculates the average values within the receptive field. In this work, we use Max Pooling, as shown in Equation (3).
where x describes the vector of input data with an activation function.
• Fully connected layers: The fully connected layer is responsible for the task of classification based on the extracted features through the previous layers and different filters. The fully connected layer usually leverages a softmax activation function to classify inputs, while convolutional and pooling layers tend to use ReLu functions. We developed our model based on a six-layer 1D CNN. Figure 1 illustrates the complete step-by-step of our model. The classification part is the network architecture, which shows that a Max-pooling layer follows every two convolutional layers for the first four layers. The dense layers integrate the locally learned features into global features. The first dense has 64 neural units, and the final dense mainly performs the task of classification or prediction. The parameters vary from dataset to dataset based on the number of class labels and type of classification, such as binary and multiclass classification. The data are just mapped into a two-dimensional array as the input to the network.

Experimental Results and Analysis
We implemented the proposed CNN-based NADS model in Python and conducted the experiments on a machine with Windows 11 Pro 64 bits operating system. Detailed information on the experimental setup is given in Table 3. The class imbalance process was done on the training set only. The batch size was set to 256, and the epoch was between 100 to 200.

Evaluation Metrics
The performance of the model is evaluated by Accuracy, Precision, Recall, f-measure, and false alarm rate (false positive rate), which are formulated as: ication mber of convolution kernels and learning rate directly affect classification results in a CNNel [27,28] . We performed experiments on several convolution kernels with different learning tain a better result. This experiment was on UNSW-NB15 for multiclass classification. We use " optimizer and "categorical crossentropy" loss function for the entire experiment. In our model,   . We performed experiments on several convolution kernels with different learning a better result. This experiment was on UNSW-NB15 for multiclass classification. We use timizer and "categorical crossentropy" loss function for the entire experiment. In our model, convolution kernels in the first four layers of 1D CNN is 64 64 256 256. We use Maxes to under-sample the parameters of the convolution layer. The activation function for r is softmax, and for the rest of the layers is Relu. To avoid over-fitting, a dropout with a .2 is used after each pooling layer. We tested six different convolution kernels with seven  . We performed experiments on several convolution kernels with different learning better result. This experiment was on UNSW-NB15 for multiclass classification. We use imizer and "categorical crossentropy" loss function for the entire experiment. In our model, convolution kernels in the first four layers of 1D CNN is 64 64 256 256. We use Maxes to under-sample the parameters of the convolution layer. The activation function for is softmax, and for the rest of the layers is Relu. To avoid over-fitting, a dropout with a is used after each pooling layer. We tested six different convolution kernels with seven tion er of convolution kernels and learning rate directly affect classification results in a CNN- [27,28] . We performed experiments on several convolution kernels with different learning in a better result. This experiment was on UNSW-NB15 for multiclass classification. We use optimizer and "categorical crossentropy" loss function for the entire experiment. In our model, of convolution kernels in the first four layers of 1D CNN is 64 64 256 256. We use Maxtimes to under-sample the parameters of the convolution layer. The activation function for yer is softmax, and for the rest of the layers is Relu. To avoid over-fitting, a dropout with a 0.2 is used after each pooling layer. We tested six different convolution kernels with seven s.

Classification
The number of convolution kernels and learning rate directly affect classification results in a CNN-based model [27,28] . We performed experiments on several convolution kernels with different learning rates to obtain a better result. This experiment was on UNSW-NB15 for multiclass classification. We use the "nadam" optimizer and "categorical crossentropy" loss function for the entire experiment. In our model, the number of convolution kernels in the first four layers of 1D CNN is 64 64 256 256. We use Max-pooling two times to under-sample the parameters of the convolution layer. The activation function for the output layer is softmax, and for the rest of the layers is Relu. To avoid over-fitting, a dropout with a parameter of 0.2 is used after each pooling layer. We tested six different convolution kernels with seven learning rates. Table 4 presents comparative studies of different convolution kernels concerning different sets of learning rates. We observe that the convolution kernels of 64 64 256 256 and learning rate 0.1 outperform in Recall, f1-score, Train-loss and Test-loss with the score of 97.23%, 97.64%, 0.30%, and 0.07%, respectively. It scores lower by 0.02% in Accuracy with a learning rate of 0.01, and Precision is lower by 0.02% from 64 64 128 128 with a learning rate of 0.1. The FAR is 0.48%, which is higher by 0.8%, with a learning rate of 0.002. Based on this experiment, we can claim that convolution kernels 64 64 256 256 outperform the other number of convolution kernels for most metrics, such as Accuracy, Recall, f1-score, FAR, Train-loss, and Test-loss except for Precision and computational time.
To evaluate the effectiveness of our model on different datasets and different types of classifications, we implemented the same experiment on binary classification and multiclass classification on both datasets. Table 5 provides summary results for all the metrics on both datasets. Table  6 presents per class performance of the model on UN-SW-NB15 dataset. We can see that the detection rate for the minority classes of shellcode and worms is better concerning the deficient number of samples for the classes.

Discussion
The experimental results show that by combining an imbalance processing technique and a 1D-CNN-based classifier model, our proposed method significantly improved the detection rate and reduced the FAR. The main reason is that instead of random re-sampling, we employed a combined method of ADASYN over-sampling and a cluster-based under-sampling using the K-means algorithm. We manually set the number of clusters to 10 due to better results from a small pre-test. Cluster-based under-sampling prevents information loss, which may occur by eliminating samples randomly. In this way, it will contribute to good performance.
On the other hand, ADASYN over-sampling generates synthetic points which are more realistic than the randomly generated points. There are many ways of under-sampling, such as K-means algorithm, random under-sampling, Gaussian-based clustering etc. Similarly, there are many over-sampling techniques like random-oversampling, SMOTE etc. We selected ADASYN rather than SMOTE to generate more realistic samples. SMOTE generates samples linearly, while ADASYN works similarly to SMOTE but adds some small fractions to generate more realistic samples. To demonstrate the effectiveness of our model, we compared our results with some of the previous works given in Table 7. We use Accuracy, Precision, Recall, f-measure, and FAR metrics for the comparison and can see from the table that our model performs much better than the previous works. Reducing the false positive rate (false alarm rate) is one of the key challenges in network anomaly detection, which is significantly dropped by our method.

Conclusions
We proposed a combined approach to process class imbalance to overcome the class imbalance issue in network traffic data. The technique combines ADASYN over-sampling and clustering-based under-sampling using the k-means algorithm. We develop a 1D CNN-based network anomaly detection with 64 64 256 256 number of convolutional kernels, a learning rate of 0.1, a softmax activation function for the output layer, Relu activation function for the layers other than the output layer, and Max-pooling followed by a dropout. We evaluated our model on two commonly used datasets, UNSW-NB15 and NSL-KDD. The binary and multiclass classification was conducted on both datasets. A comparative study with six different convolution kernels and seven learning rates was conducted to show how our model performs vs the other convolution kernels. The experimental results show that the model produces a detection rate of 98.79% and 97.23% on UNSW-NB15 for binary classification and multiclass classification, respectively, which is the highest among all tested scenarios. In the evaluation using NSL-KDD, the model yields a detection rate of 98.52% for binary type classification and 98.16% for multiclass type classification. We compared our method with some previous work on multiclass classification. Our model performs superior to state-of-the-art models and points to a promising direction for future network anomaly detection with large-scale and imbalanced datasets. We plan to explore more imbalance processing methods on a distributed platform to improve detection performance and reduce time in our future research.