[Research-type paper] Flow Sequence-Based Anonymity Network Traffic Identification with Residual GCN

Flow Sequence-Based Anonymity Network Traffic Identification with Residual Graph Convolutional Networks

Chinese title : Residual Graph Convolutional Networks for Anonymous Network Traffic Identification Based on Flow Sequences
Conference : 2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS)
Year of publication : 2022-6-10
Authors : Ruijie Zhao, Xianwen Deng, Yanhao Wang, Libo Chen, Ming Liu, Zhi Xue, and Yijun Wang
latex quote :

@article{shen2021accurate,
  title={Accurate decentralized application identification via encrypted traffic analysis using graph neural networks},
  author={Shen, Meng and Zhang, Jinpeng and Zhu, Liehuang and Xu, Ke and Du, Xiaojiang},
  journal={IEEE Transactions on Information Forensics and Security},
  volume={16},
  pages={2367--2380},
  year={2021},
  publisher={IEEE}
}

Summary

Identifying anonymous services from network traffic is an important task for network management and security. Currently, some deep learning-based researches have achieved good results in traffic analysis, especially those based on Flow Sequence (FS), which utilizes the information and characteristics of traffic flow.

However, these models still face severe challenges due to the lack of mechanisms to consider the relationship between flows , leading to wrongly treating unrelated flows in FS as clues to identify flows.

In this paper, we propose a FS-based anonymous network flow recognition framework , which leverages the relationship between flows for FS feature extraction using Residual Graph Convolutional Network (ResGCN). Furthermore, we design a practical scheme to preprocess the raw data of real traffic , which further improves the recognition performance and efficiency.

Experimental results on two real traffic datasets show that our method outperforms the current state-of-the-art methods by a large margin.

existing problems

  1. Some current artificial intelligence-based traffic analysis methods ignore some key relationships between flows, resulting in mistakenly taking unrelated flows in the flow sequence as clues for traffic identification. Due to the structural limitations of the current methods based on DL algorithms (such as CNN and LSTM), the following two relationships cannot be considered in the feature extraction process.

key relationship :

attribute relationship. The forward flow (F-flow) generated by the same application request and the corresponding reverse flow (R-flow) have an attribute relationship.
time relationship. Temporal relationships represent time intervals between streams. The longer the interval, the lower the correlation between the two.

Solution :

Graph Convolutional Networks (GCNs) connect neighbor nodes through edges with different weights, and provide a method to connect streams according to the relationship between them to solve this problem by computing the features of neighbor nodes and updating feature representations.

paper contribution

  1. It is proposed to use the attribute and time relationship between flows to realize more reasonable and effective flow sequence feature extraction for traffic recognition. We assume that graph convolutional networks (GCNs) are suitable for our purposes, and propose a new RESGCN model to identify different web services.
  2. A practical scheme is devised to handle real-world raw traffic data. It considers traffic segmentation, traffic features for generating and enriching raw traffic, and lightgbm-based feature combination, avoiding unimportant features from reducing model performance and efficiency.
  3. The framework is evaluated on two real traffic datasets. Experimental results show that the method has good classification performance and is suitable for the identification of different network services.

1. Overview of the problem and approach

  • Flow Generator
    2.
    Since the same type of network behavior usually lasts for a period of time, building flow sequences for analysis becomes an effective way to improve classification performance. Clearly, there are some relationships (i.e. attributes and temporal relationships) between the streams in each sequence. If these relationships can be considered in the feature extraction process, the design of the classifier network will be more reasonable and effective. Unfortunately, there is no related work combining these relations for feature extraction.

    difficulty:

    1. During stream generation, many variables affect the result of the generation. First, traffic can be segmented based on different durations or packet sizes, directly affecting the calculation of related statistics. It is necessary to find an optimal parameter to split the traffic .
    2. Generated traffic contains different statistical characteristics, but some may not be meaningful for traffic identification. The more important features need to be screened .

    Solution considerations:

    1. Effectiveness: Effectiveness means that the classifier can utilize the generated stream to obtain excellent classification performance.
    2. Timeliness: Timeliness means that anonymous services can be identified as soon as possible.
  • LightGBM for feature selection

    In terms of effectiveness, unimportant features will be characterized as noise under certain conditions, which will affect the recognition results.
    In terms of timeliness, removing these features can reduce the calculation of statistics in the process of traffic generation and speed up traffic generation. In addition, low-dimensional features will also reduce the complexity and classification time of the feature extraction network.

    Reasons to use LightGBM:

    1. LightGBM is an ML algorithm based on Gradient Boosted Decision Trees (GBDT). Since the GBDT algorithm ranks the importance of features during training, it is very suitable for feature selection tasks.
    2. Traditional GBDT-based algorithms (such as XGBoost, PGBRT) are time-consuming because they have to scan all sample points of each feature to select the best segmentation point. LightGBM greatly reduces the time complexity of processing samples through the gradient-based one-sided sampling (GOSS) algorithm. (The main idea of ​​the GOSS algorithm is that samples with larger gradients play a major role in calculating information gain, which means that these samples with larger gradients will contribute more information gain. Therefore, in order to maintain the accuracy of information gain evaluation When downsampling samples, samples with large gradients can be retained, and samples with small gradients can be randomly sampled in proportion. Due to the reduction of a large number of data samples with small gradients, the amount of calculation is greatly reduced.)
  • GCN for feature extraction

    In this study, we regard the flow sequence as a graph, and each flow is a node in the graph. A connection relationship is formed based on the relationship between different streams. For related mathematical reasoning and usage of GCN, you can read another article of mine

  • Thesis method

    1. Step 1 : Configure the mirror port of the switch, use the traffic capture tool tcpdump to obtain real-time traffic, and save it as a series of pcap files. The traffic collection frequency depends on the actual situation.
    2. The second step : use the flow generator to realize fast real-time analysis of the pcap file. Extract streams in pcap files according to preset rules. A stream sequence is composed of multiple continuous streams, and a graph of the relationships between the different streams will also be generated in this step. Then, a LightGBM-based feature selection method is used to select the optimal feature combination.
    3. Step 3 : The proposed RESGCN classifier utilizes the generated relation graph to achieve effective anonymous network traffic identification.

2. Anonymous network traffic identification framework

insert image description here

  1. Raw flow data processing scheme

    1. Generate stream :

    Task: extract stream from pcap package

    In real traffic, services such as those provided by BitTorrent can result in very long traffic durations. If the long-time stream is not segmented, it will affect the classification efficiency.

    Traffic segmentation schemes can be divided into two categories:
    (1) Time-based: Time-based segmentation schemes set an upper limit on the duration of traffic
    (2) Scale-based: Size-based segmentation schemes set an upper limit on the maximum packet size

    Therefore, we first use generators to generate rich features from raw traffic via time-based or size-based traffic segmentation schemes. Then, the following standard normalization method is used to improve the reliability of the data:
    z = ( x − μ ) / σ z = (x-\mu)/\sigmaz=xm ) / p

    1. Compose Flow Sequences

    Task: Group the generated streams to form a stream sequence

    If there are too few flows in the flow sequence: it will lead to insufficient information, and the ideal classification performance cannot be achieved
    If there are too many flows in the flow sequence: it will increase the amount of calculation and reduce the efficiency

    Therefore, we set each stream sequence to contain eight consecutive streams.

    1. Build Graph Structures
      insert image description here

    Task: Generate a graph for each flow sequence

    Graph generation is the key to the successful application of GCN. We hope to enable more efficient and rational analysis of streaming sequences through graph structures. Clearly, there are many relationships between the different streams in each sequence. We construct the graph structure from the following two aspects.

    • Attribute Relationship Graph (ARG):
      The relationship between the F-flow and Rflow generated by the application request is defined as an attribute relationship.

    Triple: (flow index, transmitted bytes, received bytes)
    flow index: generated according to (source/destination IP, source/destination port, protocol)

    Conditions for connecting two flows:
    (1) Both flows have the same flow index
    (2) and the number of bytes sent and bytes received is exchanged

    Based on triplets, we concatenate F-flow and R-flow and set their attribute relation weights to 1.
    G a ( V , E ) = 3 Tuple Matching . G_a(V, E) = 3TupleMatching.Ga(V,E)=3TupleMatching.

    • Time Relationship Graph (TRG):
      multiple consecutive streams are arranged in sequence according to the generation time of the streams. The closer the traffic generation time is, the stronger their correlation is, and the higher the weight is set.

    Conditions for connecting two flows:
    (1) Both are F-flow or both are R-flow

    Suppose the a-th flow after sorting is flowa flow_aflowa, the bth flow is flowb flow_bflowb. Then the distance between the two streams is ∣ B − A ∣ |B−A|BA , the initial weight is:1 / ∣ B − A ∣ 1/|B−A|1/∣BA,即 G t ( V , E ) = d i s t a n c e − 1 G_t(V, E) = distance^{−1} Gt(V,E)=distance1

    In order to set the time relationship weights of each process more reasonably, we input the time relationship weights of each process into the Softmax function, so that the sum of the new weights is 1. Assuming we have n temporal relationship weights in a stream, we can express this process as: w 1 ′ , w 2 ′ , . . . , wn ′ = Softmax ( w 1 , w 2 , . . . , wn ) w^′_1, w^′_2, ..., w^′_n = Softmax(w_1, w_2, ..., w_n)w1,w2,...,wn=Softmax(w1,w2,...,wn)

    After building ARG and TRG, we normalize the adjacency matrices of these two graphs to obtain the fused graph.

    1. Select Feature Combination

    Task: Use lightGBM to calculate the importance of each feature in the flow for feature selection

    Input: all flows: flows, shape: [flows_num, features_num]
    output: label of all flows (category of flow), shape: [flows_num, 1]
    intermediate product: importance of each feature in flow

  2. Flow sequence based RESGCN classifier
    insert image description here

    Enter : FS & FS Graph. A stream sequence consisting of 8 consecutive streams, where each stream is considered as a node in the graph.

    ResGCN block: consists of 2 GCN Units.

    GCN Unit: includes two key components, namely the generated feature interaction module (GFI) and the related process interaction module (RFI)

    • GFI is a fully connected unbiased layer. It performs a linear transformation on the characteristics of each flow (such as packet interval, packet size, etc.), allowing different characteristics to interact.
    • RFI allows related flows to exchange information based on a relational graph.

    According to the above fusion graph, these two modules can carry out effective information interaction on different features and related processes.

    The paper here is not very clear, my understanding is:
    (1) First of all, two graphs are obtained above, namely ARG and TRG, assuming that the adjacency matrix of ARG is A 1 A_1A1, the adjacency matrix of TRG is A 2 A_2A2, then the final fusion adjacency matrix is: A = A 1 + A 2 A = A_1 + A_2A=A1+A2
    (2) Assume that the feature matrix of a flow sequence is H (shape [8, features_num])
    (3) First, it must go through the GFI module. GFI is an unbiased fully connected layer, expressed as: H ′ = WT ∗ H H' = W^T * HH=WTH 其中, W : [ f e a t u r e s n u m , e m b e d d i n g n u m ] H ′ : [ 8 , e m b e d d i n g n u m ] \\W:[features_num, embedding_num]\\H':[8, embedding_num] W:[featuresnum , _embeddingnum ] _H[8,embeddingnu m ]
    (4) After passing through the RFI module, the formula is expressed as:
    H ′ ( l + 1 ) = δ ( D ~ − 1 / 2 A ~ D ~ − 1 / 2 H ′ ( l ) W ′ ( l ) ) H'^{(l+1)} = \delta(\widetilde{D}^{-1/2}\widetilde{A}\widetilde{D}^{-1/2}H'^{( l)}W'^{(l)})H(l+1)=d (D 1/2A D 1/2 H(l)W(l))

    Use the dropout layer to improve the generalization ability of the model and reduce overfitting. Finally, the exit layer output is flattened

    Flow classification using a 3-layer MLP with 2 hidden layers and 1 output layer. The first hidden layer consists of a linear layer with an output size of 220, followed by a rectified linear unit (ReLU). The second hidden layer has a similar structure but with an output size of 110.

    Summarize:

    Input layer -> 4 * ResGCN block -> dropout -> flattern -> linear1 (xx*220) -> ReLU -> linear2 (220*110) -> ReLU -> output (110* category)

    Note: According to the communication with the author, the GCN model is a graph-level model, that is, only one label is generated for each graph, which requires the original pcap package to be divided according to the label to obtain multiple pcaps of different categories package, and then use Tranalyzer2 to extract flow features for each pcap package, and then generate a graph structure.

3. Effect evaluation

Assessment Objectives:

  • RQ1: How effective is the traffic data processing scheme when dealing with real-world raw traffic?
  • RQ2: How does ResGCN identify different network services in real network traffic?
  • RQ3: Does RESGCN achieve better performance than state-of-the-art methods?
  1. Evaluation dataset

    Anonymous network traffic analysis: SJTU-AN21 dataset
    Real-world traffic dataset: ISCXVPN2016 dataset

    insert image description here

  2. experiment settings

    • RQ1: The performance of the proposed raw traffic data processing method is evaluated on different traffic segmentation schemes and feature selection methods, and the optimal combination is determined for subsequent experiments.
    • RQ2: The training process of ResGCN is analyzed, and the confusion matrix of the classification results on the test dataset is discussed.
    • RQ3: The classification performance of ResGCN is compared with current traffic classification methods on the test dataset.

    lab environment:

    python3.7+pytorch.
    Hardware: Intel® Core™ [email protected] GHz, 64 GB RAM, NVIDIA GeForce RTX3090 GPU.

    Evaluation indicators:

    Model effect: Recall, Precision, F1
    Model complexity: FLOPs

    Hyperparameter settings:

    epoch: >=100
    optimizer: SGD
    initial learning rate (initial learning rate): 0.01
    batch_size: 80
    momentum: 0.9

  3. The effectiveness of raw traffic data processing solutions (answer to RQ1)

    Three influencing factors: stream segmentation method, feature combination, and stream sequence length

    1. Stream segmentation methods
      6 segmentation schemes were evaluated (i.e. time-based 5s, 10s, 15s and size-based 5MB, 10MB and 15MB)

    2. Feature combination
      and performance of 3 feature selection methods (i.e. pca-based, xgboost-based and lightgbm-based).
      Use the training dataset for feature selection and evaluate the accuracy of our method on the test dataset

    3. The length of the stream sequence
      has been determined before, and the length of the stream sequence is 8, that is, 8 consecutive streams after traffic cutting form a stream sequence

    insert image description here
    insert image description here

    It can be seen from the figure that the 10s feature segmentation scheme has the best classification performance and the 15MB segmentation scheme has the worst classification performance on the two datasets . In addition, through the data analysis of the 15MB traffic segmentation scheme, it is found that the duration of many traffic using this segmentation scheme is very long, and the temporal and spatial correlation between the traffic is obviously weakened , which also verifies the time interval mentioned in the paper. The longer the theory, the worse the correlation between the two streams.

    As can be seen from the table, the PCA method for feature extraction has advantages in speed, but the accuracy is not as good as LightGBM. Based on the consideration of speed and accuracy, the LightGBM method is selected for feature selection.

  4. RESGCN classifier performance (answer to RQ2)

    Training process:
    insert image description here
    Note: The abscissa in the original picture is Number of Features, I think he should be wrong, so I changed it to Number of Epoch

    confusion matrix:insert image description here

    ResGCN's ablation experiment (that is, the model effect after deleting one of the modules):insert image description here

  5. Comparison with other methods (answer to RQ3)
    insert image description here

  1. The classification results of traditional ML-based classification methods are often unsatisfactory, which indicates that these methods have limited ability to classify complex network traffic
  2. The classification performance of 2D-CNN and 3D-CNN models (i.e. reading pcap files directly without computing statistical features) is very limited.
  3. CNN, LSTM, and LDAE all use statistical features and streaming sequences to achieve significant performance improvements. However, due to the lack of mining of the internal relationship of the convective sequence, it still cannot achieve high accuracy.
  4. The LAttn model learns the intrinsic relationship between flow sequences through an attention mechanism, which further improves the performance of the model.
  5. FS-Net achieves better feature representation for encrypted traffic by introducing reconstruction loss, which effectively improves classification performance. However, this model structure also brings a large amount of parameters.

RESGCN designs the model structure from a new perspective. Using the generated relational graph to perform feature extraction on flow sequences, it significantly improves the classification performance of anonymous network traffic.

  1. Model complexity

    We analyzed model complexity, model parameter size, model size and speed. These evaluations are important for some deployments where a smaller model (or a faster model that can analyze traffic in real-time) is more important than a performance classification of a model, because if the model is too large, the model cannot run on some devices. Because memory and CPU usage are easily interfered by other programs, we use FLOPs to reflect the complexity of the model to evaluate the hardware consumption of running the model.

    Model parameter sizes also affect memory usage during model inference. The model size represents the disk space the model takes up, and the speed represents the number of streams the model can process per millisecond.

    1. Both 2D-CNN and 3D-CNN models are very complex due to the use of 2D convolutions for feature extraction.
    2. The gate structure based on the LSTM model is complex, resulting in very low computational efficiency.
    3. C4.5 has fast speed, but sub-par classification performance limits its deployment.

    Benefiting from the efficient feature extraction of GCN on streaming sequences, RESGCN can achieve accurate classification without large parameters.

Summarize

In this paper, we propose a novel framework for identifying network traffic based on flow sequences, which successfully identifies different anonymous network services using RESGCN by exploiting the attribute relationship and temporal relationship between flows. Moreover, as an end-to-end real-time traffic identification method, our framework can effectively handle real traffic. It considers traffic segmentation, utilizes raw traffic generation and enriches traffic features, and lightgbm-based feature combination, avoiding unimportant features from reducing model performance and efficiency. Experimental results show that RESGCN classifier has higher classification accuracy, lower complexity and faster classification speed due to its excellent structure design.

1. Highlights of the paper

  1. Accuracy:

    (1) By evaluating various parameters: such as the traffic segmentation method (time-based method: how many seconds to divide the traffic), the number of feature selections, and so on. to confirm the optimal parameters for the model.
    (2) A ResGCN method is proposed to improve the accuracy of the model: composition is made from two perspectives of time and attribute, so that the model can fully extract the information between each stream.

  2. real-time:

    (1) Use lightGBM to extract features to ensure that the model extracts features fast enough.
    (2) Benefiting from the effective feature extraction of GCN on streaming sequences, RESGCN can achieve accurate classification without large parameters.

2. Disadvantages of the paper

  1. The model can only be applied to the classification of normal network application traffic, and does not consider the malicious traffic in the network. Therefore, a classification module for normal and abnormal traffic can be added to further expand the model.
  2. The robustness of the model needs to be further improved, that is, whether the model can correctly classify the traffic when it is attacked by some carefully crafted traffic delivered by malicious attackers.

3. Tools

  1. Traffic analysis tool Tranalyzer2 , installation tutorial
  2. Data mining tool Weka , installation tutorial

4. Datasets

SJTU-AN21 dataset:https://github.com/iZRJ/The-SJTU-AN21-Dataset
ISCXVPN2016:https://www.unb.ca/cic/datasets/vpn.html

Guess you like

Origin blog.csdn.net/Dajian1040556534/article/details/129315656