【GNN+encrypted traffic】TFE-GNN: A Temporal Fusion Encoder Using GNN for Fine-grained Encrypted Trafic Classificat

Paper title

Chinese title : TFE-GNN: A Graph Neural Network Temporal Fusion Encoder for Fine-grained Encrypted Traffic Classification
Conference : WWW '23: The ACM Web Conference 2023
Year of publication : 2023-4-30
Author : Haozhen Zhang
latex Quote :

@inproceedings{zhang2023tfe,
  title={TFE-GNN: A Temporal Fusion Encoder Using Graph Neural Networks for Fine-grained Encrypted Traffic Classification},
  author={Zhang, Haozhen and Yu, Le and Xiao, Xi and Li, Qing and Mercaldo, Francesco and Luo, Xiapu and Liu, Qixu},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={2066--2075},
  year={2023}
}

Summary

Classification of encrypted traffic is receiving a lot of attention from researchers and industrial companies. However, existing methods only extract low-level features, fail to handle short streams due to unreliable statistical properties, or treat headers and payloads equally and fail to mine potential correlations between bytes . Therefore, this paper proposes a byte-level traffic graph construction method based on pointwise mutual information (PMI), and a temporal fusion encoder model based on graph neural network (TFE-GNN) for feature extraction. In particular, we design dual embedding layers, a GNN-based flow graph encoder, and a cross-gated feature fusion mechanism that can first embed header and payload bytes separately and then fuse them together to obtain stronger feature representation. Experimental results on two real datasets demonstrate that TFE-GNN outperforms multiple state-of-the-art methods in fine-grained encrypted traffic classification tasks.

existing problems

  1. Encrypted traffic detection methods that rely on traffic statistical features require handcrafted feature engineering and may fail in some cases due to unreliable/unstable low-level statistics. Shorter streams have higher deviations than longer streams. Because the statistical characteristics of short streams are almost non-existent.
  2. Most current GNN-based methods [1, 11, 21, 25, 29] construct graphs based on the correlation between data packets, which is actually another form of usage of statistical features, which also suffers from the above-mentioned problems.
  3. For those GNN models that use packet bytes as features, there are also two major disadvantages: (1) Mixing headers and payloads. Existing approaches simply treat the packet's header and payload equally, ignoring the difference in meaning between them. (2) Raw bytes are not fully utilized. Although the data packet bytes are utilized, most methods treat the data packet as a node, and only use the original byte of the data packet as a node feature, and do not make full use of the data packet.

paper contribution

  1. For the first time, a byte-level traffic graph construction method based on point-wise mutual information (PMI) is proposed. By converting the packet byte sequence into a graph, a byte-level traffic graph is constructed to support traffic classification from different perspectives.
  2. TFE-GNN is proposed, which processes packet headers and payloads separately, and encodes each byte-level flow graph into an overall representation vector for each packet. Therefore, TFE-GNN uses packet-level representation vectors instead of low-level representation vectors.
  3. To evaluate the performance of the proposed TFE-GNN, we compare it with several existing methods on the self-collected WWT dataset and the public ISCX dataset [5, 15]. The results show that for user behavior classification, TFE-GNN outperforms these methods in effectiveness.

The method of the paper to solve the above problems:

  1. Use packet bytes instead of statistical features for feature engineering

Tasks of the thesis:

Classify encrypted traffic, graph-level classification + RNN

1. Predefined

  1. graph structure
  • G = { V , E , X } G = \{V, E, X\} G={ V,E,X } : whereGGG indicates the graph,VVV represents a node in the graph,EEE represents an edge in the graph,XXX represents the feature matrix of the node.
  • A A A G G Adjacency matrix of G , ai , j a_{i,j}ai,jIndicates the connection between the i-th and j-th nodes.
  • N ( v ) N(v) N ( v ) : indicates nodevvNeighborhood nodes of v
  • dl d_ldl: Indicates the llthEmbedding embedding dimension of l layer
  • T S = [ P t 1 , P t 2 , . . . , P t n ] , t 1 < = t 2 < = . . . < = t n TS = [P_{t_1},P_{t_2},...,P_{t_n}], t_1<=t_2<=...<=t_n TS=[Pt1,Pt2,...,Ptn],t1<=t2<=...<=tn, of which P ti P_{t_i}Ptirepresents a single packet with a timestamp, nnn is the sequence length of the traffic segment,t 1 , tn t_1,t_nt1,tnare the start time and end time of the traffic segment, respectively.
  1. Encrypted Traffic Classification
  • M M M : number of training samples
  • N N N : Number of traffic categories
  • b s i j = [ b 1 i j , b 2 i j , . . . , b m i j ] bs_i^j = [b_1^{ij},b_2^{ij},...,b_m^{ij}] bsij=[b1ij,b2ij,...,bmij] m m m is the byte sequence length,bkij b_k^{ij}bkijis the value of the k-th byte of the j-th byte sequence of the i-th traffic sample
  • s i = [ b s 1 i , b s 2 i , . . . , b s n i ] s_i = [bs_1^i,bs_2^i,...,bs_n^i] si=[bs1i,bs2i,...,bsni] n n n is the sequence length,bsji bs_j^ibsjiis the j-th byte sequence of the i-th traffic sample. si s_i heresiCan be seen as the above TS TSTS

Example:

a certain si s_isiFor [[0x01,0x02, 0x03… 0x10],[0x21,0x32, 0x73… 0x68],…,[0x55,0x65,…,0x79]], si s_isiContains multiple data packets, each data packet is represented by their byte sequence, where bs 1 i bs_1^ibs1iIt is [0x01,0x02, 0x03... 0x10], b 1 i 1 b_1^{i1}b1i 1is 0x01

2. Byte-level Trafc Graph Construction

  • Node : A certain byte. Note that the same byte value shares the same node, so the number of nodes will not exceed 256, so that the graph can be kept at a certain scale without being too large.
  • Correlation representation between bytes : point mutual information (PMI) is used to model the similarity between two bytes, and the similarity between byte i and byte j is represented by PMI ( i , j ) PMI(i, j)PMI(i,j ) said.
  • Edge : The edge is constructed according to the PMI value. The PMI value is positive: indicating that the semantic correlation between bytes is high; and the PMI value is zero or negative: indicating that there is little or no semantic correlation between bytes. Therefore, we only create an edge between two bytes with a positive PMI value.
  • Node features : the initial feature of each node is the value of the byte, the dimension is 1, and the range is [0,255]
  • Graph Construction : Since PMI ( i , j ) == PMI ( j , i ) PMI(i,j) == PMI(j,i)PMI(i,j)==PM I ( j ,i ) , so the graph is an undirected graph.

3. Dual Embedding

The necessity of dual embeddings : Byte values ​​are often used as initial features for further vector embeddings. Two bytes with different values ​​correspond to two different embedding vectors. However, the meaning of a byte varies not only with the byte value itself, but also with the part of the byte sequence it is in. In other words, in a packet's header and payload, two bytes with the same value can have completely different meanings. The reason is that the payload carries the transport content of the packet, while the header is the first part of the packet that describes its content. If we make two bytes with the same value in header and payload correspond to the same embedding vector, it is difficult for the model to converge to the optimal values ​​of these embedding parameters due to meaning confusion.

According to the basic principles mentioned above, the header and payload of the data packet are processed separately, and byte-level traffic graphs (i.e., byte-level traffic header graph and byte-level traffic payload graph) are respectively constructed for these two parts. Dual embedding with two embedding layers that do not share parameters , embeds the initial byte-valued features into the high-dimensional embedding vectors of the two graphs, respectively.

So there are two embedding matrices:

  • E h e a d e r ∈ R K × d 0 E_{header} \in R^{K \times d_0} EheaderRK×d0: where K is the number of nodes, d 0 d_0d0is the embedding dimension
  • E p a y l o a d ∈ R K × d 0 E_{payload} \in R^{K \times d_0} EpayloadRK×d0

4. Trafc Graph Encoder with Cross-gated Feature Fusion (traffic graph encoder with cross-gated feature fusion)

insert image description here
Model: 4 layers GraphSAGE
Key points:

  1. Handle header and payload separately, using unshared parameters
  2. Connect the output of 4 layers of GraphSAGE, and embed hvfinal h_v^{final} as the final nodehvfinal,具体如下: h v f i n a l = c o n c a t ( h v ( 1 ) , h v ( 2 ) , h v ( 3 ) , h v ( 4 ) ) h_v^{final} = concat(h_v^{(1)},h_v^{(2)},h_v^{(3)},h_v^{(4)}) hvfinal=concat(hv(1),hv(2),hv(3),hv(4))
  3. The readout layer uses mean pooling to get the header and payload graph embedding gh g_hghand gp g_pgp, as follows: g = h 1 final ⊕ . . . ⊕ h V final ∣ V ∣ g = \frac{h_1^{final}\oplus...\oplus h_V^{final}}{|V|}g=Vh1final...hVfinal, where VVV is the number of graph nodes
  • Cross-gated feature fusion
    Since the features are extracted from the traffic header map and the traffic payload map respectively, and the final representation of the two maps is obtained gh g_hghand gp g_pgp, the current target is at gh g_hghand gp g_pgpto create a reasonable relationship between them to get an overall representation of the packet bytes.

    Specific method: Two linear layers + two activation layers are used to obtain the gating vector sh s_hshand sp s_psp, since the activation layer uses sigmoid, so sh s_hshand sp s_pspThe element size range is [0,1]

    s h = s i g m o i d ( w h 2 T P R e L U ( w h 1 T g h + b h 1 ) + b h 2 ) s_h = sigmoid(w_{h_2}^TPReLU(w_{h_1}^Tg_h+b_{h_1})+b_{h_2}) sh=sigmoid(wh2TPR e LU ( vh1Tgh+bh1)+bh2)
    s p = s i g m o i d ( w p 2 T P R e L U ( w p 1 T g p + b p 1 ) + b p 2 ) s_p = sigmoid(w_{p_2}^TPReLU(w_{p_1}^Tg_p+b_{p_1})+b_{p_2}) sp=sigmoid(wp2TPR e LU ( vp1Tgp+bp1)+bp2)
    z = c o n c a t ( s h ⊙ g p , s p ⊙ g h ) z = concat(s_h \odot g_p, s_p \odot g_h) z=concat(shgp,spgh)

5. Downstream tasks

Since we have encoded the raw bytes of each packet in a traffic segment into a representation vector z, the segment-level classification task can be viewed as a time-series prediction task. The models used include LSTM and transformer.

6. Model summary

  1. Input : pcap traffic packet
  2. Traffic packet preprocessing : divide the traffic into segments, each segment uses si = [ bs 1 i , bs 2 i , . . . , bsni ] s_i = [bs_1^i,bs_2^i,...,bs_n^i]si=[bs1i,bs2i,...,bsni] said.
  3. Build a map : build a map with a data packet, for example, bs 1 i bs_1^ibs1iBuild two graphs (respectively header graphs ( G h G_hGh) and payload graph ( G p G_pGp)), each byte in the data packet is a node (the same byte shares a node), and the edge should be constructed according to the PMI value between bytes. side. Thus, a packet segment contains multiple graphs, for example, si s_isiContains 2n graphs (n header graphs, n payload graphs).
  4. For the two images generated by each byte, embedding is performed separately to generate two embedding matrices: E header ∈ RK × d 0 E_{header} \in R^{K \times d_0}EheaderRK×d0 E p a y l o a d ∈ R K × d 0 E_{payload} \in R^{K \times d_0} EpayloadRK×d0
  5. Put these two embedding matrices into two 4-layer stacked GraphSAGE models that do not share parameters, and generate results, which are the results of the output concat of each layer.

Import: E header E_{header}Eheader:[batch_size, num_nodes, nodes_embedding_input]、 E p a y l o a d E_{payload} Epayload:[batch_size, num_nodes, nodes_embedding_input]
输出: h h e a d e r f i n a l h_{header}^{final} hheaderfinal:[batch_size, num_nodes, nodes_embedding_output*4]、 h p a y l o a d f i n a l h_{payload}^{final} hpayloadfinal:[batch_size, num_nodes, nodes_embedding_output*4]

  1. Perform mean pooling on the output to obtain a graph representation.

输入: h h e a d e r f i n a l h_{header}^{final} hheaderfinal:[batch_size, num_nodes, nodes_embedding_output*4]、 h p a y l o a d f i n a l h_{payload}^{final} hpayloadfinal:[batch_size, num_nodes, nodes_embedding_output*4]
输出: g h g_h gh:[batch_size, nodes_embedding_output*4]、 g p g_p gp:[batch_size, nodes_embedding_output*4]

  1. Cross-gated feature fusion

Enter: gh g_hgh:[batch_size, nodes_embedding_output*4]、 g p g_p gp:[batch_size, nodes_embedding_output*4]
输出: z z z:[batch_size, nodes_embedding_output*4]

  1. Downstream task (in fact, it is a nlp sequence prediction problem, si s_isisimilar to a paragraph of an essay)

Input: si = [ [ z vector 1 1 , z vector 2 1 , . . . , z vector m 1 ] , [ z vector 1 2 , z vector 2 2 , . . . , z vector m 2 ] , . . . , [ z vector 1 n , z vector 2 n , . . . , z vector mn ] ] s_i = [[z vector 1^1, z vector 2^1, ..., z vector m^1],[z vector 1^2, z vector 2^2, ..., z vector m^2], ..., [z vector 1^n, z vector 2^n, ..., z vector m^n]]si=[[ z vector 11 ,z-vector21 ,...,z-vectorm1],[ z vector 12 ,z-vector22 ,...,z-vectorm2],...,[ z vector 1n ,zvector2n ,...,zvectormn ]]
key:label = [ label 1 , label 2 , . . . . . . . . , labeln ] label_i = [label1,label2,...,labeln]labeli=[label1label2...labeln]

In general, a graph-level gnn is performed first, and a vector z (similar to the word vector in nlp) is generated for each byte, and then an rnn is connected to turn the original problem into a sequence prediction problem.

Conversely, we can directly regard the traffic classification problem as a text classification problem. The following is the corresponding relationship:

  • pcap package — an article
  • packet segment — a paragraph in an article
  • packet — a sentence in a paragraph in an article
  • Every byte in a packet — a word in a sentence

The work done by the previous graph neural network is to obtain an appropriate word vector representation.

7. Experiment

Conduct experiments on the following questions:

  • RQ1: How useful is each component (section 4.3)?
  • RQ2: Which GNN architecture performs best (Section 4.4)?
  • RQ3: How complex is the TFE-GNN model (Section 4.5)?
  • RQ4: To what extent will changes in hyperparameters affect the effectiveness of TFE-GNN (Section 4.6)?
  1. experiment settings
  • data set:

    • ISCX VPNnonVPN : includes encrypted traffic (VPN) and non-encrypted traffic (non-VPN), 6 user behavior categories
    • ISCX Tor-nonTor : Contains encrypted traffic (Tor) and non-encrypted traffic (non-Tor), 8 user behavior categories, because the concept of flow is missing in this data set, it is divided into 60-second non-overlapping blocks as a flow train.
    • WWT (self-collected data set) : WhatsApp: 12 user behavior categories; WeChat: 9 user behavior categories; Telegram: 6 user behavior categories. In addition, the start and end timestamps of each user behavior sample are recorded for traffic segmentation.

    With stratified sampling, training set: test set = 9:1

  • Preprocessing:

    1. For each dataset, we define and filter out two kinds of "outlier" samples:

    (1) Empty flow or segment : A traffic flow or segment in which all packets have no payload. An empty stream or segment does not contain any payload, so we cannot construct a corresponding graph. In fact, these samples are often used to establish connections between clients and servers with little discriminative information helpful for classification.
    (2) Too long flows or segments : Flows or segments with a length (that is, the number of packets) greater than 10,000. Too long streams or segments contain too many packets, which may cause a large number of bad packets or retransmitted packets due to temporary bad network environment or other potential reasons. In most cases, such samples introduce too much noise, so we also treat overly long streams or segments as outlier samples and remove them.
    (3) Furthermore, for each remaining sample of the dataset, we remove the bad packets and retransmitted packets in it.

    1. Remove the Ethernet header
    2. To eliminate interference with sensitive information originating from these IP addresses and port numbers, the source and destination IP addresses and port numbers have been removed
  • Implementation details and baselines:

    parameter:

    • The maximum number of packets contained in each packet segment <= 50
    • The maximum payload byte length and maximum header byte length are set to 150 and 40, respectively.
    • PMI window size is set to 5
    • epoch:1520
    • Learning rate: 1e-2, gradually decayed to 1e-4
    • Optimizer: Adam
    • batch_size:512
    • warmup:0.1
    • dropout:0.2

    Evaluate:

    • Overall accuracy (AC), precision (PR), recall (RC), macro F1-score (F1)

    baseline:

    • Methods based on traditional feature engineering : AppScanner[31], CUMUL[23], K-FP (K-Fingerprinting)[8], FlowPrint[32], GRAIN[43], FAAR[19], ETC-PS[40]
    • Methods based on deep learning : FS-Net[18], EDC[16], FFB[44], MVML[4], DF[30], ET-BERT[17]
    • Graph neural network-based methods : GraphDApp[29], ECD-GNN[11]
  1. Comparative Experiment

    insert image description here
    insert image description here

  2. Ablation experiment (RQ1)

  • H:header
  • P:payload
  • dual: double-layer encoding
  • JKN: Four-layer concat operation of GraphSAGE
  • CGFF: Cross Gate Pass Feature Fusion
  • A&N: activation functions and batch normalization
    insert image description here
  1. GNN architecture variant research (RQ2)
    insert image description here
  2. Model Complexity Analysis (RQ3)

insert image description here

  1. Model sensitivity analysis (RQ4)
    is actually a selection problem of some hyperparameters to find the optimal hyperparameters.
    insert image description here

Summarize

Thesis content

  1. learned method

    Theoretical approach:

    1. In addition to statistical features, you can try to start with byte features
  2. Dissertation pros and cons

    advantage:

    1. Instead of analyzing statistical features, start with byte features, avoiding the defect that statistical features caused by short streams are difficult to collect
    2. This method can handle not only encrypted traffic, but also unencrypted traffic

    shortcoming:

    1. In the double-layer embedding, there is no specific description of how to embed byte values ​​into high-dimensional initially, which makes it difficult to reproduce.
    2. Finite graph construction method. The graph topology of this model is determined before training, which may lead to non-optimal performance. Furthermore, TFE-GNN cannot handle the byte noise implicit in the raw bytes of each packet.
    3. Unused time information implicit in the byte sequence. When constructing the layer-by-layer traffic graph, no explicit temporal characteristics of byte sequences are introduced. (didn't understand what it meant)
  3. innovative ideas

    1. When composing a picture, the same byte value shares a node, but the weight is not assigned according to the frequency of the byte value. I think that for nodes with more byte values, more weights should be added to make it the The more important nodes in the graph. For details, please refer to: https://blog.csdn.net/Dajian1040556534/article/details/130113702This article.

data set

  • ISCX VPNnonVPN : includes encrypted traffic (VPN) and non-encrypted traffic (non-VPN), 6 user behavior categories
  • ISCX Tor-nonTor : Contains encrypted traffic (Tor) and non-encrypted traffic (non-Tor), 8 user behavior categories, because the concept of flow is missing in this data set, it is divided into 60-second non-overlapping blocks as a flow train.
  • WWT (self-collected data set) : WhatsApp: 12 user behavior categories; WeChat: 9 user behavior categories; Telegram: 6 user behavior categories.

Guess you like

Origin blog.csdn.net/Dajian1040556534/article/details/131741788