基于机器学习的实时异常检测

目的

实时监测网络流量

The ultimate goal of this research work is to be able to use different algorithms to build different models and to compare their accuracy in detecting anomalous behavior on the Netflow data.

方法(模型)

LADS uses the One-Class SVM algorithm to construct a hyper-plane or set of hyper-planes in a high or infinite dimensional space. A good separation is achieved by the hyper-plane that has the largest dis-tance to the nearest training data points of any class (functional margin), therefore, the larger the margin the lower the generalization error of the classifier.

svm(支持向量机算法) 一种有监督的二分类算法

数据集

using a valid data-set containing over 1.4 million packets (captured using NetFlow v5 and v9)

The network has two capture points, one on Room 1 and another on Room 2, which were running Wire-shark3. The same program was used to create two packet capture files that were later converted into Net-Flow v54 and NetFlow v95 with a maximum time in-terval of 3 minutes for each flow and stored using nf-dump6 tools. Table 1 summarizes the information of the used dataset.

Before the training process, we cleaned the dataset so that broadcast, multicast and non-internal IP ad-dresses were discarded. We split the dataset into two:

Dataset 1: IP addresses from Room 1 excluding broadcast and multicast (6,000 flows); and Dataset 2: keeping only internal traffic inside Room 1 (5,800 flows).

select feature

  • ip address distance: For this experiment we transformed all IP ad-dresses into integer values (e.g., 172.18.21.4 is trans-formed to 2886735108) and we subtract the resulting integer of the modelled IP with the integer of the new observed IP. The LADS has been trained with dataset 1 and 2. We have used the One-Class SVM algorithm to compute the distance from the closest to the far-thest IP address. A model based on these data is cre-ated and a region is designed accordingly. During the testing part, we have added seven IP addresses from outside the range used during the training process. All IPs from the block are considered legitimate and all those that fall outside the boundaries are considered anomalous. Results are shown in Figure 2.
  • ip address and procotols distance
  • we split the IP address into four octets and each octet is treated as one different feature. (e.g., 172.18.21.4 is transformed to 288,67,35,108)
  • we transformed each IP into its corre-sponding binary using the Label Binarizer encoding method
  • IP Location: we evaluate if the IP address of the analyzed instance is source or destination, for which a value of zero or one will be allocated ac-cordingly (i.e., 0 if it is a source IP, 1 if it is a destination IP).IP Distance: we compute the distance between the modelled IP and the new one (as performed in pre-vious experiments).IP Knowledge: we evaluate if the IP address of the analyzed instance is known or unknown, for which a value of zero or one will be allocated ac-cordingly (i.e., 0 if it is a known IP, 1 if it is an unknown IP).

traing

Since the training process requires to build a model based on the distance among IPs within the dimensional space at which they are embedded, an IP transformation is required. In order to reduce the dimensionality of the dataset used by the LADS, maintaining at the same time as much as possible information carried by the samples, the Principal Component Analysis func-tion (PCA) is essential to find alternative features that maintain around 99% of the data variance, meaning that around 99% of information is carried by the orig-inal dataset.

testing

测试了四个特征

效果

Results show that a combination of multiple fea-tures (i.e., IP source, IP destination, distance between IPs, IP known, IP unknown) provides more accurate results and reduces considerably the false rates in the analysis performed.

对比

manual inspection 手工检查 for instance 例如 Live Anomaly Detection System (denoted by LADS)实时异常检测系统

发布了267 篇原创文章 · 获赞 51 · 访问量 25万+

猜你喜欢

转载自blog.csdn.net/AcSuccess/article/details/102178335