深度神经网络在入侵检测系统(IDS)中的应用
操作系统:Ubuntu16.04LTS 64位
GPU:GTX 1060 3GB
开发环境:Python 2.7、MATLAB R2016a
深度学习框架:TensorFlow 1.1.0
0、总体介绍
基于以上,首先进行数据预处理,然后设计了一种新的深度神经网络并将其应用在入侵检测中,与传统方法相比,检测率有了显著提升,误报率也随之下降。
入侵检测是对入侵行为的发掘,是采集和分析计算机网络或计算机系统中若干关键点信息,从中发现网络或系统中是否有违反安全策略的行为和被攻击的迹象。
入侵检测系统(Intrusion Detection System,IDS)则是完成如上功能的独立系统,对确保网络系统的安全具有十分重要的意义。
传统的IDS是采取分析和提取入侵模式和攻击特点,建立检测规则库及模式库,所以在检测准确率和智能上存在明显不足,也导致过多的人工参与。希望通过深度学习的方法进行改进。
1、IDS模型
数据采集处理模块:主要是对入侵信息的采集或收集,根据后边模型的需要进行特征数据的预处理,一般包括以下部分:数据过滤、规范化和归一化等。
特征学习模块:主要功能是利用NDNN网络模型对大量用于训练的网络数据进行网络特征提取,不断优化各个网络层次的参数,并将训练好的NDNN模型保存下来。
入侵识别分类模块:主要功能是利用已保存的NDNN网络模型去识别和分类未知的数据,并规范化其分类结果。若将其判定为攻击类型,则触发响应并报警。
2、数据预处理
KDD99数据集 是美国麻省理工学院林肯实验室提供的一种被广泛使用的入侵检测比赛数据。
NSL-KDD数据集是KDD99数据集的改进,去除冗余或重复记录,训练和测试中的记录数量更合理。
训练集包含大约500万条连接记录,测试集包含大约300万条连接记录。
数据集中每个连接用41个特征来描述:
TCP连接的基本特征(共9种,1~9)
TCP连接的内容特征(共13种,10~22)
基于时间的网络流量统计特征 (共9种,23~31)
基于主机的网络流量统计特征 (共10种,32~41)
每个记录包含42个属性,其中包含3个字符型特征、38个数字型特征和1个属性标签, 每个网络连接被标记为正常(normal)或异常(attack),异常类型被细分为4大类共39种攻击类型,即Probe(扫描与探测)、Dos(拒绝服务攻击)、U2R(对本地超级用户的非法访问)和R2L(未经授权的远程访问)。测试集中包含一些训练集中没有出现过的攻击类型,为了系统的泛化性能。
(1)规范化:将三个字符型特征和最后一列的属性标签数值化,即编码处理。标签值进行one-hot编码。
Protocal type: 1 icmp; 2 tcp; 3 udp; 4 others.
Service: 1 domain-u; 2 ecr_i; 3 eco_i; 4 finger; 5 ftp_data; 6 ftp; 7 http; 8 hostnames; 9 imap; 10 login;
11 mtp; 12 netstat; 13 other; 14 private; 15 smtp; 16 systat; 17 telnet; 18 time; 19 uucp; 20 others.
Flag: 1 REJ; 2 RSTO; 3 RSTR; 4 SO; 5 S3; 6 SF; 7 SH; 8 others.
(2)归一化:利用如下函数对数值型属性做归一化 y = (x-xmin)/(xmax-xmin)
3、网络模型
神经网络模型
1、 ReLU非线性激活函数:不仅在一定程度上能够防止Sigmoid函数易造成“梯度消失”现象的弊端,而且求导简单。
2、自适应的Adam优化器:经过偏置校正后,每一次迭代学习率都有一个确定范围,使得参数比较平稳。它为不同的参数计算不同的自适应学习率,对内存需求也较少,收敛速度更快,学习效果更有效,而且可以防止学习率消失、收敛过慢或是高方差的参数更新导致损失函数波动较大等问题。
3、Softmax激活函数:通常用于具有多个输出神经元的网络,是一种多输出竞争型分类算法。每一个输出取值在 0 到 1 之间,并保证所有的输出神经元之和为1,每个输出代表一种分类类别的概率。
4、全连接层:在softmax 层之前通过一个5个节点的全连接层,将上一隐藏层100维的输出变成5维的输出,使得softmax层输入和输出的维度保持一致。
神经网络模型选择
Types of detected intrusion
|
Predicted |
||
Attack |
Normal |
||
Actual |
Attack |
TP |
FN |
Normal |
FP |
TN |
The specific definitions of the five metrics are as follows:
检测率(DR)=R=(检测出的异常数据个数/异常数据总数)×100%
误检率(FDR)=(误认为异常的正常数据个数/正常数据总数)×100%
漏检率(MAR)=(1-DR)×100%
Algorithm |
DR |
FDR |
MAR |
Adaboost [28] |
0.8340 |
0.1740 |
0.1660 |
Auto-encoder Network [29] |
0.9890 |
0.0110 |
0.0110 |
LSSVM-IDS + FMIFS [33] |
0.9946 |
0.0013 |
0.0054 |
LSSVM-IDS + MIFS (β=0.3) [33] |
0.9938 |
0.0023 |
0.0062 |
LSSVM-IDS + FLCFS [33] |
0.9847 |
0.0061 |
0.0153 |
LSSVM-IDS + All features [33] |
0.9916 |
0.0097 |
0.0084 |
Unoptimized DBN-PNN [20] |
0.9931 |
- |
0.0069 |
Optimized DBN-PNN [20] |
0.9914 |
- |
0.0086 |
PCA-PNN [20] |
0.9828 |
- |
0.0172 |
PNN [20] |
0.9904 |
- |
0.0096 |
Proposed algorithm |
0.9995 |
0.0003 |
0.0005 |
Related intrusion detection algorithms based on deep neural networks
References |
Methods |
Performance |
Fiore et al. [17] |
Restricted Boltzmann Machine (RBM) |
Accuracy: around 94% |
K. Do et al. [18] |
An ensemble of Deep Belief Nets (DBNs) |
Detection F-score on mixed data is around 72%. |
Khaled et al. [19] |
RBM together with DBNs |
Detection rate is 97.9%. |
G. Zhao et al. [20] |
DBNs with probabilistic neural network (PNN) |
Detection accuracy is about 99%. Detection rate is about 90%. |
Niyaz et al. [16] |
Self-taught learning (STL) |
Accuracy rate is more than 98%, a little lower than 99%. F-measure can achieve 98.84% |
S. Potluri et al. [21] |
Accelerated Deep Neural Network (DNN) |
The highest detection accuracy is 97.7% |
Roy et al. [22] |
Deep Neural Network (DNN) |
Better than SVM in intrusion detection. |
Yin et al. [15] |
Recurrent neural networks (RNN-IDS) |
Superior to traditional machine learning classification methods. |
以一条连接记录为例,原始数据如下:
0,icmp,ecr_i,SF,1032,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,511,511,0.00,0.00,0.00,0.00,1.00,0.00,0.00,255,255,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,smurf.
标准化后的数据样例:
0.0 0.0 0.0526315789474 0.714285714286 1.48837071923e-06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0
Details of the KDD 99 dataset
Intrusion category |
Number of training data |
Number of testing data |
Probe |
3723 |
384 |
DoS |
356691 |
34767 |
U2R |
41 |
11 |
R2L |
1024 |
102 |
Normal |
88515 |
8763 |
Intrusion category |
DR |
FDR |
MAR |
Probe |
0.9896 |
0.0001 |
0.0104 |
DoS |
0.9997 |
0 |
0.0003 |
U2R |
0.9091 |
0.0001 |
0.0909 |
R2L |
0.9804 |
0.0001 |
0.0196 |
overall |
0.9995 |
0.0003 |
0.0005 |
Intrusion category |
recall |
accuracy |
F-measure |
Precision |
Probe |
0.9896 |
0.9818 |
0.9909 |
0.9922 |
DoS |
0.9997 |
0.9990 |
0.9998 |
0.9999 |
U2R |
0.9091 |
0.8182 |
0.8333 |
0.7692 |
R2L |
0.9804 |
0.9706 |
0.9756 |
0.9709 |
Normal |
0.9995 |
0.9997 |
0.9997 |
0.9999 |
Details of the NSL-KDD dataset
Intrusion category |
Number of training data |
Number of testing data |
Probe |
10422 |
1235 |
DoS |
41407 |
4520 |
U2R |
41 |
11 |
R2L |
896 |
98 |
Normal |
61110 |
6233 |
Intrusion category |
DR |
FDR |
MAR |
Probe |
0.9935 |
0.0009 |
0.0065 |
DoS |
0.9940 |
0 |
0.0060 |
U2R |
0.9091 |
0.0002 |
0.0909 |
R2L |
0.9796 |
0.0005 |
0.0204 |
overall |
0.9935 |
0.0016 |
0.0065 |
Intrusion category |
recall |
accuracy |
F-measure |
precision |
Probe |
0.9935 |
0.9773 |
0.9927 |
0.9920 |
DoS |
0.9940 |
0.9867 |
0.9959 |
0.9978 |
U2R |
0.9091 |
0.8182 |
0.6452 |
0.5000 |
R2L |
0.9796 |
0.9694 |
0.9412 |
0.9057 |
Normal |
0.9935 |
0.9984 |
0.9959 |
0.9983 |
features of an original intrusion data record
Description |
Feature |
Data attributes |
Basic features of individual TCP connections. |
duration |
continuous |
protocol_type |
symbolic |
|
service |
symbolic |
|
flag |
symbolic |
|
src_bytes |
continuous |
|
dst_bytes |
continuous |
|
land |
symbolic |
|
wrong_fragment |
continuous |
|
urgent |
continuous |
|
Content features within a connection suggested by domain knowledge |
hot |
continuous |
num_failed_logins |
continuous |
|
logged_in |
symbolic |
|
num_compromised |
continuous |
|
root_shell |
continuous |
|
su_attempted |
continuous |
|
num_root |
continuous |
|
num_file_creations |
continuous |
|
num_shells |
continuous |
|
num_access_files |
continuous |
|
num_outbound_cmds |
continuous |
|
is_host_login |
symbolic |
|
is_guest_login |
symbolic |
|
Traffic features computed using a two-second time window |
count |
continuous |
srv_count |
continuous |
|
serror_rate |
continuous |
|
srv_serror_rate |
continuous |
|
rerror_rate |
continuous |
|
srv_rerror_rate |
continuous |
|
same_srv_rate |
continuous |
|
diff_srv_rate |
continuous |
|
srv_diff_host_rate |
continuous |
|
Traffic features computed in and out a host |
dst_host_count |
continuous |
dst_host_srv_count |
continuous |
|
dst_host_same_srv_rate |
continuous |
|
dst_host_diff_srv_rate |
continuous |
|
dst_host_same_src_port_rate |
continuous |
|
dst_host_srv_diff_host_rate |
continuous |
|
dst_host_serror_rate |
continuous |
|
dst_host_srv_serror_rate |
continuous |
|
dst_host_rerror_rate |
continuous |
|
dst_host_srv_rerror_rate |
continuous |