Application system load analysis and disk capacity prediction

Full code and analysis

Experiment introduction

Experimental background

Large enterprises will have their own office automation systems, and the application system will load the underlying software and hardware during daily operation, which will significantly affect the performance of the application system. If the load of any underlying resource is too large, the performance of the application system may be degraded or even paralyzed. Therefore, it is necessary to pay attention to the running status of servers, databases, middleware and storage devices, and keep abreast of the current application system load, so as to prevent in advance and ensure the safe and stable operation of the system.

Purpose

  • For historical disk data, the time series analysis method is used to predict the used space of the application system server disk.

  • Set different warning levels according to user needs, compare the predicted value with the capacity value, and make an early warning judgment on the result, providing system administrators with customized warning prompts.

Analytical Methods and Processes

Because the storage space has a strong correlation over time, and historical data has a certain impact on future development. Therefore, in this experiment, the time series analysis method is used to predict and analyze the used space of the disk.

Model introduction

In this experiment, we will use the time series analysis method to construct the data.
First, let's get to know the ARIMA model. ARIMA, Differential Autoregressive Moving Average Model, also known as Summation Autoregressive Moving Average Model. It is one of the time series forecasting analysis methods.
In ARIMA(p,d,q), AR is "autoregression", p is the number of autoregressive items; MA is "moving average", q is the number of moving average items, and d is the number of differences made to make it a stationary series (Order).

The process used by the ARIMA model:
- Identify the stationarity of a time series based on its scatter plot, autocorrelation function and partial autocorrelation function plot.
- Stationarize non-stationary time series data. Until the values ​​of the processed autocorrelation function and partial autocorrelation function are non-significant and non-zero.
- Build a corresponding time series model based on the identified features. After the smoothing process, if the partial autocorrelation function is truncated and the autocorrelation function is tailed, the AR model is established; if the partial autocorrelation function is tailed and the autocorrelation function is truncated, the AR model is established. MA model; if both the partial autocorrelation function and the autocorrelation function are tailed, the sequence fits the ARMA model.
- Make predictions using a tested model.

So, how to identify the stationarity of a time series?

Stationarity Test

There are generally three methods to identify whether a sequence is stationary: graph viewing method, autocorrelation coefficient and partial correlation coefficient, and unit root test (ADF).

Let's first briefly introduce the viewing method:

Viewing method
insert image description here

The "diagram" here refers to a timing diagram, that is, a timing diagram that changes over time. The graph a of a stationary series fluctuates up and down around a constant. The non-stationary figure b is with a clear trend of increasing or decreasing.

Autocorrelation and Partial Correlation

There are two definitions involved here - truncation and smearing.

WeChat screenshot_20181003171718.png

Both autocorrelation and partial correlation plots for stationary series are either tailed or truncated. Censoring means that after a certain order, the coefficients are all 0. How to understand it? Looking at the partial correlation diagram above, when the order is 1, the coefficient value is still very large, 0.914. When the second order is long, it suddenly becomes 0.050. The following values ​​are very small and are considered to be close to 0. This situation is truncation. Then there is the tailing. The tailing has a tendency to decay, but not all of them are 0. The autocorrelation plot is neither tailed nor censored. The autocorrelation of the above graph is in the form of a triangular symmetry, and this trend is typical of a monotonic trend.

c. Unit root test (ADF)
If the unit root test p-value is less than 0.05, it is considered to be stationary.

How to deal with non-stationary series?
2. Difference processing
Difference is to take the value difference of adjacent items to replace the current value, so as to eliminate some fluctuations and make the data become stable.

data extraction

In order to extract the disk data, the performance data is extracted based on the attribute identification number (TARGET_ID) and the time of collecting the indicators (COLLECTTIME). This experiment extracts the relevant data of the disk of a database server in the financial management system from 2014-10-1 to 2014-11-16.
insert image description here

Data exploration and analysis

In this experiment, the time series analysis method is used for modeling. For modeling needs, the stationarity of the data needs to be explored.

The stationarity of the data can be initially found through the time series diagram. Periodic analysis is performed on the usage size of the server disk, in days.

Create a new file and add the following code:

import pandas as pd

dataPath = './data/discdata.xls'
data = pd.read_excel(dataPath,encoding='utf-8')

#绘制C、D盘的使用情况时序图
import matplotlib.pyplot as plt
import matplotlib

#配置matplotlib参数
#坐标轴字体
matplotlib.rc('font', **{
     
     'family': 'serif', 'serif': ['SimHei']})
#plt.rcParams['front.sans-serif'] = ['SimHei']
#坐标轴负号
#plt.rcParams['axes.unicode-minus'] = False

pd.to_datetime(data['COLLECTTIME'])
data1 = data[(data['ENTITY'] == 'C:\\') & (data['TARGET_ID'] == 184)]
#设置dataframe索引,修改dataframe,不创建新对象
data1.set_index('COLLECTTIME',inplace = True)

data2 = data[(data['ENTITY'] == 'D:\\') & (data['TARGET_ID'] == 184)]
data2.set_index('COLLECTTIME',inplace = True)

print(data.head())
print(data1.head())
print(data2.head())

After running, you can see that we will be visualizing on this data:

WeChat screenshot_20181003185832.png

Add the following code to the source file:

plt.plot(data1.index, data1['VALUE'], 'ro-')
plt.title(u"C盘已使用空间的时序图")
plt.xlabel(u'日期')
plt.ylabel(u'磁盘使用大小')
#刻度位置
plt.xticks(rotation = 30)
plt.show()

After running, you can see the timing diagram of the capacity of the C drive:

WeChat screenshot_20181003185840.png

Continue to add the following code:

plt.plot(data2.index,data2['VALUE'],'ko-')
plt.title(u'D盘已使用空间的时序图')
plt.xlabel(u'日期')
plt.ylabel(u'磁盘使用大小')
plt.xticks(rotation = 30)
plt.show()

After running, you can see the timing diagram of the D disk capacity:
WeChat screenshot_20181003185846.png

As can be seen from the above figure, the usage of disks is not cyclical, they show a slow growth, showing a trend line.

Preliminary judgment, the data is non-stationary.

Data cleaning

In actual business, the monitoring system will regularly collect information on the disk every day. However, in general, the capacity attribute of a disk is a fixed value, so there will be duplicate data of the disk capacity in the original data of the disk.

During the data cleaning process, the duplicate data of the disk capacity is eliminated, and the disk capacity of all servers is taken as a fixed value, which is convenient for model early warning.

Open the file and add the following code:

data.drop_duplicates(data.columns[:-1],inplace=True)

Remove duplicate data from the original data.

WeChat screenshot_20181003191932.png

property construction

In data storage, disk capacity is measured in KB. Because the disk information of each server can be distinguished by the three attributes of NAME, TARGET_ID, and ENTITY in the table, and the above three attributes of each server are unchanged, the three attribute values ​​can be combined to construct a new attribute .

Open the file and add the following code:

data3 = pd.DataFrame(index = data1.index,columns=['SYS_NAME','CWXT_DB:184:C:\\','CWXT_DB:184:D:\\','COLLECTTIME'])
data3['SYS_NAME'] = data1['SYS_NAME']
data3['CWXT_DB:184:C:\\'] = data1['VALUE']
data3['CWXT_DB:184:D:\\'] = data2['VALUE']
data3['COLLECTTIME'] = data1.index
data3.to_excel('./data/data_processed.xls')
print(data3.head())

After running, you can see that the data content after attribute construction is:

WeChat screenshot_20181003192001.png

Capacity forecasting model

The processed data is divided into two parts, one is modeling sample data and the other is model validation data. The last 5 data of the modeling data are selected as the validation data, and the other data are the modeling sample data.

(1) The first thing we need to do is to test the stationarity of the observed value. If it is not stationary, we need to perform a difference operation until the data after the difference is stationary.

(2) After the data is stable, perform the white noise test. If the white noise test is not passed, the model is identified to identify which model the model belongs to among AR, MA and ARMA, and the model is determined by the Bayesian information criterion. order, to determine the p and q parameters of the ARIMA model.

(3) After the model is identified, it is necessary to perform a model test to check whether the model residual sequence is a white noise sequence. If it does not pass the test, it needs to be re-identified, and the model that has passed the test needs to be modeled using the maximum likelihood estimation method. Parameter Estimation.

(4) The model is used for prediction, and the error analysis between the actual value and the predicted value is carried out. If the error is relatively small, it indicates that the model fitting effect is good, and the model can end, otherwise the parameters need to be re-estimated.

Stationarity Test

In order to confirm that there is no random trend or definite trend in the original data series, it is necessary to perform a stationarity test on the data, otherwise the phenomenon of "pseudo-regression" will occur. In this section, the ADF method is used to perform the stationarity test.

It should be noted that in this experiment, we only use the capacity of the C drive for model construction and prediction. Students can use the D drive capacity for model building and prediction.
Create a new file model.py and add the following code:

#-*- coding:utf-8 -*-
import pandas as pd
#ADF检验
#参数初始化
discfile = './data/data_processed.xls'

data4 = pd.read_excel(discfile)
#不使用最后5个数据
data = data4.iloc[:len(data)-5]

#平稳性测试
from statsmodels.tsa.stattools import adfuller as ADF
diff = 0 
adf = ADF(data['CWXT_DB:184:C:\\'])
#print(adf)
#adf[1]为p值,p值小于0.05认为是平稳的
while adf[1] >= 0.05:
    diff = diff + 1
    adf = ADF(data['CWXT_DB:184:C:\\'].diff(diff).dropna())
    #print(adf)
    
print(u'原始序列经过%s阶差分后归于平稳,p值为%s'%(diff,adf[1]))

After running you can see:

WeChat screenshot_20181003192544.png

At this time, the value of d is determined to be 1.

white noise test

In order to verify whether the useful information in the sequence has been extracted, it is necessary to perform a white noise test on the sequence. If the sequence test is a white noise sequence, it means that the useful information in the sequence has been extracted, and the rest are all random disturbances, which cannot be predicted and used. In this experiment, the method of LB statistic is used for white noise test.

Open the file model.py and add the following code:

#白噪声检验
#LB统计量
from statsmodels.stats.diagnostic import acorr_ljungbox

[[lb],[p]] = acorr_ljungbox(data['CWXT_DB:184:C:\\'], lags=1)
if p < 0.05:
    print(u'原始序列为非白噪声序列,对应的p值为:%s'%p)
else:
    print(u'原始序列为白噪声序列,对应的p值为:%s'%p)
    
[[lb],[p]] = acorr_ljungbox(data['CWXT_DB:184:C:\\'].diff(1).dropna(),lags=1)
if p < 0.05:
    print(u'一阶差分序列为非白噪声序列,对应的p值为:%s'%p)
else:
    print(u'一阶差分为白噪声序列,对应的p值为:%s'%p)

After running you can see:

WeChat screenshot_20181003193109.png

The first-order difference is also required to eliminate the non-stationarity of the original data.

model identification

In this section, the maximum likelihood method is used to estimate the parameters of the model and estimate the value of each parameter. For each different model, the BIC information criterion is used to determine the order of the model, and the p and q parameters are determined, so as to select the optimal model.

At present, we have determined that d=1 in the ARIMA model.

Open the model.py file and add the following code:

#模型识别
#确定最佳p、d、q值
#xdata = data['CWXT_DB:184:D:\\']
xdata = data['CWXT_DB:184:C:\\']
from statsmodels.tsa.arima_model import ARIMA

#定阶
pmax = int(len(xdata)/10) #一般阶数不超过length/10
qmax = int(len(xdata)/10)
bic_matrix = [] #bic矩阵
for p in range(pmax+1):
    tmp = []
    for q in range(qmax+1):
        try:
            tmp.append(ARIMA(xdata,(p,1,q)).fit().bic)
        except:
            tmp.append(None)
    bic_matrix.append(tmp)
bic_matrix = pd.DataFrame(bic_matrix) #取值区域
#stack()将数据from columns to indexs
p,q = bic_matrix.stack().astype('float64').idxmin()
print(u'BIC最小的p值和q值为:%s、%s'%(p,q))
#D:1,1
#C:0,0

After running, you can see that the optimal model for the capacity of the C drive is ARIMA(0,1,0).
The optimal model for disk D capacity is ARIMA(1,1,1).

Model checking

After the model is determined, check whether its residual sequence is white noise. If it is not white noise, it means that there is still useful information in the residual, and the model needs to be modified or further extracted.

Open the file model.py and add the following code:

#模型检验
lagnum = 12
from statsmodels.tsa.arima_model import ARIMA

arima = ARIMA(xdata,(p,1,q)).fit()
xdata_pred = arima.predict(typ = 'levels')#predict
#print(xdata_pred)
pred_error = (xdata_pred - xdata).dropna()#残差

from statsmodels.stats.diagnostic import acorr_ljungbox

lb,p_l = acorr_ljungbox(pred_error, lags = lagnum)
h = (p_l < 0.05).sum()#p值小于0.05,认为是非白噪声
if h > 0:
    print(u'模型ARIMA(%s,1,%s)不符合白噪声检验'%(p,q))
else:
    print(u'模型ARIMA(%s,1,%s)符合白噪声检验'%(p,q))

After running, ARIMA(0,1,0) for the C drive capacity meets the white noise test.

WeChat screenshot_20181003193840.png

Model prediction

Use the tested model to make predictions, get the predicted value for the next 5 days, and compare it with the actual value.

We did not use the last 5 data when modeling, we used these data for prediction validation. And convert the unit to GB.

Open the file model.py and add the following code:


#模型预测
#forecast向前预测5个值
test_predict = arima.forecast(5)[0]

#预测对比
test_data = pd.DataFrame(columns = [u'实际容量',u'预测容量'])
test_data[u'实际容量'] = data4[(len(data4)-5):]['CWXT_DB:184:C:\\']
test_data[u'预测容量'] = test_predict
test_data = test_data.applymap(lambda x :'%.2f'%x)
print(test_data)

After running, you can see the predicted value:

WeChat screenshot_20181003194124.png

Model evaluation

At this point, our model has been constructed. In order to evaluate the effectiveness of the time series forecasting model, this experiment uses three statistical indicators to measure the forecasting accuracy of the model: mean absolute error, root mean square error and mean absolute percent error. These three indicators reflect the prediction accuracy of the algorithm from different aspects.

Combined with the actual business analysis, the error threshold is set to 1.5.

Open the file model.py and add the following code:


#计算误差
#列操作
test_data[u'预测容量'] = test_data[u'预测容量'].astype(float)
test_data[u'实际容量'] = test_data[u'实际容量'].astype(float)
#10**6单位换算
abs_ = (test_data[u'预测容量'] - test_data[u'实际容量']).abs()/10**6
mae_ = abs_.mean()
rmse_ = ((abs_**2).mean())**0.05
mape_ = (abs_/test_data[u'实际容量']).mean()

print(u'平均绝对误差为:%0.4f,\n均方根误差为:%0.4f,\n平均绝对百分误差为:%0.6f。'%(mae_,rmse_,mape_))

After running you can see:

WeChat screenshot_20181003194421.png

The errors between the actual and predicted values ​​are all less than the error threshold. Therefore, the prediction effect of the model is within the acceptable range of the actual business, and this model can be used for prediction.

refer to:

https://blog.csdn.net/qq_40006058/article/details/80627357
https://blog.csdn.net/sinat_33519513/article/details/79036958
https://blog.csdn.net/huang1024rui/article/details/51375990
https://blog.csdn.net/bi_hu_man_wu/article/details/64918870

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326318538&siteId=291194637