1. Dataset

This actual combat uses the Blast Furnace Gas Cycle Power Plant (Combined Cycle Power Plant, CCPP) dataset, which comes from the UCI homepage (http://archive.ics.uci.edu/ml/datasets.html).

independent variable	AT, indicates the temperature of the blast furnace
	V, represents the pressure in the furnace;
	AP, represents the relative humidity of the blast furnace
	RH, indicates the exhaust gas volume of the blast furnace
dependent variable	Continuity, PE, indicates the power generation of the blast furnace

2. Load the dataset

1. Read the data and display the statistics of the data.

data = pd.read_excel('./CCPP.xlsx')
print(data.describe())

2. Draw a scatter diagram among the variables to observe the correlation between the variables.

sns.pairplot(ccpp)
plt.show()

From the scatterplot above, it seems that there is no obvious linear relationship between AP (relative humidity) and RH (exhaust gas) and PE (power generation).

3. Modeling

This actual combat uses OLS (ordinary least squares, ordinary least squares) regression model.

1. Split the training set (for modeling) and test set (for model evaluation).

train, test = train_test_split(ccpp, test_size=0.2, random_state=14)

2. Modeling

fit_train = sm.formula.ols('PE~AT+V+AP', data=ccpp).fit()
print(fit_train.summary())

Explanation of parameters using fit_train.summary():

part1：

1. Dep.Variable: dependent variable, PE (power generation) in this regression analysis.

2. model: method, in this regression analysis it is ordinary least squares (OLS).

3. No.Observations: The number of observations, which is the size of the sample.

4. Df Residuals: The degree of freedom of the residual (Df), Df = N- K (N = number of samples (number of observations), k = number of variables + 1).

5. Df Models: degrees of freedom (Df) of the model, Df = K - 1 (K = number of variables + 1).

6. R-squared: The R-squared value is the coefficient of determination, which tells us how many percentages of the independent variable can be explained by the independent variable.

7. Adj.R-suquared: It adds a "penalty item" on the basis of R-squared. When adding an "irrelevant independent variable" to the existing model, Adj.R-suquared will give this "irrelevant independent variable". Variable" is a penalty, so that the value of Adj.R-suquared does not necessarily increase, preventing "the generation of false promotion information".

8. F-statistic: used to complete the significance test of the model. The significance test of the model refers to whether the linear combination of the dependent variables is valid.

9. AIC: Akaike Information Criterion is a standard for measuring the goodness of fit of a statistical model. The method of Akaike Information Criterion is to find a model that can best explain the data but contains the fewest free parameters. That is, the best fitting effect is achieved with the least number of independent variables.

part2：

1. Intercept: The constant term is the intercept of the regression line. In regression, we ignore some independent variables that do not have much influence on the dependent variable, and the intercept indicates the mean value of these omitted variables and the noise present in the model.

2. coef (Coefficient term): coefficient term. If you're familiar with derivatives, you can think of them as the rate of change of Y with respect to X.

3. std err: Standard error is also called standard deviation.

4. t: t statistical value, an index to measure the statistical significance of the coefficient.

5. P > |t|: P value is a kind of probability, which refers to the probability of the sample result appearing under the premise that the H0 assumption is true. When the P value is less than 0.05, it means that the model has passed the significance test and the null hypothesis needs to be rejected.

6. [0.025,0.975]: Confidence interval, indicating the range where our coefficient may fall (95% probability).

3. Hypothesis testing of regression model

The purpose of testing the regression model is to test whether the model is valid, including whether the overall model is valid and whether a single variable is statistically significant.

1. The significance test of the model - F test, to detect whether the independent variable really affects the fluctuation of the dependent variable.

1) Raise the null hypothesis and alternative hypothesis of the question:

Null hypothesis: All partial regression coefficients of the model are 0.

Alternative hypothesis: All partial regression coefficients of the model are not all 0, that is, there is at least one independent variable that can form a linear combination of dependent variables.

2) Construct F-statistic: The better the model fits, the greater the F-statistic value.

Python function to calculate F statistics: After the regression model is fitted, the actual value of f can be output by calling model.fvalue. At the same time, the theoretical value of f can be calculated by introducing the scipy.stats module. dfn is the degree of freedom and n is the sample size.

Generally speaking, if the actual value of F is greater than the theoretical value, the null hypothesis will be rejected, that is, the model is significant, and all partial regression coefficients of the model are not all 0. The actual value of f can also be obtained through model.summary.

from scipy.stats import f
print("f的实际值：", fit_train_1.fvalue)
p = fit_train_1.df_model
n = train.shape[0]
f_theroy = f.ppf(q=0.95, dfn=p, dfd=n-p-1)
print("f的理论值：", f_theroy)

In this actual combat, the F statistical value fed back by the model is less than the theoretical value of F, indicating that the model has passed the significance test, indicating that the null hypothesis needs to be rejected (that is, all regression coefficients of the model are considered to be not all 0), but the model’s Passing the test of significance does not mean that every variable is important to the dependent variable, so the significance test of the partial regression coefficient is also required.

2. Significance test of regression coefficient--t test, whether a single independent variable is valid in the model.

The t-test also has the null hypothesis, that is, the variable is not statistically significant.
The operation of the t-test in python is very simple, just call the model summary: model.summary() to see the p-value of each variable. If p is less than 0.05, it means that the variable is statistically significant at the 95% confidence level.

The test results in the above figure show that in the column of P>|t|, the results are all less than 0.05, and all pass the significance test. It shows that the temperature AT, pressure A, and relative humidity AP all affect the variation of the displacement PE , so it is not necessary to remove these variables from the model.

4. Variable screening

Multiple linear regression can filter independent variables for regression according to some methods, including: forward method, backward method, and stepwise method. One of the criteria for these three methods to enter or eliminate variables is the AIC criterion, which is the minimum information criterion. The smaller the AIC value, the better the effect of the model.

x.join(y) method, y is an iterable object, x is inserted into the middle of the iterable object to form a string.

For example: '+'.join(['AE', 'V', 'AP']), output: 'AT+V+AP'. (Multiple strings are concatenated together with the + sign).

def forward_select(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = float('inf'), float('inf')
    while remaining:
        aic_with_candidates = []
        for candidate in remaining:
            formula = '{} ~ {}'.format(response, ' + '.join(selected + [candidate]))
            aic = sm.formula.ols(formula=formula, data=data).fit().aic
            aic_with_candidates.append((aic, candidate))
        aic_with_candidates.sort(reverse=True)
        best_new_score, best_candidate = aic_with_candidates.pop()
        if current_score > best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
            print('aic is {}, continuing!'.format(current_score))
        else:
            print('forward selection over!')
            break
    formula = '{} ~ {}'.format(response,' + '.join(selected))
    print('final formula is {}'.format(formula))
    model = sm.formula.ols(formula=formula, data=data).fit()
    return(model)

data_for_select = train[['PE', 'AT', 'V', 'AP']]
fit_train_1 = forward_select(data=data_for_select, response='PE')
print(fit_train_1.rsquared)

Model prediction results

pred = fit_train_1.predict(test[['AT', 'V', 'AP']])
print(test['PE'])
print(pred)

5. Visualization

plt.style.use('ggplot')
plt.rcParams['font.sans-serif'] = 'Microsoft YaHei'

plt.scatter(test['PE'], pred, label='观测点', color='b')
plt.plot([test['PE'].min(), test['PE'].max()], [pred.min(), pred.max()], 'r--', lw=2, label='拟合线')
plt.title('真实值VS.预测值')
plt.xlabel('预测值')
plt.ylabel('真实值')
plt.legend(loc='upper left')
plt.show()

6. Diagnosis of the regression model

Since the partial regression coefficient of this linear regression model is realized by the method of least squares (OLS), there are some assumptions about the use of the method of least squares, specifically:

There is a linear relationship between the independent variable and the dependent variable;

There is no multicollinearity between the independent variables;

The residuals of the regression model obey the normal distribution;

The residual of the regression model satisfies the variance homogeneity (that is, the variance is a fixed value);

The residuals of the regression model are independent of each other.

1. Linear correlation test

Regarding the judgment of linear relationship, we can identify it through graphics or Pearson correlation coefficient. The specific Python code is as follows:

# 绘制各变量之间的散点图
sns.pairplot(ccpp)
plt.show()

# 发电量与自变量之间的相关系数
print(ccpp.corrwith(ccpp['PE']))

From the returned results, the correlation coefficient between PE (power generation) and AT (temperature) and V (pressure) is high, while the correlation coefficient between PE and AP (relative humidity) and RH (exhaust gas) smaller.

In general, when the Pearson correlation coefficient is lower than 0.4, it indicates that there is a weak correlation between variables; when the Pearson correlation coefficient is between 0.4 and 0.6, it indicates that there is a moderate correlation between variables; when the correlation coefficient is 0.6 Above, it reflects that there is a strong correlation between the variables.

After comparison, it was found that there was a weak correlation between PE and RH, so this variable was not considered to be included in the model. Of course, there is no relationship between the variables, which may be a quadratic function relationship, logarithmic relationship, etc., so it is generally necessary to perform tests and variable conversions.

2. Multicollinearity test

There can be no strong collinearity between independent variables. If this problem occurs in multiple linear regression, it will cause very unstable estimates of regression coefficients and intercept coefficients. The test for multicollinearity can be identified using the variance inflation factor (VIF). If VIF>10, it indicates that the variables have multicollinearity. Once multicollinearity is found between variables, you can consider deleting variables and reselecting the model (ridge regression method).

The variance inflation factor (VIF) is a measure of the severity of multiple (multiple) collinearity in a multiple linear regression model . It represents the ratio of the variance of the regression coefficient estimator compared to the variance when the independent variables are assumed to be non-linearly related.

Use the model function of pasty to get the sub-column, and then instantiate it (np.ndarray), which is similar to using the .values attribute after selecting the sub-column of dataframe, but the operation flow of the pasty function is more streamlined and efficient. Moreover, the result of pasty can be passed as a parameter to the least squares function, which is often used in data analysis.

The return_type default parameter of patsy.dmatirces is Design_Matrix. Here, the dmatrices function of pasty is used to combine the dependent variable PE, the independent variable AT, V, AP, and the intercept item (1 with a value of 1 is an array) in the form of a data frame. Here, the function that provides the calculation of VIF in the statsmodels module is imported .

import patsy
from statsmodels.stats.outliers_influence import variance_inflation_factor

y, X = patsy.dmatrices('PE~AT+V+AP', data=ccpp, return_type='dataframe')
# formula +0去掉截距项
# y, X = patsy.dmatrices('PE~AT+V+AP+0', data=ccpp, return_type='dataframe')
print(y)
print(X)

vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['features'] = X.columns
print(vif)

The results show that all the above VIF values are less than 10, so there is no multicollinearity between the respective variables.

3. Outlier detection

Before abnormal point monitoring, we need to construct a linear regression model for the existing data. The specific code is as follows:

# 构造PE与AT、V和AP之间的线性模型
fit = sm.formula.ols('PE~AT+V+AP', data=ccpp).fit()
print(fit.summary())
# 计算模型的RMSE值
pred = fit.predict()
print(np.sqrt(mean_squared_error(ccpp.PE, pred)))

Calculate the RMSE value before outlier monitoring:

Start abnormal point monitoring. Regarding the monitoring method of abnormal points, it is generally judged by high leverage point (hat matrix) or DFFITS value, student residual, cook distance and covratio value. The specific calculation script for these values is as follows:

(1) When the high leverage point (or hat matrix) > 2 ( p + 1) / n, it is considered that the sample may be abnormal (where P is the number of independent variables, n is the number of observations);

(2) When the DFFITS statistical value > 2sqrt ( ( p + 1) / n), it is considered that the sample may be abnormal;

(3) When the absolute value of the studentized residual is greater than 2, it is considered that there may be anomalies in the sample point;

(4) For the cook distance, there is no clear judgment standard. Generally speaking, the larger the value, the higher the possibility of abnormality in the sample;

(5) For the covratio value, if the covratio value of a sample is farther from the value 1, the possibility of abnormality in the sample is considered to be higher.

outliers = fit.get_influence()

leverage = outliers.hat_matrix_diag
dffits = outliers.dffits[0]
resid_stu = outliers.resid_studentized_external
cook = outliers.cooks_distance[0]
covratio = outliers.cov_ratio

contat1 = pd.concat([pd.Series(leverage, name='leverage'),
                     pd.Series(dffits, name='dffits'),
                     pd.Series(resid_stu, name='resid_stu'),
                     pd.Series(cook, name='cook'),
                     pd.Series(covratio, name='covratio'), ], axis=1)
ccpp_outliers = pd.concat([ccpp, contat1], axis=1)
print(ccpp_outliers.head(5))

Here we use the studentized residual as the criterion, because it contains the information of the hat matrix and DFFITS. And we delete 3.7% outliers.

outliers_ratio = sum(np.where(np.abs(ccpp_outliers['resid_stu']) > 2, 1, 0)) / ccpp_outliers.shape[0]
print(outliers_ratio)
ccpp_outliers = ccpp_outliers.loc[np.abs(ccpp_outliers.resid_stu) <= 2, ]

The results show that there are indeed outliers in the sample, and the number of outliers accounts for 3.7%. For the processing of outliers, let's consider the following three methods:

(1) When the abnormal ratio is extremely low (such as within 5%), you can consider deleting it directly;

(2) When the proportion of abnormalities is high, it can be considered to derive the abnormal value as a dummy variable, that is, the abnormal value corresponds to 1, and the non-abnormal value corresponds to 0;

(3) Extract outliers separately and model them separately.

After deleting the outliers, we re-modeled. Compared with the model before deleting the outliers, the model after deletion is better. The specific performance is: the information criteria (AIC and BIC) are both smaller, and the RMSE (root mean square error) is also From the original 4.89 to 4.26.

fit2 = sm.formula.ols('PE~AT+V+AP', data=ccpp_outliers).fit()
print(fit2.summary())
pred2 = fit2.predict()
print(np.sqrt(mean_squared_error(ccpp_outliers['PE'], pred2)))

4. Normality test

When the residuals of the model obey the normality assumption, the t value corresponding to the partial regression coefficient of the model and the F value of the model can be guaranteed to be valid. After the ancient model is built, the residual error of the model should be tested for normality. There are two types of methods for normality testing, namely qualitative graphical methods (histogram, PP diagram and QQ diagram) and quantitative non-parametric methods (Shapiro test and KS test).

The resid function returns a vector containing the residuals for each observation.

stats.norm.pdf(X, mu, sigma), a function that computes the probability density function of the normal distribution function.

from scipy.stats import norm
from matplotlib import mlab

# 残差的正态性检验（直方图法）
resid = fit2.resid

plt.rcParams['font.sans-serif'] =['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False

plt.hist(resid,
         density=True,
         bins=100,
         color='steelblue',
         edgecolor='k',
         stacked=True)

plt.title('残差直方图')
plt.ylabel('密度值')

# 生成正态曲线的数据
x1 = np.linspace(resid.min(),resid.max(), 1000)
normed = norm.pdf(x1, resid.mean(), resid.std())
# 绘制正态分布曲线
plt.plot(x1, normed, 'r-', linewidth=2, label='正态分布曲线')

# 生成核密度曲线的数据
x2 = np.linspace(resid.min(), resid.max(), 1000)
kde = mlab.GaussianKDE(resid)
# 绘制核密度曲线
plt.plot(x2, kde(x2), 'k-', linewidth=2, label='核密度曲线')

plt.legend(loc='best')
plt.show()

As can be seen from the above figure, the kernel density curve basically agrees with the theoretical normal distribution curve, and the residual basically obeys the normal distribution.

# 残差的正态性检验（PP图和QQ图法）
pp_qq_plot = sm.ProbPlot(resid)
pp_qq_plot.ppplot(line='45')
plt.title('P-P图')

pp_qq_plot.qqplot(line='q')
plt.title('Q-Q图')
plt.show()

Judging from the PP chart and QQ chart, some sample points do not fall on the reference line, but most of the sample points are still in line with the reference line.

from scipy import stats
standard_resid = (resid-np.mean(resid))/np.std(resid)
print(stats.kstest(standard_resid, 'norm'))

Since the shapiro normality test requires a sample size of less than 5000; and the sample size of this data set is

lamd = stats.boxcox_normmax(ccpp_outliers['PE'], method='mle')
ccpp_outliers['PE'] = stats.boxcox(ccpp_outliers['PE'], lamd)
fit3 = sm.formula.ols('PE~AT+V+AP', data=ccpp_outliers).fit()
print(fit3.summary())

5. Residual variance homogeneity test

In a linear regression model, if the model is performing extremely well, there should not be some apparent relationship or trend between the residuals and the fitted values. If the residuals of the model do have certain heteroscedasticity, the estimated partial regression coefficients will not be effective, and even the prediction of the model will be inaccurate. Therefore, after modeling, it is necessary to verify whether the residual variance is homogeneous. There are two methods of verification, one is the graphical method , and the other is the statistical verification method .

The so-called homogeneity of variances means that the variances are equal, and this precondition needs to be met in both the t test and the analysis of variance.

In the comparison of two groups and multiple groups, the meaning of homogeneity of variance is easy to understand. It is nothing more than comparing the variance of each group to see if the variance of each group is about the same size. If the difference is too large, it is considered that the variance is uneven . or with unequal variances . If the difference is not large, the variances are considered to be homogeneous or equal . Of course, whether this so-called difference is large or small requires a statistical test, so there is a test for homogeneity of variance.

Judgment: Scattered points are scattered randomly and irregularly, which means that the basic assumptions are met, and if there are obvious rules or a certain trend, then there is heteroscedasticity .

plt.style.use('ggplot')
# 标准化残差与预测值之间的散点图
plt.scatter(fit2.predict(), (fit2.resid - fit2.resid.mean()) / fit2.resid.std(),color='b', s=20)
plt.xlabel('预测值')
plt.ylabel('标准化残差')

# 添加水平参考线
plt.axhline(y=0, color='r', linewidth=2)
plt.show()

From the figure, no obvious rules or trends are found (judgment criteria: if the residuals are evenly distributed on both sides of the reference line, it means that the heteroskedasticity is weak; and if there is an obvious uneven distribution, it means that There is obvious heteroscedasticity.), so it can be considered that there is no significant heteroscedasticity characteristic.

# ===========统计法完成方差齐性的判断===============
# White's Test
print(sm.stats.diagnostic.het_white(fit2.resid, exog=fit2.model.exog))
# Breusch-Pagan
print(sm.stats.diagnostic.het_breuschpagan(fit2.resid, exog_het=fit2.model.exog))

From the test results, whether it is the White test or the Breush-Pagan test, the P value is far less than the 0.05 discrimination limit, that is, the null hypothesis (the null hypothesis that the residual variance is constant) is rejected, and the residuals do not satisfy the homogeneity. this assumption. If the residuals of the model do not obey homogeneity, two methods can be considered for the results, one is the model transformation method, and the other is the weighted least squares method. Here is a demonstration of the weighted least squares method.

# 三种权重
w1 = 1 / np.abs(fit2.resid)
w2 = 1 / fit2.resid ** 2
ccpp_outliers['loge2'] = np.log(fit2.resid ** 2)
# 第三种权重
model = sm.formula.ols('loge2~AT+V+AP', data=ccpp_outliers).fit()
w3 = 1 / (np.exp(model.predict()))

# 建模
fit3 = sm.formula.wls('PE~AT+V+AP', data=ccpp_outliers, weights=w1).fit()
# 异方差检验
het3 = sm.stats.diagnostic.het_breuschpagan(fit3.resid, exog_het=fit3.model.exog)
# AIC
fit3.aic

fit4 = sm.formula.wls('PE~AT+V+AP', data=ccpp_outliers, weights=w2).fit()
het4 = sm.stats.diagnostic.het_breuschpagan(fit4.resid, exog_het=fit4.model.exog)
fit4.aic

fit5 = sm.formula.wls('PE~AT+V+AP', data=ccpp_outliers, weights=w3).fit()
het5 = sm.stats.diagnostic.het_breuschpagan(fit5.resid, exog_het=fit5.model.exog)
fit5.aic

# fit2模型
het2 = sm.stats.diagnostic.het_breuschpagan(fit2.resid, exog_het=fit2.model.exog)
fit2.aic

print('fit2模型异方差检验统计量：%.2f，P值为%.4f：' % (het2[0], het2[1]))
print('fit3模型异方差检验统计量：%.2f，P值为%.4f：' % (het3[0], het3[1]))
print('fit4模型异方差检验统计量：%.2f，P值为%.4f：' % (het4[0], het4[1]))
print('fit5模型异方差检验统计量：%.2f，P值为%.4f：\n' % (het5[0], het5[1]))

print('fit2模型的AIC：%.2f' % fit2.aic)
print('fit3模型的AIC：%.2f' % fit3.aic)
print('fit4模型的AIC：%.2f' % fit4.aic)
print('fit5模型的AIC：%.2f' % fit5.aic)

Through comparison, we found that although we used three different weights, none of them passed the significance test of the homogeneity of residual variance, but it seems that the fit4 model is more ideal. Compared with fit2, the AIC information is smaller (and of course may cause overfitting problems).

6. Residual independence test

The reason why the residuals are required to be independent is because the dependent variable y is required to be independent, because only y and the residual term are variables in the model, and the independent variable x is known. If it is coupled with the assumption of a normal distribution, it is independent and uniformly distributed in the normal distribution, and we can test the independence of the residuals through the Durbin-Watson statistic. In fact, the Durbin-Watson statistic value of the residual is included in the summary information of the model. If the value is closer to 2, it means that the residual is independent. Generally speaking, in actual data sets, there may be correlations between time series samples, while other data set samples are basically independent.

Next, make a scatter diagram of the predicted values and actual values generated by the five models. If the scatter diagram is particularly close to the prediction line, it is considered that the model fits very well.

model_list = {'fit': fit,
              'fit2': fit2,
              'fit3': fit3,
              'fit4': fit4,
              'fit5': fit5}

for i, model in enumerate(model_list.items()):
    plt.subplot(2, 3, i+1)
    if i==0:
        plt.scatter(model[1].predict(), ccpp['PE'], color='b', edgecolors='white', s=20)
        plt.plot([model[1].predict().min(), model[1].predict().max()],
                 [ccpp.PE.min(), ccpp.PE.max()],
                  'r-', linewidth=3)
    else:
        plt.scatter(model[1].predict(), ccpp_outliers['PE'], color='b', edgecolors='white', s=20)
        plt.plot([model[1].predict().min(), model[1].predict().max()],
                 [ccpp_outliers.PE.min(), ccpp_outliers.PE.max()],
                 'r-', linewidth=3)
    plt.title('模型：%s' % model[0])
    plt.xlabel('预测值')
    plt.ylabel('实际值')
plt.show()

Linear regression practice (2)--Use the ols model to predict the power generation of blast furnace gas