Multicollinearity in regression model + harm + cause + judgment standard + solution, regression coefficient

1. Multicollinearity       

        Multicollinearity refers to the existence of precise or highly correlated relationships between explanatory variables in a linear regression model .

        For example: A regression model contains two variables, age and years of work experience . Common sense knows that the older you are, the greater the number of years of work experience. The two variables may be highly correlated , so there may be multicollinearity in the model .

2. The hazards of multicollinearity

        a. Model estimation is distorted or difficult to estimate accurately or the stability is reduced , which means that the standard error of the regression equation may increase;

        b. The model parameter estimation is inaccurate and the variance is large, which is also the original and further reason for the inaccurate model estimation. As for why see 3

        c. It is impossible to judge the influence of individual variables and calculate the characteristic contribution;

        d. Therefore, the significance of the independent variable may lose its meaning. The independent variable that should be significant is not significant, but the independent variable that is not significant is significant

3. The existence of collinearity leads to the inaccurate estimation of the model

        The purpose of minimizing the loss function is to find a set of optimal regression coefficients , which can be understood as model parameters in a macro definition. The existence of multicollinearity will lead to inaccurate estimation of model parameters , which will lead to distortion or inaccurate estimation of model estimation .

        Simply intersperse the concept of regression coefficient:

Regression coefficient: regression coefficient

 In the regression equation         , the regression coefficient  represents  the  parameter of the influence of the independent variable  x on  the dependent variable y , which reflects the expected change of the dependent variable when the independent variable changes by one unit .

        The larger the regression coefficient, the greater the influence of x on y, the positive regression coefficient means that y increases with the increase of x, and the negative regression coefficient means that y decreases with the increase of x.

        For example, in the regression equation Y=bX+a, the slope  b is called the regression coefficient, which means that every time X changes by one unit, Y will change by b units on average.

More generally speaking: further understand the regression coefficient from the perspective of linear regression.

The relationship between          the variable y and the variable x = (x1, x2, x3...,xn)Y=f(x) +\varepsilon is , at this time, f(x) is called the regression of y to x, and f(x) is called the regression function. Usually in the case of normal distribution, if f(x) is a linear function of x \beta _{x}^{T} + \beta _{0} , \beta _{0}it is a regression constant, \beta _{x}^{T} = (\beta _{1}, \beta _{2}, \beta _{3}, ... , \beta _{n}) called regression coefficient .

        Back to the hazards of multicollinearity:

        If there is multicollinearity in the model, it means that at least two independent variables A and B are highly or completely correlated , that is, the changing trends of the two variables are consistent, and when one changes, the other will also change similarly. The stronger the correlation, the more difficult it is to explain the change of Y simply from the change of A when only A is changed and B is not changed. changes, so it is difficult to understand).

        As a result, the confidence in the estimated coefficients or the stability and performance of the model can be reduced.

4. Judgment criteria

        a. Pearson correlation coefficient , which can explain the degree of linear correlation between continuous variables. If the value is greater than 0.8, it can be considered that there is multicollinearity; for continuous-discrete and discrete-discrete variable pairs, it can be Use other methods (another article will be opened later);

        b. Realize adding a variable or deleting a variable, and observe whether the value of the regression coefficient changes greatly. If there is a large change, it means that the estimated coefficient of the variable is unreliable or unstable;

        c. If the F test is passed, and the coefficient of determination is also large, but the t test is not significant, there may also be multicollinearity;

        d. The positive and negative sign of the regression coefficient is opposite to the professional knowledge or inconsistent with the actual analysis results, and there is also the possibility of multicollinearity.

The above a ~ d are all subjective judgment methods ; there is also a formal inspection method!

Observe the VIF value (variance inflation factor)         in the regression analysis , the expression 1 / (1 - r2). Multicollinearity will increase the variance of parameter estimates, and the larger the variance inflation factor, the stronger the collinearity. The usual judgment standard is that the VIF value is greater than 10, that is, there is multicollinearity, and some literatures also say that the value of VIF is greater than 5, that is, there is collinearity. ​​​​​​​​

5. The solution, how to eliminate multicollinearity

        a. Retain a variable and delete other variables that are highly correlated with it, so that the stepwise regression method is most widely used;

        b. Introduce L1 and L2 regularization, reduce the variance of the parameter amount, reduce VIF, and can handle multicollinearity;

        c. Feature merging or feature combination , linearly combining related variables together;

        d. Feature dimensionality reduction, such as PCA

        e. Difference method, time series data , linear model: transform the original model into a difference model (reposted from Baidu Encyclopedia, I don't understand...)

The above content comes from: Regression Coefficient_Baidu Encyclopedia , Multicollinearity_Baidu Encyclopedia , ​​​​​​​​Fast forward! ! How much do you know about multicollinearity? - Zhihu  and your own learning and understanding, if you can, you can read these articles.

Guess you like

Origin blog.csdn.net/xiao_ling_yun/article/details/129571018