Manually calculate the mean, variance, covariance, Pearson coefficient

Project github address: bitcarmanlee easy-algorithm-interview-and-practice
welcome everyone to star, leave a message, and learn and progress together

1. Mean, variance

Suppose there is an array with a total of n elements, then the calculation formula for mean and variance is
mean:
x ˉ = 1 n ∑ i = 1 nxi \bar x = \frac{1}{n} \sum_{i=1}^ n x_ixˉ=n1i=1nxi

Variance (overall variance):
var = 1 n ∑ i = 1 n (xi − x ˉ) 2 var = \frac{1}{n} \sum_{i=1)^n (x_i-\bar x) ^ 2v a r=n1i=1n(xixˉ)2

Variance (sample variance):
S = 1 n − 1 ∑ i = 1 n (xi − x ˉ) 2 S = \frac{1}(n-1) \sum_(i=1)^n (x_i-\bar x) ^ 2S=n11i=1n(xixˉ)2

Among them, the overall variance is divided by n, and the sample variance is divided by n-1.

2. Covariance and Pearson's coefficient

Covariance (Covariance) is used to measure the degree of joint variation of two random variables. Therefore, when calculating the covariance, you need to enter two variables. Variance is a special case of covariance, that is, the covariance between a variable and itself.

c o v ( X , Y ) = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) n − 1 cov(X, Y) = \frac{\sum_{i=1}^n(x_i - \bar x)(y_i - \bar y )}{n-1} c o v ( X ,And )=n1i=1n(xixˉ )(andiYˉ)

If the denominator is divided by n-1, then the covariance of the sample is calculated. If it is divided by n, the overall covariance is calculated.

The Pearson correlation coefficient between two variables is the quotient of the covariance and standard deviation between the two variables:
ρ x, y = cov (X, Y) σ x σ y = E [(X − μ x) ( X − μ y)] σ x σ y \rho_{x,y} = \frac{cov(X,Y)}{\sigma_x \sigma_y} = \frac{E[(X-\mu_x)(X-\ mu_y)]}{\sigma_x \sigma_y}ρx , and=σxσandc o v ( X ,Y )=σxσandAnd [ ( Xμx)(Xμand)]

The Pearson coefficient of the sample, often with the letter rrr means:

r = ∑ i = 1 n ( X i − X ‾ ) ( Y i − Y ‾ ) ∑ i = 1 n ( X i − X ‾ ) 2 ∑ i = 1 n ( Y i − Y ‾ ) 2 {\displaystyle r={\frac {\sum _{i=1}^{n}(X_{i}-{\overline {X}})(Y_{i}-{\overline {Y}})}{ {\sqrt {\sum _{i=1}^{n}(X_{i}-{\overline {X}})^{2}}}{\sqrt {\sum _{i=1}^{n}(Y_{i}-{\overline {Y}})^{2}}}}}} r=i=1n(XiX)2 i=1n( AndiY)2 i=1n(XiX) ( AndiY)

The Pearson coefficient r is a dimensionless number that measures linear independence, and its value is between [-1, 1]. If r=1, it means a complete linear correlation. If r=-1, it means a completely linear negative correlation. r=0, that is, the cov value is 0, indicating that the two variables are uncorrelated, or more accurately called "linearly independent" or "linearly uncorrelated", indicating that there is no linear correlation between the two random variables X and Y, not It means that there must be no intrinsic (non-linear) functional relationship between them, and X and Y are not necessarily statistically independent.

3. Code implementation

So many theories have been mentioned before, then we will look at the specific implementation.

import numpy as np
import math
import pandas as pd

def t1():
    a = [5, 6, 16, 9]
    print(np.mean(a))
    print()

    # 整体方差除以n
    var = np.var(a)
    print(var)
    # 自己计算方差
    var2 = [math.pow(x-np.mean(a), 2) for x in a]
    print(np.mean(var2))
    print()

    # 样本方差除以n-1
    sample_var = np.var(a, ddof=1)
    print(sample_var)
    print()

    # 标准差
    std = np.std(a)
    std2 = np.std(a, ddof=1)
    print(std)
    print(std2)


t1()

Output result:

9.0

18.5
18.5

24.666666666666668

4.301162633521313
4.96655480858378

Let’s take a look at the calculation of Pearson’s coefficient.

def t2():
    a = [i for i in range(10)]
    b = [2*x + 0.1 for x in a]

    dic = {"x": a, "y": b}
    df = pd.DataFrame(dic)
    print(df)
    print()
    print("dfcorr is: ", df.corr())

    # 手动计算pearson系数
    n = len(a)
    abar = sum(a)/float(len(a))
    bbar = sum(b)/float(len(b))
    covab = sum([(x-abar)*(y-bbar) for (x, y) in zip(a, b)]) / n

    stda = math.sqrt(sum([math.pow(x-abar, 2) for x in a]) / n)
    stdb = math.sqrt(sum([math.pow(y-bbar, 2) for y in b]) / n)

    print()
    print(covab)
    print(stda)
    print(stdb)
    print()
    corrnum = covab / (stda * stdb)
    print("corrnum is: ", corrnum)


t2()

Output result

   x     y
0  0   0.1
1  1   2.1
2  2   4.1
3  3   6.1
4  4   8.1
5  5  10.1
6  6  12.1
7  7  14.1
8  8  16.1
9  9  18.1

dfcorr is:       x    y
x  1.0  1.0
y  1.0  1.0

16.5
2.8722813232690143
5.7445626465380295

corrnum is:  0.9999999999999998

x and y are two variables respectively, y = 2x + 0.1, the two have a complete linear correlation, so the pearson coefficient calculated at the end is 1.0.

Guess you like

Origin blog.csdn.net/bitcarmanlee/article/details/111365936