Use Python to calculate variance and JS divergence in different dimensions in a data set

In the field of data mining, we often need to measure the differences or similarities between different dimensions. To achieve this, it is often necessary to use various methods to calculate the difference between two samples. For example, methods such as the KS test and relative entropy can be used for comparisons of continuous variables.

This article will introduce how to use Python and the Pandas library to calculate the variance and JS divergence under different dimensions in the data set to evaluate the fluctuation of changes in each dimension.

Code:

First, define a function JS_divergence() to calculate the JS divergence between two distributions:

import scipy.stats as ss

def JS_divergence(p, q, base):
    M = (p+q)/2
    return 0.5 * ss.entropy(p, M, base=base) + 0.5 * ss.entropy(q, M, base=base)

Next, define a function compute_metrics() to calculate the variance and JS divergence under the specified dimension:

import pandas as pd
import numpy as np

def compute_metrics(df, dim):
    var = np.var(df.query(f"dimension == '{dim}'")['pred'] - df.query(f"dimension == '{dim}'")['actual'])
    js_div = JS_divergence(df.query(f"dimension == '{dim}'")['pred'], df.query(f"dimension == '{dim}'")['actual'], 2)
    
    return [var, js_div]

We then store the data in a Pandas dataframe and calculate the variance and JS divergence for each dimension using the function compute_metrics() above:

lists = [['分发模块','精选', 100000,85000]
        ,['分发模块','关注', 20000,10000]
        ,['分发模块','发现', 1000,1500]
        ,['用户分类','儿童', 2000,2000]
        ,['用户分类','青年', 30000,19500]
        ,['用户分类','中年', 69000,50000]
        ,['用户分类','老年', 20000,25000]
      ]

df = pd.DataFrame(lists, columns=['dimension', 'indicator', 'pred', 'actual'])

# 计算方差和 JS 散度
metrics = {
    
    }
for dim in df['dimension'].unique():
    metrics[dim] = compute_metrics(df, dim)

print(pd.DataFrame(metrics, index=['Var', 'JS_Div']))

Finally, we obtained the variance and JS divergence indicators under each dimension.

Summarize:

This article describes how to use Python and the Pandas library to calculate variance and JS divergence in different dimensions in a data set. These indicators can be used to evaluate the fluctuation of changes in each dimension, thereby achieving purposes such as abnormal dimension mining and data analysis. Hope it inspires everyone!

Guess you like

Origin blog.csdn.net/weixin_44976611/article/details/130955024