sparksql_探索数据分布

读取csv文件
创建dataframe的 schema：获取schema
用.groupby(…)方法分组统计
用 .describe()方法对数值进行描述性统计：
偏态&离散程度
参考：https://blog.csdn.net/weixin_39599711/article/details/79072691
import pyspark.sql.types as typ
Next, we read the data in.

# 按逗号切割，并将每个元素转换为一个整数：
# 读取csv文件
fraud = sc.textFile('ccFraud.csv.gz')
# 获取首行标题
header = fraud.first()

fraud = fraud.filter(lambda row: row != header).map(lambda row: [int(elem) for elem in row.split(',')])
Following, we create the schema for our DataFrame.

# 创建dataframe的 schema：  获取schema
fields = [
    *[
        typ.StructField(h[1:-1], typ.IntegerType(), True)
        for h in header.split(',')
    ]
]

schema = typ.StructType(fields)
Finally, we create our DataFrame.

# 创建我们的dataframe:
fraud_df = spark.createDataFrame(fraud, schema)
Now that the dataframe is ready we can calculate the basic descriptive statistics for our dataset.

# 查看schema:
fraud_df.printSchema()
root
 |-- custID: integer (nullable = true)
 |-- gender: integer (nullable = true)
 |-- state: integer (nullable = true)
 |-- cardholder: integer (nullable = true)
 |-- balance: integer (nullable = true)
 |-- numTrans: integer (nullable = true)
 |-- numIntlTrans: integer (nullable = true)
 |-- creditLine: integer (nullable = true)
 |-- fraudRisk: integer (nullable = true)

For categorical columns we will count the frequencies of their values using .groupby(...) method.

# 用.groupby(…)方法分组统计：
fraud_df.groupby('gender').count().show()
+------+-------+
|gender|  count|
+------+-------+
|     1|6178231|
|     2|3821769|
+------+-------+

For the truly numerical features we can use the .describe() method.

numerical = ['balance', 'numTrans', 'numIntlTrans']
# 从上面的描述性统计可以看出两点：

# 1）所有的特征都是正倾斜的，最大值是平均数的几倍。 
# 2）离散系数（coefficient of variation，或变异系数）非常高，接近甚至超过1，说明数据的离散程度很大，波动范围很大。

# 备注：

# 正倾斜（positively skewed）： 平均数 > 中位数，由于数据中有些很大很大的极端值，使得整体平均数被极少数的极端大值拉大了，俗称“被平均”，而中位数受极端值的影响其实很小，因而此时用中位数作为中心趋势的估计比较稳健。
# 负倾斜：同理。
# 离散系数 = 标准差 / 平均值


# 用 .describe()方法对数值进行描述性统计：
desc = fraud_df.describe(numerical)
# 检查某个特征的偏度：
desc.show()
+-------+-----------------+------------------+-----------------+
|summary|          balance|          numTrans|     numIntlTrans|
+-------+-----------------+------------------+-----------------+
|  count|         10000000|          10000000|         10000000|
|   mean|     4109.9199193|        28.9351871|        4.0471899|
| stddev|3996.847309737077|26.553781024522852|8.602970115863767|
|    min|                0|                 0|                0|
|    max|            41485|               100|               60|
+-------+-----------------+------------------+-----------------+

Here's how you check skewness (we will do it for the 'balance' feature only).

常用其他函数包括：avg() , count() , countDistinct() , first() ,
# kurtosis() , max() , mean() , min() , skewness() , stddev() , stddev_pop() , stddev_samp() , sum() ,
# sumDistinct() , var_pop() , var_samp() and variance() 等
# 检查某个特征的偏度：
fraud_df.agg({'balance': 'skewness'}).show()

# 常用其他函数包括：avg() , count() , countDistinct() , first() ,
# kurtosis() , max() , mean() , min() , skewness() , stddev() , stddev_pop() , stddev_samp() , sum() ,
# sumDistinct() , var_pop() , var_samp() and variance() 等

+------------------+
| skewness(balance)|
+------------------+
|1.1818315552995033|
+------------------+
御剑归一
发布了273 篇原创文章 · 获赞 1 · 访问量 4696
私信关注
sparksql_探索数据分布

猜你喜欢