读取csv文件
创建dataframe的 schema: 获取schema
用.groupby(…)方法分组统计
用 .describe()方法对数值进行描述性统计:
偏态&离散程度
参考:https://blog.csdn.net/weixin_39599711/article/details/79072691
import pyspark.sql.types as typ
Next, we read the data in.
# 按逗号切割,并将每个元素转换为一个整数:
# 读取csv文件
fraud = sc.textFile('ccFraud.csv.gz')
# 获取首行标题
header = fraud.first()
fraud = fraud.filter(lambda row: row != header).map(lambda row: [int(elem) for elem in row.split(',')])
Following, we create the schema for our DataFrame.
# 创建dataframe的 schema: 获取schema
fields = [
*[
typ.StructField(h[1:-1], typ.IntegerType(), True)
for h in header.split(',')
]
]
schema = typ.StructType(fields)
Finally, we create our DataFrame.
# 创建我们的dataframe:
fraud_df = spark.createDataFrame(fraud, schema)
Now that the dataframe is ready we can calculate the basic descriptive statistics for our dataset.
# 查看schema:
fraud_df.printSchema()
root
|-- custID: integer (nullable = true)
|-- gender: integer (nullable = true)
|-- state: integer (nullable = true)
|-- cardholder: integer (nullable = true)
|-- balance: integer (nullable = true)
|-- numTrans: integer (nullable = true)
|-- numIntlTrans: integer (nullable = true)
|-- creditLine: integer (nullable = true)
|-- fraudRisk: integer (nullable = true)
For categorical columns we will count the frequencies of their values using .groupby(...) method.
# 用.groupby(…)方法分组统计:
fraud_df.groupby('gender').count().show()
+------+-------+
|gender| count|
+------+-------+
| 1|6178231|
| 2|3821769|
+------+-------+
For the truly numerical features we can use the .describe() method.
numerical = ['balance', 'numTrans', 'numIntlTrans']
# 从上面的描述性统计可以看出两点:
# 1)所有的特征都是正倾斜的,最大值是平均数的几倍。
# 2)离散系数(coefficient of variation,或变异系数)非常高,接近甚至超过1,说明数据的离散程度很大,波动范围很大。
# 备注:
# 正倾斜(positively skewed): 平均数 > 中位数,由于数据中有些很大很大的极端值,使得整体平均数被极少数的极端大值拉大了,俗称“被平均”,而中位数受极端值的影响其实很小,因而此时用中位数作为中心趋势的估计比较稳健。
# 负倾斜:同理。
# 离散系数 = 标准差 / 平均值
# 用 .describe()方法对数值进行描述性统计:
desc = fraud_df.describe(numerical)
# 检查某个特征的偏度:
desc.show()
+-------+-----------------+------------------+-----------------+
|summary| balance| numTrans| numIntlTrans|
+-------+-----------------+------------------+-----------------+
| count| 10000000| 10000000| 10000000|
| mean| 4109.9199193| 28.9351871| 4.0471899|
| stddev|3996.847309737077|26.553781024522852|8.602970115863767|
| min| 0| 0| 0|
| max| 41485| 100| 60|
+-------+-----------------+------------------+-----------------+
Here's how you check skewness (we will do it for the 'balance' feature only).
常用其他函数包括:avg() , count() , countDistinct() , first() ,
# kurtosis() , max() , mean() , min() , skewness() , stddev() , stddev_pop() , stddev_samp() , sum() ,
# sumDistinct() , var_pop() , var_samp() and variance() 等
# 检查某个特征的偏度:
fraud_df.agg({'balance': 'skewness'}).show()
# 常用其他函数包括:avg() , count() , countDistinct() , first() ,
# kurtosis() , max() , mean() , min() , skewness() , stddev() , stddev_pop() , stddev_samp() , sum() ,
# sumDistinct() , var_pop() , var_samp() and variance() 等
+------------------+
| skewness(balance)|
+------------------+
|1.1818315552995033|
+------------------+