pandas 数据离散化
离散化
- 0 准备数据
- 1 分组
- 2 离散化(转化为
one-hot编码
)
import pandas as pd
height_list = [165, 174, 160, 180, 159, 163, 192, 184]
data = pd.Series(height_list)
data
0 165
1 174
2 160
3 180
4 159
5 163
6 192
7 184
dtype: int64
第一步: 分组
(1) 自动分组 qcut(series数据, 组数量)
sr = pd.qcut(data, 3)
sr
0 (163.667, 178.0]
1 (163.667, 178.0]
2 (158.999, 163.667]
3 (178.0, 192.0]
4 (158.999, 163.667]
5 (158.999, 163.667]
6 (178.0, 192.0]
7 (178.0, 192.0]
dtype: category
Categories (3, interval[float64]): [(158.999, 163.667] < (163.667, 178.0] < (178.0, 192.0]]
查看每组内的数量value_counts()
sr.value_counts()
(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
(2)自动分组 cut(series数据, [区间值])
sr_cut = pd.cut(data, [150, 165, 180, 195])
sr_cut.value_counts()
(150, 165] 4
(180, 195] 2
(165, 180] 2
dtype: int64
pd.get_dummies(sr_cut, prefix='自定义身高')
|
自定义身高_(150, 165] |
自定义身高_(165, 180] |
自定义身高_(180, 195] |
0 |
1 |
0 |
0 |
1 |
0 |
1 |
0 |
2 |
1 |
0 |
0 |
3 |
0 |
1 |
0 |
4 |
1 |
0 |
0 |
5 |
1 |
0 |
0 |
6 |
0 |
0 |
1 |
7 |
0 |
0 |
1 |
第二步: 进行离散化 get_dummies(分组好的数据, prefix="前缀")
- 就是把分组好的数据转化为
one-hot编码
或者又叫哑变量
pd.get_dummies(sr, prefix='自动身高')
|
自动身高_(158.999, 163.667] |
自动身高_(163.667, 178.0] |
自动身高_(178.0, 192.0] |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
0 |
2 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
4 |
1 |
0 |
0 |
5 |
1 |
0 |
0 |
6 |
0 |
0 |
1 |
7 |
0 |
0 |
1 |