pandas 分层按列随机抽样

实现一个多步骤的过程以达到根据类别分层随机抽样，然后从特定的 ID 中选取相关的样本。这涉及：

按类别分层进行随机抽样：首先，根据类别列进行分层抽样，选取一定数量的 ID。
根据选取的 ID 获取所有相关样本：从原数据集中提取包含这些 ID 的所有行。

下面的代码展示了如何用 pandas 来实现这些步骤。

import pandas as pd

# 创建一个示例 DataFrame
data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'value': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'],
    'category': ['cat1', 'cat2', 'cat1', 'cat2', 'cat1', 'cat2', 'cat1', 'cat2', 'cat1', 'cat2',
                 'cat1', 'cat2', 'cat1', 'cat2', 'cat1', 'cat2', 'cat1', 'cat2', 'cat1', 'cat2']
}
df = pd.DataFrame(data)

print("原始 DataFrame:")
print(df)

def stratified_sample_ids(df, category_col, num_ids_per_category, random_state=None):
    sampled_ids = []
    for category, group in df.groupby(category_col):
        sampled_ids.extend(group['id'].sample(n=num_ids_per_category, random_state=random_state).tolist())
    return sampled_ids

def get_samples_by_ids(df, ids):
    return df[df['id'].isin(ids)]

# 随机选取指定数量的 ID
num_ids_per_category = 3  # 每个类别随机选取3个ID
sampled_ids = stratified_sample_ids(df, 'category', num_ids_per_category, random_state=42)

# 获取所有相关样本
sampled_df = get_samples_by_ids(df, sampled_ids)

print("\n按类别分层随机抽样选取的 ID:")
print(sampled_ids)

print("\n根据选取的 ID 获取的所有相关样本:")
print(sampled_df)

输出

原始 DataFrame:
    id value category
0    1     A     cat1
1    2     B     cat2
2    3     C     cat1
3    4     D     cat2
4    5     E     cat1
5    6     F     cat2
6    7     G     cat1
7    8     H     cat2
8    9     I     cat1
9   10     J     cat2
10  11     K     cat1
11  12     L     cat2
12  13     M     cat1
13  14     N     cat2
14  15     O     cat1
15  16     P     cat2
16  17     Q     cat1
17  18     R     cat2
18  19     S     cat1
19  20     T     cat2

按类别分层随机抽样选取的 ID:
[5, 1, 17, 2, 8, 20]

根据选取的 ID 获取的所有相关样本:
    id value category
0    1     A     cat1
1    2     B     cat2
4    5     E     cat1
7    8     H     cat2
16  17     Q     cat1
19  20     T     cat2

sql版本见https://blog.csdn.net/u013069552/article/details/140685182依赖于大数据，处理起来比较快

pandas 分层按列随机抽样

猜你喜欢