采样方法-数据不均衡

其他 2019-02-27 21:38:07 阅读次数: 0

References :

https://towardsdatascience.com/dealing-with-class-imbalanced-datasets-for-classification-2cc6fad99fd9

a. Undersampling.

Say, you have 40,000 positive sample and 2,000 negative samples in your dataset. We will use this as our running example henceforth. What you can do is just randomly pick up 2,000 positive samples out of the 40,000, all 2,000 negative samples, and train and validate your model only on these 4,000 samples. This will allow you to use all the classification algorithms in just the usual way. This method is easy to implement and runs very fast as well. However, one downside is that you are potentially discarding the 38,000 positive sample you have and that data is going down the drain.

To overcome this, you can create an ensemble of models wherein each model uses a different set of 2,000 positive sample and all 2,000 negative samples and is trained and validated separately. Then on your test set, you take a majority vote of all these models. This allows you to take into account all of your data without causing an imbalance. Furthermore, you can even use different algorithms for different sets and then your ensemble would be even more robust. However, this would be a bit computationally expensive.

b. Oversampling

In this method, you generate more samples of your minority class. You can do this either by first creating a generative model and then creating new samples or by just picking existing samples with replacement. There exist a number of oversampling techniques such as SMOTE, ADASYN, etc. You will have to see which works best for your use case. Also, oversampling itself is a computationally expensive procedure. The major advantage is that this allows one model of yours to take all of your data into consideration at once and also helps you generate new data.

SMOTE 算法

ADASYN 算法

猜你喜欢

转载自www.cnblogs.com/wuxiangli/p/10447109.html

采样方法-数据不均衡

机器学习知识点：不均衡数据的采样方法

不均衡数据过采样实验对比

过采样和欠采样（数据不均衡处理）

SMOTE过采样处理不均衡数据（imbalanced data）

【深度好文】Pytorch不均衡数据集采样器

不均衡数据处理方法

不均衡样本集的重采样

数据不均衡问题

机器学习中对不均衡数据的处理方法

不均衡学习的抽样方法

数据不均衡问题的解决

数据集不均衡问题

Python【图解】样本不均衡问题及采样策略

SMOTE 过采样，解决正负样本不均衡问题

Python借助smote实现不均衡样本数据的上采样和下采样，并可视化展示样本分布

处理数据极度不均衡的数据集

处理不均衡的数据（imbalanced data）

机器学习数据不均衡问题（转载）

类间数据量不均衡

处理不均衡数据 (Imbalanced data)

如何处理不均衡数据

数据分布不均衡导致性能问题

干货|如何处理不均衡数据？

机器学习分类问题中，数据不均衡时的解决方法

DataScience：对严重不均衡数据集进行多种采样策略(随机过抽样、SMOTE过采样、SMOTETomek综合采样、改变样本权重等)简介、经验总结之详细攻略

特征工程—1.不均衡样本集采样—SMOTE算法与ADASYN算法

python数据预处理：样本分布不均（过采样和欠采样）

python数据预处理 :样本分布不均的解决(过采样和欠采样)

你对抗机器学习数据集里的不均衡数据

今日推荐

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

【转】spring中对控制反转和依赖注入的理解

tms webcore 安装和使用

java程序员进阶相关书籍

SpringMVC接受请求参数、

如何保存训练好的机器学习模型

MyEclipse、Eclipse设置项目JDK的三个地方

商超行业微信小程序开发定制一般多少钱（行业技术人员解读）

Markdown编辑器语言——30分钟入门到到精通

Linux系统下MongoDB的简单安装与基本操作

Power Strings

每日归档

更多

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)