Machine Learning Feature Engineering - Class-Related Statistical Features

Read the records in Section 4.3 of Yulao's "Machine Learning Algorithm Competition Actual Combat", which mainly records how to perform target encoding on category-related statistical features

It is difficult for machine learning to recognize complex patterns, especially the interaction information between different features, so features need to be constructed based on intuition or business understanding.

For category-related statistical features, consider:

  • object code
    • Consider categorical features and target variable
  • count, nunique, ratio
    • Frequently used category feature construction methods
  • Class feature cross combination
    • Combinations between categorical features

object code

Target encoding is to encode category features with target variable (usually Label) statistics. For different questions:

  • Classification: the number of positive and negative samples, the ratio of positive and negative samples
  • Regression: the mean, median, and maximum of the statistical target variable

Intuitive understanding: Grouping category features, each group uses the statistics of the target variable to replace (map) the category grouping.
pandas says:df.group('类别特征')['目标变量'].聚合操作

The sample code given in the book is as follows:

folds = KFold(n_splits=5, shuffle=True, random_state=2020)
for col in columns:
    colname = col+'_kfold'
    # 5折交叉验证
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(train)):
        tmp = train.iloc[trn_idx]
        order_label = tmp.groupby([col])['label'].mean()
        train[colname] = train[col].map(order_label)
        ......
    # 处理 test 中类别特征数据
    order_label = train.groupby([col])['label'].mean()
    test[colname] = test[col].map(order_label)

analyze:

  1. In the 5-fold cross-validation, the statistics of the target variable (label) are calculated with another quarter sample to prevent information leakage. That is order_label = tmp.groupby([col])['label'].mean(), note that it is at this time tmp. Afterwards, when performing category feature replacement, valcorresponding processing should also be done, ie train[colname] = train[col].map(order_label).
  2. When processing testdata, traingroup aggregates across the entire .
  3. An example is to count the mean of a target variable for a regression problem. For classification problems, such as statistical positive and negative sample ratio can be written as order_label = tmp.groupby([col])['label'].agg(lambda x: x.sum() / (1-x).sum()).

other

Regarding count, nunique, ratiothe three statistical indicators, the book gives a better understanding of examples: count (count feature) is used to count the frequency of occurrence of category features. The construction of nunique and ratio is relatively complicated, and often involves the joint construction of multiple categories of features. For example, in the problem of advertising click rate prediction, for user ID and advertising ID, using nunique can reflect the width of user interest in advertising, that is, statistics The user ID has seen several advertisement IDs; using ratio can reflect the user's preference for a certain type of advertisement, that is, the ratio of the frequency of clicking a certain type of advertisement ID by the user ID to the frequency of clicking all the advertisement IDs by the user.

Cross-combination of category features: just pay attention to the cardinality of the category, so I won’t go into details

Guess you like

Origin blog.csdn.net/qq_38869560/article/details/130494180