XGBoost Official Blog: Feature Selection and Engineering in XGBoost

Author: Zen and the Art of Computer Programming

1 Introduction

1.1 What is XGBoost?

XGBoost (Extreme Gradient Boosting) is an open source integrated learning library. It is an efficient, accurate and reliable machine learning algorithm, which is widely used in competition platforms such as Kaggle, Tianchi, and Alibaba. Its main advantages are as follows:

  1. Highly parallelizable, capable of fast training under massive data;
  2. Can handle many types of features, including continuous values, categorical variables, and missing values;
  3. Support custom loss function, support sensitivity to outliers;
  4. The model training speed is fast and it is suitable for high-dimensional sparse data.

1.2 Why feature selection?

Machine learning models often rely on a host of feature extraction methods to discover internal patterns in data. Effectively reducing the number of features will directly affect the performance of the model. Therefore, how to select a subset from many candidate features to optimize the effect of the model has become a research problem for many scholars.

Currently, there are three approaches to feature selection:

  1. Filter method: retain the most representative features by filtering out some low-variance or low-correlation features. This method is relatively simple, but may discard important information.
  2. Wrapper method: In a pre-learning process, a base classifier is trained first, and then important features are selected according to the prediction results of the base classifier. This method puts more emphasis on the global feature information than the Filter method, but the training time is longer.
  3. Embedded method: In the base classifier, heuristic rules are used to specify which features are most important according to different objective functions. For example, Lasso regression selects the feature with the coefficient with the smallest absolute value, while random forest selects

Guess you like

Origin blog.csdn.net/universsky2015/article/details/131875046