Past and Present Boosting algorithm (novella)

Micro-channel public number: AIKaggle
welcome suggestions and brick, if necessary resource, No. public message;
if you feel AIKaggle help you, welcome appreciation

This series of articles will sort out Boosting algorithm development, starting from the Boosting algorithm, introduced Adaboost, GBDT / GBRT algorithm, XGBoost algorithm, LightGBM algorithm, CATBoost algorithm, Thunder GBM algorithm, introduce the principle of family Boosting algorithm, framework, derivations, Past and Present Boosting algorithm (Part I) will introduce AdaBoost algorithm and gradient boosting tree algorithm, medium-length (paper) will be described in detail XGBoost algorithm proposed by Chen Tianqi leader, next will introduce LightGBM algorithm, CATBoost algorithm, Thunder GBM algorithm . XGBoost in 2014, proposed by the University of Washington Chen Tianqi (leader), for the loss of function of the second-order Taylor expansion, used the first derivative and second derivative at the time of optimization, and the use of structural loss minimization principle in loss function introduced in the function of the complexity of the model tree. If the actual cases of learning algorithms and interested in the machine, but also concerned about the public number: AIKaggle get dynamic algorithm

A Snapshot

  • XGBoost is a variant GDBT, the biggest difference is XGBoost by the objective function to do the second-order Taylor expansion, thereby obtaining the right leaf node of the tree next to the fitting of heavy (you need to determine the structure of the tree), which according to the loss of function obtaining each node split losses reduced size, to select the appropriate attributes in accordance with the split split loss.
  • The second-order expansion and loss of function formula obtained by splitting nodes of the process are closely related. First traverse all attributes of all nodes is split, it is assumed that the \ (A \) a property value of a split node determined according to the Taylor expansion formula to calculate the respective weightings of the leaf nodes of the tree structure, whereby calculating the degree of loss is reduced, thereby reducing the integrated each attribute selected so that the largest features as the attribute of the current node splitting losses. And so on, until the termination condition is satisfied.

The objective function analysis

Preliminary understanding of the objective function

  • The objective function:
    \ [the Obj ^ {(T)} = \ sum_ {I =. 1} ^ {n-} L (y_i, \ Hat y_i ^ {(T-. 1)} + F_T (x_i)) + \ Omega (F_T ) + constant \]
    wherein, \ (the Obj \) represents the \ (T \) iteration objective function, \ (\ Omega (F_T) \) is the regularization term comprising \ (L_1 \) regularization term, \ (L_2 are \) regularization term, \ (constant \) is a constant term.

Taylor expansion to approximate our original goal:

  • Taylor expansion: \ [F (X + \ of Delta X) \ simeq F (X) + F '(X) \ of Delta X + \ FRAC {. 1} {2} F' '(X) \ of Delta X ^ 2 \]

  • 定义:
    \[g_i = \partial_{\hat y^{(t-1)}}l(y_i, \hat y^{(t-1)})\]
    \[h_i = \partial^2_{\hat y^{(t-1)}}l(y_i, \hat y^{(t-1)})\]

  • Then the objective function can be written as (Equation 1):

\[Obj^{(t)} \simeq \sum_{i=1}^{n}[l(y_i, \hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_if^2_t(x_i)]+\Omega(f_t)+constant \]

  • After the objective function using a Taylor expansion, can be clearly seen, the ultimate objective function depends only on the first derivative and second derivative for each data point on the error function. For the time being this objective function on here, we are going to study a brief look at a tree structure, in order to clarify the regularization term \ (\ Omega (f_t) \ ) expression.

  • \ (x = (x_1, x_2 , ..., x_n) \ in X \ subset R ^ n \) data points, \ (F \) mapped tree structure, it is mapped to a number of data points:
    \ [f: X \ rightarrow R \ ]

The complexity of the tree

  • Next, define the complexity of the tree.

  • For (F \) \ definition do something refinement, the split tree into moiety \ (Q \) and leaves weight part \ (W \) . Below is a specific example, I will start from the figure, elaborate on how to map \ (F \) into a a \ (W \) and \ (Q \) mapped representation. Structure function \ (Q \) input \ (X \) mapped to the index number of leaves, and \ (W \) given fraction leaves the leaf node corresponding to the index number.

  • Example to illustrate this, a blue dress little boy referred to as sample points \ (x_1 \) , due to the \ (q \) is the data points are mapped as a function of leaf index point number, so \ (q (x_1) \) 1 (. 1 corresponding leaf), \ (W \) the index mapped to the index number corresponding to the leaf node points, so \ (W (. 1) +2 = \) , the two steps is compounded \ (W ( Q (X)) = W (. 1) = 2 \) .

  • The output fraction of this complex is defined further comprising a number of nodes inside the tree, each leaf node and the above \ (L_2 are \) modular squaring. Of course, this is not the only way to define, but this definition out of the way of learning tree effect is generally quite good. The figure also shows an example of computational complexity.

  • Explain the figure, the tree contains three leaf nodes, the complexity of the \ (\ Omega (f_t) \ ) The first term \ (\ Gamma = T. 3 \ Gamma \) , the second term \ (\ FRAC {1} {2} \ lambda \ sum_ {j = 1} ^ {T} w_j ^ 2 = \ frac {1} {2} \ lambda \ times ((+ 2) ^ 2 + (0.1) ^ 2 + ( -1) ^ 2) = \ frac {1} {2} \ lambda \ times (4 + 0.01 + 1) \)

  • The regular formula to figure out the items substituted into the previous 1, we can rewrite the objective function, the objective function and in order to contact the tree structure more closely.

Rewrite the objective function

  • Ignoring the constant term, the objective function can be rewritten as:

\[Obj^{(t)}\simeq \sum_{i=1}^{n}[g_if_t(x_i)+\frac12h_if_t^2(x_i)]+\Omega(f_t)\]

  • 再将映射\(f\)\(w\)\(q\)表示:
    \[Obj^{(t)}= \sum_{i=1}^{n}[g_iw_{q(x_i)}+\frac12h_iw_{q(x_i)}^2]+\gamma T+\frac12\lambda\sum_{j=1}^{T}w_j^2\]
  • 引入每个叶子上面的样本点集合\(I\),比如上图的Leaf 3包含3个样本点,分别是穿围裙的妈妈,爷爷和奶奶,所以\(I_3\)就是包含3个样本点{穿围裙的妈妈,爷爷和奶奶}的集合。将样本点的求和指标换成叶子结点的求和指标:(此处也许需要较长时间理解,看不懂可以先跳过)
    \[Obj^{(t)}= \sum_{j=1}^{T}[(\sum_{i\in I_j}g_i)w_j+\frac12(\sum_{i\in I_j}h_i+\lambda)w_j^2]+\gamma T\]

  • 写成这一形式后,目标包含了\(T\)个相互独立的单变量二次函数,我们可以定义
    \[G_j = \sum_{i \in I_j}g_i, H_j = \sum_{i \in I_j} h_i\]

  • (类似于梯度和Hessian),最终公式可以化简为
    \[Obj^{(t)} = \sum_{j=1}^{T} [(\sum_{i\in I_j}g_i)w_j+\frac12(\sum_{i\in I_j}h_i+\lambda)w_j^2]+\gamma T\]

\[Obj^{(t)} = \sum_{j=1}^{T} [G_jw_j+\frac12(H_j+\lambda)w_j^2]+\gamma T\]

最(极)值求解

  • 为了求解目标函数的最值(极值),也就是对目标函数进行最小化(极小值),那么我们需要对目标函数关于\(w_j\)进行求导(导数为零是目标函数取极值的必要条件,如果目标函数是凸函数,那么导数为零是目标函数取极值的充要条件,比较边界情况可以确定出最值),可以得到:

\[w_j^* = -\frac{G_j}{H_j+\lambda}\]

  • 然后把\(w_j\)最优解代入得到:

\[Obj = -\frac12 \sum_{j=1}^{T}\frac{G_j^2}{H_j+\lambda}+\gamma T\]

收缩学习率和列采样

  • XGBoost除了使用正则项防止过拟合外,还使用了收缩和列抽样。收缩再次添加权重因子\(\eta\)到每一步树boosting的过程中,这个过程和随机优化中的学习速率相似,收缩减少每棵单独树的影响并且为将形成的树预留了空间(提高了模型的效果)。特征(列)抽样(在随机森林中使用)(相对于遍历每个特征,获取所有可能的Gain消耗大)找到近似最优的分割点,列抽样防止过拟合,并且可以加速并行化。

打分函数计算示例

  • \(Obj\)代表了当我们指定一个树的结构的时候,我们在目标上面最多减少多少,我们可以把它叫做结构分数(Structure Score)。
  • 下图描述了结构分数的计算方法:

  • 计算好结构分数之后,我们就可以去构造XGBoost算法中的单棵树了,接下来只需要说清楚我们是怎么分裂结点生成单棵树的,这是通过贪心法实现的。

枚举不同树结构的贪心法

贪心法

  • 每一次尝试去对已有的叶子加入一个分割:

  • 对于每次扩展,我们还是要枚举所有可能的分割方案,如何高效地枚举所有的分割呢?我假设我们要枚举所有\(x<a\)这样的条件,对于某一个特定的分割\(a\),我们要计算\(a\)左边的导数和右边的导数和。

  • 在构造树结构时,每一次尝试对已有的叶子结点加入一个分割,选择具有最佳增益的分割对结点进行分裂。对于一个具体的分割方案,我们可以获得的增益可以由如下公式计算:

\[Gain = Obj(I) - [Obj(I_L)+Obj(I_R)]\]

\[Gain = \frac12[\frac{(\sum_{i\in I_L}g_i)^2}{\sum_{i \in I_L} h_i+\lambda}+\frac{(\sum_{i\in I_R}g_i)^2}{\sum_{i \in I_R} h_i+\lambda} - \frac{(\sum_{i\in I}g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\gamma\]

\[Gain = \frac12[\frac{G_L}{H_L+\lambda}+\frac{G_R}{H_R+\lambda} - \frac{G}{H+\lambda}]-\gamma\]

  • 其中\(I\)代表该结点下的所有样本集合(分割前),设\(I_L\)代表左子树,\(I_R\)代表右子树,也就有\(I = I_L \cup I_R\)

  • 我们可以发现对于所有的\(a\),我们只要做一遍从左到右的扫描就可以枚举出所有分割的梯度和\(G_L\)\(G_R\)。然后用上面的公式计算每个分割方案的分数就可以了。

  • 观察这个目标函数,大家会发现第二个值得注意的事情就是引入分割不一定会使得情况变好,因为我们有一个引入新叶子的惩罚项。优化这个目标对应了树的剪枝, 当引入的分割带来的增益小于一个阀值的时候,我们可以剪掉这个分割。大家可以发现,当我们正式地推导目标的时候,像计算分数和剪枝这样的策略都会自然地出现,而不再是一种因为heuristic(启发式)而进行的操作了。

端到端的模型评估

XGBoost的实现

  • XGBoost主要提供权重分类、排序目标函数,支持python、R、Julia,集成到了本地的数据管道如sklean。在分布式系统中,XGboost也支持Hadoop、MPI、Flink、spark

数据集

分类

排序

Out-of-core

分布式


点击阅读教主原文

Guess you like

Origin www.cnblogs.com/AIKaggle/p/11543255.html