机器学习算法系列(1)--SVM

SVM算法是在深度学习大火之前最受欢迎的机器学习算法，也是广大机器学习爱好者的入门算法。

一、SVM

1.1 介绍

Support Vector Machine 支持向量机是一种机器学习算法。

给定一个训练集 S={(xi,yi)}mi=1, 其中 xi∈ℝn 并且 yi∈{+1,−1}, 图1展示了一个 SVM 需要解决的问题。我们标记 w⋅x−b=0 为超平面， w 代表该超平面的向量。我们需要做的是找到能将 yi=1 的点和 yi=−1 的点分开的边际最大的超平面. 这就意味着 yi(w⋅xi−b)≥1，对于所有 1≤i≤n。

所以优化问题可以写成：
最大化：
$这里写图片描述$

这等价于最小化：
$这里写图片描述$

subject to yi(w⋅xi−b)≥1 for all 1≤i≤n

这里写图片描述

事实上，它可以被看作一个带有惩罚项的最小化损失问题。最终，我们希望找到以下问题的最小解

$这里写图片描述$

其中 λ 是正规化参数, $这里写图片描述$ 是 hinge 损失函数:

$这里写图片描述$

对于这一最优化问题，我们可以使用梯度下降算法来达到最小值。

目标函数为：

$这里写图片描述$

所以，迭代 t 时的梯度为：

$这里写图片描述$

于是，我们可以更新 w, 其中 $这里写图片描述$ 是下降速度

$这里写图片描述$

1.2 SGD

从上一节我们可以看到每次迭代我们都需要所有的数据点来计算梯度。而当数据集变大后，无疑会耗费大量的计算时间。这就是为什么在大规模梯度下降算法中，我们总会使用 SGD（随机梯度下降）。SDG 在每次迭代时只使用一部分数据而不是全部，从而降低了计算量。

所以，现在目标函数变成了：

$这里写图片描述$

where At⊂S, |At|=k. At each iteration, we takes a subset of data point.

然后梯度为:

$这里写图片描述$

1.3 Pegasos and MLlib implementation

Pegasos 是 SVM 使用梯度下降算法的一种实现。Spark MLlib 也提供了 SVM 的梯度下降实现，于 Pegasos 稍有不同。主要是梯度的更新速度不同。

$这里写图片描述$

在 Pegasos 算法中, 更新速度为：

$这里写图片描述$

而在 MLlib 中，为：

$这里写图片描述$

其中 α 是更新速度参数。

二、SGD in Spark

2.1 treeAggregate

Spark 来计算 SGD 的主要优势使可以分布式地计算梯度，然后将它们累加起来。在 Spark 中，这一任务是通过 RDD 的 treeAggregate 方法来完成的。 Aggregate 可被视为泛化的 Map 和 Reduce 的组合。 treeAggregate 的定义为：

RDD.treeAggregate(zeroValue: U)(
      seqOp: (U, T) => U,
      combOp: (U, U) => U,
      depth: Int = 2): U

在此方法中有三个参数，其中前两个对我们更重要：

seqOp: 计算每隔 partition 中的子梯度。
combOp: 将 seqOp 或上层 combOp 的值合并。
depth: 控制 tree 的深度。

tree aggregate

2.2 实现

SGD 是一个求最优化的算法，许多机器学习算法都可以用 SGD 来求解。所以 Spark 对其做了抽象。

class SVMWithSGD private (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
  extends GeneralizedLinearAlgorithm[SVMModel] with Serializable {

  private val gradient = new HingeGradient()
  private val updater = new SquaredL2Updater()
  @Since("0.8.0")
  override val optimizer = new GradientDescent(gradient, updater)
    .setStepSize(stepSize)
    .setNumIterations(numIterations)
    .setRegParam(regParam)
    .setMiniBatchFraction(miniBatchFraction)

可以看到 SVMWithSGD 继承了 GeneralizedLinearAlgorithm ，并定义 optimizer 来确定如何获得优化解。而 optimizer 即是 SGD 算法的实现。正如上节所述，线性 SVM 实际上是使用 hinge 损失函数和一个 L2 惩罚项的线性模型，因此这里使用了HingeGradient和 SquaredL2Updater 作为 GradientDescent 的参数。

class HingeGradient extends Gradient {
  override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = {
    val dotProduct = dot(data, weights)
    // Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
    // Therefore the gradient is -(2y - 1)*x
    val labelScaled = 2 * label - 1.0
    if (1.0 > labelScaled * dotProduct) {
      val gradient = data.copy
      scal(-labelScaled, gradient)
      (gradient, 1.0 - labelScaled * dotProduct)
    } else {
      (Vectors.sparse(weights.size, Array.empty, Array.empty), 0.0)
    }
  }

  override def compute(
      data: Vector,
      label: Double,
      weights: Vector,
      cumGradient: Vector): Double = {
    val dotProduct = dot(data, weights)
    // Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
    // Therefore the gradient is -(2y - 1)*x
    val labelScaled = 2 * label - 1.0
    if (1.0 > labelScaled * dotProduct) {
      axpy(-labelScaled, data, cumGradient)
      1.0 - labelScaled * dotProduct
    } else {
      0.0
    }
  }
}

/**
 * :: DeveloperApi ::
 * Updater for L2 regularized problems.
 *          R(w) = 1/2 ||w||^2
 * Uses a step-size decreasing with the square root of the number of iterations.
 */
@DeveloperApi
class SquaredL2Updater extends Updater {
  override def compute(
      weightsOld: Vector,
      gradient: Vector,
      stepSize: Double,
      iter: Int,
      regParam: Double): (Vector, Double) = {
    // add up both updates from the gradient of the loss (= step) as well as
    // the gradient of the regularizer (= regParam * weightsOld)
    // w' = w - thisIterStepSize * (gradient + regParam * w)
    // w' = (1 - thisIterStepSize * regParam) * w - thisIterStepSize * gradient
    val thisIterStepSize = stepSize / math.sqrt(iter)
    val brzWeights: BV[Double] = weightsOld.asBreeze.toDenseVector
    brzWeights :*= (1.0 - thisIterStepSize * regParam)
    brzAxpy(-thisIterStepSize, gradient.asBreeze, brzWeights)
    val norm = brzNorm(brzWeights, 2.0)

    (Vectors.fromBreeze(brzWeights), 0.5 * regParam * norm * norm)
  }
}

此节中, 1 展示了 GradientDescent 的主要执行逻辑。重复执行 numIterations 次以获得最终的 w。

首先, data.sample 通过 miniBatchFraction 取一部分样本. 然后使用 treeAggregate 。在 seqOp 中, gradientSum 会通过 axpy(y, b_x, c._1) 更新，如果 y⟨w,x⟩<1，即分类错误。在 combOp 中, gradientSum 通过 c1._1 += c2._1 被集合起来。当获得 gradientSum 后, 我们就可以计算 step 和 gradient 了。最后, 我们使用 axpy(-step, gradient, weights) 更新 weights 。

GradientDescent 代码片断

while (!converged && i <= numIterations) {
  val bcWeights = data.context.broadcast(weights)
  // Sample a subset (fraction miniBatchFraction) of the total data
  // compute and sum up the subgradients on this subset (this is one map-reduce)
  val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i)
    .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
      seqOp = (c, v) => {
    // c: (grad, loss, count), v: (label, features)
    val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
    (c._1, c._2 + l, c._3 + 1)
      },
      combOp = (c1, c2) => {
    // c: (grad, loss, count)
    (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
      })

  if (miniBatchSize > 0) {
    /**
     * lossSum is computed using the weights from the previous iteration
     * and regVal is the regularization value computed in the previous iteration as well.
     */
    stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
    val update = updater.compute(
      weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble),
      stepSize, i, regParam)
    weights = update._1
    regVal = update._2

    previousWeights = currentWeights
    currentWeights = Some(weights)
    if (previousWeights != None && currentWeights != None) {
      converged = isConverged(previousWeights.get,
    currentWeights.get, convergenceTol)
    }
  } else {
    logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero")
  }
  i += 1