logistic regression(1)

憋了好久，终于找到些不一样的东西分享出来，是关于逻辑回归的实现。其实有不少革命先驱已经分享了各种不同语言来实现逻辑回归，那么我的独特啥地方呢？作为 Rstudio 的忠实用户和粉丝，基于 Rcpp 包来完成的，算法实现用 C++ , 算法调用交还给 R , 其实也就是简单模拟一下算法开发。

具体逻辑回归算法的推导，就不罗嗦了，网络上有大把大把详尽的资源。这里要引入两个作为结果的公式，它们是算法的核心，算法实现围绕这两个公式：

$J (θ) = - 1 m \sum i = 1 n (y i l o g h θ (x i) + (1 - y i) l o g (1 - h θ (x i)))$ $\begin{align} J(\theta) = -\frac{1} {m}\sum_{i=1}^n (y_i logh_\theta(x_i) + (1 - y_i)log(1 - h_\theta(x_i))) \end{align}$
$θ j : = θ j - α 1 m \sum i = 1 m (h θ (x (i)) - y i) x (i) j$ $\begin{align} \theta_j:=\theta_j - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y_i)x_{j}^{(i)} \end{align}$

一个是逻辑回归的损失函数，一个是批量梯度下降参数的迭代公式，整个算法核心部分就是依靠这两个公式来，这里特殊说明字母 i 代表数据集的行，字母 j 代表数据集的列，清楚这些，实现代码时候就不会搞混了。
Rstudio 中可创建cpp 文件，不多说，直接上代码：

#include <Rcpp.h>
#include <math.h>

using namespace Rcpp;
using namespace std;

// [[Rcpp::export]]
double sigmoid(double z)
{
  return exp(z) / (1+exp(z));
}

// [[Rcpp::export]]
NumericVector Logitis(NumericMatrix x, const int &nrow, const int &ncol, 
                      NumericVector y, double learning_rate, int max_iter, double min_error) {
  // i nrow  
  // j ncol 
  // x = i X j  
  int iter = 0;
  double final_loss = 10;
  // inital weight
  NumericVector weight(ncol); 
  for (int j = 0; j < ncol; ++j) {
    weight[j] = 1; 
  }
  // set iter times and control error
  while (iter < max_iter && final_loss > min_error){
    // Batch gradient descent
    NumericVector sum(ncol);
    // calculate sum
    for (int k = 0;k < ncol; ++k){
       for (int i = 0;i < nrow; ++i){
         double h = 0;
           for (int j = 0;j < ncol;j++){
              h += weight[j] * x(i, j);
           }
         sum[k] += (y[i] - sigmoid(h)) * x(i, k);
        }
    }
    // update weight
    for (int k = 0; k < ncol; ++k){
       weight[k] += learning_rate * ((double) 1 / nrow) * sum[k];
    }
    }
    // Square error
    double loss = 0;
    final_loss = 0;
    for (int i = 0;i < nrow;++i){
      double h = 0;
      for (int j = 0;j < ncol;j++){
        h += weight[j] * x(i, j);
      }
      loss += (y[i] * log10(sigmoid(h)) + (1 - y[i]) * log10(1 - sigmoid(h)));
    }
    final_loss = - ((double) 1 / nrow) * loss;
    cout << "error: "<< final_loss << endl;
    iter++;
  }
  return weight;
}
// [[Rcpp::export]]
LogicalVector Logitis_predict(NumericMatrix x, NumericVector weight, 
                              const int &nrow, const int &ncol) {
  // call Logitis and train model
  LogicalVector Y(nrow);
  for (int i = 0; i < nrow; i++){
    double h = 0;
    for (int j = 0; j < ncol; j++){
      h += weight[j] * x(i, j);
    }
    if (sigmoid(h) >= 0.5) {
      Y[i] = 1;
    } else {
      Y[i] = 0;
    }
  }
  return Y;
}

这里sigmoid是逻辑回归的经典S曲线函数， Logitis来实现批量梯度下降训练出模型的权重， Logitis_predict使用模型权重来做预测，这里Logitis以最大迭代数和损失函数来控制迭代次数来保证模型最后收敛。还有这里NumericMatrix，NumericVector，LogicalVector这是 Rcpp 特有的类类型用来与R无缝对接，下面来看看它们如何无缝对接。

library(Rcpp)
# load Cpp file
sourceCpp(file = "logitis.cpp")
# DataSet
DataSet <- iris[1:100, ]
DataSet[, 5] <- ifelse(DataSet$Species == "setosa", 1, 0)
index <-  sample(2, nrow(DataSet), replace = TRUE, prob = c(0.7, 0.3))
# train data
trainData <- DataSet[index == 1, ]
train_X <- as.matrix(trainData[, -5])
train_Y <- trainData[, 5]
# test data
testData <- DataSet[index == 2, ]
test_X <- as.matrix(testData[, -5])
test_Y <- testData[, 5]
for (i in c(0, 10, 15, 20)){
  # weight
  weight <- Logitis(train_X, nrow = nrow(train_X), ncol = ncol(train_X), train_Y, 0.1, i, 0.1)
  # predict 
  pred_y <- Logitis_predict(test_X, weight, nrow = nrow(test_X), ncol = ncol(test_X))
  # Evaluation
  evaluation_result <- pred_y == test_Y
  cat("Accuracy: ", length(evaluation_result[evaluation_result == TRUE]) / length(test_Y), "\n")
}

这里其实 R 作为主程序启动， C++ 代码保存在 logitis.cpp 文件中，通过Rcpp包下的函数sourceCpp来加载cpp文件中的定义的函数到全局环境下让 R 来调用，因为算法本身是针对二分类问题，所以截取了iris 数据集的前两类来测试算法的效果。

Accuracy:  0.5454545 
error: 2.49176
error: 1.75373
error: 1.03305
error: 0.472978
error: 0.322857
error: 0.298091
error: 0.277371
error: 0.258426
error: 0.24134
error: 0.225961
Accuracy:  0.6969697 
error: 2.49176
error: 1.75373
error: 1.03305
error: 0.472978
error: 0.322857
error: 0.298091
error: 0.277371
error: 0.258426
error: 0.24134
error: 0.225961
error: 0.212105
error: 0.199599
error: 0.188288
error: 0.178033
error: 0.168713
Accuracy:  0.969697 
error: 2.49176
error: 1.75373
error: 1.03305
error: 0.472978
error: 0.322857
error: 0.298091
error: 0.277371
error: 0.258426
error: 0.24134
error: 0.225961
error: 0.212105
error: 0.199599
error: 0.188288
error: 0.178033
error: 0.168713
error: 0.16022
error: 0.152461
error: 0.145354
error: 0.138829
error: 0.132821
Accuracy:  1 
error: 2.49176
error: 1.75373
error: 1.03305
error: 0.472978
error: 0.322857
error: 0.298091
error: 0.277371
error: 0.258426
error: 0.24134
error: 0.225961
error: 0.212105
error: 0.199599
error: 0.188288
error: 0.178033
error: 0.168713
error: 0.16022
error: 0.152461
error: 0.145354
error: 0.138829
error: 0.132821
error: 0.127278
error: 0.122151
error: 0.117397
error: 0.112981
error: 0.108869
error: 0.105033
error: 0.101447
error: 0.0980892
Accuracy:  1

当迭代次数为零时，基本是瞎蒙的预测，但是随着迭代次数的增加模型误差率在逐步减小，准确率也在提升，最终达到了100%。
因为数据量本身不大，加上数据比较特殊，所以有这样的结果，使用 R 的 glm 函数来训练一个模型对比一下效果。

gmodel <- glm(Species~., data=trainData, family = binomial(link = "logit"))
temp <- predict(gmodel, as.data.frame(test_X))
rpred_y <- (exp(temp) / (1 + temp)) >= 0.5
table(rpred, test_Y)
#            test_Y
# rpred         0  1
#        FALSE 15  0
#        TRUE   0 18

观察混淆矩阵，发现准确率也是100%，所以并不是我模型训练的好，而是数据集本身就比较特殊，可分划分性强，不过也间接说明了我实现的模型和原生态的效果一下，(沾沾自喜一下)，回头再通过比较大的数据量来对比测试一下它们的性能差距好了。

猜你喜欢