Spark calculation tools

Vector

vectors.txt

1 2.3 4.5
3 3.1 5.6
4 3.2 7.8

Process vectors.txt file RDD[String]->RDD[Vector]

package com.yasuofenglei

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object Driver01 {
  def main(args: Array[String]): Unit = {
    //创建向量类型,传入的类型是Double类型,如果是Int类型,会自动转换
    val v1=Vectors.dense(1,2,3.1,4,5)
    //
    val v2=Vectors.dense(Array[Double](1,2,3,4))
    
    
    
    val conf=new SparkConf()
    conf.setMaster("local").setAppName("wordcount")
    val sc=new SparkContext(conf)
    val data=sc.textFile("d://data/ml/vectors.txt",2)
    //RDD[String]->RDD[Array[String]]->RDD[Array[Double]]
    val r1=data.map{_.split(" ").map{num => num.toDouble}}
    .map{arr =>Vectors.dense(arr)}
    r1.foreach { println}
    
  }
}

Vector label

package com.yasuofenglei

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object Driver02 {
  def main(args: Array[String]): Unit = {
    val v1=Vectors.dense(2.2,3.1,4.5)
    //创建一个向量标签,1标签值
    val lb1 = LabeledPoint(1, v1)
    println(lb1)
    println(lb1.features)//获取向量标签中的向量
    println(lb1.label)//获取标签值
    
    
    val conf=new SparkConf()
    conf.setMaster("local").setAppName("wordcount")
    val sc=new SparkContext(conf)
    val data=sc.textFile("d://data/ml/vectors.txt",2)
    
    
    val r1=data.map{line=>
      val info=line.split(" ")
      val label=info.head.toDouble
      val features=info.drop(1).map{
        num => num.toDouble
      }
      LabeledPoint(label,Vectors.dense(features))
    }
    
    r1.foreach{println}
  }
}

Spark calculation tools

package com.yasuofenglei

import org.apache.spark._
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics

/**
 * Spark计算工具类
 */
object Driver03 {
  def main(args: Array[String]): Unit = {
    var conf=new SparkConf().setMaster("local").setAppName("statistic")
    var sc=new SparkContext(conf)
    val r1=sc.makeRDD(List(1,2,3,4,5))
    //RDD[Int]->RDD[Vector]
    val r2=r1.map{ num => Vectors.dense(num)}
    val result=Statistics.colStats(r2)
    
    println(result.max)
    println(result.min)
    println(result.mean)//均值
    println(result.variance)//方差
    println(result.count)
    println(result.numNonzeros)//不为0的元素个数
    println(result.normL1)//计算曼哈顿距离
    println(result.normL2)//计算欧式距离
    
  }
}

Euclidean distance

Euclidean Distance (Euclidean Distance) Euclidean distance
in two-dimensional and three-dimensional space is the distance between two points.
Euclidean distance in two-dimensional space: the Euclidean distance between
Insert picture description heretwo points a(x1,y1) and b(x2,y2) on a two-dimensional plane:

The Euclidean distance between two points a(x1,y1,z1) and b(x2,y2,z2) in three-dimensional space:

The Euclidean distance between two n-dimensional vectors a(x11,x12,...,x1n) and b(x21,x22,...,x2n):

It can also be expressed as a vector operation:

R calculates Euclidean
distance. The function to calculate distance in R is dist(). The more direct usage is dist(x,method="euclidean"), which is to calculate Euclidean distance. Other optional parameters include "maximum", " manhattan", "canberra", "binary" ,"minkowski", lz adjusts this parameter to get the distance required by lz
Case 1: R calculates the Euclidean distance

a = c(1,1,3)
b = c(4,5,1)
distance=sqrt(sum((a-b)^2));distance
[1] 5.385165

or:

dist(rbind(a,b), method= “euclidian”)
a
b 5.385165

That is, the distance between sample a and sample b is: 5.385165

Case 2: R calculates the Euclidean distance between multiple samples

a=matrix(rnorm(15,0,1),c(3,5))
a
[,1] [,2] [,3] [,4] [,5]
[1,] 1.8094360 -0.7377281 -2.2285413 0.6091852 -0.1709287
[2,] -0.7308505 -0.3415692 -0.7755661 1.4363829 -0.5686896
[3,] 1.2290613 1.7541220 -0.8617373 0.3623487 -1.1996104
dist(a,method=“euclidian”)

result:

- 1 2
2 3.092508
3 3.087624 3.129251

The distance between the first row of samples and the second row of samples: 3.092508 The distance between the
first row of samples and the third row of samples: 3.087624 The distance between the
second row of samples and the third row of samples: 3.129251

Manhattan Distance (Manhattan Distance)

Taxi distance or Manhattan distance is a vocabulary created by Herman Minkowski in the 19th century. It is a geometric term used in geometric measurement spaces to indicate two points on a standard coordinate system. The sum of absolute wheelbases.

In the figure below: the
Insert picture description here
red line represents the Manhattan distance, the
green represents the Euclidean distance, which is the straight-line distance,
and the blue and yellow represent the equivalent Manhattan distance. Manhattan distance-the distance between two points in the north-south direction plus the distance in the east-west direction, that is,
d(i,j)=|xi-xj|+|yi-yj|.

For a town street with a regular layout in the direction of due south and due north, and due east and west, the distance from one point to another is the distance traveled in the north-south direction plus the distance traveled in the east-west direction, so the Manhattan distance is also called Taxi distance, Manhattan distance is not distance invariant, when the coordinate axis changes, the distance between points will be different.

Manhattan distance diagram In the early computer graphics, the screen is composed of pixels, which are integers, and the coordinates of points are generally integers. The reason is that floating-point operations are expensive, slow and have errors. If you use the Euclidean distance of AB directly (Euclidean distance: Euclidean distance in two-dimensional and three-dimensional space is the distance between two points), you must perform floating point calculations, if you use AC and CB, you only need to calculate addition and subtraction. However, this greatly improves the calculation speed, and no matter how many times the calculation is accumulated, there will be no errors.

For example, on a plane, the Manhattan distance between point i at coordinates (x1, y1) and point j at coordinates (x2, y2) is:
d(i,j)=|X1-X2|+|Y1-Y2|.

Case: R calculates the Manhattan distance between two points

a<-c(1,2)
b<-c(5,8)
dist(rbind(a,b),method=“manhattan”)
a
b 10

Result: Manhattan distance between a sample and b sample = 10

Chebyshev distance

Chebyshev Distance (Chebyshev Distance), in mathematics, Chebyshev distance or L∞ metric is a metric in vector space, and the distance between two points is defined as the maximum value difference of each coordinate.

In chess, the king can move to any of the 8 adjacent squares by one step. How many steps does the king need to walk from the grid (x1, y1) to the grid (x2, y2)? Try it yourself.
You will find that the minimum number of steps is always max( | x2-x1 |, | y2-y1 |) steps. This distance measurement method is called Chebyshev distance.
Insert picture description here
Case: R simulates the Chebyshev distance from f6 (王) to c4
a<-c(3,4)
b<-c(6,6)
dist(rbind(a,b),method=“maximum”)
a
b 3

That is, Wang takes at least 3 steps to c4.
Comprehensive case

package com.yasuofenglei

object Driver04 {
  def main(args: Array[String]): Unit = {
    val a1=Array(1,2)
    val a2=Array(4,7)
    //计算出a1和a2之间的欧式距离
    //拉链方法:zip
    //开方方法Math.sqrt()
    val r1=a1.zip(a2).map(x => (x._1-x._2)*(x._1-x._2)).sum
    println(Math.sqrt(r1))
    
    //计算两点间的夹角余弦
    val a1a2Fenzi=a1.zip(a2).map{x => x._1*x._2}.sum
    val a1Fenmu =Math.sqrt(a1.map{x => x*x}.sum)
    val a2Fenmu =Math.sqrt(a2.map{x => x*x}.sum)
    
    val a1a2Cos=a1a2Fenzi/(a1Fenmu*a2Fenmu)
    println(a1a2Cos)
    
  }
}

Least squares method

Introduction The
least square method (also known as the least square method) is a mathematical optimization technique. It finds the best function match of the data by minimizing the sum of squares of errors. The least square method can be used to easily obtain unknown data, and minimize the sum of squares of errors between the obtained data and the actual data. The least squares method can also be used for curve fitting. Some other optimization problems can also be expressed by the least square method by minimizing energy or maximizing entropy.

Background story The
Least Squares Method (LSE) is a relatively old method, derived from the application needs of astronomy and geodesy. In the early development of mathematical statistics, these two sciences played a very important role. Danish statistician Hall called them the "mothers of mathematical statistics". Since then, it has been widely used in scientific experiments and engineering technology for nearly three hundred years. American statistical historian SM Stigler pointed out that the least square method was the overriding theme of mathematical statistics in the 19th century. In 1815, this method had become a standard tool in astronomy and geodesy in France, Italy and Prussia, and by 1825 it had been widely used in Britain.

Giuseppe Piazzi
Insert picture description here
Dating back to 1801, the Italian astronomer Giuseppe Piazzi discovered the first asteroid Ceres. After 40 days of tracking, Piazzi lost the position of Ceres because Ceres moved behind the sun. Then scientists all over the world began to search for Ceres using Piazzi's observational data, but searching for Ceres based on the results of most people's calculations has no results.

Insert picture description here
At the age of 24, Gauss also calculated the orbit of Ceres. The Austrian astronomer Heinrich Olbers rediscovered Ceres based on the orbit calculated by Gauss. Gauss in his 1809 book "Theory of Planetary Motion around the Sun". In this book, he claimed that he had used the least squares method since 1799, which broke out a priority dispute with Legendre.

After studying the original documents, modern scholars believe that the two may have invented this method independently, but it was first seen in written form, and Legendre was the earliest. However, nowadays textbooks and works often attribute this invention right to Gauss. The reason, in addition to Gauss's greater reputation, is mainly due to the importance of its normal error theory to this method. Legendre explained the advantages of the least square method in his works. However, there is a lack of error analysis. We don't know what the error is caused by using this method, we need to establish an error analysis theory. In 1823, Gauss proved an optimal property of the least squares method under the assumption of independent and identical distribution of errors e1,..., en: Among all unbiased linear estimation classes, the least squares method is the least square!

The current least square method was proposed by AM Legendre in his book "A New Method for Calculating Comet Orbits" in 1805. Its main idea is to select unknown parameters to minimize the sum of squares of the difference between the theoretical value and the observed value:
Insert picture description here
principle and derivation process
Let's take a look at the simplest linear situation.
As shown in the figure below, for a certain data set (xi, yi) (i=0,1,...,n), we need to find a trend line (the dotted line in the figure) that can express the data set (xi, yi) The direction these points point.
Insert picture description here
Insert picture description here
Insert picture description here
Solve a set of k and b so that the error sum of squares is the smallest (the sum of square differences RSS) to get the corresponding optimal solution.

Modeling related concept supplement

  • Establish a target equation to fit the sample data, the target agenda is not fixed. It can be a linear equation or a nonlinear equation.
  • Solve the coefficients of the objective equation. The purpose is to find the optimal solution of the target equation coefficients. So it is necessary to find out the Cost Function (loss function, cost equation) of the objective equation
  • The optimal solution is obtained through the loss function.
  • The loss function is not fixed, and the target equation is different, and the loss function changes accordingly.

Forecast commodity demand case

lritem.txt

100|5 1000
75|7 600
80|6 1200
70|6 500
50|8 30
65|7 400
90|5 1300
100|4 1100
110|3 1300
60|9 300

For the case of forecasting commodity demand, we have established a multiple linear regression model

  • The regression model is used for prediction. For the type of regression model: 1 least square regression. 2 Gradient descent regression. 3 Ridge returns. 4Lasso returns.
  • If there is only one independent variable, it is one yuan. If the independent variable exceeds one. Both can be called pluralism.
  • Target equation. Linear equations, plane equations, hyperplane equations. The form of the linear agenda is fixed: for example, the linear equation y=β1X1+β0 , the plane equation Y=β1X1+β2X2+β0 , the hyperplane equation Y=β1X1+β2X2+…+βnXn+β0
  • The target agenda can also be a nonlinear equation. The form of the nonlinear equation is not fixed and it is not suitable to be solved.

Gradient descent

The method of least squares is applicable to the situation where the model equation has an analytical solution. If there is no analytical solution for a function, the least square method cannot be used. At this time, the real solution can only be approximated by a numerical solution (iterative).
Insert picture description here
There is no analytical solution to the above equation, and each coefficient cannot be expressed in variable expressions.
The gradient descent method is more applicable than the least square method

The gradient descent method passes a numerical solution (multiple iterations), and finally converges to the true solution. The bottom layer of many models is solved by gradient descent method. For example: logistic regression model, BP neural network for error feedback. Therefore, the application of the gradient descent method is more extensive than the two multiplication method (because the two multiplication method is only applicable to the situation where the model equation has an analytical solution, and the model agenda in the production environment is more complicated. Many of them do not have an analytical solution)
gradient from a geometric sense , Is the direction in which the function changes fastest. If you follow the positive direction of the gradient, you can find the maximum value of the function at the fastest speed.
If you follow the gradient direction, you can find the maximum value of the function most quickly. ->Gradient ascent method
If you follow the negative direction of the gradient, you can find the minimum value of the function most quickly. -> Gradient descent method
We often use gradient descent method, the reason:
using gradient descent method, find the minimum value of loss function (RSS), get the coefficient corresponding to main RSS.
The algorithm process of the gradient descent method:

  1. Randomly select the initial position of 0
  2. Multiply the step size (defined by the programmer) by the gradient of the loss function to get the descending distance, and then update 0.
  3. Iterate the second step several times until it converges to the minimum value of the loss function (min RSS), thus obtaining the optimal solution.
    There are two key elements in the above process: 1 step size and 2 gradient of the loss function.
package com.yasuofenglei.sgd

import org.apache.spark._
import org.apache.spark.sql.SQLContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
/*
1,0 1
2,0 2
3,0 3
5,1 4
7,6 1
9,4 5
6,3 3
 */
object Driver {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("sgd")
    val sc=new SparkContext(conf)
    
    val data=sc.textFile("d://data/ml/testSGD.txt")
    //1转换
    val parseData=data.map{line=>
      val info=line.split(",")
      val Y=info(0).toDouble
      val X1=info(1).split(" ")(0).toDouble
      val X2=info(1).split(" ")(1).toDouble
      LabeledPoint(Y,Vectors.dense(X1,X2))
    }
    //parseData.foreach{println}
    //建模,参数:迭代次数,步长
    //如果迭代次数过少,可能会导致还未收敛就结束了,误差很大。如果过多,会浪费CPU,导致计算代价过大
    //如果步长过小,会导致迭代很多次仍不收敛,如果过大,会导致围绕真实解来回震荡而不收敛
    //综上,建议:年代次数多一点,步长小一点(经验:0.05~0.5)
    val model=LinearRegressionWithSGD.train(parseData,20,0.05)
    
    //提取模型的自变量系数
    val coef=model.weights
    //通过模型实现预测
    //回代样本集预测,要求传入的数据类型RDD[Vector(X1,X2)]
    val predict=model.predict(parseData.map{x=> x.features})
    predict.foreach{println}
  }
}

There are three types of gradient descent methods

  1. Batch gradient descent method-BGD
    every time the coefficient is updated, all samples need to participate in the calculation.
    The advantage is that it takes a few generations to converge.
    The disadvantage is that if the sample size is large, it takes a long time to iterate once.
    This method is generally not used in production environments (because of the large amount of data)
  2. The stochastic gradient descent method-SGD
    each time the coefficient is updated, a sample is randomly selected from all samples to participate in the update.
    The advantages of each update with a very short time coefficient
    shortcomings need more iterations to converge.
    So the current production environment uses this to solve the coefficient.
  3. Small batch gradient descent method
    This algorithm is equivalent to a combination of the above two, selecting a small batch of data to participate in sample update.

Logistic regression model

Logistic regression models can respond to discrete data, so this model is applied to dry classification problems. Previously, I learned the linear regression model, which responded to continuous data, and could not respond to Lisu data. .
basic concepts

  • Continuous data, given an interval, you can take any real value in the interval
  • Discrete data, given an interval, can only take a limited number of real values ​​in the interval.

Sigmoid function

The role of this function: any real value can be mapped to 0 or 1. Discretize any continuous data to 0 or 1.
Insert picture description here
In the above figure, e is a transcendental number, approximately equal to 2.718. e=(1+1/n)^n, when n approaches infinity, the value obtained is the transcendental number.
Insert picture description here

package com.yasuofenglei.logistic

import org.apache.spark._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
/*
建立逻辑回归模型,应用场景:用于二分类问题
因变量(离散的,取值情况有两种,1or0)
 */
/*样本数据
17	1	1	1
44	0	0	1
48	1	0	1
55	0	0	1
75	1	1	1
35	0	1	0
42	1	1	0
57	0	0	0
28	0	1	0
20	0	1	0
38	1	0	0
45	0	1	0
47	1	1	0
52	0	0	0
55	0	1	0
68	1	0	1
18	1	0	1
68	0	0	1
48	1	1	1
17	0	0	1
 * */
object Driver {
  def main(args: Array[String]): Unit = {
    val conf= new SparkConf().setMaster("local").setAppName("logistic");
    val sc=new SparkContext(conf)
    val data=sc.textFile("d://data/ml/logistic.txt")
    //第一步,为了满足建模需要,需要做数据转换
    //RDD[String]->RDD[LabelPoint]
    val parseData=data.map{line => 
      val info=line.split("\t")
      val Y=info.last.toDouble;
      val featuresArray=info.take(3).map{_.toDouble};
      LabeledPoint(Y,Vectors.dense(featuresArray))
    }
    //parseData.foreach{println}
    //第二步建立逻辑回归模型,底层用的是随机梯度下降法来解出系数
//    val model= LogisticRegressionWithSGD.train(parseData, 500,1) //参数不好调
    //建立逻辑回归模型,底层用的是拟牛顿法来解出系数,
    /*这种算法通过数值解逼近真实解(迭代算法),属于快速迭代法,并且不需要指定步长
     * 优点:迭代次数少,而且不需要指定步长
     * 缺点:每迭代一次,计算量较大,所以如果数据量较大,还是建议使用SGD
     * */
    val model=new LogisticRegressionWithLBFGS().run(parseData) 
    //模型系数
    val coef=model.weights
    //第三步,回代样本,做预测(分类)
    val predict=model.predict(parseData.map{x=>x.features})
    predict.foreach{println}
    /*
     * 预测数据
18	0	0
35	1	1
40	0	1
22	1	0
     * */
    val testData=sc.textFile("d://data/ml/testLogistic.txt")
    val parseTestData=testData.map{line=>line.split("\t").map{num => num.toDouble}}
      .map{arr=>Vectors.dense(arr)}
      
    val testPredict=model.predict(parseTestData)
    testPredict.foreach { println}
  }
}

Guess you like

Origin blog.csdn.net/yasuofenglei/article/details/100766811