二、Basic statistics（基础统计量）

summary statistics（摘要统计量）

我们通过Statistic中可用的colStats函数提供RDD [Vector]的列摘要统计信息
示例代码
colStats（）返回MultivariateStatisticalSummary的实例，该实例包含按列的最大值，最小值，均值，方差和非零数，以及总数。
有关API的详细信息，请参考MultivariateStatisticalSummary Scala文档。

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations = sc.parallelize(
  Seq(
    Vectors.dense(1.0, 10.0, 100.0),
    Vectors.dense(2.0, 20.0, 200.0),
    Vectors.dense(3.0, 30.0, 300.0)
  )
)

// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean)  // a dense vector containing the mean value for each column
println(summary.variance)  // column-wise variance
println(summary.numNonzeros)  // number of nonzeros in each column

correlations（相关性）

计算两个系列数据之间的相关性是统计中的常见操作。在spark.mllib中，我们提供了灵活性，可以计算许多序列之间的成对相关性。目前支持的关联方法是Pearson和Spearman的关联。
示例代码
统计提供了计算序列之间相关性的方法。根据输入的类型（两个RDD [Double]或一个RDD [Vector]），输出将分别是Double或相关矩阵。
有关API的详细信息，请参考Statistics Scala文档。

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD

val seriesX: RDD[Double] = sc.parallelize(Array(1, 2, 3, 3, 5))  // a series
// must have the same number of partitions and cardinality as seriesX
val seriesY: RDD[Double] = sc.parallelize(Array(11, 22, 33, 33, 555))

// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a
// method is not specified, Pearson's method will be used by default.
val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
println(s"Correlation is: $correlation")

val data: RDD[Vector] = sc.parallelize(
  Seq(
    Vectors.dense(1.0, 10.0, 100.0),
    Vectors.dense(2.0, 20.0, 200.0),
    Vectors.dense(5.0, 33.0, 366.0))
)  // note that each Vector is a row and not a column

// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method
// If a method is not specified, Pearson's method will be used by default.
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
println(correlMatrix.toString)

stratified sampling（分层抽样）

与spark.mllib中的其他统计功能不同，分层采样方法sampleByKey和sampleByKeyExact可以在RDD的键值对上执行。对于分层采样，可以将键视为标签，将值视为特定属性。例如，键可以是男人或女人，或者是文档ID，并且相应的值可以是人口中人群的年龄列表或文档中的单词列表。sampleByKey方法将掷硬币来决定是否对观察结果进行采样，因此需要对数据进行一次传递，并提供预期的样本量。与sampleByKey中使用的每层简单随机抽样相比，sampleByKeyExact需要更多的资源，但将以99.99％的置信度提供准确的抽样大小。
python当前不支持sampleByKeyExact。
示例代码
sampleByKeyExact（）允许用户精确采样⌈fk⋅nk⌉∀k∈K个项目，其中fk是键k的期望分数，nk是键k的键值对的数量，而K是键集。无需替换的采样需要在RDD上再进行一次通过以确保样本量，而使用替换的采样则需要进行两次额外的遍历。

// an RDD[(K, V)] of any key value pairs
val data = sc.parallelize(
  Seq((1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')))

// specify the exact fraction desired from each key
val fractions = Map(1 -> 0.1, 2 -> 0.6, 3 -> 0.3)

// Get an approximate sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)
// Get an exact sample from each stratum
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)

hypothesis testing（假设检验）

假设检验是一种强大的统计工具，可用来确定结果是否具有统计学意义，以及该结果是否偶然发生。 spark.mllib目前支持Pearson的卡方（χ2）测试，以证明适合度和独立性。输入数据类型确定是否进行拟合优度或独立性测试。拟合优度检验需要向量的输入类型，而独立性检验则需要矩阵作为输入。
spark.mllib还支持输入类型RDD [LabeledPoint]，以通过卡方独立性测试启用功能选择。
示例代码
统计信息提供了运行Pearson卡方检验的方法。以下示例演示了如何运行和解释假设检验。

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.test.ChiSqTestResult
import org.apache.spark.rdd.RDD

// a vector composed of the frequencies of events
val vec: Vector = Vectors.dense(0.1, 0.15, 0.2, 0.3, 0.25)

// compute the goodness of fit. If a second vector to test against is not supplied
// as a parameter, the test runs against a uniform distribution.
val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
// summary of the test including the p-value, degrees of freedom, test statistic, the method
// used, and the null hypothesis.
println(s"$goodnessOfFitTestResult\n")

// a contingency matrix. Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val mat: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

// conduct Pearson's independence test on the input contingency matrix
val independenceTestResult = Statistics.chiSqTest(mat)
// summary of the test including the p-value, degrees of freedom
println(s"$independenceTestResult\n")

val obs: RDD[LabeledPoint] =
  sc.parallelize(
    Seq(
      LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)),
      LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)),
      LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5)
      )
    )
  ) // (label, feature) pairs.

// The contingency table is constructed from the raw (label, feature) pairs and used to conduct
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature
// against the label.
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
featureTestResults.zipWithIndex.foreach { case (k, v) =>
  println(s"Column ${(v + 1)} :")
  println(k)
}  // summary of the test

此外，spark.mllib提供了Kolmogorov-Smirnov（KS）测试的1-样本，两面实现，以实现概率分布相等。通过提供理论分布的名称（当前仅支持正态分布）及其参数，或者提供根据给定的理论分布计算累积分布的函数，用户可以检验零假设，即从该假设中抽取样本分配。如果用户针对正态分布进行测试（distName =“ norm”），但不提供分布参数，则测试将初始化为标准正态分布并记录一条适当的消息。
示例代码
统计提供了运行1样本，2面Kolmogorov-Smirnov检验的方法。以下示例演示了如何运行和解释假设检验。
有关API的详细信息，请参考Statistics Scala文档。

import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD

val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25))  // an RDD of sample data

// run a KS test for the sample versus a standard normal distribution
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
// summary of the test including the p-value, test statistic, and null hypothesis if our p-value
// indicates significance, we can reject the null hypothesis.
println(testResult)
println()

// perform a KS test using a cumulative distribution function of our making
val myCDF = Map(0.1 -> 0.2, 0.15 -> 0.6, 0.2 -> 0.05, 0.3 -> 0.05, 0.25 -> 0.1)
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)

streaming significance testing（流显著性测试）

spark.mllib提供一些测试的在线实现，以支持A / B测试等用例。这些测试可以在Spark Streaming DStream [（Boolean，Double）]上执行，其中每个元组的第一个元素表示对照组（假）或治疗组（真），第二个元素是观察值。
流重要性测试支持以下参数：

PeacePeriod-流中要忽略的初始数据点数，用于减轻新颖性影响。
windowSize-进行假设检验的过去批次的数量。设置为0将使用所有先前的批次执行累积处理。

示例代码
StreamingTest提供流假设测试。

val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {
  case Array(label, value) => BinarySample(label.toBoolean, value.toDouble)
})

val streamingTest = new StreamingTest()
  .setPeacePeriod(0)
  .setWindowSize(0)
  .setTestMethod("welch")

val out = streamingTest.registerStream(data)
out.print()

random data generation（随机数据生成）

随机数据生成对于随机算法，原型设计和性能测试很有用。 spark.mllib支持通过i.d生成随机RDD。从给定分布中得出的值：均匀，标准正态或泊松。
示例代码
RandomRDD提供了工厂方法来生成随机双RDD或矢量RDD。下面的示例生成一个随机双RDD，其值遵循标准正态分布N（0，1），然后将其映射到N（1，4）。
有关API的详细信息，请参阅RandomRDDs Scala文档。

import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

val sc: SparkContext = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)

Kernel density estimation（内核密度估计）

内核密度估计是一种用于可视化经验概率分布的技术，而无需假设要从中得出观察样本的特定分布。它计算在给定点集上评估的随机变量的概率密度函数的估计。它通过将特定点的经验分布的PDF表示为以每个样本为中心的正态分布PDF的平均值来实现此估计。
示例代码
KernelDensity提供了根据样本的RDD计算内核密度估计值的方法。下面的示例演示了如何执行此操作。
有关API的详细信息，请参考KernelDensity Scala文档。

import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD

// an RDD of sample data
val data: RDD[Double] = sc.parallelize(Seq(1, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9))

// Construct the density estimator with the sample data and a standard deviation
// for the Gaussian kernels
val kd = new KernelDensity()
  .setSample(data)
  .setBandwidth(3.0)

// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))

TIAN_R

发布了2 篇原创文章 · 获赞 0 · 访问量 630

私信关注

scala-MLlib官方文档---spark.mllib package--Basic statistics