Spark machine learning - defined --scala version of the Matrix

Directory
a local vector of
two, the point-based label containing
three sparse data Sparse data
four local matrix
V. distributed matrix
5.1) distributed matrix (RowMatrix) linewise
5.2) matrix row index (IndexedRowMatrix)
5.3) three yuan set of matrices (CoordinateMatrix)
a local vector
base class is local vectors vector, we provide two implementations DenseVector and SparseVector. We propose to create a local vector Vectors by factory methods implemented :( Note: Scala default language is introduced scala.collection.immutable.Vector, in order to use Vector MLlib, you must show introduced org.apache.spark.mllib.linalg .Vector.)

import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values
corresponding to nonzero entries.

val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))

// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

Second, the class label points containing
points containing class is represented by a tag case class LabeledPoint.

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

// Create a labeled point with a positive label and a dense feature vector.
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

// Create a labeled point with a negative label and a sparse feature vector.
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

Third, the sparse data Sparse data
practice, sparse data is very common. MLlib training examples can be read stored in LIBSVM format, the format is the default format LIBSVM LIBSVM and LIBLINEAR, which is a text format, each row represents a sparse feature vector containing the class labels. The following format:
[label index1: value1 index2: value2 ...]
index is starting from 1 and increments. Once loaded, the index is converted from zero.
Training examples MLUtils.loadLibSVMFile read and stored by LIBSVM format.

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD

val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, “data/mllib/sample_libsvm_data.txt”)

Fourth, the local matrix is
a local matrix of rows and columns and index data corresponding to the data value of type integer double compositions, stored in one machine. MLlib support dense matrix (No sparse matrix!), The entity value stored in column-major of a double array.
Matrix is the base class for local Matrix, we provide an implementation DenseMatrix. We propose to create a local matrix by factory methods Matrices implemented:

import org.apache.spark.mllib.linalg.{Matrix, Matrices}
// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

V. distributed matrix
a matrix of the distributed index data and line type long double value data corresponding to the composition, the one or more distributed storage in RDD. For the vast distributed matrix, the choice of the right storage format is very important. A distributed matrix into another format requires a different global shuffling (shuffle), it is costly. At present, to achieve the three types of distributed matrix storage format. The most basic type is RowMatrix. RowMatrix a distributed matrix is a row-oriented, which is the row index no specific meaning. For example, a collection of series of feature vectors. To represent all the rows by a RDD, each row is a local vector. For RowMatrix, we assume that the number of columns is not great, so that a local vector may be appropriate to the driving node (Driver) to exchange information, and can be stored and operated in a node.
IndexedRowMatrix RowMatrix with similar, but there is a row index, can be used to identify the row and join operations. CoordinateMatrix matrix and is distributed in a triple list format (coordinate list, COO) stores, in fact, is a collection of body RDD. Note: Because we need the cache size of the matrix, the underlying RDD distributed matrix must be determined (deterministic). In general, using the determined non-RDD (non-deterministic RDDs) cause an error.

5.1) for the rows distributed matrix (RowMatrix)
a RowMatrix is a row-oriented distributed matrix, which is the row index no specific meaning. For example, a collection of series of feature vectors. To represent all the rows by a RDD, each row is a local vector. Since each row vector is represented by a local, so the number of rows to be limited by the size of integer data, in fact, in practice the number of columns is a small value.
A RowMatrix create an instance from a RDD [Vector]. Then we can calculate its summary statistics.

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix

Val Rows: eet [Vector] = ... // an eet of local vectors

// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

5.2) the matrix row index (IndexedRowMatrix)
similar to IndexedRowMatrix RowMatrix, but the row index has a specific meaning, is essentially a set of data containing index information line (an RDD of indexed rows). Each row index and a long type of local vector components. It may be a IndexedRowMatrix [IndexedRow] instance created from a RDD, IndexedRow here is (Long, Vector) package type. Excluding the row index information IndexedRowMatrix becomes a RowMatrix.

import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
val rows: RDD[IndexedRow] = … // an RDD of indexed rows

// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix()

5.3) matrix triplet (CoordinateMatrix)
a CoordinateMatrix is a distributed matrix, is in fact a collection of body RDD. Each entity is a (i: Long, j: Long , value: Double) triples, where i represents the value of row index, j column index representative of, value on behalf of the entity. Only when the matrix of rows and columns are huge, and very sparse matrix when using CoordinateMatrix.
Can be created from a CoordinateMatrix a RDD [MatrixEntry] example, MatrixEntry here is (Long, Long, Double) package type. It can change a CoordinateMatrix by calling toIndexedRowMatrix a IndexedRowMatrix (but the line is sparse). Currently it does not support other computing operations.

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = … // an RDD of matrix entries

// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()
————————————————

Released nine original articles · won praise 2 · Views 807

Guess you like

Origin blog.csdn.net/m0_37611613/article/details/104315191