Spark MLlib 核心基礎:向量 And 矩陣

sunbow0發表於2015-04-23

1、Spark MLlib 核心基礎:向量 And矩陣

1.1 Vector

1.1.1 dense vector

原始碼定義:

   * Creates a dense vector from its values.

   */

  @varargs

  def dense(firstValue: Double, otherValues: Double*): Vector =

    new DenseVector((firstValue +: otherValues).toArray)

 

  // A dummy implicit is used to avoid signature collision with the one generated by @varargs.

  /**

   * Creates a dense vector from a double array.

   */

  def dense(values: Array[Double]): Vector =new DenseVector(values)

實現方法:

scala>   val A1 = (1 to 5).toArray.map {f => f.toDouble}

A1: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)

scala>   val V1 = Vectors.dense(A1)

V1: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]

scala>   val V2 = Vectors.dense(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)

V2: org.apache.spark.mllib.linalg.Vector = [2.0,2.0,2.0,2.0,2.0,2.0]

1.1.2 dense vector

原始碼定義:

/**

   * Creates a sparse vector providing its index array and value array.

   *

   * @param size vector size.

   * @param indices index array, must be strictly increasing.

   * @param values value array, must have the same length as indices.

   */

  def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector =

    new SparseVector(size, indices, values)

  def sparse(size: Int, elements: Seq[(Int, Double)]): Vector = {

  def sparse(size: Int, elements: JavaIterable[(JavaInteger, JavaDouble)]): Vector = {

實現方法:

scala>   val S1 = Vectors.sparse(5, Array(0, 1, 2, 3, 4), Array(1.0, 2.0, 3.0, 4.0, 5.0))

S1: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])

scala>   val S2 = Vectors.sparse(5, Seq((0, 1.0), (1, 2.0), (2,3.0), (3,4.0), (4,5.0)))

S2: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])

1.2 Matrix

1.2.1 dense matrix

原始碼定義:

/**

   * Creates a column-major dense matrix.

   *

   * @param numRows number of rows

   * @param numCols number of columns

   * @param values matrix entries in column major

   */

  def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix = {

    new DenseMatrix(numRows, numCols, values)

  }

實現方法:

scala>   val A2 = (1 to 25).toArray.map { f => f.toDouble }

A2: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0)

 

scala>   val M1 = Matrices.dense(5, 5, A2)

M1: org.apache.spark.mllib.linalg.Matrix =

1.0  6.0   11.0  16.0  21.0 

2.0  7.0   12.0  17.0  22.0 

3.0  8.0   13.0  18.0  23.0 

4.0  9.0   14.0  19.0  24.0 

5.0  10.0  15.0  20.0  25.0 

 

1.2.2 sparse matrix

原始碼定義:

   /**

   * Creates a column-major sparse matrix in Compressed Sparse Column (CSC) format.

   *

   * @param numRows number of rows

   * @param numCols number of columns

   * @param colPtrs the index corresponding to the start of a new column

   * @param rowIndices the row index of the entry

   * @param values non-zero matrix entries in column major

   */

  def sparse(

     numRows: Int,

     numCols: Int,

     colPtrs: Array[Int],

     rowIndices: Array[Int],

     values: Array[Double]): Matrix = {

    new SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)

  }

  /**

   * Column-major sparse matrix.

   * The entry values are stored in Compressed Sparse Column (CSC) format.

   * For example, the following matrix

   * {{{

   *   1.0 0.0 4.0

   *   0.0 3.0 5.0

   *   2.0 0.0 6.0

   * }}}

   * is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,

   * `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.

實現方法:

scala>   val M2 = Matrices.sparse(3, 3, Array(0, 2, 3, 6), Array(0, 2, 1, 0, 1, 2), Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))

M2: org.apache.spark.mllib.linalg.Matrix =

3 x 3 CSCMatrix

(0,0) 1.0

(2,0) 2.0

(1,1) 3.0

(0,2) 4.0

(1,2) 5.0

(2,2) 6.0

 

1.3 distributed Matrix

1.3.1 RowMatrix

 一個行矩陣就是把每行對應一個RDD,將矩陣的每行分散式儲存,矩陣的每行是一個本地向量。這和多變數統計的資料矩陣比較相似。因為每行以一個本地向量表示,那麼矩陣列的數量被限制在整數範圍內,但是實際應用中列數很小。 

1、建立RowMatrix

* @param rows rows stored as an RDD[Vector]

 * @param nRows number of rows. A non-positive value means unknown, and then the number of rows will be determined by the number of records in the RDD `rows`.

 * @param nCols number of columns. A non-positive value means unknown, and then the number of columns will be determined by the size of the first row.

newRowMatrix(rows: RDD[Vector])

newRowMatrix(rows: RDD[Vector], nRows: Long, nCols: Int)

scala>   val rdd1= sc.parallelize(Array(Array(1.0,2.0,3.0,4.0),Array(2.0,3.0,4.0,5.0),Array(3.0,4.0,5.0,6.0))).map(f => Vectors.dense(f))

rdd1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[38] at map at <console>:31

scala>   val RM = new RowMatrix(rdd1)

RM: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@6196e877

2、RowMatrix方法

1)columnSimilarities(threshold: Double): CoordinateMatrix

計算每列之間相似度,採用抽樣方法進行計算,引數為threshold;

val simic1 = RM.columnSimilarities(0.5)

 

2)columnSimilarities(): CoordinateMatrix

計算每列之間相似度。

val simic2 = RM.columnSimilarities()

 

3)computeColumnSummaryStatistics(): MultivariateStatisticalSummary

計算每列的彙總統計。

val simic3 = RM.computeColumnSummaryStatistics()

simic3.max

simic3.min

simic3.mean

 

4)computeCovariance(): Matrix

計算每列之間的協方差,生成協方差矩陣。

val cc1 = RM.computeCovariance

cc1: org.apache.spark.mllib.linalg.Matrix =

1.0  1.0  1.0  1.0 

1.0  1.0  1.0  1.0 

1.0  1.0  1.0  1.0 

1.0  1.0  1.0  1.0 

 

5)computeGramianMatrix(): Matrix

計算格拉姆矩陣:`A^T A`。

給定一個實矩陣 A,矩陣 ATA 是 A 的列向量的格拉姆矩陣,而矩陣 AAT 是 A 的行向量的格拉姆矩陣。

val cc1 = RM.computeCovariance

cg1: org.apache.spark.mllib.linalg.Matrix =

14.0  20.0  26.0  32.0 

20.0  29.0  38.0  47.0 

26.0  38.0  50.0  62.0 

32.0  47.0  62.0  77.0 

 

6)computePrincipalComponents(k: Int): Matrix

計算主成分,前K個。行為樣本,列為變數。

val pc1 = RM.computePrincipalComponents(3)

pc1: org.apache.spark.mllib.linalg.Matrix =

-0.5000000000000002  0.8660254037844388    1.6653345369377348E-16 

-0.5000000000000002  -0.28867513459481275  0.8164965809277258     

-0.5000000000000002  -0.28867513459481287  -0.40824829046386296   

-0.5000000000000002  -0.28867513459481287  -0.40824829046386296  

 

7)computeSVD(k: Int, computeU: Boolean = false, rCond: Double = 1e-9):

 SingularValueDecomposition[RowMatrixMatrix]

計算矩陣的奇異值分解。

  val svd = RM.computeSVD(4, true)

  val U = svd.U

  U.rows.foreach(println)

  val s = svd.s

  val V = svd.V

 

8)multiply(B: Matrix): RowMatrix

矩陣乘法運算,右乘運算。

A. multiply(B) => A*B

9)numCols(): Long

矩陣的列數;

10)numRows(): Long

矩陣的行數;

11)rows: RDD[Vector]

矩陣轉化成RDD,以行儲存的RDD。

 

1.3.2 IndexedRowMatrix 

這是分散式矩陣的第二種matrix,這種矩陣和RowMatrix非常相似,區別是它帶有有一定意義的 row indices。It is backed by an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector. 一個 IndexedRowMatrix可以從RDD[IndexedRow] 例項建立,IndexedRow 是 (Int, Vector) 的 wrapper, 而且這種矩陣可以傳換成 RowMatrix, 通過丟掉它的 row indices。

這和多變數統計的資料矩陣比較相似。因為每行以一個本地向量表示,那麼矩陣列的數量被限制在整數範圍內,但是實際應用中列數很小。

建立及使用方法類似於RowMatrix。

1.3.3 DistributedMatrix

Represents a distributively stored matrix backed by one or more RDDs.

 

1.3.4 CoordinateMatrix

這是分散式矩陣的第三種matrix,座標矩陣也是一種RDD儲存的分散式矩陣。 顧名思義,這裡的每一項都是一個(i: Long, j: Long, value: Double) 指示行列值的元組tuple。 其中i是行座標,j是列座標,value是值。如果矩陣是非常大的而且稀疏,座標矩陣一定是最好的選擇。座標矩陣則是通過RDD[MatrixEntry]例項建立,MatrixEntry是(long,long.Double)形式。座標矩陣可以轉化為IndexedRowMatrix。

 

1.3.5 BlockMatrix

/**

 * :: Experimental ::

 *

 * Represents a distributed matrix in blocks of local matrices.

 *

 * @param blocks The RDD of sub-matrix blocks ((blockRowIndex, blockColIndex), sub-matrix) that

 *               form this distributed matrix. If multiple blocks with the same index exist, the

 *               results for operations like add and multiply will be unpredictable.

 * @param rowsPerBlock Number of rows that make up each block. The blocks forming the final

 *                     rows are not required to have the given number of rows

 * @param colsPerBlock Number of columns that make up each block. The blocks forming the final

 *                     columns are not required to have the given number of columns

 * @param nRows Number of rows of this matrix. If the supplied value is less than or equal to zero,

 *              the number of rows will be calculated when `numRows` is invoked.

 * @param nCols Number of columns of this matrix. If the supplied value is less than or equal to

 *              zero, the number of columns will be calculated when `numCols` is invoked.

 */

分散式分塊矩陣。參照:http://de.wikipedia.org/wiki/Blockmatrix

 

 

相關文章