Spark MLlib 核心基礎：向量 And 矩陣

sunbow0發表於2015-04-23

Spark矩陣

1、Spark MLlib 核心基礎：向量 And矩陣

1.1 Vector

1.1.1 dense vector

原始碼定義：

* Creates a dense vector from its values.

@varargs

def dense(firstValue: Double, otherValues: Double*): Vector =

new DenseVector((firstValue +: otherValues).toArray)

// A dummy implicit is used to avoid signature collision with the one generated by @varargs.

/**

* Creates a dense vector from a double array.

def dense(values: Array[Double]): Vector =new DenseVector(values)

實現方法：

scala> val A1 = (1 to 5).toArray.map {f => f.toDouble}

A1: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)

scala> val V1 = Vectors.dense(A1)

V1: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]

scala> val V2 = Vectors.dense(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)

V2: org.apache.spark.mllib.linalg.Vector = [2.0,2.0,2.0,2.0,2.0,2.0]

1.1.2 dense vector

原始碼定義：

/**

* Creates a sparse vector providing its index array and value array.

* @param size vector size.

* @param indices index array, must be strictly increasing.

* @param values value array, must have the same length as indices.

def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector =

new SparseVector(size, indices, values)

def sparse(size: Int, elements: Seq[(Int, Double)]): Vector = {

def sparse(size: Int, elements: JavaIterable[(JavaInteger, JavaDouble)]): Vector = {

實現方法：

scala> val S1 = Vectors.sparse(5, Array(0, 1, 2, 3, 4), Array(1.0, 2.0, 3.0, 4.0, 5.0))

S1: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])

scala> val S2 = Vectors.sparse(5, Seq((0, 1.0), (1, 2.0), (2,3.0), (3,4.0), (4,5.0)))

S2: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])

1.2 Matrix

1.2.1 dense matrix

原始碼定義：

/**

* Creates a column-major dense matrix.

* @param numRows number of rows

* @param numCols number of columns

* @param values matrix entries in column major

def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix = {

new DenseMatrix(numRows, numCols, values)

}

實現方法：

scala> val A2 = (1 to 25).toArray.map { f => f.toDouble }

A2: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0)

scala> val M1 = Matrices.dense(5, 5, A2)

M1: org.apache.spark.mllib.linalg.Matrix =

1.0 6.0 11.0 16.0 21.0

2.0 7.0 12.0 17.0 22.0

3.0 8.0 13.0 18.0 23.0

4.0 9.0 14.0 19.0 24.0

5.0 10.0 15.0 20.0 25.0

1.2.2 sparse matrix

原始碼定義：

/**

* Creates a column-major sparse matrix in Compressed Sparse Column (CSC) format.

* @param numRows number of rows

* @param numCols number of columns

* @param colPtrs the index corresponding to the start of a new column

* @param rowIndices the row index of the entry

* @param values non-zero matrix entries in column major

def sparse(

numRows: Int,

numCols: Int,

colPtrs: Array[Int],

rowIndices: Array[Int],

values: Array[Double]): Matrix = {

new SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)

}

/**

* Column-major sparse matrix.

* The entry values are stored in Compressed Sparse Column (CSC) format.

* For example, the following matrix

* {{{

* 1.0 0.0 4.0

* 0.0 3.0 5.0

* 2.0 0.0 6.0

* }}}

* is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,

* `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.

實現方法：

scala> val M2 = Matrices.sparse(3, 3, Array(0, 2, 3, 6), Array(0, 2, 1, 0, 1, 2), Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))

M2: org.apache.spark.mllib.linalg.Matrix =

3 x 3 CSCMatrix

(0,0) 1.0

(2,0) 2.0

(1,1) 3.0

(0,2) 4.0

(1,2) 5.0

(2,2) 6.0

1.3 distributed Matrix

1.3.1 RowMatrix

一個行矩陣就是把每行對應一個RDD，將矩陣的每行分散式儲存，矩陣的每行是一個本地向量。這和多變數統計的資料矩陣比較相似。因為每行以一個本地向量表示，那麼矩陣列的數量被限制在整數範圍內，但是實際應用中列數很小。

1、建立RowMatrix

* @param rows rows stored as an RDD[Vector]

* @param nRows number of rows. A non-positive value means unknown, and then the number of rows will be determined by the number of records in the RDD `rows`.

* @param nCols number of columns. A non-positive value means unknown, and then the number of columns will be determined by the size of the first row.

newRowMatrix(rows: RDD[Vector])

newRowMatrix(rows: RDD[Vector], nRows: Long, nCols: Int)

scala> val rdd1= sc.parallelize(Array(Array(1.0,2.0,3.0,4.0),Array(2.0,3.0,4.0,5.0),Array(3.0,4.0,5.0,6.0))).map(f => Vectors.dense(f))

rdd1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[38] at map at <console>:31

scala> val RM = new RowMatrix(rdd1)

RM: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@6196e877

2、RowMatrix方法

1）columnSimilarities(threshold: Double): CoordinateMatrix

計算每列之間相似度，採用抽樣方法進行計算，引數為threshold；

val simic1 = RM.columnSimilarities(0.5)

2）columnSimilarities(): CoordinateMatrix

計算每列之間相似度。

val simic2 = RM.columnSimilarities()

3）computeColumnSummaryStatistics(): MultivariateStatisticalSummary

計算每列的彙總統計。

val simic3 = RM.computeColumnSummaryStatistics()

simic3.max

simic3.min

simic3.mean

4）computeCovariance(): Matrix

計算每列之間的協方差，生成協方差矩陣。

val cc1 = RM.computeCovariance

cc1: org.apache.spark.mllib.linalg.Matrix =

1.0 1.0 1.0 1.0

5）computeGramianMatrix(): Matrix

計算格拉姆矩陣：`A^T A`。

給定一個實矩陣 A，矩陣 ATA 是 A 的列向量的格拉姆矩陣，而矩陣 AAT 是 A 的行向量的格拉姆矩陣。

val cc1 = RM.computeCovariance

cg1: org.apache.spark.mllib.linalg.Matrix =

14.0 20.0 26.0 32.0

20.0 29.0 38.0 47.0

26.0 38.0 50.0 62.0

32.0 47.0 62.0 77.0

6）computePrincipalComponents(k: Int): Matrix

計算主成分，前K個。行為樣本，列為變數。

val pc1 = RM.computePrincipalComponents(3)

pc1: org.apache.spark.mllib.linalg.Matrix =

-0.5000000000000002 0.8660254037844388 1.6653345369377348E-16

-0.5000000000000002 -0.28867513459481275 0.8164965809277258

-0.5000000000000002 -0.28867513459481287 -0.40824829046386296

7）computeSVD(k: Int, computeU: Boolean = false, rCond: Double = 1e-9):

SingularValueDecomposition[RowMatrix, Matrix]

計算矩陣的奇異值分解。

val svd = RM.computeSVD(4, true)

val U = svd.U

U.rows.foreach(println)

val s = svd.s

val V = svd.V

8）multiply(B: Matrix): RowMatrix

矩陣乘法運算，右乘運算。

A. multiply(B) => A*B

9）numCols(): Long

矩陣的列數；

10）numRows(): Long

矩陣的行數；

11）rows: RDD[Vector]

矩陣轉化成RDD，以行儲存的RDD。

1.3.2 IndexedRowMatrix

這是分散式矩陣的第二種matrix，這種矩陣和RowMatrix非常相似，區別是它帶有有一定意義的 row indices。It is backed by an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector. 一個 IndexedRowMatrix可以從RDD[IndexedRow] 例項建立，IndexedRow 是 (Int, Vector) 的 wrapper，而且這種矩陣可以傳換成 RowMatrix，通過丟掉它的 row indices。

這和多變數統計的資料矩陣比較相似。因為每行以一個本地向量表示，那麼矩陣列的數量被限制在整數範圍內，但是實際應用中列數很小。

建立及使用方法類似於RowMatrix。

1.3.3 DistributedMatrix

Represents a distributively stored matrix backed by one or more RDDs.

1.3.4 CoordinateMatrix

這是分散式矩陣的第三種matrix，座標矩陣也是一種RDD儲存的分散式矩陣。顧名思義，這裡的每一項都是一個(i: Long, j: Long, value: Double) 指示行列值的元組tuple。其中i是行座標，j是列座標，value是值。如果矩陣是非常大的而且稀疏，座標矩陣一定是最好的選擇。座標矩陣則是通過RDD[MatrixEntry]例項建立，MatrixEntry是(long,long.Double)形式。座標矩陣可以轉化為IndexedRowMatrix。

1.3.5 BlockMatrix

/**

* :: Experimental ::

* Represents a distributed matrix in blocks of local matrices.

* @param blocks The RDD of sub-matrix blocks ((blockRowIndex, blockColIndex), sub-matrix) that

* form this distributed matrix. If multiple blocks with the same index exist, the

* results for operations like add and multiply will be unpredictable.

* @param rowsPerBlock Number of rows that make up each block. The blocks forming the final

* rows are not required to have the given number of rows

* @param colsPerBlock Number of columns that make up each block. The blocks forming the final

* columns are not required to have the given number of columns

* @param nRows Number of rows of this matrix. If the supplied value is less than or equal to zero,

* the number of rows will be calculated when `numRows` is invoked.

* @param nCols Number of columns of this matrix. If the supplied value is less than or equal to

* zero, the number of columns will be calculated when `numCols` is invoked.

分散式分塊矩陣。參照：http://de.wikipedia.org/wiki/Blockmatrix

spark向量、矩陣型別
2015-03-12
Spark矩陣型別
基向量變換矩陣
2024-07-10
矩陣
3D數學基礎-向量運算基礎和矩陣變換
2017-09-22
3D矩陣
OpenGL/OpenGL ES 入門：基礎變換 - 初識向量/矩陣
2019-05-19
矩陣
社交網路分析的 R 基礎：（三）向量、矩陣與列表
2022-02-07
矩陣
【scipy 基礎】--稀疏矩陣
2023-11-23
矩陣
基礎｜什麼是張量、資料立體、矩陣、向量和純數
2018-07-24
矩陣
機器學習中的矩陣向量求導(五) 矩陣對矩陣的求導
2019-05-27
機器學習矩陣求導
機器學習中的矩陣向量求導(四) 矩陣向量求導鏈式法則
2019-05-07
機器學習矩陣求導
CUDA版本稀疏矩陣向量乘
2017-12-27
矩陣
OpenMP 版本稀疏矩陣向量乘
2017-12-27
矩陣
三維重建學習(1)：基礎知識：旋轉矩陣與旋轉向量
2017-12-29
矩陣
R語言矩陣基礎操作
2013-12-10
R語言矩陣
矩陣的特徵值和特徵向量
2024-05-07
矩陣特徵
人工智慧數學基礎—-矩陣
2019-02-23
人工智慧矩陣
pytorch基礎七（矩陣運算）
2018-12-08
PyTorch矩陣
人工智慧數學基礎----矩陣
2018-06-12
人工智慧矩陣
【原創】開源Math.NET基礎數學類庫使用(02)矩陣向量計算
2015-02-13
矩陣
torch中向量、矩陣乘法大總結
2020-12-10
矩陣
MATLAB（6）矩陣和向量運算
2017-04-03
Matlab矩陣
向量和矩陣求導公式總結
2024-03-12
矩陣求導公式
Spark Distributed matrix 分散式矩陣
2017-05-06
Spark分散式矩陣
python numpy基礎陣列和向量計算
2017-02-11
Python陣列
Spark MLlib FPGrowth演算法
2016-03-04
Spark演算法
OpenGL 學習 07 向量矩陣變換投影
2018-06-02
矩陣
向量和矩陣的座標變換7
2024-10-07
矩陣
[WebGL入門]五，矩陣的基礎知識
2014-07-31
Web矩陣
Spark基礎
2018-05-10
Spark
AI開源專案 - Spark MLlib
2020-03-15
AISpark
人工智慧之機器學習線代基礎——行列式、矩陣的逆（inverse）、伴隨矩陣
2024-11-18
人工智慧機器學習矩陣
機器學習庫Spark MLlib簡介與教程
2021-12-29
機器學習Spark
Spark MLlib學習（1）--基本統計
2018-08-03
Spark
Eigen教程(3)之矩陣和向量的運算
2020-12-09
矩陣
向量化實現矩陣運算最佳化(一)
2023-09-28
矩陣
張量（Tensor）、標量（scalar）、向量（vector）、矩陣（matrix）
2023-05-10
矩陣
均值、方差、協方差、協方差矩陣、特徵值、特徵向量
2016-07-31
矩陣特徵
scala基礎語法-----Spark基礎
2020-09-29
Spark
Unity開發中常用的基礎3D數學（向量，點乘，叉乘，矩陣，四元數，尤拉角）
2018-08-04
Unity3D點乘矩陣

Spark MLlib 核心基礎：向量 And 矩陣

1、Spark MLlib 核心基礎：向量 And矩陣

1.1 Vector

1.1.1 dense vector

1.1.2 dense vector

1.2 Matrix

1.2.1 dense matrix

1.2.2 sparse matrix

1.3 distributed Matrix

1.3.1 RowMatrix

1.3.2 IndexedRowMatrix

1.3.3 DistributedMatrix

1.3.4 CoordinateMatrix

1.3.5 BlockMatrix

相關文章