Spark MLlib 核心基礎:向量 And 矩陣
1、Spark MLlib 核心基礎:向量 And矩陣
1.1 Vector
1.1.1 dense vector
原始碼定義:
* Creates a dense vector from its values.
*/
@varargs
def dense(firstValue: Double, otherValues: Double*): Vector =
new DenseVector((firstValue +: otherValues).toArray)
// A dummy implicit is used to avoid signature collision with the one generated by @varargs.
/**
* Creates a dense vector from a double array.
*/
def dense(values: Array[Double]): Vector =new DenseVector(values)
實現方法:
scala> val A1 = (1 to 5).toArray.map {f => f.toDouble}
A1: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
scala> val V1 = Vectors.dense(A1)
V1: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]
scala> val V2 = Vectors.dense(2.0, 2.0, 2.0, 2.0, 2.0, 2.0)
V2: org.apache.spark.mllib.linalg.Vector = [2.0,2.0,2.0,2.0,2.0,2.0]
1.1.2 dense vector
原始碼定義:
/**
* Creates a sparse vector providing its index array and value array.
*
* @param size vector size.
* @param indices index array, must be strictly increasing.
* @param values value array, must have the same length as indices.
*/
def sparse(size: Int, indices: Array[Int], values: Array[Double]): Vector =
new SparseVector(size, indices, values)
def sparse(size: Int, elements: Seq[(Int, Double)]): Vector = {
def sparse(size: Int, elements: JavaIterable[(JavaInteger, JavaDouble)]): Vector = {
實現方法:
scala> val S1 = Vectors.sparse(5, Array(0, 1, 2, 3, 4), Array(1.0, 2.0, 3.0, 4.0, 5.0))
S1: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])
scala> val S2 = Vectors.sparse(5, Seq((0, 1.0), (1, 2.0), (2,3.0), (3,4.0), (4,5.0)))
S2: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,2.0,3.0,4.0,5.0])
1.2 Matrix
1.2.1 dense matrix
原始碼定義:
/**
* Creates a column-major dense matrix.
*
* @param numRows number of rows
* @param numCols number of columns
* @param values matrix entries in column major
*/
def dense(numRows: Int, numCols: Int, values: Array[Double]): Matrix = {
new DenseMatrix(numRows, numCols, values)
}
實現方法:
scala> val A2 = (1 to 25).toArray.map { f => f.toDouble }
A2: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0)
scala> val M1 = Matrices.dense(5, 5, A2)
M1: org.apache.spark.mllib.linalg.Matrix =
1.0 6.0 11.0 16.0 21.0
2.0 7.0 12.0 17.0 22.0
3.0 8.0 13.0 18.0 23.0
4.0 9.0 14.0 19.0 24.0
5.0 10.0 15.0 20.0 25.0
1.2.2 sparse matrix
原始碼定義:
/**
* Creates a column-major sparse matrix in Compressed Sparse Column (CSC) format.
*
* @param numRows number of rows
* @param numCols number of columns
* @param colPtrs the index corresponding to the start of a new column
* @param rowIndices the row index of the entry
* @param values non-zero matrix entries in column major
*/
def sparse(
numRows: Int,
numCols: Int,
colPtrs: Array[Int],
rowIndices: Array[Int],
values: Array[Double]): Matrix = {
new SparseMatrix(numRows, numCols, colPtrs, rowIndices, values)
}
/**
* Column-major sparse matrix.
* The entry values are stored in Compressed Sparse Column (CSC) format.
* For example, the following matrix
* {{{
* 1.0 0.0 4.0
* 0.0 3.0 5.0
* 2.0 0.0 6.0
* }}}
* is stored as `values: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]`,
* `rowIndices=[0, 2, 1, 0, 1, 2]`, `colPointers=[0, 2, 3, 6]`.
實現方法:
scala> val M2 = Matrices.sparse(3, 3, Array(0, 2, 3, 6), Array(0, 2, 1, 0, 1, 2), Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
M2: org.apache.spark.mllib.linalg.Matrix =
3 x 3 CSCMatrix
(0,0) 1.0
(2,0) 2.0
(1,1) 3.0
(0,2) 4.0
(1,2) 5.0
(2,2) 6.0
1.3 distributed Matrix
1.3.1 RowMatrix
一個行矩陣就是把每行對應一個RDD,將矩陣的每行分散式儲存,矩陣的每行是一個本地向量。這和多變數統計的資料矩陣比較相似。因為每行以一個本地向量表示,那麼矩陣列的數量被限制在整數範圍內,但是實際應用中列數很小。
1、建立RowMatrix
* @param rows rows stored as an RDD[Vector]
* @param nRows number of rows. A non-positive value means unknown, and then the number of rows will be determined by the number of records in the RDD `rows`.
* @param nCols number of columns. A non-positive value means unknown, and then the number of columns will be determined by the size of the first row.
newRowMatrix(rows: RDD[Vector])
newRowMatrix(rows: RDD[Vector], nRows: Long, nCols: Int)
scala> val rdd1= sc.parallelize(Array(Array(1.0,2.0,3.0,4.0),Array(2.0,3.0,4.0,5.0),Array(3.0,4.0,5.0,6.0))).map(f => Vectors.dense(f))
rdd1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[38] at map at <console>:31
scala> val RM = new RowMatrix(rdd1)
RM: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@6196e877
2、RowMatrix方法
1)columnSimilarities(threshold: Double): CoordinateMatrix
計算每列之間相似度,採用抽樣方法進行計算,引數為threshold;
val simic1 = RM.columnSimilarities(0.5)
2)columnSimilarities(): CoordinateMatrix
計算每列之間相似度。
val simic2 = RM.columnSimilarities()
3)computeColumnSummaryStatistics(): MultivariateStatisticalSummary
計算每列的彙總統計。
val simic3 = RM.computeColumnSummaryStatistics()
simic3.max
simic3.min
simic3.mean
4)computeCovariance(): Matrix
計算每列之間的協方差,生成協方差矩陣。
val cc1 = RM.computeCovariance
cc1: org.apache.spark.mllib.linalg.Matrix =
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
5)computeGramianMatrix(): Matrix
計算格拉姆矩陣:`A^T A`。
給定一個實矩陣 A,矩陣 ATA 是 A 的列向量的格拉姆矩陣,而矩陣 AAT 是 A 的行向量的格拉姆矩陣。
val cc1 = RM.computeCovariance
cg1: org.apache.spark.mllib.linalg.Matrix =
14.0 20.0 26.0 32.0
20.0 29.0 38.0 47.0
26.0 38.0 50.0 62.0
32.0 47.0 62.0 77.0
6)computePrincipalComponents(k: Int): Matrix
計算主成分,前K個。行為樣本,列為變數。
val pc1 = RM.computePrincipalComponents(3)
pc1: org.apache.spark.mllib.linalg.Matrix =
-0.5000000000000002 0.8660254037844388 1.6653345369377348E-16
-0.5000000000000002 -0.28867513459481275 0.8164965809277258
-0.5000000000000002 -0.28867513459481287 -0.40824829046386296
-0.5000000000000002 -0.28867513459481287 -0.40824829046386296
7)computeSVD(k: Int, computeU: Boolean = false, rCond: Double = 1e-9):
SingularValueDecomposition[RowMatrix, Matrix]
計算矩陣的奇異值分解。
val svd = RM.computeSVD(4, true)
val U = svd.U
U.rows.foreach(println)
val s = svd.s
val V = svd.V
8)multiply(B: Matrix): RowMatrix
矩陣乘法運算,右乘運算。
A. multiply(B) => A*B
9)numCols(): Long
矩陣的列數;
10)numRows(): Long
矩陣的行數;
矩陣轉化成RDD,以行儲存的RDD。
1.3.2 IndexedRowMatrix
這是分散式矩陣的第二種matrix,這種矩陣和RowMatrix非常相似,區別是它帶有有一定意義的 row indices。It is backed by an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector. 一個 IndexedRowMatrix可以從RDD[IndexedRow] 例項建立,IndexedRow 是 (Int, Vector) 的 wrapper, 而且這種矩陣可以傳換成 RowMatrix, 通過丟掉它的 row indices。
這和多變數統計的資料矩陣比較相似。因為每行以一個本地向量表示,那麼矩陣列的數量被限制在整數範圍內,但是實際應用中列數很小。
建立及使用方法類似於RowMatrix。
1.3.3 DistributedMatrix
Represents a distributively stored matrix backed by one or more RDDs.
1.3.4 CoordinateMatrix
這是分散式矩陣的第三種matrix,座標矩陣也是一種RDD儲存的分散式矩陣。 顧名思義,這裡的每一項都是一個(i: Long, j: Long, value: Double) 指示行列值的元組tuple。 其中i是行座標,j是列座標,value是值。如果矩陣是非常大的而且稀疏,座標矩陣一定是最好的選擇。座標矩陣則是通過RDD[MatrixEntry]例項建立,MatrixEntry是(long,long.Double)形式。座標矩陣可以轉化為IndexedRowMatrix。
1.3.5 BlockMatrix
/**
* :: Experimental ::
*
* Represents a distributed matrix in blocks of local matrices.
*
* @param blocks The RDD of sub-matrix blocks ((blockRowIndex, blockColIndex), sub-matrix) that
* form this distributed matrix. If multiple blocks with the same index exist, the
* results for operations like add and multiply will be unpredictable.
* @param rowsPerBlock Number of rows that make up each block. The blocks forming the final
* rows are not required to have the given number of rows
* @param colsPerBlock Number of columns that make up each block. The blocks forming the final
* columns are not required to have the given number of columns
* @param nRows Number of rows of this matrix. If the supplied value is less than or equal to zero,
* the number of rows will be calculated when `numRows` is invoked.
* @param nCols Number of columns of this matrix. If the supplied value is less than or equal to
* zero, the number of columns will be calculated when `numCols` is invoked.
*/
分散式分塊矩陣。參照:http://de.wikipedia.org/wiki/Blockmatrix
相關文章
- spark向量、矩陣型別Spark矩陣型別
- 基向量 變換矩陣矩陣
- 3D數學基礎-向量運算基礎和矩陣變換3D矩陣
- OpenGL/OpenGL ES 入門:基礎變換 - 初識向量/矩陣矩陣
- 社交網路分析的 R 基礎:(三)向量、矩陣與列表矩陣
- 【scipy 基礎】--稀疏矩陣矩陣
- 基礎|什麼是張量、資料立體、矩陣、向量和純數矩陣
- 機器學習中的矩陣向量求導(五) 矩陣對矩陣的求導機器學習矩陣求導
- 機器學習中的矩陣向量求導(四) 矩陣向量求導鏈式法則機器學習矩陣求導
- CUDA版本稀疏矩陣向量乘矩陣
- OpenMP 版本稀疏矩陣向量乘矩陣
- 三維重建學習(1):基礎知識:旋轉矩陣與旋轉向量矩陣
- R語言矩陣基礎操作R語言矩陣
- 矩陣的特徵值和特徵向量矩陣特徵
- 人工智慧數學基礎—-矩陣人工智慧矩陣
- pytorch基礎七(矩陣運算)PyTorch矩陣
- 人工智慧數學基礎----矩陣人工智慧矩陣
- 【原創】開源Math.NET基礎數學類庫使用(02)矩陣向量計算矩陣
- torch中向量、矩陣乘法大總結矩陣
- MATLAB(6)矩陣和向量運算Matlab矩陣
- 向量和矩陣求導公式總結矩陣求導公式
- Spark Distributed matrix 分散式矩陣Spark分散式矩陣
- python numpy基礎 陣列和向量計算Python陣列
- Spark MLlib FPGrowth演算法Spark演算法
- OpenGL 學習 07 向量 矩陣變換 投影矩陣
- 向量和矩陣的座標變換7矩陣
- [WebGL入門]五,矩陣的基礎知識Web矩陣
- Spark基礎Spark
- AI開源專案 - Spark MLlibAISpark
- 人工智慧之機器學習線代基礎——行列式、矩陣的 逆(inverse)、伴隨矩陣人工智慧機器學習矩陣
- 機器學習庫Spark MLlib簡介與教程機器學習Spark
- Spark MLlib學習(1)--基本統計Spark
- Eigen教程(3)之矩陣和向量的運算矩陣
- 向量化實現矩陣運算最佳化(一)矩陣
- 張量(Tensor)、標量(scalar)、向量(vector)、矩陣(matrix)矩陣
- 均值、方差、協方差、協方差矩陣、特徵值、特徵向量矩陣特徵
- scala基礎語法-----Spark基礎Spark
- Unity開發中常用的基礎3D數學(向量,點乘,叉乘,矩陣,四元數,尤拉角)Unity3D點乘矩陣