Spark - Clustering
官方文件:https://spark.apache.org/docs/2.2.0/ml-clustering.html
這部分介紹MLlib中的聚類演算法;
目錄:
- K-means:
- 輸入列;
- 輸出列;
- Latent Dirichlet allocation(LDA):
- Bisecting k-means;
- Gaussian Mixture Model(GMM):
- 輸入列;
- 輸出列;
K-means
k-means是最常用的聚類演算法之一,它將資料聚集到預先設定的N個簇中;
KMeans作為一個預測器,生成一個KMeansModel作為基本模型;
輸入列
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | features | Feature vector |
輸出列
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | prediction | Predicted cluster center |
例子
from pyspark.ml.clustering import KMeans
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)
# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(wssse))
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
LDA
LDA是一個預測器,同時支援EMLDAOptimizer和OnlineLDAOptimizer,生成一個LDAModel作為基本模型,專家使用者如果有需要可以將EMLDAOptimizer生成的LDAModel轉為DistributedLDAModel;
from pyspark.ml.clustering import LDA
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")
# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))
# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)
# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)
Bisecting k-means
Bisecting k-means是一種使用分裂方法的層次聚類演算法:所有資料點開始都處在一個簇中,遞迴的對資料進行劃分直到簇的個數為指定個數為止;
Bisecting k-means一般比K-means要快,但是它會生成不一樣的聚類結果;
BisectingKMeans是一個預測器,並生成BisectingKMeansModel作為基本模型;
與K-means相比,二分K-means的最終結果不依賴於初始簇心的選擇,這也是為什麼通常二分K-means與K-means結果往往不一樣的原因;
from pyspark.ml.clustering import BisectingKMeans
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)
# Evaluate clustering.
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))
# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
print(center)
Gaussian Mixture Model(GMM)
GMM表示一個符合分佈,從一個高斯子分佈中提取點,每個點都有其自己 的概率,spark.ml基於給定資料通過期望最大化演算法來歸納最大似然模型實現演算法;
輸入列
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | features | Feature vector |
輸出列
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | prediction | Predicted cluster center |
probabilityCol | Vector | probability | Probability of each cluster |
例子
from pyspark.ml.clustering import GaussianMixture
# loads data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dataset)
print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)