chapter8:聚類---群組發現
一、k-means聚類
指定最後聚成的簇的數目,比如,將1000人聚成5組
二、層次聚類
不需要指定最終聚成的簇的數目。取而代之的是,演算法一開始將每個例項看成一個簇,然後在演算法的每次迭代中都將最相似的兩個簇合並在一起,該過程不斷重複直到只剩下一個簇為止。
三、單連線聚類
兩個簇的距離定義為一個簇的所有成員到另一個簇的所有成員之間最短的那個距離。
四、全連線聚類
兩個簇的距離定義為一個簇的所有成員到另一個簇的所有成員之間最長的那個距離
五、平均連線聚類
兩個簇的距離定義為一個簇的所有成員到另一個簇的所有成員之間的平均距離
六、程式設計實現一個層次聚類演算法
from queue import PriorityQueue
import math
"""
Example code for hierarchical clustering
"""
def getMedian(alist):
"""get median value of list alist"""
tmp = list(alist)
tmp.sort()
alen = len(tmp)
if (alen % 2) == 1:
return tmp[alen // 2]
else:
return (tmp[alen // 2] + tmp[(alen // 2) - 1]) / 2
def normalizeColumn(column):
"""Normalize column using Modified Standard Score"""
median = getMedian(column)
asd = sum([abs(x - median) for x in column]) / len(column)
result = [(x - median) / asd for x in column]
return result
class hClusterer:
""" this clusterer assumes that the first column of the data is a label
not used in the clustering. The other columns contain numeric data"""
def __init__(self, filename):
file = open(filename)
self.data = {}
self.counter = 0
self.queue = PriorityQueue()
lines = file.readlines()
file.close()
header = lines[0].split(',')
self.cols = len(header)
self.data = [[] for i in range(len(header))]
for line in lines[1:]:
cells = line.split(',')
toggle = 0
for cell in range(self.cols):
if toggle == 0:
self.data[cell].append(cells[cell])
toggle = 1
else:
self.data[cell].append(float(cells[cell]))
# now normalize number columns (that is, skip the first column)
for i in range(1, self.cols):
self.data[i] = normalizeColumn(self.data[i])
###
### I have read in the data and normalized the
### columns. Now for each element i in the data, I am going to
### 1. compute the Euclidean Distance from element i to all the
### other elements. This data will be placed in neighbors,
### which is a Python dictionary. Let's say i = 1, and I am
### computing the distance to the neighbor j and let's say j
### is 2. The neighbors dictionary for i will look like
### {2: ((1,2), 1.23), 3: ((1, 3), 2.3)... }
###
### 2. find the closest neighbor
###
### 3. place the element on a priority queue, called simply queue,
### based on the distance to the nearest neighbor (and a counter
### used to break ties.
# now push distances on queue
rows = len(self.data[0])
for i in range(rows):
minDistance = 99999
nearestNeighbor = 0
neighbors = {}
for j in range(rows):
if i != j:
dist = self.distance(i, j)
if i < j:
pair = (i,j)
else:
pair = (j,i)
neighbors[j] = (pair, dist)
if dist < minDistance:
minDistance = dist
nearestNeighbor = j
nearestNum = j
# create nearest Pair
if i < nearestNeighbor:
nearestPair = (i, nearestNeighbor)
else:
nearestPair = (nearestNeighbor, i)
# put instance on priority queue
self.queue.put((minDistance, self.counter,
[[self.data[0][i]], nearestPair, neighbors]))
self.counter += 1
def distance(self, i, j):
sumSquares = 0
for k in range(1, self.cols):
sumSquares += (self.data[k][i] - self.data[k][j])**2
return math.sqrt(sumSquares)
def cluster(self):
done = False
while not done:
topOne = self.queue.get()
nearestPair = topOne[2][1]
if not self.queue.empty():
nextOne = self.queue.get()
nearPair = nextOne[2][1]
tmp = []
##
## I have just popped two elements off the queue,
## topOne and nextOne. I need to check whether nextOne
## is topOne's nearest neighbor and vice versa.
## If not, I will pop another element off the queue
## until I find topOne's nearest neighbor. That is what
## this while loop does.
##
while nearPair != nearestPair:
tmp.append((nextOne[0], self.counter, nextOne[2]))
self.counter += 1
nextOne = self.queue.get()
nearPair = nextOne[2][1]
##
## this for loop pushes the elements I popped off in the
## above while loop.
##
for item in tmp:
self.queue.put(item)
if len(topOne[2][0]) == 1:
item1 = topOne[2][0][0]
else:
item1 = topOne[2][0]
if len(nextOne[2][0]) == 1:
item2 = nextOne[2][0][0]
else:
item2 = nextOne[2][0]
## curCluster is, perhaps obviously, the new cluster
## which combines cluster item1 with cluster item2.
curCluster = (item1, item2)
## Now I am doing two things. First, finding the nearest
## neighbor to this new cluster. Second, building a new
## neighbors list by merging the neighbors lists of item1
## and item2. If the distance between item1 and element 23
## is 2 and the distance betweeen item2 and element 23 is 4
## the distance between element 23 and the new cluster will
## be 2 (i.e., the shortest distance).
##
minDistance = 99999
nearestPair = ()
nearestNeighbor = ''
merged = {}
nNeighbors = nextOne[2][2]
for (key, value) in topOne[2][2].items():
if key in nNeighbors:
if nNeighbors[key][1] < value[1]:
dist = nNeighbors[key]
else:
dist = value
if dist[1] < minDistance:
minDistance = dist[1]
nearestPair = dist[0]
nearestNeighbor = key
merged[key] = dist
if merged == {}:
return curCluster
else:
self.queue.put( (minDistance, self.counter,
[curCluster, nearestPair, merged]))
self.counter += 1
def printDendrogram(T, sep=3):
"""Print dendrogram of a binary tree. Each tree node is represented by a
length-2 tuple. printDendrogram is written and provided by David Eppstein
2002. Accessed on 14 April 2014:
http://code.activestate.com/recipes/139422-dendrogram-drawing/ """
def isPair(T):
return type(T) == tuple and len(T) == 2
def maxHeight(T):
if isPair(T):
h = max(maxHeight(T[0]), maxHeight(T[1]))
else:
h = len(str(T))
return h + sep
activeLevels = {}
def traverse(T, h, isFirst):
if isPair(T):
traverse(T[0], h-sep, 1)
s = [' ']*(h-sep)
s.append('|')
else:
s = list(str(T))
s.append(' ')
while len(s) < h:
s.append('-')
if (isFirst >= 0):
s.append('+')
if isFirst:
activeLevels[h] = 1
else:
del activeLevels[h]
A = list(activeLevels)
A.sort()
for L in A:
if len(s) < L:
while len(s) < L:
s.append(' ')
s.append('|')
print (''.join(s))
if isPair(T):
traverse(T[1], h-sep, 0)
traverse(T, maxHeight(T), -1)
filename = '//Users/raz/Dropbox/guide/data/dogs.csv'
hg = hClusterer(filename)
cluster = hg.cluster()
printDendrogram(cluster)
七、試試看
該資料集從卡內基梅隆大學(CMU)下載,地址為:http://lib.stat.cmu.edu/DASL/Datafiles/Ceraals.html
k-means聚類是最流行的聚類演算法的
1、選擇k個隨機例項作為初始中心點;
2、REPEAT
3、將每個例項分配給最近的中心點從而形成k個簇;
4、通過計算每個簇的平均點來更新中心點
5、UNTIL中心點不再改變或改變不大
先放一下k-means聚類
爬山法
無法保證會達到全域性最優的結果
SSE或散度
為度量某個簇集合的質量,我們使用誤差平方和(sum of the squared error)或者稱為散度的指標
其計算過程如下:對每個點,計算它到屬於它的中心點之間的距離平方,然後將這些距離平方加起來求和
假定我們在同一資料集上執行兩次k-means演算法,每次選擇的隨機初始點不同。那麼是第一次還是第二次得到的結果更好呢?為回答這個問題,我們計算兩次聚類結果的SSE值,結果顯示SSE值小的那個結果更好。
以下開始k-means
import math
import random
"""
Implementation of the K-means algorithm
for the book A Programmer's Guide to Data Mining"
http://www.guidetodatamining.com
"""
def getMedian(alist):
"""get median of list"""
tmp = list(alist)
tmp.sort()
alen = len(tmp)
if (alen % 2) == 1:
return tmp[alen // 2]
else:
return (tmp[alen // 2] + tmp[(alen // 2) - 1]) / 2
def normalizeColumn(column):
"""normalize the values of a column using Modified Standard Score
that is (each value - median) / (absolute standard deviation)"""
median = getMedian(column)
asd = sum([abs(x - median) for x in column]) / len(column)
result = [(x - median) / asd for x in column]
return result
class kClusterer:
""" Implementation of kMeans Clustering
This clusterer assumes that the first column of the data is a label
not used in the clustering. The other columns contain numeric data
"""
def __init__(self, filename, k):
""" k is the number of clusters to make
This init method:
1. reads the data from the file named filename
2. stores that data by column in self.data
3. normalizes the data using Modified Standard Score
4. randomly selects the initial centroids
5. assigns points to clusters associated with those centroids
"""
file = open(filename)
self.data = {}
self.k = k
self.counter = 0
self.iterationNumber = 0
# used to keep track of % of points that change cluster membership
# in an iteration
self.pointsChanged = 0
# Sum of Squared Error
self.sse = 0
#
# read data from file
#
lines = file.readlines()
file.close()
header = lines[0].split(',')
self.cols = len(header)
self.data = [[] for i in range(len(header))]
# we are storing the data by column.
# For example, self.data[0] is the data from column 0.
# self.data[0][10] is the column 0 value of item 10.
for line in lines[1:]:
cells = line.split(',')
toggle = 0
for cell in range(self.cols):
if toggle == 0:
self.data[cell].append(cells[cell])
toggle = 1
else:
self.data[cell].append(float(cells[cell]))
self.datasize = len(self.data[1])
self.memberOf = [-1 for x in range(len(self.data[1]))]
#
# now normalize number columns
#
for i in range(1, self.cols):
self.data[i] = normalizeColumn(self.data[i])
# select random centroids from existing points
random.seed()
self.centroids = [[self.data[i][r] for i in range(1, len(self.data))]
for r in random.sample(range(len(self.data[0])),
self.k)]
self.assignPointsToCluster()
def updateCentroids(self):
"""Using the points in the clusters, determine the centroid
(mean point) of each cluster"""
members = [self.memberOf.count(i) for i in range(len(self.centroids))]
self.centroids = [[sum([self.data[k][i]
for i in range(len(self.data[0]))
if self.memberOf[i] == centroid])/members[centroid]
for k in range(1, len(self.data))]
for centroid in range(len(self.centroids))]
def assignPointToCluster(self, i):
""" assign point to cluster based on distance from centroids"""
min = 999999
clusterNum = -1
for centroid in range(self.k):
dist = self.euclideanDistance(i, centroid)
if dist < min:
min = dist
clusterNum = centroid
# here is where I will keep track of changing points
if clusterNum != self.memberOf[i]:
self.pointsChanged += 1
# add square of distance to running sum of squared error
self.sse += min**2
return clusterNum
def assignPointsToCluster(self):
""" assign each data point to a cluster"""
self.pointsChanged = 0
self.sse = 0
self.memberOf = [self.assignPointToCluster(i)
for i in range(len(self.data[1]))]
def euclideanDistance(self, i, j):
""" compute distance of point i from centroid j"""
sumSquares = 0
for k in range(1, self.cols):
sumSquares += (self.data[k][i] - self.centroids[j][k-1])**2
return math.sqrt(sumSquares)
def kCluster(self):
"""the method that actually performs the clustering
As you can see this method repeatedly
updates the centroids by computing the mean point of each cluster
re-assign the points to clusters based on these new centroids
until the number of points that change cluster membership is less than 1%.
"""
done = False
while not done:
self.iterationNumber += 1
self.updateCentroids()
self.assignPointsToCluster()
#
# we are done if fewer than 1% of the points change clusters
#
if float(self.pointsChanged) / len(self.memberOf) < 0.01:
done = True
print("Final SSE: %f" % self.sse)
def showMembers(self):
"""Display the results"""
for centroid in range(len(self.centroids)):
print ("\n\nClass %i\n========" % centroid)
for name in [self.data[0][i] for i in range(len(self.data[0]))
if self.memberOf[i] == centroid]:
print (name)
##
## RUN THE K-MEANS CLUSTERER ON THE DOG DATA USING K = 3
###
# change the path in the following to match where dogs.csv is on your machine
km = kClusterer('../../data/dogs.csv', 3)
km.kCluster()
km.showMembers()
試試看更多的資料集
相關文章
- kmeans實現文字聚類聚類
- 譜聚類的python實現聚類Python
- MVO優化DBSCAN實現聚類優化聚類
- 用Python實現文件聚類Python聚類
- RAID群組和奇偶校驗群組(轉)AI
- 分類 和 聚類聚類
- 聚類分析聚類
- C均值聚類 C實現 Python實現聚類Python
- 群俠匯聚經典再現 《新射鵰群俠傳》今日全平臺震撼上線
- 聚類之K均值聚類和EM演算法聚類演算法
- 聚類(part3)--高階聚類演算法聚類演算法
- 教你文字聚類聚類
- [開發教程]第22講:Bootstrap按鈕群組boot
- kmeans聚類演算法matlab實現聚類演算法Matlab
- MMM全連結聚類演算法實現聚類演算法
- 聚類演算法聚類演算法
- 機器學習——dbscan密度聚類機器學習聚類
- 【scipy 基礎】--聚類聚類
- 機器學習(8)——其他聚類機器學習聚類
- 09聚類演算法-層次聚類-CF-Tree、BIRCH、CURE聚類演算法
- 聚類分析-案例:客戶特徵的聚類與探索性分析聚類特徵
- 怎麼實現iMessage群發
- 聚類演算法與K-means實現聚類演算法
- 《演算法》這麼好玩,要不聚個群吧演算法
- 04聚類演算法-程式碼案例一-K-means聚類聚類演算法
- 機器學習 之 層次聚類機器學習聚類
- 譜聚類原理總結聚類
- 機器學習-聚類分析之DBSCAN機器學習聚類
- k-means聚類聚類
- 機器學習入門|聚類(二)機器學習聚類
- unit3 文字聚類聚類
- K-均值聚類分析聚類
- PHP實現郵件群發機薦PHP
- 推薦系統中的產品聚類:一種文字聚類的方法聚類
- 【Python機器學習實戰】聚類演算法(1)——K-Means聚類Python機器學習聚類演算法
- 程式設計實現DBSCAN密度聚類演算法,並以西瓜資料集4.0為例進行聚類效果分析程式設計聚類演算法
- 各類聚類(clustering)演算法初探聚類演算法
- 【Python機器學習實戰】聚類演算法(2)——層次聚類(HAC)和DBSCANPython機器學習聚類演算法