chapter8：聚類－－－群組發現

CopperDong發表於2017-10-06

APT聚類

一、k-means聚類

　　指定最後聚成的簇的數目，比如，將１０００人聚成５組

二、層次聚類

　　不需要指定最終聚成的簇的數目。取而代之的是，演算法一開始將每個例項看成一個簇，然後在演算法的每次迭代中都將最相似的兩個簇合並在一起，該過程不斷重複直到只剩下一個簇為止。

三、單連線聚類

兩個簇的距離定義為一個簇的所有成員到另一個簇的所有成員之間最短的那個距離。

四、全連線聚類

　　兩個簇的距離定義為一個簇的所有成員到另一個簇的所有成員之間最長的那個距離

五、平均連線聚類

　　兩個簇的距離定義為一個簇的所有成員到另一個簇的所有成員之間的平均距離

六、程式設計實現一個層次聚類演算法

from queue import PriorityQueue
import math


"""
Example code for hierarchical clustering
"""

def getMedian(alist):
    """get median value of list alist"""
    tmp = list(alist)
    tmp.sort()
    alen = len(tmp)
    if (alen % 2) == 1:
        return tmp[alen // 2]
    else:
        return (tmp[alen // 2] + tmp[(alen // 2) - 1]) / 2
    

def normalizeColumn(column):
    """Normalize column using Modified Standard Score"""
    median = getMedian(column)
    asd = sum([abs(x - median) for x in column]) / len(column)
    result = [(x - median) / asd for x in column]
    return result

class hClusterer:
    """ this clusterer assumes that the first column of the data is a label
    not used in the clustering. The other columns contain numeric data"""
    
    def __init__(self, filename):
        file = open(filename)
        self.data = {}
        self.counter = 0
        self.queue = PriorityQueue()
        lines = file.readlines()
        file.close()
        header = lines[0].split(',')
        self.cols = len(header)
        self.data = [[] for i in range(len(header))]
        for line in lines[1:]:
            cells = line.split(',')
            toggle = 0
            for cell in range(self.cols):
                if toggle == 0:
                   self.data[cell].append(cells[cell])
                   toggle = 1
                else:
                    self.data[cell].append(float(cells[cell]))
        # now normalize number columns (that is, skip the first column)
        for i in range(1, self.cols):
                self.data[i] = normalizeColumn(self.data[i])

        ###
        ###  I have read in the data and normalized the 
        ###  columns. Now for each element i in the data, I am going to
        ###     1. compute the Euclidean Distance from element i to all the 
        ###        other elements.  This data will be placed in neighbors,
        ###        which is a Python dictionary. Let's say i = 1, and I am 
        ###        computing the distance to the neighbor j and let's say j 
        ###        is 2. The neighbors dictionary for i will look like
        ###        {2: ((1,2), 1.23),  3: ((1, 3), 2.3)... }
        ###
        ###     2. find the closest neighbor
        ###
        ###     3. place the element on a priority queue, called simply queue,
        ###        based on the distance to the nearest neighbor (and a counter
        ###        used to break ties.



        # now push distances on queue        
        rows = len(self.data[0])              

        for i in range(rows):
            minDistance = 99999
            nearestNeighbor = 0
            neighbors = {}
            for j in range(rows):
                if i != j:
                    dist = self.distance(i, j)
                    if i < j:
                        pair = (i,j)
                    else:
                        pair = (j,i)
                    neighbors[j] = (pair, dist)
                    if dist < minDistance:
                        minDistance = dist
                        nearestNeighbor = j
                        nearestNum = j
            # create nearest Pair
            if i < nearestNeighbor:
                nearestPair = (i, nearestNeighbor)
            else:
                nearestPair = (nearestNeighbor, i)
                
            # put instance on priority queue    
            self.queue.put((minDistance, self.counter,
                            [[self.data[0][i]], nearestPair, neighbors]))
            self.counter += 1
    

    def distance(self, i, j):
        sumSquares = 0
        for k in range(1, self.cols):
            sumSquares += (self.data[k][i] - self.data[k][j])**2
        return math.sqrt(sumSquares)
            

    def cluster(self):
         done = False
         while not done:
             topOne = self.queue.get()
             nearestPair = topOne[2][1]
             if not self.queue.empty():
                 nextOne = self.queue.get()
                 nearPair = nextOne[2][1]
                 tmp = []
                 ##
                 ##  I have just popped two elements off the queue,
                 ##  topOne and nextOne. I need to check whether nextOne
                 ##  is topOne's nearest neighbor and vice versa.
                 ##  If not, I will pop another element off the queue
                 ##  until I find topOne's nearest neighbor. That is what
                 ##  this while loop does.
                 ##

                 while nearPair != nearestPair:
                     tmp.append((nextOne[0], self.counter, nextOne[2]))
                     self.counter += 1
                     nextOne = self.queue.get()
                     nearPair = nextOne[2][1]
                 ##
                 ## this for loop pushes the elements I popped off in the
                 ## above while loop.
                 ##                 
                 for item in tmp:
                     self.queue.put(item)
                     
                 if len(topOne[2][0]) == 1:
                    item1 = topOne[2][0][0]
                 else:
                     item1 = topOne[2][0]
                 if len(nextOne[2][0]) == 1:
                    item2 = nextOne[2][0][0]
                 else:
                     item2 = nextOne[2][0]
                 ##  curCluster is, perhaps obviously, the new cluster
                 ##  which combines cluster item1 with cluster item2.
                 curCluster = (item1, item2)

                 ## Now I am doing two things. First, finding the nearest
                 ## neighbor to this new cluster. Second, building a new
                 ## neighbors list by merging the neighbors lists of item1
                 ## and item2. If the distance between item1 and element 23
                 ## is 2 and the distance betweeen item2 and element 23 is 4
                 ## the distance between element 23 and the new cluster will
                 ## be 2 (i.e., the shortest distance).
                 ##

                 minDistance = 99999
                 nearestPair = ()
                 nearestNeighbor = ''
                 merged = {}
                 nNeighbors = nextOne[2][2]
                 for (key, value) in topOne[2][2].items():
                    if key in nNeighbors:
                        if nNeighbors[key][1] < value[1]:
                             dist =  nNeighbors[key]
                        else:
                            dist = value
                        if dist[1] < minDistance:
                             minDistance =  dist[1]
                             nearestPair = dist[0]
                             nearestNeighbor = key
                        merged[key] = dist
                    
                 if merged == {}:
                    return curCluster
                 else:
                    self.queue.put( (minDistance, self.counter,
                                     [curCluster, nearestPair, merged]))
                    self.counter += 1
                               
                        
                         


def printDendrogram(T, sep=3):
    """Print dendrogram of a binary tree.  Each tree node is represented by a
    length-2 tuple. printDendrogram is written and provided by David Eppstein
    2002. Accessed on 14 April 2014:
    http://code.activestate.com/recipes/139422-dendrogram-drawing/ """
	
    def isPair(T):
        return type(T) == tuple and len(T) == 2
    
    def maxHeight(T):
        if isPair(T):
            h = max(maxHeight(T[0]), maxHeight(T[1]))
        else:
            h = len(str(T))
        return h + sep
        
    activeLevels = {}

    def traverse(T, h, isFirst):
        if isPair(T):
            traverse(T[0], h-sep, 1)
            s = [' ']*(h-sep)
            s.append('|')
        else:
            s = list(str(T))
            s.append(' ')

        while len(s) < h:
            s.append('-')
        
        if (isFirst >= 0):
            s.append('+')
            if isFirst:
                activeLevels[h] = 1
            else:
                del activeLevels[h]
        
        A = list(activeLevels)
        A.sort()
        for L in A:
            if len(s) < L:
                while len(s) < L:
                    s.append(' ')
                s.append('|')

        print (''.join(s))    
        
        if isPair(T):
            traverse(T[1], h-sep, 0)

    traverse(T, maxHeight(T), -1)




filename = '//Users/raz/Dropbox/guide/data/dogs.csv'

hg = hClusterer(filename)
cluster = hg.cluster()
printDendrogram(cluster)

七、試試看

　　該資料集從卡內基梅隆大學（ＣＭＵ）下載，地址為：http://lib.stat.cmu.edu/DASL/Datafiles/Ceraals.html

k-means聚類是最流行的聚類演算法的

　　　１、選擇ｋ個隨機例項作為初始中心點；

　　　２、ＲＥＰＥＡＴ

　　　３、將每個例項分配給最近的中心點從而形成ｋ個簇；

　　　４、通過計算每個簇的平均點來更新中心點

　　　５、ＵＮＴＩＬ中心點不再改變或改變不大

　　先放一下k-means聚類

　爬山法

　　　無法保證會達到全域性最優的結果　　　　　　　　

　ＳＳＥ或散度

　　　為度量某個簇集合的質量，我們使用誤差平方和(sum of the squared error）或者稱為散度的指標

　　　其計算過程如下：對每個點，計算它到屬於它的中心點之間的距離平方，然後將這些距離平方加起來求和

　　　假定我們在同一資料集上執行兩次k-means演算法，每次選擇的隨機初始點不同。那麼是第一次還是第二次得到的結果更好呢？為回答這個問題，我們計算兩次聚類結果的ＳＳＥ值，結果顯示ＳＳＥ值小的那個結果更好。

　　以下開始k-means

import math
import random 


"""
Implementation of the K-means algorithm
for the book A Programmer's Guide to Data Mining"
http://www.guidetodatamining.com

"""

def getMedian(alist):
    """get median of list"""
    tmp = list(alist)
    tmp.sort()
    alen = len(tmp)
    if (alen % 2) == 1:
        return tmp[alen // 2]
    else:
        return (tmp[alen // 2] + tmp[(alen // 2) - 1]) / 2
    

def normalizeColumn(column):
    """normalize the values of a column using Modified Standard Score
    that is (each value - median) / (absolute standard deviation)"""
    median = getMedian(column)
    asd = sum([abs(x - median) for x in column]) / len(column)
    result = [(x - median) / asd for x in column]
    return result


class kClusterer:
    """ Implementation of kMeans Clustering
    This clusterer assumes that the first column of the data is a label
    not used in the clustering. The other columns contain numeric data
    """
    
    def __init__(self, filename, k):
        """ k is the number of clusters to make
        This init method:
           1. reads the data from the file named filename
           2. stores that data by column in self.data
           3. normalizes the data using Modified Standard Score
           4. randomly selects the initial centroids
           5. assigns points to clusters associated with those centroids
        """
        file = open(filename)
        self.data = {}
        self.k = k
        self.counter = 0
        self.iterationNumber = 0
        # used to keep track of % of points that change cluster membership
        # in an iteration
        self.pointsChanged = 0
        # Sum of Squared Error
        self.sse = 0
        #
        # read data from file
        #
        lines = file.readlines()
        file.close()
        header = lines[0].split(',')
        self.cols = len(header)
        self.data = [[] for i in range(len(header))]
        # we are storing the data by column.
        # For example, self.data[0] is the data from column 0.
        # self.data[0][10] is the column 0 value of item 10.
        for line in lines[1:]:
            cells = line.split(',')
            toggle = 0
            for cell in range(self.cols):
                if toggle == 0:
                   self.data[cell].append(cells[cell])
                   toggle = 1
                else:
                    self.data[cell].append(float(cells[cell]))
                    
        self.datasize = len(self.data[1])
        self.memberOf = [-1 for x in range(len(self.data[1]))]
        #
        # now normalize number columns
        #
        for i in range(1, self.cols):
                self.data[i] = normalizeColumn(self.data[i])

        # select random centroids from existing points
        random.seed()
        self.centroids = [[self.data[i][r]  for i in range(1, len(self.data))]
                           for r in random.sample(range(len(self.data[0])),
                                                 self.k)]
        self.assignPointsToCluster()

            

    def updateCentroids(self):
        """Using the points in the clusters, determine the centroid
        (mean point) of each cluster"""
        members = [self.memberOf.count(i) for i in range(len(self.centroids))]
        self.centroids = [[sum([self.data[k][i]
                                for i in range(len(self.data[0]))
                                if self.memberOf[i] == centroid])/members[centroid]
                           for k in range(1, len(self.data))]
                          for centroid in range(len(self.centroids))] 
            
        
    
    def assignPointToCluster(self, i):
        """ assign point to cluster based on distance from centroids"""
        min = 999999
        clusterNum = -1
        for centroid in range(self.k):
            dist = self.euclideanDistance(i, centroid)
            if dist < min:
                min = dist
                clusterNum = centroid
        # here is where I will keep track of changing points
        if clusterNum != self.memberOf[i]:
            self.pointsChanged += 1
        # add square of distance to running sum of squared error
        self.sse += min**2
        return clusterNum

    def assignPointsToCluster(self):
        """ assign each data point to a cluster"""
        self.pointsChanged = 0
        self.sse = 0
        self.memberOf = [self.assignPointToCluster(i)
                         for i in range(len(self.data[1]))]
        

        
    def euclideanDistance(self, i, j):
        """ compute distance of point i from centroid j"""
        sumSquares = 0
        for k in range(1, self.cols):
            sumSquares += (self.data[k][i] - self.centroids[j][k-1])**2
        return math.sqrt(sumSquares)

    def kCluster(self):
        """the method that actually performs the clustering
        As you can see this method repeatedly
            updates the centroids by computing the mean point of each cluster
            re-assign the points to clusters based on these new centroids
        until the number of points that change cluster membership is less than 1%.
        """
        done = False
 
        while not done:
            self.iterationNumber += 1
            self.updateCentroids()
            self.assignPointsToCluster()
            #
            # we are done if fewer than 1% of the points change clusters
            #
            if float(self.pointsChanged) / len(self.memberOf) <  0.01:
                done = True
        print("Final SSE: %f" % self.sse)

    def showMembers(self):
        """Display the results"""
        for centroid in range(len(self.centroids)):
             print ("\n\nClass %i\n========" % centroid)
             for name in [self.data[0][i]  for i in range(len(self.data[0]))
                          if self.memberOf[i] == centroid]:
                 print (name)
        
##
## RUN THE K-MEANS CLUSTERER ON THE DOG DATA USING K = 3
###
# change the path in the following to match where dogs.csv is on your machine
km = kClusterer('../../data/dogs.csv', 3)
km.kCluster()
km.showMembers()

試試看更多的資料集　　

譜聚類的python實現
2020-08-23
聚類Python
MVO優化DBSCAN實現聚類
2020-11-02
優化聚類
C均值聚類 C實現 Python實現
2020-12-05
聚類Python
聚類分析
2024-03-20
聚類
聚類(part3)--高階聚類演算法
2020-10-11
聚類演算法
聚類之K均值聚類和EM演算法
2019-05-13
聚類演算法
群俠匯聚經典再現《新射鵰群俠傳》今日全平臺震撼上線
2020-08-27
MMM全連結聚類演算法實現
2024-05-25
聚類演算法
聚類演算法與K-means實現
2021-09-08
聚類演算法
NLPIR平臺的文字聚類模組完美契合行業需求
2019-11-21
聚類行業
怎麼實現iMessage群發
2024-03-31
【scipy 基礎】--聚類
2023-11-01
聚類
聚類演算法
2020-04-26
聚類演算法
k-means聚類
2023-01-30
聚類
09聚類演算法-層次聚類-CF-Tree、BIRCH、CURE
2018-12-11
聚類演算法
04聚類演算法-程式碼案例一-K-means聚類
2018-12-08
聚類演算法
聚類分析-案例：客戶特徵的聚類與探索性分析
2020-09-28
聚類特徵
[開發教程]第22講：Bootstrap按鈕群組
2019-05-11
boot
unit3 文字聚類
2018-05-11
聚類
譜聚類原理總結
2022-01-18
聚類
密度聚類。Clustering by fast search and
2021-09-09
聚類AST
python手動實現K_means聚類(指定K值）
2020-11-19
Python聚類
推薦系統中的產品聚類：一種文字聚類的方法
2020-01-02
聚類
程式設計實現DBSCAN密度聚類演算法，並以西瓜資料集4.0為例進行聚類效果分析
2022-12-01
程式設計聚類演算法
【Python機器學習實戰】聚類演算法（1）——K-Means聚類
2021-12-06
Python機器學習聚類演算法
利用python的KMeans和PCA包實現聚類演算法
2019-09-15
PythonPCA聚類演算法
Spark構建聚類模型（二）
2018-12-11
Spark聚類模型
聚類演算法綜述
2018-12-09
聚類演算法
sklearn建模及評估（聚類）
2019-09-03
聚類
OPTICS聚類演算法原理
2020-05-14
聚類演算法
非完整資料聚類初探
2021-06-10
聚類
初探DBSCAN聚類演算法
2021-05-22
聚類演算法
資料探勘-層次聚類
2020-12-02
聚類
14聚類演算法-程式碼案例六-譜聚類(SC)演算法案例
2018-12-16
聚類演算法
FCM聚類演算法詳解(Python實現iris資料集)
2019-02-27
聚類演算法Python
可伸縮聚類演算法綜述（可伸縮聚類演算法開篇）
2018-10-30
聚類演算法
【Python機器學習實戰】聚類演算法（2）——層次聚類(HAC)和DBSCAN
2021-12-16
Python機器學習聚類演算法
CentOS 使用者與群組
2021-10-25
CentOS
前端架構思想：聚類分層
2018-10-19
前端架構聚類

chapter8：聚類－－－群組發現

一、k-means聚類

二、層次聚類

三、單連線聚類

四、全連線聚類

五、平均連線聚類

六、程式設計實現一個層次聚類演算法

七、試試看

爬山法

ＳＳＥ或散度

以下開始k-means

試試看更多的資料集

相關文章

　爬山法

　ＳＳＥ或散度

　　以下開始k-means

試試看更多的資料集