1. 社團劃分
(Newman and Gievan 2004) A community is a subgraph containing nodes which are more densely linked to each other than to the rest of the graph or equivalently, a graph has a community structure if the number of links into any subgraph is higher than the number of links between those subgraphs.
1. 即我們需要先確定要解決的業務場景中,存在明顯的聚集規律,節點(可以是抽象的)之間形成一定的族群結構,而不是呈現無規律的隨機分散; 2. 同時另一方面,這種聚集的結構是“有意義的”,這裡所謂的有意義是指這種聚集本身可以翻譯為一定的上層業務場景的表現
直觀地說,community detection的一般目標是要探測網路中的“塊”cluster或是“社團”community。
我們再看一個例子,word association network。即詞的聯想/搭配構成的網路:
這個網路從詞bright開始進行演化,到後面分別形成了4個組:Colors, Light, Astronomy & Intelligence。
可以說以上這4個詞可以較好地概括其所在community的特點(有點聚類的感覺);另外,community中心的詞,比如color, Sun, Smart也有很好的代表性(自動提取摘要)。
接下來的問題是,什麼樣的metrics可以用來描述這種density?Louvian 定義了一個數值上的概念(本質上就是一個目標函式),有了這個目標函式,就可以引出接下來要討論的 method based on modularity optimization
要注意的,社群劃分有很多不同的演算法,本文討論的 Fast Unfolding(Louvian)只是其中一種,而這種所謂的density密度評估方法也其實其中一種思想,不要固話地認為社群劃分就只有這一種方法。
Relevant Link:
https://stackoverflow.com/questions/21814235/how-can-modularity-help-in-network-analysis http://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008/fulltext/ https://www.researchgate.net/publication/1913681_Fast_Unfolding_of_Communities_in_Large_Networks?enrichId=rgreq-d403e26a5cb211b7053c36946c71acb3-XXX&enrichSource=Y292ZXJQYWdlOzE5MTM2ODE7QVM6MTAxOTUyNjc5NTc5NjY3QDE0MDEzMTg4MjE3ODA%3D&el=1_x_3&_esc=publicationCoverPdf https://www.jianshu.com/p/4ebe42dfa8ec https://blog.csdn.net/u011089523/article/details/79090453 https://blog.csdn.net/google19890102/article/details/48660239 《Fast Unfolding of Communities in Large Networks》
2. LOUVAIN演算法模型
0x1:Modularity的定義 - 描述社群內緊密程度的值Q
模組度是評估一個社群網路劃分好壞的度量方法,它的物理含義是社群內節點的連邊數與隨機情況下的邊數只差,它的取值範圍是 [−1/2,1),其定義如下:

A為鄰接矩陣,Aij代表了節點 i 和節點 j 之間 邊的權重,網路不是帶權圖時,所有邊的權重可以看做是 1;
是所有與節點 i 相連的 邊的權重之和(度數),kj也是同樣;
是節點 i 的社群,
函式表示若節點 i 和節點 j 在同一個社群內,則返回 1,否則返回 0;
其中 Σin 表示社群 C 內的邊的權重之和;Σtot 表示與社群 C 內的節點相連的所有邊的權重之和。
modularity Q的計算公式背後體現了這種思想:社群內部邊的權重減去所有與社群節點相連的邊的權重和,對無向圖更好理解,即社群內部邊的度數減去社群內節點的總度數。
在一輪迭代後,若整個 Q 沒有變化,則停止迭代,否則繼續迭代,直至收斂。
0x2:模組度增量 delta Q
代表由節點 i 入射叢集 C 的權重之和;
代表入射叢集 C 的總權重;ki 代表入射節點 i 的總權重;
在演算法的first phase,判斷一個節點加入到哪個社群,需要找到一個delta Q最大的節點 i,具體的演算法我們後面會詳細討論,這裡只需要記住 delta Q的作用類似決策樹中的資訊增益評估的作用,它幫助整個模型向著Modularity不斷增大的方向去靠攏。
3. LOUVAIN演算法策略
1. 兩臺主機擁有類似的網路對外發包模式
2. 兩臺主機間擁有累計的event log序列
3. 兩個攻擊payload擁有類似的詞頻特徵,可以認為是同一組漏洞利用方式
4. 在netword gateway上發現了類似的網路raw流量,也可以反過來用一直的label流量特徵進行有監督的聚類
如果按照啟發式/貪婪思想進行”one-step one node“的社群聚類,O9、O10、O11會被先加入到社群D中,因為在每次這樣的迭代中,D社群內部的緊密度(不管基於node密度還是edge得modularity評估)都是不斷提高,符合演算法的check條件,因此,O9、O10、O11會被加入到社群D中。
隨後,O1 ~ O8也會被逐個被加入到社群D中,加入的原因和O9、O10、O11被加入是一樣的。
1. weight的表徵意義問題要特別注意!
A <-> B:weight(相似性)= 0.1
B <-> C:weight(相似性)= 0.1
但是,很可能存在 A 和 C 是完全不同的兩個樣本,所以 A 和 C 屬於一個社群的這種傳遞關係是不能成立的。
本質上來說,這涉及到如何進行圖節點間weight的特徵工程問題,特徵工程提取的方法必須要能unique唯一代表樣本本身的規律,不能出現:2+8 = 5+5 這種非唯一的情況,即不能出現兩個擁有不同概率分佈的樣本,特徵向量是一樣的。
4. LOUVAIN演算法流程
2)開始first phase迭代 - 社群間節點轉移:
對每個節點i,依次嘗試把節點 i 分配到其每個鄰居節點所在的社群,計算分配前與分配後的模組度變化ΔQ,並記錄ΔQ最大的那個鄰居節點,如果maxΔQ>0,則把節點 i 分配ΔQ最大的那個鄰居節點所在的社群,否則保持不變;
3)重複2)- 繼續進行社群間節點轉移評估:
直到所有節點的所屬社群不再變化,即社群間的節點轉移結束,可以理解為本輪迭代的 Local Maximization 已達到;
4)second phase - Rebuilding Graph:
因為在這輪的first phase中,社群 C 中新增了一個新的節點 i,而 i 所在的舊的社群少了一個節點,因此需要對整個圖進行一個rebuild。
5)重複2)- 繼續開始下一輪的first/second phase:

DeltaQ 分了兩部分,前面部分表示把節點i加入到社群c後的模組度,後一部分是節點i作為一個獨立社群和社群c的模組度
https://blog.csdn.net/xuanyuansen/article/details/68941507 https://www.cnblogs.com/fengfenggirl/p/louvain.html http://www.cnblogs.com/allanspark/p/4197980.html https://github.com/gephi/gephi/wiki https://blog.csdn.net/qq547276542/article/details/70175157
5. A Python implementation of the Louvain method to find communities in large networks
#!/usr/bin/env python3 # -*- coding: utf-8 -*- ''' Implements the Louvain method. Input: a weighted undirected graph Ouput: a (partition, modularity) pair where modularity is maximum ''' class PyLouvain: ''' Builds a graph from _path. _path: a path to a file containing "node_from node_to" edges (one per line) ''' @classmethod def from_file(cls, path): f = open(path, 'r') lines = f.readlines() f.close() nodes = {} edges = [] for line in lines: n = line.split() if not n: break nodes[n[0]] = 1 nodes[n[1]] = 1 w = 1 if len(n) == 3: w = int(n[2]) edges.append(((n[0], n[1]), w)) # rebuild graph with successive identifiers nodes_, edges_ = in_order(nodes, edges) print("%d nodes, %d edges" % (len(nodes_), len(edges_))) return cls(nodes_, edges_) ''' Builds a graph from _path. _path: a path to a file following the Graph Modeling Language specification ''' @classmethod def from_gml_file(cls, path): f = open(path, 'r') lines = f.readlines() f.close() nodes = {} edges = [] current_edge = (-1, -1, 1) in_edge = 0 for line in lines: words = line.split() if not words: break if words[0] == 'id': nodes[int(words[1])] = 1 elif words[0] == 'source': in_edge = 1 current_edge = (int(words[1]), current_edge[1], current_edge[2]) elif words[0] == 'target' and in_edge: current_edge = (current_edge[0], int(words[1]), current_edge[2]) elif words[0] == 'value' and in_edge: current_edge = (current_edge[0], current_edge[1], int(words[1])) elif words[0] == ']' and in_edge: edges.append(((current_edge[0], current_edge[1]), 1)) current_edge = (-1, -1, 1) in_edge = 0 nodes, edges = in_order(nodes, edges) print("%d nodes, %d edges" % (len(nodes), len(edges))) return cls(nodes, edges) ''' Initializes the method. _nodes: a list of ints _edges: a list of ((int, int), weight) pairs ''' def __init__(self, nodes, edges): self.nodes = nodes self.edges = edges # precompute m (sum of the weights of all links in network) # k_i (sum of the weights of the links incident to node i) self.m = 0 self.k_i = [0 for n in nodes] self.edges_of_node = {} self.w = [0 for n in nodes] for e in edges: self.m += e[1] self.k_i[e[0][0]] += e[1] self.k_i[e[0][1]] += e[1] # there's no self-loop initially # save edges by node if e[0][0] not in self.edges_of_node: self.edges_of_node[e[0][0]] = [e] else: self.edges_of_node[e[0][0]].append(e) if e[0][1] not in self.edges_of_node: self.edges_of_node[e[0][1]] = [e] elif e[0][0] != e[0][1]: self.edges_of_node[e[0][1]].append(e) # access community of a node in O(1) time self.communities = [n for n in nodes] self.actual_partition = [] ''' Applies the Louvain method. ''' def apply_method(self): network = (self.nodes, self.edges) best_partition = [[node] for node in network[0]] best_q = -1 i = 1 while 1: i += 1 partition = self.first_phase(network) q = self.compute_modularity(partition) partition = [c for c in partition if c] # clustering initial nodes with partition if self.actual_partition: actual = [] for p in partition: part = [] for n in p: part.extend(self.actual_partition[n]) actual.append(part) self.actual_partition = actual else: self.actual_partition = partition if q == best_q: # 如果本輪迭代modularity沒有改變,則認為收斂,停止 break network = self.second_phase(network, partition) best_partition = partition best_q = q return (self.actual_partition, best_q) ''' Computes the modularity of the current network. _partition: a list of lists of nodes ''' def compute_modularity(self, partition): q = 0 m2 = self.m * 2 for i in range(len(partition)): q += self.s_in[i] / m2 - (self.s_tot[i] / m2) ** 2 return q ''' Computes the modularity gain of having node in community _c. _node: an int _c: an int _k_i_in: the sum of the weights of the links from _node to nodes in _c ''' def compute_modularity_gain(self, node, c, k_i_in): return 2 * k_i_in - self.s_tot[c] * self.k_i[node] / self.m ''' Performs the first phase of the method. _network: a (nodes, edges) pair ''' def first_phase(self, network): # make initial partition best_partition = self.make_initial_partition(network) while 1: improvement = 0 for node in network[0]: node_community = self.communities[node] # default best community is its own best_community = node_community best_gain = 0 # remove _node from its community best_partition[node_community].remove(node) best_shared_links = 0 for e in self.edges_of_node[node]: if e[0][0] == e[0][1]: continue if e[0][0] == node and self.communities[e[0][1]] == node_community or e[0][1] == node and self.communities[e[0][0]] == node_community: best_shared_links += e[1] self.s_in[node_community] -= 2 * (best_shared_links + self.w[node]) self.s_tot[node_community] -= self.k_i[node] self.communities[node] = -1 communities = {} # only consider neighbors of different communities for neighbor in self.get_neighbors(node): community = self.communities[neighbor] if community in communities: continue communities[community] = 1 shared_links = 0 for e in self.edges_of_node[node]: if e[0][0] == e[0][1]: continue if e[0][0] == node and self.communities[e[0][1]] == community or e[0][1] == node and self.communities[e[0][0]] == community: shared_links += e[1] # compute modularity gain obtained by moving _node to the community of _neighbor gain = self.compute_modularity_gain(node, community, shared_links) if gain > best_gain: best_community = community best_gain = gain best_shared_links = shared_links # insert _node into the community maximizing the modularity gain best_partition[best_community].append(node) self.communities[node] = best_community self.s_in[best_community] += 2 * (best_shared_links + self.w[node]) self.s_tot[best_community] += self.k_i[node] if node_community != best_community: improvement = 1 if not improvement: break return best_partition ''' Yields the nodes adjacent to _node. _node: an int ''' def get_neighbors(self, node): for e in self.edges_of_node[node]: if e[0][0] == e[0][1]: # a node is not neighbor with itself continue if e[0][0] == node: yield e[0][1] if e[0][1] == node: yield e[0][0] ''' Builds the initial partition from _network. _network: a (nodes, edges) pair ''' def make_initial_partition(self, network): partition = [[node] for node in network[0]] self.s_in = [0 for node in network[0]] self.s_tot = [self.k_i[node] for node in network[0]] for e in network[1]: if e[0][0] == e[0][1]: # only self-loops self.s_in[e[0][0]] += e[1] self.s_in[e[0][1]] += e[1] return partition ''' Performs the second phase of the method. _network: a (nodes, edges) pair _partition: a list of lists of nodes ''' def second_phase(self, network, partition): nodes_ = [i for i in range(len(partition))] # relabelling communities communities_ = [] d = {} i = 0 for community in self.communities: if community in d: communities_.append(d[community]) else: d[community] = i communities_.append(i) i += 1 self.communities = communities_ # building relabelled edges edges_ = {} for e in network[1]: ci = self.communities[e[0][0]] cj = self.communities[e[0][1]] try: edges_[(ci, cj)] += e[1] except KeyError: edges_[(ci, cj)] = e[1] edges_ = [(k, v) for k, v in edges_.items()] # recomputing k_i vector and storing edges by node self.k_i = [0 for n in nodes_] self.edges_of_node = {} self.w = [0 for n in nodes_] for e in edges_: self.k_i[e[0][0]] += e[1] self.k_i[e[0][1]] += e[1] if e[0][0] == e[0][1]: self.w[e[0][0]] += e[1] if e[0][0] not in self.edges_of_node: self.edges_of_node[e[0][0]] = [e] else: self.edges_of_node[e[0][0]].append(e) if e[0][1] not in self.edges_of_node: self.edges_of_node[e[0][1]] = [e] elif e[0][0] != e[0][1]: self.edges_of_node[e[0][1]].append(e) # resetting communities self.communities = [n for n in nodes_] return (nodes_, edges_) ''' Rebuilds a graph with successive nodes' ids. _nodes: a dict of int _edges: a list of ((int, int), weight) pairs ''' def in_order(nodes, edges): # rebuild graph with successive identifiers nodes = list(nodes.keys()) nodes.sort() i = 0 nodes_ = [] d = {} for n in nodes: nodes_.append(i) d[n] = i i += 1 edges_ = [] for e in edges: edges_.append(((d[e[0][0]], d[e[0][1]]), e[1])) return (nodes_, edges_)
node [ id 16 label "Betrayal" value "c" ] node [ id 17 label "Shut Up and Sing" value "c" ] node [ id 18 label "Meant To Be" value "n" ] node [ id 19 label "The Right Man" value "c" ]
Relevant Link:
http://www.cnblogs.com/allanspark/p/4197980.html https://arxiv.org/pdf/0803.0476.pdf https://github.com/LittleHann/pylouvain http://www.cnblogs.com/allanspark/p/4197980.html https://www.jianshu.com/p/e543dc63454f
6. 其他社群發現演算法
Relevant Link:
http://blog.sina.com.cn/s/blog_63891e610101722t.html https://www.zhihu.com/question/29042018 https://wenku.baidu.com/view/36fa145a3169a4517623a313.html