数据挖掘实验二 :聚类技术—复杂网络社团检测

实验二 :聚类技术—复杂网络社团检测

实验内容

  1. 导入karate.gml中的空手道网络数据;
  2. 根据网络结构特征给出节点相似性度量指标;
  3. 采用层次聚类过程对网络数据进行聚类;
  4. 计算模块性指标Q值,当Q值最大时输出聚类结果;
  5. 采用Cytoscape工具,可视化聚类结果。

分析及设计

  1. 导入数据包:
  2. 用python的networks包中的read_ gml方法读取”图”的数据;
  3. 观察图的信息:34个顶点、78条边的无向图;
  4. 构建节点相似度矩阵:
  5. 无向图的节点相似度矩阵 S i , j = ∣ N i ∩ N j ∣ ∣ N i ∪ N j ∣ S_{i,j}={{|N_i \cap N_j|}\over{|N_i \cup N_j|}}S i ,j ​=∣N i ​∪N j ​∣∣N i ​∩N j ​∣​
    • N i N_i N i ​代表i i i所连的全部点的集合;
    • N j N_j N j ​代表j j j所连的全部点的集合;
  6. 用平均相似度定义聚类密度:
  7. d e n s i t y = 1 C i ≠ j ∑ i , j ∈ C , i ≠ j S i , j density = {1 \over C_{i \neq j} \sum {i,j \in C, i \neq j}{S{i,j}}}d e n s i t y =C i ​=j ​∑i ,j ∈C ,i ​=j ​S i ,j ​1 ​
  8. 划分簇集:
  9. 贪心算法:
    • 随机取一个点作为簇开始;
    • 遍历所以剩余的点使每次加入一个使簇变化密度最小且满足大于设定的密度阀值threshold的点;
    • 重复上述过程,直到所有点都有划分。
      数据挖掘实验二 :聚类技术—复杂网络社团检测
  10. 设置不同的密度阀值,划分簇集后计算Q值,输出Q最大的簇集,并用matplotlib.pylot绘图;

; 详细实现

  1. 导入数据包.py
import networkx as nx
import matplotlib.pyplot as plt

gml_path = './数据包/karate.gml'
G = nx.read_gml(gml_path, label='id')

G_nn = G.number_of_nodes()
G_en = G.number_of_edges()

  1. 相似度矩阵.py
from 实验二复杂网络社团检测 import 导入数据包 as shuju
import numpy as np

G = shuju.G
'''Q:字典型,key为一个节点,value为与该节点连的所有节点的集合'''
Q = {i: set(G[i]) for i in G}

Sim = np.zeros((shuju.G_nn+1, shuju.G_en+1))
'''
    无向图中点的相似度:
        Sim(i,j) = (i与j相同邻居个数,即Q[i]与Q[j]的交集)/(与i、j连接的所有点个数,即Q[i]与Q[j]的并集)
    对于无向图的两个点来说,有越多相同的邻居说明它们可能越相似;
    但是如果一个点有很多邻居的话, 则该点可能只是流行或者热门,相似的可能性小了;
'''
for i in G:
    for j in G:
        Sim[i, j] = len(Q[i] & Q[j]) / len(Q[i] | Q[j])

  1. 密度函数.py
from 实验二复杂网络社团检测 import 相似度矩阵 as sim
Sim = sim.Sim

def density_avg(club):
    if len(club) == 1:
        return 1.0
    density = 0.0
    for i in club:
        for j in club:
            if i != j:
                density += Sim[i, j]
    density /= len(club) ** 2 - len(club)
    return density

  1. 划分簇集.py
import random
from 实验二复杂网络社团检测 import 密度函数 as md, 导入数据包 as shuju
d_avg = md.density_avg
G = shuju.G

'''
    贪心算法:
        随机取一个点作为簇开始,
        遍历所以剩余的点使每次加入一个使簇变化密度最小且满足大于设定的密度阀值threshold的点,
        重复上述过程,直到所有点都有划分。
'''
def find_clubs(threshold=0.20, density_fun=d_avg):

    clubs = []

    candidate = list(G.nodes)

    one_club = []

    while(len(candidate) > 0):
        if len(one_club) == 0:

            picked = candidate[random.randint(0, len(candidate) - 1)]

            candidate.remove(picked)
            one_club.append(picked)

            one_density = density_fun(one_club)

            if(len(candidate) == 0):
                clubs.append(one_club)
        else:
            min_det = float('inf')
            min_id = -1

            for id in candidate:
                new_club = one_club + [id, ]

                new_density = density_fun(new_club)

                det = one_density - new_density
                min_det, min_id = (det, id) if det < min_det else (min_det, min_id)

            new_density = one_density - min_det

            if new_density < threshold:
                clubs.append(one_club)

                one_club = []
            else:

                candidate.remove(min_id)
                one_club.append(min_id)

                if(len(candidate) == 0):
                    clubs.append(one_club)
                    one_club = []

    return clubs
  1. 模块度Q.py
from 实验二复杂网络社团检测 import 相似度矩阵 as sim
import numpy as np

def Qf(club, E):

    C = np.zeros((len(club), len(club)))
    ai = 0
    aj = 0
    for key, value in E.items():
        for i in value:
            if key < i:
                for j in range(0, len(club)):
                    if key in club[j]:
                        ai = j
                    if i in club[j]:
                        aj = j
                C[ai][aj] += 1
                C[aj][ai] = C[ai][aj]

    [rows, cols] = C.shape
    edges_num = 0
    sum_duijiao = 0
    sum_fengzi = 0
    sum_i = 0
    for i in range(rows):
        for j in range(cols):
            if i == j:
                edges_num += C[i][j]
                sum_duijiao += C[i][j]
            if i < j:
                edges_num += C[i][j]
    for j in range(cols):
        for i in range(rows):
            if i == j:
                sum_i += C[i][j]*2
            else:
                sum_i += C[i][j]
        sum_fengzi += sum_i ** 2
        sum_i = 0

    Q = sum_duijiao / edges_num - sum_fengzi / ((2 * edges_num)**2)
    return Q

  1. 绘图.py
from 实验二复杂网络社团检测 import 划分簇集 as cj, 模块度Q as mk, 相似度矩阵 as sim
import matplotlib.pyplot as plt
import networkx as nx
import random
import numpy as np
G = cj.G

def getRandomColor():
    colorArr = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F']
    color = "#"

    for i in range(6):
        color += colorArr[random.randint(0, 14)]
    return color

def draw_clubs(clubs):
    pos = nx.spring_layout(G)
    Q = mk.Qf(clubs, sim.Q)
    plt.title('Q=%.2f' % Q)
    for club in clubs:
        print(club)
        nx.draw(G, pos=pos, nodelist=club, node_color=getRandomColor(), with_labels=True)

def find_the_best_Qclubs(density_fun):

    thresholds = np.arange(0.00, 1.00, 0.01)

    Qmax = -1
    for i in range(0, len(thresholds)):
        clubs = cj.find_clubs(thresholds[i], density_fun)
        Q = mk.Qf(clubs, sim.Q)
        if Q > Qmax:
            Qmax = Q

            best_clubs = clubs

    return best_clubs

def show(density_fun):
    plt.figure(figsize=(12, 10))
    clubs = find_the_best_Qclubs(density_fun)

    draw_clubs(clubs)
    plt.show()

show(cj.d_avg)

实验结果

数据挖掘实验二 :聚类技术—复杂网络社团检测

Original: https://blog.csdn.net/Input_print/article/details/123099485
Author: OutlierLi
Title: 数据挖掘实验二 :聚类技术—复杂网络社团检测

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/550702/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球