[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客

本文网址:https://blog.csdn.net/HiWangWenBing/article/details/123515923

目录

前言:

第1章 指定聚类情况系的轮廓系数

1.1 非排序的轮廓系数

1.2 轮廓系数排序后的展示–横向展示

1.3 轮廓系数排序后的展示–纵向展示

第2章 cluster=4时候的轮廓系数(排序)

第3章 不同聚类数情形下的轮廓系数展示

3.1 代码

3.2 n_cluser=2

3.3 n_cluser=3

3.4 n_cluser=4

3.5 n_cluser=5

3.6 n_cluser=6

3.7 n_cluser=7

前言:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9db08ca3-1d54-4854-8557-7ba8669b6b3d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5ff3fa6d-02ee-4069-a43d-7ca0c7cc84c1

第1章 指定聚类情况系的轮廓系数

1.1 非排序的轮廓系数

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

构建模型并进行学习
n_clusters = 4
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
clusterer = clusterer.fit(X)

获得所有预测标签
cluster_labels = clusterer.labels_

获得所有样本的平均轮廓分数
silhouette_avg = silhouette_score(X, cluster_labels)

print("For n_clusters =", n_clusters)
print("The average silhouette_score is :", silhouette_avg)

获得每个样本的所有轮廓分数
sample_silhouette_values = silhouette_samples(X, cluster_labels)
print("The sample silhouette_score is :", sample_silhouette_values.shape)

可视化所有的样本的轮廓系数分布
x_data = np.linspace(0, 500, 500)
y_data = sample_silhouette_values
plt.scatter(x_data, y_data)
#fill_betweenx(y, x1, x2=0, where=None, step=None, interpolate=False, *, data=None, **kwargs)[source]
plt.fill_between(x_data, 0, y_data, facecolor='green', alpha=0.3)
For n_clusters = 4
The average silhouette_score is : 0.6505186632729437
The sample silhouette_score is : (500,)

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

1.2 轮廓系数排序后的展示–横向展示

可视化所有的样本的轮廓系数分布(排序后结果)
x_data = np.linspace(0, 500, 500)
y_data = sample_silhouette_values
y_data.sort()
plt.scatter(x_data,y_data)
plt.fill_between(x_data, 0, y_data, facecolor='green', alpha=0.3)

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

1.3 轮廓系数排序后的展示–纵向展示

可视化所有的样本的轮廓系数分布(排序后结果)
x_data = np.linspace(0, 500, 500)
y_data = sample_silhouette_values
y_data.sort()
plt.scatter(y_data,x_data)
plt.fill_between(y_data,0, x_data, facecolor='green', alpha=0.3)

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

第2章 cluster=4时候的轮廓系数(排序)

#1. 生成画布
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(9+9, 7)
横坐标是轮廓系数,[-1, 1] ,实际在[0,1]
ax1.set_xlim([-0.1, 1])

纵坐标范围
print("样本个数:", X.shape[0])
ax1.set_ylim([0, X.shape[0] + (n_clusters + 1) * 10])

#2. 分别画出每个聚类类别的轮廓系数
基线,防止每个轮廓系数贴着X轴
y_lower = 10
#
for i in range(n_clusters):
    # 获取i个聚类类别对应的轮廓系数
    ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

    # 获取i个聚类类别对应的轮廓系数进行排序,以免乱序显示,
    ith_cluster_silhouette_values.sort()

    # 获取某一个聚类类别对应的样本数
    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    print(size_cluster_i)

    # 设置样本的上线:lower + 某个聚类类别的样本数就是它的上线
    y_upper = y_lower + size_cluster_i

    # 把族的类别映射成某一种颜色
    color = cm.nipy_spectral(float(i)/n_clusters)

    # 在y轴的[y_lower, y_upper] 之间填充ith_cluster_silhouette_values
    ax1.fill_betweenx(np.arange(y_lower, y_upper)      # Y轴的数值
                      ,ith_cluster_silhouette_values   # X轴的数值
                      ,facecolor=color
                      ,alpha=0.7
                     )
    # 设置Y轴每个聚类块的标签名称
    ax1.text(-0.05 , y_lower + 0.5 * size_cluster_i , str(i))
    y_lower = y_upper + 10

可视化运行结果
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
#画出平均值线
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
设置y轴坐标
ax1.set_yticks([])
设置x轴坐标
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

#3. 画出每个样本的实际分布
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1] ,marker='o' ,s=8 ,c=colors)
centers = clusterer.cluster_centers_

Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='x', c="red", alpha=1, s=200)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data ""with n_clusters = %d" % n_clusters),
             fontsize=14, fontweight='bold')
plt.show()
样本个数: 500
123
125
124
128

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

第3章 不同聚类数情形下的轮廓系数展示

3.1 代码

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

显示不同聚类分类数的情形下的轮廓系数分布情况
for n_clusters in [2,3,4,5,6,7]:
    # 聚类类别数
    n_clusters = n_clusters

    # 准备画布
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, X.shape[0] + (n_clusters + 1) * 10])

    # 模型训练
    clusterer = KMeans(n_clusters=n_clusters, random_state=10).fit(X)

    #模型标签
    cluster_labels = clusterer.labels_

    # 轮廓系数均值
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

    # 在指定类别下,每个样本的轮廓系数
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    # 可视化轮廓系数
    y_lower = 10
    for i in range(n_clusters):
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        color = cm.nipy_spectral(float(i)/n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper)
                          ,ith_cluster_silhouette_values
                          ,facecolor=color
                          ,alpha=0.7
                         )
        ax1.text(-0.05
                 , y_lower + 0.5 * size_cluster_i
                 , str(i))
        y_lower = y_upper + 10

    # 设置轮廓系数坐标值参数
    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # 绘制中轴线
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    ax1.set_yticks([])
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)

    # 绘制不同聚类族的情形的样本分布图
    ax2.scatter(X[:, 0], X[:, 1]
                ,marker='o'
                ,s=8
                ,c=colors
               )
    centers = clusterer.cluster_centers_
    # raw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='x',
                c="red", alpha=1, s=200)

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')
    plt.show()

3.2 n_cluser=2

For n_clusters = 2 The average silhouette_score is : 0.7049787496083262

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

3.3 n_cluser=3

For n_clusters = 3 The average silhouette_score is : 0.5882004012129721

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

3.4 n_cluser=4

For n_clusters = 4 The average silhouette_score is : 0.6505186632729437

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

3.5 n_cluser=5

For n_clusters = 5 The average silhouette_score is : 0.56376469026194

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

3.6 n_cluser=6

For n_clusters = 6 The average silhouette_score is : 0.4504666294372765

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

3.7 n_cluser=7

For n_clusters = 7 The average silhouette_score is : 0.39092211029930857

[机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客

本文网址: https://blog.csdn.net/HiWangWenBing/article/details/123515923

Original: https://blog.csdn.net/HiWangWenBing/article/details/123515923
Author: 文火冰糖的硅基工坊
Title: [机器学习与scikit-learn-26]:算法-聚类-KMeans寻找最佳轮廓系数

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/560726/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球