Kmeans和DBSCAN

Kmeans

1.类别是人为给定的
如何确定最佳的类别数目,可以通过基于簇内误差平方和,使用肘方法确定簇的最佳数量,肘方法的基本理念就是找出聚类偏差骤增是的k值,通过画出不同k值对应的聚类偏差图。

DBSCAN

基于密度聚类。
密度:样本的紧密程度。使用半径和最小样本量进行评估,如果在指定的半径领域内,实际样本量超过给定的最小期望样本量。

K-means和DBSCAN对比:

优点:
DBSCAN不需要事先要形成的簇类的数量;
DBSCAN可以发现任意形状的簇类;
DBSCAN能够识别出噪声点;
DBSCAN对数据库中样本的顺序不敏感,但对簇类之间的边界样本有所摆动;
缺点:
DBSCAN不能很好反映高维数据;
如果样本集的密度不均匀、聚类间距差很大时,聚类质量较差。

评价指标

(1)sihouette_score 轮廓系数
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:579b0557-0bb8-488d-9153-f45a823c2972

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:14d7cbce-b512-46f9-98ce-94db8976d392

silhouette_score 轮廓系数:结合了凝聚度和分离度
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7b84c664-38cf-4349-a40e-b9f531261e0d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bddcb525-b756-4d40-a589-292df38d63fd

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c099668a-191a-4a46-af70-7ed187b62849

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5215a599-ba61-4bb2-b279-7be17a83c168

取值为[-1,1],值越大越好;
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:47c2c4bd-0ae8-4610-ba06-0b73f8fdca67

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3f670ffd-d5f4-4e46-90b1-9dda967e9365

当值接近0时,表明聚类结果有重叠的情况。

(2)inertias

K-Means模型对象的属性,作为没有真实分类结果标签下的非监督式评估指标。表示样本到最近聚类中心的距离总和。
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:847e6c0c-95de-48e1-9c13-7902e3ccaa39

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ea728289-f2d1-46d0-a4e9-3099973c0d1e

显而易见的,类的数量越大,inertias会趋向越小。

(3)兰德指数

兰德指数需要给定 实际类别信息C,假设K是聚类结果,a表示在C和K中都是同类别的元素对数,
b表示在C和K中都是不同类别的元素对数,计算公式略。
取值范围为[0,1],值越大意味着聚类结果与真实情况越吻合。

(4)互信息 MI
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7e244f4a-4ae7-4121-bd39-c53ddc9f8786

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5912e7f8-c9ac-4ad4-a0f5-3b415e8aba9f

(5)同质化得分、完整性得分、v_meansure_score

import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn import datasets
%matplotlib inline

X1, y1 = datasets.make_moons(n_samples=1000,
                            noise=0.1,
                            random_state=16)
X2, y2 = datasets.make_blobs(n_samples=1000,
                            n_features=2,
                            centers=[[1.2,1.2]],
                            cluster_std=[[0.1]],
                            random_state=16)

X = np.concatenate((X1, X2))
plt.figure(figsize=(10,7))
plt.title('origin')
plt.plot(X[:,0], X[:,1], 'o', markersize=6)
plt.show()

from sklearn.cluster import KMeans, DBSCAN

使用KMeans
y_pred = KMeans(n_clusters=3, random_state=9).fit_predict(X)
plt.figure(figsize=(10,7))
plt.scatter(X[:,0], X[:,1], s=25, c=y_pred)
plt.title('k-means:k=3')
plt.show()
print(metrics.silhouette_score(X, y_pred))

使用DBSCAN
y_pred = DBSCAN(eps=0.15, min_samples=10).fit_predict(X) # eps表示半径,min_samples表示簇最小样本数
plt.figure(figsize=(10, 7))
plt.scatter(X[:,0], X[:,1], s=25, c=y_pred)
plt.title('DBSCAN')
plt.show()
print(metrics.silhouette_score(X, y_pred))

Original: https://blog.csdn.net/jinselizhi/article/details/114397507
Author: 谁怕平生太急
Title: Kmeans和DBSCAN

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561627/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球