本次整理了聚类工程里面常见算法,以及给出合适的工程结构方便调用,编写常用的聚类指标当作聚类结果函数,并Github链接在文末,供大家交流学习。
“物以类聚,人以群分”
下面介绍几种聚类算法
这篇更新一个带有权重的Kmeans算法,我们有时候需要给特征规定不同的权重,这个可以根据特征重要性来判断,具体更新在欧氏距离的计算上:
def euclidean_distance(one_sample, X):
'''
:param one_sample: 一个样本点输入
:param X: 所有的聚类中心
:return: 样本点距离每一个聚类中心的距离
'''
one_sample = one_sample.reshape(1, -1)
X = X.reshape(X.shape[0], -1)
distances = []
w = [1, 0.2, 0.2]
n = X.shape[0]
for i in range(n):
subs = one_sample - X[i]
dimension2 = np.power(subs, 2)
w_dimension2 = np.multiply(w, dimension2)
w_distance2 = np.sum(w_dimension2, axis=1)[0]
distances.append(w_distance2)
return distances
AP聚类也叫亲和力(Affinity Propagation)聚类是2007年在Science杂志上提出的一种新的聚类算法。推荐论文阅读 Affinity Learning for Mixed Data Clustering
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0ae8b849-d52b-41b8-b9fd-ded296d4b08e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3787af46-a03e-436c-bba1-4ec828e28f75
AP算法的优点是:
1)不需要制定最终聚类个数;
2)将已有数据点作为最终的聚类中心,而不是新生成聚类中心;
3)模型对数据的初始值不敏感,多次执行AP聚类算法,得到的结果是完全一样的,即不需要进行随机选取初值步骤(还是对比K-Means的随机初始值);
4)对初始相似度矩阵数据的对称性没有要求;
5)与k中心聚类方法相比,其结果的平方差误差较小,相比于K-means算法,鲁棒性强、准确度较高,但算法复杂度高、运算消耗时间多。
具体思想是以S矩阵的对角线上的数值s (k, k)作为k点能否成为聚类中心的评判标准,这意味着该值越大,这个点成为聚类中心的可能性也就越大,这个值又称作参考度p ( preference) 。聚类的数量受到参考度p的影响,如果认为每个数据点都有可能作为聚类中心,那么p就应取相同的值。如果取输入的相似度的均值作为p的值,得到聚类数量是中等的。如果取最小值,得到类数较少的聚类。
主要代码部分如下:
def AP_clustering(data):
'''
:param data:
:param labels_true:
:return:
'''
center_num=[]
for i in range(-20,-50,-5):
ap = AffinityPropagation(preference=i).fit(data)
cluster_centers_indices = ap.cluster_centers_indices_
labels = ap.labels_
n_clusters_ = len(cluster_centers_indices)
center_num.append(n_clusters_)
print('预测的聚类中心个数:%d' % n_clusters_)
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ef5884f1-3d27-4c30-b531-e90d10792a0b
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b21df779-7b79-4cbf-a05e-7f4c397a160a
谱聚类在python中的实现较容易,主要代码是用于参数寻优,具体聚类操作只需调用一个函数即可,代码如下:
def chose_para(X):
scores = []
s = dict()
for index, gamma in enumerate((0.01, 0.1, 1, 10)):
for index, k in enumerate((2, 3, 4)):
y_pred = SpectralClustering(n_clusters=k,gamma=gamma).fit_predict(X.data)
print("Calinski-Harabasz Score with gamma=", gamma, "n_cluster=", k, "score=",
calinski_harabasz_score(X.data, y_pred))
tmp = dict()
tmp['gamma'] = gamma
tmp['n_cluster'] = k
tmp['score'] = calinski_harabasz_score(X.data, y_pred)
s[calinski_harabasz_score(X.data, y_pred)] = tmp
scores.append(calinski_harabasz_score(X.data, y_pred))
max_score = s.get(np.max(scores))
print("max score:\n",max_score)
gamma = list(max_score.values())[0]
n_clusters = list(max_score.values())[1]
y_pred = SpectralClustering(n_clusters=n_clusters,gamma=gamma).fit_predict(X)
plt.title('SpectralClustering of blobs')
plt.scatter(X[:, 0], X[:, 1], marker='.',c=y_pred)
plt.show()
return y_pred
Original: https://blog.csdn.net/North_City_/article/details/117996724
Author: _Tunan
Title: 聚类算法汇总(附代码)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561024/
转载文章受原作者版权保护。转载请注明原作者出处!